openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix race in waitUntilSettled https://review.openstack.org/429939 | 00:05 |
---|---|---|
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix race in waitUntilSettled https://review.openstack.org/429939 | 00:06 |
jeblair | i fixed that by reading test debug logs \o/ | 00:06 |
clarkb | jeblair: so the count of events would be off? | 00:10 |
clarkb | er count of completed events | 00:10 |
jeblair | clarkb: well, correct at the time, but when a nodepool request completes, it makes a new event. so we need to make sure we check events after checking for outstanding nodepool requestss. | 00:12 |
jeblair | basically, there's a right order to check that race-free | 00:12 |
clarkb | right, I mean on the assertion side after waiting for settled things were failing beacuse numebrs came out wrong? | 00:13 |
clarkb | (trying to understand how the failures manifested) | 00:13 |
jeblair | clarkb: ah -- the failure i saw was premature test ending, because a waitForSettled returned true before jobs ran | 00:13 |
jeblair | so yeah, any of the assertions could fail (pipelines not empty, jobs not run, etc) | 00:14 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Implement job aborting https://review.openstack.org/426306 | 00:17 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Set job class attributes in __init__ https://review.openstack.org/428859 | 00:18 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Run pre and post playbooks https://review.openstack.org/427494 | 00:18 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add SourceContext class https://review.openstack.org/427846 | 00:18 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Report exceptions in launcher https://review.openstack.org/428499 | 00:18 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add some gearman related debugging https://review.openstack.org/428530 | 00:19 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Use ZUUL_TEST_ROOT in launcher https://review.openstack.org/428391 | 00:19 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add inheritance debugging https://review.openstack.org/429065 | 00:19 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Re-enable test_mutex https://review.openstack.org/429122 | 00:21 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Re-enable test_mutex https://review.openstack.org/429122 | 00:21 |
jeblair | adam_g: ^ updated that to include playbooks (i ran tests/make_playbooks.py) | 00:22 |
adam_g | jeblair: ah cool, thanks | 00:34 |
*** pleia2_ is now known as pleia2 | 01:04 | |
*** nt has quit IRC | 05:08 | |
*** saneax-_-|AFK is now known as saneax | 06:45 | |
*** jamielennox is now known as jamielennox|away | 07:59 | |
*** jamielennox|away is now known as jamielennox | 08:06 | |
openstackgerrit | Oleksandr Karpenko proposed openstack-infra/zuul master: create directory inside state_dir/times if needed https://review.openstack.org/430112 | 08:17 |
*** hashar has joined #zuul | 08:18 | |
*** abregman has joined #zuul | 08:32 | |
*** abregman is now known as abregman|afk | 10:31 | |
*** abregman|afk is now known as abregman | 12:00 | |
*** Cibo_ has quit IRC | 12:22 | |
*** Cibo_ has joined #zuul | 12:22 | |
*** saneax is now known as saneax-_-|AFK | 12:53 | |
*** nt has joined #zuul | 12:59 | |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool master: Consume Task and TaskManager from shade https://review.openstack.org/414759 | 13:55 |
pabelanger | morning | 14:38 |
pabelanger | jeblair: mordred: Shrews: any ideas of how many (new) servers we want for the PTG? | 14:51 |
pabelanger | will they eventually become production servers or is this just for testing things? | 14:52 |
*** bhavik1 has joined #zuul | 15:17 | |
jeblair | pabelanger: why don't we do an all-in-one zuulv3 server? and maybe a second server for the nodepool launcher? i'd guess the nodepool launcher could become production later. the zuulv3 server might end up getting wiped, or we might morph it into the main zuulv3-scheduler server. | 15:33 |
pabelanger | jeblair: WFM | 15:34 |
*** Cibo_ has quit IRC | 15:40 | |
*** Cibo_ has joined #zuul | 15:45 | |
*** Cibo_ has quit IRC | 15:50 | |
*** Cibo_ has joined #zuul | 15:51 | |
*** Cibo_ has quit IRC | 15:55 | |
*** bhavik1 has quit IRC | 15:55 | |
*** Cibo_ has joined #zuul | 16:17 | |
*** openstackgerrit has quit IRC | 16:35 | |
jeblair | pabelanger: comments on 429883 | 17:06 |
*** Cibo_ has quit IRC | 17:09 | |
*** Cibo_ has joined #zuul | 17:10 | |
pabelanger | jeblair: great, thanks for the comment. In fact, we no longer need __eq__ in the Project object. I'm assuming we can drop it too? | 17:14 |
*** openstackgerrit has joined #zuul | 17:16 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul feature/zuulv3: Re-enable test_queue_names https://review.openstack.org/429883 | 17:16 |
pabelanger | updates^ | 17:16 |
jeblair | pabelanger: yes | 17:18 |
*** Cibo_ has quit IRC | 17:19 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Re-enable multiple gerrit connection test https://review.openstack.org/406699 | 17:20 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Re-enable multiple gerrit connection test https://review.openstack.org/406699 | 17:20 |
*** Cibo_ has joined #zuul | 17:20 | |
jeblair | jamielennox: ^ updated to add newly required playbooks | 17:20 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Re-enable test_abandoned_not_timer https://review.openstack.org/427985 | 17:23 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Re-enable test_abandoned_not_timer https://review.openstack.org/427985 | 17:24 |
jeblair | SpamapS: ^ updated to add newly required playbooks | 17:24 |
*** Cibo_ has quit IRC | 17:27 | |
*** abregman has quit IRC | 17:35 | |
*** saneax-_-|AFK is now known as saneax | 17:36 | |
pabelanger | okay, nodepool-launcher server should be ready for review: https://review.openstack.org/#/q/topic:infra-ptg-pike | 17:40 |
jeblair | pabelanger, mordred: see comments on 429850 | 17:42 |
jeblair | pabelanger: w00t! | 17:43 |
pabelanger | jeblair: sure, I'll dive more into it shortly | 17:48 |
SpamapS | jeblair: oh thanks.. I've been staring at the problems with that, btw. I have a few local changes that get it further, but ultimately I think something happens that makes it generate NullChange instead of FakeChange. | 17:50 |
SpamapS | starting to get a little cross-eyed at the fakes | 17:51 |
jeblair | SpamapS: i think the timer is expected to generate nullchange objects? | 17:52 |
SpamapS | jeblair: right, but those don't interact as well with the fakes in v3 for some reason. THat's where my debugging is at right now. | 17:55 |
*** hashar has quit IRC | 18:00 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Assign node set to node requests https://review.openstack.org/428428 | 18:15 |
Shrews | jeblair: I believe I addressed all of your concerns now for ^^^^ | 18:16 |
jeblair | Shrews: yeah, lgtm | 18:17 |
Shrews | jeblair: and the hang I experienced yesterday had to do with moving storeNode() to the NodeLaunch thread and not giving the thread time to complete from the poll actions | 18:17 |
Shrews | that was a doozy | 18:17 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Add 'hostname-format' to provider config section https://review.openstack.org/429841 | 18:25 |
Shrews | jeblair: and the bike is sheddededed ^^^ | 18:25 |
pabelanger | Shrews: jeblair: left comment on 429841 about the usage of 'template' | 18:29 |
Shrews | pabelanger: yeah, i don't 'template' there either. can easily remove it if there is agreement | 18:31 |
pabelanger | Shrews: so we do in fact remove it in nodepool.yaml for openstack-infra. Since we are refactoring now, we can likely drop it, but want to make sure everybody is cool with it | 18:37 |
*** saneax is now known as saneax-_-|AFK | 18:58 | |
jhesketh | Morning | 19:01 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Re-enable test_abandoned_not_timer https://review.openstack.org/427985 | 19:32 |
SpamapS | jeblair: oo, progress. Now with your playbook, _one_ job actually does get held in builds. :) | 19:35 |
SpamapS | http://paste.openstack.org/show/598008/ | 19:38 |
SpamapS | my new fail | 19:38 |
SpamapS | Exception: Node <Node 0000000000 static:ubuntu-xenial> is not locked | 19:39 |
SpamapS | http://paste.openstack.org/show/598009/ is better | 19:40 |
SpamapS | Shrews: around? | 19:40 |
Shrews | SpamapS: infra meeting atm | 19:40 |
SpamapS | oooooooh | 19:40 |
SpamapS | derp | 19:40 |
SpamapS | Shrews: ok, when you're available, I need to try and understand if and why my test might be trying to use unlocked nodes | 19:49 |
*** hashar has joined #zuul | 19:51 | |
Shrews | SpamapS: ok. kinda sorta freeish now | 19:57 |
Shrews | SpamapS: is there a review #? | 20:00 |
SpamapS | Shrews: I've been digging through, I think it's a logic error in the launcher. | 20:00 |
SpamapS | there may be a path in where acceptNodes isn't called before useNode | 20:00 |
SpamapS | Shrews: yeah 427985 | 20:01 |
SpamapS | I think an exception elsewhere is causing the problem. | 20:01 |
Shrews | SpamapS: so, what zuul is supposed to do when it gets a FULFILLED node request is go through the node set and lock each one, then set its state to IN_USE. | 20:03 |
SpamapS | Shrews: yeah this isn't nodepool's problem at all | 20:03 |
SpamapS | it's a race | 20:03 |
Shrews | SpamapS: but, i'm not familiar enough with the zuul code to say if it is doing that. | 20:04 |
SpamapS | I put a big sleep in right before the failing assertion and the test passes | 20:04 |
SpamapS | Shrews: but thanks anyway :) | 20:04 |
Shrews | not suggesting nodepool is the issue. zuul is supposed to do those things *after* nodepool changes the request to FULFILLED | 20:05 |
Shrews | but if it is a race, that's really going to be fun :) | 20:05 |
SpamapS | well but there's no nodepool running anyway | 20:06 |
SpamapS | jeblair: did you have a patch stack addressing the test races you were seeing? I may need to rebase on it. | 20:06 |
Shrews | SpamapS: i do recall jeblair putting something up for waitUntilSettled | 20:07 |
Shrews | SpamapS: https://review.openstack.org/429939 | 20:08 |
SpamapS | let's see if we rebase on that what happens.. | 20:14 |
jeblair | SpamapS: that merged, and i rebased your change on the branch tip which includes it in order to get the playbooks | 20:16 |
jeblair | SpamapS: from the logs, it looks like zuul is canceling a build during a reconfiguration (and so returns the nodeset), but then tries to start the build again (and tries to re-use the existing, already returned nodeset). | 20:27 |
jeblair | SpamapS: there's probably at least two bugs here: 1) a job canceled in reconfiguration shouldn't be relaunched; 2) the job shouldn't have been canceled during reconfiguration. | 20:29 |
*** ianw has quit IRC | 21:07 | |
jeblair | SpamapS: i see why the job is canceled -- it does not match the newly defined job because the newly defined job is in a different repo. basically, the updateConfigLayout trick used in that test isn't going to work anymore. probably we instead want to make a live configuration change to the repo with the configuration rather than changing to a different repo. | 21:08 |
*** harlowja has quit IRC | 21:09 | |
jeblair | SpamapS: the re-use of the nodeset is a separate issue; i don't know if we're going to have a test that covers that or not. we can reproduce that on its own by having a job with the same name before and after reconfiguration, but different attributes. | 21:10 |
*** harlowja has joined #zuul | 21:35 | |
*** jamielennox is now known as jamielennox|away | 21:42 | |
Shrews | jeblair: What do you think we should do with READY nodes that have been created for a node request, but that request was removed while we launched them? Leave the nodes for other node requests, or delete them? | 21:44 |
Shrews | For common node types, it would be more efficient to leave them. But for rarely used node types, they might linger for a while, using resources. | 21:45 |
clarkb | Shrews: current behavior us they linger until the cleanup timwout | 21:46 |
Shrews | clarkb: that's the approach I favor, too | 21:47 |
pabelanger | reminds me of https://review.openstack.org/#/c/83980/ something I had issues with when I had limited cloud resources | 21:47 |
Shrews | pabelanger: interesting. in fact, i'm not sure min_ready is something that makes sense in the new v3 world since allocation is now per-request | 21:49 |
jeblair | Shrews, clarkb: ++ | 21:49 |
jeblair | Shrews: i think we still want to support min-ready so that we can be more responsive | 21:49 |
jeblair | (i don't think we have to have it by the ptg :) | 21:50 |
Shrews | jeblair: we may need some thought around that then. which node types do we build upfront? | 21:50 |
Shrews | perhaps that logic already exists in nodepool though. haven't seen that bit of it yet | 21:51 |
clarkb | Shrews: ya its label specofic currwbtly | 21:52 |
jeblair | Shrews: yeah, min-ready is attached to a label, so nodepool can just periodically check that the min-ready levels are satisfied, and if not, spin up some more nodes | 21:52 |
jeblair | Shrews: that's the "watermark" in "watermark_sleep" | 21:52 |
jeblair | it refills the pool | 21:52 |
Shrews | mmm, code simile | 21:52 |
pabelanger | never did finish that patch, ended up abandoning it | 21:53 |
jeblair | Shrews: to your original question, i think "leave them" for now is the most compatible with where we want to go in the future | 21:55 |
pabelanger | okay, early EOD for me. If people would like to review https://review.openstack.org/#/q/topic:infra-ptg-pike I can start spinning up nodepool-launcher for tomorrow. zuulv3-dev, still needs some work | 21:56 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Unallocate new nodes if request is pulled https://review.openstack.org/430453 | 21:58 |
Shrews | ooh, that's incomplete, actually | 22:00 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Unallocate new nodes if request is pulled https://review.openstack.org/430453 | 22:03 |
Shrews | that should do it | 22:03 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Split merger and launcher git roots https://review.openstack.org/430456 | 22:05 |
jeblair | pabelanger: reviewed! | 22:16 |
*** hashar has quit IRC | 22:16 | |
*** jamielennox|away is now known as jamielennox | 22:42 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Split merger and launcher git roots https://review.openstack.org/430456 | 22:51 |
openstackgerrit | Adam Gandelman proposed openstack-infra/zuul feature/zuulv3: Drop test_node_label https://review.openstack.org/430473 | 23:20 |
jeblair | adam_g: is there an included python functions test fixture file we can delete with that? | 23:29 |
adam_g | jeblair: oh yes, one min | 23:31 |
SpamapS | jeblair: Thanks! Makes perfect sense.. and I remember thinking "I wonder if trying to switch repos is going to screw things up?" but forgot that while debugging. :-P | 23:33 |
openstackgerrit | Adam Gandelman proposed openstack-infra/zuul feature/zuulv3: Drop test_node_label https://review.openstack.org/430473 | 23:49 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Re-enable test_abandoned_not_timer https://review.openstack.org/427985 | 23:52 |
SpamapS | jeblair: ^ so the latest version uses the upstream repo and copyies it into layout-idle. During reconfigure, this aborts the jobs, which I believe is the exact bug the test is trying to avoid. | 23:52 |
SpamapS | or, not avoid.. but.. it's the bad behavior that we want to make sure isn't happening | 23:53 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!