SpamapS | Shrews: that was waffle house | 00:06 |
---|---|---|
SpamapS | or the back of the karaoke bar | 00:07 |
SpamapS | not sure | 00:07 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix sql reporting start/end times https://review.openstack.org/508362 | 00:11 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix sql reporting start/end times https://review.openstack.org/508362 | 00:48 |
SpamapS | There seems to be a bug in stream.html or the streaming server. | 02:51 |
SpamapS | sometimes it thinks the stream has ended | 02:51 |
SpamapS | but I think it's just timing out | 02:51 |
*** hashar has joined #zuul | 06:12 | |
openstackgerrit | Krzysztof Klimonda proposed openstack-infra/zuul feature/zuulv3: Add zuul supplementary groups before setgid/setuid https://review.openstack.org/508444 | 08:51 |
*** bhavik1 has joined #zuul | 08:57 | |
*** bhavik1 has quit IRC | 09:10 | |
*** jesusaur has quit IRC | 10:59 | |
*** jesusaur has joined #zuul | 11:03 | |
*** dkranz has joined #zuul | 12:46 | |
*** hashar is now known as hasharAway | 13:25 | |
Shrews | jeblair: mordred: noting this here b/c #infra is so busy... I see the nodepool zk connection get suspended at 2017-09-29 11:20:57,681 and it never was unsuspended until i manually kicked nodepool two hours after that (http://paste.openstack.org/show/622307/) | 13:56 |
Shrews | we see 1 case before that in the paste where it behaved properly. i'm not sure why we would never get the connection back, even after we freed up space on the zk server | 13:57 |
Shrews | chalk it up to zk getting confused? it may be the same reason zuul is now not doing anything | 13:58 |
mordred | Shrews: good call on in here ... | 13:58 |
mordred | infra-root: let's use #zuul for digging in to deep issues like the zuul zk thing - and #openstack-infra for dealing with fielding job migration issues | 13:59 |
Shrews | mordred: ++ | 14:00 |
fungi | sounds fine to me | 14:04 |
fungi | Shrews: do you think it was at all related to nodepool filling up its filesystem, or were those more likely separate problems? | 14:05 |
jeblair | is everything running now? | 14:05 |
fungi | oh, now i see you mean failure to recover after cleanup | 14:05 |
Shrews | fungi: i think that was probably the catalyst for the cascade of failures | 14:05 |
Shrews | fungi: right. after freeing space, nodepool was still suspended in it's zk connection | 14:06 |
Shrews | jeblair: i think zuul might be wedged? i don't see any progress now that nodepool is doing things | 14:06 |
Shrews | oh, maybe now it's progressing | 14:08 |
jeblair | it looks like zuul is stuck in a loop trying to return an unlocked node | 14:08 |
jeblair | we may lack an appropriate exception handler there | 14:09 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Handle errors returning nodesets on canceled jobs https://review.openstack.org/508532 | 14:47 |
fungi | that looks promising | 14:47 |
fungi | lgtm | 14:49 |
jeblair | Shrews, mordred: ^ can you look at that? | 14:49 |
jeblair | i think we're going to want to force-merge; update; and restart zuul | 14:49 |
Shrews | looking | 14:50 |
Shrews | jeblair: i sort of feel like we should make sure we try to unlock the nodes, even with the exception | 14:52 |
Shrews | that way nodepool at least has a chance of cleaning up any remains | 14:53 |
jeblair | Shrews: that exception is "unlocking failed"; i'm not sure what else we can do? | 14:53 |
Shrews | jeblair: what if storeNode() fails? | 14:54 |
Shrews | jeblair: or if it's just a single node that can't be unlocked | 14:54 |
Shrews | maybe we should move the handling to _unlockNodes()... i can't unlock this one, but going to try to unlock the rest, then bail | 14:56 |
jeblair | Shrews: okay, i'll work something up for review | 14:56 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Always try to unlock nodes when returning https://review.openstack.org/508532 | 15:01 |
jeblair | Shrews: how's that look? | 15:01 |
Shrews | jeblair: doesn't that still leave the potential for leaving the nodes locked? | 15:05 |
Shrews | the for instance in my head: 2 node nodeset, something happens with node A while in returnNodeSet(), node B is left in a locked state | 15:06 |
jeblair | Shrews: i *think* we loop over all nodes in all cases? | 15:08 |
Shrews | maybe the safest thing is to make sure _unlockNodes() is always called (even on exception) but that method will ignore unlock exceptions and try to unlock all nodes | 15:08 |
Shrews | jeblair: oh, you're right. i saw that as raising an exception, not logging | 15:09 |
Shrews | duh | 15:09 |
Shrews | jeblair: +2 | 15:09 |
jeblair | w00t | 15:09 |
Shrews | my brain is stuck at 7am i think :) | 15:10 |
jeblair | Shrews: that's when i wrote the first patch ;) | 15:14 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Always try to unlock nodes when returning https://review.openstack.org/508532 | 15:30 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add inheritance path to zuul vars https://review.openstack.org/508543 | 15:34 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Add inheritance path to zuul vars https://review.openstack.org/508543 | 15:40 |
jeblair | i've restarted zuul | 15:45 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update zuul-changes script for v3 https://review.openstack.org/508553 | 16:04 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563 | 16:38 |
*** electrofelix has quit IRC | 16:47 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563 | 17:02 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576 | 17:40 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576 | 18:04 |
*** robled has quit IRC | 18:10 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576 | 18:10 |
*** robled has joined #zuul | 18:12 | |
*** robled has joined #zuul | 18:12 | |
*** hasharAway has quit IRC | 18:51 | |
*** hashar has joined #zuul | 18:52 | |
jeblair | if anyone wants to write some zuul patches: | 20:12 |
jeblair | 1) map pipeline priority to nodepool request priority -- low=300, normal=200, high=100 | 20:13 |
jeblair | 2) load average limit in executor -- stop accepting executor:execute jobs if load average > configurable threshold (default 30) | 20:14 |
jeblair | i'm working on the fix/tests to/for the project-template issue | 20:16 |
Shrews | jeblair: i'd also like to add #3) add review+PS # to NodeRequest object and change nodepool to output this for the 'request-list' command | 20:18 |
Shrews | and i'll happily work on that myself | 20:18 |
fungi | looks like we can check against os.getloadavg()[0] for a float of the one-minute load average | 20:23 |
jeblair | fungi: awesome, thanks! | 20:24 |
fungi | i'm looking now to see where we can poke that into the source | 20:24 |
fungi | adding mocking and tests for it is likely to be much more work than the actual feature | 20:24 |
jeblair | fungi: a rare legit use of mock imho | 20:25 |
fungi | though i also need to disappear in a few minutes to work on dinner, so i'll see how far i get | 20:25 |
mordred | fungi: in zuul/executor/server.py you'll see in ExecutorExecuteWorker there is already some delay in there | 20:27 |
fungi | cool, i was looking in the right file at least ;) | 20:27 |
mordred | fungi: now - I'm not sure if THAT is the right place to put this | 20:28 |
mordred | fungi: yah - actually doesn't seem like a terrible place | 20:30 |
jeblair | mordred, fungi: i think there's 2 approaches: | 20:32 |
jeblair | a) avoid grabbing a job if we're over load -- you can do that in handlenoop, but it's some low-level gearman stuff -- you'll have to tell gear to send another pre_sleep packet and change the connection state to sleep. | 20:33 |
jeblair | b) unregister the job when the load is high, and reregister when it's low. you'll have to find some place to do this -- the executor doesn't exactly have a main loop. may need to be a new thread... not sure. | 20:34 |
jeblair | a might be the best approach; it's only a couple of lines of weird gearman stuff. | 20:34 |
fungi | i'll admit i'm fairly confused by what zuul.executor.server.ExecutorExecuteWorker.handleNoop() is doing. i expect i need to go digging in gear to be able to wrap my head around it | 20:35 |
fungi | since that method doesn't seem to actually get called by zuul itself | 20:35 |
jeblair | fungi: ya it's worth a look: https://git.openstack.org/cgit/openstack-infra/gear/tree/gear/__init__.py#n2108 | 20:35 |
jeblair | fungi: the weird gearman stuff i describe is actually in the next function -- basically, we want to do what we would do if we got a no_job packet in our noop handler if we decide we don't want any jobs. | 20:36 |
jeblair | oh you know what, that approach will not work | 20:36 |
jeblair | we still need to handle cancel jobs | 20:36 |
jeblair | so we have to do (b) | 20:37 |
fungi | though i'll need to come back to it, dinner calls. if someone has an implementation in the next hour-ish i'm happy to review, or i can pick it back up then | 20:37 |
jeblair | this has been helpful to talk through anyway | 20:37 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563 | 20:45 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix bug with multiple project-templates https://review.openstack.org/508612 | 20:46 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613 | 20:48 |
mordred | jeblair: ^^ there is first stab at precedence to priority - no tests - wanted to make sure that approach was solid before doing tests | 20:49 |
mordred | jeblair: also, re: your patch - I love it when hours of work result in a single colon | 20:49 |
jeblair | mordred: 613 lgtm | 20:51 |
mordred | jeblair: awesome. I'll poke at tests now | 20:51 |
jeblair | mordred: look at test_nodepool_failure | 20:51 |
jeblair | mordred: it pauses the fake nodepool and examines requests | 20:51 |
jeblair | mordred: so you can probably build on that with a test that puts a change in check and gate and then examines the requests | 20:52 |
mordred | ++ | 20:52 |
mordred | jeblair: if you have a sec to help me out ... | 21:25 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: SourceContext improvements https://review.openstack.org/508620 | 21:26 |
jeblair | mordred: ya | 21:26 |
mordred | jeblair: FakeNodepool has a getNodeRequests and a getNodes - but we put the priority in the path - so I think I need a getNodeIds or something which returns self.client.get_children(self.NODE_ROOT) | 21:26 |
mordred | or - REQUEST_ROOT thatis | 21:27 |
jeblair | mordred: the priority should be for the node request, so getNodeRequests should return something useful | 21:27 |
jeblair | lemme look at it | 21:27 |
mordred | oh - data['_oid'] perhaps | 21:27 |
mordred | yah | 21:27 |
jeblair | mordred: ya that looks like it should do it | 21:27 |
*** henry_ has joined #zuul | 21:33 | |
fungi | did something happen to drop the load average on the executors? swap utilization is still high but load is at/under 1.0 for most of them now | 21:34 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576 | 21:35 |
fungi | well, spiking back up again now. ~10-ish | 21:36 |
*** henry_ has quit IRC | 21:37 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613 | 21:43 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613 | 21:49 |
*** hashar has quit IRC | 21:54 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613 | 21:56 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629 | 22:16 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix bug with multiple project-templates https://review.openstack.org/508612 | 23:07 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613 | 23:10 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629 | 23:10 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613 | 23:11 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: SourceContext improvements https://review.openstack.org/508620 | 23:12 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Update zuul-changes script for v3 https://review.openstack.org/508553 | 23:12 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!