Friday, 2017-09-29

SpamapSShrews: that was waffle house00:06
SpamapSor the back of the karaoke bar00:07
SpamapSnot sure00:07
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix sql reporting start/end times  https://review.openstack.org/50836200:11
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Fix sql reporting start/end times  https://review.openstack.org/50836200:48
SpamapSThere seems to be a bug in stream.html or the streaming server.02:51
SpamapSsometimes it thinks the stream has ended02:51
SpamapSbut I think it's just timing out02:51
*** hashar has joined #zuul06:12
openstackgerritKrzysztof Klimonda proposed openstack-infra/zuul feature/zuulv3: Add zuul supplementary groups before setgid/setuid  https://review.openstack.org/50844408:51
*** bhavik1 has joined #zuul08:57
*** bhavik1 has quit IRC09:10
*** jesusaur has quit IRC10:59
*** jesusaur has joined #zuul11:03
*** dkranz has joined #zuul12:46
*** hashar is now known as hasharAway13:25
Shrewsjeblair: mordred: noting this here b/c #infra is so busy... I see the nodepool zk connection get suspended at 2017-09-29 11:20:57,681 and it never was unsuspended until i manually kicked nodepool two hours after that  (http://paste.openstack.org/show/622307/)13:56
Shrewswe see 1 case before that in the paste where it behaved properly. i'm not sure why we would never get the connection back, even after we freed up space on the zk server13:57
Shrewschalk it up to zk getting confused? it may be the same reason zuul is now not doing anything13:58
mordredShrews: good call on in here ...13:58
mordredinfra-root: let's use #zuul for digging in to deep issues like the zuul zk thing - and #openstack-infra for dealing with fielding job migration issues13:59
Shrewsmordred: ++14:00
fungisounds fine to me14:04
fungiShrews: do you think it was at all related to nodepool filling up its filesystem, or were those more likely separate problems?14:05
jeblairis everything running now?14:05
fungioh, now i see you mean failure to recover after cleanup14:05
Shrewsfungi: i think that was probably the catalyst for the cascade of failures14:05
Shrewsfungi: right. after freeing space, nodepool was still suspended in it's zk connection14:06
Shrewsjeblair: i think zuul might be wedged? i don't see any progress now that nodepool is doing things14:06
Shrewsoh, maybe now it's progressing14:08
jeblairit looks like zuul is stuck in a loop trying to return an unlocked node14:08
jeblairwe may lack an appropriate exception handler there14:09
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Handle errors returning nodesets on canceled jobs  https://review.openstack.org/50853214:47
fungithat looks promising14:47
fungilgtm14:49
jeblairShrews, mordred: ^ can you look at that?14:49
jeblairi think we're going to want to force-merge; update; and restart zuul14:49
Shrewslooking14:50
Shrewsjeblair: i sort of feel like we should make sure we try to unlock the nodes, even with the exception14:52
Shrewsthat way nodepool at least has a chance of cleaning up any remains14:53
jeblairShrews: that exception is "unlocking failed"; i'm not sure what else we can do?14:53
Shrewsjeblair: what if storeNode() fails?14:54
Shrewsjeblair: or if it's just a single node that can't be unlocked14:54
Shrewsmaybe we should move the handling to _unlockNodes()... i can't unlock this one, but going to try to unlock the rest, then bail14:56
jeblairShrews: okay, i'll work something up for review14:56
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Always try to unlock nodes when returning  https://review.openstack.org/50853215:01
jeblairShrews: how's that look?15:01
Shrewsjeblair: doesn't that still leave the potential for leaving the nodes locked?15:05
Shrewsthe for instance in my head: 2 node nodeset, something happens with node A while in returnNodeSet(), node B is left in a locked state15:06
jeblairShrews: i *think* we loop over all nodes in all cases?15:08
Shrewsmaybe the safest thing is to make sure _unlockNodes() is always called (even on exception) but that method will ignore unlock exceptions and try to unlock all nodes15:08
Shrewsjeblair: oh, you're right. i saw that as raising an exception, not logging15:09
Shrewsduh15:09
Shrewsjeblair: +215:09
jeblairw00t15:09
Shrewsmy brain is stuck at 7am i think  :)15:10
jeblairShrews: that's when i wrote the first patch ;)15:14
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Always try to unlock nodes when returning  https://review.openstack.org/50853215:30
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Add inheritance path to zuul vars  https://review.openstack.org/50854315:34
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Add inheritance path to zuul vars  https://review.openstack.org/50854315:40
jeblairi've restarted zuul15:45
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Update zuul-changes script for v3  https://review.openstack.org/50855316:04
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient  https://review.openstack.org/50856316:38
*** electrofelix has quit IRC16:47
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient  https://review.openstack.org/50856317:02
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects  https://review.openstack.org/50857617:40
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects  https://review.openstack.org/50857618:04
*** robled has quit IRC18:10
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects  https://review.openstack.org/50857618:10
*** robled has joined #zuul18:12
*** robled has joined #zuul18:12
*** hasharAway has quit IRC18:51
*** hashar has joined #zuul18:52
jeblairif anyone wants to write some zuul patches:20:12
jeblair1) map pipeline priority to nodepool request priority -- low=300, normal=200, high=10020:13
jeblair2) load average limit in executor -- stop accepting executor:execute jobs if load average > configurable threshold (default 30)20:14
jeblairi'm working on the fix/tests to/for the project-template issue20:16
Shrewsjeblair: i'd also like to add #3) add review+PS # to NodeRequest object and change nodepool to output this for the 'request-list' command20:18
Shrewsand i'll happily work on that myself20:18
fungilooks like we can check against os.getloadavg()[0] for a float of the one-minute load average20:23
jeblairfungi: awesome, thanks!20:24
fungii'm looking now to see where we can poke that into the source20:24
fungiadding mocking and tests for it is likely to be much more work than the actual feature20:24
jeblairfungi: a rare legit use of mock imho20:25
fungithough i also need to disappear in a few minutes to work on dinner, so i'll see how far i get20:25
mordredfungi: in zuul/executor/server.py you'll see in ExecutorExecuteWorker there is already some delay in there20:27
fungicool, i was looking in the right file at least ;)20:27
mordredfungi: now - I'm not sure if THAT is the right place to put this20:28
mordredfungi: yah - actually doesn't seem like a terrible place20:30
jeblairmordred, fungi: i think there's 2 approaches:20:32
jeblaira) avoid grabbing a job if we're over load -- you can do that in handlenoop, but it's some low-level gearman stuff -- you'll have to tell gear to send another pre_sleep packet and change the connection state to sleep.20:33
jeblairb) unregister the job when the load is high, and reregister when it's low.  you'll have to find some place to do this -- the executor doesn't exactly have a main loop.  may need to be a new thread... not sure.20:34
jeblaira might be the best approach; it's only a couple of lines of weird gearman stuff.20:34
fungii'll admit i'm fairly confused by what zuul.executor.server.ExecutorExecuteWorker.handleNoop() is doing. i expect i need to go digging in gear to be able to wrap my head around it20:35
fungisince that method doesn't seem to actually get called by zuul itself20:35
jeblairfungi: ya it's worth a look: https://git.openstack.org/cgit/openstack-infra/gear/tree/gear/__init__.py#n210820:35
jeblairfungi: the weird gearman stuff i describe is actually in the next function -- basically, we want to do what we would do if we got a no_job packet in our noop handler if we decide we don't want any jobs.20:36
jeblairoh you know what, that approach will not work20:36
jeblairwe still need to handle cancel jobs20:36
jeblairso we have to do (b)20:37
fungithough i'll need to come back to it, dinner calls. if someone has an implementation in the next hour-ish i'm happy to review, or i can pick it back up then20:37
jeblairthis has been helpful to talk through anyway20:37
openstackgerritClark Boylan proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient  https://review.openstack.org/50856320:45
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix bug with multiple project-templates  https://review.openstack.org/50861220:46
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority  https://review.openstack.org/50861320:48
mordredjeblair: ^^ there is first stab at precedence to priority - no tests - wanted to make sure that approach was solid before doing tests20:49
mordredjeblair: also, re: your patch - I love it when hours of work result in a single colon20:49
jeblairmordred: 613 lgtm20:51
mordredjeblair: awesome. I'll poke at tests now20:51
jeblairmordred: look at test_nodepool_failure20:51
jeblairmordred: it pauses the fake nodepool and examines requests20:51
jeblairmordred: so you can probably build on that with a test that puts a change in check and gate and then examines the requests20:52
mordred++20:52
mordredjeblair: if you have a sec to help me out ...21:25
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: SourceContext improvements  https://review.openstack.org/50862021:26
jeblairmordred: ya21:26
mordredjeblair: FakeNodepool has a getNodeRequests and a getNodes - but we put the priority in the path - so I think I need a getNodeIds or something which returns self.client.get_children(self.NODE_ROOT)21:26
mordredor - REQUEST_ROOT thatis21:27
jeblairmordred: the priority should be for the node request, so getNodeRequests should return something useful21:27
jeblairlemme look at it21:27
mordredoh - data['_oid'] perhaps21:27
mordredyah21:27
jeblairmordred: ya that looks like it should do it21:27
*** henry_ has joined #zuul21:33
fungidid something happen to drop the load average on the executors? swap utilization is still high but load is at/under 1.0 for most of them now21:34
openstackgerritMerged openstack-infra/zuul-jobs master: Add helpful error message about required-projects  https://review.openstack.org/50857621:35
fungiwell, spiking back up again now. ~10-ish21:36
*** henry_ has quit IRC21:37
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority  https://review.openstack.org/50861321:43
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority  https://review.openstack.org/50861321:49
*** hashar has quit IRC21:54
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority  https://review.openstack.org/50861321:56
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate  https://review.openstack.org/50862922:16
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Fix bug with multiple project-templates  https://review.openstack.org/50861223:07
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority  https://review.openstack.org/50861323:10
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate  https://review.openstack.org/50862923:10
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority  https://review.openstack.org/50861323:11
openstackgerritMerged openstack-infra/zuul feature/zuulv3: SourceContext improvements  https://review.openstack.org/50862023:12
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Update zuul-changes script for v3  https://review.openstack.org/50855323:12

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!