Friday, 2017-09-29

SpamapS	Shrews: that was waffle house	00:06
SpamapS	or the back of the karaoke bar	00:07
SpamapS	not sure	00:07
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix sql reporting start/end times https://review.openstack.org/508362	00:11
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Fix sql reporting start/end times https://review.openstack.org/508362	00:48
SpamapS	There seems to be a bug in stream.html or the streaming server.	02:51
SpamapS	sometimes it thinks the stream has ended	02:51
SpamapS	but I think it's just timing out	02:51
*** hashar has joined #zuul		06:12
openstackgerrit	Krzysztof Klimonda proposed openstack-infra/zuul feature/zuulv3: Add zuul supplementary groups before setgid/setuid https://review.openstack.org/508444	08:51
*** bhavik1 has joined #zuul		08:57
*** bhavik1 has quit IRC		09:10
*** jesusaur has quit IRC		10:59
*** jesusaur has joined #zuul		11:03
*** dkranz has joined #zuul		12:46
*** hashar is now known as hasharAway		13:25
Shrews	jeblair: mordred: noting this here b/c #infra is so busy... I see the nodepool zk connection get suspended at 2017-09-29 11:20:57,681 and it never was unsuspended until i manually kicked nodepool two hours after that (http://paste.openstack.org/show/622307/)	13:56
Shrews	we see 1 case before that in the paste where it behaved properly. i'm not sure why we would never get the connection back, even after we freed up space on the zk server	13:57
Shrews	chalk it up to zk getting confused? it may be the same reason zuul is now not doing anything	13:58
mordred	Shrews: good call on in here ...	13:58
mordred	infra-root: let's use #zuul for digging in to deep issues like the zuul zk thing - and #openstack-infra for dealing with fielding job migration issues	13:59
Shrews	mordred: ++	14:00
fungi	sounds fine to me	14:04
fungi	Shrews: do you think it was at all related to nodepool filling up its filesystem, or were those more likely separate problems?	14:05
jeblair	is everything running now?	14:05
fungi	oh, now i see you mean failure to recover after cleanup	14:05
Shrews	fungi: i think that was probably the catalyst for the cascade of failures	14:05
Shrews	fungi: right. after freeing space, nodepool was still suspended in it's zk connection	14:06
Shrews	jeblair: i think zuul might be wedged? i don't see any progress now that nodepool is doing things	14:06
Shrews	oh, maybe now it's progressing	14:08
jeblair	it looks like zuul is stuck in a loop trying to return an unlocked node	14:08
jeblair	we may lack an appropriate exception handler there	14:09
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Handle errors returning nodesets on canceled jobs https://review.openstack.org/508532	14:47
fungi	that looks promising	14:47
fungi	lgtm	14:49
jeblair	Shrews, mordred: ^ can you look at that?	14:49
jeblair	i think we're going to want to force-merge; update; and restart zuul	14:49
Shrews	looking	14:50
Shrews	jeblair: i sort of feel like we should make sure we try to unlock the nodes, even with the exception	14:52
Shrews	that way nodepool at least has a chance of cleaning up any remains	14:53
jeblair	Shrews: that exception is "unlocking failed"; i'm not sure what else we can do?	14:53
Shrews	jeblair: what if storeNode() fails?	14:54
Shrews	jeblair: or if it's just a single node that can't be unlocked	14:54
Shrews	maybe we should move the handling to _unlockNodes()... i can't unlock this one, but going to try to unlock the rest, then bail	14:56
jeblair	Shrews: okay, i'll work something up for review	14:56
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Always try to unlock nodes when returning https://review.openstack.org/508532	15:01
jeblair	Shrews: how's that look?	15:01
Shrews	jeblair: doesn't that still leave the potential for leaving the nodes locked?	15:05
Shrews	the for instance in my head: 2 node nodeset, something happens with node A while in returnNodeSet(), node B is left in a locked state	15:06
jeblair	Shrews: i think we loop over all nodes in all cases?	15:08
Shrews	maybe the safest thing is to make sure _unlockNodes() is always called (even on exception) but that method will ignore unlock exceptions and try to unlock all nodes	15:08
Shrews	jeblair: oh, you're right. i saw that as raising an exception, not logging	15:09
Shrews	duh	15:09
Shrews	jeblair: +2	15:09
jeblair	w00t	15:09
Shrews	my brain is stuck at 7am i think :)	15:10
jeblair	Shrews: that's when i wrote the first patch ;)	15:14
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Always try to unlock nodes when returning https://review.openstack.org/508532	15:30
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add inheritance path to zuul vars https://review.openstack.org/508543	15:34
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Add inheritance path to zuul vars https://review.openstack.org/508543	15:40
jeblair	i've restarted zuul	15:45
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Update zuul-changes script for v3 https://review.openstack.org/508553	16:04
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563	16:38
*** electrofelix has quit IRC		16:47
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563	17:02
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576	17:40
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576	18:04
*** robled has quit IRC		18:10
openstackgerrit	Monty Taylor proposed openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576	18:10
*** robled has joined #zuul		18:12
*** robled has joined #zuul		18:12
*** hasharAway has quit IRC		18:51
*** hashar has joined #zuul		18:52
jeblair	if anyone wants to write some zuul patches:	20:12
jeblair	1) map pipeline priority to nodepool request priority -- low=300, normal=200, high=100	20:13
jeblair	2) load average limit in executor -- stop accepting executor:execute jobs if load average > configurable threshold (default 30)	20:14
jeblair	i'm working on the fix/tests to/for the project-template issue	20:16
Shrews	jeblair: i'd also like to add #3) add review+PS # to NodeRequest object and change nodepool to output this for the 'request-list' command	20:18
Shrews	and i'll happily work on that myself	20:18
fungi	looks like we can check against os.getloadavg()[0] for a float of the one-minute load average	20:23
jeblair	fungi: awesome, thanks!	20:24
fungi	i'm looking now to see where we can poke that into the source	20:24
fungi	adding mocking and tests for it is likely to be much more work than the actual feature	20:24
jeblair	fungi: a rare legit use of mock imho	20:25
fungi	though i also need to disappear in a few minutes to work on dinner, so i'll see how far i get	20:25
mordred	fungi: in zuul/executor/server.py you'll see in ExecutorExecuteWorker there is already some delay in there	20:27
fungi	cool, i was looking in the right file at least ;)	20:27
mordred	fungi: now - I'm not sure if THAT is the right place to put this	20:28
mordred	fungi: yah - actually doesn't seem like a terrible place	20:30
jeblair	mordred, fungi: i think there's 2 approaches:	20:32
jeblair	a) avoid grabbing a job if we're over load -- you can do that in handlenoop, but it's some low-level gearman stuff -- you'll have to tell gear to send another pre_sleep packet and change the connection state to sleep.	20:33
jeblair	b) unregister the job when the load is high, and reregister when it's low. you'll have to find some place to do this -- the executor doesn't exactly have a main loop. may need to be a new thread... not sure.	20:34
jeblair	a might be the best approach; it's only a couple of lines of weird gearman stuff.	20:34
fungi	i'll admit i'm fairly confused by what zuul.executor.server.ExecutorExecuteWorker.handleNoop() is doing. i expect i need to go digging in gear to be able to wrap my head around it	20:35
fungi	since that method doesn't seem to actually get called by zuul itself	20:35
jeblair	fungi: ya it's worth a look: https://git.openstack.org/cgit/openstack-infra/gear/tree/gear/__init__.py#n2108	20:35
jeblair	fungi: the weird gearman stuff i describe is actually in the next function -- basically, we want to do what we would do if we got a no_job packet in our noop handler if we decide we don't want any jobs.	20:36
jeblair	oh you know what, that approach will not work	20:36
jeblair	we still need to handle cancel jobs	20:36
jeblair	so we have to do (b)	20:37
fungi	though i'll need to come back to it, dinner calls. if someone has an implementation in the next hour-ish i'm happy to review, or i can pick it back up then	20:37
jeblair	this has been helpful to talk through anyway	20:37
openstackgerrit	Clark Boylan proposed openstack-infra/zuul-jobs master: Make fetch-tox-output more resilient https://review.openstack.org/508563	20:45
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Fix bug with multiple project-templates https://review.openstack.org/508612	20:46
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613	20:48
mordred	jeblair: ^^ there is first stab at precedence to priority - no tests - wanted to make sure that approach was solid before doing tests	20:49
mordred	jeblair: also, re: your patch - I love it when hours of work result in a single colon	20:49
jeblair	mordred: 613 lgtm	20:51
mordred	jeblair: awesome. I'll poke at tests now	20:51
jeblair	mordred: look at test_nodepool_failure	20:51
jeblair	mordred: it pauses the fake nodepool and examines requests	20:51
jeblair	mordred: so you can probably build on that with a test that puts a change in check and gate and then examines the requests	20:52
mordred	++	20:52
mordred	jeblair: if you have a sec to help me out ...	21:25
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: SourceContext improvements https://review.openstack.org/508620	21:26
jeblair	mordred: ya	21:26
mordred	jeblair: FakeNodepool has a getNodeRequests and a getNodes - but we put the priority in the path - so I think I need a getNodeIds or something which returns self.client.get_children(self.NODE_ROOT)	21:26
mordred	or - REQUEST_ROOT thatis	21:27
jeblair	mordred: the priority should be for the node request, so getNodeRequests should return something useful	21:27
jeblair	lemme look at it	21:27
mordred	oh - data['_oid'] perhaps	21:27
mordred	yah	21:27
jeblair	mordred: ya that looks like it should do it	21:27
*** henry_ has joined #zuul		21:33
fungi	did something happen to drop the load average on the executors? swap utilization is still high but load is at/under 1.0 for most of them now	21:34
openstackgerrit	Merged openstack-infra/zuul-jobs master: Add helpful error message about required-projects https://review.openstack.org/508576	21:35
fungi	well, spiking back up again now. ~10-ish	21:36
*** henry_ has quit IRC		21:37
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613	21:43
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613	21:49
*** hashar has quit IRC		21:54
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613	21:56
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629	22:16
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Fix bug with multiple project-templates https://review.openstack.org/508612	23:07
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613	23:10
openstackgerrit	Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Protect against builds dict changing while we iterate https://review.openstack.org/508629	23:10
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Map pipeline precedence to nodepool node priority https://review.openstack.org/508613	23:11
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: SourceContext improvements https://review.openstack.org/508620	23:12
openstackgerrit	Merged openstack-infra/zuul feature/zuulv3: Update zuul-changes script for v3 https://review.openstack.org/508553	23:12

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!