Monday, 2018-04-30

dmsimard	Did we know about https://github.com/ansible/ansible-runner ? Because I didn't.	00:46
dmsimard	https://github.com/ansible/ansible-runner#ansible-runner-as-a-python-interface	00:46
*** dkranz has quit IRC		01:30
*** swest has quit IRC		01:36
*** swest has joined #zuul		01:50
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: mqtt: add basic reporter https://review.openstack.org/535543	01:52
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: mqtt: add basic reporter https://review.openstack.org/535543	02:09
*** spsurya has joined #zuul		02:38
*** snapiri has joined #zuul		04:38
*** swest has quit IRC		05:06
*** swest has joined #zuul		05:07
*** bhavik1 has joined #zuul		05:58
*** bhavik1 has quit IRC		06:31
*** Guest1988 has joined #zuul		07:12
*** spsurya has quit IRC		07:14
*** Guest1988 has quit IRC		07:17
*** ssbarnea_ has joined #zuul		07:35
*** gtema has joined #zuul		08:01
*** hashar has joined #zuul		08:05
*** gtema has quit IRC		08:15
*** gtema has joined #zuul		08:17
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: WIP: Improve zuul stream tests https://review.openstack.org/565169	08:38
[GNU]	pi-u	08:56
[GNU]	uh... sorry :)	08:56
[GNU]	goog morning	08:56
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: WIP: Improve zuul stream tests https://review.openstack.org/565169	08:59
*** spsurya has joined #zuul		09:01
*** electrofelix has joined #zuul		09:06
*** yolanda_ is now known as yolanda		09:39
*** tobasco is now known as tobasco-afk		10:11
*** hashar is now known as hasharAway		10:16
*** tobasco-afk is now known as tobasco		10:58
*** gtema has quit IRC		11:09
*** AJaeger is now known as AJaeger_away		11:39
*** elyezer has quit IRC		11:51
*** gtema has joined #zuul		11:55
*** hasharAway is now known as hashar		11:58
*** elyezer has joined #zuul		12:03
*** dkranz has joined #zuul		12:22
*** rlandy has joined #zuul		12:33
*** gtema has quit IRC		12:39
*** gouthamr has quit IRC		12:59
*** gtema has joined #zuul		13:01
*** gouthamr has joined #zuul		13:02
*** pwhalen has quit IRC		13:17
*** pwhalen has joined #zuul		13:21
*** pwhalen has joined #zuul		13:21
*** ssbarnea_ has quit IRC		13:34
*** lsell has left #zuul		13:34
*** ssbarnea_ has joined #zuul		13:36
*** gtema has quit IRC		13:39
*** gouthamr has quit IRC		13:42
*** gouthamr has joined #zuul		13:45
*** persia has quit IRC		15:03
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool master: Static driver: Use node states defined in zk.py https://review.openstack.org/565238	15:14
*** xinliang has quit IRC		15:17
*** xinliang has joined #zuul		15:17
*** hashar is now known as hasharAway		15:23
openstackgerrit	Merged openstack-infra/nodepool master: Static driver: Use node states defined in zk.py https://review.openstack.org/565238	15:53
*** electrofelix has quit IRC		15:54
SpamapS	Hey, this is kinda cool	15:55
SpamapS	https://pyup.io/	15:55
SpamapS	Somebody was like "Well nobody relies on distro libs anymore.. how could we get the same benefits but without all that open collaboration? :-P	15:56
clarkb	corvus: ok I think I may have a non naive fix for the handling of non live item layouts. It fails at least one test but I think the change in behavior is more accurate now? Once the current run finishes I'll push it up for others to look at assuming I don't have any new surprises	16:06
corvus	clarkb: cool, i'll take a look	16:09
dmsimard	corvus: it's no longer a time sensitive issue but I was wondering if you had an opinion on this issue I triggered yesterday: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-04-29.log.html#t2018-04-29T23:28:44	16:18
dmsimard	I suppose creating a branch from an old reference (with a now broken zuul configuration) is somewhat similar to doing a force push on an active branch with a broken config	16:19
corvus	dmsimard: yeah, "don't do that" would be my first advice :)	16:19
dmsimard	corvus: so the fix is indeed to force push the fixes ?	16:19
openstackgerrit	Clark Boylan proposed openstack-infra/zuul master: Don't generate layouts on non live items https://review.openstack.org/565251	16:20
dmsimard	because even https://review.openstack.org/#/c/565082/ (which cherry-picks all the fixes) doesn't load, but only for that particular branch	16:20
corvus	dmsimard: i don't think you need to force-push them; if there's only one repo involved, you should be able to fix it in that repos.	16:20
clarkb	corvus: ^ that should in a fail at least one test in an expected manner state as that illustrates the behavior change well. I figure once we have that data we can consider if that needs to be addressed (eg don't change that) or if the new behavior is more correct?	16:21
dmsimard	corvus: that's exactly the problem though, it doesn't look like I'm able to fix it without force pushing	16:21
clarkb	dmsimard: you aren't able to push a single change that fixes the config?	16:22
clarkb	dmsimard: worst case you disable the jobs, then roll forward in a logical change by change basis	16:22
dmsimard	clarkb: I submitted https://review.openstack.org/#/c/565082/ and none of the jobs are running, scheduler appears to short circuit on http://paste.openstack.org/raw/720098/	16:23
clarkb	dmsimard: that is an invalid nodeset	16:23
clarkb	should be debian-stable now	16:23
dmsimard	clarkb: I know, 565082 fixes that	16:24
clarkb	so now we figure out where jessie is coming from	16:24
corvus	dmsimard: the error says it's on stable/0.14.7, but the change you linked is stable/0.14	16:24
dmsimard	corvus: aye, I screwed up and ended up settling on a 0.14 branch, 0.14.7 was deleted	16:25
clarkb	ya looking at that change zuul +1'd it	16:25
dmsimard	corvus: there's another one for 0.14	16:25
dmsimard	clarkb: it's +1 but only that sphinx job ran, nothing else (not even tox-pep8 and stuff defined in project-config)	16:25
corvus	dmsimard: so you created 0.14.7 with a broken config, then created 0.14, then deleted 0.14.7?	16:25
dmsimard	corvus: I created 0.14.7, deleted it and created 0.14 -- both had the same HEAD	16:26
corvus	dmsimard: the error you linked is before you uploaded 565082; that error can't be from that change.	16:27
dmsimard	corvus: then I cherry-picked the one change I wanted to release in 0.14.7 but rabbit holed myself into non-working integration jobs so I cherry-picked the different fixes (including the debian-jessie -> debian-stable change)	16:27
dmsimard	corvus: likely from https://review.openstack.org/#/c/565081/	16:27
dmsimard	corvus: (or outright at the branch creation? I'm not sure if that would've triggered it, let me go look at the logs again)	16:27
dmsimard	I should've probably just manually hacked a tag together without trying to do it properly through a branch, gerrit and stuff :(	16:28
clarkb	to be clear the above error isn't really an issue right now right? zuul is running at least one job successfully and +1'ing things. It is not however running all of the expected jobs	16:29
clarkb	so we want to figure out why those jobs aren't running	16:29
dmsimard	clarkb: sort of, I expected the whole scheduler to be borked but I tested to make sure it wasn't (could've been a nice DoS vector..)	16:30
dmsimard	it's only that specific branch for some reason	16:30
corvus	2018-04-30 16:26:47,501 DEBUG zuul.layout: Pipeline variant <Job openstack-tox-pep8 branches: {MatchAny:{BranchMatcher:^master}} source: openstack-infra/project-config/zuul.d/zuul-jobs.yaml@master#2664> did not m	16:31
corvus	atch <Change 0x7f50561a9a20 565082,1>	16:31
dmsimard	there's a branch matcher? :/	16:31
corvus	there are several things about that line which are confusing to me	16:32
dmsimard	oh, for openstack-tox-pep8	16:32
corvus	that's not where the pipeline variant was defined, for one.	16:34
corvus	the line number is correct for zuul.d/projects.yaml, not zuul-jobs.yaml	16:37
corvus	so the filename is wrong there...	16:37
corvus	dmsimard: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/projects.yaml#n2664	16:38
corvus	dmsimard: you have branch matchers for all those jobs.	16:38
clarkb	ah its the nova problem again	16:39
corvus	build-sphinx arrives via one of the project-templates you have there, so it doesn't have a matcher. that's why it's the only one that ran.	16:39
corvus	clarkb: the nova problem?	16:39
dmsimard	huh, someone added a matcher	16:39
dmsimard	I wasn't aware of that	16:39
clarkb	corvus: branch matchers outside the config change update confusing people when config update doesn't do what they expect	16:40
clarkb	corvus: basically stop using branch matchers unless you know you really need them seemed to be the outcome	16:40
corvus	dmsimard: https://review.openstack.org/543503	16:40
dmsimard	Oh, we have them redefined in the feature/1.0 branch	16:41
corvus	so the only bug i see here is that the source context filenames aren't correct	16:41
clarkb	corvus: shouldn't the jobs defined in https://review.openstack.org/#/c/565082/1/zuul.d/layout.yaml run as well?	16:41
dmsimard	corvus: okay, so that explains the lack of tox jobs -- what about the ara jobs not running themselves ? i.e, as defined in https://review.openstack.org/#/c/565082/1/zuul.d/layout.yaml	16:42
dmsimard	clarkb beat me to it	16:42
corvus	hrm, i don't see why they aren't running	16:43
clarkb	updateConfig() checks if filename starts with zuul.d/ does gerrit drop the project name from the beginning of filenames in the events?	16:45
corvus	clarkb: it logged that it created a new configuration for that change	16:46
corvus	i just did a recheck, and it logged:	16:46
corvus	2018-04-30 16:26:42,198 DEBUG zuul.ConfigLoader: Created layout id b33b88923f204b63be46086621565244	16:46
corvus	(the lines around there are mostly related to job selection for that change)	16:46
dmsimard	You can see the traces with: grep -B50 "openstack/ara/zuul.d/jobs.yaml@stable/0.14" debug.log, the last one occured around 40 minutes ago which doesn't seem to correlate with anything	16:47
corvus	dmsimard: that correlates with a tenant reconfiguration event. it failed which means zuul is running on an old config (the last one before your 0.14.7 branch creation)	16:52
corvus	perhaps there is an error with dynamic reconfigurations on branches which are not in the running config	16:52
dmsimard	Yup, I definitely found a bug I think. Sorry :/	16:54
corvus	i'm not sure it's a bug	16:54
dmsimard	corvus: wait, you're saying this is holding back the entire scheduler ? my assessment was that it only impacted that specific branch/project	16:54
corvus	dmsimard: oh, no, broken configuration is a big deal :)	16:55
dmsimard	corvus: I tested a job change in the master branch of ara to make sure zuul was not entirely broken and it worked just fine: https://review.openstack.org/#/c/565083/	16:55
corvus	okay, yeah, _loadDynamicProjectData only loads configuration from branches zuul knows about	16:56
corvus	so that explains why the new jobs are running -- zuul won't load any configuration from that branch until it's able to reload its config. it can't reload the config until the configuration on that branch is fixed.	16:57
corvus	or rather, the new jobs "aren't" running.	16:57
corvus	dmsimard: zuul will continue to use the last config it generated until the problem is corrected	16:58
dmsimard	corvus: right, so in order to correct that, I need to merge that change (which passed jobs anyway I guess?)	16:58
dmsimard	If Zuul was -1'ing it, I'd have to force push	16:58
corvus	dmsimard: yes, or remove ara from the tenant config file	16:58
dmsimard	ok so this is at least consistent with my understanding	16:59
dmsimard	I understand why things broke, just expected zuul to pick up fixes speculativaly (or bubble up that verbose exception to the review like it usually does)	17:00
clarkb	oh and the issue is it loaded the new branch as merged config rather than random pre merged config	17:00
*** hasharAway is now known as hashar		17:01
*** hashar is now known as hasharAway		17:01
corvus	clarkb: when the new branch was created, it could not create a layout from it because it was broken. zuul only loads dynamic config from branches in the current running config (because how could there be a branch that isn't in the running config?), so even though it was creating a dynamic layout for 565082, it isn't actually reading any of the content there.	17:02
corvus	dmsimard: an implication of that is that you can't actually trust zuul's +1 on that change -- it could still have an error.	17:02
corvus	you'll just have to merge it and see what happens :)	17:02
clarkb	corvus: fwiw http://logs.openstack.org/51/565251/1/check/tox-py35/94bf75b/testr_results.html.gz did as expected. The old test failed on parent failing to merge because the syntax error was caught in the parent layout merging then child saw parent had merge failure. But when we switch to delaying merging and layout parsing for the actual child change we see that as a syntax error at that point. I think the new	17:02
clarkb	behavior is more accurate to the behavior we want and avoids needing to bisect the queue.	17:03
dmsimard	corvus: the reason why I created the branch is no longer valid (flask ended up releasing a fixed version 3 days later) but I can do it out of curiosity	17:03
clarkb	but I didn't want to update the test until there was discussion about it	17:03
clarkb	dmsimard: you may want to sort it out just in case there is another breaking change pre 1.0 release	17:04
clarkb	you'll be ready then :)	17:04
* dmsimard sighs		17:04
*** spsurya has quit IRC		17:04
corvus	clarkb: yeah, that makes sense. i sort of think that the current behavior is a little more user friendly in that it says "your change might be fine, but there's a problem with a dependency", whereas the new behavior would say "there's a problem with the config" and you'd have to figure out whether it was this change or the dependency. but i'm not sure that's worth the extra cpu cost.	17:08
clarkb	corvus: ya and the parent itself when tested should show the error in the most accurate location	17:09
corvus	yeah, that at least doesn't change	17:09
clarkb	corvus: maybe we update the error message to say it could be in a parent change as well as the current change	17:09
corvus	yeah, we could detect whether there's a change ahead with the error. otoh, the error message will include the repo/path/file/line number, so it's probably not too hard to figure out it's not this change :)	17:10
corvus	(or, sorry, we could detect whether there's a change ahead that updates the config, and therefore decide that we should add that it might not be in this change)	17:11
corvus	clarkb: anyway, i don't think that's necessary; we can probably just do the simple thing and not change the messages, and just update that test to expect the new error.	17:11
clarkb	ok I'll go ahead and do that then	17:11
openstackgerrit	Clark Boylan proposed openstack-infra/zuul master: Don't generate layouts on non live items https://review.openstack.org/565251	17:24
clarkb	corvus: ^ done	17:25
tobiash	clarkb: lgtm	17:38
clarkb	Shrews: have you checked on nodepool yet fir making a release? I've not noticed any issues	18:02
Shrews	clarkb: i looked last friday but not today yet	18:03
Shrews	clarkb: give me a few minutes and i can take a quick peek	18:04
pabelanger	nb03.o.o is stuck in retry loop for zookeeper, but don't think it is a new issue	18:05
Shrews	pabelanger: iirc, last time that happened was on a host (i think the same one?) with some networking issues	18:07
Shrews	arm server maybe?	18:07
Shrews	some errors in the syslog last time	18:08
dmsimard	corvus, clarkb: circling back to the change that didn't run ara integration jobs -- I merged the integration test fixes (didn't need force push, zuul was +1'ing) and the follow up change ran all the jobs properly.	18:09
Shrews	clarkb: i do not see any launcher issues	18:09
Shrews	at least nothing obvious	18:09
Shrews	dmsimard: fyi, you still have 2 nodes in hold for 28 days	18:10
dmsimard	Shrews: I was positive those had been deleted, let me go and do that now :/	18:10
dmsimard	Shrews: deleted, thanks for reminding me	18:11
corvus	dmsimard: great, thanks!	18:12
corvus	dmsimard: (re ara)	18:13
pabelanger	Shrews: yah, we lost connection to zookeeper, but nodepool didn't properly reconnect. Just looping now, but if we stop / start zookeeper session should start again. I suspect we've exposed a bug some where	18:13
Shrews	pabelanger: where do you see zk connection loss? I only see ovh.net connection errors	18:21
Shrews	pabelanger: actually, let's go to #-infra	18:21
*** hasharAway is now known as hashar		18:44
*** xinliang has quit IRC		18:45
jlk	clarkb: corvus: This is very relevant to your interests. https://github.com/github/git/pull/758 This can lead to remote code execution on hosts that process git and submodules.	18:50
corvus	jlk: that 404's for me	18:51
corvus	jlk: though, happily, zuul itself doesn't do anything with submodules, so perhaps whatever the issue is won't affect us?	18:52
jlk	oh shoot. Uh, hrm.	18:52
jlk	I wish I could delete IRC messages. That may be embargo	18:53
corvus	(jobs, of course, might handle submodules themselves)	18:53
corvus	jlk: well, the whole repo is 404 for me: https://github.com/github/git/	18:53
jlk	yeah I misread the url. I was thinking it was the public git repo, not the github fork of git	18:55
*** xinliang has joined #zuul		18:57
*** xinliang has quit IRC		18:57
*** xinliang has joined #zuul		18:57
*** elyezer has quit IRC		19:07
clarkb	corvus: re https://review.openstack.org/#/c/565251/ did you want to approve that? tobias +2'd but didn't approve	19:08
*** elyezer has joined #zuul		19:20
*** elyezer has quit IRC		19:25
*** elyezer has joined #zuul		19:38
corvus	clarkb: done	19:42
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Increase unit testing of host / group vars https://review.openstack.org/559405	19:59
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Inventory groups should be under children key https://review.openstack.org/559406	19:59
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Fix host_vars https://review.openstack.org/565293	19:59
openstackgerrit	Merged openstack-infra/zuul master: Don't generate layouts on non live items https://review.openstack.org/565251	20:00
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Freeze regex projects https://review.openstack.org/565294	20:14
corvus	tobiash: ^ that's in response to the project-regex change	20:16
clarkb	corvus: do you need to chagne the freeze=False there?	20:17
clarkb	lines 942 and 957	20:18
corvus	clarkb: oh wow i did that wrong :)	20:19
clarkb	this is what code review is for :)	20:19
corvus	clarkb: i meant to leave the freeze on line 978, but that does in fact break, so i need to figure out what's missing in my mental model :)	20:20
tobiash	corvus: I think the copies objects were still frozen	20:20
tobiash	that's why I did it like that	20:20
clarkb	could be lines 944 and 959 breaking it?	20:20
clarkb	need to freeze after those?	20:20
tobiash	or did your weekend-fix change this?	20:20
corvus	clarkb: no -- we do need to leave the result of the project-template parsing unfrozen (basically, i look at it as the object is half-constructed at that point, the project parser finishes constructing it, then freezes it and returns it)	20:21
corvus	it is the regex copy/modification that breaks it	20:22
corvus	i thought that a copy.copy would produce an unfrozen copy, but i guess it doesn't. so we may just need to implement a copy method on project	20:22
openstackgerrit	Merged openstack-infra/zuul master: Fix host_vars https://review.openstack.org/565293	20:25
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Freeze regex projects https://review.openstack.org/565294	20:26
corvus	tobiash, clarkb: ^ i guess that's what we have to do	20:27
clarkb	Shrews: pabelanger follwing up from earlier, are we happy with nodepool and should make a release?	20:27
clarkb	I noted the sha1 we used when I restarted things in the status log and on irc iirc	20:27
corvus	(earlier in my change series for the config refactor, i did something similar with leaving objects unfrozen, but it made me nervous; i like the safety it adds to have things frozen quickly)	20:27
tobiash	corvus: yes, that looks much cleaner :)	20:30
corvus	if that checks out, i'll +3 both changes	20:30
clarkb	sounds good	20:30
clarkb	do we want to restart with that feature in place too?	20:32
clarkb	http://logs.openstack.org/05/559405/3/check/tox-py35/63139a5/job-output.txt.gz that held up the host vars test change	20:33
clarkb	does that look familiar to anyone?	20:33
corvus	clarkb: i don't think it's urgent; let's not go to extra effort for it	20:34
Shrews	clarkb: i don't have any specific reason not to do a release of nodepool, if everything looks good from an infra operational perspective (i think it does?)	20:34
clarkb	Shrews: ya it lgtm. http://nb01.openstack.org/images/ has up to date images and nodepool is giving zuul test instances	20:36
clarkb	e9b82226a5641042e1aad1329efa6e3b376e7f3a is the sha1 we are running	20:39
pabelanger	+1	20:41
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Add waits to test_nodepool_priority https://review.openstack.org/565335	20:44
corvus	clarkb: i think that should help with that	20:44
clarkb	corvus: I also notice that the priority of the two requests seems to be the same	20:45
*** JosefWells has quit IRC		20:45
clarkb	100-0000x vs 100-0000y	20:45
corvus	clarkb: you mean in the failure message?	20:45
corvus	2018-04-30 20:13:34.551824 \| ubuntu-xenial \| testtools.matchers._impl.MismatchError: '100-0000000001' != '100-0000000002'	20:46
corvus	if so, that's the right request, it just arrived in the wrong order, so it got sequence number 1 instead of 2	20:46
clarkb	corvus: doesn't the 100/200/300 prefix indicate priority though?	20:47
corvus	(basically, the test probably shouldn't check the sequence number, but since it does, we should make sure it's consistent. the other way to fix this would be to match against '100-.*' and then do something else to verify that's the gate change)	20:47
clarkb	oh wait I get it	20:48
clarkb	200-foo may have sequence of 0001	20:48
corvus	clarkb: yes, so 100-2 means "this arrived last, but service it first"	20:48
corvus	(the -2 is 2 out of 2 (there are 3 requests, sequenced 0, 1, 2))	20:48
corvus	tobiash: left a question for you on 563194; may be easier to chat about it in irc when you have a minute	20:54
tobiash	corvus: looking	20:55
tobiash	there is a difference	20:56
tobiash	the current behavior checks if parent-change-enqueued is used in the tenant	20:57
tobiash	but doesn't check if it is needed to be emitted for a specific pipeline aka there is no check if there is definitely a listener to that	20:58
tobiash	so say I have a gate pipeline using parent-change-enqueued on gate	20:58
corvus	tobiash: gotcha	20:58
*** dkranz has quit IRC		20:58
tobiash	currently it also emits this event when a change is enqueued in check (where using that event doesn;t make that much sense)	20:59
tobiash	that behavior killed our github in combination with the searching over all installations ;)	20:59
corvus	tobiash: that's the only use of that variable, so if you want to do the calculation on every enqueue, then we should probably remove the old code; or we could update the static calculation to store pipeline names	20:59
tobiash	good point	21:00
tobiash	I think I'll try the static calculation with pipeline names	21:00
corvus	cool	21:00
tobiash	but probably next week, have to get a new openshift cluster productive this week	21:01
*** hashar has quit IRC		21:07
clarkb	anyone else willing to quickly review https://review.openstack.org/#/c/565335/1 ? I'm going to recheck the chagnes that fell out of gate due to bug that this should fix	21:22
corvus	clarkb: it probably is rare enough we can recheck-bash them in	21:23
clarkb	corvus: ya I think so too	21:24
clarkb	(But I also want to restart zuul today to get that fix in for tosky and don't want that in without the test that goes with it)	21:25
corvus	project regex changes are +3	21:25
tobiash	\o/	21:30
corvus	clarkb: https://review.openstack.org/564847 is for you to +3	21:31
clarkb	looking	21:32
clarkb	corvus: do we still think that is a good idea? there was some questioning that iirc	21:32
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Add release note about config memory improvements https://review.openstack.org/565348	21:35
corvus	clarkb: and that reminds me ^	21:35
clarkb	(I still think release note about re2 is good thing and tobiash +2'd it so I guess I will +3 it)	21:36
corvus	clarkb: i believe the only issue/question was related to the ordering of the release notes, which it turns out was just a bug in reno that can happen any time and is unrelated to retro notes	21:36
clarkb	ah ok	21:36
corvus	dhellman made a fix, unsure of release status	21:37
corvus	clarkb: it's fixed in the commit after the latest tag, so it'll be in the next reno rel	21:40
corvus	until then, our release notes page will be in random order :)	21:40
corvus	https://review.openstack.org/564857 is a trivial change that would be good to merge soon	21:41
clarkb	corvus: +3	21:41
openstackgerrit	Merged openstack-infra/zuul master: Add regex support to project stanzas https://review.openstack.org/535713	21:47
openstackgerrit	Merged openstack-infra/zuul master: Freeze regex projects https://review.openstack.org/565294	21:47
openstackgerrit	Merged openstack-infra/zuul master: Add waits to test_nodepool_priority https://review.openstack.org/565335	21:51
openstackgerrit	Merged openstack-infra/zuul master: Sometimes GitHub doesn't return repo permissions https://review.openstack.org/564666	21:51
*** ssbarnea_ has quit IRC		21:56
clarkb	559405 looks like it may fail in the gate again :/	21:59
clarkb	this apepars to be a slow test node though?	21:59
corvus	unfortunately i think my config refactor has altered test runtimes and we're forced into a stabilization period :/	22:00
clarkb	shaking out new races?	22:00
clarkb	what do you think about restarting zuul scheduler today before 559405 gets in?	22:01
corvus	yeah, or old races	22:01
clarkb	(to fix tosky)	22:01
corvus	clarkb: i think we should	22:01
clarkb	ok lets hop over to -infra and coordinate that.	22:01
clarkb	separately did you want to be the person to tag and release nodepool? Probably best to do that after we restart zuul	22:01
corvus	clarkb: yeah, i'll prepare that now	22:11
clarkb	zuul.rpcclient.RPCFailure: Invalid tenant: openstack I guess I have to wait for the tenant to be ready before that works	22:13
corvus	clarkb: yep, that'll repeat for several minutes until the initial load is complete	22:14
corvus	i have prepared a local tag of nodepool 3.0.1 at e9b82226a5641042e1aad1329efa6e3b376e7f3a	22:14
clarkb	I stopped the script and will restart it when the tenant is loaded	22:15
clarkb	are you saying I didn't ahve to stop it, it would just poll until ready?	22:15
corvus	i'll push it when we're happy with the openstack-infra restart	22:15
corvus	clarkb: oh, that was output from the re-enqueue script?	22:15
clarkb	ya	22:15
corvus	clarkb: sorry, you did the right thing.	22:15
clarkb	ok cool	22:15
corvus	clarkb: that's also emitted in the debug log due to requests from zuul-web if it's running. that's what i was thinking.	22:16
openstackgerrit	Merged openstack-infra/zuul master: Add release note about re2 https://review.openstack.org/564847	22:40
openstackgerrit	Merged openstack-infra/zuul master: Fix some code description https://review.openstack.org/564862	22:43
clarkb	corvus: I've told the release team that things seem to be happy post restart. Is there anything else you want to check before pushing your nodepool tag?	22:47
clarkb	(I can help)	22:47
openstackgerrit	Merged openstack-infra/zuul master: Add debug info to test_slow_start https://review.openstack.org/564857	22:47
clarkb	http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz that change just can't have two passing runs in a row	22:48
corvus	clarkb: nope, i'll push it now	22:53
corvus	nodepool 3.0.1 pushed	22:53
clarkb	corvus: reading the job log from that failed run above it seems like it requested an extra node?	22:58
clarkb	hrm no it never marked the request as done?	22:58
corvus	clarkb: well, the immediate issue is the test timed out	22:59
clarkb	corvus: once the request is marked fulfilled is zuul supposed to delete the request?	22:59
clarkb	corvus: ya reading through the log it appears to have run all the jobs expected of the three changes, then spent a lot of time waiting for this one request to go away and then timed out	22:59
clarkb	200-0000000007 is the request that stuck around	22:59
corvus	clarkb: yeah, zuul should lock the nodes and delete the request	23:01
clarkb	ya I'm reading that code now. I think that is what failed to happen	23:01
clarkb	http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_438198 that must be why	23:02
corvus	oh yep. i think that's a side effect of a slow node.	23:02
clarkb	though that seems to happen near th eend of the job	23:02
clarkb	so maybe its just normal zk shutdown/cleanup?	23:03
corvus	clarkb: the line you linked is before the shutdown: http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_757177	23:03
clarkb	https://pypi.org/project/nodepool/ fyi that has new release	23:07
corvus	we... should update the nodepool setup.cfg description	23:10
clarkb	we never even process the result event for that request being fulfilled so its getting stuck well away from where I thought it might be	23:11
clarkb	Shrews: corvus I think http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_408134 and the next log line are the root cause of this failure	23:21
clarkb	it is asking zk for a request there and getting back an empty list but zk is creating the record on its end I think	23:22
clarkb	but zuul doesn't seem to ever treat that node as belonging to any of its open node requests	23:22
clarkb	then we see the disconnect issue not long after	23:23
corvus	clarkb: i don't see a response for that	23:23
corvus	clarkb: i think the next line is a response to the previous request	23:23
corvus	clarkb: i'm assuming xid=96 means that	23:23
clarkb	ah	23:23
clarkb	so ya the disconnect there basically means you never got your request but then later zk has that record	23:23
corvus	yeah, that sounds legit	23:24
*** rlandy has quit IRC		23:24
clarkb	which for the test suite maybe avoiding zk disconnects is the best idea	23:24
clarkb	(hwoever that happens)	23:24
corvus	though, maybe there's a way to make this more robust :)	23:24
clarkb	it is odd that clients don't have to ack record requests for them to be completed	23:25
corvus	clarkb: here's the tb from that thread: http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_440022	23:26
corvus	clarkb: so zuul did re-request those nodes; is the problem that the test suite saw the extra node request and so never thought it was settled?	23:28
clarkb	yup	23:30
clarkb	I think zuul ended up making a new request for the nodes and that got the test to run all the jobs it thoguht it had to run	23:30
clarkb	but zk still had a request outstanding that zuul is/was never going to clean up becuse it was no longer attached to any jobs	23:30
corvus	oh, well the tests consider it not to be settled as long as zuul.nodepool has outstanding requests	23:31
corvus	we could put a try/except there to unwind that	23:32
clarkb	if it excepts try to delete it?	23:32
corvus	heh, this is a really confusing situation. to summarize: the create operation was interrupted but succeeded. so the request actually exists in zookeeper, but zuul thought it failed. however, zuul kept a record of it. so, oddly enough, it existed in both places.	23:34
corvus	the first obvious fix is to have zuul forget its record of the request if it fails. that would prevent a weird sort of memory leak, and would also happen to cause this test to succeed (which is an amazing side effect)	23:35
corvus	but there would still be an outstanding request in zookeeper, which nodepool would, presumably fulfill.	23:35
clarkb	right and we'd leak both the request and the nodes	23:40
corvus	i wonder if zk actually does return the result node id, but because kazoo does an extra 'get' operation on that id, and that failed,	23:40
corvus	i believe this is in the traceback: https://github.com/python-zk/kazoo/blob/master/kazoo/client.py#L905	23:40
clarkb	oh so the request is completing then in order to return data back it is reading the data in another request?	23:41
corvus	hrm, i may be wrong about that: https://github.com/python-zk/kazoo/blob/master/kazoo/handlers/utils.py#L73	23:42
corvus	that doesn't look like that's what's happening	23:42
corvus	it does look like the exception was raised while waiting for the response to create	23:43
corvus	so yeah, i'm not sure how to handle this on the zk side	23:43
openstackgerrit	Clark Boylan proposed openstack-infra/zuul master: Install g++ on platform:rpm https://review.openstack.org/565070	23:46
clarkb	ianw: fdegir ^ fixed with your feedback	23:46
clarkb	corvus: ^ thats an easy one I Think	23:46
corvus	yep	23:47
clarkb	as for zookeeper it is curious that the client would see the request as failed but server wouldn't	23:48
corvus	i have to run; maybe we can pick this up tomorrow	23:48
clarkb	no problem I too need to start looking at things around the house	23:49
clarkb	thank you for helping fix that bug and getting zuul restarted	23:49

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!