Monday, 2018-04-30

dmsimardDid we know about https://github.com/ansible/ansible-runner ? Because I didn't.00:46
dmsimardhttps://github.com/ansible/ansible-runner#ansible-runner-as-a-python-interface00:46
*** dkranz has quit IRC01:30
*** swest has quit IRC01:36
*** swest has joined #zuul01:50
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: mqtt: add basic reporter  https://review.openstack.org/53554301:52
openstackgerritTristan Cacqueray proposed openstack-infra/zuul master: mqtt: add basic reporter  https://review.openstack.org/53554302:09
*** spsurya has joined #zuul02:38
*** snapiri has joined #zuul04:38
*** swest has quit IRC05:06
*** swest has joined #zuul05:07
*** bhavik1 has joined #zuul05:58
*** bhavik1 has quit IRC06:31
*** Guest1988 has joined #zuul07:12
*** spsurya has quit IRC07:14
*** Guest1988 has quit IRC07:17
*** ssbarnea_ has joined #zuul07:35
*** gtema has joined #zuul08:01
*** hashar has joined #zuul08:05
*** gtema has quit IRC08:15
*** gtema has joined #zuul08:17
openstackgerritTobias Henkel proposed openstack-infra/zuul master: WIP: Improve zuul stream tests  https://review.openstack.org/56516908:38
[GNU]pi-u08:56
[GNU]uh... sorry :)08:56
[GNU]goog morning08:56
openstackgerritTobias Henkel proposed openstack-infra/zuul master: WIP: Improve zuul stream tests  https://review.openstack.org/56516908:59
*** spsurya has joined #zuul09:01
*** electrofelix has joined #zuul09:06
*** yolanda_ is now known as yolanda09:39
*** tobasco is now known as tobasco-afk10:11
*** hashar is now known as hasharAway10:16
*** tobasco-afk is now known as tobasco10:58
*** gtema has quit IRC11:09
*** AJaeger is now known as AJaeger_away11:39
*** elyezer has quit IRC11:51
*** gtema has joined #zuul11:55
*** hasharAway is now known as hashar11:58
*** elyezer has joined #zuul12:03
*** dkranz has joined #zuul12:22
*** rlandy has joined #zuul12:33
*** gtema has quit IRC12:39
*** gouthamr has quit IRC12:59
*** gtema has joined #zuul13:01
*** gouthamr has joined #zuul13:02
*** pwhalen has quit IRC13:17
*** pwhalen has joined #zuul13:21
*** pwhalen has joined #zuul13:21
*** ssbarnea_ has quit IRC13:34
*** lsell has left #zuul13:34
*** ssbarnea_ has joined #zuul13:36
*** gtema has quit IRC13:39
*** gouthamr has quit IRC13:42
*** gouthamr has joined #zuul13:45
*** persia has quit IRC15:03
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool master: Static driver: Use node states defined in zk.py  https://review.openstack.org/56523815:14
*** xinliang has quit IRC15:17
*** xinliang has joined #zuul15:17
*** hashar is now known as hasharAway15:23
openstackgerritMerged openstack-infra/nodepool master: Static driver: Use node states defined in zk.py  https://review.openstack.org/56523815:53
*** electrofelix has quit IRC15:54
SpamapSHey, this is kinda cool15:55
SpamapShttps://pyup.io/15:55
SpamapSSomebody was like "Well nobody relies on distro libs anymore.. how could we get the same benefits but without all that open collaboration? :-P15:56
clarkbcorvus: ok I think I may have a non naive fix for the handling of non live item layouts. It fails at least one test but I think the change in behavior is more accurate now? Once the current run finishes I'll push it up for others to look at assuming I don't have any new surprises16:06
corvusclarkb: cool, i'll take a look16:09
dmsimardcorvus: it's no longer a time sensitive issue but I was wondering if you had an opinion on this issue I triggered yesterday: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-04-29.log.html#t2018-04-29T23:28:4416:18
dmsimardI suppose creating a branch from an old reference (with a now broken zuul configuration) is somewhat similar to doing a force push on an active branch with a broken config16:19
corvusdmsimard: yeah, "don't do that" would be my first advice :)16:19
dmsimardcorvus: so the fix is indeed to force push the fixes ?16:19
openstackgerritClark Boylan proposed openstack-infra/zuul master: Don't generate layouts on non live items  https://review.openstack.org/56525116:20
dmsimardbecause even https://review.openstack.org/#/c/565082/ (which cherry-picks all the fixes) doesn't load, but only for that particular branch16:20
corvusdmsimard: i don't think you need to force-push them; if there's only one repo involved, you should be able to fix it in that repos.16:20
clarkbcorvus: ^ that should in a fail at least one test in an expected manner state as that illustrates the behavior change well. I figure once we have that data we can consider if that needs to be addressed (eg don't change that) or if the new behavior is more correct?16:21
dmsimardcorvus: that's exactly the problem though, it doesn't look like I'm able to fix it without force pushing16:21
clarkbdmsimard: you aren't able to push a single change that fixes the config?16:22
clarkbdmsimard: worst case you disable the jobs, then roll forward in a logical change by change basis16:22
dmsimardclarkb: I submitted https://review.openstack.org/#/c/565082/ and none of the jobs are running, scheduler appears to short circuit on http://paste.openstack.org/raw/720098/16:23
clarkbdmsimard: that is an invalid nodeset16:23
clarkbshould be debian-stable now16:23
dmsimardclarkb: I know, 565082 fixes that16:24
clarkbso now we figure out where jessie is coming from16:24
corvusdmsimard: the error says it's on stable/0.14.7, but the change you linked is stable/0.1416:24
dmsimardcorvus: aye, I screwed up and ended up settling on a 0.14 branch, 0.14.7 was deleted16:25
clarkbya looking at that change zuul +1'd it16:25
dmsimardcorvus: there's another one for 0.1416:25
dmsimardclarkb: it's +1 but only that sphinx job ran, nothing else (not even tox-pep8 and stuff defined in project-config)16:25
corvusdmsimard: so you created 0.14.7 with a broken config, then created 0.14, then deleted 0.14.7?16:25
dmsimardcorvus: I created 0.14.7, deleted it and created 0.14 -- both had the same HEAD16:26
corvusdmsimard: the error you linked is before you uploaded 565082; that error can't be from that change.16:27
dmsimardcorvus: then I cherry-picked the one change I wanted to release in 0.14.7 but rabbit holed myself into non-working integration jobs so I cherry-picked the different fixes (including the debian-jessie -> debian-stable change)16:27
dmsimardcorvus: likely from https://review.openstack.org/#/c/565081/16:27
dmsimardcorvus: (or outright at the branch creation? I'm not sure if that would've triggered it, let me go look at the logs again)16:27
dmsimardI should've probably just manually hacked a tag together without trying to do it properly through a branch, gerrit and stuff :(16:28
clarkbto be clear the above error isn't really an issue right now right? zuul is running at least one job successfully and +1'ing things. It is not however running all of the expected jobs16:29
clarkbso we want to figure out why those jobs aren't running16:29
dmsimardclarkb: sort of, I expected the whole scheduler to be borked but I tested to make sure it wasn't (could've been a nice DoS vector..)16:30
dmsimardit's only that specific branch for some reason16:30
corvus2018-04-30 16:26:47,501 DEBUG zuul.layout: Pipeline variant <Job openstack-tox-pep8 branches: {MatchAny:{BranchMatcher:^master}} source: openstack-infra/project-config/zuul.d/zuul-jobs.yaml@master#2664> did not m16:31
corvusatch <Change 0x7f50561a9a20 565082,1>16:31
dmsimardthere's a branch matcher? :/16:31
corvusthere are several things about that line which are confusing to me16:32
dmsimardoh, for openstack-tox-pep816:32
corvusthat's not where the pipeline variant was defined, for one.16:34
corvusthe line number is correct for zuul.d/projects.yaml, not zuul-jobs.yaml16:37
corvusso the filename is wrong there...16:37
corvusdmsimard: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/projects.yaml#n266416:38
corvusdmsimard: you have branch matchers for all those jobs.16:38
clarkbah its the nova problem again16:39
corvusbuild-sphinx arrives via one of the project-templates you have there, so it doesn't have a matcher.  that's why it's the only one that ran.16:39
corvusclarkb: the nova problem?16:39
dmsimardhuh, someone added a matcher16:39
dmsimardI wasn't aware of that16:39
clarkbcorvus: branch matchers outside the config change update confusing people when config update doesn't do what they expect16:40
clarkbcorvus: basically stop using branch matchers unless you know you really need them seemed to be the outcome16:40
corvusdmsimard: https://review.openstack.org/54350316:40
dmsimardOh, we have them redefined in the feature/1.0 branch16:41
corvusso the only bug i see here is that the source context filenames aren't correct16:41
clarkbcorvus: shouldn't the jobs defined in https://review.openstack.org/#/c/565082/1/zuul.d/layout.yaml run as well?16:41
dmsimardcorvus: okay, so that explains the lack of tox jobs -- what about the ara jobs not running themselves ? i.e, as defined in https://review.openstack.org/#/c/565082/1/zuul.d/layout.yaml16:42
dmsimardclarkb beat me to it16:42
corvushrm, i don't see why they aren't running16:43
clarkbupdateConfig() checks if filename starts with zuul.d/ does gerrit drop the project name from the beginning of filenames in the events?16:45
corvusclarkb: it logged that it created a new configuration for that change16:46
corvusi just did a recheck, and it logged:16:46
corvus2018-04-30 16:26:42,198 DEBUG zuul.ConfigLoader: Created layout id b33b88923f204b63be4608662156524416:46
corvus(the lines around there are mostly related to job selection for that change)16:46
dmsimardYou can see the traces with: grep -B50 "openstack/ara/zuul.d/jobs.yaml@stable/0.14" debug.log, the last one occured around 40 minutes ago which doesn't seem to correlate with anything16:47
corvusdmsimard: that correlates with a tenant reconfiguration event.  it failed which means zuul is running on an old config (the last one before your 0.14.7 branch creation)16:52
corvusperhaps there is an error with dynamic reconfigurations on branches which are not in the running config16:52
dmsimardYup, I definitely found a bug I think. Sorry :/16:54
corvusi'm not sure it's a bug16:54
dmsimardcorvus: wait, you're saying this is holding back the entire scheduler ? my assessment was that it only impacted that specific branch/project16:54
corvusdmsimard: oh, no, broken configuration is a big deal :)16:55
dmsimardcorvus: I tested a job change in the master branch of ara to make sure zuul was not entirely broken and it worked just fine: https://review.openstack.org/#/c/565083/16:55
corvusokay, yeah, _loadDynamicProjectData only loads configuration from branches zuul knows about16:56
corvusso that explains why the new jobs are running -- zuul won't load any configuration from that branch until it's able to reload its config.  it can't reload the config until the configuration on that branch is fixed.16:57
corvusor rather, the new jobs "aren't" running.16:57
corvusdmsimard: zuul will continue to use the last config it generated until the problem is corrected16:58
dmsimardcorvus: right, so in order to correct that, I need to merge that change (which passed jobs anyway I guess?)16:58
dmsimardIf Zuul was -1'ing it, I'd have to force push16:58
corvusdmsimard: yes, or remove ara from the tenant config file16:58
dmsimardok so this is at least consistent with my understanding16:59
dmsimardI understand why things broke, just expected zuul to pick up fixes speculativaly (or bubble up that verbose exception to the review like it usually does)17:00
clarkboh and the issue is it loaded the new branch as merged config rather than random pre merged config17:00
*** hasharAway is now known as hashar17:01
*** hashar is now known as hasharAway17:01
corvusclarkb: when the new branch was created, it could not create a layout from it because it was broken.  zuul only loads dynamic config from branches in the current running config (because how could there be a branch that isn't in the running config?), so even though it was creating a dynamic layout for 565082, it isn't actually reading any of the content there.17:02
corvusdmsimard: an implication of that is that you can't actually trust zuul's +1 on that change -- it could still have an error.17:02
corvusyou'll just have to merge it and see what happens :)17:02
clarkbcorvus: fwiw http://logs.openstack.org/51/565251/1/check/tox-py35/94bf75b/testr_results.html.gz did as expected. The old test failed on parent failing to merge because the syntax error was caught in the parent layout merging then child saw parent had merge failure. But when we switch to delaying merging and layout parsing for the actual child change we see that as a syntax error at that point. I think the new17:02
clarkbbehavior is more accurate to the behavior we want and avoids needing to bisect the queue.17:03
dmsimardcorvus: the reason why I created the branch is no longer valid (flask ended up releasing a fixed version 3 days later) but I can do it out of curiosity17:03
clarkbbut I didn't want to update the test until there was discussion about it17:03
clarkbdmsimard: you may want to sort it out just in case there is another breaking change pre 1.0 release17:04
clarkbyou'll be ready then :)17:04
* dmsimard sighs17:04
*** spsurya has quit IRC17:04
corvusclarkb: yeah, that makes sense.  i sort of think that the current behavior is a little more user friendly in that it says "your change might be fine, but there's a problem with a dependency", whereas the new behavior would say "there's a problem with the config" and you'd have to figure out whether it was this change or the dependency.  but i'm not sure that's worth the extra cpu cost.17:08
clarkbcorvus: ya and the parent itself when tested should show the error in the most accurate location17:09
corvusyeah, that at least doesn't change17:09
clarkbcorvus: maybe we update the error message to say it could be in a parent change as well as the current change17:09
corvusyeah, we could detect whether there's a change ahead with the error.  otoh, the error message will include the repo/path/file/line number, so it's probably not *too* hard to figure out it's not this change :)17:10
corvus(or, sorry, we could detect whether there's a change ahead that updates the config, and therefore decide that we should add that it might not be in this change)17:11
corvusclarkb: anyway, i don't think that's necessary; we can probably just do the simple thing and not change the messages, and just update that test to expect the new error.17:11
clarkbok I'll go ahead and do that then17:11
openstackgerritClark Boylan proposed openstack-infra/zuul master: Don't generate layouts on non live items  https://review.openstack.org/56525117:24
clarkbcorvus: ^ done17:25
tobiashclarkb: lgtm17:38
clarkbShrews: have you checked on nodepool yet fir making a release? I've not noticed  any issues18:02
Shrewsclarkb: i looked last friday but not today yet18:03
Shrewsclarkb: give me a few minutes and i can take a quick peek18:04
pabelangernb03.o.o is stuck in retry loop for zookeeper, but don't think it is a new issue18:05
Shrewspabelanger: iirc, last time that happened was on a host (i think the same one?) with some networking issues18:07
Shrewsarm server maybe?18:07
Shrewssome errors in the syslog last time18:08
dmsimardcorvus, clarkb: circling back to the change that didn't run ara integration jobs -- I merged the integration test fixes (didn't need force push, zuul was +1'ing) and the follow up change ran all the jobs properly.18:09
Shrewsclarkb: i do not see any launcher issues18:09
Shrewsat least nothing obvious18:09
Shrewsdmsimard: fyi, you still have 2 nodes in hold for 28 days18:10
dmsimardShrews: I was positive those had been deleted, let me go and do that now :/18:10
dmsimardShrews: deleted, thanks for reminding me18:11
corvusdmsimard: great, thanks!18:12
corvusdmsimard: (re ara)18:13
pabelangerShrews: yah, we lost connection to zookeeper, but nodepool didn't properly reconnect. Just looping now, but if we stop / start zookeeper session should start again. I suspect we've exposed a bug some where18:13
Shrewspabelanger: where do you see zk connection loss? I only see ovh.net connection errors18:21
Shrewspabelanger: actually, let's go to #-infra18:21
*** hasharAway is now known as hashar18:44
*** xinliang has quit IRC18:45
jlkclarkb: corvus: This is very relevant to your interests. https://github.com/github/git/pull/758 This can lead to remote code execution on hosts that process git and submodules.18:50
corvusjlk: that 404's for me18:51
corvusjlk: though, happily, zuul itself doesn't do anything with submodules, so perhaps whatever the issue is won't affect us?18:52
jlkoh shoot. Uh, hrm.18:52
jlkI wish I could delete IRC messages. That may be embargo18:53
corvus(jobs, of course, might handle submodules themselves)18:53
corvusjlk: well, the whole repo is 404 for me: https://github.com/github/git/18:53
jlkyeah I misread the url. I was thinking it was the public git repo, not the github fork of git18:55
*** xinliang has joined #zuul18:57
*** xinliang has quit IRC18:57
*** xinliang has joined #zuul18:57
*** elyezer has quit IRC19:07
clarkbcorvus: re https://review.openstack.org/#/c/565251/ did you want to approve that? tobias +2'd but didn't approve19:08
*** elyezer has joined #zuul19:20
*** elyezer has quit IRC19:25
*** elyezer has joined #zuul19:38
corvusclarkb: done19:42
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Increase unit testing of host / group vars  https://review.openstack.org/55940519:59
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Inventory groups should be under children key  https://review.openstack.org/55940619:59
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Fix host_vars  https://review.openstack.org/56529319:59
openstackgerritMerged openstack-infra/zuul master: Don't generate layouts on non live items  https://review.openstack.org/56525120:00
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Freeze regex projects  https://review.openstack.org/56529420:14
corvustobiash: ^ that's in response to the project-regex change20:16
clarkbcorvus: do you need to chagne the freeze=False there?20:17
clarkblines 942 and 95720:18
corvusclarkb: oh wow i did that wrong :)20:19
clarkbthis is what code review is for :)20:19
corvusclarkb: i meant to leave the freeze on line 978, but that does in fact break, so i need to figure out what's missing in my mental model :)20:20
tobiashcorvus: I think the copies objects were still frozen20:20
tobiashthat's why I did it like that20:20
clarkbcould be lines 944 and 959 breaking it?20:20
clarkbneed to freeze after those?20:20
tobiashor did your weekend-fix change this?20:20
corvusclarkb: no -- we do need to leave the result of the project-template parsing unfrozen (basically, i look at it as the object is half-constructed at that point, the project parser finishes constructing it, then freezes it and returns it)20:21
corvusit is the regex copy/modification that breaks it20:22
corvusi thought that a copy.copy would produce an unfrozen copy, but i guess it doesn't.  so we may just need to implement a copy method on project20:22
openstackgerritMerged openstack-infra/zuul master: Fix host_vars  https://review.openstack.org/56529320:25
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Freeze regex projects  https://review.openstack.org/56529420:26
corvustobiash, clarkb: ^ i guess that's what we have to do20:27
clarkbShrews: pabelanger follwing up from earlier, are we happy with nodepool and should make a release?20:27
clarkbI noted the sha1 we used when I restarted things in the status log and on irc iirc20:27
corvus(earlier in my change series for the config refactor, i did something similar with leaving objects unfrozen, but it made me nervous; i like the safety it adds to have things frozen quickly)20:27
tobiashcorvus: yes, that looks much cleaner :)20:30
corvusif that checks out, i'll +3 both changes20:30
clarkbsounds good20:30
clarkbdo we want to restart with that feature in place too?20:32
clarkbhttp://logs.openstack.org/05/559405/3/check/tox-py35/63139a5/job-output.txt.gz that held up the host vars test change20:33
clarkbdoes that look familiar to anyone?20:33
corvusclarkb: i don't think it's urgent; let's not go to extra effort for it20:34
Shrewsclarkb: i don't have any specific reason not to do a release of nodepool, if everything looks good from an infra operational perspective (i think it does?)20:34
clarkbShrews: ya it lgtm. http://nb01.openstack.org/images/ has up to date images and nodepool is giving zuul test instances20:36
clarkbe9b82226a5641042e1aad1329efa6e3b376e7f3a is the sha1 we are running20:39
pabelanger+120:41
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Add waits to test_nodepool_priority  https://review.openstack.org/56533520:44
corvusclarkb: i think that should help with that20:44
clarkbcorvus: I also notice that the priority of the two requests seems to be the same20:45
*** JosefWells has quit IRC20:45
clarkb100-0000x vs 100-0000y20:45
corvusclarkb: you mean in the failure message?20:45
corvus2018-04-30 20:13:34.551824 | ubuntu-xenial | testtools.matchers._impl.MismatchError: '100-0000000001' != '100-0000000002'20:46
corvusif so, that's the right request, it just arrived in the wrong order, so it got sequence number 1 instead of 220:46
clarkbcorvus: doesn't the 100/200/300 prefix indicate priority though?20:47
corvus(basically, the test probably *shouldn't* check the sequence number, but since it does, we should make sure it's consistent.  the other way to fix this would be to match against '100-.*' and then do something else to verify that's the gate change)20:47
clarkboh wait I get it20:48
clarkb200-foo may have sequence of 000120:48
corvusclarkb: yes, so 100-2 means "this arrived last, but service it first"20:48
corvus(the -2 is 2 out of 2 (there are 3 requests, sequenced 0, 1, 2))20:48
corvustobiash: left a question for you on 563194; may be easier to chat about it in irc when you have a minute20:54
tobiashcorvus: looking20:55
tobiashthere is a difference20:56
tobiashthe current behavior checks if parent-change-enqueued is used in the tenant20:57
tobiashbut doesn't check if it is needed to be emitted for a specific pipeline aka there is no check if there is definitely a listener to that20:58
tobiashso say I have a gate pipeline using parent-change-enqueued on gate20:58
corvustobiash: gotcha20:58
*** dkranz has quit IRC20:58
tobiashcurrently it also emits this event when a change is enqueued in check (where using that event doesn;t make that much sense)20:59
tobiashthat behavior killed our github in combination with the searching over all installations ;)20:59
corvustobiash: that's the only use of that variable, so if you want to do the calculation on every enqueue, then we should probably remove the old code; or we could update the static calculation to store pipeline names20:59
tobiashgood point21:00
tobiashI think I'll try the static calculation with pipeline names21:00
corvuscool21:00
tobiashbut probably next week, have to get a new openshift cluster productive this week21:01
*** hashar has quit IRC21:07
clarkbanyone else willing to quickly review https://review.openstack.org/#/c/565335/1 ? I'm going to recheck the chagnes that fell out of gate due to bug that this should fix21:22
corvusclarkb: it probably is rare enough we can recheck-bash them in21:23
clarkbcorvus: ya I think so too21:24
clarkb(But I also want to restart zuul today to get that fix in for tosky and don't want that in without the test that goes with it)21:25
corvusproject regex changes are +321:25
tobiash\o/21:30
corvusclarkb: https://review.openstack.org/564847 is for you to +321:31
clarkblooking21:32
clarkbcorvus: do we still think that is a good idea? there was some questioning that iirc21:32
openstackgerritJames E. Blair proposed openstack-infra/zuul master: Add release note about config memory improvements  https://review.openstack.org/56534821:35
corvusclarkb: and that reminds me ^21:35
clarkb(I still think release note about re2 is good thing and tobiash +2'd it so I guess I will +3 it)21:36
corvusclarkb: i believe the only issue/question was related to the ordering of the release notes, which it turns out was just a bug in reno that can happen any time and is unrelated to retro notes21:36
clarkbah ok21:36
corvusdhellman made a fix, unsure of release status21:37
corvusclarkb: it's fixed in the commit after the latest tag, so it'll be in the next reno rel21:40
corvusuntil then, our release notes page will be in random order :)21:40
corvushttps://review.openstack.org/564857 is a trivial change that would be good to merge soon21:41
clarkbcorvus: +321:41
openstackgerritMerged openstack-infra/zuul master: Add regex support to project stanzas  https://review.openstack.org/53571321:47
openstackgerritMerged openstack-infra/zuul master: Freeze regex projects  https://review.openstack.org/56529421:47
openstackgerritMerged openstack-infra/zuul master: Add waits to test_nodepool_priority  https://review.openstack.org/56533521:51
openstackgerritMerged openstack-infra/zuul master: Sometimes GitHub doesn't return repo permissions  https://review.openstack.org/56466621:51
*** ssbarnea_ has quit IRC21:56
clarkb559405 looks like it may fail in the gate again :/21:59
clarkbthis apepars to be a slow test node though?21:59
corvusunfortunately i think my config refactor has altered test runtimes and we're forced into a stabilization period :/22:00
clarkbshaking out new races?22:00
clarkbwhat do you think about restarting zuul scheduler today before 559405 gets in?22:01
corvusyeah, or old races22:01
clarkb(to fix tosky)22:01
corvusclarkb: i think we should22:01
clarkbok lets hop over to -infra and coordinate that.22:01
clarkbseparately did you want to be the person to tag and release nodepool? Probably best to do that after we restart zuul22:01
corvusclarkb: yeah, i'll prepare that now22:11
clarkbzuul.rpcclient.RPCFailure: Invalid tenant: openstack I guess I have to wait for the tenant to be ready before that works22:13
corvusclarkb: yep, that'll repeat for several minutes until the initial load is complete22:14
corvusi have prepared a local tag of nodepool 3.0.1 at e9b82226a5641042e1aad1329efa6e3b376e7f3a22:14
clarkbI stopped the script and will restart it when the tenant is loaded22:15
clarkbare you saying I didn't ahve to stop it, it would just poll until ready?22:15
corvusi'll push it when we're happy with the openstack-infra restart22:15
corvusclarkb: oh, that was output from the re-enqueue script?22:15
clarkbya22:15
corvusclarkb: sorry, you did the right thing.22:15
clarkbok cool22:15
corvusclarkb: that's *also* emitted in the debug log due to requests from zuul-web if it's running.  that's what i was thinking.22:16
openstackgerritMerged openstack-infra/zuul master: Add release note about re2  https://review.openstack.org/56484722:40
openstackgerritMerged openstack-infra/zuul master: Fix some code description  https://review.openstack.org/56486222:43
clarkbcorvus: I've told the release team that things seem to be happy post restart. Is there anything else you want to check before pushing your nodepool tag?22:47
clarkb(I can help)22:47
openstackgerritMerged openstack-infra/zuul master: Add debug info to test_slow_start  https://review.openstack.org/56485722:47
clarkbhttp://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz that change just can't have two passing runs in a row22:48
corvusclarkb: nope, i'll push it now22:53
corvusnodepool 3.0.1 pushed22:53
clarkbcorvus: reading the job log from that failed run above it seems like it requested an extra node?22:58
clarkbhrm no it never marked the request as done?22:58
corvusclarkb: well, the immediate issue is the test timed out22:59
clarkbcorvus: once the request is marked fulfilled is zuul supposed to delete the request?22:59
clarkbcorvus: ya reading through the log it appears to have run all the jobs expected of the three changes, then spent a lot of time waiting for this one request to go away and then timed out22:59
clarkb200-0000000007 is the request that stuck around22:59
corvusclarkb: yeah, zuul should lock the nodes and delete the request23:01
clarkbya I'm reading that code now. I think that is what failed to happen23:01
clarkbhttp://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_438198 that must be why23:02
corvusoh yep.  i think that's a side effect of a slow node.23:02
clarkbthough that seems to happen near th eend of the job23:02
clarkbso maybe its just normal zk shutdown/cleanup?23:03
corvusclarkb: the line you linked is before the shutdown: http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_75717723:03
clarkbhttps://pypi.org/project/nodepool/ fyi that has new release23:07
corvuswe... should update the nodepool setup.cfg description23:10
clarkbwe never even process the result event for that request being fulfilled so its getting stuck well away from where I thought it might be23:11
clarkbShrews: corvus I think http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_408134 and the next log line are the root cause of this failure23:21
clarkbit is asking zk for a request there and getting back an empty list but zk is creating the record on its end I think23:22
clarkbbut zuul doesn't seem to ever treat that node as belonging to any of its open node requests23:22
clarkbthen we see the disconnect issue not long after23:23
corvusclarkb: i don't see a response for that23:23
corvusclarkb: i think the next line is a response to the previous request23:23
corvusclarkb: i'm assuming xid=96 means that23:23
clarkbah23:23
clarkbso ya the disconnect there basically means you never got your request but then later zk has that record23:23
corvusyeah, that sounds legit23:24
*** rlandy has quit IRC23:24
clarkbwhich for the test suite maybe avoiding zk disconnects is the best idea23:24
clarkb(hwoever that happens)23:24
corvusthough, maybe there's a way to make this more robust :)23:24
clarkbit is odd that clients don't have to ack record requests for them to be completed23:25
corvusclarkb: here's the tb from that thread: http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_44002223:26
corvusclarkb: so zuul did re-request those nodes; is the problem that the test suite saw the extra node request and so never thought it was settled?23:28
clarkbyup23:30
clarkbI think zuul ended up making a new request for the nodes and that got the test to run all the jobs it thoguht it had to run23:30
clarkbbut zk still had a request outstanding that zuul is/was never going to clean up becuse it was no longer attached to any jobs23:30
corvusoh, well the tests consider it not to be settled as long as zuul.nodepool has outstanding requests23:31
corvuswe could put a try/except there to unwind that23:32
clarkbif it excepts try to delete it?23:32
corvusheh, this is a really confusing situation.  to summarize: the create operation was interrupted but succeeded.  so the request actually exists in zookeeper, but zuul thought it failed.  however, zuul kept a record of it.  so, oddly enough, it existed in *both* places.23:34
corvusthe first obvious fix is to have zuul forget its record of the request if it fails.  that would prevent a weird sort of memory leak, and would also happen to cause this test to succeed (which is an amazing side effect)23:35
corvusbut there would still be an outstanding request in zookeeper, which nodepool would, presumably fulfill.23:35
clarkbright and we'd leak both the request and the nodes23:40
corvusi wonder if zk actually *does* return the result node id, but because kazoo does an extra 'get' operation on that id, and *that* failed,23:40
corvusi believe this is in the traceback: https://github.com/python-zk/kazoo/blob/master/kazoo/client.py#L90523:40
clarkboh so the request is completing then in order to return data back it is reading the data in another request?23:41
corvushrm, i may be wrong about that: https://github.com/python-zk/kazoo/blob/master/kazoo/handlers/utils.py#L7323:42
corvusthat doesn't look like that's what's happening23:42
corvusit does look like the exception was raised while waiting for the response to create23:43
corvusso yeah, i'm not sure how to handle this on the zk side23:43
openstackgerritClark Boylan proposed openstack-infra/zuul master: Install g++ on platform:rpm  https://review.openstack.org/56507023:46
clarkbianw: fdegir ^ fixed with your feedback23:46
clarkbcorvus: ^ thats an easy one I Think23:46
corvusyep23:47
clarkbas for zookeeper it is curious that the client would see the request as failed but server wouldn't23:48
corvusi have to run; maybe we can pick this up tomorrow23:48
clarkbno problem I too need to start looking at things around the house23:49
clarkbthank you for helping fix that bug and getting zuul restarted23:49

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!