dmsimard | Did we know about https://github.com/ansible/ansible-runner ? Because I didn't. | 00:46 |
---|---|---|
dmsimard | https://github.com/ansible/ansible-runner#ansible-runner-as-a-python-interface | 00:46 |
*** dkranz has quit IRC | 01:30 | |
*** swest has quit IRC | 01:36 | |
*** swest has joined #zuul | 01:50 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: mqtt: add basic reporter https://review.openstack.org/535543 | 01:52 |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: mqtt: add basic reporter https://review.openstack.org/535543 | 02:09 |
*** spsurya has joined #zuul | 02:38 | |
*** snapiri has joined #zuul | 04:38 | |
*** swest has quit IRC | 05:06 | |
*** swest has joined #zuul | 05:07 | |
*** bhavik1 has joined #zuul | 05:58 | |
*** bhavik1 has quit IRC | 06:31 | |
*** Guest1988 has joined #zuul | 07:12 | |
*** spsurya has quit IRC | 07:14 | |
*** Guest1988 has quit IRC | 07:17 | |
*** ssbarnea_ has joined #zuul | 07:35 | |
*** gtema has joined #zuul | 08:01 | |
*** hashar has joined #zuul | 08:05 | |
*** gtema has quit IRC | 08:15 | |
*** gtema has joined #zuul | 08:17 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: WIP: Improve zuul stream tests https://review.openstack.org/565169 | 08:38 |
[GNU] | pi-u | 08:56 |
[GNU] | uh... sorry :) | 08:56 |
[GNU] | goog morning | 08:56 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: WIP: Improve zuul stream tests https://review.openstack.org/565169 | 08:59 |
*** spsurya has joined #zuul | 09:01 | |
*** electrofelix has joined #zuul | 09:06 | |
*** yolanda_ is now known as yolanda | 09:39 | |
*** tobasco is now known as tobasco-afk | 10:11 | |
*** hashar is now known as hasharAway | 10:16 | |
*** tobasco-afk is now known as tobasco | 10:58 | |
*** gtema has quit IRC | 11:09 | |
*** AJaeger is now known as AJaeger_away | 11:39 | |
*** elyezer has quit IRC | 11:51 | |
*** gtema has joined #zuul | 11:55 | |
*** hasharAway is now known as hashar | 11:58 | |
*** elyezer has joined #zuul | 12:03 | |
*** dkranz has joined #zuul | 12:22 | |
*** rlandy has joined #zuul | 12:33 | |
*** gtema has quit IRC | 12:39 | |
*** gouthamr has quit IRC | 12:59 | |
*** gtema has joined #zuul | 13:01 | |
*** gouthamr has joined #zuul | 13:02 | |
*** pwhalen has quit IRC | 13:17 | |
*** pwhalen has joined #zuul | 13:21 | |
*** pwhalen has joined #zuul | 13:21 | |
*** ssbarnea_ has quit IRC | 13:34 | |
*** lsell has left #zuul | 13:34 | |
*** ssbarnea_ has joined #zuul | 13:36 | |
*** gtema has quit IRC | 13:39 | |
*** gouthamr has quit IRC | 13:42 | |
*** gouthamr has joined #zuul | 13:45 | |
*** persia has quit IRC | 15:03 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool master: Static driver: Use node states defined in zk.py https://review.openstack.org/565238 | 15:14 |
*** xinliang has quit IRC | 15:17 | |
*** xinliang has joined #zuul | 15:17 | |
*** hashar is now known as hasharAway | 15:23 | |
openstackgerrit | Merged openstack-infra/nodepool master: Static driver: Use node states defined in zk.py https://review.openstack.org/565238 | 15:53 |
*** electrofelix has quit IRC | 15:54 | |
SpamapS | Hey, this is kinda cool | 15:55 |
SpamapS | https://pyup.io/ | 15:55 |
SpamapS | Somebody was like "Well nobody relies on distro libs anymore.. how could we get the same benefits but without all that open collaboration? :-P | 15:56 |
clarkb | corvus: ok I think I may have a non naive fix for the handling of non live item layouts. It fails at least one test but I think the change in behavior is more accurate now? Once the current run finishes I'll push it up for others to look at assuming I don't have any new surprises | 16:06 |
corvus | clarkb: cool, i'll take a look | 16:09 |
dmsimard | corvus: it's no longer a time sensitive issue but I was wondering if you had an opinion on this issue I triggered yesterday: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2018-04-29.log.html#t2018-04-29T23:28:44 | 16:18 |
dmsimard | I suppose creating a branch from an old reference (with a now broken zuul configuration) is somewhat similar to doing a force push on an active branch with a broken config | 16:19 |
corvus | dmsimard: yeah, "don't do that" would be my first advice :) | 16:19 |
dmsimard | corvus: so the fix is indeed to force push the fixes ? | 16:19 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Don't generate layouts on non live items https://review.openstack.org/565251 | 16:20 |
dmsimard | because even https://review.openstack.org/#/c/565082/ (which cherry-picks all the fixes) doesn't load, but only for that particular branch | 16:20 |
corvus | dmsimard: i don't think you need to force-push them; if there's only one repo involved, you should be able to fix it in that repos. | 16:20 |
clarkb | corvus: ^ that should in a fail at least one test in an expected manner state as that illustrates the behavior change well. I figure once we have that data we can consider if that needs to be addressed (eg don't change that) or if the new behavior is more correct? | 16:21 |
dmsimard | corvus: that's exactly the problem though, it doesn't look like I'm able to fix it without force pushing | 16:21 |
clarkb | dmsimard: you aren't able to push a single change that fixes the config? | 16:22 |
clarkb | dmsimard: worst case you disable the jobs, then roll forward in a logical change by change basis | 16:22 |
dmsimard | clarkb: I submitted https://review.openstack.org/#/c/565082/ and none of the jobs are running, scheduler appears to short circuit on http://paste.openstack.org/raw/720098/ | 16:23 |
clarkb | dmsimard: that is an invalid nodeset | 16:23 |
clarkb | should be debian-stable now | 16:23 |
dmsimard | clarkb: I know, 565082 fixes that | 16:24 |
clarkb | so now we figure out where jessie is coming from | 16:24 |
corvus | dmsimard: the error says it's on stable/0.14.7, but the change you linked is stable/0.14 | 16:24 |
dmsimard | corvus: aye, I screwed up and ended up settling on a 0.14 branch, 0.14.7 was deleted | 16:25 |
clarkb | ya looking at that change zuul +1'd it | 16:25 |
dmsimard | corvus: there's another one for 0.14 | 16:25 |
dmsimard | clarkb: it's +1 but only that sphinx job ran, nothing else (not even tox-pep8 and stuff defined in project-config) | 16:25 |
corvus | dmsimard: so you created 0.14.7 with a broken config, then created 0.14, then deleted 0.14.7? | 16:25 |
dmsimard | corvus: I created 0.14.7, deleted it and created 0.14 -- both had the same HEAD | 16:26 |
corvus | dmsimard: the error you linked is before you uploaded 565082; that error can't be from that change. | 16:27 |
dmsimard | corvus: then I cherry-picked the one change I wanted to release in 0.14.7 but rabbit holed myself into non-working integration jobs so I cherry-picked the different fixes (including the debian-jessie -> debian-stable change) | 16:27 |
dmsimard | corvus: likely from https://review.openstack.org/#/c/565081/ | 16:27 |
dmsimard | corvus: (or outright at the branch creation? I'm not sure if that would've triggered it, let me go look at the logs again) | 16:27 |
dmsimard | I should've probably just manually hacked a tag together without trying to do it properly through a branch, gerrit and stuff :( | 16:28 |
clarkb | to be clear the above error isn't really an issue right now right? zuul is running at least one job successfully and +1'ing things. It is not however running all of the expected jobs | 16:29 |
clarkb | so we want to figure out why those jobs aren't running | 16:29 |
dmsimard | clarkb: sort of, I expected the whole scheduler to be borked but I tested to make sure it wasn't (could've been a nice DoS vector..) | 16:30 |
dmsimard | it's only that specific branch for some reason | 16:30 |
corvus | 2018-04-30 16:26:47,501 DEBUG zuul.layout: Pipeline variant <Job openstack-tox-pep8 branches: {MatchAny:{BranchMatcher:^master}} source: openstack-infra/project-config/zuul.d/zuul-jobs.yaml@master#2664> did not m | 16:31 |
corvus | atch <Change 0x7f50561a9a20 565082,1> | 16:31 |
dmsimard | there's a branch matcher? :/ | 16:31 |
corvus | there are several things about that line which are confusing to me | 16:32 |
dmsimard | oh, for openstack-tox-pep8 | 16:32 |
corvus | that's not where the pipeline variant was defined, for one. | 16:34 |
corvus | the line number is correct for zuul.d/projects.yaml, not zuul-jobs.yaml | 16:37 |
corvus | so the filename is wrong there... | 16:37 |
corvus | dmsimard: http://git.openstack.org/cgit/openstack-infra/project-config/tree/zuul.d/projects.yaml#n2664 | 16:38 |
corvus | dmsimard: you have branch matchers for all those jobs. | 16:38 |
clarkb | ah its the nova problem again | 16:39 |
corvus | build-sphinx arrives via one of the project-templates you have there, so it doesn't have a matcher. that's why it's the only one that ran. | 16:39 |
corvus | clarkb: the nova problem? | 16:39 |
dmsimard | huh, someone added a matcher | 16:39 |
dmsimard | I wasn't aware of that | 16:39 |
clarkb | corvus: branch matchers outside the config change update confusing people when config update doesn't do what they expect | 16:40 |
clarkb | corvus: basically stop using branch matchers unless you know you really need them seemed to be the outcome | 16:40 |
corvus | dmsimard: https://review.openstack.org/543503 | 16:40 |
dmsimard | Oh, we have them redefined in the feature/1.0 branch | 16:41 |
corvus | so the only bug i see here is that the source context filenames aren't correct | 16:41 |
clarkb | corvus: shouldn't the jobs defined in https://review.openstack.org/#/c/565082/1/zuul.d/layout.yaml run as well? | 16:41 |
dmsimard | corvus: okay, so that explains the lack of tox jobs -- what about the ara jobs not running themselves ? i.e, as defined in https://review.openstack.org/#/c/565082/1/zuul.d/layout.yaml | 16:42 |
dmsimard | clarkb beat me to it | 16:42 |
corvus | hrm, i don't see why they aren't running | 16:43 |
clarkb | updateConfig() checks if filename starts with zuul.d/ does gerrit drop the project name from the beginning of filenames in the events? | 16:45 |
corvus | clarkb: it logged that it created a new configuration for that change | 16:46 |
corvus | i just did a recheck, and it logged: | 16:46 |
corvus | 2018-04-30 16:26:42,198 DEBUG zuul.ConfigLoader: Created layout id b33b88923f204b63be46086621565244 | 16:46 |
corvus | (the lines around there are mostly related to job selection for that change) | 16:46 |
dmsimard | You can see the traces with: grep -B50 "openstack/ara/zuul.d/jobs.yaml@stable/0.14" debug.log, the last one occured around 40 minutes ago which doesn't seem to correlate with anything | 16:47 |
corvus | dmsimard: that correlates with a tenant reconfiguration event. it failed which means zuul is running on an old config (the last one before your 0.14.7 branch creation) | 16:52 |
corvus | perhaps there is an error with dynamic reconfigurations on branches which are not in the running config | 16:52 |
dmsimard | Yup, I definitely found a bug I think. Sorry :/ | 16:54 |
corvus | i'm not sure it's a bug | 16:54 |
dmsimard | corvus: wait, you're saying this is holding back the entire scheduler ? my assessment was that it only impacted that specific branch/project | 16:54 |
corvus | dmsimard: oh, no, broken configuration is a big deal :) | 16:55 |
dmsimard | corvus: I tested a job change in the master branch of ara to make sure zuul was not entirely broken and it worked just fine: https://review.openstack.org/#/c/565083/ | 16:55 |
corvus | okay, yeah, _loadDynamicProjectData only loads configuration from branches zuul knows about | 16:56 |
corvus | so that explains why the new jobs are running -- zuul won't load any configuration from that branch until it's able to reload its config. it can't reload the config until the configuration on that branch is fixed. | 16:57 |
corvus | or rather, the new jobs "aren't" running. | 16:57 |
corvus | dmsimard: zuul will continue to use the last config it generated until the problem is corrected | 16:58 |
dmsimard | corvus: right, so in order to correct that, I need to merge that change (which passed jobs anyway I guess?) | 16:58 |
dmsimard | If Zuul was -1'ing it, I'd have to force push | 16:58 |
corvus | dmsimard: yes, or remove ara from the tenant config file | 16:58 |
dmsimard | ok so this is at least consistent with my understanding | 16:59 |
dmsimard | I understand why things broke, just expected zuul to pick up fixes speculativaly (or bubble up that verbose exception to the review like it usually does) | 17:00 |
clarkb | oh and the issue is it loaded the new branch as merged config rather than random pre merged config | 17:00 |
*** hasharAway is now known as hashar | 17:01 | |
*** hashar is now known as hasharAway | 17:01 | |
corvus | clarkb: when the new branch was created, it could not create a layout from it because it was broken. zuul only loads dynamic config from branches in the current running config (because how could there be a branch that isn't in the running config?), so even though it was creating a dynamic layout for 565082, it isn't actually reading any of the content there. | 17:02 |
corvus | dmsimard: an implication of that is that you can't actually trust zuul's +1 on that change -- it could still have an error. | 17:02 |
corvus | you'll just have to merge it and see what happens :) | 17:02 |
clarkb | corvus: fwiw http://logs.openstack.org/51/565251/1/check/tox-py35/94bf75b/testr_results.html.gz did as expected. The old test failed on parent failing to merge because the syntax error was caught in the parent layout merging then child saw parent had merge failure. But when we switch to delaying merging and layout parsing for the actual child change we see that as a syntax error at that point. I think the new | 17:02 |
clarkb | behavior is more accurate to the behavior we want and avoids needing to bisect the queue. | 17:03 |
dmsimard | corvus: the reason why I created the branch is no longer valid (flask ended up releasing a fixed version 3 days later) but I can do it out of curiosity | 17:03 |
clarkb | but I didn't want to update the test until there was discussion about it | 17:03 |
clarkb | dmsimard: you may want to sort it out just in case there is another breaking change pre 1.0 release | 17:04 |
clarkb | you'll be ready then :) | 17:04 |
* dmsimard sighs | 17:04 | |
*** spsurya has quit IRC | 17:04 | |
corvus | clarkb: yeah, that makes sense. i sort of think that the current behavior is a little more user friendly in that it says "your change might be fine, but there's a problem with a dependency", whereas the new behavior would say "there's a problem with the config" and you'd have to figure out whether it was this change or the dependency. but i'm not sure that's worth the extra cpu cost. | 17:08 |
clarkb | corvus: ya and the parent itself when tested should show the error in the most accurate location | 17:09 |
corvus | yeah, that at least doesn't change | 17:09 |
clarkb | corvus: maybe we update the error message to say it could be in a parent change as well as the current change | 17:09 |
corvus | yeah, we could detect whether there's a change ahead with the error. otoh, the error message will include the repo/path/file/line number, so it's probably not *too* hard to figure out it's not this change :) | 17:10 |
corvus | (or, sorry, we could detect whether there's a change ahead that updates the config, and therefore decide that we should add that it might not be in this change) | 17:11 |
corvus | clarkb: anyway, i don't think that's necessary; we can probably just do the simple thing and not change the messages, and just update that test to expect the new error. | 17:11 |
clarkb | ok I'll go ahead and do that then | 17:11 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Don't generate layouts on non live items https://review.openstack.org/565251 | 17:24 |
clarkb | corvus: ^ done | 17:25 |
tobiash | clarkb: lgtm | 17:38 |
clarkb | Shrews: have you checked on nodepool yet fir making a release? I've not noticed any issues | 18:02 |
Shrews | clarkb: i looked last friday but not today yet | 18:03 |
Shrews | clarkb: give me a few minutes and i can take a quick peek | 18:04 |
pabelanger | nb03.o.o is stuck in retry loop for zookeeper, but don't think it is a new issue | 18:05 |
Shrews | pabelanger: iirc, last time that happened was on a host (i think the same one?) with some networking issues | 18:07 |
Shrews | arm server maybe? | 18:07 |
Shrews | some errors in the syslog last time | 18:08 |
dmsimard | corvus, clarkb: circling back to the change that didn't run ara integration jobs -- I merged the integration test fixes (didn't need force push, zuul was +1'ing) and the follow up change ran all the jobs properly. | 18:09 |
Shrews | clarkb: i do not see any launcher issues | 18:09 |
Shrews | at least nothing obvious | 18:09 |
Shrews | dmsimard: fyi, you still have 2 nodes in hold for 28 days | 18:10 |
dmsimard | Shrews: I was positive those had been deleted, let me go and do that now :/ | 18:10 |
dmsimard | Shrews: deleted, thanks for reminding me | 18:11 |
corvus | dmsimard: great, thanks! | 18:12 |
corvus | dmsimard: (re ara) | 18:13 |
pabelanger | Shrews: yah, we lost connection to zookeeper, but nodepool didn't properly reconnect. Just looping now, but if we stop / start zookeeper session should start again. I suspect we've exposed a bug some where | 18:13 |
Shrews | pabelanger: where do you see zk connection loss? I only see ovh.net connection errors | 18:21 |
Shrews | pabelanger: actually, let's go to #-infra | 18:21 |
*** hasharAway is now known as hashar | 18:44 | |
*** xinliang has quit IRC | 18:45 | |
jlk | clarkb: corvus: This is very relevant to your interests. https://github.com/github/git/pull/758 This can lead to remote code execution on hosts that process git and submodules. | 18:50 |
corvus | jlk: that 404's for me | 18:51 |
corvus | jlk: though, happily, zuul itself doesn't do anything with submodules, so perhaps whatever the issue is won't affect us? | 18:52 |
jlk | oh shoot. Uh, hrm. | 18:52 |
jlk | I wish I could delete IRC messages. That may be embargo | 18:53 |
corvus | (jobs, of course, might handle submodules themselves) | 18:53 |
corvus | jlk: well, the whole repo is 404 for me: https://github.com/github/git/ | 18:53 |
jlk | yeah I misread the url. I was thinking it was the public git repo, not the github fork of git | 18:55 |
*** xinliang has joined #zuul | 18:57 | |
*** xinliang has quit IRC | 18:57 | |
*** xinliang has joined #zuul | 18:57 | |
*** elyezer has quit IRC | 19:07 | |
clarkb | corvus: re https://review.openstack.org/#/c/565251/ did you want to approve that? tobias +2'd but didn't approve | 19:08 |
*** elyezer has joined #zuul | 19:20 | |
*** elyezer has quit IRC | 19:25 | |
*** elyezer has joined #zuul | 19:38 | |
corvus | clarkb: done | 19:42 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Increase unit testing of host / group vars https://review.openstack.org/559405 | 19:59 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Inventory groups should be under children key https://review.openstack.org/559406 | 19:59 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Fix host_vars https://review.openstack.org/565293 | 19:59 |
openstackgerrit | Merged openstack-infra/zuul master: Don't generate layouts on non live items https://review.openstack.org/565251 | 20:00 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Freeze regex projects https://review.openstack.org/565294 | 20:14 |
corvus | tobiash: ^ that's in response to the project-regex change | 20:16 |
clarkb | corvus: do you need to chagne the freeze=False there? | 20:17 |
clarkb | lines 942 and 957 | 20:18 |
corvus | clarkb: oh wow i did that wrong :) | 20:19 |
clarkb | this is what code review is for :) | 20:19 |
corvus | clarkb: i meant to leave the freeze on line 978, but that does in fact break, so i need to figure out what's missing in my mental model :) | 20:20 |
tobiash | corvus: I think the copies objects were still frozen | 20:20 |
tobiash | that's why I did it like that | 20:20 |
clarkb | could be lines 944 and 959 breaking it? | 20:20 |
clarkb | need to freeze after those? | 20:20 |
tobiash | or did your weekend-fix change this? | 20:20 |
corvus | clarkb: no -- we do need to leave the result of the project-template parsing unfrozen (basically, i look at it as the object is half-constructed at that point, the project parser finishes constructing it, then freezes it and returns it) | 20:21 |
corvus | it is the regex copy/modification that breaks it | 20:22 |
corvus | i thought that a copy.copy would produce an unfrozen copy, but i guess it doesn't. so we may just need to implement a copy method on project | 20:22 |
openstackgerrit | Merged openstack-infra/zuul master: Fix host_vars https://review.openstack.org/565293 | 20:25 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Freeze regex projects https://review.openstack.org/565294 | 20:26 |
corvus | tobiash, clarkb: ^ i guess that's what we have to do | 20:27 |
clarkb | Shrews: pabelanger follwing up from earlier, are we happy with nodepool and should make a release? | 20:27 |
clarkb | I noted the sha1 we used when I restarted things in the status log and on irc iirc | 20:27 |
corvus | (earlier in my change series for the config refactor, i did something similar with leaving objects unfrozen, but it made me nervous; i like the safety it adds to have things frozen quickly) | 20:27 |
tobiash | corvus: yes, that looks much cleaner :) | 20:30 |
corvus | if that checks out, i'll +3 both changes | 20:30 |
clarkb | sounds good | 20:30 |
clarkb | do we want to restart with that feature in place too? | 20:32 |
clarkb | http://logs.openstack.org/05/559405/3/check/tox-py35/63139a5/job-output.txt.gz that held up the host vars test change | 20:33 |
clarkb | does that look familiar to anyone? | 20:33 |
corvus | clarkb: i don't think it's urgent; let's not go to extra effort for it | 20:34 |
Shrews | clarkb: i don't have any specific reason not to do a release of nodepool, if everything looks good from an infra operational perspective (i think it does?) | 20:34 |
clarkb | Shrews: ya it lgtm. http://nb01.openstack.org/images/ has up to date images and nodepool is giving zuul test instances | 20:36 |
clarkb | e9b82226a5641042e1aad1329efa6e3b376e7f3a is the sha1 we are running | 20:39 |
pabelanger | +1 | 20:41 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Add waits to test_nodepool_priority https://review.openstack.org/565335 | 20:44 |
corvus | clarkb: i think that should help with that | 20:44 |
clarkb | corvus: I also notice that the priority of the two requests seems to be the same | 20:45 |
*** JosefWells has quit IRC | 20:45 | |
clarkb | 100-0000x vs 100-0000y | 20:45 |
corvus | clarkb: you mean in the failure message? | 20:45 |
corvus | 2018-04-30 20:13:34.551824 | ubuntu-xenial | testtools.matchers._impl.MismatchError: '100-0000000001' != '100-0000000002' | 20:46 |
corvus | if so, that's the right request, it just arrived in the wrong order, so it got sequence number 1 instead of 2 | 20:46 |
clarkb | corvus: doesn't the 100/200/300 prefix indicate priority though? | 20:47 |
corvus | (basically, the test probably *shouldn't* check the sequence number, but since it does, we should make sure it's consistent. the other way to fix this would be to match against '100-.*' and then do something else to verify that's the gate change) | 20:47 |
clarkb | oh wait I get it | 20:48 |
clarkb | 200-foo may have sequence of 0001 | 20:48 |
corvus | clarkb: yes, so 100-2 means "this arrived last, but service it first" | 20:48 |
corvus | (the -2 is 2 out of 2 (there are 3 requests, sequenced 0, 1, 2)) | 20:48 |
corvus | tobiash: left a question for you on 563194; may be easier to chat about it in irc when you have a minute | 20:54 |
tobiash | corvus: looking | 20:55 |
tobiash | there is a difference | 20:56 |
tobiash | the current behavior checks if parent-change-enqueued is used in the tenant | 20:57 |
tobiash | but doesn't check if it is needed to be emitted for a specific pipeline aka there is no check if there is definitely a listener to that | 20:58 |
tobiash | so say I have a gate pipeline using parent-change-enqueued on gate | 20:58 |
corvus | tobiash: gotcha | 20:58 |
*** dkranz has quit IRC | 20:58 | |
tobiash | currently it also emits this event when a change is enqueued in check (where using that event doesn;t make that much sense) | 20:59 |
tobiash | that behavior killed our github in combination with the searching over all installations ;) | 20:59 |
corvus | tobiash: that's the only use of that variable, so if you want to do the calculation on every enqueue, then we should probably remove the old code; or we could update the static calculation to store pipeline names | 20:59 |
tobiash | good point | 21:00 |
tobiash | I think I'll try the static calculation with pipeline names | 21:00 |
corvus | cool | 21:00 |
tobiash | but probably next week, have to get a new openshift cluster productive this week | 21:01 |
*** hashar has quit IRC | 21:07 | |
clarkb | anyone else willing to quickly review https://review.openstack.org/#/c/565335/1 ? I'm going to recheck the chagnes that fell out of gate due to bug that this should fix | 21:22 |
corvus | clarkb: it probably is rare enough we can recheck-bash them in | 21:23 |
clarkb | corvus: ya I think so too | 21:24 |
clarkb | (But I also want to restart zuul today to get that fix in for tosky and don't want that in without the test that goes with it) | 21:25 |
corvus | project regex changes are +3 | 21:25 |
tobiash | \o/ | 21:30 |
corvus | clarkb: https://review.openstack.org/564847 is for you to +3 | 21:31 |
clarkb | looking | 21:32 |
clarkb | corvus: do we still think that is a good idea? there was some questioning that iirc | 21:32 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul master: Add release note about config memory improvements https://review.openstack.org/565348 | 21:35 |
corvus | clarkb: and that reminds me ^ | 21:35 |
clarkb | (I still think release note about re2 is good thing and tobiash +2'd it so I guess I will +3 it) | 21:36 |
corvus | clarkb: i believe the only issue/question was related to the ordering of the release notes, which it turns out was just a bug in reno that can happen any time and is unrelated to retro notes | 21:36 |
clarkb | ah ok | 21:36 |
corvus | dhellman made a fix, unsure of release status | 21:37 |
corvus | clarkb: it's fixed in the commit after the latest tag, so it'll be in the next reno rel | 21:40 |
corvus | until then, our release notes page will be in random order :) | 21:40 |
corvus | https://review.openstack.org/564857 is a trivial change that would be good to merge soon | 21:41 |
clarkb | corvus: +3 | 21:41 |
openstackgerrit | Merged openstack-infra/zuul master: Add regex support to project stanzas https://review.openstack.org/535713 | 21:47 |
openstackgerrit | Merged openstack-infra/zuul master: Freeze regex projects https://review.openstack.org/565294 | 21:47 |
openstackgerrit | Merged openstack-infra/zuul master: Add waits to test_nodepool_priority https://review.openstack.org/565335 | 21:51 |
openstackgerrit | Merged openstack-infra/zuul master: Sometimes GitHub doesn't return repo permissions https://review.openstack.org/564666 | 21:51 |
*** ssbarnea_ has quit IRC | 21:56 | |
clarkb | 559405 looks like it may fail in the gate again :/ | 21:59 |
clarkb | this apepars to be a slow test node though? | 21:59 |
corvus | unfortunately i think my config refactor has altered test runtimes and we're forced into a stabilization period :/ | 22:00 |
clarkb | shaking out new races? | 22:00 |
clarkb | what do you think about restarting zuul scheduler today before 559405 gets in? | 22:01 |
corvus | yeah, or old races | 22:01 |
clarkb | (to fix tosky) | 22:01 |
corvus | clarkb: i think we should | 22:01 |
clarkb | ok lets hop over to -infra and coordinate that. | 22:01 |
clarkb | separately did you want to be the person to tag and release nodepool? Probably best to do that after we restart zuul | 22:01 |
corvus | clarkb: yeah, i'll prepare that now | 22:11 |
clarkb | zuul.rpcclient.RPCFailure: Invalid tenant: openstack I guess I have to wait for the tenant to be ready before that works | 22:13 |
corvus | clarkb: yep, that'll repeat for several minutes until the initial load is complete | 22:14 |
corvus | i have prepared a local tag of nodepool 3.0.1 at e9b82226a5641042e1aad1329efa6e3b376e7f3a | 22:14 |
clarkb | I stopped the script and will restart it when the tenant is loaded | 22:15 |
clarkb | are you saying I didn't ahve to stop it, it would just poll until ready? | 22:15 |
corvus | i'll push it when we're happy with the openstack-infra restart | 22:15 |
corvus | clarkb: oh, that was output from the re-enqueue script? | 22:15 |
clarkb | ya | 22:15 |
corvus | clarkb: sorry, you did the right thing. | 22:15 |
clarkb | ok cool | 22:15 |
corvus | clarkb: that's *also* emitted in the debug log due to requests from zuul-web if it's running. that's what i was thinking. | 22:16 |
openstackgerrit | Merged openstack-infra/zuul master: Add release note about re2 https://review.openstack.org/564847 | 22:40 |
openstackgerrit | Merged openstack-infra/zuul master: Fix some code description https://review.openstack.org/564862 | 22:43 |
clarkb | corvus: I've told the release team that things seem to be happy post restart. Is there anything else you want to check before pushing your nodepool tag? | 22:47 |
clarkb | (I can help) | 22:47 |
openstackgerrit | Merged openstack-infra/zuul master: Add debug info to test_slow_start https://review.openstack.org/564857 | 22:47 |
clarkb | http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz that change just can't have two passing runs in a row | 22:48 |
corvus | clarkb: nope, i'll push it now | 22:53 |
corvus | nodepool 3.0.1 pushed | 22:53 |
clarkb | corvus: reading the job log from that failed run above it seems like it requested an extra node? | 22:58 |
clarkb | hrm no it never marked the request as done? | 22:58 |
corvus | clarkb: well, the immediate issue is the test timed out | 22:59 |
clarkb | corvus: once the request is marked fulfilled is zuul supposed to delete the request? | 22:59 |
clarkb | corvus: ya reading through the log it appears to have run all the jobs expected of the three changes, then spent a lot of time waiting for this one request to go away and then timed out | 22:59 |
clarkb | 200-0000000007 is the request that stuck around | 22:59 |
corvus | clarkb: yeah, zuul should lock the nodes and delete the request | 23:01 |
clarkb | ya I'm reading that code now. I think that is what failed to happen | 23:01 |
clarkb | http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_438198 that must be why | 23:02 |
corvus | oh yep. i think that's a side effect of a slow node. | 23:02 |
clarkb | though that seems to happen near th eend of the job | 23:02 |
clarkb | so maybe its just normal zk shutdown/cleanup? | 23:03 |
corvus | clarkb: the line you linked is before the shutdown: http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_757177 | 23:03 |
clarkb | https://pypi.org/project/nodepool/ fyi that has new release | 23:07 |
corvus | we... should update the nodepool setup.cfg description | 23:10 |
clarkb | we never even process the result event for that request being fulfilled so its getting stuck well away from where I thought it might be | 23:11 |
clarkb | Shrews: corvus I think http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_408134 and the next log line are the root cause of this failure | 23:21 |
clarkb | it is asking zk for a request there and getting back an empty list but zk is creating the record on its end I think | 23:22 |
clarkb | but zuul doesn't seem to ever treat that node as belonging to any of its open node requests | 23:22 |
clarkb | then we see the disconnect issue not long after | 23:23 |
corvus | clarkb: i don't see a response for that | 23:23 |
corvus | clarkb: i think the next line is a response to the previous request | 23:23 |
corvus | clarkb: i'm assuming xid=96 means that | 23:23 |
clarkb | ah | 23:23 |
clarkb | so ya the disconnect there basically means you never got your request but then later zk has that record | 23:23 |
corvus | yeah, that sounds legit | 23:24 |
*** rlandy has quit IRC | 23:24 | |
clarkb | which for the test suite maybe avoiding zk disconnects is the best idea | 23:24 |
clarkb | (hwoever that happens) | 23:24 |
corvus | though, maybe there's a way to make this more robust :) | 23:24 |
clarkb | it is odd that clients don't have to ack record requests for them to be completed | 23:25 |
corvus | clarkb: here's the tb from that thread: http://logs.openstack.org/05/559405/3/check/tox-py35/31da68d/job-output.txt.gz#_2018-04-30_22_43_16_440022 | 23:26 |
corvus | clarkb: so zuul did re-request those nodes; is the problem that the test suite saw the extra node request and so never thought it was settled? | 23:28 |
clarkb | yup | 23:30 |
clarkb | I think zuul ended up making a new request for the nodes and that got the test to run all the jobs it thoguht it had to run | 23:30 |
clarkb | but zk still had a request outstanding that zuul is/was never going to clean up becuse it was no longer attached to any jobs | 23:30 |
corvus | oh, well the tests consider it not to be settled as long as zuul.nodepool has outstanding requests | 23:31 |
corvus | we could put a try/except there to unwind that | 23:32 |
clarkb | if it excepts try to delete it? | 23:32 |
corvus | heh, this is a really confusing situation. to summarize: the create operation was interrupted but succeeded. so the request actually exists in zookeeper, but zuul thought it failed. however, zuul kept a record of it. so, oddly enough, it existed in *both* places. | 23:34 |
corvus | the first obvious fix is to have zuul forget its record of the request if it fails. that would prevent a weird sort of memory leak, and would also happen to cause this test to succeed (which is an amazing side effect) | 23:35 |
corvus | but there would still be an outstanding request in zookeeper, which nodepool would, presumably fulfill. | 23:35 |
clarkb | right and we'd leak both the request and the nodes | 23:40 |
corvus | i wonder if zk actually *does* return the result node id, but because kazoo does an extra 'get' operation on that id, and *that* failed, | 23:40 |
corvus | i believe this is in the traceback: https://github.com/python-zk/kazoo/blob/master/kazoo/client.py#L905 | 23:40 |
clarkb | oh so the request is completing then in order to return data back it is reading the data in another request? | 23:41 |
corvus | hrm, i may be wrong about that: https://github.com/python-zk/kazoo/blob/master/kazoo/handlers/utils.py#L73 | 23:42 |
corvus | that doesn't look like that's what's happening | 23:42 |
corvus | it does look like the exception was raised while waiting for the response to create | 23:43 |
corvus | so yeah, i'm not sure how to handle this on the zk side | 23:43 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul master: Install g++ on platform:rpm https://review.openstack.org/565070 | 23:46 |
clarkb | ianw: fdegir ^ fixed with your feedback | 23:46 |
clarkb | corvus: ^ thats an easy one I Think | 23:46 |
corvus | yep | 23:47 |
clarkb | as for zookeeper it is curious that the client would see the request as failed but server wouldn't | 23:48 |
corvus | i have to run; maybe we can pick this up tomorrow | 23:48 |
clarkb | no problem I too need to start looking at things around the house | 23:49 |
clarkb | thank you for helping fix that bug and getting zuul restarted | 23:49 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!