*** mordred has quit IRC | 00:31 | |
*** mordred has joined #openstack-infra-incident | 00:33 | |
*** ajmiller has joined #openstack-infra-incident | 04:08 | |
*** ajmiller_ has joined #openstack-infra-incident | 04:09 | |
*** ajmiller has quit IRC | 04:13 | |
*** ajmiller_ has quit IRC | 04:21 | |
-openstackstatus- NOTICE: Gerrit is going to be restarted | 11:13 | |
-openstackstatus- NOTICE: Gerrit had to be restarted because was not responsive. As a consequence, some of the test results have been lost, from 08:30 UTC to 10:30 UTC approximately. Please recheck any affected jobs by this problem. | 11:33 | |
-openstackstatus- NOTICE: Gerrit had to be restarted because was not responsive. As a consequence, some of the test results have been lost, from 09:30 UTC to 11:30 UTC approximately. Please recheck any affected jobs by this problem. | 11:35 | |
*** ig0r__ has joined #openstack-infra-incident | 13:19 | |
*** ig0r_ has quit IRC | 13:22 | |
*** greghaynes has quit IRC | 13:22 | |
*** greghaynes has joined #openstack-infra-incident | 13:38 | |
*** ajmiller has joined #openstack-infra-incident | 14:17 | |
-openstackstatus- NOTICE: Launchpad OpenID SSO is currently experiencing issues preventing login. The Launchpad team is working on the issue | 14:57 | |
*** ChanServ changes topic to "Launchpad OpenID SSO is currently experiencing issues preventing login. The Launchpad team is working on the issue" | 14:57 | |
*** jeblair has joined #openstack-infra-incident | 15:02 | |
*** AJaeger has joined #openstack-infra-incident | 15:02 | |
mordred | hey jeblair | 15:03 |
---|---|---|
jeblair | fungi, mordred, yolanda: over here, i'd like to focus on 1) why the check job missed the failure (and why its output was SO LONG: http://logs.openstack.org/87/292187/3/gate/gate-project-config-layout/6641a40/console.html ) | 15:03 |
*** clarkb has joined #openstack-infra-incident | 15:03 | |
jeblair | and 2) if something else is wrong with zuul enqueing or reporting changes.... | 15:04 |
fungi | 7.4m | 15:04 |
fungi | yeah | 15:04 |
AJaeger | With 1000+ projects, that file is really long ;( | 15:04 |
jeblair | so... someone saw a traceback on validating the zuul layout.... | 15:04 |
jeblair | was that in production zuul or only locally? | 15:05 |
yolanda | yes, there was that traceback that fungi spotted on logs. Also on my local tests, zuul-server -t was failing | 15:05 |
yolanda | removing the kolla change made the tests pass locally | 15:05 |
jeblair | so, do we think that the kolla change never ended up in the running zuul config because of the validation failure in production? | 15:06 |
yolanda | we think that the kolla change caused the failure | 15:06 |
jeblair | or is it possible it did end up in the running config, and that caused the false pipeline requirement failures? | 15:06 |
yolanda | that one ^ | 15:06 |
AJaeger | Also, why does tools/layout-checks.py not find the broken regex? | 15:06 |
yolanda | however, it has been forced, and i applied puppet and i still don't see any success. But i think that the reconfig has not been triggered | 15:07 |
jeblair | yeah, that's question 1 :) | 15:07 |
fungi | i wonder if zuul is running with an incomplete config | 15:07 |
fungi | or was at the time | 15:07 |
jeblair | fungi: that's a possibility worth consdering as well | 15:07 |
fungi | though it looks like this was raised during validation phase so shouldn't have been actually used? | 15:08 |
jeblair | fungi: i _don't_ think it was during validation | 15:09 |
clarkb | ya reconfigre doesn't validate iirc | 15:09 |
jeblair | i'm digging in to see where it would have bailed | 15:09 |
fungi | oh, this was actually loading. for some reason i keep thinking it does a validation pass prior to loading | 15:10 |
fungi | i'm checking to see what's going on with that job passing. first making sure i can replicate this failure locally with tox -e zuul | 15:13 |
yolanda | also i haven't seen any reload, although i launched puppet with the ansible play on that host | 15:14 |
clarkb | loading that yaml I don't see any obvious quoting or escape errors | 15:15 |
clarkb | [{'skip-if': [{'project': '^openstack/kolla.*$', 'all-files-match-any': ['^.*\\.rst$', '^doc/.*']}], 'name': '^gate-kolla(.*)?-(?!(docs$)).*$'}] | 15:15 |
fungi | yeah, though it seems to be barfing on the name regex | 15:16 |
fungi | i think | 15:16 |
fungi | from the traceback | 15:16 |
clarkb | yes I think so too, it is possible that this is maybe a raw string vs a string string problem? /me uses that string as loaded by yaml to test | 15:16 |
*** SamYaple has joined #openstack-infra-incident | 15:17 | |
SamYaple | o/ sorry for the touble! | 15:17 |
*** nibalizer has joined #openstack-infra-incident | 15:17 | |
nibalizer | ohai | 15:17 |
*** igorbelikov has joined #openstack-infra-incident | 15:17 | |
jeblair | SamYaple: no worries, this is supposed to be bulletproof | 15:18 |
clarkb | that compiles cleanly | 15:18 |
SamYaple | relevant paste shows the regex *should* be valid http://paste.openstack.org/show/490527/ | 15:18 |
fungi | oh, i need the pillow build deps to be able to tox this | 15:18 |
clarkb | SamYaple: yup my testing agrees with you | 15:18 |
jeblair | fungi: ? | 15:18 |
SamYaple | i do something special, which i think is breaking it, (.*)? | 15:18 |
clarkb | SamYaple: even when loading it in from yaml and all the associated escapes and possible type conversions | 15:18 |
SamYaple | yea clarkb i just did the same | 15:18 |
SamYaple | but my guess is the (.*)? is breaking it because it basically says match everything but everything may not exist | 15:19 |
SamYaple | its valid, clearly, but i can see it breaking somewhere | 15:19 |
fungi | jeblair: to be able to tox -e zuul in project-config it pip installs pillow | 15:19 |
jeblair | fungi: i don't undertand why it would do that | 15:20 |
jeblair | fungi: or to be more clear, i don't understand why that should be necessary | 15:20 |
fungi | jeblair: depends on what "that" is | 15:20 |
jeblair | clearly... | 15:21 |
fungi | jeblair: are you saying you don't see why trying to reproduce this with tox -e zuul in project-config would be necessary? | 15:21 |
jeblair | fungi: installing pillow | 15:21 |
fungi | i don't know what "that" is | 15:21 |
fungi | oh | 15:21 |
clarkb | iirc sphinx depends on it | 15:22 |
yolanda | ok zuul was reconfigured now | 15:22 |
jeblair | fungi: (to completely level reset: the task you are working on is of the utmost importance and i am saddened that it is being made difficult by seemingly unecessary complication) | 15:22 |
yolanda | removal of that job stopped the error, and i can see a sucessful Reconfiguration complete now | 15:23 |
yolanda | and changes being added to gate finally! | 15:24 |
fungi | jeblair: Collecting Pillow (from blockdiag>=1.5.0->sphinxcontrib-blockdiag>=1.1.0->-r /home/fungi/work/openstack/openstack-infra/project-config/.test/zuul/test-requirements.txt (line 4)) | 15:24 |
clarkb | SamYaple: the internets do say python has had bugs with nested repeating modifiers | 15:24 |
SamYaple | clarkb: this was my thought as well | 15:24 |
clarkb | SamYaple: with (.*)? the * and ? are nested and that could indeed be the trouble | 15:24 |
fungi | so it's an indirect test requirement of zuul, which i guess we install in the virtualenv | 15:24 |
clarkb | SamYaple: if you rewrite it to be just .* that should work | 15:24 |
SamYaple | clarkb: i wanted to see the compiled pattern because im not sure if my pattern is used directly or not | 15:25 |
jeblair | fungi: well, it should be a test requirement of zuul for building zuul docs... | 15:25 |
SamYaple | clarkb: it should, but now im scared | 15:25 |
jeblair | fungi: not sure howe it ended up in the project config test... | 15:25 |
clarkb | is zuul still precise? | 15:25 |
clarkb | to the puppetboard fact list | 15:26 |
SamYaple | ohhh clarkb good question | 15:26 |
mordred | yes | 15:26 |
clarkb | it is | 15:26 |
SamYaple | let me check old python | 15:26 |
clarkb | so it will have an older version of python | 15:26 |
SamYaple | docker away! | 15:26 |
mordred | mordred@zuul:~$ python --version | 15:26 |
mordred | Python 2.7.3 | 15:26 |
clarkb | ya 2.7.6 is when it was supposedly fixed which is what trusty has | 15:27 |
clarkb | this of course all on stackoverflow so take iwth a grain of salt | 15:27 |
nibalizer | clarkb: ya 12.04 | 15:27 |
SamYaple | coolest thing ever btw -- `docker run --rm -it ubuntu:12.04 bash` | 15:27 |
jeblair | i'm guessing we're running that job on trusty now? | 15:27 |
SamYaple | will confirm | 15:27 |
jeblair | 2016-03-15 08:15:08.339 | Building remotely on ubuntu-trusty-osic-cloud1-8764915 (ubuntu-trusty) in workspace /home/jenkins/workspace/gate-project-config-layout | 15:27 |
SamYaple | http://paste.openstack.org/show/490531/ | 15:28 |
SamYaple | confirmed clarkb ^ | 15:28 |
SamYaple | let me test a new regex | 15:28 |
jeblair | report from zuul internals: a) zuul does validate the config as part of the loading process in basically the same way it does in test config; this failure is past yaml validation and into "try to build the data structure" part of the reload / test config (it should fail in the same way both ways) | 15:28 |
clarkb | SamYaple: just rewrite it to .* and that hsould be equivalent | 15:28 |
jeblair | b) the connections changes have made zuul slightly more fragile if reloads break | 15:28 |
clarkb | the (.*)? | 15:28 |
SamYaple | yea im testing it now clarkb | 15:28 |
SamYaple | im trying to match mesos (and future projects too) | 15:29 |
jeblair | c) that fragility, in this case, should only affect the timer trigger (basically, a reload failure at this point will break the timer trigger, probably until a successful reload) | 15:29 |
fungi | jeblair: looks like it was running on bare-trusty before https://review.openstack.org/285722 switched it to ubuntu-trusty, so it's been testing on trusty for a while | 15:29 |
yolanda | so ... looking at the zuul layout job for the change of SamYaple in http://logs.openstack.org/87/292187/3/check/gate-project-config-layout/d3cd3a1/console.html | 15:29 |
yolanda | i see an <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'now' | 15:29 |
yolanda | an exception on apscheduler | 15:30 |
yolanda | can this cause a false positive? | 15:30 |
SamYaple | clarkb: yea with use .* it doesnt match mesos. ill go the explict root | 15:30 |
jeblair | fungi: yeah, but aiui, we haven't tried to use this regex before now; so this was probably lurking since we changed the job to trusty | 15:30 |
fungi | agreed | 15:30 |
clarkb | SamYaple: is it because there is a -mesos? | 15:30 |
*** ChanServ changes topic to "situation normal" | 15:30 | |
-openstackstatus- NOTICE: Launchpad SSO is back to normal - happy hacking | 15:30 | |
SamYaple | clarkb: yea. enough being fancy. ill just have to update this for any new projects we have | 15:31 |
clarkb | SamYaple: the trailing .* should match the -somethings | 15:31 |
fungi | also tox -e zuul is succeeding for me on a recent platform. gonna hold a bare-precise and try there | 15:31 |
SamYaple | clarkb: im sure i can do this fancy (like i did before) but i _know_ i can do it explict, so i may as well | 15:31 |
clarkb | SamYaple: re.match(r'^gate-kolla-(?!(docs$)).*$', 'gate-kolla-mesos') that works | 15:31 |
jeblair | yolanda: it's a bug but should not cause any of the behaviors we are seeing | 15:32 |
SamYaple | clarkb: gate-kolla-docs, gate-kolla-mesos-docs | 15:32 |
SamYaple | those are the excludes | 15:32 |
SamYaple | potentially in the future including gate-kolla-ansible-docs, i was trying to save some future commits | 15:33 |
clarkb | hrm that makes it trickier | 15:33 |
jeblair | fungi, yolanda: so i think we've answered question 1; i haven't found a connection between question 1 and question 2 (failed pipeline requirements checking) yet... | 15:35 |
jeblair | zuul should have been running with the old config, aside from the fact that the timer triggers won't activate | 15:35 |
fungi | i'm going to try reenqueuing a change which was failing to enter the gate pipeline earlier just to confirm there wasn't something else going wrong | 15:37 |
jeblair | fungi: cool, let me know which one | 15:37 |
jeblair | fungi: preferably before you do it | 15:37 |
fungi | jeblair: openstack/releases 292539,3 | 15:38 |
fungi | it's the one i quoted from the zuul debug log earlier | 15:38 |
fungi | back in #openstack-infra | 15:38 |
jeblair | fungi: cool go for it | 15:38 |
clarkb | SamYaple: re.match(r'^gate-kolla-(.*-)?(?!(docs.*)).*$', 'gate-kolla-mesos') may be close | 15:38 |
fungi | fired the enqueue | 15:38 |
jeblair | fungi: oh, i think 'zuul enqueue' may bypass pipeline requirements :) | 15:38 |
fungi | oh, maybe some of them at least. i know it still performs the mergeable check for gerrit but that may not have been an issue | 15:39 |
fungi | jeblair: Exception: Gerrit error executing gerrit query --format json --all-approvals --c | 15:40 |
fungi | omments --commit-message --current-patch-set --dependencies --files --patch-sets | 15:40 |
fungi | --submit-records 292539 | 15:40 |
fungi | er, that's a bunch of stray newlines, sorry | 15:40 |
fungi | i'll get the full traceback into a paste | 15:40 |
SamYaple | clarkb: yea ive been playing with ti, but that doesnt work either :/ | 15:41 |
SamYaple | tried a few variants | 15:41 |
clarkb | SamYaple: that regex works on gate-kolla-docs, gate-kolla-mesos-docs, and gate-kolla-mesos | 15:41 |
clarkb | (I am not super familiar with all the combinations here) | 15:42 |
fungi | jeblair: full enqueue traceback http://paste.openstack.org/show/490535/ | 15:43 |
SamYaple | clarkb: it does not work. | 15:43 |
SamYaple | re.match(r'^gate-kolla-(.*-)?(?!(docs.*)).*$', 'gate-kolla-mesos-docs') | 15:43 |
SamYaple | that _shouldnt_ match, but it does | 15:43 |
SamYaple | re.match(r'^gate-kolla-(.*-)?(?!(docs.*)).*$', 'gate-kolla-docs') | 15:44 |
clarkb | SamYaple: I thought we don't want that to match | 15:44 |
clarkb | which is why we exclude docs in the first place | 15:44 |
fungi | as for why the job is not catching this, i've confirmed it seems to be the difference between re.compile() behavior on the ubuntu 12.04 python 2.7 and the ubuntu 14.04 python 2.7 stdlib | 15:44 |
fungi | as suspected | 15:44 |
SamYaple | clarkb: gate-kolla-mesos-docs and gate-kolla-docs _shouldnt_ match that regex, but gate-kolla-mesos-docs _does_ match. | 15:45 |
fungi | at least i do get the same failure/traceback from the layout check under tox on precise as our zuul server threw when reconfiguring | 15:45 |
fungi | but not with newer platforms (ubuntu trusty, debian jessie) | 15:45 |
clarkb | SamYaple: re.match(r'^gate-kolla-(.*-)?(?!(docs))$', 'gate-kolla-mesos-docs') returns None | 15:45 |
jeblair | fungi: i don't understand that error :/ | 15:45 |
clarkb | oh derp thats without the trailing .* | 15:45 |
clarkb | gah regexes | 15:46 |
jeblair | fungi: i don't see anything corresponding in the gerrit log, and that command works for me manually... | 15:46 |
SamYaple | clarkb: da | 15:46 |
fungi | jeblair: well, it's also ugly because i apparently copied a bunch of stray newlines into the paste | 15:48 |
fungi | hard to tell the difference between where i've pasted in line breaks and where the input form for paste.o.o is wrapping long lines | 15:49 |
jeblair | fungi: the same exception is in the debug log, but also does not make sense (and it's not because of the newlines) | 15:49 |
fungi | ahh, yeah i didn't start pulling up the zuul debug log again for that error | 15:49 |
clarkb | jeblair: is gerrit executing the ocmmand but returning a non zero return code? | 15:50 |
jeblair | clarkb: not when i try it manually | 15:50 |
jeblair | we can update logging with a reconfigure.... | 15:51 |
jeblair | so i think we should turn on zuul gerrit debug log, reconfigure zuul, then try fungi's enqueue again | 15:51 |
jeblair | yolanda: you did not disable puppet, correct? | 15:51 |
fungi | and yeah, that gerrit query does work for me with my account, at least | 15:52 |
yolanda | no, is enabled | 15:52 |
jeblair | ok, i am disabling puppet on zuul now, and will manually update the log config and request reconfiguration | 15:52 |
fungi | sounds good | 15:54 |
jeblair | fungi: okay, go for it | 15:54 |
fungi | rerunning now | 15:54 |
fungi | and... no traceback | 15:54 |
fungi | heisenbug? | 15:54 |
jeblair | of course not | 15:54 |
fungi | le sigh | 15:55 |
fungi | zuul also updated the change with a "starting gate jobs" comment that time, so seems to have worked | 15:56 |
jeblair | fungi: do you have another change handy that failed requirements earlier where we can workflow it in gerrit? | 15:56 |
SamYaple | clarkb: i ended up with "^gate-kolla.*(?<!docs)$" | 15:58 |
SamYaple | gate-kolla prefix, anything not ending in docs | 15:58 |
fungi | jeblair: that was the only example i was working from but there were several other users reporting similar behavior in channel around the same time | 16:01 |
fungi | jeblair: i'm digging some up now | 16:01 |
fungi | SamYaple: fwiw, something like https://review.openstack.org/293015 would have prevented the original issue from merging | 16:02 |
fungi | based on some manual tests | 16:02 |
jeblair | AJaeger: hi, can you not leave recheck comments on those? :) | 16:03 |
SamYaple | fungi: yes it would have | 16:03 |
fungi | jeblair: 292074 291762 292521,1 are a few more i see mentioned in scrollback | 16:04 |
jeblair | fungi: rechecked, merged, merged :( | 16:05 |
fungi | oh | 16:05 |
jeblair | i need one where i can actually reproduce what people were claiming :( | 16:05 |
fungi | morgan mentioned 292653 which may fit the profile | 16:08 |
fungi | it's not enqueued but was approved at 10:26 utc today | 16:08 |
fungi | 292653,1 | 16:08 |
fungi | though that may have been around the time yolanda and jhesketh were trying to fix things by restarting gerrit | 16:09 |
SamYaple | fyi -- https://review.openstack.org/293021 the pastes there show its validated and shouldnt cause this problem again | 16:09 |
yolanda | on gerrit restart, several changes were lost | 16:09 |
SamYaple | thanks everyone, im headed out now. ping me if there are more issues | 16:10 |
fungi | looks like 292653,1 was prior to the gerrit restart even | 16:10 |
fungi | SamYaple: thanks! | 16:10 |
jeblair | 2016-03-15 00:20:19,240 DEBUG zuul.DependentPipelineManager: Change <Change 0x7f4fc785fd10 292653,1> can not merge, ignoring | 16:10 |
jeblair | 2016-03-15 00:20:19,240 DEBUG zuul.DependentPipelineManager: Change <Change 0x7f4fc785fd10 292653,1> is not ready to be enqueued, ignoring | 16:10 |
fungi | jeblair: those log entries are too old i think. we're looking for when it handled the workflow +1 at 10:26 utc | 16:11 |
jeblair | fungi: ah, then it's probably a missed event | 16:12 |
jeblair | cause that's the last | 16:12 |
fungi | so we likely need to find one which was approved after the gerrit restart solved whatever was going on with the event queue (which presumably was a distinct issue from the zuul config problem) | 16:14 |
fungi | gerrit start time seems to have been 11:13 utc | 16:17 |
fungi | i need to take a quick break to do morning stuff which i didn't get to because i saw everything was on fire when i woke up and now it's afternoon | 16:23 |
fungi | will brb | 16:23 |
fungi | COFFEE | 16:23 |
jeblair | i'm still looking for a patch that exhibits the reported problem; i have not found one | 16:24 |
fungi | jeblair: morgan mentioned in the infra channel another odd inconsistency in gerrit which could be query-related/impacting | 16:25 |
jeblair | okay i give up | 16:26 |
fungi | specifically, the resulting list at https://review.openstack.org/#/q/project:openstack/stackalytics+status:open is not showing the code review and workflow votes for his "Change company affiliation" patch which can be seen in the details at https://review.openstack.org/#/c/292653/ | 16:26 |
jeblair | i've gone through *a lot* of changes | 16:26 |
jeblair | and so far, the only ones that *might* have reproduced the issue | 16:26 |
jeblair | have been rechecked by AJaeger | 16:26 |
jeblair | i'm going to put zuul's logging configuration back | 16:27 |
fungi | oh, nevermind, this one may be some sort of startup race | 16:27 |
fungi | looks like the approval was added the same moment gerrit was being started | 16:27 |
jeblair | AJaeger: the next time there is a problem with zuul, would you please consult with people who are working on debugging it before leaving 'recheck' comments? they destroy any chance we have to try to understand the problem and fix it | 16:28 |
jeblair | i don't think there's anything left to do here, so i'm going back to -infra | 16:29 |
fungi | i retract my earlier statement now that i have some coffee in me | 16:52 |
fungi | i was comparing the approval time on that change to itself because i pasted the wrong time i was looking at earlier. the start time on gerrit is 11:13 utc, so that change was approved a solid 45 minutes before gerrit was restarted, implying we were indeed missing events | 16:53 |
fungi | but i agree, we can move this back to #openstack-infra | 16:54 |
AJaeger | jeblair: understood, sorry about those rechecks | 18:34 |
*** ig0r_ has joined #openstack-infra-incident | 20:04 | |
*** ig0r__ has quit IRC | 20:06 | |
*** AJaeger has left #openstack-infra-incident | 20:52 | |
*** SamYaple is now known as NotSamYaple | 20:52 | |
*** NotSamYaple is now known as SamYaple | 20:53 | |
*** dhellmann has joined #openstack-infra-incident | 23:07 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!