Thursday, 2017-08-17

SpamapSmordred: It's proving pretty gross in python too, so I"m glad I flipped00:04
openstackgerritTristan Cacqueray proposed openstack-infra/zuul-jobs master: Add configure-logserver role  https://review.openstack.org/48911302:31
*** openstack has quit IRC04:42
*** openstack has joined #zuul04:47
*** eventingmonkey has joined #zuul04:47
*** leifmadsen has quit IRC04:47
*** leifmadsen has joined #zuul04:47
*** rfolco has joined #zuul04:47
*** jlk has joined #zuul04:49
*** ianw has joined #zuul04:49
*** jhesketh has joined #zuul04:49
*** jlk has quit IRC04:49
*** jlk has joined #zuul04:49
*** tobiash has joined #zuul04:50
*** jesusaurum has joined #zuul04:51
*** maxamillion has joined #zuul04:52
*** rcarrill1 has joined #zuul04:53
*** cinerama has joined #zuul04:54
*** mnaser has joined #zuul04:55
*** tristanC has joined #zuul05:02
*** isaacb has joined #zuul05:18
*** dmsimard|off has quit IRC05:48
*** dmsimard has joined #zuul05:49
*** dmsimard is now known as dmsimard|off05:49
*** isaacb has quit IRC06:17
*** isaacb has joined #zuul06:28
*** SotK has joined #zuul07:43
*** smyers has quit IRC09:05
*** smyers has joined #zuul09:16
*** electrofelix has joined #zuul09:53
*** smyers has quit IRC10:12
*** robled has quit IRC10:12
*** robled has joined #zuul10:17
*** robled has quit IRC10:17
*** robled has joined #zuul10:17
*** smyers has joined #zuul10:21
*** jkilpatr has quit IRC10:44
*** pbelamge has joined #zuul10:56
pbelamgehello all10:56
pbelamgeI am getting this error in gate pipeline:10:57
pbelamge<QueueItem 0x7fe63c495410 for <Change 0x7fe63c50be10 44,1> in gate> is a failing item because ['it did not merge']10:57
pbelamgeanyone faced this kind of error before?10:57
pbelamgelog entries before this line:10:57
pbelamgehttps://thepasteb.in/p/lOhONzjJ3JZCB10:58
SpamapS2017-07-30 02:00:12,386 DEBUG zuul.GerritSource: Change <Change 0x7fe63c50be10 44,1> did not appear in the git repo11:00
SpamapSpbelamge: perhaps you're using a mirror for merging?11:00
pbelamgehttps://thepasteb.in/p/AnhrA18DMNzHv11:01
pbelamgethat is the git directory mentioned in the zuul.conf and the zuul_url11:02
pbelamgelet me know if you need anything to see if I am missing anything?11:08
pbelamgein layout.yaml I used jenkins as user name and in zuul.conf as well11:21
pbelamgeI am sending Workflow (1) and Code-Review (2) from gerrit11:21
pbelamgeadded jenkins in non-interactive users group in gerrit11:21
pbelamgeso, Verified (2) and Submit is automatically executed11:22
SpamapSSo zuul got the event. But it appears it couldn't find it later.11:22
pbelamgeright11:22
SpamapSrather, it appears Zuul couldn't find the change.11:22
pbelamgewhat could be wrong?11:23
SpamapSnot entirely sure11:23
*** jkilpatr has joined #zuul11:23
SpamapSI'm not super experienced debugging zuul problems. However, have you checked the gerrit logs to see what zuul tried to fetch?11:23
pbelamgeok, let me take a look11:24
pbelamgelog from sshd_log11:28
pbelamgehttps://thepasteb.in/p/LghNnY3jRGOsZ11:32
pbelamgelog from httpd_log11:32
pbelamgehttps://thepasteb.in/p/nZhlN6WJzM3UY11:33
*** jkilpatr has quit IRC11:37
*** jkilpatr has joined #zuul11:37
SpamapSpbelamge: some more experienced zuul debuggers will be online in the next 4-5 hours ... I'm not even supposed to be awake yet. ;)12:05
* SpamapS decides to try and get another hour of sleep after reporting and fixing https://github.com/ansible/ansible/issues/28325 in ansible :-P12:14
*** pleia2 has joined #zuul12:42
*** isaacb has quit IRC12:53
*** gothicmindfood has joined #zuul13:06
*** pbelamge has quit IRC13:25
*** amoralej is now known as amoralej|lunch13:40
*** dkranz has joined #zuul14:04
*** amoralej|lunch is now known as amoralej14:25
*** isaacb has joined #zuul14:37
*** isaacb has quit IRC14:43
*** jeblair has joined #zuul14:56
ShrewsDoes anyone need any help with the current set of "must-be-done" tasks?15:17
SpamapSmordred: hah, thanks for the +1 on that PR. :)15:35
SpamapSmordred: the workaround is to just add all 3 known types. :-P15:36
*** openstackgerrit has joined #zuul15:55
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Use cached branch list in dynamic config  https://review.openstack.org/49461815:55
* SpamapS notes that jamielennox got BonnyCI ALMOST to v3 last night15:59
SpamapSjust ran into https://review.openstack.org/#/c/493059/ which we're fixing now16:00
SpamapS(post-review instead of allow-secrets)16:00
jeblairSpamapS, jamielennox: \o/16:01
pabelangernice16:02
ShrewsSpamapS: for some reason, i thought you were already on v316:12
Shrewsbut yay!16:12
SpamapSI know right?16:15
SpamapShrm... status webapp seems stuck16:17
SpamapS  Job base not defined16:20
SpamapSderp16:20
SpamapSI think just needs a parent: null16:22
jeblairSpamapS: ah yep that's another new thing16:22
SpamapSluckily I've been paying attention :)16:22
jeblairpabelanger: what's the status of the mirror name fix?16:23
jeblairi just had a job bomb because it ran on inap with the wrong mirror name16:23
SpamapSoh yay, as soon as I pushed it.. Zuul saw it and sprang to life.16:25
SpamapSor not16:26
SpamapShm16:26
Shrewsdoes it have to be "parent: base" or is "parent: null" also acceptable?16:29
SpamapSGuessing I'm hitting an unexpected url here....http://paste.openstack.org/show/618704/16:29
SpamapSShrews: the ultimate base needs to be parent: null16:29
jeblairShrews: to define a parent job, "parent: null".  we no longer need to add "parent: base" to every other job as that's now the default.16:29
ShrewsSpamapS: ah, right. makes sense16:29
SpamapSor yeah, all ultimate base jobs I should say, there's not just one..16:30
jeblairSpamapS: yeah, maybe that's hitting the status url with the wrong (or no?) tenant name (which should be the first path component of the url)16:30
SpamapShttps://zuul.opentech.bonnyci.org/BonnyCI/status.json <-- wee16:31
SpamapSjeblair: it's hitting with _only_ the tenant name.16:31
SpamapSshould not really be a server side ERROR log IMO16:31
jeblairSpamapS: i agree16:32
SpamapSmaybe our log config is off16:39
SpamapSbecause it looks like it is being logged using .exception()16:39
SpamapSwhich I'd only expect to see in debug logs16:39
SpamapSlevel=WARNING ... so.. hrm16:40
SpamapSOH16:41
SpamapS.exception() does ERROR? that seems.. hrm.16:41
pabelangerjeblair: I think we need to restart executors16:41
pabelangerlooking now16:42
jeblairSpamapS: yes -- i think that's appropriate.  generally if there's an uncaught exception i want to know about it.  :)  i think we just need to make sure there's a tenant validity check before that point.16:43
jeblairSpamapS: iow -- any *other* error formatting the status json is worth noting :)16:43
SpamapSyeah, just some lazy coding in there expecting things16:43
SpamapSI'm really just trying to see how to make webapp return 40416:44
SpamapSahh HTTPNotFound k16:44
pabelangerjeblair: ya, looks like we need to restart executors. I can do that once I get some food16:44
jeblairpabelanger: thx16:45
pabelangerShrews: here is an interesting nodepool failure: http://logs.openstack.org/30/493330/2/gate/gate-dsvm-nodepool-ubuntu-src/993f8ce/logs/screen-nodepool-builder.txt.gz#_Aug_17_16_02_33_54375716:48
pabelangerwe lost connection with zookeeper some reason16:48
SpamapSThat should be something that resolves itself.16:49
SpamapSOne should expect to lose one's zk from time to time.16:49
Shrewspabelanger: but seems like the builder recovered correctly, yeah?16:49
Shrewsyeah, can't do anything about zk disappearing when we only have one node in the cluster. as long as it recovers correctly...16:50
Shrewsbut in a dsvm job, you really don't expect zk to just go away. weird that it did16:51
Shrewslooks like it happened during the upload, that failed, then the upload worker suspended itself correctly.16:54
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Return 404 on unknown tenants  https://review.openstack.org/49464216:54
pabelangerShrews: ya, likely need to recover more gracefully for d-g hook: http://logs.openstack.org/30/493330/2/gate/gate-dsvm-nodepool-ubuntu-src/993f8ce/console.html16:54
pabelangerlets see if it happens more often16:55
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Use cached branch list in dynamic config  https://review.openstack.org/49461817:38
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Allow multiple semaphore definitions within a project  https://review.openstack.org/49465017:38
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Reload configuration when branches are created or deleted  https://review.openstack.org/49465117:38
pabelangerjeblair: just ze01.o.o or are we launching jobs across other executors now for restarting17:48
jeblairpabelanger: all 4 now17:48
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: WIP Replace paste/webob with aiohttp  https://review.openstack.org/49465517:50
mordredSpamapS: saw you poking at webapp things - I wrote taht ^^ a couple of months ago as an exploration - figured I'd stick it up in case anyone decides they feel like poking further  webapp/zuul-web in case it's helpful17:51
pabelangerjeblair: thanks17:51
pabelangerHmm, looks like we leak /var/run/zuul-executor/zuul-executor.pid on stop. Will find out why17:53
pabelangerk, all have restarted. Now to fix puppet-zuul17:55
jeblairpabelanger: because zuul can't delete it because it's written as root but zuul drops privileges17:57
jeblairmordred: this error is new to me: http://logs.openstack.org/50/494650/1/check/tox-pep8/50bd841/job-output.txt.gz#_2017-08-17_17_43_53_01904017:58
*** jkilpatr has quit IRC18:02
jeblairand again: http://logs.openstack.org/51/494651/1/check/tox-pep8/c1a2cb1/job-output.txt.gz#_2017-08-17_17_56_46_88820118:02
jeblairi wonder if those have something to do with the executors restarting18:03
*** jkilpatr has joined #zuul18:03
jeblairpabelanger, mordred: what's the status on tarball/publish jobs?18:04
pabelangerlet me pull up etherpad18:05
pabelangerboth ssh / testpypi credentials are added to project-config: https://review.openstack.org/494276/ was to start testing testpypi creds on executor and if pip install was working18:06
pabelangerI'be added the publish-openstack-python-branch-tarball to post pipeline for zuul, and confirmed it uploaded to zuulv3-dev.o.o18:07
pabelangermordred: still has (pre-)python-tarball playbooks to create, but happy to take over if needed18:08
jeblairpabelanger: +3 49427618:08
pabelangertwine role is started, and ready to push up pending above18:08
pabelangerso, I can take back pre-python-tarball and python-tarball playbooks now to keep iterating forwar18:09
jeblairmordred: ^ are you on that or do you want to hand it off?18:10
jeblairalso, other folks were asking about what they could do to help.18:10
pabelangermost of the playbook content is written, we just need to reorg it18:10
jeblairpabelanger: when 494276 lands you should be able to move forward on testing pypi uploads at least, right?18:10
pabelangerjeblair: yes, I'll push on that18:11
mordredjeblair, pabelanger: I'm fine with pabelanger taking those if he's in the groove on it18:11
pabelangerk18:11
mordredI wanted to see if I could get the secret name thing in real quick - cause I think we can use it to good effect on both tarballs and logs18:12
pabelangerwhich one is that?18:12
mordredpabelanger: oh - I also, speaking of - I think we can collapse the logs role/job to at least use the add-fileserver role like you mentioned yesterday18:12
mordredpabelanger: 494650 - I need to fix a unittest real quick (now that I'm done with this morning's project-config fun :) )18:13
pabelangermordred: ya, I did that already with https://review.openstack.org/494314/ for base-test logs18:13
mordredjeblair: OH MY - that's a fantastic error18:13
mordredpabelanger: neat!18:13
pabelangerI think you have the other patch in merge conflict18:13
jeblairmordred: yeah, i'm kinda inclined to see if it shows up again not in proximity to an executor restart before we spend too much time on it18:14
mordredjeblair: kk18:14
jeblairmordred: (even though i would not expect an executor restart to manifest like that;  just trying not to rabbithole)18:14
mordred"dictionary changed size during iteration" <-- seems extra weird - but yah18:14
pabelangerI see that a lot when doing python3 convert in nodepool18:15
mordredyah - I'm just not sure why we'd only see it near an executor restart - so I'm hoping it's just a heisenbug18:15
jeblairmordred: let me know when you have the secrets name thing fixed (or if you need another set of eyes on it).  i'm working on the branch cache (should be done, just waiting on stable test results) and will work on the tmpfs-for-secrets idea next (probably after lunch).18:16
mordredjeblair: cool18:16
mordredjeblair: and will do - also, when you get a spare second, https://review.openstack.org/#/c/494260/ will let us do https://review.openstack.org/#/c/494281/18:17
mordred(and the depends have landed, etc)18:17
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Allow requesting secrets by a different name  https://review.openstack.org/49434318:19
mordrednobody should let me code18:19
jeblairmordred: hire 4 devs and you have a deal.  otherwise, no dice.18:21
mordredit's literally the same thing I fixed between patch 1 and patch 2 - just one line down18:23
jeblairmordred: ara: +2 on the first, -1 on the second (one of us is confused)18:25
*** electrofelix has quit IRC18:30
mordredjeblair: I just got another ANSIBLE PARSE ERROR18:30
jeblairokay so that's a thing :(18:31
jeblairmordred: i wonder if our ansible version is different on the new executors18:31
jeblairmordred: left some comments on secrets change18:33
mordredjeblair: well - I got it on ze01 - but I have now officially switched to trying to figure out what's up18:34
jeblairhrm, ansible 2.3.2.0 on ze01 and ze04.18:34
jeblairmordred: ok cool.  i have to afk now, back after lunch.18:34
mordredyah - and both with the same python version18:34
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Allow requesting secrets by a different name  https://review.openstack.org/49434318:39
Shrews            for key in result_dict.keys():18:43
Shrews                if key != 'changed':18:43
Shrews                    result_dict.pop(key)18:43
Shrewsmordred: that seems suspicious18:44
Shrewsthe "if '_zuul_nolog_return'" portion of v2_runner_on_ok()18:44
mordredShrews: yes - I agree- that looks very bad18:46
mordredI _think_ there are two issues here ... one is an issue in the callback plugin - the second is that there was a problem connecting18:47
Shrewsmordred: interestingly, i can reproduce that 'dict changed size' error on that exact code, but so far only in py3. we aren't running ansible under py3, are we?18:49
mordredShrews: the local code - the callback plugin - does run under python318:51
Shrewsthat doesn't seem safe, since ansible isn't really py3 ready18:52
mordredthecore folks said core is py3 ready - it's modules that are the problem (we chatted with them about it)18:52
*** jesusaurum is now known as jesusaur18:52
Shrewsah yes, i believe i have seen a similar statement18:52
mordredShrews: so running ansible-playbook with python3 but setting ansible_python_interpreter so that it uses python2 on the remote nodes18:53
mordredand toshio said if we encounter any py3 issues in core they'd consider them urgent/bad bugs18:53
Shrewsmordred: want me to put up a fix for that?18:56
mordredShrews: sure! thanks18:56
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix zuul_stream callback dict modification  https://review.openstack.org/49467918:58
Shrewsmordred: jeblair: ^^^18:58
mordrednow- why that playbook couldn't connect is a whole different question to track down - especially considering we're using persistent conection stuff18:59
mordredSpamapS: we did wind up using persistent connections yeah?19:00
*** amoralej is now known as amoralej|off19:01
SpamapSmordred: fer what? sorry, I lost context.19:01
mordredSpamapS: for ansible to remote nodes19:01
SpamapSOh yes we're still persistent.19:01
SpamapSBut we don't persist across runs.19:01
SpamapSThe bubblewrap dies and kills the ssh with it.19:02
mordredhrm19:02
SpamapSNot doing that proved pretty squirelly (Technical Term)19:02
SpamapSI think it would be doable, but we had to get down into more ansible guts to do it right IIRC.19:03
pabelangermordred: jeblair: good news, pip install worked in bwrap on executor19:07
pabelangerjeblair: I think we also need to restart zuulv3.o.o for nodepool.cloud variable. I can see nodepool adding it to zookeeper, but still not showing up in our inventory files.19:10
pabelangerI only reset our executors19:10
mordredpabelanger: pip install --user yeah?19:12
pabelangermordred: yes19:13
pabelangermordred: it is missing our pip.conf settings, so we'll have to iterate on that too19:14
mordredpabelanger: good point19:14
mordredpabelanger: we'll need to be careful with that - as executors are in dfw, so we can't use any nodepool variables that we normally would for setting mirrors19:18
Shrewshrm, i wonder if https://review.openstack.org/494679 will need to be force merged19:20
*** isaacb has joined #zuul19:21
pabelangermordred: ++19:23
mordredpabelanger: we might need to have a special role or version of configure-mirror for setting up only ~/.pydistutils and ~/.pip.conf?19:24
mordredpabelanger: oh - I've got an idea ...19:25
pabelangerI'm going to restart zuulv3.o.o to pick up new inventory variables for nodepool.cloud19:25
mordredkk19:25
pabelangermordred: also waiting for your idea19:25
pabelangerzuulv3.o.o back19:27
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Allow configure-mirrors to do only homedir  https://review.openstack.org/49468419:30
mordredpabelanger: how about that ^^ ?19:30
pabelangerya, that might work19:31
mordredpabelanger: then in the publish playbook you're working on, since that's specific to openstack anyway, you can just do role: configure-mirrors: mirror_fqdn: mirror.dfw.rax.opentsack.org only_homedir: true19:31
mordreds/role: /roles: - /19:31
mordredShrews: I'm curious as to why this issue with the dict just started all of a sudden19:33
Shrewsyah, me too19:33
jeblairShrews: thanks! +319:33
mordredalso - that's in if '_zuul_nolog_return' in result_dict: which only occurs in like one place19:33
Shrewslike i said, i don't think it will merge w/o magic19:34
jeblairShrews: oh why not?19:34
Shrewsb/c zuulv3 has the buggy version, causing the failure, yeah?19:34
jeblairShrews: i thought it was sporadic19:34
Shrewsjeblair: my patch failed w/ the same error19:35
Shrewsso... less sporadic now? dunno19:35
mordredjeblair: turning on -vvv to the logs requires a restart right? or do we have a command for that?19:35
jeblairmordred: zuul-executor verbose19:35
mordredI'm going to turn that on19:36
jeblairkk19:36
jeblairmordred: we have 4 executors running now19:36
jeblairmordred: if you want to scale that back to 1, i think that'd be fine19:36
mordredjeblair: I actually havne't seen the error anywhere OTHER than ze01 fwiw19:37
jeblairmordred: curiouser and curiouser19:37
mordredyah ....19:37
Shrewsand failed yet again  :(19:37
Shrewsweird it's just showing up19:37
jeblairShrews: okay i'll push it through19:37
mordredAug 17 19:28:12 ze01 kernel: [6224688.842500] traps: ansible-playboo[28587] general protection ip:50ee24 sp:7ffe1c61e848 error:0 in python3.5[400000+3a8000]19:38
jeblairmordred: if you're ready for that -- unless you want.....19:38
mordredwe're getting segfaults19:38
mordredfrom ansible-playbook19:38
jeblairi um19:38
mordredthat's in /var/log/syslog19:38
SpamapSwoof19:38
Shrewsneat19:38
mordredwhich have not been in the syslog until just recently19:39
mordredthat, in fact, was the first occurance19:39
jeblairmordred: so not 1:1 with these errors?19:40
Shrews2.4 wasn't released was it?19:40
jeblairShrews: we're on 2.3.2.019:41
jeblairon all 4 executors19:41
Shrewsk. i knew 2.4 was very close19:41
mordredjeblair: no - there's only a few of them19:41
jeblairmordred: i think the segfaults have been around a while19:48
mordredawesome19:48
jeblairmordred: they're in all our syslogs, back to aug 1119:48
mordredmaybe that happens when we restart an executor with a playbook running?19:48
jeblairmordred: maybe?19:48
mordredthing to keep our eyes on at leats19:48
jeblairi'm at least inclined to separate it from the callback issue at this point19:48
mordred++19:49
mordredagree19:49
jeblairso we've got callback first reported (could be older -- this is manual) around 17:5819:49
mordredI'm rechecking the secret patch while verbose is on to see if I can get a traceback in the logs19:49
jeblairmordred: i have a kernel of an idea but it's not fleshed out -19:52
mordredok19:53
jeblairmordred: our jobdir paths just got longer, and some of the paths in an error i just checked are 120 bytes long; the shebang limit is 128.  i don't know how that could come in to play.19:53
jeblairmaybe the ansible module interpolation stuff?  i dunno.  just brainstorming.19:54
mordredoh. well- that could be the cause of the remote error - it's possible the issue with callback warning has been happening for a while and we just didn't notice19:54
mordredsorry - don't know if you caught that when you came back - there are two issues - the job isn't failing because of the callback error19:54
mordredjeblair: http://logs.openstack.org/43/494343/4/check/tox-pep8/4b474ca/job-output.txt.gz#_2017-08-17_18_43_45_46099719:55
jeblairmordred: ah.  so force-merging Shrews' change won't help.19:55
mordredjeblair: is a real issue19:55
jeblairright, but not one which causes ansible to fail?19:55
mordredno - I mean that link is the link to the real issue19:55
jeblairmordred: yes i know.  i see both issues.  i'm just trying to understand which is fatal.19:56
jeblaircallback, unreachable, or both?19:56
mordredthe unreachable one19:56
Shrewsmight this be a good time to try autohold?19:56
mordredcallback errors are never fatal - they just cause lack of further output to log for that task19:56
jeblairmordred: run 'zuul-executor keep' as well please so we can inspect the jobdir contents19:57
jeblairShrews: maybe so19:58
mordredjeblair: done19:58
jeblairit's also worth entertaining the idea that, if this only happens on infracloud nodes, that the network is legitimately in a worse-than-normal state.19:58
mordredjeblair: the error is happening consistently right after Install /etc/pip.conf19:59
jeblairinfracloud had its mirror hosts replaced this morning.  i also can't tie that to the problem, but it's worth noting.19:59
jeblair(aside from the fact that the network there is incomprehensible and we have switches acting as hubs so it's anyone's guess)20:00
mordredoh - sorry - it's happening consistently ON Install /ec/pip.conf20:00
pabelangerYa, new mirrors and slow networking20:00
jeblairmordred: oh, so that's between tasks within one playbook; it's not on a playbook boundary.  so that reduces (not eliminates) likelihood of it being a network problem since that *should* be cached.20:01
jeblair(pydistutils.cfg task comes after pip.conf task)20:02
mordredyah20:02
mordredthis makes me think back to your shebang path thing20:02
jeblairmordred: get a hit with vvv and keep yet?20:02
mordred/var/lib/zuul/builds/4b474ca8037b49c3afd351b9dee20611/.ansible/remote_tmp/ansible-tmp-1502995425.1263802-116590003515092 is what it reports as the remote_tmp20:02
mordredjeblair: with vvv yes20:02
mordrednothing new interesting in log20:03
jeblairokay, rechecking20:03
jeblairmaybe with keep we can grep for long paths20:03
jeblairansible-tmp-1502995425.1263802-116590003515092 looks like it may be a variable length path component too20:03
mordredjeblair: in log ..20:04
mordredb'mkdir: cannot create directory \\xe2\\x80\\x98/var/lib/zuul\\xe2\\x80\\x99: Permission denied\\n')"20:04
mordredmkdir -p \\"` echo /var/lib/zuul/builds/9ed3901c28ff4e72b4b6233cbb08fce2/.ansible/remote_tmp/ansible-tmp-1502999815.994324-89732834853800 `\\"20:05
mordredis the command it reports associated with that I believe20:05
mordredjeblair: http://paste.openstack.org/show/618720/ <-- relevant section in whole20:06
jeblair\\xe2\\x80\\x98 is utf-8 for "Left single quotation mark"20:08
mordredyah. '‘/var/lib/zuul’' is what I got from python decode20:08
*** jkilpatr has quit IRC20:09
jeblair/var/lib/zuul should pretty much either exist (yes) or not exist within bwrap.20:10
jeblairer remote_tmp?20:11
jeblairis this happening on the remote host?20:11
mordredyah- I think that's it trying to make the remote_tmp dir - and yes, I think so?20:11
mordredyes: b"<15.184.65.176> (1, b'', b'mkdir: cannot create directory \\xe2\\x80\\x98/var/lib/zuul\\xe2\\x80\\x99: Permission denied\\n')"20:11
jeblairso that's changed recently too, from /tmp/.... to /var/lib/zuul/....20:12
jeblairbecause the remote_tmp is set to the same path as the local_tmp20:12
jeblairi guess when we did that, we didn't expect the local tmp path to be anything different20:12
jeblairwhy would this *sometimes* work?20:12
pabelangeris it failing all the time now?20:13
mordredyah. so maybe we have some hosts missing a writable/var/lib/zuul? have we changed /var/lib/zuul on base images any time recently?20:13
mordredand yah - I think we should try the autohold feature - it might be good to look at a node and figure out what's up with it20:14
jeblairpabelanger: i don't think so.20:14
jeblairmordred: you want to do that?  'zuul autohold' on zuulv3.o.o20:14
mordredkk20:14
jeblairmordred: maybe let's add it for tox-pep8 tox-py35 tox-docs and tox-cover20:15
mordredthe interface says job is optional ...20:15
mordredcan I just add it for opentsack-infra/zuul ?20:15
mordredShrews: ^^ ?20:15
jeblairi'll work on a patch that changes remote_tmp20:16
mordredand does project need to be git.openstack.org/openstack-infra/zuul ?20:16
mordredok. tenant job and reason are required20:16
mordredShrews: I would like to report a bug in the helptext :)20:16
Shrewsi think openstack-infra/zuul will be enough since it's unique20:17
pabelangerjeblair: I am wondering if we had older version of ansible-playbook and when I restarted ze01 this morning, it started using ansible 2.3.2.020:17
mordredneat20:17
mordredconfigparser.NoSectionError: No section: 'gearman'20:17
mordredoh. I shold be not mordred20:18
mordredI have submitted an autohold for tox-pep8 tox-py35 tox-docs and tox-cover20:18
mordredI'm going to recheck20:19
Shrewsmordred: | 0000017891 | infracloud-vanilla   | nova | ubuntu-xenial | fbc53ee2-2044-4710-b276-a0fb73cc623e | hold     | 00:00:00:02 | unlocked | ubuntu-xenial-infracloud-vanilla-0000017891   | 15.184.65.200 | 15.184.65.200 |      | 22       | nl01-6551-PoolWorker.infracloud-vanilla-main   | openstack git.openstack.org/openstack-infra/zuul tox-pep8 | track down connectivity problem |20:20
Shrewsfyi20:20
pabelanger2017-08-16 20:03:48,173 DEBUG zuul.AnsibleJob: [build: 917b0d5c1778477eae27ca466ef70a66] Job root: /var/lib/zuul/builds/917b0d5c1778477eae27ca466ef70a66 was the first time we started using /var/lib/zuul/builds on ze01.o.o20:20
pabelangerwhich resulted in http://logs.openstack.org/10/494310/1/check/tox-py35/917b0d5/job-output.txt.gz error20:21
mordredShrews: woot20:22
pabelanger0ce80c17f2004e098b2092e37204cdc2 was the last job to run using /tmp, and didn't have that issue20:22
pabelangermind you the job was aborted20:22
Shrews| 0000017896 | inap-mtl01           | nova | ubuntu-xenial | 335daeb7-e5f5-4fe2-b28b-d6774d877701 | hold     | 00:00:01:59 | unlocked | ubuntu-xenial-inap-mtl01-0000017896           | 198.72.124.67 | 198.72.124.67 |      | 22       | nl01-6551-PoolWorker.inap-mtl01-main           | openstack git.openstack.org/openstack-infra/zuul tox-docs | track down connectivity problem |20:23
Shrewsis the other one so far20:23
mordredjeblair: there is, in fact, no /var/lib/zuul on the node20:23
pabelangerOh, is that on the remote side?20:24
pabelangerYa, that explains it20:24
mordredyah20:24
pabelangerwe used /tmp20:24
mordredjeblair: OH !  I've got the whole thig now20:24
mordredjeblair: ze01 is the only executor using /var/lib/zuul20:24
mordredthe others are stll using /tmp20:24
mordredthat's why it only fails sometimes20:24
pabelangerYa20:24
mordredand always on ze0120:24
mordredPHEW20:25
mordredwell - we know the entire story now :)20:25
mordredand it turns out the error is, in fact, that it can't create a directory20:25
pabelangerya20:25
pabelanger /tmp is 77720:25
pabelangerwe should likely have it use /home/zuul or ansible_ssh_user on the remote side20:25
jeblairwhy are the other executors not using varlibzuul?20:25
pabelangerI don't see puppet running on ze0220:26
pabelangerdid we accept ssh keys on puppetmaster?20:26
jeblairpabelanger: probably not20:27
jeblairwhy isn't that part of launch-node?20:27
pabelangerYa, hostkeys aren't accepted on puppetmaster20:28
jeblairpabelanger: would you mind fixing that please?20:28
pabelangerjeblair: not sure, it has always been a manually process since I've created nodes20:28
pabelangersure20:28
pabelangeraccepted now20:30
*** jkilpatr has joined #zuul20:30
pabelangerI cannot remember why we changed remote_tmp in ansible.cfg20:31
jeblairmordred, pabelanger, Shrews: i need help with this.  this is the reason the local and remote directories have the same root: https://review.openstack.org/34688020:31
pabelangerwas it something with async?20:31
jeblairkeep in mind, that's for zuulv2.5.  i haven't worked out if it's applicable still.20:31
mordredjeblair: my reading there is that it's mostly about async20:32
pabelangerYa, that is what I seem to remember too20:32
mordredoh - aso - we don't set keep_remote_files anymore20:33
mordred            # NB: when setting pipelining = True, keep_remote_files20:33
mordred            # must be False (the default).  Otherwise it apparently20:33
mordred            # will override the pipelining option and effectively20:33
mordred            # disable it.20:33
mordredso I think both reasons we did that are now gone20:33
jeblairmordred, pabelanger: okay -- two options: 1) we can remote remote_tmp and revert to default behavior.  or 2) i can create a new local tmpdir explicitly for setting remote_tmp (this will work both locally and remotely, and we can make sure it's cleaned up with jobdir).20:33
pabelangerright, I seem to remember that also20:34
pabelangerI am open to trying option 120:34
mordredme too20:34
jeblairoption 1 should say '*remove* remote_tmp' but i bet you got that20:34
pabelanger+120:35
mordredour setting of this seems mostly to have been about working around an issue with two things we don't use anymore - so I vote for option 1 because it's less complexity20:35
mordredand if there IS a reason we need to be explicit about it - I'd like to learn that reason and document it20:35
mordredjeblair: that said - I'm not *oppposed* to 2 and think that would also work fine - I just don't think we need it20:36
Shrewsmordred: don't forget to delete your held nodes when you're done with them20:36
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove remote_tmp setting  https://review.openstack.org/49469520:36
mordredShrews: what's my process for that?20:36
Shrewsmordred: nodepool delete ID   .... i can do it for you since i'm already there20:37
jeblairmordred, pabelanger, Shrews: ^ there's that.20:37
mordredShrews: cool. thanks20:37
mordredjeblair: I'm going to turn off verbose and keep20:37
Shrewsmordred: safe to delete them now?20:37
jeblairwe can merge that by shutting down ze01, or force-merge it and restart all of them.20:37
mordredShrews: yah20:37
mordredjeblair: I'm fine with either approach - we should restart all of them anyway though to pick up that change20:38
mordredjeblair: so maybe shut down ze01, land that change, then restart all20:38
mordredI'm on ze01 and can shut it down real quick20:39
pabelangermordred: jeblair: just thinking, with bwrap now, we might also be able to drop local_tmp too. Since each ansible-playbook is namespaced20:40
*** isaacb has quit IRC20:40
pabelangeralso, I have +2'd 49469520:40
jeblairmordred: ++20:40
jeblairlet's make sure we get that and shrews change in place for the restarts20:40
mordreddone20:40
mordredthe Shrews change landed already I believe yeah?20:41
jeblairnope 494679 i just rechecked it20:41
jeblairshould go in now20:41
jeblairand i self-approved my change as it has 2x+220:42
jeblair(my change may still fail check if it hit ze01 in which case we may need to recheck)20:42
mordredjeblair: k. we're gonna have to recheck your change - it managed to get three fails before ze01 went away20:42
openstackgerritJames E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove remote_tmp setting  https://review.openstack.org/49469520:43
jeblairthat'll speed things up.  commit msg mod20:43
mordredhttps://review.openstack.org/#/c/494343/ is reviewable while we're waiting20:43
openstackgerritClint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add known_hosts generation for multiple nodes  https://review.openstack.org/49470020:46
jeblairmordred: lgtm20:47
SpamapS^^ is a bit rough20:47
SpamapSand I'm not sure we want that in the predominant base job :-P20:47
*** dkranz has quit IRC20:48
mordredSpamapS: dude20:51
SpamapScan you imagine that in jinja? ;-)20:52
mordredNO20:52
SpamapSneither could I20:52
jeblairnow.... xpath... you could totally do that in xpath.20:53
SpamapSyou can build minecraft in xpath20:53
mordredyou can build xpath in minecraft20:53
clarkbre ecdsa aren't you not supposed to use those at all? (Just htinking its a lot of effort to go through to support that s it changes when you shouldn't even use them)20:56
clarkbI apparently have ecdsa host keys on this machine though. Interesting20:57
clarkbalso we are assuming no dsa because we aren't going to talk to ancient things anymore?20:57
jeblairit looks like all changes are now failing with this: http://logs.openstack.org/95/494695/2/check/tox-docs/efc3986/job-output.txt.gz#_2017-08-17_20_53_28_87968021:02
jeblairpabelanger: are we still missing a piece for the nodepool cloud mirror thing?21:02
pabelangerjeblair: no, just waiting for puppet to apply the change to servers21:04
pabelangerokay, so that is it21:04
pabelangerah21:04
pabelangernodepool.cloud is missing from that inventory21:04
pabelangerwhy is that21:05
jeblairpabelanger: could it be because the executors are running an old zuul?21:05
pabelangerya, that is possible. I didn't confirm /opt/zuul was latest version on zuul executors21:06
jeblairokay, i'm going to fix this21:06
pabelangerthanks21:06
pabelangerhttp://logs.openstack.org/79/494679/1/check/tox-cover/a8082ad/inventory.yaml does have nodepool.cloud, which is ze0121:08
jeblairmordred: this one (A worker was found in a dead state) hit again: http://paste.openstack.org/show/618724/21:17
jeblairi'm convinced at this point that's going to kill us in production21:17
jeblairhttps://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/strategy/__init__.py#L58421:21
jeblairmordred, dmsimard|off: ara seems to be adding "Initializing new DB from scratch" to the start of our job output; is there a way to avoid that?21:24
jeblairsee top of http://logs.openstack.org/79/494679/1/check/tox-docs/4b6b007/job-output.txt.gz21:24
jeblair2017-08-17 21:14:22,654 DEBUG zuul.AnsibleJob: [build: 4b6b0071d22b4093aa915b76d9d713e6] Ansible output: b'ERROR! A worker was found in a dead state'21:26
jeblairAug 17 21:14:22 ze01 kernel: [6231058.789222] ansible-playboo[10382]: segfault at a9 ip 000000000050ee24 sp 00007fffb6c19a38 error 4 in python3.5[400000+3a8000]21:26
jeblairmordred: ^ the "dead state" error is caused by segfault21:26
jeblairnote they check for sigsegv here: https://github.com/ansible/ansible/blob/devel/lib/ansible/executor/task_queue_manager.py#L33321:28
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Remove remote_tmp setting  https://review.openstack.org/49469521:32
jeblairSpamapS: i think we need to do allow ansible-playbook to write a core file.  it's running in bwrap in a python subprocess.  any ideas how to do that?21:35
jeblair(also, we need to consider where it will be written)21:36
jeblair(well, technically it's an ansible worker process that's getting the segv and will coredump, but i assume getting this to work for ansible-playbook should work for the worker process too)21:37
SpamapSjeblair: ulimit should work inside the process namespace. We can do that with a wrapper around ansible-playbook21:37
jeblairSpamapS: so a little shell script which does "ulimit -c unlimited; ansible-playbook ...." then we call that from our popen?21:41
SpamapSjeblair: exactly.21:42
jeblaircool, i'll hack that up real quick...21:42
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Fix zuul_stream callback dict modification  https://review.openstack.org/49467921:42
SpamapSjeblair: I presume you already tried just letting the executor write cores?21:42
SpamapSI'm not 100% sure but I thought when you make a new process namespace it takes the parent's settings.21:42
jeblairSpamapS: i have not.... i... perhaps erroneously... thought it wouldn't make it all the way through that.21:42
jeblairthat would be much easier so maybe we should try that first.21:43
SpamapSIt may not21:43
SpamapSyou can test that with zuul-bwrap21:43
jeblairgood point21:43
jeblairSpamapS: that works21:44
jeblairi'll start with the init script then21:44
jeblairnow where will the core be written i wonder?21:45
SpamapSpwd21:45
SpamapSso if that's not writable, that's a problem21:46
SpamapSI believe we chdir to / of the bwrap21:46
jeblairwell, it's more that it's in the jobdir so it will be deleted21:46
jeblairso i guess we're going to need to run with 'keep' enabled for a while to catch this21:46
SpamapSah no, no chdir21:46
jeblairthe popen has cwd=self.jobdir.work_root21:47
EmilienMjeblair: fyi, dmsimard is on pto and back next week IIRC21:47
jeblairso it'll be in jobdir/work i think21:47
SpamapSwe chdir to {workdir} so yay21:47
jeblairEmilienM: thanks21:47
SpamapSjeblair: yeah and the bwrap also does --chdir {workdir}21:48
SpamapSis the kernel the same version on all nodes btw?21:54
SpamapSjust wondering21:54
jeblairSpamapS: ze01 is older 4.4.0-79-generic vs 4.4.0-92-generic21:56
jeblairSpamapS: you reckon i should reboot?21:56
SpamapSNot the worst idea, if for no other reason than normalization21:57
jeblairSpamapS: yeah; though to increase the chances of catching the error, i was planning on only keeping ze01 in service21:57
SpamapSthat 4.4.0 kernel in xenial is particularly insidious .. it's really a bunch of post-4.4.0 patches masquerading as 4.4.021:58
jeblairbut if something in the kernel has altered it (fixed or made worse), i'd rather know sooner, so i'll reboot21:58
SpamapSYeah I can't point to anything and go "It's that!" but... that's hundreds of patches.21:59
jeblairi'll wait till the current crop of tests finish21:59
jeblairwe got 4-6 of theese an hour today21:59
SpamapSI remember there was a really bad bug in the python in Ubuntu 14.04 that broke our tests for a while. I wonder if we have something similar going on.22:01
clarkbthat bug was in the garbage controller trying to free already freed entries in a circular list22:16
clarkbso possible you've found another one of those in python22:16
jeblairclarkb: py3.5?22:19
clarkbit was 3.422:19
jeblairze01 is rebooted, running latest kernel, with ulimit -c unlimited and keep enabled22:19
jeblairso we're looking for Ansible output: b'ERROR! A worker was found in a dead state' in the log after 2017-08-17 22:0022:20
pabelangerk22:22
jeblairi'm going to recheck all zuul patches.22:22
*** openstackgerrit has quit IRC22:33
jeblairoho 3 segfaults i think22:37
jeblairyeah, timestamps line up22:37
SpamapSstatus lighting up like a christmas tree22:37
jeblairno core files though :(22:38
SpamapSblurgh22:43
*** openstackgerrit has joined #zuul22:47
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Remove callback_whitelist setting  https://review.openstack.org/48821422:47
dmsimard|offjeblair: re - db initialization in ARA, not yet but it's on my to-do.22:50
* mordred now has working water in his house and does NOT have water running down the street from the broken sprinkler pipe22:52
mordredjeblair: have I missed anything new or exciting?22:54
jeblairSpamapS: i used /proc to verify that the zuul process was ulimit -c 0 still (maybe something something systemd?  or start-stop-daemon?  or python-daemon?)22:54
jeblairSpamapS: i use prlimit (which i just learned about!) to set it to unlimited while running (!!!!one!)22:55
jeblairmordred: yeah; the scrollback about segfaults is interesting; that's what i'm poking at22:55
SpamapSjeblair: ooooooooooooooooooo neat22:55
jeblairnow waiting for another segfault, post 22:55 utc22:56
jeblairoh wow all the jobs are done?22:56
jeblairi guess most of those were merge conflicts22:56
mordredjeblair: just read - lovely!22:57
jeblairrechecking all changes again22:57
mordredjeblair, dmsimard|off: also, I'm very curious as to how the db init line from ara got in to the zuul-stream log file22:57
dmsimard|offmordred: me too, likely faulty logging config from ara22:58
mordredaha. nod. I can squelch that22:59
jeblairit looks like web log streaming may be broken22:59
dmsimard|offlogging config (and logging in general) in ara is either horrible or inexistant so it wouldn't surprise me22:59
jeblairdirect finger to ze01 works22:59
jeblairmordred, dmsimard|off: we're also getting alembic log lines to ansible stdout/stderr (so they're showing up in the executor debug log)23:01
jeblair2017-08-17 22:59:07,638 DEBUG zuul.AnsibleJob: [build: 52eab059248f4642b08660e3988cfc64] Ansible output: b'INFO  [alembic.runtime.migration] Running upgrade 22aa8072d705 -> 5716083d63f5, ansible_metadata'23:01
jeblaireg ^23:01
dmsimard|offjeblair: that would also be ara, yes23:01
jeblairthat's not user visible, but it does make the executor logs chatty so would be nice to clean up (the entries showing up in the console log are more important to nix before production)23:02
dmsimard|offhttps://storyboard.openstack.org/#!/story/2000931 is aptly described as "Get rid of alembic foreground logging, it's pretty annoying"23:03
jeblairit's only 7% of our log output at the moment.  :)23:03
dmsimard|offI'll add a logging general topic to my 1.0 checklist sir :)23:04
jeblair2017-08-17 23:00:56,323 DEBUG zuul.AnsibleJob: [build: 5aaa32661bae496ba0e01006aa3d67cb] Ansible output: b'ERROR! A worker was found in a dead state'23:04
jeblairding!23:04
pabelangerwinner winner, chicken dinner23:06
dmsimard|offif I keep piling up those 1.0 ARA todo's I'm going to pull a Zuul v3 and ship it in 2 years at this rate :D23:06
* dmsimard|off hides23:06
jeblairSpamapS, mordred: still no core file23:08
clarkbjeblair: is pam resetting ulimits?23:08
clarkbit may do that if there are user switches happening23:08
clarkbyou should be able to check via /proc I think23:08
jeblairclarkb: i just checked an ansible-playbook process via proc and it has core set to unlimited.23:09
SpamapSjeblair: work_dir is rw mounted right?23:09
jeblairSpamapS: yes, it's jobdir/work so should be writable in all cases23:10
jeblaircat /proc/sys/kernel/core_pattern23:10
jeblair|/usr/share/apport/apport %p %s %c %P23:10
* jeblair is not amused23:11
jeblairSpamapS: any chance you know if that has caused the core dumps to go to some place where i can recover them?23:13
mordredjeblair: maybe we shoudl uninstall apport23:14
jeblair/var/crash is empty23:14
clarkbianw may know since there was debugging of cores from apache I think23:14
jeblair echo -n "core" > /proc/sys/kernel/core_pattern23:15
jeblairi did that23:15
mordredApport is not enabled by default in stable releases, even if it is installed. The automatic crash interception component of apport is disabled by default in stable releases for a number of reasons:23:15
mordredI beg to differ with them23:15
jeblairnow looking for a segfault after 23:1523:16
jeblairanother global recheck23:16
mordredjeblair: echo "/tmp/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern23:17
jeblairmordred: does /tmp/cores need to be world-writable?23:18
jeblairif i were to do that?23:18
mordredjeblair: unsure - reading more23:19
mordredjeblair: so - actually your thing seems better23:19
mordredjeblair: "core" is the default and will put a core file in the current dir23:20
jeblairk.  i *think* we're set up for that, but if we have problems, we can try with tmp23:20
mordredcool23:20
jeblair-rw------- 1 zuul zuul 84430848 Aug 17 23:27 /var/lib/zuul/builds/33ff13fb20cf44dea20078f1ea61eac5/work/core23:29
jeblairfinally!23:29
SpamapSjeblair: woot23:30
jeblair#0  0x000000000050ee24 in visit_decref () at ../Modules/gcmodule.c:37323:31
jeblairthat function is the same in 3.5.2 and tip of master23:37
clarkbI want to say that may be where the old bug was23:38
jeblairianw: ^ is any of this looking familiar?23:38
clarkbthey do list traversals and decrement references. Problem is  its a circular list so if you don't clean pointers up properly on last pass you can hit old freed memory on the next pass and attempt to free already freed memory23:39
jeblairclarkb: was there a fix for that?23:39
clarkbyes, the fix came from 3.5 iirc23:39
clarkbI'm finding links now23:40
clarkbhttps://bugs.launchpad.net/ubuntu/+source/python3.4/+bug/1367907 is what I filed in ubuntu to fix trusty, http://bugs.python.org/issue21435 is bug upstream in python23:41
openstackLaunchpad bug 1367907 in python3.4 (Ubuntu Trusty) "Segfault in gc with cyclic trash" [Undecided,In progress]23:41
SpamapSUbuntu is usually _really_ close to mainline releases on python.23:41
clarkbbt is different23:41
clarkbso this is probably a new bug23:41
ianwjeblair: apart from the unfortunate familiarity of spending 97% of the time just trying to get a coredump, no sorry :/23:41
jeblairhttps://etherpad.openstack.org/p/uHxJU6CrXW23:42
ianwcan you do a bt full or is it crazy?23:47
clarkbhttp://bugs.python.org/issue3118123:48
clarkbinteresting that they seem to think it is related to yaml23:48
jeblairclarkb: and that's py2723:48
clarkbhttp://bugs.python.org/issue27945 seems like a more robust reporting of what may be this issue and last bug is dup of it23:48
clarkbjeblair: ya, ^ is all the versions though and I think is the same bug23:49
clarkbthere are patches too23:49
jeblairianw: will check in a sec23:49
clarkbis it just me or does gerrit make it a million times easier to understand what the actual fixes are23:50
clarkbhttps://github.com/python/cpython/commit/2f7f533cf6fb57fcedcbc7bd454ac59fbaf2c65523:50
jeblairianw: http://paste.ubuntu.com/25335706/23:53
clarkbits not too difficult to build your own python particularly on debuntu23:54
clarkbyou could likely grab that patch and see if it fixes23:54
clarkb(on suse its a major pita because everything needs patching by distro)23:54
mordredjeblair, clarkb: since it mentions yaml - i wonder if there is any difference between cyaml and non-cyaml23:59
clarkbmordred: looking at the second bug it seems to be related to a large rewrite of the dict type23:59
mordredah23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!