SpamapS | mordred: It's proving pretty gross in python too, so I"m glad I flipped | 00:04 |
---|---|---|
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul-jobs master: Add configure-logserver role https://review.openstack.org/489113 | 02:31 |
*** openstack has quit IRC | 04:42 | |
*** openstack has joined #zuul | 04:47 | |
*** eventingmonkey has joined #zuul | 04:47 | |
*** leifmadsen has quit IRC | 04:47 | |
*** leifmadsen has joined #zuul | 04:47 | |
*** rfolco has joined #zuul | 04:47 | |
*** jlk has joined #zuul | 04:49 | |
*** ianw has joined #zuul | 04:49 | |
*** jhesketh has joined #zuul | 04:49 | |
*** jlk has quit IRC | 04:49 | |
*** jlk has joined #zuul | 04:49 | |
*** tobiash has joined #zuul | 04:50 | |
*** jesusaurum has joined #zuul | 04:51 | |
*** maxamillion has joined #zuul | 04:52 | |
*** rcarrill1 has joined #zuul | 04:53 | |
*** cinerama has joined #zuul | 04:54 | |
*** mnaser has joined #zuul | 04:55 | |
*** tristanC has joined #zuul | 05:02 | |
*** isaacb has joined #zuul | 05:18 | |
*** dmsimard|off has quit IRC | 05:48 | |
*** dmsimard has joined #zuul | 05:49 | |
*** dmsimard is now known as dmsimard|off | 05:49 | |
*** isaacb has quit IRC | 06:17 | |
*** isaacb has joined #zuul | 06:28 | |
*** SotK has joined #zuul | 07:43 | |
*** smyers has quit IRC | 09:05 | |
*** smyers has joined #zuul | 09:16 | |
*** electrofelix has joined #zuul | 09:53 | |
*** smyers has quit IRC | 10:12 | |
*** robled has quit IRC | 10:12 | |
*** robled has joined #zuul | 10:17 | |
*** robled has quit IRC | 10:17 | |
*** robled has joined #zuul | 10:17 | |
*** smyers has joined #zuul | 10:21 | |
*** jkilpatr has quit IRC | 10:44 | |
*** pbelamge has joined #zuul | 10:56 | |
pbelamge | hello all | 10:56 |
pbelamge | I am getting this error in gate pipeline: | 10:57 |
pbelamge | <QueueItem 0x7fe63c495410 for <Change 0x7fe63c50be10 44,1> in gate> is a failing item because ['it did not merge'] | 10:57 |
pbelamge | anyone faced this kind of error before? | 10:57 |
pbelamge | log entries before this line: | 10:57 |
pbelamge | https://thepasteb.in/p/lOhONzjJ3JZCB | 10:58 |
SpamapS | 2017-07-30 02:00:12,386 DEBUG zuul.GerritSource: Change <Change 0x7fe63c50be10 44,1> did not appear in the git repo | 11:00 |
SpamapS | pbelamge: perhaps you're using a mirror for merging? | 11:00 |
pbelamge | https://thepasteb.in/p/AnhrA18DMNzHv | 11:01 |
pbelamge | that is the git directory mentioned in the zuul.conf and the zuul_url | 11:02 |
pbelamge | let me know if you need anything to see if I am missing anything? | 11:08 |
pbelamge | in layout.yaml I used jenkins as user name and in zuul.conf as well | 11:21 |
pbelamge | I am sending Workflow (1) and Code-Review (2) from gerrit | 11:21 |
pbelamge | added jenkins in non-interactive users group in gerrit | 11:21 |
pbelamge | so, Verified (2) and Submit is automatically executed | 11:22 |
SpamapS | So zuul got the event. But it appears it couldn't find it later. | 11:22 |
pbelamge | right | 11:22 |
SpamapS | rather, it appears Zuul couldn't find the change. | 11:22 |
pbelamge | what could be wrong? | 11:23 |
SpamapS | not entirely sure | 11:23 |
*** jkilpatr has joined #zuul | 11:23 | |
SpamapS | I'm not super experienced debugging zuul problems. However, have you checked the gerrit logs to see what zuul tried to fetch? | 11:23 |
pbelamge | ok, let me take a look | 11:24 |
pbelamge | log from sshd_log | 11:28 |
pbelamge | https://thepasteb.in/p/LghNnY3jRGOsZ | 11:32 |
pbelamge | log from httpd_log | 11:32 |
pbelamge | https://thepasteb.in/p/nZhlN6WJzM3UY | 11:33 |
*** jkilpatr has quit IRC | 11:37 | |
*** jkilpatr has joined #zuul | 11:37 | |
SpamapS | pbelamge: some more experienced zuul debuggers will be online in the next 4-5 hours ... I'm not even supposed to be awake yet. ;) | 12:05 |
* SpamapS decides to try and get another hour of sleep after reporting and fixing https://github.com/ansible/ansible/issues/28325 in ansible :-P | 12:14 | |
*** pleia2 has joined #zuul | 12:42 | |
*** isaacb has quit IRC | 12:53 | |
*** gothicmindfood has joined #zuul | 13:06 | |
*** pbelamge has quit IRC | 13:25 | |
*** amoralej is now known as amoralej|lunch | 13:40 | |
*** dkranz has joined #zuul | 14:04 | |
*** amoralej|lunch is now known as amoralej | 14:25 | |
*** isaacb has joined #zuul | 14:37 | |
*** isaacb has quit IRC | 14:43 | |
*** jeblair has joined #zuul | 14:56 | |
Shrews | Does anyone need any help with the current set of "must-be-done" tasks? | 15:17 |
SpamapS | mordred: hah, thanks for the +1 on that PR. :) | 15:35 |
SpamapS | mordred: the workaround is to just add all 3 known types. :-P | 15:36 |
*** openstackgerrit has joined #zuul | 15:55 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Use cached branch list in dynamic config https://review.openstack.org/494618 | 15:55 |
* SpamapS notes that jamielennox got BonnyCI ALMOST to v3 last night | 15:59 | |
SpamapS | just ran into https://review.openstack.org/#/c/493059/ which we're fixing now | 16:00 |
SpamapS | (post-review instead of allow-secrets) | 16:00 |
jeblair | SpamapS, jamielennox: \o/ | 16:01 |
pabelanger | nice | 16:02 |
Shrews | SpamapS: for some reason, i thought you were already on v3 | 16:12 |
Shrews | but yay! | 16:12 |
SpamapS | I know right? | 16:15 |
SpamapS | hrm... status webapp seems stuck | 16:17 |
SpamapS | Job base not defined | 16:20 |
SpamapS | derp | 16:20 |
SpamapS | I think just needs a parent: null | 16:22 |
jeblair | SpamapS: ah yep that's another new thing | 16:22 |
SpamapS | luckily I've been paying attention :) | 16:22 |
jeblair | pabelanger: what's the status of the mirror name fix? | 16:23 |
jeblair | i just had a job bomb because it ran on inap with the wrong mirror name | 16:23 |
SpamapS | oh yay, as soon as I pushed it.. Zuul saw it and sprang to life. | 16:25 |
SpamapS | or not | 16:26 |
SpamapS | hm | 16:26 |
Shrews | does it have to be "parent: base" or is "parent: null" also acceptable? | 16:29 |
SpamapS | Guessing I'm hitting an unexpected url here....http://paste.openstack.org/show/618704/ | 16:29 |
SpamapS | Shrews: the ultimate base needs to be parent: null | 16:29 |
jeblair | Shrews: to define a parent job, "parent: null". we no longer need to add "parent: base" to every other job as that's now the default. | 16:29 |
Shrews | SpamapS: ah, right. makes sense | 16:29 |
SpamapS | or yeah, all ultimate base jobs I should say, there's not just one.. | 16:30 |
jeblair | SpamapS: yeah, maybe that's hitting the status url with the wrong (or no?) tenant name (which should be the first path component of the url) | 16:30 |
SpamapS | https://zuul.opentech.bonnyci.org/BonnyCI/status.json <-- wee | 16:31 |
SpamapS | jeblair: it's hitting with _only_ the tenant name. | 16:31 |
SpamapS | should not really be a server side ERROR log IMO | 16:31 |
jeblair | SpamapS: i agree | 16:32 |
SpamapS | maybe our log config is off | 16:39 |
SpamapS | because it looks like it is being logged using .exception() | 16:39 |
SpamapS | which I'd only expect to see in debug logs | 16:39 |
SpamapS | level=WARNING ... so.. hrm | 16:40 |
SpamapS | OH | 16:41 |
SpamapS | .exception() does ERROR? that seems.. hrm. | 16:41 |
pabelanger | jeblair: I think we need to restart executors | 16:41 |
pabelanger | looking now | 16:42 |
jeblair | SpamapS: yes -- i think that's appropriate. generally if there's an uncaught exception i want to know about it. :) i think we just need to make sure there's a tenant validity check before that point. | 16:43 |
jeblair | SpamapS: iow -- any *other* error formatting the status json is worth noting :) | 16:43 |
SpamapS | yeah, just some lazy coding in there expecting things | 16:43 |
SpamapS | I'm really just trying to see how to make webapp return 404 | 16:44 |
SpamapS | ahh HTTPNotFound k | 16:44 |
pabelanger | jeblair: ya, looks like we need to restart executors. I can do that once I get some food | 16:44 |
jeblair | pabelanger: thx | 16:45 |
pabelanger | Shrews: here is an interesting nodepool failure: http://logs.openstack.org/30/493330/2/gate/gate-dsvm-nodepool-ubuntu-src/993f8ce/logs/screen-nodepool-builder.txt.gz#_Aug_17_16_02_33_543757 | 16:48 |
pabelanger | we lost connection with zookeeper some reason | 16:48 |
SpamapS | That should be something that resolves itself. | 16:49 |
SpamapS | One should expect to lose one's zk from time to time. | 16:49 |
Shrews | pabelanger: but seems like the builder recovered correctly, yeah? | 16:49 |
Shrews | yeah, can't do anything about zk disappearing when we only have one node in the cluster. as long as it recovers correctly... | 16:50 |
Shrews | but in a dsvm job, you really don't expect zk to just go away. weird that it did | 16:51 |
Shrews | looks like it happened during the upload, that failed, then the upload worker suspended itself correctly. | 16:54 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul feature/zuulv3: Return 404 on unknown tenants https://review.openstack.org/494642 | 16:54 |
pabelanger | Shrews: ya, likely need to recover more gracefully for d-g hook: http://logs.openstack.org/30/493330/2/gate/gate-dsvm-nodepool-ubuntu-src/993f8ce/console.html | 16:54 |
pabelanger | lets see if it happens more often | 16:55 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Use cached branch list in dynamic config https://review.openstack.org/494618 | 17:38 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Allow multiple semaphore definitions within a project https://review.openstack.org/494650 | 17:38 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Reload configuration when branches are created or deleted https://review.openstack.org/494651 | 17:38 |
pabelanger | jeblair: just ze01.o.o or are we launching jobs across other executors now for restarting | 17:48 |
jeblair | pabelanger: all 4 now | 17:48 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: WIP Replace paste/webob with aiohttp https://review.openstack.org/494655 | 17:50 |
mordred | SpamapS: saw you poking at webapp things - I wrote taht ^^ a couple of months ago as an exploration - figured I'd stick it up in case anyone decides they feel like poking further webapp/zuul-web in case it's helpful | 17:51 |
pabelanger | jeblair: thanks | 17:51 |
pabelanger | Hmm, looks like we leak /var/run/zuul-executor/zuul-executor.pid on stop. Will find out why | 17:53 |
pabelanger | k, all have restarted. Now to fix puppet-zuul | 17:55 |
jeblair | pabelanger: because zuul can't delete it because it's written as root but zuul drops privileges | 17:57 |
jeblair | mordred: this error is new to me: http://logs.openstack.org/50/494650/1/check/tox-pep8/50bd841/job-output.txt.gz#_2017-08-17_17_43_53_019040 | 17:58 |
*** jkilpatr has quit IRC | 18:02 | |
jeblair | and again: http://logs.openstack.org/51/494651/1/check/tox-pep8/c1a2cb1/job-output.txt.gz#_2017-08-17_17_56_46_888201 | 18:02 |
jeblair | i wonder if those have something to do with the executors restarting | 18:03 |
*** jkilpatr has joined #zuul | 18:03 | |
jeblair | pabelanger, mordred: what's the status on tarball/publish jobs? | 18:04 |
pabelanger | let me pull up etherpad | 18:05 |
pabelanger | both ssh / testpypi credentials are added to project-config: https://review.openstack.org/494276/ was to start testing testpypi creds on executor and if pip install was working | 18:06 |
pabelanger | I'be added the publish-openstack-python-branch-tarball to post pipeline for zuul, and confirmed it uploaded to zuulv3-dev.o.o | 18:07 |
pabelanger | mordred: still has (pre-)python-tarball playbooks to create, but happy to take over if needed | 18:08 |
jeblair | pabelanger: +3 494276 | 18:08 |
pabelanger | twine role is started, and ready to push up pending above | 18:08 |
pabelanger | so, I can take back pre-python-tarball and python-tarball playbooks now to keep iterating forwar | 18:09 |
jeblair | mordred: ^ are you on that or do you want to hand it off? | 18:10 |
jeblair | also, other folks were asking about what they could do to help. | 18:10 |
pabelanger | most of the playbook content is written, we just need to reorg it | 18:10 |
jeblair | pabelanger: when 494276 lands you should be able to move forward on testing pypi uploads at least, right? | 18:10 |
pabelanger | jeblair: yes, I'll push on that | 18:11 |
mordred | jeblair, pabelanger: I'm fine with pabelanger taking those if he's in the groove on it | 18:11 |
pabelanger | k | 18:11 |
mordred | I wanted to see if I could get the secret name thing in real quick - cause I think we can use it to good effect on both tarballs and logs | 18:12 |
pabelanger | which one is that? | 18:12 |
mordred | pabelanger: oh - I also, speaking of - I think we can collapse the logs role/job to at least use the add-fileserver role like you mentioned yesterday | 18:12 |
mordred | pabelanger: 494650 - I need to fix a unittest real quick (now that I'm done with this morning's project-config fun :) ) | 18:13 |
pabelanger | mordred: ya, I did that already with https://review.openstack.org/494314/ for base-test logs | 18:13 |
mordred | jeblair: OH MY - that's a fantastic error | 18:13 |
mordred | pabelanger: neat! | 18:13 |
pabelanger | I think you have the other patch in merge conflict | 18:13 |
jeblair | mordred: yeah, i'm kinda inclined to see if it shows up again not in proximity to an executor restart before we spend too much time on it | 18:14 |
mordred | jeblair: kk | 18:14 |
jeblair | mordred: (even though i would not expect an executor restart to manifest like that; just trying not to rabbithole) | 18:14 |
mordred | "dictionary changed size during iteration" <-- seems extra weird - but yah | 18:14 |
pabelanger | I see that a lot when doing python3 convert in nodepool | 18:15 |
mordred | yah - I'm just not sure why we'd only see it near an executor restart - so I'm hoping it's just a heisenbug | 18:15 |
jeblair | mordred: let me know when you have the secrets name thing fixed (or if you need another set of eyes on it). i'm working on the branch cache (should be done, just waiting on stable test results) and will work on the tmpfs-for-secrets idea next (probably after lunch). | 18:16 |
mordred | jeblair: cool | 18:16 |
mordred | jeblair: and will do - also, when you get a spare second, https://review.openstack.org/#/c/494260/ will let us do https://review.openstack.org/#/c/494281/ | 18:17 |
mordred | (and the depends have landed, etc) | 18:17 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Allow requesting secrets by a different name https://review.openstack.org/494343 | 18:19 |
mordred | nobody should let me code | 18:19 |
jeblair | mordred: hire 4 devs and you have a deal. otherwise, no dice. | 18:21 |
mordred | it's literally the same thing I fixed between patch 1 and patch 2 - just one line down | 18:23 |
jeblair | mordred: ara: +2 on the first, -1 on the second (one of us is confused) | 18:25 |
*** electrofelix has quit IRC | 18:30 | |
mordred | jeblair: I just got another ANSIBLE PARSE ERROR | 18:30 |
jeblair | okay so that's a thing :( | 18:31 |
jeblair | mordred: i wonder if our ansible version is different on the new executors | 18:31 |
jeblair | mordred: left some comments on secrets change | 18:33 |
mordred | jeblair: well - I got it on ze01 - but I have now officially switched to trying to figure out what's up | 18:34 |
jeblair | hrm, ansible 2.3.2.0 on ze01 and ze04. | 18:34 |
jeblair | mordred: ok cool. i have to afk now, back after lunch. | 18:34 |
mordred | yah - and both with the same python version | 18:34 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul feature/zuulv3: Allow requesting secrets by a different name https://review.openstack.org/494343 | 18:39 |
Shrews | for key in result_dict.keys(): | 18:43 |
Shrews | if key != 'changed': | 18:43 |
Shrews | result_dict.pop(key) | 18:43 |
Shrews | mordred: that seems suspicious | 18:44 |
Shrews | the "if '_zuul_nolog_return'" portion of v2_runner_on_ok() | 18:44 |
mordred | Shrews: yes - I agree- that looks very bad | 18:46 |
mordred | I _think_ there are two issues here ... one is an issue in the callback plugin - the second is that there was a problem connecting | 18:47 |
Shrews | mordred: interestingly, i can reproduce that 'dict changed size' error on that exact code, but so far only in py3. we aren't running ansible under py3, are we? | 18:49 |
mordred | Shrews: the local code - the callback plugin - does run under python3 | 18:51 |
Shrews | that doesn't seem safe, since ansible isn't really py3 ready | 18:52 |
mordred | thecore folks said core is py3 ready - it's modules that are the problem (we chatted with them about it) | 18:52 |
*** jesusaurum is now known as jesusaur | 18:52 | |
Shrews | ah yes, i believe i have seen a similar statement | 18:52 |
mordred | Shrews: so running ansible-playbook with python3 but setting ansible_python_interpreter so that it uses python2 on the remote nodes | 18:53 |
mordred | and toshio said if we encounter any py3 issues in core they'd consider them urgent/bad bugs | 18:53 |
Shrews | mordred: want me to put up a fix for that? | 18:56 |
mordred | Shrews: sure! thanks | 18:56 |
openstackgerrit | David Shrewsbury proposed openstack-infra/zuul feature/zuulv3: Fix zuul_stream callback dict modification https://review.openstack.org/494679 | 18:58 |
Shrews | mordred: jeblair: ^^^ | 18:58 |
mordred | now- why that playbook couldn't connect is a whole different question to track down - especially considering we're using persistent conection stuff | 18:59 |
mordred | SpamapS: we did wind up using persistent connections yeah? | 19:00 |
*** amoralej is now known as amoralej|off | 19:01 | |
SpamapS | mordred: fer what? sorry, I lost context. | 19:01 |
mordred | SpamapS: for ansible to remote nodes | 19:01 |
SpamapS | Oh yes we're still persistent. | 19:01 |
SpamapS | But we don't persist across runs. | 19:01 |
SpamapS | The bubblewrap dies and kills the ssh with it. | 19:02 |
mordred | hrm | 19:02 |
SpamapS | Not doing that proved pretty squirelly (Technical Term) | 19:02 |
SpamapS | I think it would be doable, but we had to get down into more ansible guts to do it right IIRC. | 19:03 |
pabelanger | mordred: jeblair: good news, pip install worked in bwrap on executor | 19:07 |
pabelanger | jeblair: I think we also need to restart zuulv3.o.o for nodepool.cloud variable. I can see nodepool adding it to zookeeper, but still not showing up in our inventory files. | 19:10 |
pabelanger | I only reset our executors | 19:10 |
mordred | pabelanger: pip install --user yeah? | 19:12 |
pabelanger | mordred: yes | 19:13 |
pabelanger | mordred: it is missing our pip.conf settings, so we'll have to iterate on that too | 19:14 |
mordred | pabelanger: good point | 19:14 |
mordred | pabelanger: we'll need to be careful with that - as executors are in dfw, so we can't use any nodepool variables that we normally would for setting mirrors | 19:18 |
Shrews | hrm, i wonder if https://review.openstack.org/494679 will need to be force merged | 19:20 |
*** isaacb has joined #zuul | 19:21 | |
pabelanger | mordred: ++ | 19:23 |
mordred | pabelanger: we might need to have a special role or version of configure-mirror for setting up only ~/.pydistutils and ~/.pip.conf? | 19:24 |
mordred | pabelanger: oh - I've got an idea ... | 19:25 |
pabelanger | I'm going to restart zuulv3.o.o to pick up new inventory variables for nodepool.cloud | 19:25 |
mordred | kk | 19:25 |
pabelanger | mordred: also waiting for your idea | 19:25 |
pabelanger | zuulv3.o.o back | 19:27 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul-jobs master: Allow configure-mirrors to do only homedir https://review.openstack.org/494684 | 19:30 |
mordred | pabelanger: how about that ^^ ? | 19:30 |
pabelanger | ya, that might work | 19:31 |
mordred | pabelanger: then in the publish playbook you're working on, since that's specific to openstack anyway, you can just do role: configure-mirrors: mirror_fqdn: mirror.dfw.rax.opentsack.org only_homedir: true | 19:31 |
mordred | s/role: /roles: - / | 19:31 |
mordred | Shrews: I'm curious as to why this issue with the dict just started all of a sudden | 19:33 |
Shrews | yah, me too | 19:33 |
jeblair | Shrews: thanks! +3 | 19:33 |
mordred | also - that's in if '_zuul_nolog_return' in result_dict: which only occurs in like one place | 19:33 |
Shrews | like i said, i don't think it will merge w/o magic | 19:34 |
jeblair | Shrews: oh why not? | 19:34 |
Shrews | b/c zuulv3 has the buggy version, causing the failure, yeah? | 19:34 |
jeblair | Shrews: i thought it was sporadic | 19:34 |
Shrews | jeblair: my patch failed w/ the same error | 19:35 |
Shrews | so... less sporadic now? dunno | 19:35 |
mordred | jeblair: turning on -vvv to the logs requires a restart right? or do we have a command for that? | 19:35 |
jeblair | mordred: zuul-executor verbose | 19:35 |
mordred | I'm going to turn that on | 19:36 |
jeblair | kk | 19:36 |
jeblair | mordred: we have 4 executors running now | 19:36 |
jeblair | mordred: if you want to scale that back to 1, i think that'd be fine | 19:36 |
mordred | jeblair: I actually havne't seen the error anywhere OTHER than ze01 fwiw | 19:37 |
jeblair | mordred: curiouser and curiouser | 19:37 |
mordred | yah .... | 19:37 |
Shrews | and failed yet again :( | 19:37 |
Shrews | weird it's just showing up | 19:37 |
jeblair | Shrews: okay i'll push it through | 19:37 |
mordred | Aug 17 19:28:12 ze01 kernel: [6224688.842500] traps: ansible-playboo[28587] general protection ip:50ee24 sp:7ffe1c61e848 error:0 in python3.5[400000+3a8000] | 19:38 |
jeblair | mordred: if you're ready for that -- unless you want..... | 19:38 |
mordred | we're getting segfaults | 19:38 |
mordred | from ansible-playbook | 19:38 |
jeblair | i um | 19:38 |
mordred | that's in /var/log/syslog | 19:38 |
SpamapS | woof | 19:38 |
Shrews | neat | 19:38 |
mordred | which have not been in the syslog until just recently | 19:39 |
mordred | that, in fact, was the first occurance | 19:39 |
jeblair | mordred: so not 1:1 with these errors? | 19:40 |
Shrews | 2.4 wasn't released was it? | 19:40 |
jeblair | Shrews: we're on 2.3.2.0 | 19:41 |
jeblair | on all 4 executors | 19:41 |
Shrews | k. i knew 2.4 was very close | 19:41 |
mordred | jeblair: no - there's only a few of them | 19:41 |
jeblair | mordred: i think the segfaults have been around a while | 19:48 |
mordred | awesome | 19:48 |
jeblair | mordred: they're in all our syslogs, back to aug 11 | 19:48 |
mordred | maybe that happens when we restart an executor with a playbook running? | 19:48 |
jeblair | mordred: maybe? | 19:48 |
mordred | thing to keep our eyes on at leats | 19:48 |
jeblair | i'm at least inclined to separate it from the callback issue at this point | 19:48 |
mordred | ++ | 19:49 |
mordred | agree | 19:49 |
jeblair | so we've got callback first reported (could be older -- this is manual) around 17:58 | 19:49 |
mordred | I'm rechecking the secret patch while verbose is on to see if I can get a traceback in the logs | 19:49 |
jeblair | mordred: i have a kernel of an idea but it's not fleshed out - | 19:52 |
mordred | ok | 19:53 |
jeblair | mordred: our jobdir paths just got longer, and some of the paths in an error i just checked are 120 bytes long; the shebang limit is 128. i don't know how that could come in to play. | 19:53 |
jeblair | maybe the ansible module interpolation stuff? i dunno. just brainstorming. | 19:54 |
mordred | oh. well- that could be the cause of the remote error - it's possible the issue with callback warning has been happening for a while and we just didn't notice | 19:54 |
mordred | sorry - don't know if you caught that when you came back - there are two issues - the job isn't failing because of the callback error | 19:54 |
mordred | jeblair: http://logs.openstack.org/43/494343/4/check/tox-pep8/4b474ca/job-output.txt.gz#_2017-08-17_18_43_45_460997 | 19:55 |
jeblair | mordred: ah. so force-merging Shrews' change won't help. | 19:55 |
mordred | jeblair: is a real issue | 19:55 |
jeblair | right, but not one which causes ansible to fail? | 19:55 |
mordred | no - I mean that link is the link to the real issue | 19:55 |
jeblair | mordred: yes i know. i see both issues. i'm just trying to understand which is fatal. | 19:56 |
jeblair | callback, unreachable, or both? | 19:56 |
mordred | the unreachable one | 19:56 |
Shrews | might this be a good time to try autohold? | 19:56 |
mordred | callback errors are never fatal - they just cause lack of further output to log for that task | 19:56 |
jeblair | mordred: run 'zuul-executor keep' as well please so we can inspect the jobdir contents | 19:57 |
jeblair | Shrews: maybe so | 19:58 |
mordred | jeblair: done | 19:58 |
jeblair | it's also worth entertaining the idea that, if this only happens on infracloud nodes, that the network is legitimately in a worse-than-normal state. | 19:58 |
mordred | jeblair: the error is happening consistently right after Install /etc/pip.conf | 19:59 |
jeblair | infracloud had its mirror hosts replaced this morning. i also can't tie that to the problem, but it's worth noting. | 19:59 |
jeblair | (aside from the fact that the network there is incomprehensible and we have switches acting as hubs so it's anyone's guess) | 20:00 |
mordred | oh - sorry - it's happening consistently ON Install /ec/pip.conf | 20:00 |
pabelanger | Ya, new mirrors and slow networking | 20:00 |
jeblair | mordred: oh, so that's between tasks within one playbook; it's not on a playbook boundary. so that reduces (not eliminates) likelihood of it being a network problem since that *should* be cached. | 20:01 |
jeblair | (pydistutils.cfg task comes after pip.conf task) | 20:02 |
mordred | yah | 20:02 |
mordred | this makes me think back to your shebang path thing | 20:02 |
jeblair | mordred: get a hit with vvv and keep yet? | 20:02 |
mordred | /var/lib/zuul/builds/4b474ca8037b49c3afd351b9dee20611/.ansible/remote_tmp/ansible-tmp-1502995425.1263802-116590003515092 is what it reports as the remote_tmp | 20:02 |
mordred | jeblair: with vvv yes | 20:02 |
mordred | nothing new interesting in log | 20:03 |
jeblair | okay, rechecking | 20:03 |
jeblair | maybe with keep we can grep for long paths | 20:03 |
jeblair | ansible-tmp-1502995425.1263802-116590003515092 looks like it may be a variable length path component too | 20:03 |
mordred | jeblair: in log .. | 20:04 |
mordred | b'mkdir: cannot create directory \\xe2\\x80\\x98/var/lib/zuul\\xe2\\x80\\x99: Permission denied\\n')" | 20:04 |
mordred | mkdir -p \\"` echo /var/lib/zuul/builds/9ed3901c28ff4e72b4b6233cbb08fce2/.ansible/remote_tmp/ansible-tmp-1502999815.994324-89732834853800 `\\" | 20:05 |
mordred | is the command it reports associated with that I believe | 20:05 |
mordred | jeblair: http://paste.openstack.org/show/618720/ <-- relevant section in whole | 20:06 |
jeblair | \\xe2\\x80\\x98 is utf-8 for "Left single quotation mark" | 20:08 |
mordred | yah. '‘/var/lib/zuul’' is what I got from python decode | 20:08 |
*** jkilpatr has quit IRC | 20:09 | |
jeblair | /var/lib/zuul should pretty much either exist (yes) or not exist within bwrap. | 20:10 |
jeblair | er remote_tmp? | 20:11 |
jeblair | is this happening on the remote host? | 20:11 |
mordred | yah- I think that's it trying to make the remote_tmp dir - and yes, I think so? | 20:11 |
mordred | yes: b"<15.184.65.176> (1, b'', b'mkdir: cannot create directory \\xe2\\x80\\x98/var/lib/zuul\\xe2\\x80\\x99: Permission denied\\n')" | 20:11 |
jeblair | so that's changed recently too, from /tmp/.... to /var/lib/zuul/.... | 20:12 |
jeblair | because the remote_tmp is set to the same path as the local_tmp | 20:12 |
jeblair | i guess when we did that, we didn't expect the local tmp path to be anything different | 20:12 |
jeblair | why would this *sometimes* work? | 20:12 |
pabelanger | is it failing all the time now? | 20:13 |
mordred | yah. so maybe we have some hosts missing a writable/var/lib/zuul? have we changed /var/lib/zuul on base images any time recently? | 20:13 |
mordred | and yah - I think we should try the autohold feature - it might be good to look at a node and figure out what's up with it | 20:14 |
jeblair | pabelanger: i don't think so. | 20:14 |
jeblair | mordred: you want to do that? 'zuul autohold' on zuulv3.o.o | 20:14 |
mordred | kk | 20:14 |
jeblair | mordred: maybe let's add it for tox-pep8 tox-py35 tox-docs and tox-cover | 20:15 |
mordred | the interface says job is optional ... | 20:15 |
mordred | can I just add it for opentsack-infra/zuul ? | 20:15 |
mordred | Shrews: ^^ ? | 20:15 |
jeblair | i'll work on a patch that changes remote_tmp | 20:16 |
mordred | and does project need to be git.openstack.org/openstack-infra/zuul ? | 20:16 |
mordred | ok. tenant job and reason are required | 20:16 |
mordred | Shrews: I would like to report a bug in the helptext :) | 20:16 |
Shrews | i think openstack-infra/zuul will be enough since it's unique | 20:17 |
pabelanger | jeblair: I am wondering if we had older version of ansible-playbook and when I restarted ze01 this morning, it started using ansible 2.3.2.0 | 20:17 |
mordred | neat | 20:17 |
mordred | configparser.NoSectionError: No section: 'gearman' | 20:17 |
mordred | oh. I shold be not mordred | 20:18 |
mordred | I have submitted an autohold for tox-pep8 tox-py35 tox-docs and tox-cover | 20:18 |
mordred | I'm going to recheck | 20:19 |
Shrews | mordred: | 0000017891 | infracloud-vanilla | nova | ubuntu-xenial | fbc53ee2-2044-4710-b276-a0fb73cc623e | hold | 00:00:00:02 | unlocked | ubuntu-xenial-infracloud-vanilla-0000017891 | 15.184.65.200 | 15.184.65.200 | | 22 | nl01-6551-PoolWorker.infracloud-vanilla-main | openstack git.openstack.org/openstack-infra/zuul tox-pep8 | track down connectivity problem | | 20:20 |
Shrews | fyi | 20:20 |
pabelanger | 2017-08-16 20:03:48,173 DEBUG zuul.AnsibleJob: [build: 917b0d5c1778477eae27ca466ef70a66] Job root: /var/lib/zuul/builds/917b0d5c1778477eae27ca466ef70a66 was the first time we started using /var/lib/zuul/builds on ze01.o.o | 20:20 |
pabelanger | which resulted in http://logs.openstack.org/10/494310/1/check/tox-py35/917b0d5/job-output.txt.gz error | 20:21 |
mordred | Shrews: woot | 20:22 |
pabelanger | 0ce80c17f2004e098b2092e37204cdc2 was the last job to run using /tmp, and didn't have that issue | 20:22 |
pabelanger | mind you the job was aborted | 20:22 |
Shrews | | 0000017896 | inap-mtl01 | nova | ubuntu-xenial | 335daeb7-e5f5-4fe2-b28b-d6774d877701 | hold | 00:00:01:59 | unlocked | ubuntu-xenial-inap-mtl01-0000017896 | 198.72.124.67 | 198.72.124.67 | | 22 | nl01-6551-PoolWorker.inap-mtl01-main | openstack git.openstack.org/openstack-infra/zuul tox-docs | track down connectivity problem | | 20:23 |
Shrews | is the other one so far | 20:23 |
mordred | jeblair: there is, in fact, no /var/lib/zuul on the node | 20:23 |
pabelanger | Oh, is that on the remote side? | 20:24 |
pabelanger | Ya, that explains it | 20:24 |
mordred | yah | 20:24 |
pabelanger | we used /tmp | 20:24 |
mordred | jeblair: OH ! I've got the whole thig now | 20:24 |
mordred | jeblair: ze01 is the only executor using /var/lib/zuul | 20:24 |
mordred | the others are stll using /tmp | 20:24 |
mordred | that's why it only fails sometimes | 20:24 |
pabelanger | Ya | 20:24 |
mordred | and always on ze01 | 20:24 |
mordred | PHEW | 20:25 |
mordred | well - we know the entire story now :) | 20:25 |
mordred | and it turns out the error is, in fact, that it can't create a directory | 20:25 |
pabelanger | ya | 20:25 |
pabelanger | /tmp is 777 | 20:25 |
pabelanger | we should likely have it use /home/zuul or ansible_ssh_user on the remote side | 20:25 |
jeblair | why are the other executors not using varlibzuul? | 20:25 |
pabelanger | I don't see puppet running on ze02 | 20:26 |
pabelanger | did we accept ssh keys on puppetmaster? | 20:26 |
jeblair | pabelanger: probably not | 20:27 |
jeblair | why isn't that part of launch-node? | 20:27 |
pabelanger | Ya, hostkeys aren't accepted on puppetmaster | 20:28 |
jeblair | pabelanger: would you mind fixing that please? | 20:28 |
pabelanger | jeblair: not sure, it has always been a manually process since I've created nodes | 20:28 |
pabelanger | sure | 20:28 |
pabelanger | accepted now | 20:30 |
*** jkilpatr has joined #zuul | 20:30 | |
pabelanger | I cannot remember why we changed remote_tmp in ansible.cfg | 20:31 |
jeblair | mordred, pabelanger, Shrews: i need help with this. this is the reason the local and remote directories have the same root: https://review.openstack.org/346880 | 20:31 |
pabelanger | was it something with async? | 20:31 |
jeblair | keep in mind, that's for zuulv2.5. i haven't worked out if it's applicable still. | 20:31 |
mordred | jeblair: my reading there is that it's mostly about async | 20:32 |
pabelanger | Ya, that is what I seem to remember too | 20:32 |
mordred | oh - aso - we don't set keep_remote_files anymore | 20:33 |
mordred | # NB: when setting pipelining = True, keep_remote_files | 20:33 |
mordred | # must be False (the default). Otherwise it apparently | 20:33 |
mordred | # will override the pipelining option and effectively | 20:33 |
mordred | # disable it. | 20:33 |
mordred | so I think both reasons we did that are now gone | 20:33 |
jeblair | mordred, pabelanger: okay -- two options: 1) we can remote remote_tmp and revert to default behavior. or 2) i can create a new local tmpdir explicitly for setting remote_tmp (this will work both locally and remotely, and we can make sure it's cleaned up with jobdir). | 20:33 |
pabelanger | right, I seem to remember that also | 20:34 |
pabelanger | I am open to trying option 1 | 20:34 |
mordred | me too | 20:34 |
jeblair | option 1 should say '*remove* remote_tmp' but i bet you got that | 20:34 |
pabelanger | +1 | 20:35 |
mordred | our setting of this seems mostly to have been about working around an issue with two things we don't use anymore - so I vote for option 1 because it's less complexity | 20:35 |
mordred | and if there IS a reason we need to be explicit about it - I'd like to learn that reason and document it | 20:35 |
mordred | jeblair: that said - I'm not *oppposed* to 2 and think that would also work fine - I just don't think we need it | 20:36 |
Shrews | mordred: don't forget to delete your held nodes when you're done with them | 20:36 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove remote_tmp setting https://review.openstack.org/494695 | 20:36 |
mordred | Shrews: what's my process for that? | 20:36 |
Shrews | mordred: nodepool delete ID .... i can do it for you since i'm already there | 20:37 |
jeblair | mordred, pabelanger, Shrews: ^ there's that. | 20:37 |
mordred | Shrews: cool. thanks | 20:37 |
mordred | jeblair: I'm going to turn off verbose and keep | 20:37 |
Shrews | mordred: safe to delete them now? | 20:37 |
jeblair | we can merge that by shutting down ze01, or force-merge it and restart all of them. | 20:37 |
mordred | Shrews: yah | 20:37 |
mordred | jeblair: I'm fine with either approach - we should restart all of them anyway though to pick up that change | 20:38 |
mordred | jeblair: so maybe shut down ze01, land that change, then restart all | 20:38 |
mordred | I'm on ze01 and can shut it down real quick | 20:39 |
pabelanger | mordred: jeblair: just thinking, with bwrap now, we might also be able to drop local_tmp too. Since each ansible-playbook is namespaced | 20:40 |
*** isaacb has quit IRC | 20:40 | |
pabelanger | also, I have +2'd 494695 | 20:40 |
jeblair | mordred: ++ | 20:40 |
jeblair | let's make sure we get that and shrews change in place for the restarts | 20:40 |
mordred | done | 20:40 |
mordred | the Shrews change landed already I believe yeah? | 20:41 |
jeblair | nope 494679 i just rechecked it | 20:41 |
jeblair | should go in now | 20:41 |
jeblair | and i self-approved my change as it has 2x+2 | 20:42 |
jeblair | (my change may still fail check if it hit ze01 in which case we may need to recheck) | 20:42 |
mordred | jeblair: k. we're gonna have to recheck your change - it managed to get three fails before ze01 went away | 20:42 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul feature/zuulv3: Remove remote_tmp setting https://review.openstack.org/494695 | 20:43 |
jeblair | that'll speed things up. commit msg mod | 20:43 |
mordred | https://review.openstack.org/#/c/494343/ is reviewable while we're waiting | 20:43 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add known_hosts generation for multiple nodes https://review.openstack.org/494700 | 20:46 |
jeblair | mordred: lgtm | 20:47 |
SpamapS | ^^ is a bit rough | 20:47 |
SpamapS | and I'm not sure we want that in the predominant base job :-P | 20:47 |
*** dkranz has quit IRC | 20:48 | |
mordred | SpamapS: dude | 20:51 |
SpamapS | can you imagine that in jinja? ;-) | 20:52 |
mordred | NO | 20:52 |
SpamapS | neither could I | 20:52 |
jeblair | now.... xpath... you could totally do that in xpath. | 20:53 |
SpamapS | you can build minecraft in xpath | 20:53 |
mordred | you can build xpath in minecraft | 20:53 |
clarkb | re ecdsa aren't you not supposed to use those at all? (Just htinking its a lot of effort to go through to support that s it changes when you shouldn't even use them) | 20:56 |
clarkb | I apparently have ecdsa host keys on this machine though. Interesting | 20:57 |
clarkb | also we are assuming no dsa because we aren't going to talk to ancient things anymore? | 20:57 |
jeblair | it looks like all changes are now failing with this: http://logs.openstack.org/95/494695/2/check/tox-docs/efc3986/job-output.txt.gz#_2017-08-17_20_53_28_879680 | 21:02 |
jeblair | pabelanger: are we still missing a piece for the nodepool cloud mirror thing? | 21:02 |
pabelanger | jeblair: no, just waiting for puppet to apply the change to servers | 21:04 |
pabelanger | okay, so that is it | 21:04 |
pabelanger | ah | 21:04 |
pabelanger | nodepool.cloud is missing from that inventory | 21:04 |
pabelanger | why is that | 21:05 |
jeblair | pabelanger: could it be because the executors are running an old zuul? | 21:05 |
pabelanger | ya, that is possible. I didn't confirm /opt/zuul was latest version on zuul executors | 21:06 |
jeblair | okay, i'm going to fix this | 21:06 |
pabelanger | thanks | 21:06 |
pabelanger | http://logs.openstack.org/79/494679/1/check/tox-cover/a8082ad/inventory.yaml does have nodepool.cloud, which is ze01 | 21:08 |
jeblair | mordred: this one (A worker was found in a dead state) hit again: http://paste.openstack.org/show/618724/ | 21:17 |
jeblair | i'm convinced at this point that's going to kill us in production | 21:17 |
jeblair | https://github.com/ansible/ansible/blob/devel/lib/ansible/plugins/strategy/__init__.py#L584 | 21:21 |
jeblair | mordred, dmsimard|off: ara seems to be adding "Initializing new DB from scratch" to the start of our job output; is there a way to avoid that? | 21:24 |
jeblair | see top of http://logs.openstack.org/79/494679/1/check/tox-docs/4b6b007/job-output.txt.gz | 21:24 |
jeblair | 2017-08-17 21:14:22,654 DEBUG zuul.AnsibleJob: [build: 4b6b0071d22b4093aa915b76d9d713e6] Ansible output: b'ERROR! A worker was found in a dead state' | 21:26 |
jeblair | Aug 17 21:14:22 ze01 kernel: [6231058.789222] ansible-playboo[10382]: segfault at a9 ip 000000000050ee24 sp 00007fffb6c19a38 error 4 in python3.5[400000+3a8000] | 21:26 |
jeblair | mordred: ^ the "dead state" error is caused by segfault | 21:26 |
jeblair | note they check for sigsegv here: https://github.com/ansible/ansible/blob/devel/lib/ansible/executor/task_queue_manager.py#L333 | 21:28 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Remove remote_tmp setting https://review.openstack.org/494695 | 21:32 |
jeblair | SpamapS: i think we need to do allow ansible-playbook to write a core file. it's running in bwrap in a python subprocess. any ideas how to do that? | 21:35 |
jeblair | (also, we need to consider where it will be written) | 21:36 |
jeblair | (well, technically it's an ansible worker process that's getting the segv and will coredump, but i assume getting this to work for ansible-playbook should work for the worker process too) | 21:37 |
SpamapS | jeblair: ulimit should work inside the process namespace. We can do that with a wrapper around ansible-playbook | 21:37 |
jeblair | SpamapS: so a little shell script which does "ulimit -c unlimited; ansible-playbook ...." then we call that from our popen? | 21:41 |
SpamapS | jeblair: exactly. | 21:42 |
jeblair | cool, i'll hack that up real quick... | 21:42 |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Fix zuul_stream callback dict modification https://review.openstack.org/494679 | 21:42 |
SpamapS | jeblair: I presume you already tried just letting the executor write cores? | 21:42 |
SpamapS | I'm not 100% sure but I thought when you make a new process namespace it takes the parent's settings. | 21:42 |
jeblair | SpamapS: i have not.... i... perhaps erroneously... thought it wouldn't make it all the way through that. | 21:42 |
jeblair | that would be much easier so maybe we should try that first. | 21:43 |
SpamapS | It may not | 21:43 |
SpamapS | you can test that with zuul-bwrap | 21:43 |
jeblair | good point | 21:43 |
jeblair | SpamapS: that works | 21:44 |
jeblair | i'll start with the init script then | 21:44 |
jeblair | now where will the core be written i wonder? | 21:45 |
SpamapS | pwd | 21:45 |
SpamapS | so if that's not writable, that's a problem | 21:46 |
SpamapS | I believe we chdir to / of the bwrap | 21:46 |
jeblair | well, it's more that it's in the jobdir so it will be deleted | 21:46 |
jeblair | so i guess we're going to need to run with 'keep' enabled for a while to catch this | 21:46 |
SpamapS | ah no, no chdir | 21:46 |
jeblair | the popen has cwd=self.jobdir.work_root | 21:47 |
EmilienM | jeblair: fyi, dmsimard is on pto and back next week IIRC | 21:47 |
jeblair | so it'll be in jobdir/work i think | 21:47 |
SpamapS | we chdir to {workdir} so yay | 21:47 |
jeblair | EmilienM: thanks | 21:47 |
SpamapS | jeblair: yeah and the bwrap also does --chdir {workdir} | 21:48 |
SpamapS | is the kernel the same version on all nodes btw? | 21:54 |
SpamapS | just wondering | 21:54 |
jeblair | SpamapS: ze01 is older 4.4.0-79-generic vs 4.4.0-92-generic | 21:56 |
jeblair | SpamapS: you reckon i should reboot? | 21:56 |
SpamapS | Not the worst idea, if for no other reason than normalization | 21:57 |
jeblair | SpamapS: yeah; though to increase the chances of catching the error, i was planning on only keeping ze01 in service | 21:57 |
SpamapS | that 4.4.0 kernel in xenial is particularly insidious .. it's really a bunch of post-4.4.0 patches masquerading as 4.4.0 | 21:58 |
jeblair | but if something in the kernel has altered it (fixed or made worse), i'd rather know sooner, so i'll reboot | 21:58 |
SpamapS | Yeah I can't point to anything and go "It's that!" but... that's hundreds of patches. | 21:59 |
jeblair | i'll wait till the current crop of tests finish | 21:59 |
jeblair | we got 4-6 of theese an hour today | 21:59 |
SpamapS | I remember there was a really bad bug in the python in Ubuntu 14.04 that broke our tests for a while. I wonder if we have something similar going on. | 22:01 |
clarkb | that bug was in the garbage controller trying to free already freed entries in a circular list | 22:16 |
clarkb | so possible you've found another one of those in python | 22:16 |
jeblair | clarkb: py3.5? | 22:19 |
clarkb | it was 3.4 | 22:19 |
jeblair | ze01 is rebooted, running latest kernel, with ulimit -c unlimited and keep enabled | 22:19 |
jeblair | so we're looking for Ansible output: b'ERROR! A worker was found in a dead state' in the log after 2017-08-17 22:00 | 22:20 |
pabelanger | k | 22:22 |
jeblair | i'm going to recheck all zuul patches. | 22:22 |
*** openstackgerrit has quit IRC | 22:33 | |
jeblair | oho 3 segfaults i think | 22:37 |
jeblair | yeah, timestamps line up | 22:37 |
SpamapS | status lighting up like a christmas tree | 22:37 |
jeblair | no core files though :( | 22:38 |
SpamapS | blurgh | 22:43 |
*** openstackgerrit has joined #zuul | 22:47 | |
openstackgerrit | Merged openstack-infra/zuul feature/zuulv3: Remove callback_whitelist setting https://review.openstack.org/488214 | 22:47 |
dmsimard|off | jeblair: re - db initialization in ARA, not yet but it's on my to-do. | 22:50 |
* mordred now has working water in his house and does NOT have water running down the street from the broken sprinkler pipe | 22:52 | |
mordred | jeblair: have I missed anything new or exciting? | 22:54 |
jeblair | SpamapS: i used /proc to verify that the zuul process was ulimit -c 0 still (maybe something something systemd? or start-stop-daemon? or python-daemon?) | 22:54 |
jeblair | SpamapS: i use prlimit (which i just learned about!) to set it to unlimited while running (!!!!one!) | 22:55 |
jeblair | mordred: yeah; the scrollback about segfaults is interesting; that's what i'm poking at | 22:55 |
SpamapS | jeblair: ooooooooooooooooooo neat | 22:55 |
jeblair | now waiting for another segfault, post 22:55 utc | 22:56 |
jeblair | oh wow all the jobs are done? | 22:56 |
jeblair | i guess most of those were merge conflicts | 22:56 |
mordred | jeblair: just read - lovely! | 22:57 |
jeblair | rechecking all changes again | 22:57 |
mordred | jeblair, dmsimard|off: also, I'm very curious as to how the db init line from ara got in to the zuul-stream log file | 22:57 |
dmsimard|off | mordred: me too, likely faulty logging config from ara | 22:58 |
mordred | aha. nod. I can squelch that | 22:59 |
jeblair | it looks like web log streaming may be broken | 22:59 |
dmsimard|off | logging config (and logging in general) in ara is either horrible or inexistant so it wouldn't surprise me | 22:59 |
jeblair | direct finger to ze01 works | 22:59 |
jeblair | mordred, dmsimard|off: we're also getting alembic log lines to ansible stdout/stderr (so they're showing up in the executor debug log) | 23:01 |
jeblair | 2017-08-17 22:59:07,638 DEBUG zuul.AnsibleJob: [build: 52eab059248f4642b08660e3988cfc64] Ansible output: b'INFO [alembic.runtime.migration] Running upgrade 22aa8072d705 -> 5716083d63f5, ansible_metadata' | 23:01 |
jeblair | eg ^ | 23:01 |
dmsimard|off | jeblair: that would also be ara, yes | 23:01 |
jeblair | that's not user visible, but it does make the executor logs chatty so would be nice to clean up (the entries showing up in the console log are more important to nix before production) | 23:02 |
dmsimard|off | https://storyboard.openstack.org/#!/story/2000931 is aptly described as "Get rid of alembic foreground logging, it's pretty annoying" | 23:03 |
jeblair | it's only 7% of our log output at the moment. :) | 23:03 |
dmsimard|off | I'll add a logging general topic to my 1.0 checklist sir :) | 23:04 |
jeblair | 2017-08-17 23:00:56,323 DEBUG zuul.AnsibleJob: [build: 5aaa32661bae496ba0e01006aa3d67cb] Ansible output: b'ERROR! A worker was found in a dead state' | 23:04 |
jeblair | ding! | 23:04 |
pabelanger | winner winner, chicken dinner | 23:06 |
dmsimard|off | if I keep piling up those 1.0 ARA todo's I'm going to pull a Zuul v3 and ship it in 2 years at this rate :D | 23:06 |
* dmsimard|off hides | 23:06 | |
jeblair | SpamapS, mordred: still no core file | 23:08 |
clarkb | jeblair: is pam resetting ulimits? | 23:08 |
clarkb | it may do that if there are user switches happening | 23:08 |
clarkb | you should be able to check via /proc I think | 23:08 |
jeblair | clarkb: i just checked an ansible-playbook process via proc and it has core set to unlimited. | 23:09 |
SpamapS | jeblair: work_dir is rw mounted right? | 23:09 |
jeblair | SpamapS: yes, it's jobdir/work so should be writable in all cases | 23:10 |
jeblair | cat /proc/sys/kernel/core_pattern | 23:10 |
jeblair | |/usr/share/apport/apport %p %s %c %P | 23:10 |
* jeblair is not amused | 23:11 | |
jeblair | SpamapS: any chance you know if that has caused the core dumps to go to some place where i can recover them? | 23:13 |
mordred | jeblair: maybe we shoudl uninstall apport | 23:14 |
jeblair | /var/crash is empty | 23:14 |
clarkb | ianw may know since there was debugging of cores from apache I think | 23:14 |
jeblair | echo -n "core" > /proc/sys/kernel/core_pattern | 23:15 |
jeblair | i did that | 23:15 |
mordred | Apport is not enabled by default in stable releases, even if it is installed. The automatic crash interception component of apport is disabled by default in stable releases for a number of reasons: | 23:15 |
mordred | I beg to differ with them | 23:15 |
jeblair | now looking for a segfault after 23:15 | 23:16 |
jeblair | another global recheck | 23:16 |
mordred | jeblair: echo "/tmp/cores/core.%e.%p.%h.%t" > /proc/sys/kernel/core_pattern | 23:17 |
jeblair | mordred: does /tmp/cores need to be world-writable? | 23:18 |
jeblair | if i were to do that? | 23:18 |
mordred | jeblair: unsure - reading more | 23:19 |
mordred | jeblair: so - actually your thing seems better | 23:19 |
mordred | jeblair: "core" is the default and will put a core file in the current dir | 23:20 |
jeblair | k. i *think* we're set up for that, but if we have problems, we can try with tmp | 23:20 |
mordred | cool | 23:20 |
jeblair | -rw------- 1 zuul zuul 84430848 Aug 17 23:27 /var/lib/zuul/builds/33ff13fb20cf44dea20078f1ea61eac5/work/core | 23:29 |
jeblair | finally! | 23:29 |
SpamapS | jeblair: woot | 23:30 |
jeblair | #0 0x000000000050ee24 in visit_decref () at ../Modules/gcmodule.c:373 | 23:31 |
jeblair | that function is the same in 3.5.2 and tip of master | 23:37 |
clarkb | I want to say that may be where the old bug was | 23:38 |
jeblair | ianw: ^ is any of this looking familiar? | 23:38 |
clarkb | they do list traversals and decrement references. Problem is its a circular list so if you don't clean pointers up properly on last pass you can hit old freed memory on the next pass and attempt to free already freed memory | 23:39 |
jeblair | clarkb: was there a fix for that? | 23:39 |
clarkb | yes, the fix came from 3.5 iirc | 23:39 |
clarkb | I'm finding links now | 23:40 |
clarkb | https://bugs.launchpad.net/ubuntu/+source/python3.4/+bug/1367907 is what I filed in ubuntu to fix trusty, http://bugs.python.org/issue21435 is bug upstream in python | 23:41 |
openstack | Launchpad bug 1367907 in python3.4 (Ubuntu Trusty) "Segfault in gc with cyclic trash" [Undecided,In progress] | 23:41 |
SpamapS | Ubuntu is usually _really_ close to mainline releases on python. | 23:41 |
clarkb | bt is different | 23:41 |
clarkb | so this is probably a new bug | 23:41 |
ianw | jeblair: apart from the unfortunate familiarity of spending 97% of the time just trying to get a coredump, no sorry :/ | 23:41 |
jeblair | https://etherpad.openstack.org/p/uHxJU6CrXW | 23:42 |
ianw | can you do a bt full or is it crazy? | 23:47 |
clarkb | http://bugs.python.org/issue31181 | 23:48 |
clarkb | interesting that they seem to think it is related to yaml | 23:48 |
jeblair | clarkb: and that's py27 | 23:48 |
clarkb | http://bugs.python.org/issue27945 seems like a more robust reporting of what may be this issue and last bug is dup of it | 23:48 |
clarkb | jeblair: ya, ^ is all the versions though and I think is the same bug | 23:49 |
clarkb | there are patches too | 23:49 |
jeblair | ianw: will check in a sec | 23:49 |
clarkb | is it just me or does gerrit make it a million times easier to understand what the actual fixes are | 23:50 |
clarkb | https://github.com/python/cpython/commit/2f7f533cf6fb57fcedcbc7bd454ac59fbaf2c655 | 23:50 |
jeblair | ianw: http://paste.ubuntu.com/25335706/ | 23:53 |
clarkb | its not too difficult to build your own python particularly on debuntu | 23:54 |
clarkb | you could likely grab that patch and see if it fixes | 23:54 |
clarkb | (on suse its a major pita because everything needs patching by distro) | 23:54 |
mordred | jeblair, clarkb: since it mentions yaml - i wonder if there is any difference between cyaml and non-cyaml | 23:59 |
clarkb | mordred: looking at the second bug it seems to be related to a large rewrite of the dict type | 23:59 |
mordred | ah | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!