Sunday, 2017-09-10

pabelangerjeblair: mordred: Shrews: I am noticing odd behavior with nl01 / nl02. Currently have a 2 trusty nodes marked ready (unlocked) and we built a new image. I would have expect the existing ready nodes to have been used first00:10
pabelangerjeblair: mordred: Shrews: also see 2 opensuse nodes online, evern though we have min-ready: 1 for labels00:12
*** Shrews has quit IRC01:22
*** Shrews has joined #zuul01:23
*** xinliang has quit IRC02:04
*** mordred has quit IRC02:09
*** xinliang has joined #zuul02:21
*** xinliang has joined #zuul02:21
*** mordred has joined #zuul02:37
*** jamielennox has quit IRC03:08
*** jamielennox has joined #zuul03:13
* SpamapS wishes he was heading to DEN to work on the transition :-P03:54
*** yolanda has quit IRC04:15
* tobiash sits at the airport (MUC) and waits for departure06:08
openstackgerritDirk Mueller proposed openstack-infra/nodepool master: Reorder except statements to avoid unnecessary traces  https://review.openstack.org/50231212:13
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Fix typo - the word is "available"  https://review.openstack.org/50232014:37
mordredjeblair: http://logs.openstack.org/65/500365/10/check/shade-functional-devstack/7e8d1d3/job-output.txt.gz - which is one of the attempts to use the new devstack job - is failing here:14:51
mordredhttp://logs.openstack.org/65/500365/10/check/shade-functional-devstack/7e8d1d3/job-output.txt.gz14:51
mordredgah14:51
mordredhttp://logs.openstack.org/65/500365/10/check/shade-functional-devstack/7e8d1d3/job-output.txt.gz#_2017-09-09_20_44_47_81076814:52
mordredjeblair: it looks like it's failing in a pre-run task, we're not logging anything, then we shift to the post-run14:52
mordredjeblair: I can't see anything in the logs to indicate what might have gone wrong14:52
jeblairmordred: what's the last log lines from that playbook?14:53
mordredjeblair: ah! I think: 2017-09-09 20:44:47,933 DEBUG zuul.AnsibleJob: [build: 7e8d1d39913a4d67b94c1c9d1c99f903] Ansible output: b'ERROR! A worker was found in a dead state'15:00
jeblairlikely another segfault :(15:01
clarkbone newer python?15:01
clarkb*on15:01
mordredjeblair, clarkb: I'm not convinced we have the right python installed15:03
mordredk. nevermind. apt-cache was just showing me weird things15:03
mordredpython3.5 is already the newest version (3.5.2-2ubuntu0~16.04.2~openstackinfra).15:03
mordredjeblair, clarkb: "luckily" it seems to be consistent and reproducible - I think all of those jobs failed out in the same place15:04
*** jkilpatr has joined #zuul15:04
jeblairmordred: "cool"15:04
*** jkilpatr has quit IRC15:10
mordredjeblair, clarkb, fungi: if you have a sec: https://review.openstack.org/#/c/50227015:12
* mordred would like to clear that stack if possible15:12
mordredalso https://review.openstack.org/#/c/502256/ is an easy one15:14
fungimordred: you've got some earlier changes adding generate-log-url which seem to run counter to that. should they be abandoned now?15:25
* fungi is mildly confused about which ones we're keeping15:26
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Add migration tool for v2 to v3 conversion  https://review.openstack.org/49180515:40
mordredfungi: yah - one sec15:42
mordredfungi: so, https://review.openstack.org/#/c/502270 https://review.openstack.org/#/c/502271 and https://review.openstack.org/#/c/502320 are good as is https://review.openstack.org/#/c/502272/15:45
fungithanks!15:45
mordredfungi: I've marked the https://review.openstack.org/#/c/502266 and https://review.openstack.org/#/c/502265 WIP - once the other have landed and we've verified, we can rework those15:45
fungiknowing what not to review is at least as important as knowing what to review at this point ;)15:45
mordredfungi: +10015:45
mordredupdate https://review.openstack.org/502265 to actually be correct and to depend on that base-test patch15:49
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: Add max-job-timeout tenant setting  https://review.openstack.org/50233216:22
mordredclarkb: I can can do some hacking here in the tc/board room, but going all the way to tracking down python segfault is a bit too much - any chance you can help with that one?16:30
clarkbmordred: sure what can I do to help?16:30
clarkbI'm slowly booting this morning so haven't quite followed all the segfault stuf16:30
mordredclarkb: I think we just need to trigger/run one of the changes with core files enabled and then gdb a little bit16:31
mordredclarkb: you can see all the devstack jobs here: https://review.openstack.org/#/c/500365/ hit the retry limit (beaues they're all hitting the segfault)16:31
mordredclarkb: so turning on keep on the executors and rechecking that should make it easy to reproduce the segfault (it seems to be happening every time)16:32
mordredactually - I betcha we could make a much smaller patch that reproduces it - lemme see if I can make you one of those real quick16:32
clarkbok16:34
mordredclarkb: https://review.openstack.org/502335 is a smaller one but still uses the devstack_local_conf parameter (the python module implementing that seems to be where the crashes are happening)16:35
mordredI also wonder if there's $something we can do, maybe in our callback plugin, that puts in a segfault handler so that we can maybe emit a $something into the logs16:36
clarkbya catching sigsegv and recording state might be helpful16:37
clarkbmordred: so I should see 502335 hit the retry limits ya?16:38
mordredclarkb: yah16:40
mordredclarkb: OR - if we don't, then we can compare it to the previous patch and see what it's doing that doesn't segfault when the others do16:40
clarkbok16:42
clarkband keep keeps the bubbewrap env around?16:42
mordredclarkb: yah -if you run "zuul-executor keep" on the executor it tells it to not delete the build dir from /var/lib/zuul/builds16:45
clarkband is that global? or can I scope it to a specific job?16:46
mordredclarkb: there's a zuul-bwrap command youcan use to manually run a playbook via bubblewrap16:46
mordredclarkb: it's global16:46
mordredclarkb: scoping it to a job might be a nice improvement though16:46
clarkbmordred: and you double chcked the version of python?16:49
* clarkb looks at python16:49
clarkbVersion: 3.5.1-3 is what apt cache says we have16:50
mordredclarkb: yah - apt-cache seems to be lying16:52
mordredclarkb: I tried apt-get update ; apt-get install python3.5 and it said: "python3.5 is already the newest version (3.5.2-2ubuntu0~16.04.2~openstackinfra)"16:53
clarkbPython 3.5.2 (default, Aug 18 2017, 13:17:48)16:53
clarkbis what python itself reports which is close to ^ than 3.5.1-316:53
mordredclarkb: yah- I thnk we should figure out what's up with that - or see if the right one has hit the proposed bucket yet16:54
clarkbany objections to installing aptitude? I find it easier to debug these sorts of things using aptitude than apt-*16:55
clarkbalso finger streaming really kills the browser16:58
clarkbI wonder if we can have it poll every say 10 seconds16:58
clarkbmordred: dpkg -s also says 3.5.1-316:59
clarkbaha dpkg -l to the rescue17:01
clarkbmordred: package confusion seems to be python3 vs python3.517:01
clarkband python3 is not a real package it is a dependency package that depends on the python3.5 package17:02
clarkbmordred: your reproduction change seems to have failed properly17:03
clarkbuser jenkins tried to write to zuul's homedir17:04
mordredclarkb: oh - goodie17:07
clarkbalso /new/shade no such file or directory17:08
clarkbso its mostly just dir movement17:08
mordredcool. so - where did you see that error and was there actually a segfault?17:09
mordredclarkb: oh - that actually didn't reproduce the failure17:09
clarkbmordred: http://logs.openstack.org/35/502335/1/check/shade-functional-devstack/2b1bdd8/job-output.txt.gz#_2017-09-10_16_57_59_497595 ya no reproduced sefault17:09
clarkbits a legit fail17:09
mordredclarkb: that failed like a normal job ...yah17:09
mordredclarkb: whereas the previous change did fail with the segfault17:10
mordredclarkb: https://review.openstack.org/#/c/50036517:10
clarkboh well that particular job failed with http://logs.openstack.org/65/500365/11/check/shade-functional-devstack/fe7c728/job-output.txt.gz#_2017-09-10_16_48_59_505509 which is anothre legit fail17:11
clarkbno segfault there that I see17:11
clarkbI see that is why you added heat to that one17:12
clarkbyou sur eit wasn't just error on clone causing problems?17:13
* clarkb looks at different logs17:13
mordredclarkb: "./stack.sh: line 512: generate-subunit: command not found" also a different thing, but I don't thnk the main thing17:17
clarkbmordred: thats happening because the error on clone is bailing everything out early I think17:17
mordredgotcha17:17
clarkbmordred: so fixing the heat clone I think may fix it in general? I'm not seeing segfaults?17:18
mordredclarkb: cool. I'm not sure why we didn't get good logging on that previous run - but lemme fix that in the real one and see how it goes17:18
clarkbya ltes do that17:18
mordredclarkb: there's a flag I can set to turn off fail on clone, right?17:19
clarkbmordred: ya ERROR_ON_CLONE is the devstack flag and I think its set by default in jeblairs base job? Not sure how setting it twice goes in the conflict resolution17:19
mordredlet's find out :)17:19
clarkbmordred: but you shouldn't need it just add heat to your jobs like you did in the recent job17:21
clarkbthen you get to the path problems in your functional test script17:21
mordredah - yes17:23
openstackgerritMerged openstack-infra/zuul-jobs master: Add zuul_log_url to emit-job-header  https://review.openstack.org/50227017:36
openstackgerritMerged openstack-infra/zuul-jobs master: Rename log_path and generate-log-url  https://review.openstack.org/50227117:36
clarkbmordred: http://logs.openstack.org/65/500365/12/check/shade-ansible-devel-functional-devstack/cb514cc/job-output.txt.gz#_2017-09-10_17_43_56_361478 I think you are in business now?17:46
clarkbmordred: new errors17:46
clarkbfun fun fun17:46
clarkbI'm gonna push a new patchset17:48
mordredclarkb: I actually just realized I can kill both of these hok scripts ...17:53
clarkbya I haven't looked at those yet17:53
clarkbfixed a playbook problem in latest patchset17:54
mordredone sec - patch coming17:54
mordred(I can do a thing in v3 I couldn't do well before)17:54
mordredclarkb: check out https://review.openstack.org/500365 now18:02
openstackgerritClark Boylan proposed openstack-infra/zuul-jobs master: Actually set zuul_log_path  https://review.openstack.org/50234018:04
mordredclarkb: ^^ related to that, https://review.openstack.org/#/c/50227218:06
openstackgerritMerged openstack-infra/zuul-jobs master: Actually set zuul_log_path  https://review.openstack.org/50234018:09
mordredtristanC: https://review.openstack.org/#/c/502332 has a nit - but I think we should fix it - the patch otherwise looks great to me18:15
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Override tox requirments with zuul git repos  https://review.openstack.org/48971918:19
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Remove extra quotation mark  https://review.openstack.org/50234318:27
tristanCmordred: oops, fixing it now18:39
openstackgerritMerged openstack-infra/zuul-jobs master: Fix typo - the word is "available"  https://review.openstack.org/50232018:40
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: Add max-job-timeout tenant setting  https://review.openstack.org/50233218:40
tristanCmordred: actually, next we should thinker how-to generalize tenant resources limits, having a list of 'max_' doesn't looks clean18:42
tristanCfor example, we probably also want to limit more things such as available labels18:44
tobiashtristanC: you intend a limits setting like this: http://paste.openstack.org/show/620830/ ?18:44
tristanCtobiash: yes, something like that would looks better imo18:45
tristanCmaybe the structure could support, min,max,in type of limit18:45
mordredtristanC, tobiash: I think we need to consider label/tenant mapping and label/job or project mapping fairly broadly - realized it when workingon some of the job conversions ...18:47
tristanCmordred: on another topic is there some work already started to move the status.json controller to zuul-web (iirc through gearman) ?18:48
mordredone of the things seemed like it would be a good use for static nodes - until we realized there is no way (currently) to prevent some random job in some random repo from requesting a given static node18:48
mordredtristanC: there is not - I wrote down some thoughts on it somewhere, but I do not believe anyone has started work on it yet18:48
mordredtristanC: I also wrote this: https://review.openstack.org/#/c/494655/ which doesn't work and was just me poking around in general at what paste->aiohttp conversion looks like18:49
tobiashmordred: for this there was the idea of a 'allowed labels' setting18:49
mordredtobiash: for a tenant in main.yaml?18:49
tobiashmordred: yes18:50
tobiashmordred: didn't have time for this yet unfortunately18:50
mordredtobiash: I'd also like to ponder the idea of teaching nodepool about tenants - so that you could have different max quotas per tenant for instance18:51
tobiashmordred: but this might also fit into a limits description18:51
tobiashmordred: jeblair was against this as nodepool also has other users than zuul, this also could be handled directly in zuul I think18:51
tristanCmordred: how about starting with making formatStatusJSON a gearman function and create the per-tenant status endpoint in zuul-web?18:54
mordredtristanC: I think that's a great idea18:55
tristanCi ask those things because it will help the work on the dashboard, it seems like having a per-tenant status page will help with thinkering what we want the zuul interface to look like for jobs and builds18:56
openstackgerritMerged openstack-infra/zuul-jobs master: Remove extra quotation mark  https://review.openstack.org/50234318:56
tristanCmordred: ok great, i'll try to get that formatStatusJSON function exposed over gearman now18:57
mordredtristanC: when I last spoke with jeblair about status.json and gearman, I believe his suggestion was to keep the status cache with the web size - so move the _refresh_status_cache and whatnot to zuul-web too19:02
fungiin retrospect, wondering whether 500320 should really have just been constraints instead of upper constraints, given it's in zuul-jobs and the idea of the constraints list being a rendering of transitive upper versions from a global requirements list is quite openstack-specific19:21
*** dkranz has quit IRC19:34
mordredfungi: we could do that20:07
fungiwe can do it later for all i care, just keeping a mental tally20:08
mordredyah. I think renaming it is a fine idea20:08
*** olaph has quit IRC20:16
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Rename tox_upper_constraints_file to tox_constraints_file  https://review.openstack.org/50234820:17
mordredfungi: ^^20:17
*** olaph has joined #zuul20:19
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Add migration tool for v2 to v3 conversion  https://review.openstack.org/49180520:26
*** olaph1 has joined #zuul20:36
*** olaph has quit IRC20:38
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Add migration tool for v2 to v3 conversion  https://review.openstack.org/49180520:50
jeblairi am at my hotel in denver and have caught up with scrollback here.22:07
dmsimardneed some ci_mirrors.sh reviews merged to fix fedora in v3 still: https://review.openstack.org/#/c/502268/ and (once tested) https://review.openstack.org/#/c/501954/22:20
mordredjeblair, pabelanger, clarkb: http://logs.openstack.org/05/491805/13/check/zuul-migrate/274648b/job-output.txt.gz - is reported as POST_FAILURE but I donmt' see anythiung that failed22:27
mordredhowever, trying to figure out what was wrong led me down a little rabbit hole ...22:27
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Produce subunit and html at the end of tox  https://review.openstack.org/50235922:28
dmsimardmordred: the end of that text file is truncated, no?22:30
jeblairdmsimard: i went ahead and approved 268 since it's just a base-test add22:30
dmsimardjeblair: \o/22:30
mordreddmsimard: it's always truncated - we can't log to the file anymore after we gzip it22:31
jeblairmordred: we already have support for non-python subunit22:31
dmsimardmordred: yeah but I mean, if there'd be a failure after that point you wouldn't see it in the log ? :)22:31
jeblairmordred: if you change the filenames, you're going to have to change the subunit stuff i'm working on too22:31
mordredjeblair: I can unchange the file names - I was more concerned about the todo around the fact that the roles were using the pre-existing virtualenv as well as the tox_envlist variable22:33
jeblairmordred: thanks, i think that'd be best (i elaborated in a review comment)22:34
mordredjeblair: kk - will do22:34
dmsimardjeblair, mordred: would it be okay to send a patch to gzip the json-output.json file ? We have the mimetype for it ( https://github.com/openstack-infra/puppet-openstackci/blob/master/templates/logs.vhost.erb#L49 ) and it's >=1MB for each run22:38
mordreddmsimard: wfm22:38
jeblair++22:38
dmsimardbut of course the post-logs role is one of those things in base that can easily break22:39
dmsimardso... ugh22:39
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Produce subunit and html at the end of tox  https://review.openstack.org/50235922:41
mordreddmsimard: yup. it's the multi-step dance of joy!22:43
dmsimardmordred: I'm not sure at what point the job-output.json file ends up being created, grepfu is failing and only returning things about job-output.txt.22:43
dmsimardmordred: is it just continually written to by zuul_json ?22:44
dmsimardoh wait nevermind, found it -- it's in v2_on_stats of zuul_json22:45
jeblairdmsimard: yep22:45
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Add migration tool for v2 to v3 conversion  https://review.openstack.org/49180522:45
ShrewsIs there early check-in somewhere?22:45
dmsimardShrews: for what, the badges ?22:46
ShrewsYeah22:47
dmsimardon the lobby floor down the escalator on the left hand side (when coming in the hotel)22:47
Shrewsthx22:48
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Decode stdout from ansible-playbook before logging  https://review.openstack.org/50236223:01
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul-jobs master: Gzip the job-output.json file  https://review.openstack.org/50236323:04
dmsimardjeblair: I'm not terribly confident about ^ mostly confused about whether this should occur on localhost or not23:04
jeblairdmsimard: i think it wolud be better to gzip on localhost like the text console log.23:05
dmsimardah I think I get it now23:06
pabelangerjeblair: mordred: 491805 is a dead worker23:09
pabelanger2017-09-10 21:21:20,087 DEBUG zuul.AnsibleJob: [build: 274648b3571a4cfa82a7aeb61fd9f447] Ansible output: b'ERROR! A worker was found in a dead state'23:09
jeblairclarkb: ^23:09
openstackgerritDavid Moreau Simard proposed openstack-infra/zuul-jobs master: Gzip the job-output.json file  https://review.openstack.org/50236323:09
dmsimardjeblair: I think that should work ^23:09
jeblairclarkb: did you set it to save core files?23:10
clarkbjeblair: I didnt do anything b4cause I couldnt find segfaults23:11
clarkbthey were legit fails in pre that caused it to retry limit fail23:11
jeblairwell, we need to set it up to save core files before it segfaults23:11
jeblairSep 10 23:10:23 ze02 kernel: [2183616.902108] ansible-playboo[4575]: segfault at a9 ip 000000000050eaa4 sp 00007ffe55dfc808 error 4 in python3.5[400000+3a7000]23:13
jeblairthat's the corresponding segfault to pabelanger's error23:13
jeblairpabelanger: what executors are running now?23:14
pabelangerjeblair: I haven't changed anything over the last few days, but looking at the SRU SpamapS pushed, we sure enable xenial-proposed and pull in the new python3.523:16
pabelangerhttps://launchpad.net/ubuntu/+source/python3.5/3.5.2-2ubuntu0~16.04.223:16
pabelanger s/sure/should/23:16
jeblairi thought it was well established that we are running the correct python 3.5.  is that still in doubt?23:17
pabelangerI _think_ we still are running out patched version but I haven't really looked into state of current failures23:18
jeblairclarkb, mordred: didn't i see you do that earlier?  are you satisfied the correct python is being used?23:19
dmsimardpython -c "import sys; print(sys.version_info)" ?23:19
pabelangerhowever python3 and python3-all are 3.5.1, where we have python3.5 package for ppa23:19
clarkbyes I think it is correct python23:19
pabelangerhttp://paste.openstack.org/show/620834/23:20
jeblairclarkb: great, thanks.  did you check all 4 executors?  pabelanger do we still have 4, is that correct?23:20
pabelangerjeblair: yes 4 exectuors, but only ze01, ze02 are online23:21
dmsimardclarkb: I guess the executor processes did get restarted (to run on the new python?)23:21
jeblairpabelanger: thanks23:21
pabelangerunless somebody have started ze03 / ze0423:21
jeblairpabelanger: well, if you're not sure, let's check23:21
jeblairit's not running on 0323:21
jeblairand it's not running on 0423:22
pabelangerk23:22
pabelangerthat pastebin for apt packaging is alittle odd however, I don't know why we have both 3.5.1 and 3.5.2 installed23:23
pabelangermaybe we should think about removing the unpacked version?23:23
pabelangerunpatched*23:23
clarkbjeblair: I only checked the first two23:23
clarkbpabelanger: we have both versions because python3 deps on python3.523:24
pabelangerclarkb: so, possible this is a new traceback23:30
jeblairfor pid in `pgrep zuul-executor`; do prlimit --core=unlimited:unlimited --pid $pid; done23:32
jeblairclarkb, pabelanger: ihave done that on both executors23:32
jeblairzuul-executor keep23:32
jeblairand i did that on both as well23:32
jeblairso the next time there's a segfault, we should have a core file in the build directory23:33
jeblairecho -n "core" >/proc/sys/kernel/core_pattern23:33
jeblairoh, also that ^, but that was only needed on ze0223:33
jeblairto get rid of the crazy apport thing23:33
pabelangerjeblair: k23:36
openstackgerritMerged openstack-infra/zuul-jobs master: Gzip the job-output.json file  https://review.openstack.org/50236323:37
jeblairmordred: can you re-review https://review.openstack.org/502273 ?  clarkb can you review that as well?23:40
dmsimardoh yeah I wanted to look at that. Looking now.23:43
pabelangerjeblair: should we kick of the rechecks and try to reproduce segfault now?23:47
jeblairpabelanger: go for it23:48

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!