Monday, 2017-09-11

jeblairSep 10 23:51:03 ze01 kernel: [2079277.750440] ansible-playboo[14535]: segfault at a9 ip 000000000050eaa4 sp 00007ffcc0c3ffc8 error 4 in python3.5[400000+3a7000]00:00
pabelangernice00:01
jeblair2017-09-10 23:51:03,890 DEBUG zuul.AnsibleJob: [build: 57e588b84fd145d99b4049d974d270fe] Ansible output: b'ERROR! A worker was found in a dead state'00:01
jeblair-rw------- 1 zuul zuul 94801920 Sep 10 23:51 /var/lib/zuul/builds/57e588b84fd145d99b4049d974d270fe/work/core00:02
pabelangeryay00:02
jeblairhttp://paste.ubuntu.com/25511193/00:05
jeblair#0  0x000000000050eaa4 in visit_decref () at ../Modules/gcmodule.c:37300:06
jeblairthat's the same location as before00:06
pabelangerya, I'm wondering if something isn't setup properly00:06
jeblairthe stack above it is different; this time it's the powershell module that triggered it, not the ssh connection plugin00:09
jeblairpabelanger: like what?00:10
pabelangerjeblair: on friday I did run apt-get upgrade on zuul-executors because it said a new version of PPA for python3.5 needed to be installed.  This was on ze02.o.o, but I also ran it on ze01.o.o, since apt was also complaining.  However, we should have already been running PPA00:12
ianwi think my irc is messed up, it looks like jeblair just said "powershell" was causing zuul segfaults :)00:12
pabelangerso I wonder it upgrading to PPA version caused an issue00:12
jeblairwe've had 31 segfaults on ze01 today, compared to 13 yesterday,  4 the day before, 10, 3,3, 8 before that00:12
pabelangerbut, cannot think why00:12
jeblairianw: ansible-playbook segfaults :)00:13
pabelangerjeblair: how did we install the patched version of python a few weeks ago when you ran your series of rechecks00:13
jeblairpabelanger: from the ppy that mordred created00:13
jeblair cat /etc/apt/sources.list.d/openstack-ci-core-ubuntu-python-bpo-27945-backport-xenial.list00:14
jeblairdeb http://ppa.launchpad.net/openstack-ci-core/python-bpo-27945-backport/ubuntu xenial main00:14
* tobiash finally arrived at the hotel in denver00:15
jeblairpabelanger: i don't see anything in dpkg.log after sept 600:16
pabelangerjeblair: ya, that would have been the day I did apt-get upgrade00:16
openstackgerritMerged openstack-infra/zuul-jobs master: Allow overriding the workspace directory in prepare-workspace  https://review.openstack.org/50046600:17
pabelangerso wednesday00:17
jeblairpabelanger: so you installed a new version of our patched python from the ppa.  why was there a new version in our ppa?00:18
pabelangerjeblair: right, I don't know. So, that makes me think, we didn't have ppa version installed. Because, no where is puppet-zuul to we have ensure => latest python3.500:19
pabelangerSo, I wonder if we some how uninstalled the fix00:19
pabelangerbut, that would mean, our PPA doesn't include the actually fix00:19
pabelangerI'm also unsure why our PPA diff is largely different from xenial-proposed diff00:21
jeblairpabelanger: the version on the system matches what's in the ppa 3.5.2-2ubuntu0~16.04.2~openstackinfra00:21
pabelangerour ppa: https://launchpadlibrarian.net/333758013/python3.5_3.5.1-10_3.5.2-2ubuntu0~16.04.2~openstackinfra.diff.gz00:21
pabelangerxenial-proposed: http://launchpadlibrarian.net/336180849/python3.5_3.5.2-2~16.04_3.5.2-2ubuntu0~16.04.2.diff.gz00:21
pabelangerjeblair: have we discussed moving ansible-playbook back to python2 interrupter? Or we want to ensure everything is python3 for zuulv300:31
jeblairpabelanger: that's not entirely straightforward.  i don't think we can release a piece of software that runs partly under python2 and partly under python3 and expect anyone to take it seriously.  so either we make python3 work, or *everything* goes back to python2, including zuul (which includes web streaming).00:37
jeblairor rust00:37
jeblairi could live with zuul in rust running ansible in python2 :)00:38
pabelangerjeblair: sure, I just got the feeling from ansiblefest, python3 support and ansible was best effort right now. When I talked about the issue of dead worker, ansible testing hasn't seen it yet. So, we are likely in new territory with python300:39
jeblairbut even as a temporary measure, we'd have to find a way to invoke ansible under python2 from a zuul installation running in python300:39
pabelangerya00:40
jeblairpabelanger: nope, ansible core is py3 ready.  ansible modules are best effort.00:40
jeblairpabelanger: this is not an ansible bug, this is a python bug.00:40
SpamapShow's the hacking going folks?00:40
* SpamapS reading..00:40
SpamapSmore python problems?00:41
pabelangerjeblair: ya, so I'm kinda surprised we able to segfault python3.5 at all00:41
jeblairSpamapS: yeah, segfaults in the same location00:41
SpamapSjeblair: even w/ fix? :(00:41
SpamapSjeblair: so maybe we found a separate class of segfault there00:41
jeblairapparently so00:41
SpamapSthar be dragons00:41
SpamapSjeblair: maybe we could use jython00:42
* SpamapS waits for beers to be cleaned off screens00:42
jeblairi pastebinned the gdb backtrace above00:42
jeblairi need to afk for a bit00:42
pabelangersame, I'm about to step away for social but will return once finished00:43
SpamapSkk.. I'm not super great at python debugging but I'll peek and see if it compares to the supposedly fixed bug00:43
ianwvery suspicous that it's in this same frozen import magic area as before -> http://paste.openstack.org/show/620835/00:56
ianwi mean, of all the dictionary creation magic that happens, it hits it around this area ...00:58
*** jkilpatr has joined #zuul01:27
*** jkilpatr has quit IRC01:34
SpamapSalso why powershell?03:49
SpamapSI'm sure it's the mechanism importing, and not powershell itself.. but... why?03:49
ianwSpamapS: http://paste.openstack.org/show/620839/03:55
ianwnote that does *not* replicate a segfault ... just my attempts at it03:56
ianwit's definitely coming around from when builtins.__build_class__ builds builds a ShellModule class03:56
SpamapSwhoa03:57
SpamapSis that a dictionary key'd by a subclassed multiprocessing.Process?03:57
ianwi was just trying to pass something back and foward and keep a global dict there03:57
SpamapSkk, just trying to wrap head around it03:57
SpamapSSo the updated Python definitely has "fixed" the original reproducer effects. But it's entirely possible there are off-by-one's left that are triggered by deeper usage.03:58
ianwjust reading those ... i'm not sure how similar they are03:58
ianwthey have to do with basically modifying a dict while inside it03:58
ianwthere is sooooooo much going on around all this though ... that may be happening03:59
ianw__build_class__ is just deep magic ... all i can find about it is https://mail.python.org/pipermail/python-3000/2007-March/006338.html04:00
SpamapSIIRC the ansible plugin interface makes use of imp()04:02
SpamapSwhich means Python devs throw their hands up and say NOPE04:02
ianwSpamapS: yep, i was walking "upwards" from http://paste.openstack.org/show/620835/ to see where we might call in04:03
ianwbasically it mutli-process forks() and then "connection = self._shared_loader_obj.connection_loader.get"04:04
ianwbut really, ti's the alloc from line #54 in http://paste.ubuntu.com/25511193/  that's triggered a gc, and then presumably found some bad object04:07
SpamapSI'm still going back to my old twitch in the back of my head when better python devs than I told me never ever fork without exec in python.04:09
ianwhmm, i can walk the garbage collection ... i wonder if we can leverage that to figure out what it was touching ...04:36
ianwjeblair / SpamapS : I started a story on this -> https://storyboard.openstack.org/#!/story/200118605:58
ianwin short, it seem extremely suspicious that the object seeming to cause this problems is a code-generation object for https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/ansible/callback/zuul_stream.py?h=feature/zuulv3#n3905:59
*** hashar has joined #zuul07:45
*** xinliang has quit IRC08:25
*** xinliang has joined #zuul08:25
*** xinliang has quit IRC08:25
*** xinliang has joined #zuul08:25
*** electrofelix has joined #zuul09:04
electrofelixTo reply to the email thread I started about the UI, is there a good place to host screenshots to help with the discussion? Would storyboard be an option? Just figured that attachments would get stripped09:20
*** jkilpatr has joined #zuul11:04
*** jkilpatr has quit IRC11:11
*** jkilpatr has joined #zuul11:23
*** yolanda has joined #zuul11:50
*** dkranz has joined #zuul12:27
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: WIP: web: add /{tenant}/status controller  https://review.openstack.org/50245312:30
*** fbo_ has joined #zuul12:49
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Decode stdout from ansible-playbook before logging  https://review.openstack.org/50236213:35
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Emit a message to the job log if ansible crashes  https://review.openstack.org/50246813:35
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Emit a message to the job log if ansible crashes  https://review.openstack.org/50246813:36
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Decode stdout from ansible-playbook before logging  https://review.openstack.org/50236213:36
fungielectrofelix: closest thing we had was pholio.openstack.org but after the ux team dissolved and nobody had ever used the service we decommissioned it14:23
fungielectrofelix: i guess gerrit can host images if you push them in throwaway changes, though that's likely suboptimal14:25
fungi(it will display images in the diff view)14:25
clarkbits also probably not great long term for a repo you care about14:26
clarkb(could make a throwaway image repo though)14:26
electrofelixfungi clarkb: would using https://snag.gy/ or https://screencloud.net/ be considered ok? 6 month lifespan on snag.gy14:54
tristanCmordred: hum, should the public keys be exposed over gearman by the scheduler so that we could move the /keys endpoint to zuul-web too?15:05
jeblairtristanC: sounds reasonable15:06
fungielectrofelix: seems as good as anything. i usually just push stuff up to a temporary directory on a personal website15:09
tristanCjeblair: would you mind checking this gearman worker implementation in the scheduler: https://review.openstack.org/#/c/502453/1/zuul/scheduler.py15:11
*** yolanda has quit IRC15:11
* tristanC working on moving the webapp.py to zuul-web15:12
*** yolanda has joined #zuul15:16
*** yolanda has quit IRC15:16
*** yolanda has joined #zuul15:16
mordredtristanC: ++ to keys15:18
Shrewsfyi, putting https://review.openstack.org/502137 through to fix websocket tests. survived several rechecks. someone ping me if they discover another ipv6 test failure.15:29
Shrewsthere is still a race in that test. trying to hunt that down now15:30
jeblairhttps://etherpad.openstack.org/p/zuulv3-ptg15:51
jeblairthat's the list of tasks we discussed15:51
tristanCianw: jeblair: what's the exact python version that segfault?15:52
*** dmellado has joined #zuul15:54
*** yolanda has quit IRC15:55
openstackgerritMerged openstack-infra/zuul feature/zuulv3: Support IPv6 in the finger log streamer  https://review.openstack.org/50213715:55
tristanCjeblair: would it be http://ppa.launchpad.net/openstack-ci-core/python-bpo-27945-backport/ubuntu/pool/main/p/python3.5/ ?15:58
jeblairtristanC: that's the one15:59
jeblairtristanC: that was the current version of py35 in ubuntu with the patch in python bug 27945 applied15:59
openstackbug 33002 in gnome-session (Ubuntu) "duplicate for #27945 logout dialog UI objections" [Wishlist,Fix released] https://launchpad.net/bugs/3300215:59
jeblairthat's not the bug :)15:59
jeblairhttps://bugs.python.org/issue2794515:59
jeblairthat bug15:59
Shrewspabelanger: another thing to note, even though a node may be READY, it could already be allocated to another request (the allocated_to field in ZK)16:03
Shrewsjust something to check if you see it again16:03
Shrewsalso tobiash ^^^16:04
SpamapShttps://bugs.launchpad.net/ubuntu/zesty/+source/python3.5/+bug/171172416:04
openstackLaunchpad bug 1711724 in python3.6 (Ubuntu Zesty) "Segfaults with dict" [High,Fix committed] - Assigned to Clint Byrum (clint-fewbar)16:04
SpamapSis the Ubuntu bug16:04
pabelangerShrews: ya, next time I see it, I'll try to get into zk-shell16:05
SpamapSjeblair: ianw makes a fair point about zuul_stream. I wonder if we can somehow disable streaming and see if the problem goes away. it would at least give us an idea of where to go to work around the issue.16:05
pabelangerShrews: or maybe add info to --detail output?16:05
jeblairSpamapS: mordred is working on a patch to make linesplit not a generater in an attempt to avoid triggering the bug16:06
jeblairmeanwhile tristanC is looking into the bug itself16:07
Shrewspabelanger: yeah, if it's not already there16:07
jeblairSpamapS: (the hope being that we can eventually squash the bug, but in the mean time, maybe avoid triggering it while we try to do v3 cutover)16:07
fungimordred: are your topic:zuul-v3-migration changes intended to be topic:zuulv3 instead? (i just stumbled across them via a depends-on)16:11
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249216:11
mordredfungi: yes they are - whoops16:12
fungimordred: no worries, i'll reset them16:12
jeblairremote:   https://review.openstack.org/502273 Add log processing roles16:12
*** yolanda has joined #zuul16:13
jeblairmordred, clarkb, dmsimard: ^ updated to address comments, plus i added a 'when' line for subunit (because we only want those in certain pipelines)16:13
*** haypo has joined #zuul16:13
fungiif this is a help to anyone, my zuulv3 review dashboard query string in gertty: "is:open AND (project:openstack-infra/openstack-zuul-jobs OR project:openstack-infra/openstack-zuul-roles OR project:openstack-infra/zuul-jobs OR (project:^openstack-infra/.* AND topic:zuulv3)) AND NOT label:Workflow=-1"16:14
haypohi. tristanC pointed me to https://storyboard.openstack.org/#!/story/2001186 -- you are using Python 3.5.2 which has https://bugs.python.org/issue26617 bug - i don't know at this point if your crash is related to this bug or not?16:14
jeblairfungi: ++ that's pretty close to mine too :)16:14
fungi27 changes currently matching16:14
haypothe best would be to have a simple reproducer and try python 3.5.416:14
tristanChaypo: thanks for steping in!16:14
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Add node.allocated_to to node detail output  https://review.openstack.org/50249516:16
Shrewspabelanger: ^^^16:16
jeblairhaypo, tristanC: so we should check whether https://hg.python.org/cpython/rev/c9b7272e2553 is in our version?16:16
pabelangerShrews: ty!16:16
haypojeblair: it's not in 3.5.2 : https://docs.python.org/3.5/whatsnew/changelog.html#changelog16:21
haypojeblair: it was fixed in 3.5.316:21
haypojeblair: the question is if it's the same bug or not16:21
mordredjeblair: +2 from me16:21
jeblairhaypo: yes, and it doesn't look like the fix was backported in ubuntu16:21
haypojeblair: but i'm almost 100% that your python3 (3.5.2) has the https://bugs.python.org/issue26617 bug16:22
haypojeblair: ubuntu, haha16:22
haypojeblair: i sent a ping every 3 months to backport a fix, two peoples validated a test package which contained the fix16:22
haypojeblair: 1 year 1/2 later, the bug was still not fixed16:22
jeblairhaypo: we can probably apply that fix to a local package16:22
jeblairhaypo: oy :(16:22
haypojeblair: it was a critical segfault in... the garbage collector, again ;)16:22
haypojeblair: https://bugs.launchpad.net/ubuntu/+source/python3.4/+bug/1367907 opened since  2014-09-10, still not fixed16:23
openstackLaunchpad bug 1367907 in python3.4 (Ubuntu Trusty) "Segfault in gc with cyclic trash" [Undecided,In progress]16:23
haypojeblair: straighforward fix, with a very short script to reproduce the crash, it's not possible to workaround the crash, and *easy* to get it using asyncio...16:23
haypojeblair: i never understood why ubuntu doesn't care of this bug16:24
mordredjeblair, haypo that patch does not seem to be in our version - I'll prepare a version that has it16:24
jeblairclarkb opened that bug 3 years and one day ago.16:24
openstackbug 3 in Launchpad itself "Custom information for each translation team" [Low,Fix released] https://launchpad.net/bugs/316:24
haypoi don't understand neither why Linux distro don't upgrade to the latest 3.5.x instead of cherry-pick16:24
jeblairanything would be better than this16:24
mordredhaypo: because insanity16:24
haypomordred: while i'm not sure that https://bugs.python.org/issue26617 is your bug, it shouldn't hurt to get the fix ;)16:25
haypomordred: "because insanity" ? can you elaborate?16:25
haypoi should ask my colleagues who maintain python for RHEL, Fedora, Centos and SCL16:25
mordredhaypo: the reason they don't just upgrade to 3.5.x ... I was just being snarky16:25
mordredhaypo: the reason is that the policy for stable releases of distros is to not upgrade versions of software they have16:26
haypomordred: for python, we (python core developers) try to make sure that we don't change the behaviour in stable versions16:27
hayposorry i have to go, please ping me back if you are still stuck!16:30
jeblairhaypo: thanks!16:31
jeblairremote:   https://review.openstack.org/502499 Add log processing jobs to base-test16:33
openstackgerritDavid Shrewsbury proposed openstack-infra/zuul feature/zuulv3: DNM: tracking down test race  https://review.openstack.org/50250016:34
jeblairfungi:   https://review.openstack.org/502273  is good reading16:48
fungithanks16:48
fungiwas having trouble tracking that down for some reason16:48
fungiaha, git grep and codesearch won't do string matches on file paths ;)16:49
jeblairfungi: yeah... i wonder if we should put the names of our roles in our roles... :)16:52
*** harlowja has joined #zuul16:54
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Add node.allocated_to to node detail output  https://review.openstack.org/50249516:55
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249217:04
*** jkilpatr_ has joined #zuul17:12
openstackgerritMerged openstack-infra/nodepool master: Add SSH Host Key Verifier Strategy  https://review.openstack.org/48348517:12
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix node list output  https://review.openstack.org/50250417:13
*** hashar is now known as hasharAway17:13
*** jkilpatr has quit IRC17:14
Shrewspabelanger: care to +3 that? ^^^17:14
fungiShrews: lgtm, +317:16
Shrewsfungi: thx17:16
tobiashpabelanger, Shrews: now I found the (lengthly) discussion about the nodepool issues we discussed (node_failure when running into openstack quota and also the 'build instead of directly allocate' issue)17:16
tobiashpabelanger, Shrews: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2017-07-05.log.html#t2017-07-05T20:18:4917:16
pabelangertobiash: thanks, I'll read shortly17:20
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix node list output  https://review.openstack.org/50250417:25
openstackgerritJames E. Blair proposed openstack-infra/zuul-jobs master: DNM: test base-tests logstash stuff  https://review.openstack.org/50251517:28
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: web: add /{tenant}/status controller  https://review.openstack.org/50245317:37
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249217:40
mordreddmsimard: ^^ there is a change to the functional test in that patch that I do not understand why it's needed :(17:43
electrofelixI've noticed some problems with trying to run the zuul tox tests via within Jenkins, because there is no tty, the tests are significantly slower. I can replicate this by running without a tty allocated17:44
electrofelixhas anyone seen that before and any suggestions on how to avoid the slowdown (keeps causing the tests to timeout)17:44
mordredelectrofelix: I have not seen that issue - but we haven't run anyhting in a jenkins in a while17:45
electrofelixmordred: based on tests I can reproduce this whenever I disable a tty allocation to run the tests within a docker container and I get the same slowdown17:54
*** bhavik1 has joined #zuul17:55
SpamapSelectrofelix: that's worth diagnosing. I'd guess it has something to do with forking and running anisible17:59
openstackgerritMerged openstack-infra/nodepool feature/zuulv3: Fix node list output  https://review.openstack.org/50250418:01
*** bhavik1 has quit IRC18:07
*** yolanda has quit IRC18:07
*** jkilpatr_ has quit IRC18:15
*** jkilpatr has joined #zuul18:16
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: web: add /{tenant}/status route  https://review.openstack.org/50245318:38
openstackgerritTristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: web: add /{source}/{project}.pem route  https://review.openstack.org/50253018:38
clarkbmordred: your devstack things appera to be failing due to missing passwords set in localrc18:46
openstackgerritMonty Taylor proposed openstack-infra/zuul-jobs master: Override tox requirments with zuul git repos  https://review.openstack.org/48971918:55
*** xinliang has quit IRC18:57
mordredShrews: so - it's about building on top of: https://review.openstack.org/#/c/491805/19:01
SpamapSanything I can do to help btw?19:02
* SpamapS may be able to review or code while y'all are socializing19:02
clarkbSpamapS: yes actually19:05
clarkbSpamapS: there is possibility we need another SRU for python319:06
clarkbSpamapS: tldr is the python3 gc problem in python3.4 on trusty that took forever to merge? well it affects python3.5 in xenial now too becase ya19:06
SpamapSOHHHH19:06
SpamapSseriously?19:06
clarkbSpamapS: we are currently checking if mordred's current ppa package for python3.5 with circular gc fix fixes zuul19:06
clarkbSpamapS: so shortly we should have a good idea if that is the culprit19:06
SpamapSThat's wild.. yeah we can get it moving faster this time.19:07
SpamapSespecially if it has a reproducer19:07
clarkbSpamapS: ya  Ithink that bug lived so long that ubuntu managed to grab xenial 3.5 python without noticing that it too needed patching19:07
clarkbwhen the bug was filed we did not have a xenial r python3.4 to worry about19:07
clarkbbut then things happened and ya19:07
clarkbI think upstream python may have a reproducer for that bug too19:07
clarkbSpamapS: https://bugs.launchpad.net/ubuntu/+source/python3.4/+bug/1367907 has links to upstream bug too19:08
openstackLaunchpad bug 1367907 in python3.4 (Ubuntu Trusty) "Segfault in gc with cyclic trash" [Undecided,In progress]19:08
mordredShrews: skip-if:all-files-match-any -> irrelevant-files - if the pattern matches job.old and project matches19:09
*** xinliang has joined #zuul19:10
*** xinliang has quit IRC19:10
*** xinliang has joined #zuul19:10
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249219:27
*** yolanda has joined #zuul19:28
clarkbSpamapS: looking like we still segfault though19:29
clarkbso maybe more investigation needed19:29
openstackgerritClark Boylan proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249219:40
*** yolanda has quit IRC19:43
*** yolanda_ has joined #zuul19:49
*** yolanda_ is now known as yolanda19:49
*** jkilpatr has quit IRC19:56
*** kmalloc_ has joined #zuul20:16
*** ianw has quit IRC20:18
*** mattclay_ has joined #zuul20:18
*** bstinson_ has joined #zuul20:22
*** kmalloc has quit IRC20:24
*** mattclay has quit IRC20:24
*** bstinson has quit IRC20:24
*** kmalloc_ is now known as kmalloc20:24
*** mattclay_ is now known as mattclay20:24
*** _ari_ has quit IRC20:28
*** bstinson_ is now known as bstinson20:34
*** _ari_ has joined #zuul20:38
*** openstackgerrit has quit IRC20:48
*** ianw has joined #zuul21:01
*** dkranz has quit IRC21:07
Shrewspabelanger: i have solved our delay issue21:12
Shrewsor, figured out a possible reason, at least21:12
Shrewsin the zuul repo, we define 3 ubuntu-xenial nodes for a functional test. we have min-ready set to 2 for that label, so it's likely that it will always need to build a new node for that21:13
Shrewsjeblair: mordred: ^^^21:14
Shrewsoh, but i also see two xenial nodes that have been ready and unallocated for a couple of hours. seems something is likely wonky there21:17
*** hasharAway has quit IRC21:18
Shrewsboth rax-iad21:18
Shrewsoh, 3 in rax-iad21:20
*** jkilpatr has joined #zuul21:25
Shrewsi think this is intentional, due to our node locality logic21:26
Shrewsan existing ready node will only be used for a request if it is from the same provider handling the request, launched by the same pool handler, and in the same AZ21:28
Shrewsi think that logic, and our min-ready definition, are slightly at odds21:31
clarkbcould we short circuit request handling by attempting to fullfill from any ready nodes first21:34
clarkbthen foreard the request to the launchers if unable to?21:34
Shrewsclarkb: we always attempt to fulfill from ready nodes first21:36
clarkbbut on a provider specific basis21:36
Shrewsthe launchers are the ones handling the request (1 thread per pool), and we can't control which one handles the request21:36
clarkbI'm saying byass providers entirely at first and scan the ready nodes and hand out what is available or fallback to the provider21:37
Shrewswe would need a thing in between21:37
clarkbya21:37
Shrewsyeah, that's a whole new architecture21:37
clarkbbasically two layers of requests21:37
clarkbone that is what zuul talks to and the other an implementation detail to boot if necessary21:37
Shrewszuul only talks to ZK21:38
clarkbby submitting a request21:38
Shrewscorrect21:38
clarkbthat request would submit some inner level boot request if ready cant fulfill for some reason21:38
Shrewsand the pool threads consume that request list21:38
clarkb(I understand this is not how it works today suggesting it may br a good way to fix it)21:39
Shrewswell, that's not a simple change, i would think  :)21:39
Shrewsbut yeah, that's certainly a way21:40
dmsimardyou guys still over in vail ?21:44
clarkbdmsimard: I left to go to tc session on onboarding new contributors21:55
*** yolanda has quit IRC21:57
tristanCclarkb: is there a way to test the crm114 filters manually? I'm trying to understand if it's possible to use it as a post job...21:58
clarkbtristanC: yes, you can run crm114 locally if oyu install it21:58
clarkbits been a long time since I did it but its basically an interpreter for programs and you feed it stdin21:58
tristanCwhat does it except in the argument directory?21:58
clarkbtristanC: I believe that is the location to store its learned state content22:00
clarkbtristanC: so you can give it a path in /tmp and it writes to it22:00
tristanCoh I see, then how it differeciates success from failure logs?22:00
clarkbhttps://review.openstack.org/#/c/502492/5 could use reviews as a segfault workaround22:04
tobiashtristanC: you tell it as an argument22:05
clarkbtobiash: tristanC correct youtell it based on the result of the job22:06
clarkbthen it uses that information to learn22:06
SpamapSdmsimard: FYI, running 'sfconfig' for the first time right now.22:06
SpamapSdmsimard: it looks like I'm getting a zuulv2 ... want 322:06
dmsimardSpamapS: oi, exciting22:06
clarkbso if job fails you tell it the logs belong to failed job and if the job passed you tell it it is from a successful job22:06
dmsimardSpamapS: I'm sure tristanC can help22:07
dmsimardhe's a pro22:07
SpamapStristanC: I assume I need to change arch.yaml a bit22:07
SpamapSmaybe zuul3-* ?22:07
dmsimardSpamapS: maybe v3 isn't installed by default22:07
dmsimardI dunno22:07
dmsimardthe v3 that sf2.6 runs is effectively a snapshot, it doesn't follow the feature/zuulv3 branch closely so just be aware of that22:08
SpamapSoh I'm also getting a gerrit htat I don't want, oops22:08
SpamapSdmsimard: I was figuring I'd learn how to circumvent that with git ASAP :)22:08
tristanCSpamapS: gerrit is still needed because it's the default place to store the config repo, we may make it optional in further release22:08
dmsimardtristanC: ah I guess with v3 we could host it in github22:09
tristanCSpamapS: yes, you need to add zuul3-scheduler, zuul3-executor and nodepool3-launcher in the arch22:09
SpamapStristanC: Can't I just wrestle control of zuul config to put it somewhere else?22:09
tristanCSpamapS: the trick is that the config-update job except to clone that repository from the local gerrit22:10
SpamapSI dunno what that means22:10
tristanCSpamapS: when a change is merged in that config repo, e.g. adding a new project to the zuulv3 main.yaml, then the config-update will apply that change to the zuul configuration and reload the service22:11
dmsimardSpamapS: software factory has a 'config' repo which is akin to upstream project-config where the job, zuul and nodepool config lives22:12
dmsimardit self hosts that repository inside gerrit22:12
dmsimardbut could be made to self host it on github with v3 later22:12
dmsimardor something else22:12
SpamapSk22:13
SpamapSso now I tried editting arch.yaml and it's telling me there's a conflict because I asked for zuul and zuul3 but I only have zuul3 stuff now22:14
SpamapS:_P22:14
tristanCclarkb: tobiash: alright it makes more sense, so basically "cat success.log | ./classify-log.crm data/ SUCCESS"; and then use the FAILED argument to process failed log22:14
*** harlowja has quit IRC22:14
tristanCSpamapS: you should use the master version, in 2.6 zuul3 needed a specific node22:14
tristanCSpamapS: install softwarefactory-project.io/repos/sf-release-master.rpm and yum update -y22:15
SpamapSwee22:15
clarkbtristanC: yup22:15
pabelangerSpamapS: Oh, awesome22:16
tristanCthen to run crm114 as a post-job it will need a way to retrieve previous success logs22:17
pabelangerSpamapS: oops, should have been Shrews22:17
SpamapStristanC: thanks, it's happening now22:20
SpamapSI have dueling Zuul installs here22:20
SpamapSSF on right, BonnyCI on left22:20
SpamapSbut I think SF is going to win out22:20
SpamapSsince Bonny still doesn't know how to find python3 on CentOS22:20
SpamapStristanC: looks like I still need jenkins?22:21
* SpamapS removed it thinking it could be eliminated22:21
dmsimardSpamapS: we don't know how to find py3 on centos either, tell us when you find it22:21
* dmsimard hides22:21
SpamapS:)22:22
*** yolanda has joined #zuul22:22
tristanCSpamapS: hum, after installing the master version, you should probably use "sfconfig --upgrade" to make sure any configuration changes are updated22:22
* SpamapS checks under the rock marked "Disable SELinux" where he usually finds hard centos problem solutions.22:23
SpamapStristanC: k, it says managesf needs the jenkins user22:23
SpamapS"TASK [sf-gerrit : Add jenkins user (for zuul-launcher and managesf)] *********************************************************************************************************************************************************************************************************22:23
SpamapSso I assume since i still have 'managesf' that it's still needed.22:24
tristanCSpamapS: yes you could remove jenkins, but when doing so you have to "zuul_jenkins_credencials: False" in /etc/software-factory/custom-vars.yaml22:24
clarkbmordred: jeblair rereading things haypo seemed pretty concinved the bug is the one that I filed with ubuntu 3 years ago22:24
clarkbmordred: jeblair so looking at ze01 3.5.2-2ubuntu0~16.04.3 is what we have installed. Does that contain mordred's fix?22:24
tristanCSpamapS:  that's because zuul-launcher uses jenkins_rsa key and "Jenkins CI" user in gerrit, but the service can be removed22:25
clarkbI'm testing with the simple reproducer22:25
tristanCSpamapS: well there is work in progress to run sf with selinux enforcing + new policies for zuul, this is type enforcement file I'm working on for zuul's domain: paste.openstack.org/show/620869/22:26
SpamapStristanC: ok ... I'll deal with having a useless jenkins for now :)22:27
clarkb3.5.2-2ubuntu0~16.04.2~openstackinfra is what is installed on ze03 fwiw note that the ~openstackinfra is gone on ze01. But neither fial on the crash.py script in the original bug22:27
SpamapShrm22:30
SpamapS"+ ssh gerrit gerrit create-account jenkins -g '\"Non-Interactive' 'Users\"' --email jenkins@factory0.cloud.phx3.gdg --full-name '\"Jenkins' 'CI\"' --ssh-key -", "fatal: invalid email address"],22:30
haypoclarkb: which bug? (url?)22:30
clarkbhaypo: the one you looked at earlier on storyboard and pointed to https://bugs.python.org/issue2661722:31
tristanCSpamapS: arg, that may be a gerrit thing where it's a bit sensitive with valid email address22:31
clarkbhaypo: that has a crash.py script unsure if expected to crash 100% though22:31
SpamapStristanC: doesn't like my pretend tld? ;-)22:31
SpamapStristanC: I'm trying your "zuul_jenkins_credentials: false" suggestion22:32
clarkbhaypo: mordred applied the patch from that to ubuntu python3.5 and we are still getting segfaults though I guess no one has looked a core dump from current segfaults and could be unrelated?22:32
clarkbjeblair: mordred ^ I've got to do docs thing now on docs retention22:32
haypoclarkb: ah https://bugs.python.org/issue26617 -- well, use Python 3.5.4 :-)22:32
dmsimardSpamapS: you're not at the ptg, are you ?22:32
clarkbhaypo: well I'm saying the reproducer doesn't reproduce22:32
haypoclarkb: 3.5.4 fixes also a security issue ;-)22:32
clarkbhaypo: oh nice22:32
clarkbmordred: jeblair ^ maybe we just need to build a 3.5.4?22:33
clarkbbut I doubt that gets SRU'd22:33
tristanCSpamapS: it seems like your fqdn needs to match the "TLD by Apache Commons Validator v1.4.1", so probably uses a more common tld indeed22:33
SpamapStristanC: fun22:33
haypoclarkb: the test is a little bit different than the attached reproducer, https://github.com/python/cpython/pull/2695/files22:33
tristanCSpamapS: that's a good reason to work on making it optional :-)22:34
SpamapSmy fqdn == local 'hostname -f' output .. so I'm wondering what it think is invalid22:34
haypoclarkb: maybe you are testing a python3 which contains the fix? i didn't follow the discussion. it's getting late here, have to go ;-)22:34
clarkbhaypo: ya I tried to test the one with and one without. But I'll look at the one in the test and see if that behaves differently when I get a chance22:34
haypoclarkb: someone said that he/she will include the fix in your python3 package22:35
clarkbya mordred did that, but its only on a subset of our nodes so I should be able to a/b test it22:35
haypoclarkb: ok22:36
SpamapStristanC: looks like zuul_jenkins_credentials is not used to decide whether or not to do that part.22:36
clarkbactually 26617 looks newer than the one I filed /me looks more closely22:36
clarkbya 21435 was my thing ok22:37
clarkb(now I am slightly less confused on the situation)22:37
SpamapStristanC: also, credentials is the english spelling. Not sure what the standard is in software factory. ;)22:37
haypoclarkb: https://bugs.python.org/issue21435 is https://github.com/python/cpython/commit/5fbc7b12f776109678dc34fdb49b420750a3e5ff and was fixed in Python 3.4.1 and 3.5.022:39
haypoclarkb: we are talking about python 3.5.222:39
clarkbhaypo: thank you for clarifying that. Saw the link to my 3 year old launchpad bug and got confused22:39
clarkbso that one should definitely be fixed in python3.5 on ubuntu22:40
clarkbquestion is whether or not that other, 26617, is22:40
clarkbhaypo: was your reproducer 100% determinstic?22:40
haypoclarkb: i don't recall, sorry. i don't know if you have the same bug. it can also be a bug in any third party C extensions22:40
haypoclarkb: PYTHONMALLOC=debug of Python 3.6 may help here, if you get access to python 3.622:41
SpamapSthat's an option btw22:44
SpamapSwe could just try python3.622:44
SpamapSbackported from artful22:44
clarkbif only to confirm bug is gone22:45
SpamapSright22:45
clarkbmay not be terrible idea22:45
SpamapSif you have a decently small reproducer you could even bisect22:45
SpamapSpython will build relatively quickly without a 'make clean' in between22:46
dmsimardSpamapS: credencials is a typo and should be fixed ;)22:46
*** harlowja has joined #zuul22:50
SpamapSthis is pretty annoying now I got it to stop making jenkins gerrit users22:52
SpamapSbut it still needs a zuul gerrit user22:52
*** harlowja has quit IRC23:06
clarkbI have successfully run haypos test case on ze01 without a segfault and run it on ze03 with a segfault. 01 has mordreds new python and 03 doesn't (I think)23:08
clarkbso I think any current segfaults are either not that bug or we are not running the python version we expect in bubblewrap or somewhere23:09
clarkbwe don't fork python/ansible in such a way that it wouldn't load up the newer python do we?23:09
SpamapSbubblewrap mounts /usr from the host23:10
SpamapSso would be hard to get another version of python in the way23:11
clarkbwell I know we didn't restart the zuul executor23:11
clarkband maybe we misunderstand where the segfault is happening? I dunno23:11
SpamapSthat's entireyl possible23:11
SpamapSas is spehlinge23:11
* clarkb makes a paste23:14
SpamapStristanC: failing on SSO stuff :-P23:14
clarkbhttp://paste.openstack.org/show/620893/ is derived from haypos test case and that does fail on 03 but not 01 when run under python3.423:14
clarkber 3.523:14
*** toabctl has quit IRC23:15
*** jkilpatr has quit IRC23:16
*** harlowja has joined #zuul23:17
clarkbI think it would be good if someone could grab a current core and if we restarted services just to be sure new python is in use everywhere23:25
ianwhaypo: i had a little more of a look in https://storyboard.openstack.org/#!/story/2001186 ; i'm not seeing obvious connections to weak references23:26
clarkbianw: ya I'm beginning to suspect its a different segfault bsaed on my testing23:26
clarkbianw: since haypos problem is fixed on ze01 with haypos thing patched in23:26
ianwhaypo: in summary, seems related to an invalid gi_frame in a generator object ... if that rings any bells23:27
*** jkilpatr has joined #zuul23:29
clarkbianw: https://review.openstack.org/#/c/502492/5 avoids using a generator23:30
clarkbwhich is another thing to try to see if problem goes away23:30
clarkbI expect that generator runs quite a bit because its reading 4096 bytes at a time or so and our streamed logs can be many megabytes23:30
mordredjeblair: https://review.openstack.org/#/c/502492/23:32
mordredclarkb: where are you?23:32
clarkbmordred: in the docs retention session23:32
clarkbdhellmann asked I be here for this months ago23:32
mordredclarkb: kk23:33
mordredclarkb: mostly just curious23:33
clarkbatrium in steamboat is the room iirc23:33
*** openstackgerrit has joined #zuul23:39
openstackgerritPaul Belanger proposed openstack-infra/zuul feature/zuulv3: Rename success ansible variable to zuul_success  https://review.openstack.org/50286323:39
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249223:40
clarkbmordred: ^ that failed due to merge conflict?23:41
openstackgerritMonty Taylor proposed openstack-infra/zuul feature/zuulv3: Shift zuul_stream socket reading to use a for loop  https://review.openstack.org/50249223:42
mordredclarkb: yah23:47

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!