Wednesday, 2019-04-24

*** jangutter has quit IRC00:22
*** jangutter has joined #zuul00:30
*** jangutter has quit IRC00:40
*** jangutter has joined #zuul00:46
*** jamesmcarthur has joined #zuul00:52
*** jamesmcarthur has quit IRC00:54
*** jamesmcarthur has joined #zuul00:58
tristanCclarkb: fwiw we have been running with 41b6b0ea33 picked on 3.6.1 without leak, here is our memory graph: https://softwarefactory-project.io/grafana/?panelId=12054&fullscreen&orgId=1&var-datasource=default&var-server=zs.softwarefactory-project.io&var-inter=$__auto_interval_inter&from=now%2FM&to=now01:01
clarkbtristanC: that is good to know thank you01:02
clarkbso probably leaking somewhere else01:02
tristanCon the other hand, we are also running zuul-3.8.0/nodepool-3.5.0 without noticable leak, perhaps this is because of the software-collection provided python-3.5.101:11
*** jamesmcarthur has quit IRC01:14
*** rfolco has joined #zuul01:15
*** jamesmcarthur_ has joined #zuul01:18
*** jamesmcarthur_ has quit IRC01:34
*** jamesmcarthur has joined #zuul02:05
*** jamesmcarthur has quit IRC02:12
*** jamesmcarthur has joined #zuul02:18
*** bhavikdbavishi has joined #zuul02:26
*** bhavikdbavishi has quit IRC02:35
openstackgerritTristan Cacqueray proposed zuul/zuul master: web: add OpenAPI documentation  https://review.opendev.org/53554102:39
*** bhavikdbavishi has joined #zuul03:19
*** bhavikdbavishi1 has joined #zuul03:22
*** bhavikdbavishi has quit IRC03:23
*** bhavikdbavishi1 is now known as bhavikdbavishi03:23
*** jamesmcarthur has quit IRC04:17
*** jamesmcarthur has joined #zuul04:18
*** jamesmcarthur has quit IRC04:23
*** jamesmcarthur has joined #zuul04:29
*** jamesmcarthur has quit IRC04:45
*** jamesmcarthur has joined #zuul04:45
*** jamesmcarthur has quit IRC04:50
*** jamesmcarthur has joined #zuul04:50
*** jamesmcarthur has quit IRC04:54
*** bjackman has joined #zuul05:03
*** jamesmcarthur has joined #zuul05:06
*** jamesmcarthur has quit IRC05:11
*** quiquell|off is now known as quiquell|rover05:17
*** jamesmcarthur has joined #zuul05:17
*** jamesmcarthur has quit IRC05:22
*** jamesmcarthur has joined #zuul05:25
*** jamesmcarthur has quit IRC05:32
*** bjackman has quit IRC05:32
*** bjackman has joined #zuul05:35
*** bjackman_ has joined #zuul05:38
*** bjackman has quit IRC05:41
*** electrofelix has joined #zuul06:11
*** jamesmcarthur has joined #zuul06:11
*** bhavikdbavishi1 has joined #zuul06:15
*** jamesmcarthur has quit IRC06:15
*** bhavikdbavishi has quit IRC06:16
*** bhavikdbavishi1 is now known as bhavikdbavishi06:16
*** pcaruana has joined #zuul06:24
*** jvv_ has quit IRC06:25
*** homeski has joined #zuul06:27
*** quiquell|rover is now known as quique|rover|brb06:54
*** gtema has joined #zuul06:57
*** themroc has joined #zuul07:06
*** saneax has joined #zuul07:09
*** quique|rover|brb is now known as quiquell|rover07:14
*** jpena|off is now known as jpena07:48
openstackgerritFabien Boucher proposed zuul/zuul master: web: add build modal with a parameter form  https://review.opendev.org/64448508:07
bjackman_tristanC, who should I pester to get a Workflow +1 on https://review.opendev.org/#/c/649900/ ?08:24
tristanCbjackman_: i would but i'm not a zuul maintainer... and we need a couple of CR+2 before +Workflow08:29
bjackman_tristanC, sure, just wondering who is the right person?08:43
bjackman_mordred, are you a maintainer?08:44
tristanCbjackman_: iirc zuul-maint should ping the right people (to ask for review on https://review.opendev.org/#/c/649900/ )08:44
bjackman_tristanC, ah cheers!08:44
bjackman_zuul-maint: (ping in case the above message has not already done it)08:45
*** sshnaidm|afk is now known as sshnaidm|09:16
*** sshnaidm| is now known as sshnaidm09:16
openstackgerritTristan Cacqueray proposed zuul/zuul master: web: add support for checkbox and list parameters  https://review.opendev.org/64866109:24
openstackgerritTristan Cacqueray proposed zuul/zuul master: web: add support for checkbox and list parameters  https://review.opendev.org/64866109:25
*** mrhillsman has quit IRC09:55
*** mrhillsman has joined #zuul09:58
*** bhavikdbavishi has quit IRC10:00
openstackgerritFabien Boucher proposed zuul/zuul master: Pagure driver - https://pagure.io/pagure/  https://review.opendev.org/60440410:13
*** gtema has quit IRC10:53
*** jpena is now known as jpena|lunch11:18
*** gtema has joined #zuul11:27
*** bhavikdbavishi has joined #zuul11:34
*** gtema has quit IRC11:42
*** bhavikdbavishi1 has joined #zuul11:50
*** bhavikdbavishi has quit IRC11:50
*** bhavikdbavishi1 is now known as bhavikdbavishi11:50
*** quiquell|rover is now known as quique|rover|eat12:00
*** rlandy has joined #zuul12:01
*** bhavikdbavishi has quit IRC12:01
*** rlandy is now known as rlandy|ruck12:02
*** jamesmcarthur has joined #zuul12:10
*** bhavikdbavishi has joined #zuul12:14
*** jamesmcarthur has quit IRC12:15
*** jamesmcarthur has joined #zuul12:16
*** themroc has quit IRC12:21
*** jamesmcarthur has quit IRC12:32
*** quique|rover|eat is now known as quiquell|rover12:41
*** jamesmcarthur has joined #zuul12:44
*** irclogbot_3 has quit IRC12:55
*** irclogbot_3 has joined #zuul12:57
*** altlogbot_1 has quit IRC12:57
openstackgerritMonty Taylor proposed zuul/zuul master: Update references for opendev  https://review.opendev.org/65423812:59
*** altlogbot_0 has joined #zuul12:59
*** bjackman_ has quit IRC12:59
*** bjackman_ has joined #zuul13:01
pabelangerzuul-build-image looks to be broken: http://logs.openstack.org/97/653497/2/check/zuul-build-image/701b7ac/job-output.txt.gz#_2019-04-24_12_37_11_93639213:03
pabelangerdoes anybody know the status of the intermediate registry? I know we've have a few issues of the last few weeks13:04
openstackgerritSean McGinnis proposed zuul/zuul-jobs master: ensure-twine: Don't install --user if running in venv  https://review.opendev.org/65524113:09
*** rlandy|ruck is now known as rlandy|ruck|mtg13:11
*** rfolco is now known as rfolco_sick13:18
openstackgerritSean McGinnis proposed zuul/zuul-jobs master: ensure-twine: Don't install --user if running in venv  https://review.opendev.org/65524113:22
*** bhavikdbavishi has quit IRC13:23
clarkbpabelanger: mordred was converting it to swift backed storage. That may need a followup13:28
pabelangermordred: happen to have an idea what is happening? Getting an auth failure13:34
pabelangershould we make the job non-voting until the infra is better? I'm unsure the bandwidth people have before travels13:35
*** jpena|lunch is now known as jpena13:40
clarkb13:41
clarkber13:41
clarkbcorvus also updated the cred on the registry. Maybe that didnt get updated in the secrets?13:41
openstackgerritSean McGinnis proposed zuul/zuul-jobs master: Add environment debugging to ensure-twine role  https://review.opendev.org/65543713:44
mordredclarkb: I updated the secret13:45
mordredclarkb: yesterday ... but maybe I missed something somewhere?13:46
mordredclarkb: https://review.opendev.org/#/c/655228/ fwiw13:46
*** rfolco_sick is now known as rfolco13:52
openstackgerritSean McGinnis proposed zuul/zuul-jobs master: Add environment debugging to ensure-twine role  https://review.opendev.org/65543714:00
*** jamesmcarthur has quit IRC14:00
*** jamesmcarthur has joined #zuul14:00
pabelangerShrews: did you get the changes you wanted landed for opendev nodepool?14:17
Shrewspabelanger: nope. i have  no idea what's happening with the gate jobs (they pass in check)14:19
Shrewsthe failures seem quite exquisitely random, such as: http://logs.openstack.org/62/654462/3/gate/puppet-beaker-rspec-infra/e743f98/job-output.txt.gz#_2019-04-24_00_32_35_44564214:23
pabelangerShrews: that usually happens when another node comes online in openstack provider with same IP14:24
clarkbwe spent a good chunk (all for me basically) debugging those netwprking issues14:24
Shrewsthe other fail was attempting to contact mirror.mtl01.inap.openstack.org14:25
clarkbseems there weremaybe mulyiple things. Reused IPs in inap and maybe unhappy executor nodes or cloud networking14:25
AJaegerteam, do we want to merge smcginnis' change https://review.opendev.org/#/c/655437 to debug the ensure-twine role?14:25
clarkbin any case mgagne fixed inap and werebooted the executors and it is happier now14:25
* Shrews rechecks14:26
corvusfungi: i approved https://review.opendev.org/65543714:28
smcginnisThanks14:30
fungithanks corvus!14:30
fungii've run down yet more avenues to attempt to figure out what changed in our environment to start exposing that14:31
funginone of the system packages updated in the relevant timeframe look to be likely culprits for such a behavior shift14:31
fungiand unlike i had assumed, we're not actually automatically upgrading our ansible interpreters on our executors either14:32
fungiso they're still running the versions of ansible which were current at the time those virtualenvs were first created14:32
smcginnisWith 655437 merged, should I do another release-test release? Or is it enough to reenqueue one of the failed ones?14:32
fungishould be fine for me to reenqueue one, i'll get details from you in another channel so we don't make too much unrelated noise in here14:33
fungioh, actually, i can look one up easily14:33
fungiyay zuul builds dashboard14:33
fungiet voila: http://zuul.opendev.org/t/openstack/builds/?job_name=release-openstack-python&project=openstack%2Frelease-test14:35
corvusfungi: ah, i'm surprised we aren't upgrading ansible too14:35
fungicorvus: yeah, i think that's something for us to look into soon14:35
*** electrofelix has quit IRC14:37
fungioh, hah, getting ahead of myself. i reenqueued that test tag but 655437 hasn't merged yet, so ignore the useless failure we'll probably get from it14:37
smcginnisOops, a few minutes yet in gate. :)14:38
fungiyeah, no worries, i have e-mails to read still14:38
clarkbfungi: you wereable to confirm we didnt upgrade ansible? when I lookedthe file timestamps were newer than most recent ansoble 2.7 release14:39
fungiwhich file timestamps? i ran `/usr/lib/zuul/ansible/2.7/bin/ansible --version` and also checked the last modified time on that executable14:40
fungi(and did the same for 2.6 and 2.5 as well)14:40
clarkbthe filestamps on the files in /var/lib/zuul/ansible/2.714:40
clarkbthere weresome from april 2014:41
fungioh, those are apparently where we *download* newer ansible, but not where we *install* it14:41
clarkbah14:41
fungiit's /usr vs /var (i didn't catch that my first time through either and corvus had to point it out)14:42
openstackgerritMerged zuul/zuul-jobs master: Add environment debugging to ensure-twine role  https://review.opendev.org/65543714:42
fungithough it's not clear to me why we have unpacked ansible installs in /var when we're installing them into virtualenvs in /usr14:43
fungii need to take a closer look at the ansible manager probably to understand the reasons14:43
*** jamesmcarthur has quit IRC14:48
*** quiquell|rover is now known as quiquell|off14:49
corvusfungi: i think the zuul modules overrides are in /var14:50
clarkbin response to the trouble we had debugging conenction issues yseterday cloudnull wrote an ansible callback plugin to run a traceroute on connection failure14:52
clarkbthat seems like something that we may want to install in zuul's ansible by default?14:52
clarkbcurrently the code is up on a gist but I can see about having cloudnull push that to zuul if we think that would be useful14:54
fungismcginnis: "Invalid options for debug: real_prefix,VIRTUAL_ENV,path,version,prefix,base_exec_prefix,executable" http://logs.openstack.org/59/59ce65e9e66fe3ea203b77812ee14e69ebdb192a/release/release-openstack-python/abaae0d/14:57
openstackgerritSean McGinnis proposed zuul/zuul-jobs master: Correct debug statement for ensure-twine environment  https://review.opendev.org/65546614:57
smcginnisfungi: Sorry, I messed that up.14:57
* smcginnis needs to spend more time with ansible14:57
*** pcaruana has quit IRC15:06
*** jamesmcarthur has joined #zuul15:07
*** rlandy|ruck|mtg is now known as rlandy15:16
*** rfolco is now known as rfolco|ruck15:19
pabelangerfungi: so, if we used twine to upload zuul-3.8.1.dev2.tar.gz to pypi, is there any flags we need to pass to make it a pre-release? Eg: if somebody did pip install zuul, would 3.7.0 be selected or 3.8.0.dev2?15:19
pabelangerI also _think_ our current release-zuul-python job should work15:19
pabelangertry to find logs from last release15:19
fungipypi and pip figure that out on their own, courtesy of pep 44015:20
pabelangergreat15:20
fungilike if you look at https://pypi.org/project/ansible/#history those "pre-release" tags are added automatically based on there being an "a" or "b" or "rc" in the name, and "dev" should do the same15:21
pabelangerfungi: yup! 2.7.0.dev015:22
pabelangerthanks for link15:22
*** sshnaidm is now known as sshnaidm|off15:23
webknjaz> zuul-3.8.0.tar.gz (10.3 MB)15:24
webknjazwhat do you put there? 0_o15:24
fungipabelanger: indeed, i confirmed on one of my own projects where i've been uploading dev versions and it does mark them as pre-release15:24
webknjazI can confirm that too15:24
webknjazI'm actually the one who requested that label on the UI :)15:25
fungiawesome!15:25
fungiand yeah, pypi/warehouse lists the most recent non-pre-release version by default too15:26
webknjazFYI I've found a good way of keeping the release pipeline healthy15:26
webknjazpublish every commit in master to `test.pypi.org`15:26
webknjazwe do this in ansible-lint and molecule15:26
pabelangerfungi: so, looking at the last release-zuul-job, I don't see it renaming files after sdist / bdist_wheel. So I think we could add it to post (or promote?) and publishing should just work15:27
webknjazfacilitated by `setuptools-scm` in particular15:27
pabelangerhttp://logs.openstack.org/7b/7b17aa534d193743f5307f4b399fa1690f72da6d/release/release-zuul-python/aabfa42/job-output.txt.gz15:27
webknjazrenaming dist? don't do that manually!15:27
fungipabelanger: probably not promote since that would run on the change ref rather than the merged state15:29
fungii think post is what you want15:29
webknjaz> if somebody did pip install zuul15:29
webknjazpre-release install would require `--pre` CLI arg15:29
pabelangerwebknjaz: yup, I mostly didn't want humans getting dev releases by default15:30
webknjazpabelanger: what renaming were you talking about?15:31
openstackgerritMerged zuul/zuul-jobs master: Correct debug statement for ensure-twine environment  https://review.opendev.org/65546615:31
openstackgerritPaul Belanger proposed zuul/zuul master: Add release-zuul-python to post pipeline  https://review.opendev.org/65547415:32
pabelangerwebknjaz: some of our puppet jobs would rename things, before uploading things. I was confusing that with our python jobs.15:33
pabelangercorvus: fungi: I _think_ ^ is all we need to do15:33
pabelangerI'll check nodepool release job shortly15:34
webknjazpabelanger: nowadays it probably makes sense to move to PEP 517 entrypoint for building Python dists15:35
fungiwebknjaz: yeah, i'm not sure we've thought through the implications of pep 517 on pbr, but it's possible it "just works" at this point15:36
fungiworth trying at any rate15:36
pabelangeryah, will defer to other humans on pypi / pep things :)15:37
fungiwebknjaz: also a number of us dislike toml on principle, and the idea of carrying toml files in our git repositories is undesirable so may need a bit of a workaround15:37
webknjazthat's a standard now and it's pretty good15:38
webknjaznow sure about pbr though15:38
fungimore that using toml in any way could be seen as an endorsement of its creator15:38
webknjazif you use editable installs, you'd need to use `pip install --no-build-isolation -e .`15:38
webknjazthere's a lot of tools using `pyproject.toml`, it's already endorsed. And it's def better than throwing install-time invoked code at users15:39
mordredwebknjaz: yeah - I know. but just because other projects have adopted it doesn't mean we're necessarily excited to do so, when the reasoning behind the creation of toml in the first place is so gross. it basically makes many of us quite sad and annoyed, so we haven't really put any energy into figuring out what the pbr interactions with it will be or need to be15:43
fungii guess it's a question of whether install-time code generation is better than supporting misogyny15:44
mordredwe'll clearly have to get over it at some point15:44
webknjazI like the reasoning behind toml15:44
mordredI don't. it's "I can't be bothered to read the yaml spec"15:44
mordredand that's the type of lazy and narcissistic behavior that is ruining our industry15:44
webknjazi'm more of "the implementations of other formats are broken"15:45
fungi"...so let's create yet another instead of improving one"15:45
mordredyeah. can't be bothered to write a better impl of an existing widely used format15:45
Shrewspabelanger: looks like my change merged while i was at the doc. just waiting for puppet to do its thing15:45
webknjaztoml is stricter15:45
pabelangerShrews: yay15:45
mordredthe only reason I touch, or ever will touch, toml is when I am forced to do so15:46
fungibut anyway, my main concern with toml remains that tpl is a detestable human and i'd rather not be associated with him in any way i can avoid15:46
mordredyeah. that too15:46
*** pcaruana has joined #zuul15:46
mordredand that the file format is named after himself is yet another great example of runaway ego15:47
mordredbut - as I said - at some point it will become unavoidable in python just like it is un rust15:47
mordredand we'll be forced to interact with it15:47
fungisome rudimentary "research" was performed in the course of dstufft's choice of tolm for pep 517, in which (paraphrasing) "we asked some of the women working in the python community if they had any concerns and none of them spoke up"15:47
pabelangerouch15:48
webknjazI saw better explanations of why it's been chosen for PEP 51715:49
webknjazsomewhere outside that pep there's a gist with that15:49
mordredthe reason was that pyyaml was too hard to vendor, since it depends on libyaml15:50
webknjazwell15:50
pabelangermordred: so, https://github.com/ansible/ansible-zuul-jobs/blob/master/playbooks/ansible-network-vyos-base/files/bootstrap.yaml is how we are doing per job keys for vyos network appliance (debian based), mostly because vyos_config is blocked in zuul. I know we have a discuss going on ML to drop blacklisting of action modules in zuul, but I also cannot remember the reason of dropping ansible network modules by15:50
pabelangerdefault15:50
webknjazI'll tell you more15:50
mordredand for reasons surpassing understanding there is still no yaml support in the stdlib15:50
webknjazpyyaml is "maintained" by a Perl guy15:50
mordredyeah15:50
webknjazwho doesn't like to accept any help15:50
webknjazruamel is not much better either15:51
webknjazthere's no toml in stdlib either, after all...15:51
mordredyeah. but it's written in such a way that the pip people can vendor it15:51
mordredwhich is how they solve dependencies15:52
webknjazright15:52
mordredeven though it's an atrocious practice15:52
webknjazI know15:52
*** bjackman_ has quit IRC15:52
webknjazthis practice is fine, because it solves chicken/egg in a way15:52
mordredbut even hating toml - I'd argue that python stdlib should carry both yaml and toml support - as they are pretty standard15:52
SpamapSI'm less angry about TOML, and more frustrated at the yaml libs being poorly maintained.15:53
webknjazI've tried contributing to pyyaml so many times15:53
SpamapSYAML has eaten the world.15:53
mordredyeah. yaml is pretty darned important15:53
SpamapSwebknjaz:yeah I bounced off of it once too.15:53
webknjazand one other Ansible Core Dev does that too15:53
webknjazwhen we realised that they are going to trash one of the Ansible's main deps - the only way to save is to talk some sense into maintainers...15:54
fungiwell, pyyaml is virtually unmaintained, so there's plenty of room for someone to write a from-scratch reimplementation suitable for inclusion in the python stdlib (and no ruamel.yaml is not it, the entire incestuous ruamel dependency set makes me not want to depend on it in my projects)15:57
mordredfungi: ++15:58
fungii expect a lot of projects would jump at the chance to swap out pyyaml for a slimmer and better-maintained alternative, even if it didn't maintain bug-compatibility with pyyaml15:59
fungior, heck, even if it didn't mirror the pyyaml api15:59
fungii also feel like it wouldn't need to have optional libyaml support, could just be pure-python-only16:00
fungiand at this stage, could probably focus entirely on python>=3.516:00
corvusmordred, tobiash, fungi, AJaeger: smcginnis got this: http://logs.openstack.org/59/59ce65e9e66fe3ea203b77812ee14e69ebdb192a/release/release-openstack-python/781edd8/ara-report/result/a6f1d958-2d17-4b8a-a6bf-97cb1bf6e31d/16:00
corvuswhich shows that the pip install task running on localhost (ie, the executor) is running in the zuul managed ansible virtualenv16:01
smcginnisAJaeger: pointed out the failed one was "Host:localhost" and the one that passed was "Host: ubunutu-bionic".16:01
fungiyeah, the latter ran on a dedicated job node rather than the executor16:01
corvusthis makes for awkwardness -- we're trying to run "pip instal --user" which will now work on remote hosts, but not on localhost due to the venv16:02
corvuspabelanger: ^ ping you too16:02
fungibut what's really weird here is that this job was working fine until a week ago, and we can't find anything that changed, but also it was only intermittently failing at first (reenqueuing the tag ref to trigger it again would often succeed)16:03
fungiso this suggests perhaps something was inconsistent across our farm of executors a week ago, but in the past few days has been made consistent16:03
corvusit seems very strange to me that ansible would pass through all the virtualenv information such that a *shell* task would then pick it up16:03
pabelanger"executable": "/usr/lib/zuul/ansible/2.7/bin/python3"16:03
pabelangerI think that is the issue16:03
pabelangerthe next task is using python3, which maybe because the path patch for venv is being used over system python16:04
fungiyes, the twine_python variable gets "python3" passed into it, unqualified. one workaround would be for us to set to "/usr/bin/python3" instead16:04
fungibut that hasn't changed since november16:04
smcginnisI suppose it would be safe for us to hard code the known sys path to which python we want.16:05
pabelangershould we append the virtualenv path in: https://opendev.org/zuul/zuul/commit/70ec13a7caf8903a95b0f9e08dc1facd2aa75e84 so OS versions take priority over virtualenv?16:06
pabelangerhttps://opendev.org/zuul/zuul/commit/70ec13a7caf8903a95b0f9e08dc1facd2aa75e8416:06
smcginnisBut did someone say ensure-twine may be used by other non-OpenStack/OpenDev deployments of Zuul? In that case, we would be hardcoding an assumption for everyone's environment.16:06
corvuspabelanger: is *that* the thing that changed?16:06
fungismcginnis: well, we can (and do in fact) set twine_python in our job definition, so could just update it there16:06
corvussmcginnis, fungi: yes, but it would be great if ensure-twine worked, rather than was guaranteed to be broken everywhere :/16:07
pabelangercorvus: I'll have to look when release jobs started failing16:07
fungicorvus: i agree16:07
smcginnisWait, is https://opendev.org/zuul/zuul/commit/70ec13a7caf8903a95b0f9e08dc1facd2aa75e84 the cause of these failures? Before 3 weeks ago it wouldn't have picked up ansible's virtualenv path first/16:07
pabelangersmcginnis: when did jobs start failing?16:07
fungipabelanger: smcginnis: well, also it would depend on when we restarted our executors, i think16:07
smcginnispabelanger: 16 days ago I think was the first one.16:08
fungiwhich could explain why it was random at first if we had only restarted one or a few executors around the 16th/17th16:08
pabelangeryah16:08
corvusi'm starting to think we should revert that patch16:08
corvusara should be the execption here, not the rule16:08
fungismcginnis: oh, there was one from the 8th? the earliest i had on record was from the 17th16:08
fungibut entirely likely i missed earlier examples16:09
smcginnisfungi: I should double check. The first I was aware of was python-solumclient.16:09
*** altlogbot_0 has quit IRC16:09
smcginnisfungi: OK, sorry, it was on the 17th - http://lists.openstack.org/pipermail/release-job-failures/2019-April/001135.html16:09
fungianyway, we did restart all our executors recently, which could account for why it's now failing consistently16:09
fungiperhaps a week ago we had some executors running with 70ec13a and some without?16:10
corvusyeah, could happen if they were rebooted16:10
pabelangercorvus: revert should be okay, we also landed a patch to ara-report that allows user to define path to ara. Which our jobs could use until we determine right approach16:12
corvuspabelanger: yeah.  perhaps zuul could expose the path to its virtualenv as a variable.  i think it would be fine for the ara-report role to expect to look for the ara in the virtualenv, since it's closely tied to zuul16:13
*** altlogbot_3 has joined #zuul16:13
corvuspabelanger: then it's the thing that has to care about virtualenvs, not everything else16:13
pabelangerthat seems reasonable16:13
smcginnisThat could be useful to find out from zuul.16:13
*** ianychoi has quit IRC16:14
corvussomeone want to propose a revert?16:14
pabelangersure16:14
*** ianychoi has joined #zuul16:15
clarkbcorvus: unrelated to the venv stuff, did you see tristanC reports that they cherry picked my security fix back to 3.6.1 and don't have the memory leak16:17
clarkblooking like that may be unrelated16:17
openstackgerritPaul Belanger proposed zuul/zuul master: Revert "Prepend path with bin dir of ansible virtualenv"  https://review.opendev.org/65549116:19
pabelangercorvus: smcginnis: fungi: ^16:19
pabelangertobiash: ^fyi since it originally was your patch16:19
fungithanks pabelanger!16:20
smcginnisThanks!16:20
pabelangernp16:20
fungiheading back to #openstack-infra with related opendev logistics16:21
mordredcorvus: there's 2x+2 on the revert - but I left off the +A just in case16:22
corvusmordred: let's see if tobiash shows up soon16:22
tobiashI'll be here in a bit16:24
corvusmordred, pabelanger: am i correct in my assumption that if zuul itself were installed in a virtualenv, jobs would see the virtualenv if the operator activated it then ran zuul, but if they just ran "/path/to/venv/zuul-executor", jobs would not?16:27
tobiashcorvus: correct16:28
corvusso, broadly speaking, there are two ways we can go: assume that zuul will either not be installed in a virtualenv, or if it is, that operators will run it directly without activation.  then jobs in zuul-jobs can assume a clean (non-virtualenv) environment.  with that approach, we should merge the revert.16:30
corvusthe other approach would be for zuul-jobs to not make any assumptions about virtualenvs and always be defensive16:30
corvushowever, if we do that, jobs may have difficulty doing things like finding an appropriate python interpreter which is not in a virtualenv16:31
corvusi think that's what's causing me to lean toward the assume-non-virtualenv approach16:31
tobiashI'd also vote for the revert16:34
tobiashand ++ for the idea to put the virtualenv into a variable16:34
mordredI also vote for the revert and the assumption that zuul will not be run in an activated virtualenv16:35
corvusokay, i think we're good to +A that change now; i will do so16:36
smcginnisExecutors will need to be restarted after that revert lands to ensure it's picked up, correct?16:36
mordredthat said - I also thnik it wouldn't be terrible to be defensive in zuul-jobs - it might be worth testing to see if making an additional explicit virtualenv in a job rather than using --user would work as expected16:36
corvuspabelanger: if you have a minute to do the zuul variable change, that'd be great; i'm still swamped.16:36
pabelangeryup, I can do that shortly16:37
clarkbsmcginnis: yes16:37
smcginnisk16:37
pabelangermordred: yah, I think that is also a good idea. I've been defaulting to a virtualenv in our zuul jobs over --user so far16:39
mordredpabelanger: ++16:39
mordredpabelanger: yah - I think it's overall safer - and in general the jobs know what they installed, so using a tool from a venv in the job isn't terribly onerous16:40
pabelangeryah16:40
mordredbasically - maximize the explicitness16:40
pabelangermordred: also, did you see my ping in the back scoll about black listed network modules?16:40
mordredNOPE! lemme go look16:40
*** altlogbot_3 has quit IRC16:43
*** manjeets_ is now known as manjeets16:43
*** altlogbot_0 has joined #zuul16:45
*** mattw4 has joined #zuul16:47
AJaegerteam, want to merge https://review.opendev.org/654238 for opendev changes? That fixes testsuite as well and passes finally...16:49
smcginnisPost failure on that path revert: http://logs.openstack.org/91/655491/1/check/zuul-quick-start/5a0bfb3/job-output.txt.gz#_2019-04-24_16_49_06_93169416:50
smcginnisActually, real failure a few lines up from that link.16:51
AJaegersmcginnis: see my ping two lines above - that should fix it ;)16:52
smcginnisAJaeger: Oh, great!16:52
*** altlogbot_0 has quit IRC16:53
pabelangerAJaeger: +216:54
pabelangercorvus: tobiash: fungi: clarkb: if you'd also like to review https://review.opendev.org/65423816:55
*** homeski has left #zuul16:56
AJaegertobiash: want to +A as well? thanks, pabelanger and tobiash for reviewing16:56
*** altlogbot_2 has joined #zuul16:56
corvusgood -- that shows the intermediate registry is fixed too16:57
tobiashAJaeger: doen16:57
*** homeski has joined #zuul16:57
AJaegerthanks16:58
AJaegeronce that is in, we can recheck https://review.opendev.org/#/c/654230 to fix nodepool16:58
homeskiAre the jobs for https://github.com/ansible-network/ansible-zuul-jobs being moved somewhere else?16:59
corvuspabelanger: ^17:00
pabelangerhomeski: yes, we've retired that repo in favor of https://github.com/ansible/ansible-zuul-jobs17:01
homeskithanks!17:02
*** jpena is now known as jpena|off17:06
pabelangerso, not directly zuul related, but we've been doing a promote pipeline job to deploy our zuul control plane for ansible.zuul.com. We've been having random jobs failures for some reason, which I haven't been able to debug properly yet, getting rc: -13 results.17:08
tobiashfrom ansible?17:08
pabelangertobiash: yah17:08
pabelangerI get link a log in a minute17:08
pabelangerhowever, I also wanted to give molecule a try again, and see if that would also expose the issue. So far it hasn't.17:09
pabelangerhowever, switching to molecule has reduced our job run times17:09
pabelangeroriginally ~10:46, with molecule 6:2617:09
pabelangerthat is against 16 nodes17:10
pabelangertobiash: https://logs.zuul.ansible.com/periodic-1hr/github.com/ansible-network/windmill-config/master/windmill-config-deploy/62b26f5/bastion/ara-report/result/ad040c37-a769-4be8-a483-9507aee28d6d/17:11
pabelangertobiash: talking to core it has to do with SIGPIPE, but I don't fully understand why yet17:12
*** jamesmcarthur_ has joined #zuul17:13
tobiashpabelanger: do you have more log than just the return code?17:14
tobiashis this an ansible run inside the executor or an ansible run driven by a job?17:14
*** jamesmcarthur has quit IRC17:15
pabelangertobiash: no more logs, not the executor on a static node (bastion)17:16
pabelangertobiash: it seems limited to 2 servers17:17
pabelangerhowever, servers appear to be healthy17:17
pabelangertobiash: also, a follow up from yesterday: http://paste.openstack.org/show/749715/17:20
pabelangerthat is another error on same get pr function17:20
tobiashyah, that's definitive worth a retry17:21
*** mattw4 has quit IRC17:28
*** electrofelix has joined #zuul17:46
smcginnisCan https://review.opendev.org/#/c/655491/ be rechecked now that other patches have landed? Or are there still issues with that one?17:47
*** themroc has joined #zuul17:47
pabelangersmcginnis: I believe so17:51
smcginnisOK, I'll give it a kick. Thanks pabelanger17:52
AJaegersmcginnis: wait...17:52
smcginnisAh, AJaeger already got it. :)17:52
AJaegersmcginnis: I just added recheck - and then remember that we need to merge https://review.opendev.org/654238 first. It failed py35 and py36 tests ;(17:53
smcginnisOh, I missed that that one failed too.17:55
AJaegerthe failing tests passed now, so check back in 30+ mins and let's hope it's merged by then17:57
*** jamesmcarthur_ has quit IRC18:08
*** jamesmcarthur has joined #zuul18:19
Shrewscorvus: given the recent zuul memleak discovery, you may be interested in https://review.opendev.org/#/c/654599/18:28
corvusShrews: ++thx18:30
Shrewscorvus: oh, also https://review.opendev.org/653549 which might cause some thread info to not be output18:34
corvusyep.  that was borken.18:36
pabelangerError determining manifest MIME type for docker://127.0.0.1:54567/zuul/zuul:latest: pinging docker registry returned: Get https://127.0.0.1:54567/v2/: EOF18:46
pabelangeris that new?18:46
pabelangerhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/job-output.txt.gz#_2019-04-24_18_25_41_79370318:46
pabelangerthat ran in vexxhost, so maybe related to ipv6 still18:47
clarkbperhaps relatedto the swift change18:49
fungiseems like it certainly could be18:53
fungicould be related to swift backend implementation which just went in, i mean18:53
fungiis it saying ti wants /v2/zuul/zuul instead of just /zuul/zuul?18:56
*** openstackgerrit has quit IRC18:57
pabelangerhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/be894b95-7e79-440c-8592-d19536833162/ is the command we are using19:00
pabelangerlooking at man page for skopeo, it looks to be correct19:01
*** electrofelix has quit IRC19:03
pabelangermordred: happen to have any thoughts?^19:04
clarkbya we had it working earlier19:11
clarkbwith the localhost socat proxy thing and everything which iswhy i suspect the swift storage for docker registry19:11
clarkbpabelanger: some github bugs imply that that would happen if we have no labels in the registry19:16
clarkbskopeo is asking for zuul/zuul:latest and there isn't a :latest ?19:16
pabelangeroh, maybe. Is this the first image we are publishing?19:16
pabelangernow that we back with swift19:16
clarkbya though reading it more closely it is copying from the buildset registry to the swift backed intermedaite registry and it is the local one off buildset registry it is complaining about19:17
clarkbso I think we need to look backwards in the job and see when it pushed into the buildset registry and make sure that looks good19:17
pabelangerI think that is: http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/913fa7ec-8642-454c-8943-403b5bfd8314/19:18
clarkbhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/job-output.txt.gz#_2019-04-24_17_37_36_444962 is the console log side of it19:18
pabelangeroh, that is different of what I linked19:19
clarkbwe need to see the commands for that I think19:19
pabelangerchecking ara19:19
pabelangerhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/169531ed-c013-49a4-be1a-000daf3591bc/19:20
pabelangerdocker push zuul-jobs.buildset-registry:5000/zuul/zuul:latest19:21
clarkbhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/169531ed-c013-49a4-be1a-000daf3591bc/ yup found it too19:21
AJaegernow the jobs passed for https://review.opendev.org/#/c/65423819:21
clarkbso now lets make sure that zuul-jobs.buildset-registry:5000 and our socat are the same location19:21
pabelangerclarkb: is that http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/342e18b4-323d-446d-bf6f-20b03d192af4/ ?19:22
clarkbhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/342e18b4-323d-446d-bf6f-20b03d192af4/ ya that is the socat side19:22
clarkbnow to find the /etc/hosts on remote node side19:22
pabelangerHmm, no output for that19:23
pabelangerbut also, task didn't change19:23
clarkbheh lineinfile doesn't record anything19:23
clarkbhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/14f98992-7e1d-4874-99c4-f06291d90767/ it changed there19:23
clarkbif we do it agin later it not changing is correct because the data is alerady in there19:23
pabelangerokay, cool19:24
clarkbpabelanger: maybe we update the role to record /etc/hosts after modifying it?19:27
clarkbassuming the two IPs line up I'm stumped though not sure why it would error like that19:27
pabelangerclarkb: good idea19:27
pabelangerclarkb: cat /etc/hosts good enough?19:28
clarkbpabelanger: ya it should be19:28
pabelangerk, incoming patch19:28
clarkbpabelanger: reading more bugs it may also be an authentication issue19:32
*** openstackgerrit has joined #zuul19:33
openstackgerritPaul Belanger proposed zuul/zuul-jobs master: Output contents of /etc/hosts after modifying it  https://review.opendev.org/65554119:33
clarkbis it possible that ran before the fix for that?19:33
clarkbor perhaps we don't have auth fully fixed?19:33
pabelangernot sure19:33
pabelangerbut we have 1 zuul patch in gate now19:33
pabelangerso, it has passed check19:33
*** rlandy is now known as rlandy|biab19:33
clarkbfix for intermediate registry merged at 1600UTC19:34
clarkbthat job started at 17:25UTC19:34
clarkbis it possible that we load the job config when queued and it queued before the fix merged?19:34
clarkbpabelanger: maybe we watch it a little more and see if it recurs19:36
clarkband if yes merge your change above19:36
pabelangerclarkb: wfm19:39
pabelangerbtw, I'd love a review on https://review.opendev.org/655204/ we are seeing an issue when a new PR is open in github that the change isn't enqueue by zuul into a pipeline. Thttp://paste.openstack.org/show/749715/ is 1 example of what we see.19:41
*** openstackgerrit has quit IRC19:42
AJaegerwe have unstable tests ;( https://review.opendev.org/654238 is failing second time in gate - now again py35, see http://logs.openstack.org/38/654238/3/gate/tox-py35/eaf6262/ ;(19:53
pabelangerwe lost connection to gearman19:55
pabelangermaybe a slow node19:55
pabelangermaybe we should start dstat and collect some extra info19:55
corvusclarkb: what job started at 1725?20:12
clarkbcorvus: http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/20:13
corvusthat can't be right20:13
clarkbhttp://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/job-output.txt.gz#_2019-04-24_17_25_33_43001120:14
corvusoh... now i see it20:14
corvusso that one was probably due to AJaeger's recheck at 17:2320:15
clarkbhrm that is after the fix merged20:15
corvusyep20:15
corvusthe run *before* that was after the fix merged too20:15
corvusAJaeger: issued a recheck at 160120:15
corvus(and i believe that was intentionally right after the fix merged)20:15
corvusalso, zuul-upload-image (which, for this purpose, is nearly the same as zuul-build-image) worked too in the gate run20:16
corvusso there were 2 passes of the "build image" job before that failure20:16
corvus2 passes with the new registry config20:17
corvusfwiw, this should be over ipv620:18
corvusi think we should hold nodes for zuul-build-image and zuul-upload-image20:19
clarkbwfm20:19
AJaegerhttps://review.opendev.org/654238 is currently running zuul-build-image...20:20
corvusautoholds set20:20
*** themroc has quit IRC20:20
* AJaeger calls it a day - good night!20:21
*** mattw4 has joined #zuul20:22
*** pcaruana has quit IRC20:39
*** rlandy|biab is now known as rlandy20:40
clarkbI've scanned the list of commits between when we weren't leaking memory in the scheduler and when we were. I think the most likely culprits are 9f7c642a 6d5ecd65 4a7894bc 3704095c but even then none of them stand out to (note I've left off the security fix because tristanC reported it doesn't cause memory issues on their cherry picked version)21:01
pabelangerI'm trying to get up some monitoring on zuul.a.c so I can add another metric too21:02
corvusclarkb: if the memory leak is related to layouts, system size will affect it21:06
clarkbcorvus: thats a good point. We are way more active so may trigger it more quickly21:07
corvusand the data structure sizes are larger21:07
clarkbok so those 4 + the security fix :)21:08
corvusclarkb: you can strike 4a7894bc -- it's a test-only change21:11
clarkbah yup21:11
clarkb(in my brain I had an idea that yaml could be teh thing leaking but not if it is only in tests)21:11
corvusclarkb: are we seeing jumps?  if so, correlating jumps with either gate resets or reconfigurations would point toward 3704095c21:12
corvus(of course, the same applies to the security fix)21:13
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all ya there was a big jump about 1850UTC today21:13
corvusyeah, but that's stretched over a long period if time... that could just be activity related21:14
clarkbfwiw I'm not in a great spot to debug this further right now. Too much pre travel prep going on. Also I assume we want to dig in with the objgraph stuff?21:46
clarkbcorvus: maybe we can use board meeting "quiet" time to do that?21:46
corvusi would be happy to help with pointers21:47
corvusclarkb: as long as we're restarting anyway though, if any of the patches on your shortlist are non-critical for opendev, we could manually apply reverts and restart21:47
corvusthat could get us a lot of information at little cost in personnel time21:47
clarkbthats a good point. I think the security fix is probably critical but some of the others may not be21:48
corvusi think we can live without the allow nonexistent project one (as long as we don't have any current config errors in any tenant, it's safe to remove)21:48
clarkb"Tolerate missing project" "Forward artifacts to child jobs within buildset" "Centralize job canceling"21:49
clarkbwe can probably live without centralize job cancelling too21:49
clarkb9f7c642a and 3704095c21:49
corvusre artifacts -- we are probably not taking advantage of that yet, so it's probably okay to remove, but that's less certain21:50
*** rlandy is now known as rlandy|bbl22:00
corvuswhat in the world is this error?  http://logs.openstack.org/91/655491/1/check/zuul-quick-start/4a2e58e/ara-report/result/f5c97f96-c38d-4017-96be-a8c767e4d4bf/22:05
SpamapScorvus:did the canonical name get hard coded instead of using zuul.something maybe?22:08
corvusSpamapS: oh, huh i didn't even notice that, thanks.  that makes a lot more sense now.22:08
corvusfungi, clarkb: we probably need https://review.opendev.org/654238 before the executor change can land22:09
corvusSpamapS: ^ i think that has the fix you are implying22:09
clarkbnoted22:09
fungiugh22:10
fungishall i enqueue it too and promote?22:10
clarkbprobably should if we expect the other to merge22:11
corvusit failed again with the registry22:11
fungioh, so it did. do we expect that's fixed now?22:12
clarkbI don't think so but we have holds in place for it now22:12
corvusno, we don't know why it's sometimes failing.  we set autoholds, but i don't see that any were triggered22:13
corvustrying to figure out why now22:14
corvushttp://paste.openstack.org/show/749728/22:17
clarkbhrm we deleted the node before we could hold it?22:19
corvus2019-04-24 20:35:28,383 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0005637791 (state: in-use, allocated_to: 300-0003033704)22:20
corvus2019-04-24 20:36:19,020 INFO nodepool.NodeDeleter: Deleting ZK node id=0005637791, state=deleting, external_id=7a67b97b-c65f-44b1-a51f-ad2c9715081c22:20
corvusalmost an hour before the autohold22:20
corvusthat would explain the failure to connect to the buildset registry if the node was gone22:23
corvuszuul never logged unlocking it until the end22:24
corvusso i don't know why nodepool thought it was unlocked22:24
clarkbwas there a zk disconnect during that time period?22:25
corvusyes22:28
corvusclarkb: do you think this could be the cause of the mysterious "connection issues"?22:28
corvuswe observed that they came in batches, which is exactly what you would expect from a mass node deletion event22:29
corvusclarkb: the ultimate cause may be the memory leak22:29
clarkbya I think this may explain it22:29
corvuswe're *very* sensitive to swapping22:29
corvusand the last time there was a memleak we saw similar behavior22:30
clarkbI should write a big sign on my office wall that says "Check if zk disconnected and if it did: are there memory leaks"22:30
clarkbthen maybe I won't neglect to look at it22:30
corvusmay i have a copy?22:30
clarkbyes22:30
corvusback to -infra22:31
*** openstackgerrit has joined #zuul23:30
openstackgerritJames E. Blair proposed zuul/zuul master: Revert "Prepend path with bin dir of ansible virtualenv"  https://review.opendev.org/65549123:30

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!