*** jangutter has quit IRC | 00:22 | |
*** jangutter has joined #zuul | 00:30 | |
*** jangutter has quit IRC | 00:40 | |
*** jangutter has joined #zuul | 00:46 | |
*** jamesmcarthur has joined #zuul | 00:52 | |
*** jamesmcarthur has quit IRC | 00:54 | |
*** jamesmcarthur has joined #zuul | 00:58 | |
tristanC | clarkb: fwiw we have been running with 41b6b0ea33 picked on 3.6.1 without leak, here is our memory graph: https://softwarefactory-project.io/grafana/?panelId=12054&fullscreen&orgId=1&var-datasource=default&var-server=zs.softwarefactory-project.io&var-inter=$__auto_interval_inter&from=now%2FM&to=now | 01:01 |
---|---|---|
clarkb | tristanC: that is good to know thank you | 01:02 |
clarkb | so probably leaking somewhere else | 01:02 |
tristanC | on the other hand, we are also running zuul-3.8.0/nodepool-3.5.0 without noticable leak, perhaps this is because of the software-collection provided python-3.5.1 | 01:11 |
*** jamesmcarthur has quit IRC | 01:14 | |
*** rfolco has joined #zuul | 01:15 | |
*** jamesmcarthur_ has joined #zuul | 01:18 | |
*** jamesmcarthur_ has quit IRC | 01:34 | |
*** jamesmcarthur has joined #zuul | 02:05 | |
*** jamesmcarthur has quit IRC | 02:12 | |
*** jamesmcarthur has joined #zuul | 02:18 | |
*** bhavikdbavishi has joined #zuul | 02:26 | |
*** bhavikdbavishi has quit IRC | 02:35 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: web: add OpenAPI documentation https://review.opendev.org/535541 | 02:39 |
*** bhavikdbavishi has joined #zuul | 03:19 | |
*** bhavikdbavishi1 has joined #zuul | 03:22 | |
*** bhavikdbavishi has quit IRC | 03:23 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 03:23 | |
*** jamesmcarthur has quit IRC | 04:17 | |
*** jamesmcarthur has joined #zuul | 04:18 | |
*** jamesmcarthur has quit IRC | 04:23 | |
*** jamesmcarthur has joined #zuul | 04:29 | |
*** jamesmcarthur has quit IRC | 04:45 | |
*** jamesmcarthur has joined #zuul | 04:45 | |
*** jamesmcarthur has quit IRC | 04:50 | |
*** jamesmcarthur has joined #zuul | 04:50 | |
*** jamesmcarthur has quit IRC | 04:54 | |
*** bjackman has joined #zuul | 05:03 | |
*** jamesmcarthur has joined #zuul | 05:06 | |
*** jamesmcarthur has quit IRC | 05:11 | |
*** quiquell|off is now known as quiquell|rover | 05:17 | |
*** jamesmcarthur has joined #zuul | 05:17 | |
*** jamesmcarthur has quit IRC | 05:22 | |
*** jamesmcarthur has joined #zuul | 05:25 | |
*** jamesmcarthur has quit IRC | 05:32 | |
*** bjackman has quit IRC | 05:32 | |
*** bjackman has joined #zuul | 05:35 | |
*** bjackman_ has joined #zuul | 05:38 | |
*** bjackman has quit IRC | 05:41 | |
*** electrofelix has joined #zuul | 06:11 | |
*** jamesmcarthur has joined #zuul | 06:11 | |
*** bhavikdbavishi1 has joined #zuul | 06:15 | |
*** jamesmcarthur has quit IRC | 06:15 | |
*** bhavikdbavishi has quit IRC | 06:16 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 06:16 | |
*** pcaruana has joined #zuul | 06:24 | |
*** jvv_ has quit IRC | 06:25 | |
*** homeski has joined #zuul | 06:27 | |
*** quiquell|rover is now known as quique|rover|brb | 06:54 | |
*** gtema has joined #zuul | 06:57 | |
*** themroc has joined #zuul | 07:06 | |
*** saneax has joined #zuul | 07:09 | |
*** quique|rover|brb is now known as quiquell|rover | 07:14 | |
*** jpena|off is now known as jpena | 07:48 | |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: web: add build modal with a parameter form https://review.opendev.org/644485 | 08:07 |
bjackman_ | tristanC, who should I pester to get a Workflow +1 on https://review.opendev.org/#/c/649900/ ? | 08:24 |
tristanC | bjackman_: i would but i'm not a zuul maintainer... and we need a couple of CR+2 before +Workflow | 08:29 |
bjackman_ | tristanC, sure, just wondering who is the right person? | 08:43 |
bjackman_ | mordred, are you a maintainer? | 08:44 |
tristanC | bjackman_: iirc zuul-maint should ping the right people (to ask for review on https://review.opendev.org/#/c/649900/ ) | 08:44 |
bjackman_ | tristanC, ah cheers! | 08:44 |
bjackman_ | zuul-maint: (ping in case the above message has not already done it) | 08:45 |
*** sshnaidm|afk is now known as sshnaidm| | 09:16 | |
*** sshnaidm| is now known as sshnaidm | 09:16 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: web: add support for checkbox and list parameters https://review.opendev.org/648661 | 09:24 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: web: add support for checkbox and list parameters https://review.opendev.org/648661 | 09:25 |
*** mrhillsman has quit IRC | 09:55 | |
*** mrhillsman has joined #zuul | 09:58 | |
*** bhavikdbavishi has quit IRC | 10:00 | |
openstackgerrit | Fabien Boucher proposed zuul/zuul master: Pagure driver - https://pagure.io/pagure/ https://review.opendev.org/604404 | 10:13 |
*** gtema has quit IRC | 10:53 | |
*** jpena is now known as jpena|lunch | 11:18 | |
*** gtema has joined #zuul | 11:27 | |
*** bhavikdbavishi has joined #zuul | 11:34 | |
*** gtema has quit IRC | 11:42 | |
*** bhavikdbavishi1 has joined #zuul | 11:50 | |
*** bhavikdbavishi has quit IRC | 11:50 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 11:50 | |
*** quiquell|rover is now known as quique|rover|eat | 12:00 | |
*** rlandy has joined #zuul | 12:01 | |
*** bhavikdbavishi has quit IRC | 12:01 | |
*** rlandy is now known as rlandy|ruck | 12:02 | |
*** jamesmcarthur has joined #zuul | 12:10 | |
*** bhavikdbavishi has joined #zuul | 12:14 | |
*** jamesmcarthur has quit IRC | 12:15 | |
*** jamesmcarthur has joined #zuul | 12:16 | |
*** themroc has quit IRC | 12:21 | |
*** jamesmcarthur has quit IRC | 12:32 | |
*** quique|rover|eat is now known as quiquell|rover | 12:41 | |
*** jamesmcarthur has joined #zuul | 12:44 | |
*** irclogbot_3 has quit IRC | 12:55 | |
*** irclogbot_3 has joined #zuul | 12:57 | |
*** altlogbot_1 has quit IRC | 12:57 | |
openstackgerrit | Monty Taylor proposed zuul/zuul master: Update references for opendev https://review.opendev.org/654238 | 12:59 |
*** altlogbot_0 has joined #zuul | 12:59 | |
*** bjackman_ has quit IRC | 12:59 | |
*** bjackman_ has joined #zuul | 13:01 | |
pabelanger | zuul-build-image looks to be broken: http://logs.openstack.org/97/653497/2/check/zuul-build-image/701b7ac/job-output.txt.gz#_2019-04-24_12_37_11_936392 | 13:03 |
pabelanger | does anybody know the status of the intermediate registry? I know we've have a few issues of the last few weeks | 13:04 |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: ensure-twine: Don't install --user if running in venv https://review.opendev.org/655241 | 13:09 |
*** rlandy|ruck is now known as rlandy|ruck|mtg | 13:11 | |
*** rfolco is now known as rfolco_sick | 13:18 | |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: ensure-twine: Don't install --user if running in venv https://review.opendev.org/655241 | 13:22 |
*** bhavikdbavishi has quit IRC | 13:23 | |
clarkb | pabelanger: mordred was converting it to swift backed storage. That may need a followup | 13:28 |
pabelanger | mordred: happen to have an idea what is happening? Getting an auth failure | 13:34 |
pabelanger | should we make the job non-voting until the infra is better? I'm unsure the bandwidth people have before travels | 13:35 |
*** jpena|lunch is now known as jpena | 13:40 | |
clarkb | 13:41 | |
clarkb | er | 13:41 |
clarkb | corvus also updated the cred on the registry. Maybe that didnt get updated in the secrets? | 13:41 |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: Add environment debugging to ensure-twine role https://review.opendev.org/655437 | 13:44 |
mordred | clarkb: I updated the secret | 13:45 |
mordred | clarkb: yesterday ... but maybe I missed something somewhere? | 13:46 |
mordred | clarkb: https://review.opendev.org/#/c/655228/ fwiw | 13:46 |
*** rfolco_sick is now known as rfolco | 13:52 | |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: Add environment debugging to ensure-twine role https://review.opendev.org/655437 | 14:00 |
*** jamesmcarthur has quit IRC | 14:00 | |
*** jamesmcarthur has joined #zuul | 14:00 | |
pabelanger | Shrews: did you get the changes you wanted landed for opendev nodepool? | 14:17 |
Shrews | pabelanger: nope. i have no idea what's happening with the gate jobs (they pass in check) | 14:19 |
Shrews | the failures seem quite exquisitely random, such as: http://logs.openstack.org/62/654462/3/gate/puppet-beaker-rspec-infra/e743f98/job-output.txt.gz#_2019-04-24_00_32_35_445642 | 14:23 |
pabelanger | Shrews: that usually happens when another node comes online in openstack provider with same IP | 14:24 |
clarkb | we spent a good chunk (all for me basically) debugging those netwprking issues | 14:24 |
Shrews | the other fail was attempting to contact mirror.mtl01.inap.openstack.org | 14:25 |
clarkb | seems there weremaybe mulyiple things. Reused IPs in inap and maybe unhappy executor nodes or cloud networking | 14:25 |
AJaeger | team, do we want to merge smcginnis' change https://review.opendev.org/#/c/655437 to debug the ensure-twine role? | 14:25 |
clarkb | in any case mgagne fixed inap and werebooted the executors and it is happier now | 14:25 |
* Shrews rechecks | 14:26 | |
corvus | fungi: i approved https://review.opendev.org/655437 | 14:28 |
smcginnis | Thanks | 14:30 |
fungi | thanks corvus! | 14:30 |
fungi | i've run down yet more avenues to attempt to figure out what changed in our environment to start exposing that | 14:31 |
fungi | none of the system packages updated in the relevant timeframe look to be likely culprits for such a behavior shift | 14:31 |
fungi | and unlike i had assumed, we're not actually automatically upgrading our ansible interpreters on our executors either | 14:32 |
fungi | so they're still running the versions of ansible which were current at the time those virtualenvs were first created | 14:32 |
smcginnis | With 655437 merged, should I do another release-test release? Or is it enough to reenqueue one of the failed ones? | 14:32 |
fungi | should be fine for me to reenqueue one, i'll get details from you in another channel so we don't make too much unrelated noise in here | 14:33 |
fungi | oh, actually, i can look one up easily | 14:33 |
fungi | yay zuul builds dashboard | 14:33 |
fungi | et voila: http://zuul.opendev.org/t/openstack/builds/?job_name=release-openstack-python&project=openstack%2Frelease-test | 14:35 |
corvus | fungi: ah, i'm surprised we aren't upgrading ansible too | 14:35 |
fungi | corvus: yeah, i think that's something for us to look into soon | 14:35 |
*** electrofelix has quit IRC | 14:37 | |
fungi | oh, hah, getting ahead of myself. i reenqueued that test tag but 655437 hasn't merged yet, so ignore the useless failure we'll probably get from it | 14:37 |
smcginnis | Oops, a few minutes yet in gate. :) | 14:38 |
fungi | yeah, no worries, i have e-mails to read still | 14:38 |
clarkb | fungi: you wereable to confirm we didnt upgrade ansible? when I lookedthe file timestamps were newer than most recent ansoble 2.7 release | 14:39 |
fungi | which file timestamps? i ran `/usr/lib/zuul/ansible/2.7/bin/ansible --version` and also checked the last modified time on that executable | 14:40 |
fungi | (and did the same for 2.6 and 2.5 as well) | 14:40 |
clarkb | the filestamps on the files in /var/lib/zuul/ansible/2.7 | 14:40 |
clarkb | there weresome from april 20 | 14:41 |
fungi | oh, those are apparently where we *download* newer ansible, but not where we *install* it | 14:41 |
clarkb | ah | 14:41 |
fungi | it's /usr vs /var (i didn't catch that my first time through either and corvus had to point it out) | 14:42 |
openstackgerrit | Merged zuul/zuul-jobs master: Add environment debugging to ensure-twine role https://review.opendev.org/655437 | 14:42 |
fungi | though it's not clear to me why we have unpacked ansible installs in /var when we're installing them into virtualenvs in /usr | 14:43 |
fungi | i need to take a closer look at the ansible manager probably to understand the reasons | 14:43 |
*** jamesmcarthur has quit IRC | 14:48 | |
*** quiquell|rover is now known as quiquell|off | 14:49 | |
corvus | fungi: i think the zuul modules overrides are in /var | 14:50 |
clarkb | in response to the trouble we had debugging conenction issues yseterday cloudnull wrote an ansible callback plugin to run a traceroute on connection failure | 14:52 |
clarkb | that seems like something that we may want to install in zuul's ansible by default? | 14:52 |
clarkb | currently the code is up on a gist but I can see about having cloudnull push that to zuul if we think that would be useful | 14:54 |
fungi | smcginnis: "Invalid options for debug: real_prefix,VIRTUAL_ENV,path,version,prefix,base_exec_prefix,executable" http://logs.openstack.org/59/59ce65e9e66fe3ea203b77812ee14e69ebdb192a/release/release-openstack-python/abaae0d/ | 14:57 |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: Correct debug statement for ensure-twine environment https://review.opendev.org/655466 | 14:57 |
smcginnis | fungi: Sorry, I messed that up. | 14:57 |
* smcginnis needs to spend more time with ansible | 14:57 | |
*** pcaruana has quit IRC | 15:06 | |
*** jamesmcarthur has joined #zuul | 15:07 | |
*** rlandy|ruck|mtg is now known as rlandy | 15:16 | |
*** rfolco is now known as rfolco|ruck | 15:19 | |
pabelanger | fungi: so, if we used twine to upload zuul-3.8.1.dev2.tar.gz to pypi, is there any flags we need to pass to make it a pre-release? Eg: if somebody did pip install zuul, would 3.7.0 be selected or 3.8.0.dev2? | 15:19 |
pabelanger | I also _think_ our current release-zuul-python job should work | 15:19 |
pabelanger | try to find logs from last release | 15:19 |
fungi | pypi and pip figure that out on their own, courtesy of pep 440 | 15:20 |
pabelanger | great | 15:20 |
fungi | like if you look at https://pypi.org/project/ansible/#history those "pre-release" tags are added automatically based on there being an "a" or "b" or "rc" in the name, and "dev" should do the same | 15:21 |
pabelanger | fungi: yup! 2.7.0.dev0 | 15:22 |
pabelanger | thanks for link | 15:22 |
*** sshnaidm is now known as sshnaidm|off | 15:23 | |
webknjaz | > zuul-3.8.0.tar.gz (10.3 MB) | 15:24 |
webknjaz | what do you put there? 0_o | 15:24 |
fungi | pabelanger: indeed, i confirmed on one of my own projects where i've been uploading dev versions and it does mark them as pre-release | 15:24 |
webknjaz | I can confirm that too | 15:24 |
webknjaz | I'm actually the one who requested that label on the UI :) | 15:25 |
fungi | awesome! | 15:25 |
fungi | and yeah, pypi/warehouse lists the most recent non-pre-release version by default too | 15:26 |
webknjaz | FYI I've found a good way of keeping the release pipeline healthy | 15:26 |
webknjaz | publish every commit in master to `test.pypi.org` | 15:26 |
webknjaz | we do this in ansible-lint and molecule | 15:26 |
pabelanger | fungi: so, looking at the last release-zuul-job, I don't see it renaming files after sdist / bdist_wheel. So I think we could add it to post (or promote?) and publishing should just work | 15:27 |
webknjaz | facilitated by `setuptools-scm` in particular | 15:27 |
pabelanger | http://logs.openstack.org/7b/7b17aa534d193743f5307f4b399fa1690f72da6d/release/release-zuul-python/aabfa42/job-output.txt.gz | 15:27 |
webknjaz | renaming dist? don't do that manually! | 15:27 |
fungi | pabelanger: probably not promote since that would run on the change ref rather than the merged state | 15:29 |
fungi | i think post is what you want | 15:29 |
webknjaz | > if somebody did pip install zuul | 15:29 |
webknjaz | pre-release install would require `--pre` CLI arg | 15:29 |
pabelanger | webknjaz: yup, I mostly didn't want humans getting dev releases by default | 15:30 |
webknjaz | pabelanger: what renaming were you talking about? | 15:31 |
openstackgerrit | Merged zuul/zuul-jobs master: Correct debug statement for ensure-twine environment https://review.opendev.org/655466 | 15:31 |
openstackgerrit | Paul Belanger proposed zuul/zuul master: Add release-zuul-python to post pipeline https://review.opendev.org/655474 | 15:32 |
pabelanger | webknjaz: some of our puppet jobs would rename things, before uploading things. I was confusing that with our python jobs. | 15:33 |
pabelanger | corvus: fungi: I _think_ ^ is all we need to do | 15:33 |
pabelanger | I'll check nodepool release job shortly | 15:34 |
webknjaz | pabelanger: nowadays it probably makes sense to move to PEP 517 entrypoint for building Python dists | 15:35 |
fungi | webknjaz: yeah, i'm not sure we've thought through the implications of pep 517 on pbr, but it's possible it "just works" at this point | 15:36 |
fungi | worth trying at any rate | 15:36 |
pabelanger | yah, will defer to other humans on pypi / pep things :) | 15:37 |
fungi | webknjaz: also a number of us dislike toml on principle, and the idea of carrying toml files in our git repositories is undesirable so may need a bit of a workaround | 15:37 |
webknjaz | that's a standard now and it's pretty good | 15:38 |
webknjaz | now sure about pbr though | 15:38 |
fungi | more that using toml in any way could be seen as an endorsement of its creator | 15:38 |
webknjaz | if you use editable installs, you'd need to use `pip install --no-build-isolation -e .` | 15:38 |
webknjaz | there's a lot of tools using `pyproject.toml`, it's already endorsed. And it's def better than throwing install-time invoked code at users | 15:39 |
mordred | webknjaz: yeah - I know. but just because other projects have adopted it doesn't mean we're necessarily excited to do so, when the reasoning behind the creation of toml in the first place is so gross. it basically makes many of us quite sad and annoyed, so we haven't really put any energy into figuring out what the pbr interactions with it will be or need to be | 15:43 |
fungi | i guess it's a question of whether install-time code generation is better than supporting misogyny | 15:44 |
mordred | we'll clearly have to get over it at some point | 15:44 |
webknjaz | I like the reasoning behind toml | 15:44 |
mordred | I don't. it's "I can't be bothered to read the yaml spec" | 15:44 |
mordred | and that's the type of lazy and narcissistic behavior that is ruining our industry | 15:44 |
webknjaz | i'm more of "the implementations of other formats are broken" | 15:45 |
fungi | "...so let's create yet another instead of improving one" | 15:45 |
mordred | yeah. can't be bothered to write a better impl of an existing widely used format | 15:45 |
Shrews | pabelanger: looks like my change merged while i was at the doc. just waiting for puppet to do its thing | 15:45 |
webknjaz | toml is stricter | 15:45 |
pabelanger | Shrews: yay | 15:45 |
mordred | the only reason I touch, or ever will touch, toml is when I am forced to do so | 15:46 |
fungi | but anyway, my main concern with toml remains that tpl is a detestable human and i'd rather not be associated with him in any way i can avoid | 15:46 |
mordred | yeah. that too | 15:46 |
*** pcaruana has joined #zuul | 15:46 | |
mordred | and that the file format is named after himself is yet another great example of runaway ego | 15:47 |
mordred | but - as I said - at some point it will become unavoidable in python just like it is un rust | 15:47 |
mordred | and we'll be forced to interact with it | 15:47 |
fungi | some rudimentary "research" was performed in the course of dstufft's choice of tolm for pep 517, in which (paraphrasing) "we asked some of the women working in the python community if they had any concerns and none of them spoke up" | 15:47 |
pabelanger | ouch | 15:48 |
webknjaz | I saw better explanations of why it's been chosen for PEP 517 | 15:49 |
webknjaz | somewhere outside that pep there's a gist with that | 15:49 |
mordred | the reason was that pyyaml was too hard to vendor, since it depends on libyaml | 15:50 |
webknjaz | well | 15:50 |
pabelanger | mordred: so, https://github.com/ansible/ansible-zuul-jobs/blob/master/playbooks/ansible-network-vyos-base/files/bootstrap.yaml is how we are doing per job keys for vyos network appliance (debian based), mostly because vyos_config is blocked in zuul. I know we have a discuss going on ML to drop blacklisting of action modules in zuul, but I also cannot remember the reason of dropping ansible network modules by | 15:50 |
pabelanger | default | 15:50 |
webknjaz | I'll tell you more | 15:50 |
mordred | and for reasons surpassing understanding there is still no yaml support in the stdlib | 15:50 |
webknjaz | pyyaml is "maintained" by a Perl guy | 15:50 |
mordred | yeah | 15:50 |
webknjaz | who doesn't like to accept any help | 15:50 |
webknjaz | ruamel is not much better either | 15:51 |
webknjaz | there's no toml in stdlib either, after all... | 15:51 |
mordred | yeah. but it's written in such a way that the pip people can vendor it | 15:51 |
mordred | which is how they solve dependencies | 15:52 |
webknjaz | right | 15:52 |
mordred | even though it's an atrocious practice | 15:52 |
webknjaz | I know | 15:52 |
*** bjackman_ has quit IRC | 15:52 | |
webknjaz | this practice is fine, because it solves chicken/egg in a way | 15:52 |
mordred | but even hating toml - I'd argue that python stdlib should carry both yaml and toml support - as they are pretty standard | 15:52 |
SpamapS | I'm less angry about TOML, and more frustrated at the yaml libs being poorly maintained. | 15:53 |
webknjaz | I've tried contributing to pyyaml so many times | 15:53 |
SpamapS | YAML has eaten the world. | 15:53 |
mordred | yeah. yaml is pretty darned important | 15:53 |
SpamapS | webknjaz:yeah I bounced off of it once too. | 15:53 |
webknjaz | and one other Ansible Core Dev does that too | 15:53 |
webknjaz | when we realised that they are going to trash one of the Ansible's main deps - the only way to save is to talk some sense into maintainers... | 15:54 |
fungi | well, pyyaml is virtually unmaintained, so there's plenty of room for someone to write a from-scratch reimplementation suitable for inclusion in the python stdlib (and no ruamel.yaml is not it, the entire incestuous ruamel dependency set makes me not want to depend on it in my projects) | 15:57 |
mordred | fungi: ++ | 15:58 |
fungi | i expect a lot of projects would jump at the chance to swap out pyyaml for a slimmer and better-maintained alternative, even if it didn't maintain bug-compatibility with pyyaml | 15:59 |
fungi | or, heck, even if it didn't mirror the pyyaml api | 15:59 |
fungi | i also feel like it wouldn't need to have optional libyaml support, could just be pure-python-only | 16:00 |
fungi | and at this stage, could probably focus entirely on python>=3.5 | 16:00 |
corvus | mordred, tobiash, fungi, AJaeger: smcginnis got this: http://logs.openstack.org/59/59ce65e9e66fe3ea203b77812ee14e69ebdb192a/release/release-openstack-python/781edd8/ara-report/result/a6f1d958-2d17-4b8a-a6bf-97cb1bf6e31d/ | 16:00 |
corvus | which shows that the pip install task running on localhost (ie, the executor) is running in the zuul managed ansible virtualenv | 16:01 |
smcginnis | AJaeger: pointed out the failed one was "Host:localhost" and the one that passed was "Host: ubunutu-bionic". | 16:01 |
fungi | yeah, the latter ran on a dedicated job node rather than the executor | 16:01 |
corvus | this makes for awkwardness -- we're trying to run "pip instal --user" which will now work on remote hosts, but not on localhost due to the venv | 16:02 |
corvus | pabelanger: ^ ping you too | 16:02 |
fungi | but what's really weird here is that this job was working fine until a week ago, and we can't find anything that changed, but also it was only intermittently failing at first (reenqueuing the tag ref to trigger it again would often succeed) | 16:03 |
fungi | so this suggests perhaps something was inconsistent across our farm of executors a week ago, but in the past few days has been made consistent | 16:03 |
corvus | it seems very strange to me that ansible would pass through all the virtualenv information such that a *shell* task would then pick it up | 16:03 |
pabelanger | "executable": "/usr/lib/zuul/ansible/2.7/bin/python3" | 16:03 |
pabelanger | I think that is the issue | 16:03 |
pabelanger | the next task is using python3, which maybe because the path patch for venv is being used over system python | 16:04 |
fungi | yes, the twine_python variable gets "python3" passed into it, unqualified. one workaround would be for us to set to "/usr/bin/python3" instead | 16:04 |
fungi | but that hasn't changed since november | 16:04 |
smcginnis | I suppose it would be safe for us to hard code the known sys path to which python we want. | 16:05 |
pabelanger | should we append the virtualenv path in: https://opendev.org/zuul/zuul/commit/70ec13a7caf8903a95b0f9e08dc1facd2aa75e84 so OS versions take priority over virtualenv? | 16:06 |
pabelanger | https://opendev.org/zuul/zuul/commit/70ec13a7caf8903a95b0f9e08dc1facd2aa75e84 | 16:06 |
smcginnis | But did someone say ensure-twine may be used by other non-OpenStack/OpenDev deployments of Zuul? In that case, we would be hardcoding an assumption for everyone's environment. | 16:06 |
corvus | pabelanger: is *that* the thing that changed? | 16:06 |
fungi | smcginnis: well, we can (and do in fact) set twine_python in our job definition, so could just update it there | 16:06 |
corvus | smcginnis, fungi: yes, but it would be great if ensure-twine worked, rather than was guaranteed to be broken everywhere :/ | 16:07 |
pabelanger | corvus: I'll have to look when release jobs started failing | 16:07 |
fungi | corvus: i agree | 16:07 |
smcginnis | Wait, is https://opendev.org/zuul/zuul/commit/70ec13a7caf8903a95b0f9e08dc1facd2aa75e84 the cause of these failures? Before 3 weeks ago it wouldn't have picked up ansible's virtualenv path first/ | 16:07 |
pabelanger | smcginnis: when did jobs start failing? | 16:07 |
fungi | pabelanger: smcginnis: well, also it would depend on when we restarted our executors, i think | 16:07 |
smcginnis | pabelanger: 16 days ago I think was the first one. | 16:08 |
fungi | which could explain why it was random at first if we had only restarted one or a few executors around the 16th/17th | 16:08 |
pabelanger | yah | 16:08 |
corvus | i'm starting to think we should revert that patch | 16:08 |
corvus | ara should be the execption here, not the rule | 16:08 |
fungi | smcginnis: oh, there was one from the 8th? the earliest i had on record was from the 17th | 16:08 |
fungi | but entirely likely i missed earlier examples | 16:09 |
smcginnis | fungi: I should double check. The first I was aware of was python-solumclient. | 16:09 |
*** altlogbot_0 has quit IRC | 16:09 | |
smcginnis | fungi: OK, sorry, it was on the 17th - http://lists.openstack.org/pipermail/release-job-failures/2019-April/001135.html | 16:09 |
fungi | anyway, we did restart all our executors recently, which could account for why it's now failing consistently | 16:09 |
fungi | perhaps a week ago we had some executors running with 70ec13a and some without? | 16:10 |
corvus | yeah, could happen if they were rebooted | 16:10 |
pabelanger | corvus: revert should be okay, we also landed a patch to ara-report that allows user to define path to ara. Which our jobs could use until we determine right approach | 16:12 |
corvus | pabelanger: yeah. perhaps zuul could expose the path to its virtualenv as a variable. i think it would be fine for the ara-report role to expect to look for the ara in the virtualenv, since it's closely tied to zuul | 16:13 |
*** altlogbot_3 has joined #zuul | 16:13 | |
corvus | pabelanger: then it's the thing that has to care about virtualenvs, not everything else | 16:13 |
pabelanger | that seems reasonable | 16:13 |
smcginnis | That could be useful to find out from zuul. | 16:13 |
*** ianychoi has quit IRC | 16:14 | |
corvus | someone want to propose a revert? | 16:14 |
pabelanger | sure | 16:14 |
*** ianychoi has joined #zuul | 16:15 | |
clarkb | corvus: unrelated to the venv stuff, did you see tristanC reports that they cherry picked my security fix back to 3.6.1 and don't have the memory leak | 16:17 |
clarkb | looking like that may be unrelated | 16:17 |
openstackgerrit | Paul Belanger proposed zuul/zuul master: Revert "Prepend path with bin dir of ansible virtualenv" https://review.opendev.org/655491 | 16:19 |
pabelanger | corvus: smcginnis: fungi: ^ | 16:19 |
pabelanger | tobiash: ^fyi since it originally was your patch | 16:19 |
fungi | thanks pabelanger! | 16:20 |
smcginnis | Thanks! | 16:20 |
pabelanger | np | 16:20 |
fungi | heading back to #openstack-infra with related opendev logistics | 16:21 |
mordred | corvus: there's 2x+2 on the revert - but I left off the +A just in case | 16:22 |
corvus | mordred: let's see if tobiash shows up soon | 16:22 |
tobiash | I'll be here in a bit | 16:24 |
corvus | mordred, pabelanger: am i correct in my assumption that if zuul itself were installed in a virtualenv, jobs would see the virtualenv if the operator activated it then ran zuul, but if they just ran "/path/to/venv/zuul-executor", jobs would not? | 16:27 |
tobiash | corvus: correct | 16:28 |
corvus | so, broadly speaking, there are two ways we can go: assume that zuul will either not be installed in a virtualenv, or if it is, that operators will run it directly without activation. then jobs in zuul-jobs can assume a clean (non-virtualenv) environment. with that approach, we should merge the revert. | 16:30 |
corvus | the other approach would be for zuul-jobs to not make any assumptions about virtualenvs and always be defensive | 16:30 |
corvus | however, if we do that, jobs may have difficulty doing things like finding an appropriate python interpreter which is not in a virtualenv | 16:31 |
corvus | i think that's what's causing me to lean toward the assume-non-virtualenv approach | 16:31 |
tobiash | I'd also vote for the revert | 16:34 |
tobiash | and ++ for the idea to put the virtualenv into a variable | 16:34 |
mordred | I also vote for the revert and the assumption that zuul will not be run in an activated virtualenv | 16:35 |
corvus | okay, i think we're good to +A that change now; i will do so | 16:36 |
smcginnis | Executors will need to be restarted after that revert lands to ensure it's picked up, correct? | 16:36 |
mordred | that said - I also thnik it wouldn't be terrible to be defensive in zuul-jobs - it might be worth testing to see if making an additional explicit virtualenv in a job rather than using --user would work as expected | 16:36 |
corvus | pabelanger: if you have a minute to do the zuul variable change, that'd be great; i'm still swamped. | 16:36 |
pabelanger | yup, I can do that shortly | 16:37 |
clarkb | smcginnis: yes | 16:37 |
smcginnis | k | 16:37 |
pabelanger | mordred: yah, I think that is also a good idea. I've been defaulting to a virtualenv in our zuul jobs over --user so far | 16:39 |
mordred | pabelanger: ++ | 16:39 |
mordred | pabelanger: yah - I think it's overall safer - and in general the jobs know what they installed, so using a tool from a venv in the job isn't terribly onerous | 16:40 |
pabelanger | yah | 16:40 |
mordred | basically - maximize the explicitness | 16:40 |
pabelanger | mordred: also, did you see my ping in the back scoll about black listed network modules? | 16:40 |
mordred | NOPE! lemme go look | 16:40 |
*** altlogbot_3 has quit IRC | 16:43 | |
*** manjeets_ is now known as manjeets | 16:43 | |
*** altlogbot_0 has joined #zuul | 16:45 | |
*** mattw4 has joined #zuul | 16:47 | |
AJaeger | team, want to merge https://review.opendev.org/654238 for opendev changes? That fixes testsuite as well and passes finally... | 16:49 |
smcginnis | Post failure on that path revert: http://logs.openstack.org/91/655491/1/check/zuul-quick-start/5a0bfb3/job-output.txt.gz#_2019-04-24_16_49_06_931694 | 16:50 |
smcginnis | Actually, real failure a few lines up from that link. | 16:51 |
AJaeger | smcginnis: see my ping two lines above - that should fix it ;) | 16:52 |
smcginnis | AJaeger: Oh, great! | 16:52 |
*** altlogbot_0 has quit IRC | 16:53 | |
pabelanger | AJaeger: +2 | 16:54 |
pabelanger | corvus: tobiash: fungi: clarkb: if you'd also like to review https://review.opendev.org/654238 | 16:55 |
*** homeski has left #zuul | 16:56 | |
AJaeger | tobiash: want to +A as well? thanks, pabelanger and tobiash for reviewing | 16:56 |
*** altlogbot_2 has joined #zuul | 16:56 | |
corvus | good -- that shows the intermediate registry is fixed too | 16:57 |
tobiash | AJaeger: doen | 16:57 |
*** homeski has joined #zuul | 16:57 | |
AJaeger | thanks | 16:58 |
AJaeger | once that is in, we can recheck https://review.opendev.org/#/c/654230 to fix nodepool | 16:58 |
homeski | Are the jobs for https://github.com/ansible-network/ansible-zuul-jobs being moved somewhere else? | 16:59 |
corvus | pabelanger: ^ | 17:00 |
pabelanger | homeski: yes, we've retired that repo in favor of https://github.com/ansible/ansible-zuul-jobs | 17:01 |
homeski | thanks! | 17:02 |
*** jpena is now known as jpena|off | 17:06 | |
pabelanger | so, not directly zuul related, but we've been doing a promote pipeline job to deploy our zuul control plane for ansible.zuul.com. We've been having random jobs failures for some reason, which I haven't been able to debug properly yet, getting rc: -13 results. | 17:08 |
tobiash | from ansible? | 17:08 |
pabelanger | tobiash: yah | 17:08 |
pabelanger | I get link a log in a minute | 17:08 |
pabelanger | however, I also wanted to give molecule a try again, and see if that would also expose the issue. So far it hasn't. | 17:09 |
pabelanger | however, switching to molecule has reduced our job run times | 17:09 |
pabelanger | originally ~10:46, with molecule 6:26 | 17:09 |
pabelanger | that is against 16 nodes | 17:10 |
pabelanger | tobiash: https://logs.zuul.ansible.com/periodic-1hr/github.com/ansible-network/windmill-config/master/windmill-config-deploy/62b26f5/bastion/ara-report/result/ad040c37-a769-4be8-a483-9507aee28d6d/ | 17:11 |
pabelanger | tobiash: talking to core it has to do with SIGPIPE, but I don't fully understand why yet | 17:12 |
*** jamesmcarthur_ has joined #zuul | 17:13 | |
tobiash | pabelanger: do you have more log than just the return code? | 17:14 |
tobiash | is this an ansible run inside the executor or an ansible run driven by a job? | 17:14 |
*** jamesmcarthur has quit IRC | 17:15 | |
pabelanger | tobiash: no more logs, not the executor on a static node (bastion) | 17:16 |
pabelanger | tobiash: it seems limited to 2 servers | 17:17 |
pabelanger | however, servers appear to be healthy | 17:17 |
pabelanger | tobiash: also, a follow up from yesterday: http://paste.openstack.org/show/749715/ | 17:20 |
pabelanger | that is another error on same get pr function | 17:20 |
tobiash | yah, that's definitive worth a retry | 17:21 |
*** mattw4 has quit IRC | 17:28 | |
*** electrofelix has joined #zuul | 17:46 | |
smcginnis | Can https://review.opendev.org/#/c/655491/ be rechecked now that other patches have landed? Or are there still issues with that one? | 17:47 |
*** themroc has joined #zuul | 17:47 | |
pabelanger | smcginnis: I believe so | 17:51 |
smcginnis | OK, I'll give it a kick. Thanks pabelanger | 17:52 |
AJaeger | smcginnis: wait... | 17:52 |
smcginnis | Ah, AJaeger already got it. :) | 17:52 |
AJaeger | smcginnis: I just added recheck - and then remember that we need to merge https://review.opendev.org/654238 first. It failed py35 and py36 tests ;( | 17:53 |
smcginnis | Oh, I missed that that one failed too. | 17:55 |
AJaeger | the failing tests passed now, so check back in 30+ mins and let's hope it's merged by then | 17:57 |
*** jamesmcarthur_ has quit IRC | 18:08 | |
*** jamesmcarthur has joined #zuul | 18:19 | |
Shrews | corvus: given the recent zuul memleak discovery, you may be interested in https://review.opendev.org/#/c/654599/ | 18:28 |
corvus | Shrews: ++thx | 18:30 |
Shrews | corvus: oh, also https://review.opendev.org/653549 which might cause some thread info to not be output | 18:34 |
corvus | yep. that was borken. | 18:36 |
pabelanger | Error determining manifest MIME type for docker://127.0.0.1:54567/zuul/zuul:latest: pinging docker registry returned: Get https://127.0.0.1:54567/v2/: EOF | 18:46 |
pabelanger | is that new? | 18:46 |
pabelanger | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/job-output.txt.gz#_2019-04-24_18_25_41_793703 | 18:46 |
pabelanger | that ran in vexxhost, so maybe related to ipv6 still | 18:47 |
clarkb | perhaps relatedto the swift change | 18:49 |
fungi | seems like it certainly could be | 18:53 |
fungi | could be related to swift backend implementation which just went in, i mean | 18:53 |
fungi | is it saying ti wants /v2/zuul/zuul instead of just /zuul/zuul? | 18:56 |
*** openstackgerrit has quit IRC | 18:57 | |
pabelanger | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/be894b95-7e79-440c-8592-d19536833162/ is the command we are using | 19:00 |
pabelanger | looking at man page for skopeo, it looks to be correct | 19:01 |
*** electrofelix has quit IRC | 19:03 | |
pabelanger | mordred: happen to have any thoughts?^ | 19:04 |
clarkb | ya we had it working earlier | 19:11 |
clarkb | with the localhost socat proxy thing and everything which iswhy i suspect the swift storage for docker registry | 19:11 |
clarkb | pabelanger: some github bugs imply that that would happen if we have no labels in the registry | 19:16 |
clarkb | skopeo is asking for zuul/zuul:latest and there isn't a :latest ? | 19:16 |
pabelanger | oh, maybe. Is this the first image we are publishing? | 19:16 |
pabelanger | now that we back with swift | 19:16 |
clarkb | ya though reading it more closely it is copying from the buildset registry to the swift backed intermedaite registry and it is the local one off buildset registry it is complaining about | 19:17 |
clarkb | so I think we need to look backwards in the job and see when it pushed into the buildset registry and make sure that looks good | 19:17 |
pabelanger | I think that is: http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/913fa7ec-8642-454c-8943-403b5bfd8314/ | 19:18 |
clarkb | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/job-output.txt.gz#_2019-04-24_17_37_36_444962 is the console log side of it | 19:18 |
pabelanger | oh, that is different of what I linked | 19:19 |
clarkb | we need to see the commands for that I think | 19:19 |
pabelanger | checking ara | 19:19 |
pabelanger | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/169531ed-c013-49a4-be1a-000daf3591bc/ | 19:20 |
pabelanger | docker push zuul-jobs.buildset-registry:5000/zuul/zuul:latest | 19:21 |
clarkb | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/169531ed-c013-49a4-be1a-000daf3591bc/ yup found it too | 19:21 |
AJaeger | now the jobs passed for https://review.opendev.org/#/c/654238 | 19:21 |
clarkb | so now lets make sure that zuul-jobs.buildset-registry:5000 and our socat are the same location | 19:21 |
pabelanger | clarkb: is that http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/342e18b4-323d-446d-bf6f-20b03d192af4/ ? | 19:22 |
clarkb | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/342e18b4-323d-446d-bf6f-20b03d192af4/ ya that is the socat side | 19:22 |
clarkb | now to find the /etc/hosts on remote node side | 19:22 |
pabelanger | Hmm, no output for that | 19:23 |
pabelanger | but also, task didn't change | 19:23 |
clarkb | heh lineinfile doesn't record anything | 19:23 |
clarkb | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ara-report/result/14f98992-7e1d-4874-99c4-f06291d90767/ it changed there | 19:23 |
clarkb | if we do it agin later it not changing is correct because the data is alerady in there | 19:23 |
pabelanger | okay, cool | 19:24 |
clarkb | pabelanger: maybe we update the role to record /etc/hosts after modifying it? | 19:27 |
clarkb | assuming the two IPs line up I'm stumped though not sure why it would error like that | 19:27 |
pabelanger | clarkb: good idea | 19:27 |
pabelanger | clarkb: cat /etc/hosts good enough? | 19:28 |
clarkb | pabelanger: ya it should be | 19:28 |
pabelanger | k, incoming patch | 19:28 |
clarkb | pabelanger: reading more bugs it may also be an authentication issue | 19:32 |
*** openstackgerrit has joined #zuul | 19:33 | |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: Output contents of /etc/hosts after modifying it https://review.opendev.org/655541 | 19:33 |
clarkb | is it possible that ran before the fix for that? | 19:33 |
clarkb | or perhaps we don't have auth fully fixed? | 19:33 |
pabelanger | not sure | 19:33 |
pabelanger | but we have 1 zuul patch in gate now | 19:33 |
pabelanger | so, it has passed check | 19:33 |
*** rlandy is now known as rlandy|biab | 19:33 | |
clarkb | fix for intermediate registry merged at 1600UTC | 19:34 |
clarkb | that job started at 17:25UTC | 19:34 |
clarkb | is it possible that we load the job config when queued and it queued before the fix merged? | 19:34 |
clarkb | pabelanger: maybe we watch it a little more and see if it recurs | 19:36 |
clarkb | and if yes merge your change above | 19:36 |
pabelanger | clarkb: wfm | 19:39 |
pabelanger | btw, I'd love a review on https://review.opendev.org/655204/ we are seeing an issue when a new PR is open in github that the change isn't enqueue by zuul into a pipeline. Thttp://paste.openstack.org/show/749715/ is 1 example of what we see. | 19:41 |
*** openstackgerrit has quit IRC | 19:42 | |
AJaeger | we have unstable tests ;( https://review.opendev.org/654238 is failing second time in gate - now again py35, see http://logs.openstack.org/38/654238/3/gate/tox-py35/eaf6262/ ;( | 19:53 |
pabelanger | we lost connection to gearman | 19:55 |
pabelanger | maybe a slow node | 19:55 |
pabelanger | maybe we should start dstat and collect some extra info | 19:55 |
corvus | clarkb: what job started at 1725? | 20:12 |
clarkb | corvus: http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/ | 20:13 |
corvus | that can't be right | 20:13 |
clarkb | http://logs.openstack.org/38/654238/3/check/zuul-build-image/5c232a5/job-output.txt.gz#_2019-04-24_17_25_33_430011 | 20:14 |
corvus | oh... now i see it | 20:14 |
corvus | so that one was probably due to AJaeger's recheck at 17:23 | 20:15 |
clarkb | hrm that is after the fix merged | 20:15 |
corvus | yep | 20:15 |
corvus | the run *before* that was after the fix merged too | 20:15 |
corvus | AJaeger: issued a recheck at 1601 | 20:15 |
corvus | (and i believe that was intentionally right after the fix merged) | 20:15 |
corvus | also, zuul-upload-image (which, for this purpose, is nearly the same as zuul-build-image) worked too in the gate run | 20:16 |
corvus | so there were 2 passes of the "build image" job before that failure | 20:16 |
corvus | 2 passes with the new registry config | 20:17 |
corvus | fwiw, this should be over ipv6 | 20:18 |
corvus | i think we should hold nodes for zuul-build-image and zuul-upload-image | 20:19 |
clarkb | wfm | 20:19 |
AJaeger | https://review.opendev.org/654238 is currently running zuul-build-image... | 20:20 |
corvus | autoholds set | 20:20 |
*** themroc has quit IRC | 20:20 | |
* AJaeger calls it a day - good night! | 20:21 | |
*** mattw4 has joined #zuul | 20:22 | |
*** pcaruana has quit IRC | 20:39 | |
*** rlandy|biab is now known as rlandy | 20:40 | |
clarkb | I've scanned the list of commits between when we weren't leaking memory in the scheduler and when we were. I think the most likely culprits are 9f7c642a 6d5ecd65 4a7894bc 3704095c but even then none of them stand out to (note I've left off the security fix because tristanC reported it doesn't cause memory issues on their cherry picked version) | 21:01 |
pabelanger | I'm trying to get up some monitoring on zuul.a.c so I can add another metric too | 21:02 |
corvus | clarkb: if the memory leak is related to layouts, system size will affect it | 21:06 |
clarkb | corvus: thats a good point. We are way more active so may trigger it more quickly | 21:07 |
corvus | and the data structure sizes are larger | 21:07 |
clarkb | ok so those 4 + the security fix :) | 21:08 |
corvus | clarkb: you can strike 4a7894bc -- it's a test-only change | 21:11 |
clarkb | ah yup | 21:11 |
clarkb | (in my brain I had an idea that yaml could be teh thing leaking but not if it is only in tests) | 21:11 |
corvus | clarkb: are we seeing jumps? if so, correlating jumps with either gate resets or reconfigurations would point toward 3704095c | 21:12 |
corvus | (of course, the same applies to the security fix) | 21:13 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all ya there was a big jump about 1850UTC today | 21:13 |
corvus | yeah, but that's stretched over a long period if time... that could just be activity related | 21:14 |
clarkb | fwiw I'm not in a great spot to debug this further right now. Too much pre travel prep going on. Also I assume we want to dig in with the objgraph stuff? | 21:46 |
clarkb | corvus: maybe we can use board meeting "quiet" time to do that? | 21:46 |
corvus | i would be happy to help with pointers | 21:47 |
corvus | clarkb: as long as we're restarting anyway though, if any of the patches on your shortlist are non-critical for opendev, we could manually apply reverts and restart | 21:47 |
corvus | that could get us a lot of information at little cost in personnel time | 21:47 |
clarkb | thats a good point. I think the security fix is probably critical but some of the others may not be | 21:48 |
corvus | i think we can live without the allow nonexistent project one (as long as we don't have any current config errors in any tenant, it's safe to remove) | 21:48 |
clarkb | "Tolerate missing project" "Forward artifacts to child jobs within buildset" "Centralize job canceling" | 21:49 |
clarkb | we can probably live without centralize job cancelling too | 21:49 |
clarkb | 9f7c642a and 3704095c | 21:49 |
corvus | re artifacts -- we are probably not taking advantage of that yet, so it's probably okay to remove, but that's less certain | 21:50 |
*** rlandy is now known as rlandy|bbl | 22:00 | |
corvus | what in the world is this error? http://logs.openstack.org/91/655491/1/check/zuul-quick-start/4a2e58e/ara-report/result/f5c97f96-c38d-4017-96be-a8c767e4d4bf/ | 22:05 |
SpamapS | corvus:did the canonical name get hard coded instead of using zuul.something maybe? | 22:08 |
corvus | SpamapS: oh, huh i didn't even notice that, thanks. that makes a lot more sense now. | 22:08 |
corvus | fungi, clarkb: we probably need https://review.opendev.org/654238 before the executor change can land | 22:09 |
corvus | SpamapS: ^ i think that has the fix you are implying | 22:09 |
clarkb | noted | 22:09 |
fungi | ugh | 22:10 |
fungi | shall i enqueue it too and promote? | 22:10 |
clarkb | probably should if we expect the other to merge | 22:11 |
corvus | it failed again with the registry | 22:11 |
fungi | oh, so it did. do we expect that's fixed now? | 22:12 |
clarkb | I don't think so but we have holds in place for it now | 22:12 |
corvus | no, we don't know why it's sometimes failing. we set autoholds, but i don't see that any were triggered | 22:13 |
corvus | trying to figure out why now | 22:14 |
corvus | http://paste.openstack.org/show/749728/ | 22:17 |
clarkb | hrm we deleted the node before we could hold it? | 22:19 |
corvus | 2019-04-24 20:35:28,383 DEBUG nodepool.DeletedNodeWorker: Marking for deletion unlocked node 0005637791 (state: in-use, allocated_to: 300-0003033704) | 22:20 |
corvus | 2019-04-24 20:36:19,020 INFO nodepool.NodeDeleter: Deleting ZK node id=0005637791, state=deleting, external_id=7a67b97b-c65f-44b1-a51f-ad2c9715081c | 22:20 |
corvus | almost an hour before the autohold | 22:20 |
corvus | that would explain the failure to connect to the buildset registry if the node was gone | 22:23 |
corvus | zuul never logged unlocking it until the end | 22:24 |
corvus | so i don't know why nodepool thought it was unlocked | 22:24 |
clarkb | was there a zk disconnect during that time period? | 22:25 |
corvus | yes | 22:28 |
corvus | clarkb: do you think this could be the cause of the mysterious "connection issues"? | 22:28 |
corvus | we observed that they came in batches, which is exactly what you would expect from a mass node deletion event | 22:29 |
corvus | clarkb: the ultimate cause may be the memory leak | 22:29 |
clarkb | ya I think this may explain it | 22:29 |
corvus | we're *very* sensitive to swapping | 22:29 |
corvus | and the last time there was a memleak we saw similar behavior | 22:30 |
clarkb | I should write a big sign on my office wall that says "Check if zk disconnected and if it did: are there memory leaks" | 22:30 |
clarkb | then maybe I won't neglect to look at it | 22:30 |
corvus | may i have a copy? | 22:30 |
clarkb | yes | 22:30 |
corvus | back to -infra | 22:31 |
*** openstackgerrit has joined #zuul | 23:30 | |
openstackgerrit | James E. Blair proposed zuul/zuul master: Revert "Prepend path with bin dir of ansible virtualenv" https://review.opendev.org/655491 | 23:30 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!