Tuesday, 2022-10-18

-@gerrit:opendev.org- Ian Wienand proposed:04:30
- [zuul/zuul-jobs] 861587: Pin sphinx to 5.2.3 https://review.opendev.org/c/zuul/zuul-jobs/+/861587
- [zuul/zuul-jobs] 854933: linter: Use capitals for names https://review.opendev.org/c/zuul/zuul-jobs/+/854933
- [zuul/zuul-jobs] 861559: Fix ansible-lint name[template] https://review.opendev.org/c/zuul/zuul-jobs/+/861559
- [zuul/zuul-jobs] 861560: Add names to include tasks https://review.opendev.org/c/zuul/zuul-jobs/+/861560
- [zuul/zuul-jobs] 861562: Standarise block/when ordering https://review.opendev.org/c/zuul/zuul-jobs/+/861562
- [zuul/zuul-jobs] 861563: Update to ansible-lint 6.8.2 https://review.opendev.org/c/zuul/zuul-jobs/+/861563
- [zuul/zuul-jobs] 861588: [wip] sphinx circular dependencies error https://review.opendev.org/c/zuul/zuul-jobs/+/861588
- [zuul/zuul-jobs] 861698: Workaround stevedore/python3.7 issues https://review.opendev.org/c/zuul/zuul-jobs/+/861698
@jim:acmegating.comzuul-maint: how does this look for a zuul release?  commit 61d01732afcd2c55e901876c1bb0408846e4f21b (tag: 7.1.0)14:54
@jim:acmegating.com(opendev restarted on that over the weekend)14:54
@fungicide:matrix.org"Fix pipeline event starvation due to cleanup locks" is the only thing that has merged since, and doesn't look critical to include15:29
@jim:acmegating.comyeah, it's a low-probability / low-consequence bug that's been around for a while.15:33
@mbecker12:matrix.org> <@avass:vassast.org> We're seeing ansible `command` modules getting stuck before executing in like 1/100 jobs, and when that happens in cleanup it's really bad because we've got jobs hung indefinitely unless they're dequeue and the process on the executor is killed.17:50
> I've been digging for a while and from what I can see the on_task_start callback runs to print the task banner, but the build node never seem to run the actual task so ansible-playbook never exits. But I can't find any obvious race condition in logging, or in the command module anywhere. Is anyone else experiencing something similar?
Hi, we're seeing a bunch of stuck jobs again ever since updating to zuul 7, somewhat reminiscent of what Albin wrote here.
Feels like it can happen to any ansible play or module. It looks like it never gets executed and never gets past printing the task name. I can't really find any information about the execution and don't really know how to debug it due to this.
I remember that after this change https://review.opendev.org/c/zuul/zuul/+/849795 on a previous zuul version, we saw fewer of these events.
Does anyone see similar behavior or has any pointers on how to debug this?
@clarkb:matrix.orgWe reverted that libc install from testing because debian backported the fix to bullseye. OpenDev hasn't noticed this problem returning since doing that. More likely the changes are probably related to being on newer ansible?17:52
@clarkb:matrix.orgOpenDev ran on Ansible 5 and discovered that connection pipeline configuration isn't read as expected there, but the tasks still ran just a bit slower potentially. We've since switched to Ansible 6. Don't recall seeing problems like that on either version of newer ansible though17:53
@mbecker12:matrix.orgAh yeah, I remember reading about the libc backporting here somewhere. We have been running ansible 5 for a while now, even before upgrading to zuul 7 18:02
@clarkb:matrix.orgmbecker12: are the tasks that get stuck of a consistent type?18:09
@mbecker12:matrix.orgit's hard to find a pattern. I saw it happen in all phases (run, pre-/post-run, cleanup), in various tasks. Maybe the majority of events occurred in tasks running the `command` module but I also saw it in the upload-logs-s3 role for example18:15
@clarkb:matrix.orgmbecker12: stracing the processes might help (I believe this is how Albin Vass caught the prior issue. Maybe also enabling ansible debug output to see if ansible itself logs the issue?18:39
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan:18:55
- [zuul/zuul-client] 861477: Update docker images to python 3.10 https://review.opendev.org/c/zuul/zuul-client/+/861477
- [zuul/zuul-client] 861667: Remove python3.6 testing https://review.opendev.org/c/zuul/zuul-client/+/861667
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul-operator] 861478: Update the docker images to python 3.10 https://review.opendev.org/c/zuul/zuul-operator/+/86147819:07
@jim:acmegating.comstrace for a start, and i believe gdb was involved at some point.  that would require adding in the python debug packages to get the symbols (you could build a custom image, or just log in as root and apt-get install them)19:09
-@gerrit:opendev.org- MICHAEL KELLY proposed:19:50
- [zuul/zuul-operator] 861279: bug: Select scheduler pod based on instance name on update https://review.opendev.org/c/zuul/zuul-operator/+/861279
- [zuul/zuul-operator] 853592: Allow the specification of storageClassName in PVCs https://review.opendev.org/c/zuul/zuul-operator/+/853592
- [zuul/zuul-operator] 853695: Prefix zuul-specific resources with instance name https://review.opendev.org/c/zuul/zuul-operator/+/853695
- [zuul/zuul-operator] 853696: Prefix nodepool specific resources with instance name https://review.opendev.org/c/zuul/zuul-operator/+/853696
- [zuul/zuul-operator] 861488: helm: Add a basic helm chart for zuul-operator https://review.opendev.org/c/zuul/zuul-operator/+/861488
@fungicide:matrix.orgClark: very minor q on https://review.opendev.org/86147919:54
@fungicide:matrix.orgoh. never mind. looking closer at zuul's feedback on the earlier patchset answers it for me19:56
@jim:acmegating.compushed 7.1.020:07
@clarkb:matrix.orgfungi: ya I'm not sure how that got through previously but the line was just too long so I shortened it20:16
@clarkb:matrix.orghrm the zuul-registry change hit that netifaces thing with our container image builds20:24
@clarkb:matrix.orgcorvus: ^ the tldr on that is we do a two stage docker image build. The first stage installs all the build requirements (including gcc in this case) and we build wheels for all the deps. Then the second stage copies over the wheels and is supposed to install them vai the wheel cache. However, it seems like pip is choosing to pick an sdist instead of the wheels. The theories I had for why that is happening are that maybe pip updated and changed behavior or maybe we're not copying the wheels like we think we are20:25
@clarkb:matrix.orgI've been meaning to look at it more closely (nodepool and zuul-registry both exhibit it I guess) but haven't had time with the ptg and other things20:26
@jim:acmegating.comClark: is it intermittent?20:26
@clarkb:matrix.orgthat I'm not sure of yet. It would only affect changes that had wheels that need to be built20:27
@clarkb:matrix.orglet me push a noop change to zuul-registry and nodepool to gather more info on that aspect20:27
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 861797: DNM checking docker image builds https://review.opendev.org/c/zuul/nodepool/+/86179720:29
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul-registry] 861798: DNM testing docker image builds https://review.opendev.org/c/zuul/zuul-registry/+/86179820:30
@clarkb:matrix.orgthings I note. The assemble script does seem to update pip to latest and there was a pip release on october 1520:38
@jim:acmegating.comso if noop test fails, we can try pinning20:43
@fungicide:matrix.orgi don't see anything in the pip 22.3 release notes which would explain the behavior change: https://pip.pypa.io/en/stable/news/#v22-320:45
@clarkb:matrix.orgI'm also trying to figure out why we pass nodepool_base to install-from-bindep. As far as I can tell $1 is ignored in that script?20:48
@clarkb:matrix.orgevery pip command in install-from-bindep uses --cache-dir=/output/wheels so the wheels should be there if pip wrote them in the first place20:48
@clarkb:matrix.orgok my zuul-registry change doens't reproduce for two reasons. zuul-registry is on python3.8 before my change and netifaces has a python3.8 wheel on pypi that is used instead. And the job runs on focal which doens't support podman but we do podman testing for the registry in that job I think and it fails. THis is why I updated to jammy in that change. 20:52
@clarkb:matrix.orgnodepool does repreoduce so I expect this is a consistent issue for python3.10 based imgaes and I suspect something with the recent pip release20:52
@clarkb:matrix.organd it would only happen for installs that need gcc or some other wheel built that cannot build out of the base environment.20:53
@clarkb:matrix.orgI'm going to make a netifaces wheel in my local cache and see if I can install using it as expected.20:54
@clarkb:matrix.orgya this seems trivially reproduceable. Using pip 22.3 `pip install --cache-dir/some/cache/dir netifaces` then run the same command out of a different venv and it says `  Using cached netifaces-0.11.0.tar.gz (30 kB)` instead of the wheel21:02
@fungicide:matrix.organd versions of pip <22.3 don't exhibit the same behavior even with python 3.10?21:05
@clarkb:matrix.orgI haven't tested that yet21:06
@clarkb:matrix.orgI'll do that now21:06
@clarkb:matrix.orghrm my testing may have been broken because I didn't install the wheel library. I'll start over with that installed21:09
@clarkb:matrix.orgok installing the wheel lib made this act a bit more like I expected. I believe that pip 22.2.2 is not compatible with the wheel cache produced by pip 22.321:16
@jim:acmegating.comwe did build the zuul image for the tag (i don't think that's a surprise since there's no netifaces).  just fyi.21:16
@clarkb:matrix.orgWhat is happening is we upgrade to pip 22.3 in the assemble scripts' venv and produce a wheel cache with that version of pip. Then when we install-from-bindep we use the global pip install from the base python images which are currently have pip 22.2.2 not 22.321:17
@clarkb:matrix.orgI think that means we either need to upgrade pip in install-from-bindep or not update pip in assemble21:17
@jim:acmegating.comi wonder if the assemble upgrade is vestigal21:17
@clarkb:matrix.organd all of this is repreoduceable using python3.10, virtualenvs, and various manipulations of the pip version21:17
@clarkb:matrix.orgI'm going to cross check what a virtualenv produces in our base images as it is possible the pip version in a virtualenv does not match the global install (unlikely but possible)21:19
@jim:acmegating.comoh, the pip "upgrade" is really a pip install in a venv just in order to get a clean build env21:19
@clarkb:matrix.orgyup21:19
@jim:acmegating.combut still seems like maybe we could omit it?21:20
@clarkb:matrix.orgyes it appears the global pip on the image has the same version as the pip that a virtualenv produces21:20
@clarkb:matrix.orgI think we can just avoid upgrading. The next thing I'm going to test before pushing that change is if 22.3 can find wheels made by 22.2.2 in a wheel cache. If so then that should be extra safe21:21
@jim:acmegating.cominstalling 'build' may need to be retained based on commit comments21:22
@clarkb:matrix.orgyes we need to install wheel and build both.21:22
@clarkb:matrix.orgBut I think we can avoid using -U and omit pip21:22
@clarkb:matrix.org22.3 does not find wheels produced by 22.2.2 so this is neither forward no backward compatible cache content21:22
@jim:acmegating.com5e12438e0d is the commit that added pip and wheel21:22
@jim:acmegating.comoh wait no it goes further back than that21:23
@fungicide:matrix.orgyou may want to test without wheel and build just to be sure. i know pip installs pep-517 build requirements (i.e. setup_requires) in a separate isolated ephemeral environment, but that may only be when using pyproject.toml integration21:23
@clarkb:matrix.orgfungi: without wheel installed it always run setup.py21:23
@clarkb:matrix.orgthis was why my initial testing was invalid21:24
@clarkb:matrix.orgthere wasn't a wheel to use either way21:24
@jim:acmegating.com0e1cd6ee85 is the commit.  so it's been in there from the start.21:24
@jim:acmegating.comso yeah, i think skip pip, include wheel/build sounds good21:24
@fungicide:matrix.orgalso, as of pip 22.3, you no longer need it installed in the venv. you can just run an external version of it, possibly even one inside another venv, and tell it where the environment is you want it to manage for that invocation21:25
@fungicide:matrix.orgthe pypa lot have been working toward making it so that your venv no longer needs any of the setup requires or package management tools installed in a venv21:26
-@gerrit:opendev.org- MICHAEL KELLY proposed: [zuul/zuul-jobs] 861799: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/86179921:26
@fungicide:matrix.org * the pypa lot have been working toward making it so that your venv no longer needs any of the setup requires or package management tools installed inside it21:26
@fungicide:matrix.orgalso they added a zipapp distribution of pip, so that you can download and run it without needing to install it anywhere (since the python interpreter can directly load zipapps)21:27
@clarkb:matrix.orgfungi: I wonder if not making the cache forward/backward compatible was intentional21:27
@clarkb:matrix.orgor at least known. They didn't call it out in the release notes21:28
@fungicide:matrix.orgthat's entirely possible, but i would have expected that to be mentioned in the release notes21:28
-@gerrit:opendev.org- MICHAEL KELLY proposed: [zuul/zuul-operator] 861488: helm: Add a basic helm chart for zuul-operator https://review.opendev.org/c/zuul/zuul-operator/+/86148821:30
@clarkb:matrix.orgremote:   https://review.opendev.org/c/opendev/system-config/+/861800 Stop updating pip in our docker assemble script21:35
@jim:acmegating.comClark: does opendev have sufficient testing of that?  you could test that by, once the check build finishes, change the nodepool FROM line to insecure-ci-registry/....21:37
@clarkb:matrix.orgya doesn't look like we really test that on the opendev side. Good idea. Should be able to push up an update to my noop nodepool change to switch out the FROM lines and have that test it21:38
@clarkb:matrix.orgI'm going to work on cleaning up some of my local debugging so that I can write a pip bug if for no other reason than to document the change of behavior with them21:39
@clarkb:matrix.orghttps://github.com/pypa/pip/issues/1152722:01
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 861797: DNM checking docker image builds https://review.opendev.org/c/zuul/nodepool/+/86179722:04
@clarkb:matrix.orgThat updates the change to use the CI registry images that were just built22:04
-@gerrit:opendev.org- MICHAEL KELLY proposed: [zuul/zuul-jobs] 861799: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/86179922:18
@clarkb:matrix.orgThere is good and bad news. The good news is that opendev's uwsgi image builds actually do test this for us. The bad news is those builds and nodepool's still fail22:29
@clarkb:matrix.orgWhich now has me wondering how this ever worked22:29
-@gerrit:opendev.org- MICHAEL KELLY proposed: [zuul/zuul-jobs] 861799: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/86179922:32
-@gerrit:opendev.org- MICHAEL KELLY proposed: [zuul/zuul-operator] 861488: helm: Add a basic helm chart for zuul-operator https://review.opendev.org/c/zuul/zuul-operator/+/86148822:39
-@gerrit:opendev.org- MICHAEL KELLY proposed: [zuul/zuul-jobs] 861799: helm: Add job for linting helm charts https://review.opendev.org/c/zuul/zuul-jobs/+/86179922:42
@clarkb:matrix.orgI need to step away from this now. I've been awake for far too many hours and my brain is mush. Maybe pypa will have some feedback on my issue that will help figure out what is going on. I'm also happy for another set of eyes to look at it (in fact I think at htis point that is a good idea)22:45
@michael_kelly_anet:matrix.orgTrying to figure out the failure in `zuul-tox-docs` for https://review.opendev.org/c/zuul/zuul-jobs/+/861799 - not clear exactly why it's failing.  The underlying sphinx command it's running seems to also fail against previous versions when I run it locally...?23:12
@michael_kelly_anet:matrix.orgAnyone able to give me a clue here?23:15
@jim:acmegating.comMichael Kelly: possibly fixed by https://review.opendev.org/861587 which will need to be rebased off of its parent in order to merge (cc ianw )23:19
@michael_kelly_anet:matrix.orgfun times.  I'll rebase onto that and see what I get23:20
@iwienand:matrix.orgcorvus: Michael Kelly ahh yes sorry -- i meant to look at that more closely yesterday but got sidetracked into another zuul-jobs problem with python3.7 and stevedore!23:20
@michael_kelly_anet:matrix.orgThanks corvus 23:20
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 861587: Pin sphinx to 5.2.3 https://review.opendev.org/c/zuul/zuul-jobs/+/86158723:21
@iwienand:matrix.org^ that is the quick fix until i figure out exactly what it doesn't like about the includes23:21
@michael_kelly_anet:matrix.orgianw: such is life :)23:21
@jim:acmegating.comianw: lgtm, thx23:22
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 861698: Workaround stevedore/python3.7 issues https://review.opendev.org/c/zuul/zuul-jobs/+/86169823:25
@iwienand:matrix.org^ that is the stevedore gate-breaker -- but only for things that trigger buster (python 3.7 really) testing23:25
@iwienand:matrix.orghttps://review.opendev.org/q/topic:ansible-lint-6.8.2 is ontop of those -- the minus -1's are related to the prior -- but are otherwise no-op linter fixup things.  reviews on those welcome to 23:27
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 861587: Pin sphinx to 5.2.3 https://review.opendev.org/c/zuul/zuul-jobs/+/86158723:48
-@gerrit:opendev.org- Ian Wienand proposed:23:51
- [zuul/zuul-jobs] 861698: Workaround stevedore/python3.7 issues https://review.opendev.org/c/zuul/zuul-jobs/+/861698
- [zuul/zuul-jobs] 854933: linter: Use capitals for names https://review.opendev.org/c/zuul/zuul-jobs/+/854933
- [zuul/zuul-jobs] 861559: Fix ansible-lint name[template] https://review.opendev.org/c/zuul/zuul-jobs/+/861559
- [zuul/zuul-jobs] 861560: Add names to include tasks https://review.opendev.org/c/zuul/zuul-jobs/+/861560
- [zuul/zuul-jobs] 861562: Standarise block/when ordering https://review.opendev.org/c/zuul/zuul-jobs/+/861562
- [zuul/zuul-jobs] 861563: Update to ansible-lint 6.8.2 https://review.opendev.org/c/zuul/zuul-jobs/+/861563
- [zuul/zuul-jobs] 861588: [wip] sphinx circular dependencies error https://review.opendev.org/c/zuul/zuul-jobs/+/861588

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!