@jsoo1:matrix.org | fungi: sorry, flaked again | 05:20 |
---|---|---|
-@gerrit:opendev.org- Benjamin Schanzel proposed: [zuul/zuul] 937716: WIP: Allow pinning pipelines on status page https://review.opendev.org/c/zuul/zuul/+/937716 | 07:37 | |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 938783: Remove config object freezing https://review.opendev.org/c/zuul/zuul/+/938783 | 10:40 | |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 938783: Remove config object freezing https://review.opendev.org/c/zuul/zuul/+/938783 | 12:57 | |
@fungicide:matrix.org | jsoo1: yes, this time it looks like the integration test job ran into a problem creating a gerrit user, though i didn't find any obvious reason comparing the transcript side-by-side with the gerrit container log from it | 13:50 |
@jsoo1:matrix.org | fungi: same thing. Maybe there's a reproducible bug? | 14:43 |
@fungicide:matrix.org | yeah, sort of odd to see that crop up twice in a row. those two builds ran in different cloud providers on entirely different continents even, so probably not a performance-induced race condition in the test setup | 14:58 |
@fungicide:matrix.org | but also zuul-nox-py312 is failing on the most recent recheck | 14:59 |
@fungicide:matrix.org | and the py312 job failed a couple of tests because ansible deemed ssh on 127.0.0.1 unreachable for some reason | 15:21 |
@fungicide:matrix.org | that's extra strange | 15:21 |
@jsoo1:matrix.org | Oh is that because on those hosts localhost is ipv6 only? | 16:03 |
@fungicide:matrix.org | mmm, unlikely, i think all our systems are configured dual-stack. i'll double-check though... | 16:04 |
@fungicide:matrix.org | https://zuul.opendev.org/t/zuul/build/94503031fb3e46e1a4eaaa425344e817/log/zuul-info/zuul-info.ubuntu-noble.txt#20-25 says the node used for that build had both 127.0.0.1/8 and ::1/128 bound to the lo interface | 16:06 |
@clarkb:matrix.org | note ansible says "unreachable" for any ssh failures up to and including "ansibel actually ssh'd into the remote node but failed to copy its remote payload over to tmp due to permissions or space issues" | 16:12 |
@clarkb:matrix.org | they basically treat application errors as connectivity errors in the way they are reported which can lead to confusion | 16:13 |
@fungicide:matrix.org | yeah, i was unsuccessful in finding any more specific error in the log, i think ansible hides it away | 16:15 |
@fungicide:matrix.org | so it could have been a dns lookup error, a full /tmp, or any number of other causes | 16:16 |
@fungicide:matrix.org | looks like we've hit that gerrit account setup failure thrice in sequence, and i see another change failing in check with the same as well. i wonder if something changed with the gerrit image... | 16:58 |
@fungicide:matrix.org | looks like the quickstart compose file just grabs docker.io/gerritcodereview/gerrit unversioned | 17:00 |
@fungicide:matrix.org | if i'm reading the dockerhub interface correctly, the last time those images changed was 2024-12-02, over a month ago | 17:02 |
@clarkb:matrix.org | There were the changes to gerrit permissions not allowing you to push acls, but corvus addressed that and this sounds different | 17:27 |
@fungicide:matrix.org | yeah, this seems to have just cropped up in the past 24 hours. side-by-side comparisons from earlier gerrit.log files are also not yielding any obvious differences | 17:28 |
@fungicide:matrix.org | looks like it started to happen consistently around 22:45 utc, a prior run just before 20:00 succeeded | 17:30 |
@fungicide:matrix.org | the past 6 failures listed at https://zuul.opendev.org/t/zuul/builds?job_name=zuul-quick-start&skip=0 seem to all have that as the cause (the stray failure at 19:25 yesterday looks unrelated) | 17:32 |
@clarkb:matrix.org | https://zuul.opendev.org/t/zuul/build/94eee2f75f1c49dda27a2f1d7459f11b/log/container_logs/gerritconfig.log this is the container that creates the zuul account in gerrit | 17:40 |
@clarkb:matrix.org | based on that log it is stuck waiting for gerrit to be up | 17:40 |
@clarkb:matrix.org | the gerrit log indicates it is up though so maybe our detection mechanism for that is broken | 17:40 |
@fungicide:matrix.org | yeah, it never gets past waiting for gerrit to come up, but the gerrit.log shows gerrit thinks it fully started | 17:40 |
@clarkb:matrix.org | we wait for port 29418 to be listening on the host | 17:40 |
@clarkb:matrix.org | https://zuul.opendev.org/t/zuul/build/94eee2f75f1c49dda27a2f1d7459f11b/log/container_logs/gerrit.log#154 indicates that it should be listening but maybe this is related to the localhost connectivity problems too | 17:41 |
@clarkb:matrix.org | might make sense to hold the node and see what is going on | 17:43 |
@clarkb:matrix.org | note this isn't using host networking its looking up the gerrit container via the container networking | 17:44 |
@fungicide:matrix.org | yeah, i was trying to work out what additional logs we could/should collect in that job, but maybe a hold is more straightforward for now | 17:44 |
@jim:acmegating.com | it's not stuck waiting for it to start; it's failing to create the user: https://zuul.opendev.org/t/zuul/build/94eee2f75f1c49dda27a2f1d7459f11b/console | 17:44 |
@clarkb:matrix.org | huh so the container log is incomplete? | 17:45 |
@fungicide:matrix.org | i'm thinking it must not flush regularly | 17:45 |
@jim:acmegating.com | oh you're looking at https://zuul.opendev.org/t/zuul/build/3bf117748cb7479ba83b73b4fc99368e/log/container_logs/gerritconfig.log#11 | 17:45 |
@jim:acmegating.com | yeah that's waiting for gerrit to start :) | 17:45 |
@clarkb:matrix.org | corvus: yes and that playbook is the one that creates the zuul user | 17:46 |
@clarkb:matrix.org | my assumtpion was that we never create the user because we don't recognize gerrit as up | 17:46 |
@fungicide:matrix.org | right, we started from trying to figure out why the account creation never seems to go through | 17:46 |
@jim:acmegating.com | yeah, so the outside playbook has decided that gerrit started, but not the inside one | 17:46 |
@jim:acmegating.com | i agree that smells like container connectivity issues | 17:47 |
@fungicide:matrix.org | in that case, i wonder if the name resolution issues in https://zuul.opendev.org/t/zuul/build/94503031fb3e46e1a4eaaa425344e817 could have a similar underlying cause | 17:48 |
@fungicide:matrix.org | "ssh: Could not resolve hostname test_node: Name or service not known" | 17:49 |
@clarkb:matrix.org | I wonder if podman updated on jammy | 17:50 |
@fungicide:matrix.org | not recently that i can see: https://changelogs.ubuntu.com/changelogs/pool/universe/libp/libpod/libpod_3.4.4+ds1-1ubuntu1.22.04.3/changelog | 17:52 |
@fungicide:matrix.org | 3.4.4+ds1-1ubuntu1.22.04.3 is the current version in jammy and jammy-updates | 17:53 |
@fungicide:matrix.org | unless puc is lagging behind... | 17:54 |
@clarkb:matrix.org | it may be another supporting package since podman relies on a bunch of tertiary things. But ya I'm guessing holding a node is going to be quickest path to discovery here | 17:55 |
@fungicide:matrix.org | autohold is set and 938346 rechecked | 17:58 |
@fungicide:matrix.org | 200.225.47.45 is the held node | 18:58 |
@fungicide:matrix.org | https://zuul.opendev.org/t/zuul/build/06d1225d4261480093c3600b1d145d8b is the corresponding build report | 19:03 |
@fungicide:matrix.org | interestingly, apt policy podman shows it's got 3.4.4+ds1-1ubuntu1 installed but that it could upgrade to 3.4.4+ds1-1ubuntu1.22.04.3 | 19:08 |
@fungicide:matrix.org | the installed version is from jammy/universe while the newer available version is found in jammy-updates/universe and jammy-security/universe | 19:09 |
@clarkb:matrix.org | `podman exec -it zuul-tutorial_executor_1 bash` then running python3's `socket.gethostbyname('gerrit')` results in `socket.gaierror: [Errno -2] Name or service not known` | 19:10 |
@clarkb:matrix.org | `gerrit` is the name the setup.yaml playbook checks for port 29418 so that may explain it? | 19:12 |
@fungicide:matrix.org | i guess docker does some sort of fancy "lookup containers as if they're hostnames" magic? | 19:13 |
@clarkb:matrix.org | yes docker/podman are supposed to intercept dns requests and populate the logical names for you | 19:13 |
@clarkb:matrix.org | in opendev we don't generally deal with that because we use host networking which disables that behavior | 19:14 |
@fungicide:matrix.org | got it. let's set aside my extreme fright at that discovery for the time being | 19:14 |
@fungicide:matrix.org | presumably that's how it was working previously, so what broke it? | 19:15 |
@clarkb:matrix.org | maybe the packages that set up dns stuff for podman did update to match the security update but the main runtime didn't and there is an incompatibility? | 19:15 |
@clarkb:matrix.org | https://pypi.org/project/podman-compose/#history I think this is the problem actually | 19:16 |
@clarkb:matrix.org | based only on the latest release timestamp | 19:16 |
@jim:acmegating.com | it may not be creating the container with the correct name | 19:17 |
@clarkb:matrix.org | I have confirmed that the latest 1.3.0 version is what we haev installed | 19:17 |
@clarkb:matrix.org | we can push an update touse the previous version and see if that fixes it | 19:18 |
@jim:acmegating.com | Clark: while you do that, do you mind if i take over this host and downgrade to 1.2.0 and test | 19:18 |
@jim:acmegating.com | i want to see what the difference in container metadata is | 19:18 |
@clarkb:matrix.org | corvus: go for it | 19:18 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 938837: Pin podman-compose to 1.2.0 https://review.opendev.org/c/zuul/zuul/+/938837 | 19:20 | |
@jim:acmegating.com | - "--network=zuul-tutorial_zuul:alias=gerrit", | 19:21 |
+ "--network=zuul-tutorial_zuul", | ||
+ "--network-alias=gerrit", | ||
@fungicide:matrix.org | looping back around to the installed podman version being old, it was a workaround for https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2024394 implemented via https://review.opendev.org/c/zuul/zuul-jobs/+/886552 (so it's indeed intentional) | 19:21 |
@jim:acmegating.com | * ``` | 19:22 |
- "--network=zuul-tutorial_zuul:alias=gerrit", | ||
+ "--network=zuul-tutorial_zuul", | ||
+ "--network-alias=gerrit", | ||
``` | ||
@jim:acmegating.com | maybe the old podman doesn't support the separate network-alias arg | 19:22 |
@clarkb:matrix.org | I guess another approach may be to update the job to noble under the assumption that newer podman compose woudl work. Or use docker compose or something | 19:23 |
@clarkb:matrix.org | but if the pin works on jammy that is probably a reasonable workaround until one of those steps are taken | 19:23 |
@fungicide:matrix.org | supposedly the podman on noble doesn't suffer from 2024394. we may end up trading that bug for newer ones, but at some point we'll need to upgrade either way | 19:23 |
@jim:acmegating.com | i take it 4394 is not fixed on jammy? | 19:24 |
@fungicide:matrix.org | correct | 19:25 |
@fungicide:matrix.org | people running jammy continue to rediscover that bug report | 19:25 |
@fungicide:matrix.org | as recently as two months ago, so if it's not beein fixed in ~1.5 years already it's unlikely to ever be | 19:26 |
@fungicide:matrix.org | debian would also be an option, if that aligns better with other jobs we're running | 19:26 |
@jim:acmegating.com | it sounds like the quickstart is broken on jammy then if we have to pin both podman and podman-compose | 19:26 |
@jim:acmegating.com | i agree with Clark let's do the podman-compose pin for the test, but we should immediately upgrade the job to noble to see if we can drop the pins. then after that let's see about switching to "docker compose". i do have a change for that which could be put into shape quickly. | 19:28 |
@fungicide:matrix.org | the sad thing is podman was working on jammy, ubuntu folk introduced this regression later with a backported patch but then never fixed it | 19:28 |
@jim:acmegating.com | yes, everything about all of this is very table-flippy. | 19:29 |
@jim:acmegating.com | the manual downgrade worked btw | 19:30 |
@fungicide:matrix.org | le sigh | 19:30 |
@jim:acmegating.com | i approved 837 | 19:30 |
@jim:acmegating.com | anyone writing a noble patch or should i? | 19:30 |
@fungicide:matrix.org | lgtm as well | 19:30 |
@fungicide:matrix.org | i had not started one yet, no | 19:31 |
@clarkb:matrix.org | nor me | 19:31 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 938840: Use noble for quickstart https://review.opendev.org/c/zuul/zuul/+/938840 | 19:34 | |
@fungicide:matrix.org | thanks! | 19:34 |
@jim:acmegating.com | oh btw, https://review.opendev.org/923084 is already running that on noble, so i'm optimistic about that :) | 19:35 |
@jim:acmegating.com | (with podman under "docker compose") | 19:36 |
@fungicide:matrix.org | even better! | 19:36 |
@jim:acmegating.com | oh, though the pip install of git-review may fail | 19:39 |
@clarkb:matrix.org | I think on noble you may have to install to a venv now? | 19:40 |
@fungicide:matrix.org | i expect so, yes | 19:40 |
@clarkb:matrix.org | global installs are disabled unless youpass the flag to say I really meant it | 19:40 |
@fungicide:matrix.org | or with --break-system-packages or by removing the externally-managed flagfile or a variety of other options | 19:40 |
@fungicide:matrix.org | could e.g. create a /opt/git-review venv, pip install inside there, then drop a symlink from /usr/local/bin/git-review to /opt/git-review/bin/git-review, for example | 19:42 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 923084: WIP: Try docker compose with podman https://review.opendev.org/c/zuul/zuul/+/923084 | 19:43 | |
@jim:acmegating.com | i think we should just use the distro git-review | 19:43 |
@jim:acmegating.com | that's what i do in that change ^ | 19:44 |
@fungicide:matrix.org | hear hear | 19:44 |
@fungicide:matrix.org | there haven't been any critical fixes to git-review in ages, and we really don't need any of the newer features for this | 19:44 |
@jim:acmegating.com | assuming 840 fails because of that, i will move that part into 840 | 19:44 |
@fungicide:matrix.org | ugh, now zuul-nox-py311 is failing for 938837 | 19:52 |
@fungicide:matrix.org | "Timeout waiting for Zuul to settle" in TestAWSDriver.testawsdiskimage_snapshot | 19:53 |
@jim:acmegating.com | good chance the niz stack i'm working on improves that; i'd recheck-bash it and not worry for now. | 19:54 |
@fungicide:matrix.org | will do, thanks | 20:03 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 20:11 | |
- [zuul/zuul] 937946: Add image/upload delete lifecycle https://review.opendev.org/c/zuul/zuul/+/937946 | ||
- [zuul/zuul] 937947: Add web API image delete endpoints https://review.opendev.org/c/zuul/zuul/+/937947 | ||
- [zuul/zuul] 938022: Allow deleting images through web UI https://review.opendev.org/c/zuul/zuul/+/938022 | ||
- [zuul/zuul] 938023: Add REST API method to trigger image build https://review.opendev.org/c/zuul/zuul/+/938023 | ||
- [zuul/zuul] 938024: Add a web UI button to build an image https://review.opendev.org/c/zuul/zuul/+/938024 | ||
- [zuul/zuul] 938087: Add labels and flavors to web https://review.opendev.org/c/zuul/zuul/+/938087 | ||
- [zuul/zuul] 938088: Add niz nodes to rest api nodes list endpoint https://review.opendev.org/c/zuul/zuul/+/938088 | ||
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 923084: WIP: Try docker compose with podman https://review.opendev.org/c/zuul/zuul/+/923084 | 20:52 | |
@fungicide:matrix.org | ahh good, i was having trouble parsing the error message from that last attempt | 20:52 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 20:55 | |
- [zuul/zuul] 938840: Use noble for quickstart https://review.opendev.org/c/zuul/zuul/+/938840 | ||
- [zuul/zuul] 923084: WIP: Try docker compose with podman https://review.opendev.org/c/zuul/zuul/+/923084 | ||
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [zuul/zuul] 938837: Pin podman-compose to 1.2.0 https://review.opendev.org/c/zuul/zuul/+/938837 | 21:15 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 21:35 | |
- [zuul/zuul] 937384: Use a TreeCache for job request queues https://review.opendev.org/c/zuul/zuul/+/937384 | ||
- [zuul/zuul] 937385: Reduce ZK lock contention in executor https://review.opendev.org/c/zuul/zuul/+/937385 | ||
- [zuul/zuul] 937386: Disable cache event log https://review.opendev.org/c/zuul/zuul/+/937386 | ||
- [zuul/zuul] 937387: Make executor sensor messages more useful https://review.opendev.org/c/zuul/zuul/+/937387 | ||
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 21:54 | |
- [zuul/zuul] 938840: Use noble for quickstart https://review.opendev.org/c/zuul/zuul/+/938840 | ||
- [zuul/zuul] 923084: Switch quickstart to docker compose v2 https://review.opendev.org/c/zuul/zuul/+/923084 | ||
@clarkb:matrix.org | corvus: I left a comment on https://review.opendev.org/c/zuul/zuul-jobs/+/925916 so I didn't approve it. But feel free if you think that can be sorted out later or is unimportant | 22:02 |
@clarkb:matrix.org | and a couple of notes/questions on https://review.opendev.org/c/zuul/zuul/+/923084 too | 22:06 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 22:07 | |
- [zuul/zuul] 937946: Add image/upload delete lifecycle https://review.opendev.org/c/zuul/zuul/+/937946 | ||
- [zuul/zuul] 937947: Add web API image delete endpoints https://review.opendev.org/c/zuul/zuul/+/937947 | ||
- [zuul/zuul] 938022: Allow deleting images through web UI https://review.opendev.org/c/zuul/zuul/+/938022 | ||
- [zuul/zuul] 938023: Add REST API method to trigger image build https://review.opendev.org/c/zuul/zuul/+/938023 | ||
- [zuul/zuul] 938024: Add a web UI button to build an image https://review.opendev.org/c/zuul/zuul/+/938024 | ||
- [zuul/zuul] 938087: Add labels and flavors to web https://review.opendev.org/c/zuul/zuul/+/938087 | ||
- [zuul/zuul] 938088: Add niz nodes to rest api nodes list endpoint https://review.opendev.org/c/zuul/zuul/+/938088 | ||
@jim:acmegating.com | Clark: i suspect either it is necessary for users, or something changed in the interim. i was definitely being minimal. so i think we should +w it and if you or anyone else gets curious about reverting it, that's easy enough to test with another change and depends-on. | 22:09 |
@jim:acmegating.com | Clark: did you do a service disabling thing for opendev we could copypasto into 923084? | 22:11 |
@clarkb:matrix.org | corvus: yes there is a service disablement block let me find it | 22:14 |
@clarkb:matrix.org | corvus: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/install-docker/tasks/Ubuntu.noble.yaml#L23-L37 | 22:15 |
@clarkb:matrix.org | I've approved https://review.opendev.org/c/zuul/zuul-jobs/+/925916 a followup to test the cleanup should be fine | 22:15 |
-@gerrit:opendev.org- Zuul merged on behalf of John Soo: [zuul/zuul] 938346: Configure notify settings when reporting to gerrit https://review.opendev.org/c/zuul/zuul/+/938346 | 22:42 | |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: | 22:55 | |
- [zuul/zuul-jobs] 926164: Add ability to exclude a specific platform https://review.opendev.org/c/zuul/zuul-jobs/+/926164 | ||
- [zuul/zuul-jobs] 925916: ensure-podman: add tasks to configure socket group https://review.opendev.org/c/zuul/zuul-jobs/+/925916 | ||
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 923084: Switch quickstart to docker compose v2 https://review.opendev.org/c/zuul/zuul/+/923084 | 23:22 | |
@jim:acmegating.com | Clark: ^ thanks, updated. | 23:23 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!