*** ykarel__ is now known as ykarel | 06:55 | |
EnriqueVallespiGil[m] | Hi folks, we're facing slow connection against opendev with a lot of: fatal: unable to access 'https://opendev.org/openstack/ansible-role-lunasa-hsm/': Failed to connect to opendev.org port 443: Connection timed out | 13:18 |
---|---|---|
EnriqueVallespiGil[m] | Unsure what's going on there, but I've just noticed this. Is there any ongoing service degradation that is known? | 13:18 |
fungi | EnriqueVallespiGil[m]: we replaced our load balancer for that around 20:00 utc yesterday (the old one had some issue where its global ipv6 address ad gateways kept vanishing) | 13:22 |
fungi | when did you notice issues begin? | 13:22 |
EnriqueVallespiGil[m] | 22:30 CET but Its a cronjob running per hour, might be earlier | 13:28 |
EnriqueVallespiGil[m] | So 20:30 UTC | 13:29 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903 | 13:41 |
corvus | i'm approving a zuul change to drop ansible 8 and add ansible 11. the current default is and will remain ansible 9, and aside from that ^, codesearch doesn't show any use of other ansible versions (of course that could be different on older branches). | 13:43 |
fungi | thanks for the heads up! | 13:44 |
fungi | EnriqueVallespiGil[m]: that definitely sounds like it could correlate to when we swapped out the haproxy server for a new one | 13:44 |
fungi | EnriqueVallespiGil[m]: do you know if you're connecting to it over ipv4 or ipv6? | 13:45 |
opendevreview | Merged zuul/zuul-jobs master: limit-log-files: allow unlimited files https://review.opendev.org/c/zuul/zuul-jobs/+/953854 | 13:46 |
fungi | testing with ping to the ipv6 address of the new gitea-lb03 from my home (usa) i'm seeing 0% packet loss... and from our mirror.gra1.ovh server (france) also 0% lost | 13:50 |
fungi | EnriqueVallespiGil[m]: are you still seeing connection issues right now, or was it only around 20:30 utc yesterday that it happened? | 13:52 |
EnriqueVallespiGil[m] | Ler me check rigjt now, but I was 25 min ago | 13:56 |
fungi | on mirror.gra1.ovh if i `git clone https://opendev.org/openstack/nova` it takes 99 seconds to complete, which seems fairly typical | 13:56 |
fungi | theory time: since we're using source address hash persistence in haproxy, i wonder if before the lb swap you were beig persisted to a backend that was performing well, and now you're being persisted to one that's performing poorly. i'll start checking all the backends to see if one's getting especially overloaded | 13:57 |
fungi | load average on all the backends is in the ~1-2 range, so i don't think any are being swamped at the moment | 13:58 |
EnriqueVallespiGil[m] | i’d is in the same shape with some connection time out, but this service is living in a openstack cluster that might be underperforming. | 14:01 |
fungi | from mirror.gra1.ovh i git cloned openstack/ansible-role-lunasa-hsm directly from each of the 6 gitea backend servers over https and all succeeded, completion times were between 0.5-2.5 seconds each | 14:04 |
EnriqueVallespiGil[m] | I’ll ping here back if I see there’s no something else in our side, thanks @fungi++ | 14:04 |
fungi | EnriqueVallespiGil[m]: you're welcome, if you need to dig deeper we can try sharing traceroutes in both directions to see if there's a spot somewhere on the internet where this is choking | 14:05 |
EnriqueVallespiGil[m] | That’d be great! Ill ping here back | 14:05 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903 | 14:25 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903 | 14:25 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903 | 14:32 |
clarkb | fungi: EnriqueVallespiGil[m] is it possible you're trying to connect to the old server because I shutdown services there | 15:08 |
clarkb | that would explain connection timeouts too | 15:08 |
clarkb | but yes more information about which ip protocol is used, the actual ip address(es) being connected to, etc would be helpful if this persists | 15:10 |
clarkb | EnriqueVallespiGil[m]: fungi: the other thought I've got is maybe you have firewall rules allowing access to the old ip address but not the new | 15:11 |
fungi | gonna go grab lunch, back shortly | 15:16 |
EnriqueVallespiGil[m] | Definetely is that clarkb, we have a whitelist firewall, I dont have ping to 38.108.68.97. | 15:18 |
EnriqueVallespiGil[m] | I have allowed 38.108.68.66. Let me change this on Monday | 15:18 |
clarkb | cool with that solved do we want to proceed with https://review.opendev.org/c/opendev/system-config/+/955829 and https://review.opendev.org/c/opendev/zone-opendev.org/+/955830 to start removing configs for the old server? | 15:34 |
corvus | lgtm | 15:40 |
clarkb | corvus: all but one image build job was successful last night. All of the image builds succeeded too. The one job that failed did so after running out of disk space | 15:40 |
clarkb | that implies to me the changes we've made the last couple days have had a positive impact on reliability | 15:41 |
clarkb | now is trixie just a larger image or are we right on the disk limit or something else? not sure yet | 15:41 |
clarkb | https://zuul.opendev.org/t/opendev/build/f040d487b54144449b4c290649f8fa07/log/job-output.txt#9023-9032 disk space seems ok ish there but a few tasks down it says there is no more space | 15:42 |
clarkb | I should say all of the dib builds succeeded and we've moved the errors from image build to post processing | 15:42 |
corvus | that error is a little confusing | 15:43 |
corvus | is it saying it ran out of space in zuul_return? because that's writing a file on the executor. | 15:43 |
clarkb | corvus: ya though running ansible plays/tasks requires copying files to the remote /tmp iirc. Maybe it is saying when trying to execute that code it ran out of disk when copying the executable artifact to the remote? | 15:44 |
corvus | nothing should be copied for a zuul_return task | 15:45 |
corvus | it's an action module that runs on the executor | 15:45 |
clarkb | ah, in that case I agree very odd. Maybe the executor ran out of disk space? | 15:45 |
corvus | (not saying that didn't happen -- just saying it's not supposed to) | 15:45 |
clarkb | (and it is coincidence that we suffered the consequences during that task in this job?) | 15:46 |
corvus | yes very likely | 15:46 |
corvus | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=2025-07-25T01:46:57.222Z&to=2025-07-25T04:53:06.306Z&timezone=utc | 15:46 |
corvus | did we miss a trick with the executor replacement? | 15:47 |
clarkb | I think the old ones had volumes for /var/lib/zuul (or is it /var/run/zuul) not sure if that was replicated/copied over | 15:47 |
corvus | i don't think the new ones do | 15:48 |
clarkb | and they were ~80GB large iirc | 15:48 |
clarkb | corvus: the old volumes are probably still there if we want to rotate them in (may need to gracefully shutdown an executor, clear out the moutn path, then mount to avoid orphaning a bunch of disk content that we won't be able to clean up later?) | 15:48 |
corvus | i think the volumes are gone; i don't see any in dfw | 15:50 |
clarkb | corvus: oh you know what? I wonder if we used the ephemeral drive | 15:53 |
clarkb | (I agree there are no executor volumes listed in DFW) | 15:53 |
clarkb | I think that might explain why I recall them being 80GB as that is the size of the ephemeral drive iirc | 15:54 |
corvus | okay, with much work i have extracted a cacti graph that confirms they were 80gb | 15:55 |
corvus | so yeah, we just need to stop, adjust fstab, remount, start | 15:56 |
clarkb | with some file copying in the middle there (there is a small amount of content in /opt where the ephemeral drive is currently mounted) | 15:57 |
corvus | what is /opt/containerd ? | 15:58 |
clarkb | looks like that includes project-config whcih would be copied in by hourly jobs. So maybe we put each executor in the emergency file as it gets modified too in order to avoid hourly jobs copying a new project-config in as we're doing the data move | 15:58 |
corvus | (the only other thing there is project-config, which is only used by ansible and ansible would recreate, so we could just delete it and ignore it) | 15:58 |
clarkb | corvus: maybe /opt/containerd is a side effect of using podman? | 15:59 |
clarkb | it looks like its contents are largely empty (its a few dirs with no real content?) | 15:59 |
corvus | yeah | 16:00 |
clarkb | googling indicates it may be a side effect of docker not podman so maybe some carry over from installing docker compose? | 16:00 |
clarkb | I suspect that we can probably clean out both the project-config content and the containerd content and let them recreate if necessary. | 16:00 |
clarkb | https://github.com/moby/moby/issues/41672 | 16:01 |
clarkb | "its probably ok to delete them but theywill be recreated on startup" | 16:01 |
clarkb | so something like stop executor, delete /opt contents, move /var/lib/zuul into /opt, unmount /opt, mount to /var/lib/zuul, start zuul executor? | 16:02 |
corvus | i'm going to do half the executors at a time; they're a little busy now, but i expect load to trail off | 16:03 |
corvus | i'll just delete everything then | 16:03 |
clarkb | oh and put the executors in the emergency file so that we don't recreate project config in the wrong place due to the hourly job | 16:03 |
clarkb | corvus: ^ | 16:03 |
corvus | yep | 16:03 |
clarkb | looks like hourlies just started | 16:03 |
clarkb | so that may also race edits to the emergency file if we want to wait a minute for them to finish first | 16:04 |
corvus | yep. they're all in emergency and i have run graceful on half | 16:04 |
corvus | i had the first half in the emergency file for about 5 minutes or so, just now decided to run the second half | 16:04 |
corvus | either way, the hourlies should be long done by the time the graceful is finished | 16:05 |
clarkb | ++ | 16:05 |
corvus | this is a good find. this should be the last of the observed image build issues | 16:06 |
clarkb | corvus: do we think the zuul image build tests are failing for similar reasons? | 16:08 |
corvus | you mean changes proposed to zuul-providers? | 16:10 |
corvus | or do you mean zuul unit tests? | 16:11 |
corvus | i don't think there are any open changes to zuul-providers | 16:12 |
clarkb | corvus: to zuul test jobs (image builds) digging deeper I think that may be caused by quay.io maintenance and "elevated 500 errors" | 16:12 |
clarkb | https://status.redhat.com/ | 16:12 |
clarkb | I don't think we need to bothe rwith that more at the moment. Just focus on the disk thing and let quay figure out their problems | 16:12 |
corvus | i think any job that ran in periodic-nightly could hit this -- i think that's what pushed the executor disk usage over the edge | 16:12 |
clarkb | makes sense since they all run at once. A big stampede | 16:12 |
corvus | oh you mean like 'zuul-build-image' :) | 16:13 |
clarkb | ya | 16:13 |
corvus | haha so many image builds | 16:13 |
corvus | it's like... you used exactly the right words to describe that, but i still didn't know what you meant because we have so many things that could be described as "zuul image build jobs" :) | 16:14 |
clarkb | we should run a generative ai job to build png images with llms based on our favorite things. Ghostbusters, penguins, gnus. Maybe some anklesaurs | 16:14 |
clarkb | just to add to the confusion | 16:14 |
fungi | let's see what i missed | 16:44 |
fungi | 955829 approved and +2 on 955830 but didn't approve it yet since i think they don't share a queue | 16:46 |
fungi | also ++ to ankle dinosaurs, dunno why | 16:48 |
clarkb | quay reports the 502 issues are resolved but they are extending the maintenance period by one hour... Not sure what that means for jobs but hopefully things will be happy again soon | 16:49 |
fungi | oh, i missed that it was a reported issue | 16:49 |
clarkb | fungi: ya I linked the status.redhat.com page above which ahs more details | 16:50 |
opendevreview | Merged opendev/system-config master: Drop gitea-lb02 from our inventory https://review.opendev.org/c/opendev/system-config/+/955829 | 17:04 |
clarkb | I think the quay read only maintenance mode will end at 17:58 UTC if I'm reading the status page correctly | 17:09 |
clarkb | `Error response from daemon: {"message":"can't talk to a V1 container registry"}` this is a new error message (seen in the codesearch deploy) I'm guessing this is not a persistent issue and instaed some sort of api fumble on the registry side since a bunch of other jobs are happy right now | 17:24 |
clarkb | that was running docker-compose pull on codesearch that emitted the error | 17:24 |
corvus | 1,2,3,5 are stopped, i'll begin the work on them | 17:34 |
corvus | and 4 | 17:39 |
clarkb | I'm going to approve the dns cleanup for gitea-lb02 now. I won't delete the server in case guilhermesp ricolin or mnaser indicate an interest in debugging further in the next few days | 17:40 |
fungi | wfm, i was only holding off until the depends-on merged | 17:41 |
fungi | which it has, even though deploy jobs failed | 17:41 |
fungi | i hope the quay folks are able to have a pleasant weekend | 17:42 |
opendevreview | Merged opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records https://review.opendev.org/c/opendev/zone-opendev.org/+/955830 | 17:43 |
corvus | oops, we bind-mount the site variables from /opt/project-config | 17:50 |
fungi | so that needs container restarts i guess? | 17:53 |
corvus | well, i deleted /opt/project-config because i thought we only copied stuff from there. so i'll copy it from one of the other executors, and then on the remaining ones, i'll make sure to save it. | 17:54 |
fungi | ah | 17:54 |
clarkb | as a heads up I have an appliance repair person coming by this afternoon to fix our range. I may be distracted for a bit once they show up | 17:55 |
fungi | otherwise you may need to switch to raw foods | 17:55 |
clarkb | I wasn't filled with a bunch of confidence when he seemed to know over the phone what exactly needed to be done then discovered the necessary part was out of stock (so we've been waiting a couple of weeks) | 17:57 |
clarkb | confidence in the range. Person seems to know what htey are doing | 17:57 |
clarkb | worried we'll be doing the same repair in six months but only time will tell | 17:58 |
corvus | did it's gpu burn out? | 17:59 |
clarkb | the primary stove top burner only operates in full on or full off mode. So its not moderating the output. Some sort of switch or resistor thing? | 18:00 |
fungi | that would work fine for me. i use a cast iron wok and only turn the burner all the way on or all the way off. in-between is irrelevant | 18:01 |
corvus | okay ze01 looks good now, starting the others | 18:01 |
corvus | done: ze01-ze05 have been restarted on the new configuration; ze06-ze12 are gracefully shutting down | 18:03 |
corvus | okay status page says green now | 18:05 |
corvus | quay even | 18:05 |
clarkb | oh good, the amount of orange and red on the zuul dashboar should go down now | 18:06 |
fungi | lots of the houses out here are named, and i've decided to name ours "skeleton quay" (second place choice was "dock shadows") | 18:07 |
corvus | yes, but how will you pronounce it? and how will everyone else who sees it? (i know the answer to the second one) | 18:08 |
fungi | i mean, the whole point is that the pun only works if you know the pronounciation | 18:09 |
fungi | though "skeleton cay" and "skeleton key" were considered, they just don't work quite as well | 18:10 |
clarkb | arent they all keys in your part of the world? | 18:10 |
clarkb | or maybe you have to go down to florida for that terminology shift to occur? | 18:10 |
fungi | at the moment i'm on a peninsula because "beach nourishment" keeps all the inlets north of us shoaled over for the benefit of people who crazily built vacation homes on top of them | 18:11 |
clarkb | TIL cay is a derivation from Arawak "cairi" | 18:11 |
clarkb | which eventually becomes key in english | 18:12 |
fungi | yeah, "cay" would have been the most traditional | 18:12 |
clarkb | oh its a specific Arawak language: Taino | 18:12 |
fungi | from a regional perspective, nobody calls anything a key, cay or quay here, they're just "islands" or "banks" or "shore" | 18:13 |
fungi | it's well down into florida before you encounter the term locally | 18:13 |
corvus | that makes it very awkward to talk about quay.io since the folks that operate the service pronounce it "kway". | 18:14 |
corvus | https://access.redhat.com/articles/6970915 | 18:14 |
corvus | at least they are aware of the difficulty | 18:14 |
fungi | hah | 18:14 |
clarkb | ya we always referred to them simply as islands growing up. But technically many of the islands we called islands are also keys | 18:15 |
fungi | most of the variability of spelling for cay/quay/key is found in the caribbean sea thanks to repeated colonization and re-colonization by various european countries who disagreed on language and spelling | 18:16 |
clarkb | (but not all islands were keys as some of them were volcanic in origin not sedimentary on top of reefs) | 18:16 |
fungi | yeah, ironically most of the caribbean islands are volcanic, but colonists couldn't tell the difference | 18:17 |
fungi | unless, you know, they were actually erupting | 18:17 |
fungi | the indigenous islanders knew through oral histories, but nobody really paid attention to them (and also the european colonists typically displaced or genocided them just like in other parts of the world) | 18:19 |
fungi | the decimation of port royal in jamaica was a good example | 18:20 |
fungi | (1692ad) | 18:21 |
fungi | granted that was more of a seismic event not an eruption, it's all tied to the same tectonic fault there | 18:26 |
clarkb | fungi: I think Friday is a light meeting day. Should I approve https://review.opendev.org/c/opendev/system-config/+/955544 now? | 18:31 |
clarkb | (also quay seems happier which is another prereq I think) | 18:31 |
fungi | i think this is the perfect time for that, yesh | 18:31 |
clarkb | cool approved | 18:33 |
opendevreview | Merged opendev/system-config master: Switch IRC and matrix bots to log with journald rather than syslog https://review.opendev.org/c/opendev/system-config/+/955544 | 18:58 |
clarkb | seems like it did restart bots | 19:04 |
clarkb | yes docker ps -a reports they all updated | 19:06 |
clarkb | and /var/log/containers/docker-gerritbot.log has entries more recent than the restart so the journald vs syslog equivalence seems to be working as expected | 19:07 |
clarkb | I'm about to grab lunch but this lgtm. let me know if you see something unexpected | 19:07 |
corvus | the other executors are stopped, i'll update them now | 19:47 |
corvus | #status log updated all zuul executors to mount the ephemeral drive at /var/lib/zuul | 20:00 |
corvus | that's all done now, and the executors are out of the emergency file | 20:00 |
opendevstatus | corvus: finished logging | 20:00 |
clarkb | fingers now crossed that we have all green image builds overnight | 20:00 |
clarkb | does anyone know how the meeting logs captured by limnoria on eavesdrop01.opendev.org end up on meetings.opendev.org which is really just a vhost on static which presumably serves out of AFS? It doesn't look like we have afs mounted on eavesdrop01.opendev.org | 20:03 |
corvus | meetings proxies to eavesdrop | 20:04 |
corvus | so they're just served by apache on eavesdrop from local files | 20:04 |
clarkb | oh ok so it isn't afs | 20:04 |
corvus | right | 20:04 |
clarkb | and it looks like that data is on a cinder volume. Thanks this helps my planning for replacing the eavesdrop server | 20:05 |
corvus | np. i traced through that just recently. :) | 20:06 |
clarkb | I think what I should do is boot a new server with a new volume. Rsync the data into the volume, then when we land the change to add the new server to ivnentory which configures it to run the containers stop services on the old server and rsync again. Something along those lines | 20:07 |
clarkb | but I'll think about that a bit to see if there is a less impactful or simpler approach | 20:07 |
clarkb | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-24h&to=now&timezone=utc shows much better disk utilization percentages now. We probably need to wait and see what periodic jobs do though | 20:43 |
clarkb | jrosser: fungi I went ahead and abandoned https://review.opendev.org/c/zuul/zuul-jobs/+/954280 as I believe the dib update should be sufficient to address the issue | 20:59 |
fungi | yep | 21:00 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!