Friday, 2025-07-25

*** ykarel__ is now known as ykarel06:55
EnriqueVallespiGil[m]Hi folks, we're facing slow connection against opendev with a lot of: fatal: unable to access 'https://opendev.org/openstack/ansible-role-lunasa-hsm/': Failed to connect to opendev.org port 443: Connection timed out13:18
EnriqueVallespiGil[m]Unsure what's going on there, but I've just noticed this. Is there any ongoing service degradation that is known?13:18
fungiEnriqueVallespiGil[m]: we replaced our load balancer for that around 20:00 utc yesterday (the old one had some issue where its global ipv6 address ad gateways kept vanishing)13:22
fungiwhen did you notice issues begin?13:22
EnriqueVallespiGil[m]22:30 CET but Its a cronjob running per hour, might be earlier13:28
EnriqueVallespiGil[m]So 20:30 UTC13:29
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs  https://review.opendev.org/c/zuul/zuul-jobs/+/95590313:41
corvusi'm approving a zuul change to drop ansible 8 and add ansible 11.  the current default is and will remain ansible 9, and aside from that ^, codesearch doesn't show any use of other ansible versions (of course that could be different on older branches).13:43
fungithanks for the heads up!13:44
fungiEnriqueVallespiGil[m]: that definitely sounds like it could correlate to when we swapped out the haproxy server for a new one13:44
fungiEnriqueVallespiGil[m]: do you know if you're connecting to it over ipv4 or ipv6?13:45
opendevreviewMerged zuul/zuul-jobs master: limit-log-files: allow unlimited files  https://review.opendev.org/c/zuul/zuul-jobs/+/95385413:46
fungitesting with ping to the ipv6 address of the new gitea-lb03 from my home (usa) i'm seeing 0% packet loss... and from our mirror.gra1.ovh server (france) also 0% lost13:50
fungiEnriqueVallespiGil[m]: are you still seeing connection issues right now, or was it only around 20:30 utc yesterday that it happened?13:52
EnriqueVallespiGil[m]Ler me check rigjt now, but I was 25 min ago13:56
fungion mirror.gra1.ovh if i `git clone https://opendev.org/openstack/nova` it takes 99 seconds to complete, which seems fairly typical13:56
fungitheory time: since we're using source address hash persistence in haproxy, i wonder if before the lb swap you were beig persisted to a backend that was performing well, and now you're being persisted to one that's performing poorly. i'll start checking all the backends to see if one's getting especially overloaded13:57
fungiload average on all the backends is in the ~1-2 range, so i don't think any are being swamped at the moment13:58
EnriqueVallespiGil[m]i’d is in the same shape with some connection time out, but this service is living in a openstack cluster that might be underperforming.14:01
fungifrom mirror.gra1.ovh i git cloned openstack/ansible-role-lunasa-hsm directly from each of the 6 gitea backend servers over https and all succeeded, completion times were between 0.5-2.5 seconds each14:04
EnriqueVallespiGil[m]I’ll ping here back if I see there’s no something else in our side, thanks @fungi++14:04
fungiEnriqueVallespiGil[m]: you're welcome, if you need to dig deeper we can try sharing traceroutes in both directions to see if there's a spot somewhere on the internet where this is choking14:05
EnriqueVallespiGil[m]That’d be great! Ill ping here back14:05
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs  https://review.opendev.org/c/zuul/zuul-jobs/+/95590314:25
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs  https://review.opendev.org/c/zuul/zuul-jobs/+/95590314:25
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs  https://review.opendev.org/c/zuul/zuul-jobs/+/95590314:32
clarkbfungi: EnriqueVallespiGil[m] is it possible you're trying to connect to the old server because I shutdown services there15:08
clarkbthat would explain connection timeouts too15:08
clarkbbut yes more information about which ip protocol is used, the actual ip address(es) being connected to, etc would be helpful if this persists15:10
clarkbEnriqueVallespiGil[m]: fungi: the other thought I've got is maybe you have firewall rules allowing access to the old ip address but not the new15:11
fungigonna go grab lunch, back shortly15:16
EnriqueVallespiGil[m]Definetely is that clarkb, we have a whitelist firewall, I dont have ping to 38.108.68.97.15:18
EnriqueVallespiGil[m]I have allowed 38.108.68.66. Let me change this on Monday15:18
clarkbcool with that solved do we want to proceed with https://review.opendev.org/c/opendev/system-config/+/955829 and https://review.opendev.org/c/opendev/zone-opendev.org/+/955830 to start removing configs for the old server?15:34
corvuslgtm15:40
clarkbcorvus: all but one image build job was successful last night. All of the image builds succeeded too. The one job that failed did so after running out of disk space15:40
clarkbthat implies to me the changes we've made the last couple days have had a positive impact on reliability15:41
clarkbnow is trixie just a larger image or are we right on the disk limit or something else? not sure yet15:41
clarkbhttps://zuul.opendev.org/t/opendev/build/f040d487b54144449b4c290649f8fa07/log/job-output.txt#9023-9032 disk space seems ok ish there but a few tasks down it says there is no more space15:42
clarkbI should say all of the dib builds succeeded and we've moved the errors from image build to post processing15:42
corvusthat error is a little confusing15:43
corvusis it saying it ran out of space in zuul_return?  because that's writing a file on the executor.15:43
clarkbcorvus: ya though running ansible plays/tasks requires copying files to the remote /tmp iirc. Maybe it is saying when trying to execute that code it ran out of disk when copying the executable artifact to the remote?15:44
corvusnothing should be copied for a zuul_return task15:45
corvusit's an action module that runs on the executor15:45
clarkbah, in that case I agree very odd. Maybe the executor ran out of disk space?15:45
corvus(not saying that didn't happen -- just saying it's not supposed to)15:45
clarkb(and it is coincidence that we suffered the consequences during that task in this job?)15:46
corvusyes very likely15:46
corvushttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=2025-07-25T01:46:57.222Z&to=2025-07-25T04:53:06.306Z&timezone=utc15:46
corvusdid we miss a trick with the executor replacement?15:47
clarkbI think the old ones had volumes for /var/lib/zuul (or is it /var/run/zuul) not sure if that was replicated/copied over15:47
corvusi don't think the new ones do15:48
clarkband they were ~80GB large iirc15:48
clarkbcorvus: the old volumes are probably still there if we want to rotate them in (may need to gracefully shutdown an executor, clear out the moutn path, then mount to avoid orphaning a bunch of disk content that we won't be able to clean up later?)15:48
corvusi think the volumes are gone; i don't see any in dfw15:50
clarkbcorvus: oh you know what? I wonder if we used the ephemeral drive15:53
clarkb(I agree there are no executor volumes listed in DFW)15:53
clarkbI think that might explain why I recall them being 80GB as that is the size of the ephemeral drive iirc15:54
corvusokay, with much work i have extracted a cacti graph that confirms they were 80gb15:55
corvusso yeah, we just need to stop, adjust fstab, remount, start15:56
clarkbwith some file copying in the middle there (there is a small amount of content in /opt where the ephemeral drive is currently mounted)15:57
corvuswhat is /opt/containerd ?15:58
clarkblooks like that includes project-config whcih would be copied in by hourly jobs. So maybe we put each executor in the emergency file as it gets modified too in order to avoid hourly jobs copying a new project-config in as we're doing the data move15:58
corvus(the only other thing there is project-config, which is only used by ansible and ansible would recreate, so we could just delete it and ignore it)15:58
clarkbcorvus: maybe /opt/containerd is a side effect of using podman?15:59
clarkbit looks like its contents are largely empty (its a few dirs with no real content?)15:59
corvusyeah16:00
clarkbgoogling indicates it may be a side effect of docker not podman so maybe some carry over from installing docker compose?16:00
clarkbI suspect that we can probably clean out both the project-config content and the containerd content and let them recreate if necessary.16:00
clarkbhttps://github.com/moby/moby/issues/4167216:01
clarkb"its probably ok to delete them but theywill be recreated on startup"16:01
clarkbso something like stop executor, delete /opt contents, move /var/lib/zuul into /opt, unmount /opt, mount to /var/lib/zuul, start zuul executor?16:02
corvusi'm going to do half the executors at a time; they're a little busy now, but i expect load to trail off16:03
corvusi'll just delete everything then16:03
clarkboh and put the executors in the emergency file so that we don't recreate project config in the wrong place due to the hourly job16:03
clarkbcorvus: ^16:03
corvusyep16:03
clarkblooks like hourlies just started16:03
clarkbso that may also race edits to the emergency file if we want to wait a minute for them to finish first16:04
corvusyep.  they're all in emergency and i have run graceful on half16:04
corvusi had the first half in the emergency file for about 5 minutes or so, just now decided to run the second half16:04
corvuseither way, the hourlies should be long done by the time the graceful is finished16:05
clarkb++16:05
corvusthis is a good find.  this should be the last of the observed image build issues16:06
clarkbcorvus: do we think the zuul image build tests are failing for similar reasons?16:08
corvusyou mean changes proposed to zuul-providers?16:10
corvusor do you mean zuul unit tests?16:11
corvusi don't think there are any open changes to zuul-providers16:12
clarkbcorvus: to zuul test jobs (image builds) digging deeper I think that may be caused by quay.io maintenance and "elevated 500 errors"16:12
clarkbhttps://status.redhat.com/16:12
clarkbI don't think we need to bothe rwith that more at the moment. Just focus on the disk thing and let quay figure out their problems16:12
corvusi think any job that ran in periodic-nightly could hit this -- i think that's what pushed the executor disk usage over the edge16:12
clarkbmakes sense since they all run at once. A big stampede16:12
corvusoh you mean like 'zuul-build-image'  :)16:13
clarkbya16:13
corvushaha so many image builds16:13
corvusit's like... you used exactly the right words to describe that, but i still didn't know what you meant because we have so many things that could be described as "zuul image build jobs" :)16:14
clarkbwe should run a generative ai job to build png images with llms based on our favorite things. Ghostbusters, penguins, gnus. Maybe some anklesaurs16:14
clarkbjust to add to the confusion16:14
fungilet's see what i missed16:44
fungi955829 approved and +2 on 955830 but didn't approve it yet since i think they don't share a queue16:46
fungialso ++ to ankle dinosaurs, dunno why16:48
clarkbquay reports the 502 issues are resolved but they are extending the maintenance period by one hour... Not sure what that means for jobs but hopefully things will be happy again soon16:49
fungioh, i missed that it was a reported issue16:49
clarkbfungi: ya I linked the status.redhat.com page above which ahs more details16:50
opendevreviewMerged opendev/system-config master: Drop gitea-lb02 from our inventory  https://review.opendev.org/c/opendev/system-config/+/95582917:04
clarkbI think the quay read only maintenance mode will end at 17:58 UTC if I'm reading the status page correctly17:09
clarkb`Error response from daemon: {"message":"can't talk to a V1 container registry"}` this is a new error message (seen in the codesearch deploy) I'm guessing this is not a persistent issue and instaed some sort of api fumble on the registry side since a bunch of other jobs are happy right now17:24
clarkbthat was running docker-compose pull on codesearch that emitted the error17:24
corvus1,2,3,5 are stopped, i'll begin the work on them17:34
corvusand 417:39
clarkbI'm going to approve the dns cleanup for gitea-lb02 now. I won't delete the server in case guilhermesp ricolin or mnaser indicate an interest in debugging further in the next few days17:40
fungiwfm, i was only holding off until the depends-on merged17:41
fungiwhich it has, even though deploy jobs failed17:41
fungii hope the quay folks are able to have a pleasant weekend17:42
opendevreviewMerged opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records  https://review.opendev.org/c/opendev/zone-opendev.org/+/95583017:43
corvusoops, we bind-mount the site variables from /opt/project-config17:50
fungiso that needs container restarts i guess?17:53
corvuswell, i deleted /opt/project-config because i thought we only copied stuff from there.  so i'll copy it from one of the other executors, and then on the remaining ones, i'll make sure to save it.17:54
fungiah17:54
clarkbas a heads up I have an appliance repair person coming by this afternoon to fix our range. I may be distracted for a bit once they show up17:55
fungiotherwise you may need to switch to raw foods17:55
clarkbI wasn't filled with a bunch of confidence when he seemed to know over the phone what exactly needed to be done then discovered the necessary part was out of stock (so we've been waiting a couple of weeks)17:57
clarkbconfidence in the range. Person seems to know what htey are doing17:57
clarkbworried we'll be doing the same repair in six months but only time will tell17:58
corvusdid it's gpu burn out?17:59
clarkbthe primary stove top burner only operates in full on or full off mode. So its not moderating the output. Some sort of switch or resistor thing?18:00
fungithat would work fine for me. i use a cast iron wok and only turn the burner all the way on or all the way off. in-between is irrelevant18:01
corvusokay ze01 looks good now, starting the others18:01
corvusdone: ze01-ze05 have been restarted on the new configuration; ze06-ze12 are gracefully shutting down18:03
corvusokay status page says green now18:05
corvusquay even18:05
clarkboh good, the amount of orange and red on the zuul dashboar should go down now18:06
fungilots of the houses out here are named, and i've decided to name ours "skeleton quay" (second place choice was "dock shadows")18:07
corvusyes, but how will you pronounce it?  and how will everyone else who sees it?  (i know the answer to the second one)18:08
fungii mean, the whole point is that the pun only works if you know the pronounciation18:09
fungithough "skeleton cay" and "skeleton key" were considered, they just don't work quite as well18:10
clarkbarent they all keys in your part of the world?18:10
clarkbor maybe you have to go down to florida for that terminology shift to occur?18:10
fungiat the moment i'm on a peninsula because "beach nourishment" keeps all the inlets north of us shoaled over for the benefit of people who crazily built vacation homes on top of them18:11
clarkbTIL cay is a derivation from Arawak "cairi"18:11
clarkbwhich eventually becomes key in english18:12
fungiyeah, "cay" would have been the most traditional18:12
clarkboh its a specific Arawak language: Taino18:12
fungifrom a regional perspective, nobody calls anything a key, cay or quay here, they're just "islands" or "banks" or "shore"18:13
fungiit's well down into florida before you encounter the term locally18:13
corvusthat makes it very awkward to talk about quay.io since the folks that operate the service pronounce it "kway".18:14
corvushttps://access.redhat.com/articles/697091518:14
corvusat least they are aware of the difficulty18:14
fungihah18:14
clarkbya we always referred to them simply as islands growing up. But technically many of the islands we called islands are also keys18:15
fungimost of the variability of spelling for cay/quay/key is found in the caribbean sea thanks to repeated colonization and re-colonization by various european countries who disagreed on language and spelling18:16
clarkb(but not all islands were keys as some of them were volcanic in origin not sedimentary on top of reefs)18:16
fungiyeah, ironically most of the caribbean islands are volcanic, but colonists couldn't tell the difference18:17
fungiunless, you know, they were actually erupting18:17
fungithe indigenous islanders knew through oral histories, but nobody really paid attention to them (and also the european colonists typically displaced or genocided them just like in other parts of the world)18:19
fungithe decimation of port royal in jamaica was a good example18:20
fungi(1692ad)18:21
fungigranted that was more of a seismic event not an eruption, it's all tied to the same tectonic fault there18:26
clarkbfungi: I think Friday is a light meeting day. Should I approve https://review.opendev.org/c/opendev/system-config/+/955544 now?18:31
clarkb(also quay seems happier which is another prereq I think)18:31
fungii think this is the perfect time for that, yesh18:31
clarkbcool approved18:33
opendevreviewMerged opendev/system-config master: Switch IRC and matrix bots to log with journald rather than syslog  https://review.opendev.org/c/opendev/system-config/+/95554418:58
clarkbseems like it did restart bots19:04
clarkbyes docker ps -a reports they all updated19:06
clarkband /var/log/containers/docker-gerritbot.log has entries more recent than the restart so the journald vs syslog equivalence seems to be working as expected19:07
clarkbI'm about to grab lunch but this lgtm. let me know if you see something unexpected19:07
corvusthe other executors are stopped, i'll update them now19:47
corvus#status log updated all zuul executors to mount the ephemeral drive at /var/lib/zuul20:00
corvusthat's all done now, and the executors are out of the emergency file20:00
opendevstatuscorvus: finished logging20:00
clarkbfingers now crossed that we have all green image builds overnight20:00
clarkbdoes anyone know how the meeting logs captured by limnoria on eavesdrop01.opendev.org end up on meetings.opendev.org which is really just a vhost on static which presumably serves out of AFS? It doesn't look like we have afs mounted on eavesdrop01.opendev.org20:03
corvusmeetings proxies to eavesdrop20:04
corvusso they're just served by apache on eavesdrop from local files20:04
clarkboh ok so it isn't afs20:04
corvusright20:04
clarkband it looks like that data is on a cinder volume. Thanks this helps my planning for replacing the eavesdrop server20:05
corvusnp.  i traced through that just recently.  :)20:06
clarkbI think what I should do is boot a new server with a new volume. Rsync the data into the volume, then when we land the change to add the new server to ivnentory which configures it to run the containers stop services on the old server and rsync again. Something along those lines20:07
clarkbbut I'll think about that a bit to see if there is a less impactful or simpler approach20:07
clarkbhttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-24h&to=now&timezone=utc shows much better disk utilization percentages now. We probably need to wait and see what periodic jobs do though20:43
clarkbjrosser: fungi I went ahead and abandoned https://review.opendev.org/c/zuul/zuul-jobs/+/954280 as I believe the dib update should be sufficient to address the issue20:59
fungiyep21:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!