Friday, 2025-07-25

*** ykarel__ is now known as ykarel		06:55
EnriqueVallespiGil[m]	Hi folks, we're facing slow connection against opendev with a lot of: fatal: unable to access 'https://opendev.org/openstack/ansible-role-lunasa-hsm/': Failed to connect to opendev.org port 443: Connection timed out	13:18
EnriqueVallespiGil[m]	Unsure what's going on there, but I've just noticed this. Is there any ongoing service degradation that is known?	13:18
fungi	EnriqueVallespiGil[m]: we replaced our load balancer for that around 20:00 utc yesterday (the old one had some issue where its global ipv6 address ad gateways kept vanishing)	13:22
fungi	when did you notice issues begin?	13:22
EnriqueVallespiGil[m]	22:30 CET but Its a cronjob running per hour, might be earlier	13:28
EnriqueVallespiGil[m]	So 20:30 UTC	13:29
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903	13:41
corvus	i'm approving a zuul change to drop ansible 8 and add ansible 11. the current default is and will remain ansible 9, and aside from that ^, codesearch doesn't show any use of other ansible versions (of course that could be different on older branches).	13:43
fungi	thanks for the heads up!	13:44
fungi	EnriqueVallespiGil[m]: that definitely sounds like it could correlate to when we swapped out the haproxy server for a new one	13:44
fungi	EnriqueVallespiGil[m]: do you know if you're connecting to it over ipv4 or ipv6?	13:45
opendevreview	Merged zuul/zuul-jobs master: limit-log-files: allow unlimited files https://review.opendev.org/c/zuul/zuul-jobs/+/953854	13:46
fungi	testing with ping to the ipv6 address of the new gitea-lb03 from my home (usa) i'm seeing 0% packet loss... and from our mirror.gra1.ovh server (france) also 0% lost	13:50
fungi	EnriqueVallespiGil[m]: are you still seeing connection issues right now, or was it only around 20:30 utc yesterday that it happened?	13:52
EnriqueVallespiGil[m]	Ler me check rigjt now, but I was 25 min ago	13:56
fungi	on mirror.gra1.ovh if i `git clone https://opendev.org/openstack/nova` it takes 99 seconds to complete, which seems fairly typical	13:56
fungi	theory time: since we're using source address hash persistence in haproxy, i wonder if before the lb swap you were beig persisted to a backend that was performing well, and now you're being persisted to one that's performing poorly. i'll start checking all the backends to see if one's getting especially overloaded	13:57
fungi	load average on all the backends is in the ~1-2 range, so i don't think any are being swamped at the moment	13:58
EnriqueVallespiGil[m]	i’d is in the same shape with some connection time out, but this service is living in a openstack cluster that might be underperforming.	14:01
fungi	from mirror.gra1.ovh i git cloned openstack/ansible-role-lunasa-hsm directly from each of the 6 gitea backend servers over https and all succeeded, completion times were between 0.5-2.5 seconds each	14:04
EnriqueVallespiGil[m]	I’ll ping here back if I see there’s no something else in our side, thanks @fungi++	14:04
fungi	EnriqueVallespiGil[m]: you're welcome, if you need to dig deeper we can try sharing traceroutes in both directions to see if there's a spot somewhere on the internet where this is choking	14:05
EnriqueVallespiGil[m]	That’d be great! Ill ping here back	14:05
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903	14:25
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903	14:25
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Remove xenial test jobs https://review.opendev.org/c/zuul/zuul-jobs/+/955903	14:32
clarkb	fungi: EnriqueVallespiGil[m] is it possible you're trying to connect to the old server because I shutdown services there	15:08
clarkb	that would explain connection timeouts too	15:08
clarkb	but yes more information about which ip protocol is used, the actual ip address(es) being connected to, etc would be helpful if this persists	15:10
clarkb	EnriqueVallespiGil[m]: fungi: the other thought I've got is maybe you have firewall rules allowing access to the old ip address but not the new	15:11
fungi	gonna go grab lunch, back shortly	15:16
EnriqueVallespiGil[m]	Definetely is that clarkb, we have a whitelist firewall, I dont have ping to 38.108.68.97.	15:18
EnriqueVallespiGil[m]	I have allowed 38.108.68.66. Let me change this on Monday	15:18
clarkb	cool with that solved do we want to proceed with https://review.opendev.org/c/opendev/system-config/+/955829 and https://review.opendev.org/c/opendev/zone-opendev.org/+/955830 to start removing configs for the old server?	15:34
corvus	lgtm	15:40
clarkb	corvus: all but one image build job was successful last night. All of the image builds succeeded too. The one job that failed did so after running out of disk space	15:40
clarkb	that implies to me the changes we've made the last couple days have had a positive impact on reliability	15:41
clarkb	now is trixie just a larger image or are we right on the disk limit or something else? not sure yet	15:41
clarkb	https://zuul.opendev.org/t/opendev/build/f040d487b54144449b4c290649f8fa07/log/job-output.txt#9023-9032 disk space seems ok ish there but a few tasks down it says there is no more space	15:42
clarkb	I should say all of the dib builds succeeded and we've moved the errors from image build to post processing	15:42
corvus	that error is a little confusing	15:43
corvus	is it saying it ran out of space in zuul_return? because that's writing a file on the executor.	15:43
clarkb	corvus: ya though running ansible plays/tasks requires copying files to the remote /tmp iirc. Maybe it is saying when trying to execute that code it ran out of disk when copying the executable artifact to the remote?	15:44
corvus	nothing should be copied for a zuul_return task	15:45
corvus	it's an action module that runs on the executor	15:45
clarkb	ah, in that case I agree very odd. Maybe the executor ran out of disk space?	15:45
corvus	(not saying that didn't happen -- just saying it's not supposed to)	15:45
clarkb	(and it is coincidence that we suffered the consequences during that task in this job?)	15:46
corvus	yes very likely	15:46
corvus	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=2025-07-25T01:46:57.222Z&to=2025-07-25T04:53:06.306Z&timezone=utc	15:46
corvus	did we miss a trick with the executor replacement?	15:47
clarkb	I think the old ones had volumes for /var/lib/zuul (or is it /var/run/zuul) not sure if that was replicated/copied over	15:47
corvus	i don't think the new ones do	15:48
clarkb	and they were ~80GB large iirc	15:48
clarkb	corvus: the old volumes are probably still there if we want to rotate them in (may need to gracefully shutdown an executor, clear out the moutn path, then mount to avoid orphaning a bunch of disk content that we won't be able to clean up later?)	15:48
corvus	i think the volumes are gone; i don't see any in dfw	15:50
clarkb	corvus: oh you know what? I wonder if we used the ephemeral drive	15:53
clarkb	(I agree there are no executor volumes listed in DFW)	15:53
clarkb	I think that might explain why I recall them being 80GB as that is the size of the ephemeral drive iirc	15:54
corvus	okay, with much work i have extracted a cacti graph that confirms they were 80gb	15:55
corvus	so yeah, we just need to stop, adjust fstab, remount, start	15:56
clarkb	with some file copying in the middle there (there is a small amount of content in /opt where the ephemeral drive is currently mounted)	15:57
corvus	what is /opt/containerd ?	15:58
clarkb	looks like that includes project-config whcih would be copied in by hourly jobs. So maybe we put each executor in the emergency file as it gets modified too in order to avoid hourly jobs copying a new project-config in as we're doing the data move	15:58
corvus	(the only other thing there is project-config, which is only used by ansible and ansible would recreate, so we could just delete it and ignore it)	15:58
clarkb	corvus: maybe /opt/containerd is a side effect of using podman?	15:59
clarkb	it looks like its contents are largely empty (its a few dirs with no real content?)	15:59
corvus	yeah	16:00
clarkb	googling indicates it may be a side effect of docker not podman so maybe some carry over from installing docker compose?	16:00
clarkb	I suspect that we can probably clean out both the project-config content and the containerd content and let them recreate if necessary.	16:00
clarkb	https://github.com/moby/moby/issues/41672	16:01
clarkb	"its probably ok to delete them but theywill be recreated on startup"	16:01
clarkb	so something like stop executor, delete /opt contents, move /var/lib/zuul into /opt, unmount /opt, mount to /var/lib/zuul, start zuul executor?	16:02
corvus	i'm going to do half the executors at a time; they're a little busy now, but i expect load to trail off	16:03
corvus	i'll just delete everything then	16:03
clarkb	oh and put the executors in the emergency file so that we don't recreate project config in the wrong place due to the hourly job	16:03
clarkb	corvus: ^	16:03
corvus	yep	16:03
clarkb	looks like hourlies just started	16:03
clarkb	so that may also race edits to the emergency file if we want to wait a minute for them to finish first	16:04
corvus	yep. they're all in emergency and i have run graceful on half	16:04
corvus	i had the first half in the emergency file for about 5 minutes or so, just now decided to run the second half	16:04
corvus	either way, the hourlies should be long done by the time the graceful is finished	16:05
clarkb	++	16:05
corvus	this is a good find. this should be the last of the observed image build issues	16:06
clarkb	corvus: do we think the zuul image build tests are failing for similar reasons?	16:08
corvus	you mean changes proposed to zuul-providers?	16:10
corvus	or do you mean zuul unit tests?	16:11
corvus	i don't think there are any open changes to zuul-providers	16:12
clarkb	corvus: to zuul test jobs (image builds) digging deeper I think that may be caused by quay.io maintenance and "elevated 500 errors"	16:12
clarkb	https://status.redhat.com/	16:12
clarkb	I don't think we need to bothe rwith that more at the moment. Just focus on the disk thing and let quay figure out their problems	16:12
corvus	i think any job that ran in periodic-nightly could hit this -- i think that's what pushed the executor disk usage over the edge	16:12
clarkb	makes sense since they all run at once. A big stampede	16:12
corvus	oh you mean like 'zuul-build-image' :)	16:13
clarkb	ya	16:13
corvus	haha so many image builds	16:13
corvus	it's like... you used exactly the right words to describe that, but i still didn't know what you meant because we have so many things that could be described as "zuul image build jobs" :)	16:14
clarkb	we should run a generative ai job to build png images with llms based on our favorite things. Ghostbusters, penguins, gnus. Maybe some anklesaurs	16:14
clarkb	just to add to the confusion	16:14
fungi	let's see what i missed	16:44
fungi	955829 approved and +2 on 955830 but didn't approve it yet since i think they don't share a queue	16:46
fungi	also ++ to ankle dinosaurs, dunno why	16:48
clarkb	quay reports the 502 issues are resolved but they are extending the maintenance period by one hour... Not sure what that means for jobs but hopefully things will be happy again soon	16:49
fungi	oh, i missed that it was a reported issue	16:49
clarkb	fungi: ya I linked the status.redhat.com page above which ahs more details	16:50
opendevreview	Merged opendev/system-config master: Drop gitea-lb02 from our inventory https://review.opendev.org/c/opendev/system-config/+/955829	17:04
clarkb	I think the quay read only maintenance mode will end at 17:58 UTC if I'm reading the status page correctly	17:09
clarkb	`Error response from daemon: {"message":"can't talk to a V1 container registry"}` this is a new error message (seen in the codesearch deploy) I'm guessing this is not a persistent issue and instaed some sort of api fumble on the registry side since a bunch of other jobs are happy right now	17:24
clarkb	that was running docker-compose pull on codesearch that emitted the error	17:24
corvus	1,2,3,5 are stopped, i'll begin the work on them	17:34
corvus	and 4	17:39
clarkb	I'm going to approve the dns cleanup for gitea-lb02 now. I won't delete the server in case guilhermesp ricolin or mnaser indicate an interest in debugging further in the next few days	17:40
fungi	wfm, i was only holding off until the depends-on merged	17:41
fungi	which it has, even though deploy jobs failed	17:41
fungi	i hope the quay folks are able to have a pleasant weekend	17:42
opendevreview	Merged opendev/zone-opendev.org master: Reset opendev.org TTLS and drop gitea-lb02 records https://review.opendev.org/c/opendev/zone-opendev.org/+/955830	17:43
corvus	oops, we bind-mount the site variables from /opt/project-config	17:50
fungi	so that needs container restarts i guess?	17:53
corvus	well, i deleted /opt/project-config because i thought we only copied stuff from there. so i'll copy it from one of the other executors, and then on the remaining ones, i'll make sure to save it.	17:54
fungi	ah	17:54
clarkb	as a heads up I have an appliance repair person coming by this afternoon to fix our range. I may be distracted for a bit once they show up	17:55
fungi	otherwise you may need to switch to raw foods	17:55
clarkb	I wasn't filled with a bunch of confidence when he seemed to know over the phone what exactly needed to be done then discovered the necessary part was out of stock (so we've been waiting a couple of weeks)	17:57
clarkb	confidence in the range. Person seems to know what htey are doing	17:57
clarkb	worried we'll be doing the same repair in six months but only time will tell	17:58
corvus	did it's gpu burn out?	17:59
clarkb	the primary stove top burner only operates in full on or full off mode. So its not moderating the output. Some sort of switch or resistor thing?	18:00
fungi	that would work fine for me. i use a cast iron wok and only turn the burner all the way on or all the way off. in-between is irrelevant	18:01
corvus	okay ze01 looks good now, starting the others	18:01
corvus	done: ze01-ze05 have been restarted on the new configuration; ze06-ze12 are gracefully shutting down	18:03
corvus	okay status page says green now	18:05
corvus	quay even	18:05
clarkb	oh good, the amount of orange and red on the zuul dashboar should go down now	18:06
fungi	lots of the houses out here are named, and i've decided to name ours "skeleton quay" (second place choice was "dock shadows")	18:07
corvus	yes, but how will you pronounce it? and how will everyone else who sees it? (i know the answer to the second one)	18:08
fungi	i mean, the whole point is that the pun only works if you know the pronounciation	18:09
fungi	though "skeleton cay" and "skeleton key" were considered, they just don't work quite as well	18:10
clarkb	arent they all keys in your part of the world?	18:10
clarkb	or maybe you have to go down to florida for that terminology shift to occur?	18:10
fungi	at the moment i'm on a peninsula because "beach nourishment" keeps all the inlets north of us shoaled over for the benefit of people who crazily built vacation homes on top of them	18:11
clarkb	TIL cay is a derivation from Arawak "cairi"	18:11
clarkb	which eventually becomes key in english	18:12
fungi	yeah, "cay" would have been the most traditional	18:12
clarkb	oh its a specific Arawak language: Taino	18:12
fungi	from a regional perspective, nobody calls anything a key, cay or quay here, they're just "islands" or "banks" or "shore"	18:13
fungi	it's well down into florida before you encounter the term locally	18:13
corvus	that makes it very awkward to talk about quay.io since the folks that operate the service pronounce it "kway".	18:14
corvus	https://access.redhat.com/articles/6970915	18:14
corvus	at least they are aware of the difficulty	18:14
fungi	hah	18:14
clarkb	ya we always referred to them simply as islands growing up. But technically many of the islands we called islands are also keys	18:15
fungi	most of the variability of spelling for cay/quay/key is found in the caribbean sea thanks to repeated colonization and re-colonization by various european countries who disagreed on language and spelling	18:16
clarkb	(but not all islands were keys as some of them were volcanic in origin not sedimentary on top of reefs)	18:16
fungi	yeah, ironically most of the caribbean islands are volcanic, but colonists couldn't tell the difference	18:17
fungi	unless, you know, they were actually erupting	18:17
fungi	the indigenous islanders knew through oral histories, but nobody really paid attention to them (and also the european colonists typically displaced or genocided them just like in other parts of the world)	18:19
fungi	the decimation of port royal in jamaica was a good example	18:20
fungi	(1692ad)	18:21
fungi	granted that was more of a seismic event not an eruption, it's all tied to the same tectonic fault there	18:26
clarkb	fungi: I think Friday is a light meeting day. Should I approve https://review.opendev.org/c/opendev/system-config/+/955544 now?	18:31
clarkb	(also quay seems happier which is another prereq I think)	18:31
fungi	i think this is the perfect time for that, yesh	18:31
clarkb	cool approved	18:33
opendevreview	Merged opendev/system-config master: Switch IRC and matrix bots to log with journald rather than syslog https://review.opendev.org/c/opendev/system-config/+/955544	18:58
clarkb	seems like it did restart bots	19:04
clarkb	yes docker ps -a reports they all updated	19:06
clarkb	and /var/log/containers/docker-gerritbot.log has entries more recent than the restart so the journald vs syslog equivalence seems to be working as expected	19:07
clarkb	I'm about to grab lunch but this lgtm. let me know if you see something unexpected	19:07
corvus	the other executors are stopped, i'll update them now	19:47
corvus	#status log updated all zuul executors to mount the ephemeral drive at /var/lib/zuul	20:00
corvus	that's all done now, and the executors are out of the emergency file	20:00
opendevstatus	corvus: finished logging	20:00
clarkb	fingers now crossed that we have all green image builds overnight	20:00
clarkb	does anyone know how the meeting logs captured by limnoria on eavesdrop01.opendev.org end up on meetings.opendev.org which is really just a vhost on static which presumably serves out of AFS? It doesn't look like we have afs mounted on eavesdrop01.opendev.org	20:03
corvus	meetings proxies to eavesdrop	20:04
corvus	so they're just served by apache on eavesdrop from local files	20:04
clarkb	oh ok so it isn't afs	20:04
corvus	right	20:04
clarkb	and it looks like that data is on a cinder volume. Thanks this helps my planning for replacing the eavesdrop server	20:05
corvus	np. i traced through that just recently. :)	20:06
clarkb	I think what I should do is boot a new server with a new volume. Rsync the data into the volume, then when we land the change to add the new server to ivnentory which configures it to run the containers stop services on the old server and rsync again. Something along those lines	20:07
clarkb	but I'll think about that a bit to see if there is a less impactful or simpler approach	20:07
clarkb	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-24h&to=now&timezone=utc shows much better disk utilization percentages now. We probably need to wait and see what periodic jobs do though	20:43
clarkb	jrosser: fungi I went ahead and abandoned https://review.opendev.org/c/zuul/zuul-jobs/+/954280 as I believe the dib update should be sufficient to address the issue	20:59
fungi	yep	21:00

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!