Tuesday, 2024-12-17

clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Dec 17 19:00:18 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MID2FVRVSSZBARY7TM64ZWOE2WXI264E/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	This will be our last meeting of 2024. The next two tuesdays overlap with major holidays (December 24 and 31) so we'll skip the next two weeks of meetings and be back here January 7, 2025	19:01
clarkb	#topic Zuul-launcher image builds	19:03
clarkb	Not sure if we've got corvus around but yesterday I'm pretty sure I saw zuul changes get proposed for adding image build management API stuff to zuul web	19:03
clarkb	including things like buttons to delete images	19:03
corvus	o/	19:03
corvus	yep i'm partway through that	19:04
corvus	a little more work needed on the backend for the image deletion lifecycle	19:04
corvus	then the buttons will actually do something	19:04
corvus	then we're ready to try it out. :)	19:04
clarkb	corvus: zuul-cleint will get similar support too I hope?	19:04
corvus	that should be able to happen, but it's not a priority for me at the moment.	19:05
clarkb	ack	19:06
clarkb	I guess if it is in the API then any api client can manage it	19:06
clarkb	just a matter of adding support	19:06
corvus	yeah	19:06
corvus	i think the web is going to be a much nicer way of interacting with this stuff	19:06
corvus	so i'd recommend it for opendev admins. :)	19:06
corvus	(it's a lot of uuids)	19:07
clarkb	any other image build updates?	19:07
corvus	not from me	19:07
clarkb	#topic Upgrading old servers	19:08
clarkb	I have updates on adding noble support but that has a topic of its own	19:08
clarkb	anyone else have updates for general server update related efforts?	19:08
tonyb	Nothing new from me	19:08
clarkb	cool let's dive into the next topic then	19:08
clarkb	#topic Docker compose plugin with podman service for servers	19:08
clarkb	The background on this is that Ubuntu Noble comes with python3.12 and docker-compose the python tool doesn't run on python3.12. Rather tha ntry and resurrect that tool we instead switch to the docker compose golang implementation	19:09
clarkb	then because its a clean break we take this as an opportunity to run containers with podman on the backend rather than docker proper	19:09
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/937641 docker compose and podman for Noble	19:09
clarkb	This change and the rest of topic:podman-prep show that this generally works for a number of canary services (paste, meetpad, gitea, gerrit, mailman) within our CI jobs	19:10
clarkb	There are a few things that do break though and most of topic:podman-prep has been written to update things to be forward and backward compatbile so the same config management works on pre Noble and Noble installations	19:10
clarkb	things we've found so far: podman doesn't support syslog logging output so we update that to journald which seems to be largely equivalent. Docker compose changes the actual underlying container names so we switch to using docker-compose/docker compose commands which use logical container names.	19:11
clarkb	oh and docker-compose pull is more verbose than docker compose pull so we had to rewrite some ansible that checks if we pulled new images to more explicitly check if that has happened	19:12
clarkb	on the whole fairly minor things that we've been able to workaround in every case which I think is really promising	19:12
tonyb	There is a CLI flag to keep the python style names if you want	19:12
clarkb	tonyb: no I think this is better since it gets us into a more forward looking position	19:12
clarkb	and the changes are straightforward	19:13
clarkb	as far as the implementation itself goes there are a few things to be aware	19:13
clarkb	*aware of	19:13
tonyb	oka	19:13
tonyb	*okay	19:13
clarkb	I'ev configured the podman.socket systemd unit to listen on docker's default socket path and disabled docker / put docker.socket on a different socket path	19:13
clarkb	what this means is if you run `docker` or `docker compose` you automatically talk to podman on the backend	19:14
tonyb	Which is kinda cute :)	19:14
clarkb	when you run `podman` you also talk to podman on the backend but it seems to use a different (podman default) socket path. The good news is that despite teh change of socket path podman commands also see all of the docker command created resources so I don't think this is an issue	19:14
clarkb	I also put a shell shim at /usr/local/bin/docker-compose to replace the python docker-compose command that would normall go there and it just passes through to `docker compose`	19:15
clarkb	this is a compatibility layer for our configuration management that we can drop once everything is on noble or newer	19:15
clarkb	oh and I've opted to install both docker-compose-v2 and podman from ubuntu distro packages. Not sure if we want ot tryand install one or the other or both from upstream sources	19:15
clarkb	I went with distro packages because it made bootstrapping simpler, but now that we know it generally works we can expand to other sources if we like	19:16
clarkb	I think where we are at now is deciding how we wnt ot proceed with this since nothing has exploded dramatically	19:17
clarkb	in particular I think we have two options. One is to continue to use that WIP change to check our services in CI and fix them up before we land the changes to the install-docker role	19:17
corvus	i think using docker cli commands when we run things manually makes sense to match our automation	19:17
clarkb	the other is to accept we've probably done enough testing to proceed with something in production (like paste) and then get somethign into production more quickly	19:17
clarkb	corvus: ++	19:17
clarkb	I've mostly used docker cli commands in on the held node to test things out of habit too	19:18
clarkb	if we want to proceed to production with a service first I can clean up the WIP change to make it mergable then next step would be redeploying paste I think	19:18
clarkb	otherwise I'll continue to iterate through our services and see what needs updates and fix them under the topic:podman-prep topic	19:18
tonyb	I like the get it into production approach	19:19
frickler	+1 just maybe not right before the holidays?	19:19
corvus	i recognize this is unhelpful, but i'm ambivalent about whether we go into prod now or continue to iterate. our pre-merge testing is pretty good. but maybe putting something simple in production will help us find unknown unknowns sooner.	19:19
tonyb	Given the switch is combined with the OS update I think it's fairly safe and controllable	19:20
clarkb	ya the unknown unknowns is probably the biggest concern at this point so pushing on something into production probably makes the most sense	19:20
clarkb	tonyb: I agree cc frickler	19:20
corvus	i think if paste blows up we can roll it back pretty easily despite the holidays.	19:20
clarkb	ok sounds like I should clean up that change so we can merge it and then maybe tomorrowish start a new paste	19:20
clarkb	oh the last thing I wanted to call out about this effort is we landed changes to "fix" logging for several services yesterday all of which have restarted onto the new docker compose config except for gerrit	19:21
clarkb	after lunch today I'd like to restart gerrit so that it is running under that new config (mostly a heads up I should be able to do that without trouble the images aren't updating just the logging config for docker)	19:22
tonyb	We can keep the existing paste around then rollback if needed would be DNS and data dump+restore	19:22
corvus	yeah i think doing a gerrit restart soon would be good	19:22
clarkb	tonyb: ++	19:22
clarkb	ok I think that is the feedback I needed to take the next few steps on this effort. Thanks!	19:22
clarkb	any other questions/concerns/feedback?	19:22
corvus	thank you clarkb	19:23
tonyb	clarkb: Thanks for doing this, I'm sorry I slowed things down.	19:23
clarkb	tonyb: I don't think you slowed things down. I wasn't able to get to it until I got to it	19:23
tonyb	clarkb: also ++ on a gerrit restart	19:23
clarkb	#topic Docker Hub Rate Limits	19:25
clarkb	Last week when we updated Gerrit's image to pick up the openid fix I wrote we discovered that we were at the docker hub rate limit despite not pulling any images for days	19:25
clarkb	this led to us discovering that docker hub treats entire ipv6 /64s as a single IP for rate limit purposes	19:25
clarkb	at the time we worked around this by editing /etc/hosts to use ipv4 addresses for the two locations that are used to fetch docker images from docker hub and that sorted us out	19:26
clarkb	I think we could update our zuul jobs to maybe do that on the fly if we like though I haven't looked into doing that myself	19:26
clarkb	I know corvus is advocating for more of a get off docker hub approach which is partially what is driving the noble podman work	19:26
clarkb	basically I've been going down that path myself too	19:27
corvus	i think we have everything we need to start mirroring images to quay	19:27
clarkb	oh right the role to drive that landed ni zuul-jobs last week too	19:27
corvus	next step for that would be to write a simple zuul job that used the new role and run it in a periodic pipeline	19:28
clarkb	I think starting with the python base image(s) and mariadb image(s) would be a good start	19:28
corvus	++	19:28
clarkb	corvus: where do you think that job should live and what namespace should we publish to?	19:29
clarkb	like should opendev host the job and also host the images in the quay opendev location?	19:29
corvus	yeah for those images, i think so	19:30
clarkb	ya I think that makes sense since we would be primary consumers of them	19:30
corvus	what's a little less clear is:	19:30
corvus	say the zuul project needs something else; should it make its own job and use its own namespace, or should we have the opendev job be open to anyone?	19:31
corvus	i don't think it'd be a huge imposition on us to review one-liner additions.... but also, it can be self-serve for individual projects so maybe it should be?	19:31
clarkb	ya getting us out of the critical path unless there is broader usefulness is probably a good thing	19:32
corvus	(also, i guess if the list of images gets very long, we'd need multiple jobs anyway...)	19:32
corvus	ok, so multiple jobs managed by projects; sounds good to me	19:33
tonyb	Yeah I think I'm leaning that way also	19:33
corvus	(we might also want one job per image for rate limit purposes :)	19:33
clarkb	that is a very good point acutally	19:34
corvus	oh also, i'd put the "pull" part of the job in a pre-run playbook. ;)	19:34
clarkb	also a good low impact change over holidays for anyone who finds time to do it	19:35
clarkb	that is also a good idea	19:35
corvus	oh phooey we can't	19:35
corvus	we'd need to split the role. oh well.	19:35
clarkb	can cross that bridge if/when we get there	19:35
corvus	yep.	19:35
clarkb	anything else on this topic?	19:35
corvus	nak	19:36
clarkb	#topic Rax-ord Noble Nodes with 1 VCPU	19:36
clarkb	I don't really have anything new on this topic, but it seems like this problem occurs very infrequently	19:36
clarkb	I wanted to make sure I wasn't missing anything important on this or that I'm wrong about ^ that observation	19:37
tonyb	I trust your investigation	19:37
clarkb	the rax flex folks indicated that anything xen related is not in their wheelhouse unfortunately	19:37
clarkb	so I suspect that addressing this is going to require someone to file a proper ticket with rax	19:38
clarkb	nto sure that is high on my priority list unless the error rate increases	19:38
tonyb	Can we add something to detect this early and fail the job, which would get a new node from the pool?	19:38
clarkb	tonyb: yes I think you could do a check along the lines of ansible bios version is too low and num cpu is 1 then exit with failure	19:39
clarkb	I don't think we only want to check the cpu count since we may have jobs that intentionally try to use lower cpus and then wonder why they fail in a year or three so try and constrain it as much as possible to the symptoms we have identified	19:39
tonyb	clarkb: ++	19:40
clarkb	#topic Ubuntu-ports mirror fixups	19:40
tonyb	I'll see if I can get something written today.	19:40
clarkb	thanks	19:40
clarkb	fungi discovered a few of our afs mirrors were stale and managed to fix them all in a pretty straightforward manner except for ubuntu-ports	19:41
clarkb	reprepro was complaining about a corrupt berkley db for ubuntu-ports, fungi rebuilt the db which fixed the db issue but then some tempfile which records package info was growign to infinity recording the same packages over and over again	19:42
clarkb	eventually that would hit disk limits for the disk the temp file was written to and we would fail	19:42
clarkb	where we've ended up is taht fungi has deleted the ubuntu-ports RW volume content and has started reprepo over from scratch	19:42
clarkb	the RO volumes still have the old stale content so our jobs should continue to run successfully	19:42
clarkb	this is mostly a heads up about this situation as it may take some time to correct	19:43
clarkb	the rebuild from scratch is running on a screen on mirror-update. THough i haven't checked in on it myself yet	19:43
clarkb	fungi is enjoying some well deserved vacation time so we don't have an update from him on this but we can check directly on progress in the screen if anything comes up in the short term	19:44
clarkb	sounded like fungi would try to monitor it as he can though	19:44
clarkb	#topic Open Discussion	19:44
clarkb	Anything else?	19:44
clarkb	I guess it is worth mentioning that I'm expecting to be around this week. But then the two weeks after I'll be in and out as things happen with the kids etc	19:45
clarkb	I don't really have a concrete schedule as i'm not currently traveling anywhere so it will be more organic time off I guess	19:45
tonyb	Same, although I'll be more busy with my kids the week after christmas	19:45
clarkb	I hope everyone else gets to enjoy some time off too.	19:46
clarkb	Also worth mentioning that since we don't have any weekly meetings for a while please do bring up important topics via our typical comms channels if necessary	19:47
tonyb	noted	19:47
clarkb	and thank you everyone for all the help this year. I think it ended up being quite productive within OpenDev	19:48
tonyb	clarkb: and thank you	19:48
corvus	thanks all and thank you clarkb !	19:48
tonyb	clarkb: Oh did you have annual report stuff to write that you'd like eyes on?	19:48
clarkb	tonyb: they have changed how we draft that stuff this year so I don't	19:49
clarkb	there is still a small section of content but its much smaller in scope	19:49
tonyb	clarkb: Oh cool?	19:50
clarkb	I've used it to call out software updates (gerrit, gitea, ehterpad etc) as well as onboarding new clouds (raxflex) as well as a notice that we lost one of two arm clouds	19:50
tonyb	nice	19:50
clarkb	more of a quick highlights than an indepth recap	19:50
clarkb	if you feel strongly about somethign not in that list let me know and I can probably add it	19:51
clarkb	and with that I think we can end the meeting	19:52
tonyb	I don't have any strong feelings about it. It just occurred to me that we normally have $stuff to do at this time of year	19:52
clarkb	ack	19:52
tonyb	++	19:52
clarkb	we'll be back here at our normal time and location on January 7	19:52
clarkb	#endmeeting	19:53
opendevmeet	Meeting ended Tue Dec 17 19:53:05 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:53
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.html	19:53
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.txt	19:53
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.log.html	19:53
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage.		21:00
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart		22:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!