Tuesday, 2024-12-17

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Dec 17 19:00:18 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MID2FVRVSSZBARY7TM64ZWOE2WXI264E/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbThis will be our last meeting of 2024. The next two tuesdays overlap with major holidays (December 24 and 31) so we'll skip the next two weeks of meetings and be back here January 7, 202519:01
clarkb#topic Zuul-launcher image builds19:03
clarkbNot sure if we've got corvus around but yesterday I'm pretty sure I saw zuul changes get proposed for adding image build management API stuff to zuul web19:03
clarkbincluding things like buttons to delete images19:03
corvuso/19:03
corvusyep i'm partway through that19:04
corvusa little more work needed on the backend for the image deletion lifecycle19:04
corvusthen the buttons will actually do something19:04
corvusthen we're ready to try it out.  :)19:04
clarkbcorvus: zuul-cleint will get similar support too I hope?19:04
corvusthat should be able to happen, but it's not a priority for me at the moment.19:05
clarkback19:06
clarkbI guess if it is in the API then any api client can manage it19:06
clarkbjust a matter of adding support19:06
corvusyeah19:06
corvusi think the web is going to be a much nicer way of interacting with this stuff19:06
corvusso i'd recommend it for opendev admins.  :)19:06
corvus(it's a lot of uuids)19:07
clarkbany other image build updates?19:07
corvusnot from me19:07
clarkb#topic Upgrading old servers19:08
clarkbI have updates on adding noble support  but that has a topic of its own19:08
clarkbanyone else have updates for general server update related efforts?19:08
tonybNothing new from me19:08
clarkbcool let's dive into the next topic then19:08
clarkb#topic Docker compose plugin with podman service for servers19:08
clarkbThe background on this is that Ubuntu Noble comes with python3.12 and docker-compose the python tool doesn't run on python3.12. Rather tha ntry and resurrect that tool we instead switch to the docker compose golang implementation19:09
clarkbthen because its a clean break we take this as an opportunity to run containers with podman on the backend rather than docker proper19:09
clarkb#link https://review.opendev.org/c/opendev/system-config/+/937641 docker compose and podman for Noble19:09
clarkbThis change and the rest of topic:podman-prep show that this generally works for a number of canary services (paste, meetpad, gitea, gerrit, mailman) within our CI jobs19:10
clarkbThere are a few things that do break though and most of topic:podman-prep has been written to update things to be forward and backward compatbile so the same config management works on pre Noble and Noble installations19:10
clarkbthings we've found so far: podman doesn't support syslog logging output so we update that to journald which seems to be largely equivalent. Docker compose changes the actual underlying container names so we switch to using docker-compose/docker compose commands which use logical container names.19:11
clarkboh and docker-compose pull is more verbose than docker compose pull so we had to rewrite some ansible that checks if we pulled new images to more explicitly check if that has happened19:12
clarkbon the whole fairly minor things that we've been able to workaround in every case which I think is really promising19:12
tonybThere is a CLI flag to keep the python style names if you want19:12
clarkbtonyb: no I think this is better since it gets us into a more forward looking position19:12
clarkband the changes are straightforward19:13
clarkbas far as the implementation itself goes there are a few things to be aware19:13
clarkb*aware of19:13
tonyboka19:13
tonyb*okay19:13
clarkbI'ev configured the podman.socket systemd unit to listen on docker's default socket path and disabled docker / put docker.socket on a different socket path19:13
clarkbwhat this means is if you run `docker` or `docker compose` you automatically talk to podman on the backend19:14
tonybWhich is kinda cute :)19:14
clarkbwhen you run `podman` you also talk to podman on the backend but it seems to use a different (podman default) socket path. The good news is that despite teh change of socket path podman commands also see all of the docker command created resources so I don't think this is an issue19:14
clarkbI also put a shell shim at /usr/local/bin/docker-compose to replace the python docker-compose command that would normall go there and it just passes through to `docker compose`19:15
clarkbthis is a compatibility layer for our configuration management that we can drop once everything is on noble or newer19:15
clarkboh and I've opted to install both docker-compose-v2 and podman from ubuntu distro packages. Not sure if we want ot tryand install one or the other or both from upstream sources19:15
clarkbI went with distro packages because it made bootstrapping simpler, but now that we know it generally works we can expand to other sources if we like19:16
clarkbI think where we are at now is deciding how we wnt ot proceed with this since nothing has exploded dramatically19:17
clarkbin particular I think we have two options. One is to continue to use that WIP change to check our services in CI and fix them up before we land the changes to the install-docker role19:17
corvusi think using docker cli commands when we run things manually makes sense to match our automation19:17
clarkbthe other is to accept we've probably done enough testing to proceed with something in production (like paste) and then get somethign into production more quickly19:17
clarkbcorvus: ++19:17
clarkbI've mostly used docker cli commands in on the held node to test things out of habit too19:18
clarkbif we want to proceed to production with a service first I can clean up the WIP change to make it mergable then next step would be redeploying paste I think19:18
clarkbotherwise I'll continue to iterate through our services and see what needs updates and fix them under the topic:podman-prep topic19:18
tonybI like the get it into production approach19:19
frickler+1 just maybe not right before the holidays?19:19
corvusi recognize this is unhelpful, but i'm ambivalent about whether we go into prod now or continue to iterate.  our pre-merge testing is pretty good.  but maybe putting something simple in production will help us find unknown unknowns sooner.19:19
tonybGiven the switch is combined with the OS update I think it's fairly safe and controllable19:20
clarkbya the unknown unknowns is probably the biggest concern at this point so pushing on something into production probably makes the most sense19:20
clarkbtonyb: I agree cc frickler 19:20
corvusi think if paste blows up we can roll it back pretty easily despite the holidays.19:20
clarkbok sounds like I should clean up that change so we can merge it and then maybe tomorrowish start a new paste19:20
clarkboh the last thing I wanted to call out about this effort is we landed changes to "fix" logging for several services yesterday all of which have restarted onto the new docker compose config except for gerrit19:21
clarkbafter lunch today I'd like to restart gerrit so that it is running under that new config (mostly a heads up I should be able to do that without trouble the images aren't updating just the logging config for docker)19:22
tonybWe can keep the existing paste around then rollback if needed would be DNS and data dump+restore19:22
corvusyeah i think doing a gerrit restart soon would be good19:22
clarkbtonyb: ++19:22
clarkbok I think that is the feedback I needed to take the next few steps on this effort. Thanks!19:22
clarkbany other questions/concerns/feedback?19:22
corvusthank you clarkb 19:23
tonybclarkb: Thanks for doing this, I'm sorry I slowed things down.19:23
clarkbtonyb: I don't think you slowed things down. I wasn't able to get to it until I got to it19:23
tonybclarkb: also ++ on a gerrit restart19:23
clarkb#topic Docker Hub Rate Limits19:25
clarkbLast week when we updated Gerrit's image to pick up the openid fix I wrote we discovered that we were at the docker hub rate limit despite not pulling any images for days19:25
clarkbthis led to us discovering that docker hub treats entire ipv6 /64s as a single IP for rate limit purposes19:25
clarkbat the time we worked around this by editing /etc/hosts to use ipv4 addresses for the two locations that are used to fetch docker images from docker hub and that sorted us out19:26
clarkbI think we could update our zuul jobs to maybe do that on the fly if we like though I haven't looked into doing that myself19:26
clarkbI know corvus is advocating for more of a get off docker hub approach which is partially what is driving the noble podman work19:26
clarkbbasically I've been going down that path myself too19:27
corvusi think we have everything we need to start mirroring images to quay19:27
clarkboh right the role to drive that landed ni zuul-jobs last week too19:27
corvusnext step for that would be to write a simple zuul job that used the new role and run it in a periodic pipeline19:28
clarkbI think starting with the python base image(s) and mariadb image(s) would be a good start19:28
corvus++19:28
clarkbcorvus: where do you think that job should live and what namespace should we publish to?19:29
clarkblike should opendev host the job and also host the images in the quay opendev location?19:29
corvusyeah for those images, i think so19:30
clarkbya I think that makes sense since we would be primary consumers of them19:30
corvuswhat's a little less clear is:19:30
corvussay the zuul project needs something else; should it make its own job and use its own namespace, or should we have the opendev job be open to anyone?19:31
corvusi don't think it'd be a huge imposition on us to review one-liner additions.... but also, it can be self-serve for individual projects so maybe it should be?19:31
clarkbya getting us out of the critical path unless there is broader usefulness is probably a good thing19:32
corvus(also, i guess if the list of images gets very long, we'd need multiple jobs anyway...)19:32
corvusok, so multiple jobs managed by projects; sounds good to me19:33
tonybYeah I think I'm leaning that way also19:33
corvus(we might also want one job per image for rate limit purposes :)19:33
clarkbthat is a very good point acutally19:34
corvusoh also, i'd put the "pull" part of the job in a pre-run playbook.  ;)19:34
clarkbalso a good low impact change over holidays for anyone who finds time to do it19:35
clarkbthat is also a good idea19:35
corvusoh phooey we can't19:35
corvuswe'd need to split the role.  oh well.19:35
clarkbcan cross that bridge if/when we get there19:35
corvusyep.19:35
clarkbanything else on this topic?19:35
corvusnak19:36
clarkb#topic Rax-ord Noble Nodes with 1 VCPU19:36
clarkbI don't really have anything new on this topic, but it seems like this problem occurs very infrequently19:36
clarkbI wanted to make sure I wasn't missing anything important on this or that I'm wrong about ^ that observation19:37
tonybI trust your investigation19:37
clarkbthe rax flex folks indicated that anything xen related is not in their wheelhouse unfortunately19:37
clarkbso I suspect that addressing this is going to require someone to file a proper ticket with rax19:38
clarkbnto sure that is high on my priority list unless the error rate increases19:38
tonybCan we add something to detect this early and fail the job, which would get a new node from the pool?19:38
clarkbtonyb: yes I think you could do a check along the lines of ansible bios version is too low and num cpu is 1 then exit with failure19:39
clarkbI don't think we only want to check the cpu count since we may have jobs that intentionally try to use lower cpus and then wonder why they fail in a year or three so try and constrain it as much as possible to the symptoms we have identified19:39
tonybclarkb: ++19:40
clarkb#topic Ubuntu-ports mirror fixups19:40
tonybI'll see if I can get something written today.19:40
clarkbthanks19:40
clarkbfungi discovered a few of our afs mirrors were stale and managed to fix them all in a pretty straightforward manner except for ubuntu-ports19:41
clarkbreprepro was complaining about a corrupt berkley db for ubuntu-ports, fungi rebuilt the db which fixed the db issue but then some tempfile which records package info was growign to infinity recording the same packages over and over again19:42
clarkbeventually that would hit disk limits for the disk the temp file was written to and we would fail19:42
clarkbwhere we've ended up is taht fungi has deleted the ubuntu-ports RW volume content and has started reprepo over from scratch19:42
clarkbthe RO volumes still have the old stale content so our jobs should continue to run successfully19:42
clarkbthis is mostly a heads up about this situation as it may take some time to correct19:43
clarkbthe rebuild from scratch is running on a screen on mirror-update. THough i haven't checked in on it myself yet19:43
clarkbfungi is enjoying some well deserved vacation time so we don't have an update from him on this but we can check directly on progress in the screen if anything comes up in the short term19:44
clarkbsounded like fungi would try to monitor it as he can though19:44
clarkb#topic Open Discussion19:44
clarkbAnything else?19:44
clarkbI guess it is worth mentioning that I'm expecting to be around this week. But then the two weeks after I'll be in and out as things happen with the kids etc19:45
clarkbI don't really have a concrete schedule as i'm not currently traveling anywhere so it will be more organic time off I guess19:45
tonybSame, although I'll be more busy with my kids the week after christmas19:45
clarkbI hope everyone else gets to enjoy some time off too.19:46
clarkbAlso worth mentioning that since we don't have any weekly meetings for a while please do bring up important topics via our typical comms channels if necessary19:47
tonybnoted19:47
clarkband thank you everyone for all the help this year. I think it ended up being quite productive within OpenDev19:48
tonybclarkb: and thank you19:48
corvusthanks all and thank you clarkb !19:48
tonybclarkb: Oh did you have annual report stuff to write that you'd like eyes on?19:48
clarkbtonyb: they have changed how we draft that stuff this year so I don't19:49
clarkbthere is still a small section of content but its much smaller in scope19:49
tonybclarkb: Oh cool?19:50
clarkbI've used it to call out software updates (gerrit, gitea, ehterpad etc) as well as onboarding new clouds (raxflex) as well as a notice that we lost one of two arm clouds19:50
tonybnice19:50
clarkbmore of a quick highlights than an indepth recap19:50
clarkbif you feel strongly about somethign not in that list let me know and I can probably add it19:51
clarkband with that I think we can end the meeting19:52
tonybI don't have any strong feelings about it.  It just occurred to me that we normally have $stuff to do at this time of year19:52
clarkback19:52
tonyb++19:52
clarkbwe'll be back here at our normal time and location on January 719:52
clarkb#endmeeting19:53
opendevmeetMeeting ended Tue Dec 17 19:53:05 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:53
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.html19:53
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.txt19:53
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.log.html19:53
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage.21:00
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart22:12

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!