clarkb | meeting time | 19:00 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Dec 17 19:00:18 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/MID2FVRVSSZBARY7TM64ZWOE2WXI264E/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | This will be our last meeting of 2024. The next two tuesdays overlap with major holidays (December 24 and 31) so we'll skip the next two weeks of meetings and be back here January 7, 2025 | 19:01 |
clarkb | #topic Zuul-launcher image builds | 19:03 |
clarkb | Not sure if we've got corvus around but yesterday I'm pretty sure I saw zuul changes get proposed for adding image build management API stuff to zuul web | 19:03 |
clarkb | including things like buttons to delete images | 19:03 |
corvus | o/ | 19:03 |
corvus | yep i'm partway through that | 19:04 |
corvus | a little more work needed on the backend for the image deletion lifecycle | 19:04 |
corvus | then the buttons will actually do something | 19:04 |
corvus | then we're ready to try it out. :) | 19:04 |
clarkb | corvus: zuul-cleint will get similar support too I hope? | 19:04 |
corvus | that should be able to happen, but it's not a priority for me at the moment. | 19:05 |
clarkb | ack | 19:06 |
clarkb | I guess if it is in the API then any api client can manage it | 19:06 |
clarkb | just a matter of adding support | 19:06 |
corvus | yeah | 19:06 |
corvus | i think the web is going to be a much nicer way of interacting with this stuff | 19:06 |
corvus | so i'd recommend it for opendev admins. :) | 19:06 |
corvus | (it's a lot of uuids) | 19:07 |
clarkb | any other image build updates? | 19:07 |
corvus | not from me | 19:07 |
clarkb | #topic Upgrading old servers | 19:08 |
clarkb | I have updates on adding noble support but that has a topic of its own | 19:08 |
clarkb | anyone else have updates for general server update related efforts? | 19:08 |
tonyb | Nothing new from me | 19:08 |
clarkb | cool let's dive into the next topic then | 19:08 |
clarkb | #topic Docker compose plugin with podman service for servers | 19:08 |
clarkb | The background on this is that Ubuntu Noble comes with python3.12 and docker-compose the python tool doesn't run on python3.12. Rather tha ntry and resurrect that tool we instead switch to the docker compose golang implementation | 19:09 |
clarkb | then because its a clean break we take this as an opportunity to run containers with podman on the backend rather than docker proper | 19:09 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/937641 docker compose and podman for Noble | 19:09 |
clarkb | This change and the rest of topic:podman-prep show that this generally works for a number of canary services (paste, meetpad, gitea, gerrit, mailman) within our CI jobs | 19:10 |
clarkb | There are a few things that do break though and most of topic:podman-prep has been written to update things to be forward and backward compatbile so the same config management works on pre Noble and Noble installations | 19:10 |
clarkb | things we've found so far: podman doesn't support syslog logging output so we update that to journald which seems to be largely equivalent. Docker compose changes the actual underlying container names so we switch to using docker-compose/docker compose commands which use logical container names. | 19:11 |
clarkb | oh and docker-compose pull is more verbose than docker compose pull so we had to rewrite some ansible that checks if we pulled new images to more explicitly check if that has happened | 19:12 |
clarkb | on the whole fairly minor things that we've been able to workaround in every case which I think is really promising | 19:12 |
tonyb | There is a CLI flag to keep the python style names if you want | 19:12 |
clarkb | tonyb: no I think this is better since it gets us into a more forward looking position | 19:12 |
clarkb | and the changes are straightforward | 19:13 |
clarkb | as far as the implementation itself goes there are a few things to be aware | 19:13 |
clarkb | *aware of | 19:13 |
tonyb | oka | 19:13 |
tonyb | *okay | 19:13 |
clarkb | I'ev configured the podman.socket systemd unit to listen on docker's default socket path and disabled docker / put docker.socket on a different socket path | 19:13 |
clarkb | what this means is if you run `docker` or `docker compose` you automatically talk to podman on the backend | 19:14 |
tonyb | Which is kinda cute :) | 19:14 |
clarkb | when you run `podman` you also talk to podman on the backend but it seems to use a different (podman default) socket path. The good news is that despite teh change of socket path podman commands also see all of the docker command created resources so I don't think this is an issue | 19:14 |
clarkb | I also put a shell shim at /usr/local/bin/docker-compose to replace the python docker-compose command that would normall go there and it just passes through to `docker compose` | 19:15 |
clarkb | this is a compatibility layer for our configuration management that we can drop once everything is on noble or newer | 19:15 |
clarkb | oh and I've opted to install both docker-compose-v2 and podman from ubuntu distro packages. Not sure if we want ot tryand install one or the other or both from upstream sources | 19:15 |
clarkb | I went with distro packages because it made bootstrapping simpler, but now that we know it generally works we can expand to other sources if we like | 19:16 |
clarkb | I think where we are at now is deciding how we wnt ot proceed with this since nothing has exploded dramatically | 19:17 |
clarkb | in particular I think we have two options. One is to continue to use that WIP change to check our services in CI and fix them up before we land the changes to the install-docker role | 19:17 |
corvus | i think using docker cli commands when we run things manually makes sense to match our automation | 19:17 |
clarkb | the other is to accept we've probably done enough testing to proceed with something in production (like paste) and then get somethign into production more quickly | 19:17 |
clarkb | corvus: ++ | 19:17 |
clarkb | I've mostly used docker cli commands in on the held node to test things out of habit too | 19:18 |
clarkb | if we want to proceed to production with a service first I can clean up the WIP change to make it mergable then next step would be redeploying paste I think | 19:18 |
clarkb | otherwise I'll continue to iterate through our services and see what needs updates and fix them under the topic:podman-prep topic | 19:18 |
tonyb | I like the get it into production approach | 19:19 |
frickler | +1 just maybe not right before the holidays? | 19:19 |
corvus | i recognize this is unhelpful, but i'm ambivalent about whether we go into prod now or continue to iterate. our pre-merge testing is pretty good. but maybe putting something simple in production will help us find unknown unknowns sooner. | 19:19 |
tonyb | Given the switch is combined with the OS update I think it's fairly safe and controllable | 19:20 |
clarkb | ya the unknown unknowns is probably the biggest concern at this point so pushing on something into production probably makes the most sense | 19:20 |
clarkb | tonyb: I agree cc frickler | 19:20 |
corvus | i think if paste blows up we can roll it back pretty easily despite the holidays. | 19:20 |
clarkb | ok sounds like I should clean up that change so we can merge it and then maybe tomorrowish start a new paste | 19:20 |
clarkb | oh the last thing I wanted to call out about this effort is we landed changes to "fix" logging for several services yesterday all of which have restarted onto the new docker compose config except for gerrit | 19:21 |
clarkb | after lunch today I'd like to restart gerrit so that it is running under that new config (mostly a heads up I should be able to do that without trouble the images aren't updating just the logging config for docker) | 19:22 |
tonyb | We can keep the existing paste around then rollback if needed would be DNS and data dump+restore | 19:22 |
corvus | yeah i think doing a gerrit restart soon would be good | 19:22 |
clarkb | tonyb: ++ | 19:22 |
clarkb | ok I think that is the feedback I needed to take the next few steps on this effort. Thanks! | 19:22 |
clarkb | any other questions/concerns/feedback? | 19:22 |
corvus | thank you clarkb | 19:23 |
tonyb | clarkb: Thanks for doing this, I'm sorry I slowed things down. | 19:23 |
clarkb | tonyb: I don't think you slowed things down. I wasn't able to get to it until I got to it | 19:23 |
tonyb | clarkb: also ++ on a gerrit restart | 19:23 |
clarkb | #topic Docker Hub Rate Limits | 19:25 |
clarkb | Last week when we updated Gerrit's image to pick up the openid fix I wrote we discovered that we were at the docker hub rate limit despite not pulling any images for days | 19:25 |
clarkb | this led to us discovering that docker hub treats entire ipv6 /64s as a single IP for rate limit purposes | 19:25 |
clarkb | at the time we worked around this by editing /etc/hosts to use ipv4 addresses for the two locations that are used to fetch docker images from docker hub and that sorted us out | 19:26 |
clarkb | I think we could update our zuul jobs to maybe do that on the fly if we like though I haven't looked into doing that myself | 19:26 |
clarkb | I know corvus is advocating for more of a get off docker hub approach which is partially what is driving the noble podman work | 19:26 |
clarkb | basically I've been going down that path myself too | 19:27 |
corvus | i think we have everything we need to start mirroring images to quay | 19:27 |
clarkb | oh right the role to drive that landed ni zuul-jobs last week too | 19:27 |
corvus | next step for that would be to write a simple zuul job that used the new role and run it in a periodic pipeline | 19:28 |
clarkb | I think starting with the python base image(s) and mariadb image(s) would be a good start | 19:28 |
corvus | ++ | 19:28 |
clarkb | corvus: where do you think that job should live and what namespace should we publish to? | 19:29 |
clarkb | like should opendev host the job and also host the images in the quay opendev location? | 19:29 |
corvus | yeah for those images, i think so | 19:30 |
clarkb | ya I think that makes sense since we would be primary consumers of them | 19:30 |
corvus | what's a little less clear is: | 19:30 |
corvus | say the zuul project needs something else; should it make its own job and use its own namespace, or should we have the opendev job be open to anyone? | 19:31 |
corvus | i don't think it'd be a huge imposition on us to review one-liner additions.... but also, it can be self-serve for individual projects so maybe it should be? | 19:31 |
clarkb | ya getting us out of the critical path unless there is broader usefulness is probably a good thing | 19:32 |
corvus | (also, i guess if the list of images gets very long, we'd need multiple jobs anyway...) | 19:32 |
corvus | ok, so multiple jobs managed by projects; sounds good to me | 19:33 |
tonyb | Yeah I think I'm leaning that way also | 19:33 |
corvus | (we might also want one job per image for rate limit purposes :) | 19:33 |
clarkb | that is a very good point acutally | 19:34 |
corvus | oh also, i'd put the "pull" part of the job in a pre-run playbook. ;) | 19:34 |
clarkb | also a good low impact change over holidays for anyone who finds time to do it | 19:35 |
clarkb | that is also a good idea | 19:35 |
corvus | oh phooey we can't | 19:35 |
corvus | we'd need to split the role. oh well. | 19:35 |
clarkb | can cross that bridge if/when we get there | 19:35 |
corvus | yep. | 19:35 |
clarkb | anything else on this topic? | 19:35 |
corvus | nak | 19:36 |
clarkb | #topic Rax-ord Noble Nodes with 1 VCPU | 19:36 |
clarkb | I don't really have anything new on this topic, but it seems like this problem occurs very infrequently | 19:36 |
clarkb | I wanted to make sure I wasn't missing anything important on this or that I'm wrong about ^ that observation | 19:37 |
tonyb | I trust your investigation | 19:37 |
clarkb | the rax flex folks indicated that anything xen related is not in their wheelhouse unfortunately | 19:37 |
clarkb | so I suspect that addressing this is going to require someone to file a proper ticket with rax | 19:38 |
clarkb | nto sure that is high on my priority list unless the error rate increases | 19:38 |
tonyb | Can we add something to detect this early and fail the job, which would get a new node from the pool? | 19:38 |
clarkb | tonyb: yes I think you could do a check along the lines of ansible bios version is too low and num cpu is 1 then exit with failure | 19:39 |
clarkb | I don't think we only want to check the cpu count since we may have jobs that intentionally try to use lower cpus and then wonder why they fail in a year or three so try and constrain it as much as possible to the symptoms we have identified | 19:39 |
tonyb | clarkb: ++ | 19:40 |
clarkb | #topic Ubuntu-ports mirror fixups | 19:40 |
tonyb | I'll see if I can get something written today. | 19:40 |
clarkb | thanks | 19:40 |
clarkb | fungi discovered a few of our afs mirrors were stale and managed to fix them all in a pretty straightforward manner except for ubuntu-ports | 19:41 |
clarkb | reprepro was complaining about a corrupt berkley db for ubuntu-ports, fungi rebuilt the db which fixed the db issue but then some tempfile which records package info was growign to infinity recording the same packages over and over again | 19:42 |
clarkb | eventually that would hit disk limits for the disk the temp file was written to and we would fail | 19:42 |
clarkb | where we've ended up is taht fungi has deleted the ubuntu-ports RW volume content and has started reprepo over from scratch | 19:42 |
clarkb | the RO volumes still have the old stale content so our jobs should continue to run successfully | 19:42 |
clarkb | this is mostly a heads up about this situation as it may take some time to correct | 19:43 |
clarkb | the rebuild from scratch is running on a screen on mirror-update. THough i haven't checked in on it myself yet | 19:43 |
clarkb | fungi is enjoying some well deserved vacation time so we don't have an update from him on this but we can check directly on progress in the screen if anything comes up in the short term | 19:44 |
clarkb | sounded like fungi would try to monitor it as he can though | 19:44 |
clarkb | #topic Open Discussion | 19:44 |
clarkb | Anything else? | 19:44 |
clarkb | I guess it is worth mentioning that I'm expecting to be around this week. But then the two weeks after I'll be in and out as things happen with the kids etc | 19:45 |
clarkb | I don't really have a concrete schedule as i'm not currently traveling anywhere so it will be more organic time off I guess | 19:45 |
tonyb | Same, although I'll be more busy with my kids the week after christmas | 19:45 |
clarkb | I hope everyone else gets to enjoy some time off too. | 19:46 |
clarkb | Also worth mentioning that since we don't have any weekly meetings for a while please do bring up important topics via our typical comms channels if necessary | 19:47 |
tonyb | noted | 19:47 |
clarkb | and thank you everyone for all the help this year. I think it ended up being quite productive within OpenDev | 19:48 |
tonyb | clarkb: and thank you | 19:48 |
corvus | thanks all and thank you clarkb ! | 19:48 |
tonyb | clarkb: Oh did you have annual report stuff to write that you'd like eyes on? | 19:48 |
clarkb | tonyb: they have changed how we draft that stuff this year so I don't | 19:49 |
clarkb | there is still a small section of content but its much smaller in scope | 19:49 |
tonyb | clarkb: Oh cool? | 19:50 |
clarkb | I've used it to call out software updates (gerrit, gitea, ehterpad etc) as well as onboarding new clouds (raxflex) as well as a notice that we lost one of two arm clouds | 19:50 |
tonyb | nice | 19:50 |
clarkb | more of a quick highlights than an indepth recap | 19:50 |
clarkb | if you feel strongly about somethign not in that list let me know and I can probably add it | 19:51 |
clarkb | and with that I think we can end the meeting | 19:52 |
tonyb | I don't have any strong feelings about it. It just occurred to me that we normally have $stuff to do at this time of year | 19:52 |
clarkb | ack | 19:52 |
tonyb | ++ | 19:52 |
clarkb | we'll be back here at our normal time and location on January 7 | 19:52 |
clarkb | #endmeeting | 19:53 |
opendevmeet | Meeting ended Tue Dec 17 19:53:05 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:53 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.html | 19:53 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.txt | 19:53 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-17-19.00.log.html | 19:53 |
-opendevstatus- NOTICE: Gerrit will be restarted to pick up a small configuration update. You may notice a short Gerrit outage. | 21:00 | |
-opendevstatus- NOTICE: You may have noticed the Gerrit restart was a bit bumpy. We have identified an issue with Gerrit caches that we'd like to address which we think will make this better. This requires one more restart | 22:12 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!