Tuesday, 2024-11-19

* frickler will be a couple of minutes late, need a break after the long TC meeting18:57
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Nov 19 19:00:04 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
* tonyb is on a train so coverage may be sporadic 19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BQ2IWL7BUMNUXYWCQV62DZQCF2AI7E5U/ Our Agenda19:00
clarkb#topic Announcements19:00
tonybalso in an EU timezone for the next few weeks19:00
clarkboh fun19:00
tonybshould be!19:00
clarkbNext week is a major US holiday week. I plan to be around Monday and Tuesday and will host the weekly meeting. But we should probably expect slower response times from various places19:01
clarkbAnything else to announce?19:01
clarkb#topic Zuul-launcher image builds19:03
clarkbcorvus has continued to iterate on the mechanics of uploading images to swift, downloading them to the launcher and reuploading to the clouds19:03
clarkband good news is the time to build a qcow2 image then shuffle it around is close to if not better than the time the current builders do it in19:04
clarkb#link https://review.opendev.org/935455 Setup a raw image cloud for raw image testing19:04
clarkbqcow2 images are relatievly small compared to the raw and vhd images we also deal with so next step is testing this process with the larger image types19:04
clarkbThere is still opportunity to add image build jobs for other distros and releases as well19:04
tonybit's high on my to-do list19:05
clarkbcool19:05
clarkbanything else on this topic cc corvus 19:05
clarkbseems like good slow but steady progrss19:06
clarkbI'll keep the meeting moving as we have a number of items to get through. We can always swing back to topics if we have time at the end or after the meeting etc19:07
clarkb#topic Backup Server Pruning19:07
clarkbthe smaller backup server got close to filling up its disk again and fungi pruned it again. THank you for that19:07
clarkbbut this is a good reminder that we have a couple of changes proposed to help alleviate some of that by purging things from the backup servers once they aer no longer needed19:07
clarkb#link https://review.opendev.org/c/opendev/system-config/+/933700 Backup deletions managed through Ansible19:08
clarkb#link https://review.opendev.org/c/opendev/system-config/+/934768 Handle backup verification for purged backups19:08
clarkboh I missed hta fungi had +2'd but not approved the first one19:08
fungii wasn't sure if those needed close attention after merging19:09
clarkbshould we go ahead and either fix the indnetation then approve or just approve it?19:09
fungii think we can just approve19:09
tonybI agree, we can do the indentation after if we want19:10
clarkbI think the test case on line 34 in https://review.opendev.org/c/opendev/system-config/+/933700/23/testinfra/test_borg_backups.py should ensure that it is fairly safe19:10
clarkbthen after it is landed we can touch the retired flag in the ethercalc dir and then add ethercalc to the list of retirements (to catch up the other server) check that worked, then add it to the purge list (to catch up the other server) and check that worked19:11
clarkbthen if that is good we can retire the rest of the items in our retirement list19:11
clarkbI think this should make managing disk consumption a bit more sane19:12
clarkbanything else related to backups?19:12
clarkbactually I souldn't need to manually touch the retired file19:13
clarkbgoing through the motions to retire it on the other server should fix the one I did manually19:13
clarkb#topic Upgrading old servers19:13
clarkbtonyb: anything new to report on wiki or other server replacments?19:14
clarkb(I did have a note about the docker compose situation noble but was going to bring that up during the docker compose portion of the agenda)19:14
tonybnothing new.19:15
tonybI guess I discovered that noble is going to be harder than expected due to python being too new for docker-compose v119:15
clarkb#topic Docker Hub Rate Limits19:15
clarkb#undo19:15
opendevmeetRemoving item from minutes: #topic Docker Hub Rate Limits19:15
clarkbya thats the bit I was going to discuss during the docker compose podman section since they are related19:16
clarkbI have an idea for that that may not be terrible19:16
tonybokay that's cool19:16
clarkb#topic Docker Hub Rate Limits19:16
corvus[i'm here now]19:16
clarkbBefore we get there another related topic is that people have been noticing we're hitting docker hub rate limits more often19:16
clarkb#link https://www.docker.com/blog/november-2024-updated-plans-announcement/19:16
tonybI don't know why it's only just come up as we have noble servers19:16
clarkbtonyb: because mirrors don't use docker-compose I think19:17
clarkbbasically you've avoided the intersection so far19:17
clarkbso that blog post says anonymous requetss will get 10 pulls per hour now19:17
clarkbwhich is a reduction from whatever the old value is. However, if I go through the dance of getting an anonymous pull token and inspect that token it says 100 pulls per 6 hours19:17
corvusmaybe they walked it back due to complaints...19:18
clarkbI've also experimentally checked docker image pull against 12 different alpine image tags and about 12 various library images from docker hub19:18
clarkband had no errors19:18
clarkbcorvus: ya that could be. Another thought I had was maybe they rate limit the library images differently than the normal images but once you hit the limit it fails for all pulls. But kolla/base reported the same 100 pulls per 6 hours limit that the other library images did so I don't think it is that19:19
fricklerwell 100/6h is 16/h, not too far off, just a bit more burst allowed19:19
clarkbfrickler: ya the burst is important for CI workloads though particularly since we cache things (if you use the proxy cache)19:19
corvusone contingency plan would be to mirror the dockerhub images we need on quay.io (or elsewhere); i started on https://review.opendev.org/935574 for that.19:19
clarkbplanning for contigencies and generally trying to be on the look out for anything that helps us undersatnd their changes (if any) would be good19:20
clarkbbut otherwise I now feel like I understand less today than I did yesterday. This doesn't feel like a drop everything emergency but something we should work to understand and then address19:20
tonybassuming that still worth with speculative builds that seems like a solid contingency 19:20
clarkbanother suggested improvement was to lean in buildset registries more to stash all the images a buildset will need and not just those we may build locally19:21
clarkbthis way we're fetching images like mariadb once per buildset instead of N times for each of N jobs using it19:21
fungiheck, some jobs themselves may be pulling the same image multiple times for different hosts19:22
tonybtrue.19:22
clarkbso ya be on the lookout for more concrete info and feel free to experiment with making the jobs more image pull efficient19:23
clarkband maybe we'll know more next week19:23
clarkb#topic Docker compose plugin with podman service for servers19:23
clarkbThis agenda item is related to the previous in that it would allow us to migrate off of docker hub for opendev images and preserve our speculative testing of our images19:23
clarkbadditionally tonyb found that python docker-compose doesn't run on python3.12 on noble19:24
clarkbwhich is another reason to switch to docker compose19:24
corvus(in particular, getting python-base and python-builder on quay.io could be a big win for total image pulls)19:24
clarkball of this is coming together to make this effort a bit of ah igher priority19:24
clarkbI'd like to talk about the noble docker compose thing first19:25
clarkbI suspect that in our install docker role we can do something hacky like have it install an alias/symlink/something that maps docker-compose to docker compose and then we won't have to rewrite our playbooks/roles until after everything has left docker-compose behind19:25
clarkbthe two tools have similar enough command lines that I suspect the only place we would run into trouble is anywhere we parse command output and we might have to check both versions instead19:26
tonybyeah, that's kinda gross but itd work for now19:26
clarkbbut this way we don't need to rewrite every role using docker-compose today and as long as we don't do in place upgrade we're replacing the servers anyway and they'll use the proper tool in that transition19:26
clarkbI think this wouldnt' work fi we did an in place switch between docker-compose and docker compose19:26
clarkbbut as long as its old focal server replaced by new noble server that should mostly work19:27
tonybyeah.   I guess we can come back and tidy up the roles after the fact19:27
clarkbassuming we do that the next question is do we also have install docker configure docker-compose which is now really docker compose to run podman for everything? I think we were leaning that way when we first discussed this19:27
clarkbthe upside to that is we get the speculative testing and don't have to be on docker hub19:28
clarkbtonyb: exactly19:28
corvusi think the last time someone talked about parsing output of "docker-compose" i suggested an alternative...  like maybe an "inspect" command we could use instead.19:28
clarkbcorvus: ++19:28
corvusit may actually be simpler to do it once with podman19:28
clarkbso to tl;dr all my typing above I think there are two improvements we should make to our docker installation role. A) set it up to alias docker-compose to docker compose somehow and B) configure docker compose to rely on podman as the runtime19:29
tonybwell one place thatd be hard is docker compose pull and looking for updates but yes generally avoinf parsing the output is good19:29
corvusjust because the difference between them is so small, but it's all at install time.  so switching from docker to podman later is more work than "docker compose" plugin with podman now.19:29
clarkbtonyb: ya I think thats the only place we do it. So we'd inspect, pull, inspect and see if they change or something19:29
clarkbcorvus: exactly19:29
corvustonyb: yeah, that's the thing i suggested an alternative to.  no one took me up on it at the time, but it's in scrollback somewhere.19:30
tonybokay 19:30
clarkbtonyb: I know you brought this up as a need for noble servers. I'm not sure if you are interested in making those changes to the docker installation role19:30
tonybyeah I can 19:30
corvusi think the only outstanding question for https://review.opendev.org/923084 is how to set up the podman socket path -- i think in a previous meeting we identified a potential easy way of doing that.19:31
tonybI don't know exactly what's needed for the podman enablement 19:31
clarkbI can probably help though I feel like this week is already swamped and next week is the holiday, but ping me for reviews or ideas etc. I ended up brainstorming this a bit the other day so enough is paged in I think I can be useful19:31
tonyband making it nice but I can take direction 19:31
clarkbtonyb: 923084 does it in a different context so we have to map that into system-config19:31
tonybnoted19:31
clarkband i guess address the question af the socket path19:31
tonybokay19:32
clarkbsounds like no one is terribly concerned about these hacks and we should be able to get away from them as soon as we're sufficiently migrated19:32
clarkbAnything else on these subjects?19:32
corvusoh yeah docker "contexts" is the thing19:32
corvusthat might make setting the path easy19:32
tonybokay cool19:33
corvus#link https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-01-19.00.log.html#l-91 docker contexts19:33
tonyband this is for noble+19:33
clarkbthanks!19:33
clarkbtonyb: yes19:33
corvusyep19:33
tonybperfect 19:33
corvusoh ha19:34
corvusclarkb: also you suggested we could set the env var in a "docker-compose" compat tool shim :)19:34
corvus(in that meeting)19:34
tonyb( for the record I'm about to get off the train which probably equates to offline)19:34
clarkbno wonder when I was thinking about tonyb's propblem I was like this is the solution19:34
tonybyeah that could work I guess19:35
corvus¿por que no los dos?19:35
clarkbtonyb: ack don't hurt yourself trying to type and walk at the same time19:35
clarkb#topic Enabling mailman3 bounce processing19:35
clarkblet's keep this show moving forward19:35
clarkblast week lists.opendev.org and lists.zuul-ci.org lists were all set to (or already set to) enable bounce processing19:35
clarkbfungi: do you know if openstack-discuss got its config updted?19:35
clarkbthen separately I haven't received any notifications of members hitting the limits for any of the lists I moderate and can't find evidence of anyone with a score higher than 3 (5 is the threshold)19:36
clarkbso I'm curious if anyone has seen that in action yet19:36
fungiclarkb: i've not done that yet, no19:37
clarkbok it would probably be good to set as I suspect that's where we'll get the most timely feedback19:37
clarkbthen if we're still happy with the results enabling this by default on new lists is likely to require we define a custom mailman list style and create new lists with that style19:37
clarkbthe documentation is pretty sparse on how you're actually supposed to create a new style unfortunately19:38
fungii've just now switched "process bounces" to "yes" for openstack-discuss19:38
clarkb(there is a rest api endpoint for it btu without info on how you set the millions of config options)19:38
clarkbfungi: thanks!19:38
clarkbfungi: you don't happen to know what would be required to set up a new list style do you?19:39
clarkb(we probably actually need two, one for private lists and one for public lists)19:39
fungino clue, short of the instructions for making a mailman plugin in python. might be worth one of us asking on the mailman-users ml19:39
corvusi think i have an instance of a user where bounce processing is not working19:39
clarkbcorvus: are they hitting the threshold then not getting removed?19:40
clarkbthe bounce disable warnings configuration item implies there is some delay that must be reached after being above the threshold before you are removed19:42
clarkbI wonder if the 7 day threshold reset is resetting them to 0 before hitting that if so19:42
corvushrm, i got a message about an unprocessed bounce a while back, but the user does not appear to be a member anymore.  so this may not be actionable.19:42
corvusnot sure what happened in the interim.19:42
clarkback, I guess we monitor for current behavior and debug from there19:42
clarkbfungi: ++ that seems like a good idea.19:42
clarkbI'm going to keep things moving as we have less than 20 minutes and still several topics to cover19:43
clarkb#topic Intermediate Insecure CI Registry Pruning19:43
corvus0x8e19:43
clarkbAs scheduled/announced we started this on Friday. We hit a few issues. The first was 404s on object delete requests19:43
clarkbthat was a simple fix as we can simply ignore 404 errors when trying to delete something. The other was that we weren't paginating object listings so were capped at 10k objects per listing reuqest19:43
clarkbthis gave the pruning process an incomplete (and possibly inaccurate) picture of what should be deleted vs kept19:44
corvusthat means we've fixed two bugs that could have caused the previous issues!  :)19:44
clarkbthe process was restarted after fixing the problems and has been running since late friday. We anticipate it will take at least 6 days though I THink it is trending slowly to be longer as it goes on19:44
clarkbas far as I can tell no one has had problems with the intermediate registry while this is running either which is a good sign we're not over deleting anything19:45
clarkb#link https://review.opendev.org/c/opendev/system-config/+/935542 Enable daily pruning after this bulk prune is complete19:45
corvus0x8e out of 0xff means we're 55% through the blobs.19:45
clarkbonce this is done we should be able to run regular pruning that doesn't take days to complete since we'll have much fewer objects to contend with19:45
clarkbthat change enables a cron job to do this. Which is probably good for hygiene purposes but shouldn't be merged until after we complete this manual run and are happy with it19:46
clarkbwe'll be much more disk efficient as a result \o/19:46
clarkbcorvus: napkin math has that taking closer to 8 days now?19:46
* clarkb looks at the clock and continues on19:47
clarkb#topic Gerrit 3.10 Upgrade Planning19:47
corvusare we starting from friday or saturday?19:47
clarkbcorvus: I think we started at roughly 00:00 UTC saturdayt19:48
clarkband we're almost at 20:00 UTC tuesday so like 7.75 days?19:48
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document19:48
clarkbI've announced the upgrade for December 6. I'm hoping that I can focus on getting the last bits of checks and testing done next week before the holiday so that the week after we aren't rushed19:49
clarkbIf you have time to read the etherpad and call out any additioanl concerns I would appreciate it. There is a held node too fi you are interested in checking out the newer version of gerrit19:49
clarkbI'm not too terribly concerned about this though as there is a straightforward rollback path which i have also tested19:49
clarkb#topic mirror.sjc3.raxflex.opendev.org cinder volume issues19:50
clarkbYesterday after someone complained about this mirror not working I discovered it couldn't read sector 0 on its cinder volume backing the cache dirs19:50
clarkband then ~5 hours later it seems the server itself shutdown (maybe due to kernel panic after being in this state?)19:51
funginope, services got restarted19:51
clarkbin general that shouldn't restart VMs though? I guess maybe if you restart libvirt or something19:51
fungiwhich apparently resulted in a lot of server instances shutting off or rebooting19:51
fungiwell, that was the explanation we got anyway19:51
clarkbanyway klamath in #opendev reports that it should be happier now and that this wasn't intentional so we've stuck the mirro back into service19:51
clarkbthere are two things to consider though. One is that we are using a volume of type capacity instead of type standard and it is suggested we could hcange that19:52
clarkbthe other is if we rebuild our networks we will get bigger mtus19:52
fungifull 1500-byte mtus to the server instances, specifically19:52
clarkbto rebuild the networks I think the most straightforward option is to simply delete everything we've got and let cloud launcher recreate that stuff for us19:52
clarkbdoing that likely means deleting the mirror anyway based on mordred's report of not being able to change ports for stnadard port creation on instance create processes19:53
clarkbso long story short we should pick a time where we intentionally stop using this cloud, delete the networks and all servers, rerun cloud launcher to recreate networks, then rebuild our mirror using the new networks and a new cinder volume of type standard19:54
fungisounds good19:54
clarkbthat might be a good big exercise for the week after the gerrit upgrade19:54
clarkbin theory things will slow down as we near the end of the year making those changes less impactful19:54
clarkbwe should also check that the cloud launcher is running happily before we do that19:55
clarkb(to avoid delay in reconfiguring the networks)19:55
fungiyep19:55
clarkb#topic Open Discussion19:56
clarkbhttps://review.opendev.org/c/opendev/lodgeit/+/935712 someone noticed problems with captcha rendering in lodgeit today and has arleady pushed up a fix19:56
clarkbwe've also been updating our openafs packages and rolling those out with reboots to affected servers19:56
fungi#link https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs19:57
fungii'm still planning to do the openinfra.org mailing list migration on december 2, i have the start of a migration plan outline in an etherpad i'll share when i get it a little more fleshed out19:58
fungisending an announcement about it to the foundation ml later today19:58
clarkbfungi: is that something we can/should test in the system-config-run mm3 job?19:58
clarkbI assume we create a new domain then move the lists over then add the lists to our config?19:59
clarkbI'll wait for the etherpad no need to run through the whole process here19:59
fungino, database update query19:59
clarkboh fun19:59
fungimailman core, hyperkitty and postorius all use the django db19:59
fungiso can basically just change the domain/host references there and then update our ansible data to match so it doesn't recreate the old lists20:00
clarkband we are at time20:00
clarkbfungi: got it20:00
clarkbmakes sense20:00
clarkbthank you everyone! as mentioned we'll be back here next week per usual despite the holiday for several of us20:00
fungithanks clarkb!20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Nov 19 20:00:33 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.log.html20:00
clarkband now time for lunch20:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!