Tuesday, 2024-11-19

* frickler will be a couple of minutes late, need a break after the long TC meeting		18:57
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Nov 19 19:00:04 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
* tonyb is on a train so coverage may be sporadic		19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BQ2IWL7BUMNUXYWCQV62DZQCF2AI7E5U/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
tonyb	also in an EU timezone for the next few weeks	19:00
clarkb	oh fun	19:00
tonyb	should be!	19:00
clarkb	Next week is a major US holiday week. I plan to be around Monday and Tuesday and will host the weekly meeting. But we should probably expect slower response times from various places	19:01
clarkb	Anything else to announce?	19:01
clarkb	#topic Zuul-launcher image builds	19:03
clarkb	corvus has continued to iterate on the mechanics of uploading images to swift, downloading them to the launcher and reuploading to the clouds	19:03
clarkb	and good news is the time to build a qcow2 image then shuffle it around is close to if not better than the time the current builders do it in	19:04
clarkb	#link https://review.opendev.org/935455 Setup a raw image cloud for raw image testing	19:04
clarkb	qcow2 images are relatievly small compared to the raw and vhd images we also deal with so next step is testing this process with the larger image types	19:04
clarkb	There is still opportunity to add image build jobs for other distros and releases as well	19:04
tonyb	it's high on my to-do list	19:05
clarkb	cool	19:05
clarkb	anything else on this topic cc corvus	19:05
clarkb	seems like good slow but steady progrss	19:06
clarkb	I'll keep the meeting moving as we have a number of items to get through. We can always swing back to topics if we have time at the end or after the meeting etc	19:07
clarkb	#topic Backup Server Pruning	19:07
clarkb	the smaller backup server got close to filling up its disk again and fungi pruned it again. THank you for that	19:07
clarkb	but this is a good reminder that we have a couple of changes proposed to help alleviate some of that by purging things from the backup servers once they aer no longer needed	19:07
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/933700 Backup deletions managed through Ansible	19:08
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/934768 Handle backup verification for purged backups	19:08
clarkb	oh I missed hta fungi had +2'd but not approved the first one	19:08
fungi	i wasn't sure if those needed close attention after merging	19:09
clarkb	should we go ahead and either fix the indnetation then approve or just approve it?	19:09
fungi	i think we can just approve	19:09
tonyb	I agree, we can do the indentation after if we want	19:10
clarkb	I think the test case on line 34 in https://review.opendev.org/c/opendev/system-config/+/933700/23/testinfra/test_borg_backups.py should ensure that it is fairly safe	19:10
clarkb	then after it is landed we can touch the retired flag in the ethercalc dir and then add ethercalc to the list of retirements (to catch up the other server) check that worked, then add it to the purge list (to catch up the other server) and check that worked	19:11
clarkb	then if that is good we can retire the rest of the items in our retirement list	19:11
clarkb	I think this should make managing disk consumption a bit more sane	19:12
clarkb	anything else related to backups?	19:12
clarkb	actually I souldn't need to manually touch the retired file	19:13
clarkb	going through the motions to retire it on the other server should fix the one I did manually	19:13
clarkb	#topic Upgrading old servers	19:13
clarkb	tonyb: anything new to report on wiki or other server replacments?	19:14
clarkb	(I did have a note about the docker compose situation noble but was going to bring that up during the docker compose portion of the agenda)	19:14
tonyb	nothing new.	19:15
tonyb	I guess I discovered that noble is going to be harder than expected due to python being too new for docker-compose v1	19:15
clarkb	#topic Docker Hub Rate Limits	19:15
clarkb	#undo	19:15
opendevmeet	Removing item from minutes: #topic Docker Hub Rate Limits	19:15
clarkb	ya thats the bit I was going to discuss during the docker compose podman section since they are related	19:16
clarkb	I have an idea for that that may not be terrible	19:16
tonyb	okay that's cool	19:16
clarkb	#topic Docker Hub Rate Limits	19:16
corvus	[i'm here now]	19:16
clarkb	Before we get there another related topic is that people have been noticing we're hitting docker hub rate limits more often	19:16
clarkb	#link https://www.docker.com/blog/november-2024-updated-plans-announcement/	19:16
tonyb	I don't know why it's only just come up as we have noble servers	19:16
clarkb	tonyb: because mirrors don't use docker-compose I think	19:17
clarkb	basically you've avoided the intersection so far	19:17
clarkb	so that blog post says anonymous requetss will get 10 pulls per hour now	19:17
clarkb	which is a reduction from whatever the old value is. However, if I go through the dance of getting an anonymous pull token and inspect that token it says 100 pulls per 6 hours	19:17
corvus	maybe they walked it back due to complaints...	19:18
clarkb	I've also experimentally checked docker image pull against 12 different alpine image tags and about 12 various library images from docker hub	19:18
clarkb	and had no errors	19:18
clarkb	corvus: ya that could be. Another thought I had was maybe they rate limit the library images differently than the normal images but once you hit the limit it fails for all pulls. But kolla/base reported the same 100 pulls per 6 hours limit that the other library images did so I don't think it is that	19:19
frickler	well 100/6h is 16/h, not too far off, just a bit more burst allowed	19:19
clarkb	frickler: ya the burst is important for CI workloads though particularly since we cache things (if you use the proxy cache)	19:19
corvus	one contingency plan would be to mirror the dockerhub images we need on quay.io (or elsewhere); i started on https://review.opendev.org/935574 for that.	19:19
clarkb	planning for contigencies and generally trying to be on the look out for anything that helps us undersatnd their changes (if any) would be good	19:20
clarkb	but otherwise I now feel like I understand less today than I did yesterday. This doesn't feel like a drop everything emergency but something we should work to understand and then address	19:20
tonyb	assuming that still worth with speculative builds that seems like a solid contingency	19:20
clarkb	another suggested improvement was to lean in buildset registries more to stash all the images a buildset will need and not just those we may build locally	19:21
clarkb	this way we're fetching images like mariadb once per buildset instead of N times for each of N jobs using it	19:21
fungi	heck, some jobs themselves may be pulling the same image multiple times for different hosts	19:22
tonyb	true.	19:22
clarkb	so ya be on the lookout for more concrete info and feel free to experiment with making the jobs more image pull efficient	19:23
clarkb	and maybe we'll know more next week	19:23
clarkb	#topic Docker compose plugin with podman service for servers	19:23
clarkb	This agenda item is related to the previous in that it would allow us to migrate off of docker hub for opendev images and preserve our speculative testing of our images	19:23
clarkb	additionally tonyb found that python docker-compose doesn't run on python3.12 on noble	19:24
clarkb	which is another reason to switch to docker compose	19:24
corvus	(in particular, getting python-base and python-builder on quay.io could be a big win for total image pulls)	19:24
clarkb	all of this is coming together to make this effort a bit of ah igher priority	19:24
clarkb	I'd like to talk about the noble docker compose thing first	19:25
clarkb	I suspect that in our install docker role we can do something hacky like have it install an alias/symlink/something that maps docker-compose to docker compose and then we won't have to rewrite our playbooks/roles until after everything has left docker-compose behind	19:25
clarkb	the two tools have similar enough command lines that I suspect the only place we would run into trouble is anywhere we parse command output and we might have to check both versions instead	19:26
tonyb	yeah, that's kinda gross but itd work for now	19:26
clarkb	but this way we don't need to rewrite every role using docker-compose today and as long as we don't do in place upgrade we're replacing the servers anyway and they'll use the proper tool in that transition	19:26
clarkb	I think this wouldnt' work fi we did an in place switch between docker-compose and docker compose	19:26
clarkb	but as long as its old focal server replaced by new noble server that should mostly work	19:27
tonyb	yeah. I guess we can come back and tidy up the roles after the fact	19:27
clarkb	assuming we do that the next question is do we also have install docker configure docker-compose which is now really docker compose to run podman for everything? I think we were leaning that way when we first discussed this	19:27
clarkb	the upside to that is we get the speculative testing and don't have to be on docker hub	19:28
clarkb	tonyb: exactly	19:28
corvus	i think the last time someone talked about parsing output of "docker-compose" i suggested an alternative... like maybe an "inspect" command we could use instead.	19:28
clarkb	corvus: ++	19:28
corvus	it may actually be simpler to do it once with podman	19:28
clarkb	so to tl;dr all my typing above I think there are two improvements we should make to our docker installation role. A) set it up to alias docker-compose to docker compose somehow and B) configure docker compose to rely on podman as the runtime	19:29
tonyb	well one place thatd be hard is docker compose pull and looking for updates but yes generally avoinf parsing the output is good	19:29
corvus	just because the difference between them is so small, but it's all at install time. so switching from docker to podman later is more work than "docker compose" plugin with podman now.	19:29
clarkb	tonyb: ya I think thats the only place we do it. So we'd inspect, pull, inspect and see if they change or something	19:29
clarkb	corvus: exactly	19:29
corvus	tonyb: yeah, that's the thing i suggested an alternative to. no one took me up on it at the time, but it's in scrollback somewhere.	19:30
tonyb	okay	19:30
clarkb	tonyb: I know you brought this up as a need for noble servers. I'm not sure if you are interested in making those changes to the docker installation role	19:30
tonyb	yeah I can	19:30
corvus	i think the only outstanding question for https://review.opendev.org/923084 is how to set up the podman socket path -- i think in a previous meeting we identified a potential easy way of doing that.	19:31
tonyb	I don't know exactly what's needed for the podman enablement	19:31
clarkb	I can probably help though I feel like this week is already swamped and next week is the holiday, but ping me for reviews or ideas etc. I ended up brainstorming this a bit the other day so enough is paged in I think I can be useful	19:31
tonyb	and making it nice but I can take direction	19:31
clarkb	tonyb: 923084 does it in a different context so we have to map that into system-config	19:31
tonyb	noted	19:31
clarkb	and i guess address the question af the socket path	19:31
tonyb	okay	19:32
clarkb	sounds like no one is terribly concerned about these hacks and we should be able to get away from them as soon as we're sufficiently migrated	19:32
clarkb	Anything else on these subjects?	19:32
corvus	oh yeah docker "contexts" is the thing	19:32
corvus	that might make setting the path easy	19:32
tonyb	okay cool	19:33
corvus	#link https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-01-19.00.log.html#l-91 docker contexts	19:33
tonyb	and this is for noble+	19:33
clarkb	thanks!	19:33
clarkb	tonyb: yes	19:33
corvus	yep	19:33
tonyb	perfect	19:33
corvus	oh ha	19:34
corvus	clarkb: also you suggested we could set the env var in a "docker-compose" compat tool shim :)	19:34
corvus	(in that meeting)	19:34
tonyb	( for the record I'm about to get off the train which probably equates to offline)	19:34
clarkb	no wonder when I was thinking about tonyb's propblem I was like this is the solution	19:34
tonyb	yeah that could work I guess	19:35
corvus	¿por que no los dos?	19:35
clarkb	tonyb: ack don't hurt yourself trying to type and walk at the same time	19:35
clarkb	#topic Enabling mailman3 bounce processing	19:35
clarkb	let's keep this show moving forward	19:35
clarkb	last week lists.opendev.org and lists.zuul-ci.org lists were all set to (or already set to) enable bounce processing	19:35
clarkb	fungi: do you know if openstack-discuss got its config updted?	19:35
clarkb	then separately I haven't received any notifications of members hitting the limits for any of the lists I moderate and can't find evidence of anyone with a score higher than 3 (5 is the threshold)	19:36
clarkb	so I'm curious if anyone has seen that in action yet	19:36
fungi	clarkb: i've not done that yet, no	19:37
clarkb	ok it would probably be good to set as I suspect that's where we'll get the most timely feedback	19:37
clarkb	then if we're still happy with the results enabling this by default on new lists is likely to require we define a custom mailman list style and create new lists with that style	19:37
clarkb	the documentation is pretty sparse on how you're actually supposed to create a new style unfortunately	19:38
fungi	i've just now switched "process bounces" to "yes" for openstack-discuss	19:38
clarkb	(there is a rest api endpoint for it btu without info on how you set the millions of config options)	19:38
clarkb	fungi: thanks!	19:38
clarkb	fungi: you don't happen to know what would be required to set up a new list style do you?	19:39
clarkb	(we probably actually need two, one for private lists and one for public lists)	19:39
fungi	no clue, short of the instructions for making a mailman plugin in python. might be worth one of us asking on the mailman-users ml	19:39
corvus	i think i have an instance of a user where bounce processing is not working	19:39
clarkb	corvus: are they hitting the threshold then not getting removed?	19:40
clarkb	the bounce disable warnings configuration item implies there is some delay that must be reached after being above the threshold before you are removed	19:42
clarkb	I wonder if the 7 day threshold reset is resetting them to 0 before hitting that if so	19:42
corvus	hrm, i got a message about an unprocessed bounce a while back, but the user does not appear to be a member anymore. so this may not be actionable.	19:42
corvus	not sure what happened in the interim.	19:42
clarkb	ack, I guess we monitor for current behavior and debug from there	19:42
clarkb	fungi: ++ that seems like a good idea.	19:42
clarkb	I'm going to keep things moving as we have less than 20 minutes and still several topics to cover	19:43
clarkb	#topic Intermediate Insecure CI Registry Pruning	19:43
corvus	0x8e	19:43
clarkb	As scheduled/announced we started this on Friday. We hit a few issues. The first was 404s on object delete requests	19:43
clarkb	that was a simple fix as we can simply ignore 404 errors when trying to delete something. The other was that we weren't paginating object listings so were capped at 10k objects per listing reuqest	19:43
clarkb	this gave the pruning process an incomplete (and possibly inaccurate) picture of what should be deleted vs kept	19:44
corvus	that means we've fixed two bugs that could have caused the previous issues! :)	19:44
clarkb	the process was restarted after fixing the problems and has been running since late friday. We anticipate it will take at least 6 days though I THink it is trending slowly to be longer as it goes on	19:44
clarkb	as far as I can tell no one has had problems with the intermediate registry while this is running either which is a good sign we're not over deleting anything	19:45
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/935542 Enable daily pruning after this bulk prune is complete	19:45
corvus	0x8e out of 0xff means we're 55% through the blobs.	19:45
clarkb	once this is done we should be able to run regular pruning that doesn't take days to complete since we'll have much fewer objects to contend with	19:45
clarkb	that change enables a cron job to do this. Which is probably good for hygiene purposes but shouldn't be merged until after we complete this manual run and are happy with it	19:46
clarkb	we'll be much more disk efficient as a result \o/	19:46
clarkb	corvus: napkin math has that taking closer to 8 days now?	19:46
* clarkb looks at the clock and continues on		19:47
clarkb	#topic Gerrit 3.10 Upgrade Planning	19:47
corvus	are we starting from friday or saturday?	19:47
clarkb	corvus: I think we started at roughly 00:00 UTC saturdayt	19:48
clarkb	and we're almost at 20:00 UTC tuesday so like 7.75 days?	19:48
clarkb	#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document	19:48
clarkb	I've announced the upgrade for December 6. I'm hoping that I can focus on getting the last bits of checks and testing done next week before the holiday so that the week after we aren't rushed	19:49
clarkb	If you have time to read the etherpad and call out any additioanl concerns I would appreciate it. There is a held node too fi you are interested in checking out the newer version of gerrit	19:49
clarkb	I'm not too terribly concerned about this though as there is a straightforward rollback path which i have also tested	19:49
clarkb	#topic mirror.sjc3.raxflex.opendev.org cinder volume issues	19:50
clarkb	Yesterday after someone complained about this mirror not working I discovered it couldn't read sector 0 on its cinder volume backing the cache dirs	19:50
clarkb	and then ~5 hours later it seems the server itself shutdown (maybe due to kernel panic after being in this state?)	19:51
fungi	nope, services got restarted	19:51
clarkb	in general that shouldn't restart VMs though? I guess maybe if you restart libvirt or something	19:51
fungi	which apparently resulted in a lot of server instances shutting off or rebooting	19:51
fungi	well, that was the explanation we got anyway	19:51
clarkb	anyway klamath in #opendev reports that it should be happier now and that this wasn't intentional so we've stuck the mirro back into service	19:51
clarkb	there are two things to consider though. One is that we are using a volume of type capacity instead of type standard and it is suggested we could hcange that	19:52
clarkb	the other is if we rebuild our networks we will get bigger mtus	19:52
fungi	full 1500-byte mtus to the server instances, specifically	19:52
clarkb	to rebuild the networks I think the most straightforward option is to simply delete everything we've got and let cloud launcher recreate that stuff for us	19:52
clarkb	doing that likely means deleting the mirror anyway based on mordred's report of not being able to change ports for stnadard port creation on instance create processes	19:53
clarkb	so long story short we should pick a time where we intentionally stop using this cloud, delete the networks and all servers, rerun cloud launcher to recreate networks, then rebuild our mirror using the new networks and a new cinder volume of type standard	19:54
fungi	sounds good	19:54
clarkb	that might be a good big exercise for the week after the gerrit upgrade	19:54
clarkb	in theory things will slow down as we near the end of the year making those changes less impactful	19:54
clarkb	we should also check that the cloud launcher is running happily before we do that	19:55
clarkb	(to avoid delay in reconfiguring the networks)	19:55
fungi	yep	19:55
clarkb	#topic Open Discussion	19:56
clarkb	https://review.opendev.org/c/opendev/lodgeit/+/935712 someone noticed problems with captcha rendering in lodgeit today and has arleady pushed up a fix	19:56
clarkb	we've also been updating our openafs packages and rolling those out with reboots to affected servers	19:56
fungi	#link https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs	19:57
fungi	i'm still planning to do the openinfra.org mailing list migration on december 2, i have the start of a migration plan outline in an etherpad i'll share when i get it a little more fleshed out	19:58
fungi	sending an announcement about it to the foundation ml later today	19:58
clarkb	fungi: is that something we can/should test in the system-config-run mm3 job?	19:58
clarkb	I assume we create a new domain then move the lists over then add the lists to our config?	19:59
clarkb	I'll wait for the etherpad no need to run through the whole process here	19:59
fungi	no, database update query	19:59
clarkb	oh fun	19:59
fungi	mailman core, hyperkitty and postorius all use the django db	19:59
fungi	so can basically just change the domain/host references there and then update our ansible data to match so it doesn't recreate the old lists	20:00
clarkb	and we are at time	20:00
clarkb	fungi: got it	20:00
clarkb	makes sense	20:00
clarkb	thank you everyone! as mentioned we'll be back here next week per usual despite the holiday for several of us	20:00
fungi	thanks clarkb!	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Nov 19 20:00:33 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.log.html	20:00
clarkb	and now time for lunch	20:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!