Tuesday, 2024-06-04

clarkb	meeting time	19:00
* tonyb waves		19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Jun 4 19:00:39 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/JH2FNAUEA32AS4GH475AHYEPLP4FUGPE/ Our Agenda	19:00
clarkb	#topic Announcements	19:01
clarkb	I will not be able to run the meeting on June 18	19:01
clarkb	that is two weeks from today. More than happy for someone else to take over or we can skip if people prefer that	19:01
tonyb	I'll be back in Australia by then so I probably be can't run it	19:02
clarkb	I wil be afk from the 14-19th	19:02
tonyb	okay. I'll be travelling for part of that.	19:03
clarkb	we can sort out the plan next week. Plenty of time	19:03
tonyb	poor fungi	19:03
tonyb	sounds good	19:03
clarkb	#topic Upgrading old servers	19:03
clarkb	#link https://etherpad.opendev.org/p/opendev-mediawiki-upgrade	19:03
clarkb	I think this has been the recent focus of this effort	19:03
tonyb	yup	19:04
clarkb	looks like there was a change just psuehd too I should #link that too	19:04
tonyb	there are some things to read about the plan etc	19:04
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/921321 Mediawiki deployment for OpenDev	19:04
tonyb	reviews very welcome. it works for local testing but needs more and comprehensive testing	19:05
clarkb	tonyb: anything specific in the plan etherpad you want us to be looking at or careful about?	19:05
tonyb	There is a small announcement that could do with a review	19:05
clarkb	tonyb: ya I think if we can get something close enough to migrate data into we can live with small issues like theming not working and plan a cut over. I assume we'll want to shutdown the old wiki so we don't have content divergence?	19:05
tonyb	as I'd like to get that out reasonably soon	19:06
clarkb	#link https://etherpad.opendev.org/p/opendev-wiki-announce Announcement for wiki changes	19:06
clarkb	that one I assume	19:06
tonyb	Yes we'll shutdown the current server ASAP	19:06
tonyb	There is plenty of planning stuff like moving away from rax-trove etc but I'm pretty happy with the progress	19:07
clarkb	yup I think this is great. I'm planning to dig into it more today after meetings and lunch	19:07
tonyb	the bare bones upgrade the host OS is IMO a solid improvement	19:08
tonyb	I'll try a noble server this week	19:08
tonyb	I think that's pretty much all there is to say for server upgrades	19:08
clarkb	thanks	19:08
clarkb	#topic AFS Mirror Cleanup	19:09
clarkb	Not much new here other than that devstack-gate has been fully retired	19:09
clarkb	I've been distracted by many other things like gerrit upgrades and gitea upgrades and cloud shutdowns which we'll get to shortly	19:09
clarkb	#topic Gerrit 3.9 Upgrade	19:09
clarkb	This happened. It seemed to go well. People are even noticing some of the new features like suggested edits	19:09
clarkb	Has anyone else seen or heard of any issues since the upgrade?	19:10
clarkb	I guess there was the small gertty issue which is resolvable by starting with a new sqlite db	19:10
tonyb	That's all I'm aware of	19:11
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/920938 is ready when we think we are unlikely to revert	19:11
clarkb	I'm going to go ahead and remove my WIP vote on that change since we arne't aware of any problems	19:12
tonyb	I'm happy to go ahead whenever	19:13
clarkb	That change has a few children which add the 3.10 image builds and testing. The testing change seems to error when trying to pull the 3.10 which I thought should be in the intermediate ci registry. But I wonder if it is because the tag doesn't exist at all	19:13
clarkb	corvus: ^ just noting that in case you've seen it like does the tooling try to fetch from docker hub proper and then fail even if it could find stuff locally?	19:13
clarkb	in any case that isn't urgent but getting some early feedback on whether or not the next release works and is upgradeable is always good (particularly after the last upgrade was broken initially)	19:14
clarkb	thank you to everyone that helped with the upgrade. Despite my concerns feeling under prepared things went smoothly which probably speaks to how prepared we actually were	19:14
clarkb	#topic Gitea 1.22 Upgrade	19:15
clarkb	I was hoping there would be a 1.22.1 by now but as far as I can tell there isn't. I'm also likely going to put this on the back burner for the immediate future as I've got several other more time sensitive things to worry about before I take that time off	19:15
tonyb	That's fair.	19:16
clarkb	That said I think the next step is going to be getting the upgrade chagne into shape so if people can review that it is still helpful	19:16
clarkb	then we can upgrade then we can do the doctor tooling one by one on our backend nodes	19:17
tonyb	Whatever works, if we decide we need it then we're pretty prepared thanks to your work	19:17
clarkb	testing of the doctor tool seems to indicate running it is straightforward. We should be able to do it with minimal impact taking one node out of service and doctoring it and doing that in a loop	19:17
clarkb	tonyb: ya I think reviews are the most useful thing at this point for the gitea upgrade	19:17
clarkb	since the initial pass is working as is the doctor tool in testing	19:17
tonyb	Cool	19:18
clarkb	#topic Fixing Cloud Launcher Ansible	19:19
clarkb	frickler: fixed up the security groups in the osuosl cloud but then almost immediately we ran into the git perms trust issue that is a side effect of recent git packaging updates for security concern fixes	19:19
clarkb	This appears to be our only infra-prod-* job affected by the git updates so overall pretty good	19:19
clarkb	for fixing cloud launcher I went ahead and wrote a change that trusts the ansible role repos when we clone them	19:20
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/921061 workaround the Git issue	19:20
clarkb	I did find that you can pass config options to git clone but they only apply after most of the cloning is complete	19:20
clarkb	so I'm like 99% sure that doing that won't work for this case as the perms check occurs before cloning begins	19:20
tonyb	Yeah it'd be nice if the options applied "earlier" but I think what you've done is good for now	19:21
clarkb	One thing to keep in mind reviewing that change is I'm pretty sure we don't have any real test coverage for it	19:21
clarkb	so careful review is a good idea and tonyb already caught one big issue with it (now fixed)	19:21
tonyb	The general git safe.directory doesn't have a one size fits all solution	19:22
corvus	clarkb: (sorry i'm late) i don't think lack of image in dockerhub should be a problem; that may point to a bug/flaw	19:22
clarkb	corvus: ok, I feel like this has happend before and I've double checked the dependencies and provides/requires and I think everything is working in the order I would expect	19:23
clarkb	and I just wondered if using a new tag is part of the issue since 3.10 doesn't exist anywhere yet but a :latest would be present typically	19:23
clarkb	I guess I'll look more closely	19:24
clarkb	#topic Increase Mailman 3 Out Runner Count	19:24
corvus	yeah; lemme know if you want me to dig into details with you	19:24
clarkb	corvus: thanks	19:24
clarkb	I don't know if anyone else has noticed but recently I had some confusion over emails being in the openstack-discuss list archive and not in my inbox thinking there was some problem	19:24
clarkb	the issue was that delivery for that list takes time. Upwards of 10 minutes.	19:24
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/920765 Increase the number of out runners to speed up outbound mail delivery	19:25
clarkb	In that change I linked to a discussion about similar problems another mail list host had and ultimately the biggest impact for them was increasing the number of runner instances	19:25
tonyb	Makes sense to me.	19:25
clarkb	I've gone ahead and proposed we do the same. This isn't super critical but I think it will help avoid confusion in the future and keep discussions moving without unnecessary delay	19:26
tonyb	I'm cool with your fix but it'd be cool to have a way to verify it has the intended impact	19:26
corvus	are we still letting exim do the verp?	19:26
tonyb	not that I'd block the chnage until that exists	19:26
clarkb	corvus: yes I believe exim is doing verp	19:26
clarkb	corvus: and disabling verp is one of the suggestions in the thread I found about improving this performance	19:27
clarkb	however, it seems like we can try this first since it is less impactful to behavior (in theory anyway)	19:27
corvus	k. exim doing verp make things pretty fast, since if we're delivering 10k messages, and 5k are for gmail and 4k are for redhat and 1k are for everyone else (made up numbers), that should boil down to just a handful of outgoing smtp transactions for mailman.	19:28
corvus	s/make things/should make things/	19:28
clarkb	corvus: makes sense. And ya I suspect the bottleneck may be between mailman and exim based on that thread (but I haven't profiled it locally)	19:29
corvus	if, however, mailman is doing 5k smtp transactions to exim for gmail, then we've lost the advantage	19:29
fungi	from what i gather, mailman 3 limits the number of addresses per message to the mta to a fairly small number in order to avoid some spam detection heuristics that may trigger when you have too many recipients	19:29
corvus	(it should do, ideally, 1, but there are recipient limits, so maybe 10 total, 500 recipients each)	19:29
fungi	i don't recall what the default is for sure, but think it may be something like 10 addresses per delivery	19:30
corvus	okay, so there might be an opportunity to tune that, so that we can maximize the work exim does	19:30
clarkb	suonds good. Do we think that is something to do instead of increasing the out runner instances or somethign to try next after this update?	19:30
fungi	but yeah, the usual recommendation is to increase the number of threads mailman/django will use to submit messages to the mta so it doesn't just serialize and block	19:30
corvus	my gut says try both, order doesn't matter	19:31
clarkb	ack I guess we proceed with this and try the other thing too	19:31
corvus	(non-exclusive)	19:31
corvus	yep	19:31
corvus	(sending multiple 500 recipient messages to exim in parallel is an ideal outcome)	19:32
clarkb	#topic OpenMetal Cloud Rebuild	19:32
clarkb	The Inmotion/OpenMetal folks sent email recently calling out that they have updated their openstack cloud as a service tooling to deploy on new platforms and deploy a newer openstack version	19:33
clarkb	they have volunteered to help out with the provisiining in the near future so I've been try to prepare cleanup/shutdown of the existing cloud so that we can gracefully replace it	19:33
clarkb	The hardware needs to be reused rather than setting up a new cloud adjacent to the old one which means shutting everything down first	19:34
clarkb	#link https://review.opendev.org/c/openstack/project-config/+/921072 Nodepool cleanups	19:34
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/921075 System-config cleanups	19:34
clarkb	my goal is to get these chagnes landed over the next day or so as I'm meeting with Yuriys at 1pm Pacific time tomorrow to discuss further actions	19:34
clarkb	I expect that nodepool will be in a dirty state (with stuck nodes and maybe stuck images) after the cleanup changes land. corvus pointed out the erase command which I'll use if it comes to that	19:35
clarkb	Also if anyone else is interested in joining tomorrow just let me know. I can share the conference link	19:35
tonyb	Happy to help with all/any of that	19:35
clarkb	tonyb: will do. I think the immediate next step is to land teh change which should clean up images in nodepool and see where that gets us	19:35
clarkb	oh and the system-config change should be landable too	19:35
corvus	ping me if there are nodepool cleanup probs	19:36
clarkb	corvus: will do	19:36
clarkb	then maybe tomorrow we land the final cleanup and do the erase step	19:36
clarkb	I'm hopeful that by the end of the week or early next week we'll be able to spin up a new mirror and add the new cloud to nodepool under its new openmetal name	19:37
tonyb	That seems doable :)	19:37
corvus	++	19:38
clarkb	#topic Testing Rackspace's New Cloud Product	19:38
clarkb	Yesterday a ticket was opened in our rax nodepool account to let us know that their new openstack flex product is in a beta/testing/trial period and we coudl opt into testing it. They seem specifically interested in any feedback we may have	19:39
clarkb	I think this is a good idea, but the info is a bit scarse so not sure if it is a good fit for us yet. I'm also pretty swamped with the other stuff going on so not sure I'll have time to run this down before I take that time off	19:39
corvus	heh my first question is "are you sure?" :)	19:39
corvus	you=rax	19:39
clarkb	Wanted to call it out if anyone else wants to look into this more closely. The product is called openstack flex and it might be worth pinging cloudnull to get details or a referal to someone else with that info	19:40
clarkb	corvus: ya I think the upside is we really do batter a cloud pretty good so if they take that data and feedback and improve with it then we're helping	19:40
clarkb	at the same time we might make their lives miserable :)	19:40
tonyb	I can take a stab at that if you'd like	19:40
clarkb	tonyb: sure, I think the main thing is starting up some communication to see if this fits our use case and then go for it I guess	19:41
clarkb	#topic Open Discussion	19:42
clarkb	dansmith discovered today that centos 8 stream seems to have been deleted upstream	19:42
clarkb	our mirrors have faithfully synced this state and now centos 8 stream jobs are breaking	19:42
tonyb	Oh nice	19:43
clarkb	I think that means we can probably put removing centos 8 stream nodes on the todo list for nodepool cleanups	19:43
clarkb	and we can also remove the centos afs mirror as everything should live in centos-stream now	19:43
fungi	the new rax service might be interesting if it performs better, is kvm-based, has stable support for nested kvm acceleration, etc	19:43
clarkb	but all the content has been deleted already so that is mostly just bookkeeping	19:43
clarkb	fungi: ++	19:43
fungi	we do get a lot of users complaining about performance or lack of cpu features in our current rax nodes	19:44
tonyb	I can find out more	19:45
tonyb	I assume I need to do that via the Account/webUI?	19:45
tonyb	I can also ping cloudnull for a informal chat	19:46
clarkb	tonyb: that would be one appraoch though it might get us to the first level of support first	19:46
clarkb	an informal chat might be better if cloudnull has time to at least point us at someone else	19:46
tonyb	Okay	19:48
clarkb	sounds like that might be everything	19:50
clarkb	thank you for your time	19:50
clarkb	#endmeeting	19:50
opendevmeet	Meeting ended Tue Jun 4 19:50:28 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:50
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-04-19.00.html	19:50
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-04-19.00.txt	19:50
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-04-19.00.log.html	19:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!