Tuesday, 2024-06-04

clarkbmeeting time19:00
* tonyb waves19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jun  4 19:00:39 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/JH2FNAUEA32AS4GH475AHYEPLP4FUGPE/ Our Agenda19:00
clarkb#topic Announcements19:01
clarkbI will not be able to run the meeting on June 1819:01
clarkbthat is two weeks from today. More than happy for someone else to take over or we can skip if people prefer that19:01
tonybI'll be back in Australia by then so I probably be can't run it 19:02
clarkbI wil be afk from the 14-19th19:02
tonybokay.  I'll be travelling for part of that.19:03
clarkbwe can sort out the plan next week. Plenty of time19:03
tonybpoor fungi 19:03
tonybsounds good 19:03
clarkb#topic Upgrading old servers19:03
clarkb#link https://etherpad.opendev.org/p/opendev-mediawiki-upgrade19:03
clarkbI think this has been the recent focus of this effort19:03
tonybyup19:04
clarkblooks like there was a change just psuehd too I should #link that too19:04
tonybthere are some things to read about the plan etc 19:04
clarkb#link https://review.opendev.org/c/opendev/system-config/+/921321 Mediawiki deployment for OpenDev19:04
tonybreviews very welcome.   it works for local testing but needs more and comprehensive testing 19:05
clarkbtonyb: anything specific in the plan etherpad you want us to be looking at or careful about?19:05
tonybThere is a small announcement that could do with a review19:05
clarkbtonyb: ya I think if we can get something close enough to migrate data into we can live with small issues like theming not working and plan a cut over. I assume we'll want to shutdown the old wiki so we don't have content divergence?19:05
tonybas I'd like to get that out reasonably soon19:06
clarkb#link https://etherpad.opendev.org/p/opendev-wiki-announce Announcement for wiki changes19:06
clarkbthat one I assume19:06
tonybYes we'll shutdown the current server ASAP19:06
tonybThere is plenty of planning stuff like moving away from rax-trove etc but I'm pretty happy with the progress19:07
clarkbyup I think this is great. I'm planning to dig into it more today after meetings and lunch19:07
tonybthe bare bones upgrade the host OS is IMO a solid improvement19:08
tonybI'll try a noble server this week19:08
tonybI think that's pretty much all there is to say for server upgrades19:08
clarkbthanks19:08
clarkb#topic AFS Mirror Cleanup19:09
clarkbNot much new here other than that devstack-gate has been fully retired19:09
clarkbI've been distracted by many other things like gerrit upgrades and gitea upgrades and cloud shutdowns which we'll get to shortly19:09
clarkb#topic Gerrit 3.9 Upgrade19:09
clarkbThis happened. It seemed to go well. People are even noticing some of the new features like suggested edits19:09
clarkbHas anyone else seen or heard of any issues since the upgrade?19:10
clarkbI guess there was the small gertty issue which is resolvable by starting with a new sqlite db19:10
tonybThat's all I'm aware of19:11
clarkb#link https://review.opendev.org/c/opendev/system-config/+/920938 is ready when we think we are unlikely to revert19:11
clarkbI'm going to go ahead and remove my WIP vote on that change since we arne't aware of any problems19:12
tonybI'm happy to go ahead whenever19:13
clarkbThat change has a few children which add the 3.10 image builds and testing. The testing change seems to error when trying to pull the 3.10 which I thought should be in the intermediate ci registry. But I wonder if it is because the tag doesn't exist at all19:13
clarkbcorvus: ^ just noting that in case you've seen it like does the tooling try to fetch from docker hub proper and then fail even if it could find stuff locally?19:13
clarkbin any case that isn't urgent but getting some early feedback on whether or not the next release works and is upgradeable is always good (particularly after the last upgrade was broken initially)19:14
clarkbthank you to everyone that helped with the upgrade. Despite my concerns feeling under prepared things went smoothly which probably speaks to how prepared we actually were19:14
clarkb#topic Gitea 1.22 Upgrade19:15
clarkbI was hoping there would be a 1.22.1 by now but as far as I can tell there isn't. I'm also likely going to put this on the back burner for the immediate future as I've got several other more time sensitive things to worry about before I take that time off19:15
tonybThat's fair.19:16
clarkbThat said I think the next step is going to be getting the upgrade chagne into shape so if people can review that it is still helpful19:16
clarkbthen we can upgrade then we can do the doctor tooling one by one on our backend nodes19:17
tonybWhatever works, if we decide we need it then we're pretty prepared thanks to your work19:17
clarkbtesting of the doctor tool seems to indicate running it is straightforward. We should be able to do it with minimal impact taking one node out of service and doctoring it and doing that in a loop19:17
clarkbtonyb: ya I think reviews are the most useful thing at this point for the gitea upgrade19:17
clarkbsince the initial pass is working as is the doctor tool in testing19:17
tonybCool19:18
clarkb#topic Fixing Cloud Launcher Ansible19:19
clarkbfrickler: fixed up the security groups in the osuosl cloud but then almost immediately we ran into the git perms trust issue that is a side effect of recent git packaging updates for security concern fixes19:19
clarkbThis appears to be our only infra-prod-* job affected by the git updates so overall pretty good19:19
clarkbfor fixing cloud launcher I went ahead and wrote a change that trusts the ansible role repos when we clone them19:20
clarkb#link https://review.opendev.org/c/opendev/system-config/+/921061 workaround the Git issue19:20
clarkbI did find that you can pass config options to git clone but they only apply after most of the cloning is complete19:20
clarkbso I'm like 99% sure that doing that won't work for this case as the perms check occurs before cloning begins19:20
tonybYeah it'd be nice if the options applied "earlier" but I think what you've done is good for now19:21
clarkbOne thing to keep in mind reviewing that change is I'm pretty sure we don't have any real test coverage for it19:21
clarkbso careful review is a good idea and tonyb already caught one big issue with it (now fixed)19:21
tonybThe general git safe.directory doesn't have a one size fits all solution19:22
corvusclarkb: (sorry i'm late) i don't think lack of image in dockerhub should be a problem; that may point to a bug/flaw19:22
clarkbcorvus: ok, I feel like this has happend before and I've double checked the dependencies and provides/requires and I think everything is working in the order I would expect19:23
clarkband I just wondered if using a new tag is part of the issue since 3.10 doesn't exist anywhere yet but a :latest would be present typically19:23
clarkbI guess I'll look more closely19:24
clarkb#topic Increase Mailman 3 Out Runner Count19:24
corvusyeah; lemme know if you want me to dig into details with you19:24
clarkbcorvus: thanks19:24
clarkbI don't know if anyone else has noticed but recently I had some confusion over emails being in the openstack-discuss list archive and not in my inbox thinking there was some problem19:24
clarkbthe issue was that delivery for that list takes time. Upwards of 10 minutes.19:24
clarkb#link https://review.opendev.org/c/opendev/system-config/+/920765 Increase the number of out runners to speed up outbound mail delivery19:25
clarkbIn that change I linked to a discussion about similar problems another mail list host had and ultimately the biggest impact for them was increasing the number of runner instances19:25
tonybMakes sense to me.19:25
clarkbI've gone ahead and proposed we do the same. This isn't super critical but I think it will help avoid confusion in the future and keep discussions moving without unnecessary delay19:26
tonybI'm cool with your fix but it'd be cool to have a way to verify it has the intended impact19:26
corvusare we still letting exim do the verp?19:26
tonybnot that I'd block the chnage until that exists19:26
clarkbcorvus: yes I believe exim is doing verp19:26
clarkbcorvus: and disabling verp is one of the suggestions in the thread I found about improving this performance19:27
clarkbhowever, it seems like we can try this first since it is less impactful to behavior (in theory anyway)19:27
corvusk.  exim doing verp make things pretty fast, since if we're delivering 10k messages, and 5k are for gmail and 4k are for redhat and 1k are for everyone else (made up numbers), that should boil down to just a handful of outgoing smtp transactions for mailman.19:28
corvuss/make things/should make things/19:28
clarkbcorvus: makes sense. And ya I suspect the bottleneck may be between mailman and exim based on that thread (but I haven't profiled it locally)19:29
corvusif, however, mailman is doing 5k smtp transactions to exim for gmail, then we've lost the advantage19:29
fungifrom what i gather, mailman 3 limits the number of addresses per message to the mta to a fairly small number in order to avoid some spam detection heuristics that may trigger when you have too many recipients19:29
corvus(it should do, ideally, 1, but there are recipient limits, so maybe 10 total, 500 recipients each)19:29
fungii don't recall what the default is for sure, but think it may be something like 10 addresses per delivery19:30
corvusokay, so there might be an opportunity to tune that, so that we can maximize the work exim does19:30
clarkbsuonds good. Do we think that is something to do instead of increasing the out runner instances or somethign to try next after this update?19:30
fungibut yeah, the usual recommendation is to increase the number of threads mailman/django will use to submit messages to the mta so it doesn't just serialize and block19:30
corvusmy gut says try both, order doesn't matter19:31
clarkback I guess we proceed with this and try the other thing too19:31
corvus(non-exclusive)19:31
corvusyep19:31
corvus(sending multiple 500 recipient messages to exim in parallel is an ideal outcome)19:32
clarkb#topic OpenMetal Cloud Rebuild19:32
clarkbThe Inmotion/OpenMetal folks sent email recently calling out that they have updated their openstack cloud as a service tooling to deploy on new platforms and deploy a newer openstack version19:33
clarkbthey have volunteered to help out with the provisiining in the near future so I've been try to prepare cleanup/shutdown of the existing cloud so that we can gracefully replace it19:33
clarkbThe hardware needs to be reused rather than setting up a new cloud adjacent to the old one which means shutting everything down first19:34
clarkb#link https://review.opendev.org/c/openstack/project-config/+/921072 Nodepool cleanups19:34
clarkb#link https://review.opendev.org/c/opendev/system-config/+/921075 System-config cleanups19:34
clarkbmy goal is to get these chagnes landed over the next day or so as I'm meeting with Yuriys at 1pm Pacific time tomorrow to discuss further actions19:34
clarkbI expect that nodepool will be in a dirty state (with stuck nodes and maybe stuck images) after the cleanup changes land. corvus  pointed out the erase command which I'll use if it comes to that19:35
clarkbAlso if anyone else is interested in joining tomorrow just let me know. I can share the conference link19:35
tonybHappy to help with all/any of that19:35
clarkbtonyb: will do. I think the immediate next step is to land teh change which should clean up images in nodepool and see where that gets us19:35
clarkboh and the system-config change should be landable too19:35
corvusping me if there are nodepool cleanup probs19:36
clarkbcorvus: will do19:36
clarkbthen maybe tomorrow we land the final cleanup and do the erase step19:36
clarkbI'm hopeful that by the end of the week or early next week we'll be able to spin up a new mirror and add the new cloud to nodepool under its new openmetal name19:37
tonybThat seems doable :)19:37
corvus++19:38
clarkb#topic Testing Rackspace's New Cloud Product19:38
clarkbYesterday a ticket was opened in our rax nodepool account to let us know that their new openstack flex product is in a beta/testing/trial period and we coudl opt into testing it. They seem specifically interested in any feedback we may have19:39
clarkbI think this is a good idea, but the info is a bit scarse so not sure if it is a good fit for us yet. I'm also pretty swamped with the other stuff going on so not sure I'll have time to run this down before I take that time off19:39
corvusheh my first question is "are you sure?" :)19:39
corvusyou=rax19:39
clarkbWanted to call it out if anyone else wants to look into this more closely. The product is called openstack flex and it might be worth pinging cloudnull to get details or a referal to someone else with that info19:40
clarkbcorvus: ya I think the upside is we really do batter a cloud pretty good so if they take that data and feedback and improve with it then we're helping19:40
clarkbat the same time we might make their lives miserable :)19:40
tonybI can take a stab at that if you'd like19:40
clarkbtonyb: sure, I think the main thing is starting up some communication to see if this fits our use case and then go for it I guess19:41
clarkb#topic Open Discussion19:42
clarkbdansmith discovered today that centos 8 stream seems to have been deleted upstream19:42
clarkbour mirrors have faithfully synced this state and now centos 8 stream jobs are breaking19:42
tonybOh nice19:43
clarkbI think that means we can probably put removing centos 8 stream nodes on the todo list for nodepool cleanups19:43
clarkband we can also remove the centos afs mirror as everything should live in centos-stream now19:43
fungithe new rax service might be interesting if it performs better, is kvm-based, has stable support for nested kvm acceleration, etc19:43
clarkbbut all the content has been deleted already so that is mostly just bookkeeping19:43
clarkbfungi: ++19:43
fungiwe do get a lot of users complaining about performance or lack of cpu features in our current rax nodes19:44
tonybI can find out more19:45
tonybI assume I need to do that via the Account/webUI?19:45
tonybI can also ping cloudnull for a informal chat19:46
clarkbtonyb: that would be one appraoch though it might get us to the first level of support first19:46
clarkban informal chat might be better if cloudnull has time to at least point us at someone else19:46
tonybOkay19:48
clarkbsounds like that might be everything19:50
clarkbthank you for your time19:50
clarkb#endmeeting19:50
opendevmeetMeeting ended Tue Jun  4 19:50:28 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:50
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-04-19.00.html19:50
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-04-19.00.txt19:50
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-04-19.00.log.html19:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!