Tuesday, 2024-06-25

*** diablo_rojo_phone is now known as Guest1073411:21
fricklerfyi I'm going to be like 15min late after the long TC meeting18:49
clarkbfrickler: noted18:50
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jun 25 19:00:27 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IWHUW7OFDHZSL7MGPOK53LJ3ZR43OSOF/ Our Agenda19:00
clarkb#topic Announcements19:01
clarkbIts been a couple weeks since the last meeting but other than that I'm not aware of any important announcements19:01
clarkbmight be worth noting that a week from Thursday is a major US holdiay and I expect those of us in the US will be busy enjoying the day off19:01
clarkbanything else to announce?19:02
clarkbseems like thats it19:04
clarkb#topic Upgrading Old Servers19:04
clarkbtonyb continues to poke at configuration management and related infrastructure for a rebuilt wiki server19:05
tonybYup it's getting closer19:05
clarkbThe last big question that came up was where should logs live for the apache embedded in the container images that is responsible for running php stuff. I suggested that we could hook that into our regular container logs to syslog to log files on disk system19:05
clarkbI like that appraoch as it keeps the logs distinct from the host side ssl terminating apache that we use on most of our systems and are likely to use here as well19:06
clarkbbut there are many ways to do that so I guess let tonyb know if you have different ideas19:06
tonybI still need to configure elastic search but I can build a server in CI that "works" seems to do the important stuff19:07
clarkbthat is great progress19:07
fungiwas the conclusion to directly expose the container apache or reverse-proxy to it from an outer apache?19:07
tonybAt some stage soon I want to try importing the images from the existing server and pointing a held node at the trove DB19:07
tonybfungi: I'm still using a reverse proxy19:08
clarkbtonyb: if you do that you'll probably need to make a copy of the db? since I'm not sure its safe to have two different installs talking to the same db?19:08
clarkbbut that sounds like a great step to include as part of the how do we migrate story19:09
tonybI'll double check on that ... fortunately there is a mariadb server just waiting for data :)19:09
fungiyeah, weird as it sounds, proxying from an apache to another apache makes the most sense for consistency with our other systems19:10
fungijust wanted to double-check19:10
tonybfungi: when I get a more complete install I'll need you to poke at the antispam stuff to see if it's working as expected19:10
fungiyep19:11
tonybIt's installed but that's about as far as I have gotten19:11
tonybIt's funny to connect to http://new-server/ and watch all the 30X replies to get to a 200 ;P19:12
clarkbredirect all the things19:12
clarkbtonyb: you were also poking at booting noble nodes but ran into trouble with the afs packaging situation on noble (due to our use of a ppa that doesn't have packages yet)19:12
clarkbtonyb: has there been any progress on that? Thats somethign I wanted to catch up on after my week off and haven't managed to19:13
tonybI have patches up that I think passed CI yesterday the topic is noble-mirror or something like that19:13
tonybI poked the Ubuntu stable team to get the fixed packages moved into -updates19:14
clarkb#link https://review.opendev.org/q/topic:noble-mirror Deploying noble based mirror servers19:14
clarkbanything else for server upgrades?19:15
tonybI think that's it for now19:15
clarkb#topic AFS Mirror Cleanups19:15
clarkbThis is something I've started pushing on again recently.19:15
clarkbtopic:drop-ubuntu-xenial has at least one new change. This chagne drops xenial from system-config base role testing. I noted in the commit mesasge for that change that this might be one of the more "dangerous" changes in the xenial cleanup for us. The reason for that is we require the infra-prod-service-base job to succeed before running most other jobs19:16
clarkbso if we start failing against xenial we could prevent other jobs from running even for stuff not on xenial. That said I think the risk is still relatively low as the base role stuff doesn't change often19:17
clarkbopen to feedback if you've got it (probably best to keep that in the review though)19:17
clarkbI've also been trying to catch up with centos 8 stream cleanup now that none of those test nodes are really functioanl19:18
clarkbtopic:drop-centos-8-stream has a whole bunch of fun changes to make that cleanup happen. The vast majority should be safe. There is an openstack-zuul-jobs change that is -1'd by zuul bceause projects have c8s jobs for fips19:18
clarkbI've been trying to get the word out and brought this up with the openstack qa team and tc. Sounds like the qa team may send email about it19:19
clarkbat some point I expect we may end up force merging that chagne though and forcing projects to address the problems if they haven't already19:19
clarkbI think the risk is really low though considering that all the jobs are failing on that platform since no packages are availalbe.19:21
clarkb#topic Gitea 1.22 Upgrade19:21
clarkbThis is the other item that is/was high on my todo list. Unfortunately there still isn't a 1.22.1 release yet. I don't think many if any of the issues they have with 1.22.0 will affect us but there were a lot of issuse and some of them seemed important (for general use)19:21
clarkbso I'm feeling more confident waiting for a 1.22.1 release before we upgrade19:21
clarkbonce we have upgraded we can proceed with fixing up the database encodings and all that19:22
clarkbanyway no new info here other than 1.22.1 is still absent19:22
clarkb#topic Improving Mailman Mail Throughput19:22
clarkbas discussed last time we likely need to increase both mailman queue batch sizes and exim verp batch sizes19:23
clarkbI don't think a change has been pushed for that yet unless I missed it in today's early morning scrollback but fungi was looking into it19:23
clarkbfungi: is there anything else to add?19:23
tonyb#link https://review.opendev.org/c/opendev/system-config/+/92270319:24
fungioh, i pushed one, sorry19:24
tonybI think that's what we're talking about right?19:24
clarkboh hey I did miss it in scrollback thanks19:25
fungithanks tonyb19:25
clarkbyup19:25
clarkbI'll givethat a review after the meeting19:25
clarkb#topic OpenMetal Cloud Rebuild19:26
clarkbSo this was all going great until it wasn't :)19:26
clarkbWe had the cloud fully running in nodepool with max-servers 50 set and jobs were running etc. But then there were failures booting nodes and frickler's debugging indicated we ran out of disk?19:26
clarkbsounds like maybe ceph isn't backing all of the disk related stuff that we expect it to be backing and that lead to us filling the smaller portion of disck allocated to not ceph19:27
clarkbfrickler: is that a reasonable summary?19:27
fricklerI sent a mail about that to the openmetal thread19:27
fricklerthe issue is that glance and nova cannot share the ceph pool properly in the current configuration19:28
clarkbya you also suggested some kolla setting changes that could be applied to fix it. I think it is a good idea to work through the openmetal folks on this as I suspect it may affect their product as a whole and this si something we want to ensure is working for us and everyone else19:28
fricklerthus nova needs to download each image and upload it to ceph again19:28
clarkbaha19:28
clarkbthat would be inefficient19:28
fricklerwhich fills up the root partition which is only like 200G and our images are large19:28
fricklerit was actually openmetal who discovered the disk full condition, too19:29
fricklermaybe you can poke yuri again since there was no response yet to my mail afaict19:29
clarkbfrickler: I haven't seen a response to your email yet, any chance they responded to you directly instead of cc'ing the rest of us? or should i try and followup with them in a bit?19:29
clarkback can do19:29
clarkbthere was also the question of the number of ceph partition groups per osd being very low. The docs suggest 100 per osd but we're at like 4? I can mention that too19:30
frickleronce that is fixed we can do another production attempt and them possibly decide about the ceph tuning19:30
clarkbthough the docs are also a bit hand wavey around how much it actually matters19:30
clarkb#link https://docs.ceph.com/en/latest/dev/placement-group/19:31
clarkbsounds good thank you for helping with the debugging on that19:31
fricklerwell it will worsen the performance a bit, but difficult to tell how much19:31
clarkbanything else openmetal related?19:32
fricklerI think that should be all for now19:33
frickleroh19:33
fricklerask about the monitoring thing again, I think that also went unnoticed in your mail or unresponded at least19:33
clarkbcan do /me scribbles some notes on the todo list19:34
clarkb#topic Testing Rackspace's New Cloud Offering19:34
clarkbrax recently reached out to fungi and myself about this. Still not a ton of info, but I'm working to try and schedule a short meeting to allow us to discuss it more syncrhonously and determine the next step here19:35
clarkbI proposed July 8 as a possible day as it avoids the holiday next week and isn't this week, but have no idea what their schedule is like and haven't heard back yet.19:35
fricklerdo you have a mail about this that you could share?19:36
clarkbBut hopefully we'll be able to sync up and learn more and do something productive around this. I think helping them burn in the new product if we can is a great idea19:36
clarkbfrickler: just the ticket they filed against the nodepool account19:36
fungii think the details were the same as what they sent opendev, yeah19:37
fricklerah, that sounded like you were contacted directly, ok19:37
clarkbya its basically just "we have a new product in test/beta/limited availability would you like to be an early user"19:38
fungithe separate outreach was technically to the foundation staff about using it, but the actual foundation wouldn't really have much use for it beyond supplying resources for opendev to use, i think19:38
fricklerack19:38
fungirackspace contacted business development folks on the foundation staff, who basically forwarded it to me and clarkb19:38
fungiand in the end it was a matter of "yeah they already reached out to opendev about this same thing"19:39
clarkband without any more info than we got in the ticket19:39
clarkbonce I hear back anything more actionable I can share that info19:40
fricklerso more business than technical, I'm fine to be left out of that ;)19:40
clarkb#topic Open Discussion19:40
clarkbI wanted to mention that dib was also hit by the c8s stuff but is getting sorted out19:41
clarkbmostly an fyi  Idon't think we need help with it and jobs should be passing again as of today19:41
corvussome of the recent zuul performance improvements have resulted in a significant (40%+) reduction in peak zk data size in opendev:19:42
corvushttps://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-30d&to=now&viewPanel=3819:42
clarkbstarlingx also ran into pip 24.1 problems (there is a bunch of discussion of that in #opendev from today) related to pip failing to handle metadata on some packages19:42
fungioh wow!19:42
fungithat's a nice performance gain19:42
tonybcorvus: awesome!19:42
corvusfungi: yeah, i was not expecting it to be so big with opendev's specific characteristics, so was a pleasant surprise :)19:43
corvusclarkb: is an underlying cause the age of the package in question?  i'm wondering if there may be more time-bombs like that...19:43
fungiwell, there are a number of backward-incompatible changes in pip 24.119:44
clarkbcorvus: sort of. The package is older and newer versions do fix it19:45
clarkbcorvus: but I think you could hit the same issue with modern packages too19:46
corvusack19:46
fungiit dropped support for python 3.7, refuses to install anything with a non-pep440-compliant version string, and also vendors newer libs for things like metadata processing which may have gotten more strict than they used to be19:46
clarkbI may end up being away from irc/matrix at some point today and/or tomorrow. I really need to finally get around to RMAing this laptop and before I do I want to retest with ubuntu noble and wayland vs x11 etc to ensure this isn't just a sofwtare problem. I think I'm going to start diving into that today if I ca and then sit on the phoen tomorrow with lenovo if still necessary19:49
clarkbI got a talk accepted to the openinfra summit event in korea and need a working laptop before that event19:49
clarkbthe annoying thing is it almost mostly works if I disable modesetting in the kernel but if I do that everything has to run at 1920x1080 (including external displays) and I can't control display brightness19:50
tonybclarkb: congrats and noted19:50
clarkbLast call otherwise I think we can have 10 minutes back for $meal/sleep/etc19:51
tonyb.zZ19:51
clarkbthanks everyone!19:52
clarkb#endmeeting19:52
opendevmeetMeeting ended Tue Jun 25 19:52:11 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:52
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-25-19.00.html19:52
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-25-19.00.txt19:52
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-06-25-19.00.log.html19:52
* corvus falls asleep into bowl of rice19:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!