Tuesday, 2025-03-11

clarkbwe will start our weekly meeting in just a minute or two18:59
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Mar 11 19:00:18 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/X75EQWC4QAGVRJKPB3HTTKHKONAM26JJ/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI don't have anything to announce. Did anyone else have something?19:00
* tonyb has nothing19:02
clarkb#topic Zuul-launcher image builds19:03
clarkbI think the big update here is that we did do the raxflex surgery to switch tenants and update networks (and their MTUs)19:03
clarkbthat happened for both zuul-launcher and nodepool and as far as I know is working19:03
corvusyay!19:04
corvusi know at least some nodes were provided by zuul-launcher yesterday19:04
corvusi need to check on a node failure though19:04
clarkbthat would imply things are mostly working I think. Thats good19:04
corvusso let's say i should confirm that's actually working with zl19:04
corvushrm, i think there may be a rax flex problem with zl19:05
corvus2025-03-10 17:26:56,062 ERROR zuul.Launcher:   TypeError: create_server() requires either 'image' or 'boot_volume'19:05
corvusnot sure what the deal is there19:05
corvusi'll follow up in #opendev tho19:06
clarkbnodepool is using images to boot off of not boot from volume fwiw19:06
clarkback. Anything else to bring up on this topic?19:06
corvusi think we're really close to being ready to start expanding the use of niz19:06
corvuslike maybe one or two more features to add to the launcher19:06
corvusso it's probably worth asking how to proceed?19:07
corvusi'm thinking...19:07
corvuscontinue to add some more providers, and switch the zuul tenant over to use niz exclusively.  do that without the new flavors though (so we use the old flavors -- 8gb everywhere)19:07
fungii do have a change up to add rackspace flex dfw319:07
corvus(basically, decouple the adding of new flavors from the switching to niz)19:08
corvusfungi: oh cool19:08
funginote that sjc3 and dfw3 use different flavor names, so my change moves the label->flavor mapping up into the region layer19:08
corvusif we do what i suggest -- should we also dial down max-servers in nodepool just a little?  or let them duke it out?19:09
fungi#link https://review.opendev.org/943104 Add the DFW3 region for Rackspace Flex19:09
clarkbcorvus: I suspect we can dial back max-servers by some small amount in nodepool19:09
corvusfungi: nice -- that's why we can do it in both places19:09
fungiyeah, i found that convenient19:09
fungigreat design!19:09
clarkbwe tend to float under max-servers due to $cloudproblems anyway so we may find that eventually leads to them duking it out :)19:10
corvuswe're definitely getting to the point where more images would be good to have :)  we have everything needed for zuul i think, but it would be good to have other images that other tenants use19:10
corvusclarkb: good point19:10
corvusi think that's about it from me19:11
clarkbthanks for the update. Adding more images and starting with the zuul tenant sounds great to me19:11
clarkb#topic Updating Flavors in OVH19:12
clarkbrelated is the OVH flavor surgery that we've discussed with amorin19:12
clarkb#link https://etherpad.opendev.org/p/ovh-flavors19:12
clarkbthe proposal is to do this Monday (March 17) ish19:12
corvusdid amorin respond about scheduling?19:12
corvusi missed a bunch of scrollback due to travel19:12
clarkb19:21:33*       amorin | corvus: I will check with the team tomorrow about march 17 and let you know.19:13
clarkbthis was the last message I saw19:13
clarkbso I guess we're still waiting on final confirmation they will proceed then19:13
corvuscool, i did miss that.  gtk.19:13
corvusprobably worth a ping to check on the status, but no rush19:13
clarkb++19:13
clarkband ya other wise its mostly be aware of this as a thing we're trying to do. It should make working with ovh in zuul-launcher and nodepool nicer19:14
clarkbmore future proofed by getting off the bespoke scheduling system19:14
clarkb#topic Running infra-prod Jobs in Parallel on Bridge19:16
clarkbYesterday we landed the chnage to increase our infra-prod jobs semaphore limit to 2. Since then we have been running our infra-prod jobs in parallel19:16
corvusyaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaay19:17
tonyb\o/  #greatsuccess19:17
funginow deploys are happening ~2x as fast19:17
fungiit's very nice!19:17
clarkbI monitored deploy, opendev-prod-hourly, and periodic buildsets as well as general demand/load on bridge and thankfully no major issues came up19:17
clarkb#link https://review.opendev.org/c/opendev/system-config/+/943999 A small logs fixup19:17
clarkbthis fixup addresses ansible logging on bridge that ianw noticed is off19:18
clarkb#link https://review.opendev.org/c/opendev/system-config/+/943992 A small job ordering fixup19:18
clarkband this change makes puppet-else depend on letsencrypt because puppet deployed services do use LE certs19:18
clarkbneither of which is urgent nor impacting functionality at the moment19:18
corvusit's important that puppet work so we can update our jenkins configuration :)19:19
clarkbindeed19:19
corvus(amusingly, since change #1 is a puppet change, we can say 943998 changes later and we still have puppet)19:20
clarkbother than those two chagnes I think the next step is considering increasing that semaphore limit to a larger number. Bridge has 8vcpus and load avg was below 2 everytime I checked. I think that means we can safely bump up to 4 or 5 in the semaphore19:20
corvus4 sgtm19:21
tonybI like 4 also 19:21
clarkbI was also hoping to land a few more regular system-config changes to see deploy do its thing but the one that merged today largely noop'd19:21
clarkbfungi's change from yseterday did not noop and was a good exercise thankfully19:22
clarkbok maybe try to land a change or two to system-config later today and if that and periodic look good bump up to 4 tomororw19:22
fungionce we have performance graphs we can check easily again, we might consider increasing it further19:23
clarkbanything else on this topic? It was a long time coming but its here and seems to be working so yay and thank you to everyone who helped make it happen19:23
clarkb#topic Upgrading old servers19:24
clarkbAs expected last week was consumed by infra-prod chagnes and the start of this week too19:24
clarkbso I don't have anyting enw on this front. But it is the next big item on my todo list to tackle19:25
clarkbtonyb: did you have any updates?19:25
tonyb:( No.19:26
clarkb#topic Sprinting to Upgrade Servers to Noble19:26
clarkbso as mentioned I'm hoping to pick this up again starting tomorrow most likely19:26
clarkbHelp with reviews is always appreciated as is help bootstrapping replacement servers19:26
clarkb#link https://etherpad.opendev.org/p/opendev-server-replacement-sprint19:26
clarkbfeel free to throw some notes in that etherpad and I'll try to get it updated from my side as I pick it up again19:27
clarkbThen in related news I started brainstorming what the Gerrit replacement looks like. I think it is basically boot a new gerrit and volume, have system-config deploy it "empty" without replication configured. Sanity check things look right and functional. Then sync over current prod state and recheck operation. Then schedule a downtime to do a final sync and cut over dns19:28
clarkbthe main thing is taht we don't want replication to run until the new server is in production to avoid overwriting content on gitea19:28
clarkbif you can think of any other sync poinst that need to be handled carefully let me know19:28
clarkb#topic Running certcheck on bridge19:29
clarkbas mentioned last week I volunteered to look at running this within an infra-prod job that runs daily instead19:29
clarkbI haven't done that yet but am hopeful I'll be able to look before next meeting just because I have so much infra-prod stuff paged in at this point19:30
clarkband with things running in parallel the wall clock time cost for that should be minimal19:30
clarkb#topic Working through our TODO list19:31
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:31
clarkbthis is just your weekly reminder to look at that list if you need good ideas for what to work on next. I think I can probably mark off infra-prod running in parallel now too19:31
clarkb#topic Open Discussion19:32
clarkbAnything else?19:32
fungii didn't have anything19:33
clarkbmy kids are out on spring break week after next. I may try to pop out a day or two to do family things as a result. but we have nothing big planned so my guess is it wouldn't be anything crazy19:34
tonybNothing from me this week19:34
clarkbthat was quick. I think  we can end early today then19:36
clarkbthank you everyone for your time and help19:36
clarkbsee you back here next week19:36
clarkb#endmeeting19:36
opendevmeetMeeting ended Tue Mar 11 19:36:22 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:36
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-11-19.00.html19:36
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-11-19.00.txt19:36
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-11-19.00.log.html19:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!