Tuesday, 2025-07-08

clarkbmeeting time!19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jul  8 19:00:45 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EHPWD6ZYIOJ6KPI2D6QUN36KUXOV6YHL/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI'll be out next week from the 15th to 17th. Sounded like fungi was willing to chair next week's meeting if there is sufficient interest19:01
fungiyeah, i can take it19:01
fungiunless someone else wants to19:01
clarkbanything else to announce or should we jump into the agenda?19:02
fungii have nothing19:03
clarkb#topic Zuul-launcher19:03
clarkbthe agenda is a bit out of date on this topic since thing are moving so quickly19:03
clarkbbut we noticed mixed provider nodes continuing to impact jobs. corvus found bugs and fixed them19:04
corvusfixes for all known bugs related to mixed provider nodesets are in production; i spot-checked some periodic builds this morning and didn't see failures related to this19:04
corvusso if someone sees a new issue, pls raise it19:04
fungiyeah, other folks who had been tracking that symptom in their project reported no further incidences19:05
clarkbexcellent19:05
clarkbAnother problem we ran into was the gnutls errors from image builds fetching git repos to update our git repo image caches19:05
corvusi have changes for a couple of issues related to old/stuck requests; those are gating now19:05
clarkbmnasiadka updated dib to retry git requests. I found that one of the nodes I looked at seemed to alternate between requesting git resources via ipv4 and ipv6 protocols19:06
clarkbwhen it stalled out it was trying ipv6 then after the timeout and retry it used ipv4 and no further problems were observed. Even after it switched back to ipv6 again19:07
clarkbI thought that maybe there were stray interface updates but spot checking that provider we configure the interfaces statically for ipv6 so I'm really stumped19:07
corvuswhat do test nodes use for dns now?19:07
clarkbI think for now relying on the retries is probably a decent enough workaround and we can dig in further if problems become worse19:07
clarkbcorvus: should be unbound listening on localhost with unbound forwarding to google and cloudflare19:08
clarkbif the host has an ipv6 address the ipv6 google and cloudflare addrs are used. If not then ipv419:08
clarkb(they don't recurse themselves)19:08
corvusso if there's dns flapping, it would be google/cloudflare...19:08
clarkbhowever image builds occur in a chroot which may impact /etc/resolv.conf I haven't checked on that more closely19:09
clarkbcorvus: ya either google or cloudflare or something about chroot setup overriding the host dns maybe19:09
corvuswe have a 1h ttl on opendev.org...19:09
clarkbright and the flapping between the protocols occured over a time span less than an hour and it did so multiple times19:09
clarkbits definitely a really odd situation.19:09
corvusweird. nothing is jumping out at me either.19:10
clarkbanother idea is that its just Internet failures19:10
clarkbin which case retrying is probably the most correct thing to do and we're doing that now19:10
fungithat seems like the most probable cause fitting all the observed behaviors19:10
corvusand retrying seems to be sufficient?19:10
clarkbcorvus: yes in the example I was digging into there was a single retry for one failure and it proceeded successfully from there19:11
clarkbit appears to be quite intermittent19:11
corvusi'm happy enough to just back away slowly from this then.  :)19:11
fungisomebody's got a router somewhere in the path that's inadvertently hashing a small percentage of flows into a black hole19:11
fungiand maybe only for ipv619:11
clarkbcorvus: ya thats about where I've ended up19:12
clarkbthere are also trixie and EL10 image builds that I figured we could touch on briefly but I gave them their own topics19:13
clarkbanything other than ^ we want to discuss on this topic?19:13
corvusoh yeah19:13
corvusi think we're really close to having zero nodepool traffic19:14
corvusand i think probability of us wanting to roll-back is nearing zero19:14
corvusso... do we want to decomission the nodepool servers?19:14
fungithat would be awesome19:14
clarkbI'm on baord with that. I was thinking it is probably a good idea to stop services for a week or so and keep an eye out for any unexpected corner cases we missed19:15
corvussounds like a plan.  a day or so after it looks like zero traffic, i'll stop, then wait a week to delete19:15
clarkbthen shutdown and remove the servers entirely after that period. Mostly just thinking there may be jobs that run infrequently enough or are lost in the periodic job noise that we may not notice until we shut it down properly19:15
clarkbcorvus: sounds great19:15
corvuseot from me19:16
clarkb#topic Adding Debian Trixie Images19:16
clarkbrelated to the last topic frickler has done some work to add Debian Trixie images and only in zuul-launcher not nodepool19:16
clarkbI think that is the correct choice at this point particularly since we are close to shutting down Nodepool but worth calling out19:16
corvus++19:17
clarkbThe other bits worth noting are that we are relying on a prerelase version of dib to build those images since they are really debian testing images until trixie is released19:17
clarkbwe will likely need tomake some small updates to s/testing/trixie/ post release too19:17
clarkbthinking out loud here we shoudl prboably try and get a dib release out soon too to capture these updates, but I think I'm ok with using depends on and pulling dib from source. We may need to be careful about reviewing depends on before approving zuul-provider chagnes if we do that long term though?19:18
corvusi think we switched zuul-provider to just always build from source19:18
clarkbbut even with depends on zuul-provider won't merge and upload an image until the dib change is reviewed and approved19:18
clarkbso I think risk there is low. Mostly just want zuul-provider reviewers to know they need to think about and dib depends-on too19:19
corvusyes, zuul-providers always uses dib from source now, so releases are irrelevant to it19:19
corvus(for all image builds)19:20
clarkbcool19:20
clarkboh the other thing is we aren't mirroring testing/trixie19:20
clarkbthis may break jobs that try to run on the new nodes as I think we try to configure mirrors for all debian images today19:20
clarkbwe can figure that out if/when it becomes a problem after images are built19:20
clarkb#topic EL10 Images19:21
clarkbthen related to that we're also in the process of adding CentOS 10 Stream and Rocky Linux 10 images19:21
clarkb#link https://review.opendev.org/c/opendev/zuul-providers/+/953460 CentOS 10 Stream19:21
clarkb#link https://review.opendev.org/c/opendev/zuul-providers/+/954265 Rocky Linux 1019:22
clarkbin addition to all of the prior notes about dib and mirrors the extra consideration here is that these nodes cannot boot in rackspace classic19:22
clarkbI think I've come to terms with that and I think it may be a good way to push other providers (and rax) to expand on more capable hardware19:22
clarkbbut it does mean that like half our resources can't boot those images right now19:23
clarkbthank you to everyone pushing both trixie and EL10 along. I think they have been a great illustration for how zuul-launcher is an improvement over nodepool when it comes to debug cycles and adding new images19:23
clarkbwe're able to do everything upfront rather than post merge on nodepool's daily rebuild cycle19:24
fungiand this opens us up to the possbility of adding image acceptance tests19:24
fungias actual zuul jobs19:25
clarkb#topic Gerrit 3.11 Upgrade Planning19:26
clarkbThe only thing I've done on this topic recently is clean up the old held nodes19:26
clarkbI realized they are probably stale and don't represent our current images and config anymore (for exmaple the cleanup of h2 compaction timeout increases)19:26
clarkbThe other thing I've eralized is that we've got a handful of unrelated changes that all touch on the gerrit images that we might want to bundel up and apply with a single gerrit restart19:27
clarkbThere is the move to host gerrit images on quay19:27
clarkb#link https://review.opendev.org/c/opendev/system-config/+/882900 Host Gerrit images on quay.io19:27
clarkbThere is the update to remove cla stuff19:27
clarkb(which has content baked into the image iirc)19:27
clarkbthen there are the updates to the zuul status viewer which I don't think we've deployed yet either19:28
clarkbfungi: maybe you and I can sync up on doing quay and the cla stuff one day and do a single restart to pick up both?19:28
fungiyeah, i expect at least a week lead time to announce merging the acl change to remove cla enforcement from all remaining projects in our gerrit19:29
clarkbthen once that is done I want to hold new nodes with that up to date configuration state and container images and dig into planning the upgrade from there19:29
corvuscorrect, i don't think the status plugin is deployed, but it was tested in a system-config build.19:29
clarkbfungi: ok lets sync up after the meeting and figure out what planning for that looks like19:30
clarkbbut once all of that us caught up I can dig into the release notes and update the plan with testing input19:30
clarkb#link https://www.gerritcodereview.com/3.11.html19:30
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade19:30
clarkbany other questions, concerns or comments in relation to the gerrit 3.10 to 3.11 upgrade planning?19:30
funginone on my side19:31
clarkb#topic Upgrading old servers19:31
clarkbwe had a good sprint on this ~last week and the week prior19:32
clarkbbut since then I havent' seen any new servers (which is fine)19:32
fungino news on refstack19:32
clarkback. Other than refstack the next "easy" server I've identified is eavesdrop19:32
clarkbso when I find time that is the most likely next server to get replaced19:32
clarkb#topic Trialing Matrix for OpenDev Comms19:34
clarkbunlike gerrit upgrade prep I didn't dive into this because I realized there were prereqs. I just ran out of time between debugging image build and gitea stuff and the holiday19:35
clarkbthat said I should be able to write that spec this week as long as fires stay away. No holiday helps too19:35
clarkbits on my todo list after reviewing PBR updates and some local laptop updates I need to do19:35
clarkboh and some paperwork that is due today19:35
clarkbso ya hopefully soon19:36
clarkb#topic Working through our TODO list19:37
clarkbAnd now its time for the weekly reminder that we have a TODO list that still needs a new home19:37
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:37
clarkbthat is related enough to spec work that I should probably roll those two efforts into adjacent blocks of time19:37
clarkb#topic Pre PTG Planning19:37
clarkb#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document19:37
clarkbI've started to put the planning for this 'event' into its own dedicated document.19:38
clarkbFeel free to add agenda items and add suggestions for specific timeframes and days in the etherpad. Once we've got a bit more of a schedule I'll announce this on the mailing list as well to make it more official19:38
clarkbI can also port over still relevant items from the last event to this one. We've managed to complete a few things but the list is long and there are likely updates19:39
clarkb#topic Open Discussion19:41
clarkb~August will be our next Service Coordinator election19:41
clarkbI'm calling that out now in part to remind myself to figure out planning for that but also to encourage others to volunteer if interested.19:41
clarkbsounds like that may be everything?19:44
clarkbThank you everyone for keeping OpenDev up and running19:45
fungithanks clarkb!19:45
corvusthanks!19:45
clarkb#endmeeting19:45
opendevmeetMeeting ended Tue Jul  8 19:45:44 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:45
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.html19:45
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.txt19:45
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.log.html19:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!