Tuesday, 2025-07-08

clarkb	meeting time!	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Jul 8 19:00:45 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EHPWD6ZYIOJ6KPI2D6QUN36KUXOV6YHL/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	I'll be out next week from the 15th to 17th. Sounded like fungi was willing to chair next week's meeting if there is sufficient interest	19:01
fungi	yeah, i can take it	19:01
fungi	unless someone else wants to	19:01
clarkb	anything else to announce or should we jump into the agenda?	19:02
fungi	i have nothing	19:03
clarkb	#topic Zuul-launcher	19:03
clarkb	the agenda is a bit out of date on this topic since thing are moving so quickly	19:03
clarkb	but we noticed mixed provider nodes continuing to impact jobs. corvus found bugs and fixed them	19:04
corvus	fixes for all known bugs related to mixed provider nodesets are in production; i spot-checked some periodic builds this morning and didn't see failures related to this	19:04
corvus	so if someone sees a new issue, pls raise it	19:04
fungi	yeah, other folks who had been tracking that symptom in their project reported no further incidences	19:05
clarkb	excellent	19:05
clarkb	Another problem we ran into was the gnutls errors from image builds fetching git repos to update our git repo image caches	19:05
corvus	i have changes for a couple of issues related to old/stuck requests; those are gating now	19:05
clarkb	mnasiadka updated dib to retry git requests. I found that one of the nodes I looked at seemed to alternate between requesting git resources via ipv4 and ipv6 protocols	19:06
clarkb	when it stalled out it was trying ipv6 then after the timeout and retry it used ipv4 and no further problems were observed. Even after it switched back to ipv6 again	19:07
clarkb	I thought that maybe there were stray interface updates but spot checking that provider we configure the interfaces statically for ipv6 so I'm really stumped	19:07
corvus	what do test nodes use for dns now?	19:07
clarkb	I think for now relying on the retries is probably a decent enough workaround and we can dig in further if problems become worse	19:07
clarkb	corvus: should be unbound listening on localhost with unbound forwarding to google and cloudflare	19:08
clarkb	if the host has an ipv6 address the ipv6 google and cloudflare addrs are used. If not then ipv4	19:08
clarkb	(they don't recurse themselves)	19:08
corvus	so if there's dns flapping, it would be google/cloudflare...	19:08
clarkb	however image builds occur in a chroot which may impact /etc/resolv.conf I haven't checked on that more closely	19:09
clarkb	corvus: ya either google or cloudflare or something about chroot setup overriding the host dns maybe	19:09
corvus	we have a 1h ttl on opendev.org...	19:09
clarkb	right and the flapping between the protocols occured over a time span less than an hour and it did so multiple times	19:09
clarkb	its definitely a really odd situation.	19:09
corvus	weird. nothing is jumping out at me either.	19:10
clarkb	another idea is that its just Internet failures	19:10
clarkb	in which case retrying is probably the most correct thing to do and we're doing that now	19:10
fungi	that seems like the most probable cause fitting all the observed behaviors	19:10
corvus	and retrying seems to be sufficient?	19:10
clarkb	corvus: yes in the example I was digging into there was a single retry for one failure and it proceeded successfully from there	19:11
clarkb	it appears to be quite intermittent	19:11
corvus	i'm happy enough to just back away slowly from this then. :)	19:11
fungi	somebody's got a router somewhere in the path that's inadvertently hashing a small percentage of flows into a black hole	19:11
fungi	and maybe only for ipv6	19:11
clarkb	corvus: ya thats about where I've ended up	19:12
clarkb	there are also trixie and EL10 image builds that I figured we could touch on briefly but I gave them their own topics	19:13
clarkb	anything other than ^ we want to discuss on this topic?	19:13
corvus	oh yeah	19:13
corvus	i think we're really close to having zero nodepool traffic	19:14
corvus	and i think probability of us wanting to roll-back is nearing zero	19:14
corvus	so... do we want to decomission the nodepool servers?	19:14
fungi	that would be awesome	19:14
clarkb	I'm on baord with that. I was thinking it is probably a good idea to stop services for a week or so and keep an eye out for any unexpected corner cases we missed	19:15
corvus	sounds like a plan. a day or so after it looks like zero traffic, i'll stop, then wait a week to delete	19:15
clarkb	then shutdown and remove the servers entirely after that period. Mostly just thinking there may be jobs that run infrequently enough or are lost in the periodic job noise that we may not notice until we shut it down properly	19:15
clarkb	corvus: sounds great	19:15
corvus	eot from me	19:16
clarkb	#topic Adding Debian Trixie Images	19:16
clarkb	related to the last topic frickler has done some work to add Debian Trixie images and only in zuul-launcher not nodepool	19:16
clarkb	I think that is the correct choice at this point particularly since we are close to shutting down Nodepool but worth calling out	19:16
corvus	++	19:17
clarkb	The other bits worth noting are that we are relying on a prerelase version of dib to build those images since they are really debian testing images until trixie is released	19:17
clarkb	we will likely need tomake some small updates to s/testing/trixie/ post release too	19:17
clarkb	thinking out loud here we shoudl prboably try and get a dib release out soon too to capture these updates, but I think I'm ok with using depends on and pulling dib from source. We may need to be careful about reviewing depends on before approving zuul-provider chagnes if we do that long term though?	19:18
corvus	i think we switched zuul-provider to just always build from source	19:18
clarkb	but even with depends on zuul-provider won't merge and upload an image until the dib change is reviewed and approved	19:18
clarkb	so I think risk there is low. Mostly just want zuul-provider reviewers to know they need to think about and dib depends-on too	19:19
corvus	yes, zuul-providers always uses dib from source now, so releases are irrelevant to it	19:19
corvus	(for all image builds)	19:20
clarkb	cool	19:20
clarkb	oh the other thing is we aren't mirroring testing/trixie	19:20
clarkb	this may break jobs that try to run on the new nodes as I think we try to configure mirrors for all debian images today	19:20
clarkb	we can figure that out if/when it becomes a problem after images are built	19:20
clarkb	#topic EL10 Images	19:21
clarkb	then related to that we're also in the process of adding CentOS 10 Stream and Rocky Linux 10 images	19:21
clarkb	#link https://review.opendev.org/c/opendev/zuul-providers/+/953460 CentOS 10 Stream	19:21
clarkb	#link https://review.opendev.org/c/opendev/zuul-providers/+/954265 Rocky Linux 10	19:22
clarkb	in addition to all of the prior notes about dib and mirrors the extra consideration here is that these nodes cannot boot in rackspace classic	19:22
clarkb	I think I've come to terms with that and I think it may be a good way to push other providers (and rax) to expand on more capable hardware	19:22
clarkb	but it does mean that like half our resources can't boot those images right now	19:23
clarkb	thank you to everyone pushing both trixie and EL10 along. I think they have been a great illustration for how zuul-launcher is an improvement over nodepool when it comes to debug cycles and adding new images	19:23
clarkb	we're able to do everything upfront rather than post merge on nodepool's daily rebuild cycle	19:24
fungi	and this opens us up to the possbility of adding image acceptance tests	19:24
fungi	as actual zuul jobs	19:25
clarkb	#topic Gerrit 3.11 Upgrade Planning	19:26
clarkb	The only thing I've done on this topic recently is clean up the old held nodes	19:26
clarkb	I realized they are probably stale and don't represent our current images and config anymore (for exmaple the cleanup of h2 compaction timeout increases)	19:26
clarkb	The other thing I've eralized is that we've got a handful of unrelated changes that all touch on the gerrit images that we might want to bundel up and apply with a single gerrit restart	19:27
clarkb	There is the move to host gerrit images on quay	19:27
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/882900 Host Gerrit images on quay.io	19:27
clarkb	There is the update to remove cla stuff	19:27
clarkb	(which has content baked into the image iirc)	19:27
clarkb	then there are the updates to the zuul status viewer which I don't think we've deployed yet either	19:28
clarkb	fungi: maybe you and I can sync up on doing quay and the cla stuff one day and do a single restart to pick up both?	19:28
fungi	yeah, i expect at least a week lead time to announce merging the acl change to remove cla enforcement from all remaining projects in our gerrit	19:29
clarkb	then once that is done I want to hold new nodes with that up to date configuration state and container images and dig into planning the upgrade from there	19:29
corvus	correct, i don't think the status plugin is deployed, but it was tested in a system-config build.	19:29
clarkb	fungi: ok lets sync up after the meeting and figure out what planning for that looks like	19:30
clarkb	but once all of that us caught up I can dig into the release notes and update the plan with testing input	19:30
clarkb	#link https://www.gerritcodereview.com/3.11.html	19:30
clarkb	#link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade	19:30
clarkb	any other questions, concerns or comments in relation to the gerrit 3.10 to 3.11 upgrade planning?	19:30
fungi	none on my side	19:31
clarkb	#topic Upgrading old servers	19:31
clarkb	we had a good sprint on this ~last week and the week prior	19:32
clarkb	but since then I havent' seen any new servers (which is fine)	19:32
fungi	no news on refstack	19:32
clarkb	ack. Other than refstack the next "easy" server I've identified is eavesdrop	19:32
clarkb	so when I find time that is the most likely next server to get replaced	19:32
clarkb	#topic Trialing Matrix for OpenDev Comms	19:34
clarkb	unlike gerrit upgrade prep I didn't dive into this because I realized there were prereqs. I just ran out of time between debugging image build and gitea stuff and the holiday	19:35
clarkb	that said I should be able to write that spec this week as long as fires stay away. No holiday helps too	19:35
clarkb	its on my todo list after reviewing PBR updates and some local laptop updates I need to do	19:35
clarkb	oh and some paperwork that is due today	19:35
clarkb	so ya hopefully soon	19:36
clarkb	#topic Working through our TODO list	19:37
clarkb	And now its time for the weekly reminder that we have a TODO list that still needs a new home	19:37
clarkb	#link https://etherpad.opendev.org/p/opendev-january-2025-meetup	19:37
clarkb	that is related enough to spec work that I should probably roll those two efforts into adjacent blocks of time	19:37
clarkb	#topic Pre PTG Planning	19:37
clarkb	#link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document	19:37
clarkb	I've started to put the planning for this 'event' into its own dedicated document.	19:38
clarkb	Feel free to add agenda items and add suggestions for specific timeframes and days in the etherpad. Once we've got a bit more of a schedule I'll announce this on the mailing list as well to make it more official	19:38
clarkb	I can also port over still relevant items from the last event to this one. We've managed to complete a few things but the list is long and there are likely updates	19:39
clarkb	#topic Open Discussion	19:41
clarkb	~August will be our next Service Coordinator election	19:41
clarkb	I'm calling that out now in part to remind myself to figure out planning for that but also to encourage others to volunteer if interested.	19:41
clarkb	sounds like that may be everything?	19:44
clarkb	Thank you everyone for keeping OpenDev up and running	19:45
fungi	thanks clarkb!	19:45
corvus	thanks!	19:45
clarkb	#endmeeting	19:45
opendevmeet	Meeting ended Tue Jul 8 19:45:44 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:45
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.html	19:45
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.txt	19:45
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.log.html	19:45

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!