clarkb | meeting time! | 19:00 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Jul 8 19:00:45 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EHPWD6ZYIOJ6KPI2D6QUN36KUXOV6YHL/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | I'll be out next week from the 15th to 17th. Sounded like fungi was willing to chair next week's meeting if there is sufficient interest | 19:01 |
fungi | yeah, i can take it | 19:01 |
fungi | unless someone else wants to | 19:01 |
clarkb | anything else to announce or should we jump into the agenda? | 19:02 |
fungi | i have nothing | 19:03 |
clarkb | #topic Zuul-launcher | 19:03 |
clarkb | the agenda is a bit out of date on this topic since thing are moving so quickly | 19:03 |
clarkb | but we noticed mixed provider nodes continuing to impact jobs. corvus found bugs and fixed them | 19:04 |
corvus | fixes for all known bugs related to mixed provider nodesets are in production; i spot-checked some periodic builds this morning and didn't see failures related to this | 19:04 |
corvus | so if someone sees a new issue, pls raise it | 19:04 |
fungi | yeah, other folks who had been tracking that symptom in their project reported no further incidences | 19:05 |
clarkb | excellent | 19:05 |
clarkb | Another problem we ran into was the gnutls errors from image builds fetching git repos to update our git repo image caches | 19:05 |
corvus | i have changes for a couple of issues related to old/stuck requests; those are gating now | 19:05 |
clarkb | mnasiadka updated dib to retry git requests. I found that one of the nodes I looked at seemed to alternate between requesting git resources via ipv4 and ipv6 protocols | 19:06 |
clarkb | when it stalled out it was trying ipv6 then after the timeout and retry it used ipv4 and no further problems were observed. Even after it switched back to ipv6 again | 19:07 |
clarkb | I thought that maybe there were stray interface updates but spot checking that provider we configure the interfaces statically for ipv6 so I'm really stumped | 19:07 |
corvus | what do test nodes use for dns now? | 19:07 |
clarkb | I think for now relying on the retries is probably a decent enough workaround and we can dig in further if problems become worse | 19:07 |
clarkb | corvus: should be unbound listening on localhost with unbound forwarding to google and cloudflare | 19:08 |
clarkb | if the host has an ipv6 address the ipv6 google and cloudflare addrs are used. If not then ipv4 | 19:08 |
clarkb | (they don't recurse themselves) | 19:08 |
corvus | so if there's dns flapping, it would be google/cloudflare... | 19:08 |
clarkb | however image builds occur in a chroot which may impact /etc/resolv.conf I haven't checked on that more closely | 19:09 |
clarkb | corvus: ya either google or cloudflare or something about chroot setup overriding the host dns maybe | 19:09 |
corvus | we have a 1h ttl on opendev.org... | 19:09 |
clarkb | right and the flapping between the protocols occured over a time span less than an hour and it did so multiple times | 19:09 |
clarkb | its definitely a really odd situation. | 19:09 |
corvus | weird. nothing is jumping out at me either. | 19:10 |
clarkb | another idea is that its just Internet failures | 19:10 |
clarkb | in which case retrying is probably the most correct thing to do and we're doing that now | 19:10 |
fungi | that seems like the most probable cause fitting all the observed behaviors | 19:10 |
corvus | and retrying seems to be sufficient? | 19:10 |
clarkb | corvus: yes in the example I was digging into there was a single retry for one failure and it proceeded successfully from there | 19:11 |
clarkb | it appears to be quite intermittent | 19:11 |
corvus | i'm happy enough to just back away slowly from this then. :) | 19:11 |
fungi | somebody's got a router somewhere in the path that's inadvertently hashing a small percentage of flows into a black hole | 19:11 |
fungi | and maybe only for ipv6 | 19:11 |
clarkb | corvus: ya thats about where I've ended up | 19:12 |
clarkb | there are also trixie and EL10 image builds that I figured we could touch on briefly but I gave them their own topics | 19:13 |
clarkb | anything other than ^ we want to discuss on this topic? | 19:13 |
corvus | oh yeah | 19:13 |
corvus | i think we're really close to having zero nodepool traffic | 19:14 |
corvus | and i think probability of us wanting to roll-back is nearing zero | 19:14 |
corvus | so... do we want to decomission the nodepool servers? | 19:14 |
fungi | that would be awesome | 19:14 |
clarkb | I'm on baord with that. I was thinking it is probably a good idea to stop services for a week or so and keep an eye out for any unexpected corner cases we missed | 19:15 |
corvus | sounds like a plan. a day or so after it looks like zero traffic, i'll stop, then wait a week to delete | 19:15 |
clarkb | then shutdown and remove the servers entirely after that period. Mostly just thinking there may be jobs that run infrequently enough or are lost in the periodic job noise that we may not notice until we shut it down properly | 19:15 |
clarkb | corvus: sounds great | 19:15 |
corvus | eot from me | 19:16 |
clarkb | #topic Adding Debian Trixie Images | 19:16 |
clarkb | related to the last topic frickler has done some work to add Debian Trixie images and only in zuul-launcher not nodepool | 19:16 |
clarkb | I think that is the correct choice at this point particularly since we are close to shutting down Nodepool but worth calling out | 19:16 |
corvus | ++ | 19:17 |
clarkb | The other bits worth noting are that we are relying on a prerelase version of dib to build those images since they are really debian testing images until trixie is released | 19:17 |
clarkb | we will likely need tomake some small updates to s/testing/trixie/ post release too | 19:17 |
clarkb | thinking out loud here we shoudl prboably try and get a dib release out soon too to capture these updates, but I think I'm ok with using depends on and pulling dib from source. We may need to be careful about reviewing depends on before approving zuul-provider chagnes if we do that long term though? | 19:18 |
corvus | i think we switched zuul-provider to just always build from source | 19:18 |
clarkb | but even with depends on zuul-provider won't merge and upload an image until the dib change is reviewed and approved | 19:18 |
clarkb | so I think risk there is low. Mostly just want zuul-provider reviewers to know they need to think about and dib depends-on too | 19:19 |
corvus | yes, zuul-providers always uses dib from source now, so releases are irrelevant to it | 19:19 |
corvus | (for all image builds) | 19:20 |
clarkb | cool | 19:20 |
clarkb | oh the other thing is we aren't mirroring testing/trixie | 19:20 |
clarkb | this may break jobs that try to run on the new nodes as I think we try to configure mirrors for all debian images today | 19:20 |
clarkb | we can figure that out if/when it becomes a problem after images are built | 19:20 |
clarkb | #topic EL10 Images | 19:21 |
clarkb | then related to that we're also in the process of adding CentOS 10 Stream and Rocky Linux 10 images | 19:21 |
clarkb | #link https://review.opendev.org/c/opendev/zuul-providers/+/953460 CentOS 10 Stream | 19:21 |
clarkb | #link https://review.opendev.org/c/opendev/zuul-providers/+/954265 Rocky Linux 10 | 19:22 |
clarkb | in addition to all of the prior notes about dib and mirrors the extra consideration here is that these nodes cannot boot in rackspace classic | 19:22 |
clarkb | I think I've come to terms with that and I think it may be a good way to push other providers (and rax) to expand on more capable hardware | 19:22 |
clarkb | but it does mean that like half our resources can't boot those images right now | 19:23 |
clarkb | thank you to everyone pushing both trixie and EL10 along. I think they have been a great illustration for how zuul-launcher is an improvement over nodepool when it comes to debug cycles and adding new images | 19:23 |
clarkb | we're able to do everything upfront rather than post merge on nodepool's daily rebuild cycle | 19:24 |
fungi | and this opens us up to the possbility of adding image acceptance tests | 19:24 |
fungi | as actual zuul jobs | 19:25 |
clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:26 |
clarkb | The only thing I've done on this topic recently is clean up the old held nodes | 19:26 |
clarkb | I realized they are probably stale and don't represent our current images and config anymore (for exmaple the cleanup of h2 compaction timeout increases) | 19:26 |
clarkb | The other thing I've eralized is that we've got a handful of unrelated changes that all touch on the gerrit images that we might want to bundel up and apply with a single gerrit restart | 19:27 |
clarkb | There is the move to host gerrit images on quay | 19:27 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/882900 Host Gerrit images on quay.io | 19:27 |
clarkb | There is the update to remove cla stuff | 19:27 |
clarkb | (which has content baked into the image iirc) | 19:27 |
clarkb | then there are the updates to the zuul status viewer which I don't think we've deployed yet either | 19:28 |
clarkb | fungi: maybe you and I can sync up on doing quay and the cla stuff one day and do a single restart to pick up both? | 19:28 |
fungi | yeah, i expect at least a week lead time to announce merging the acl change to remove cla enforcement from all remaining projects in our gerrit | 19:29 |
clarkb | then once that is done I want to hold new nodes with that up to date configuration state and container images and dig into planning the upgrade from there | 19:29 |
corvus | correct, i don't think the status plugin is deployed, but it was tested in a system-config build. | 19:29 |
clarkb | fungi: ok lets sync up after the meeting and figure out what planning for that looks like | 19:30 |
clarkb | but once all of that us caught up I can dig into the release notes and update the plan with testing input | 19:30 |
clarkb | #link https://www.gerritcodereview.com/3.11.html | 19:30 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade | 19:30 |
clarkb | any other questions, concerns or comments in relation to the gerrit 3.10 to 3.11 upgrade planning? | 19:30 |
fungi | none on my side | 19:31 |
clarkb | #topic Upgrading old servers | 19:31 |
clarkb | we had a good sprint on this ~last week and the week prior | 19:32 |
clarkb | but since then I havent' seen any new servers (which is fine) | 19:32 |
fungi | no news on refstack | 19:32 |
clarkb | ack. Other than refstack the next "easy" server I've identified is eavesdrop | 19:32 |
clarkb | so when I find time that is the most likely next server to get replaced | 19:32 |
clarkb | #topic Trialing Matrix for OpenDev Comms | 19:34 |
clarkb | unlike gerrit upgrade prep I didn't dive into this because I realized there were prereqs. I just ran out of time between debugging image build and gitea stuff and the holiday | 19:35 |
clarkb | that said I should be able to write that spec this week as long as fires stay away. No holiday helps too | 19:35 |
clarkb | its on my todo list after reviewing PBR updates and some local laptop updates I need to do | 19:35 |
clarkb | oh and some paperwork that is due today | 19:35 |
clarkb | so ya hopefully soon | 19:36 |
clarkb | #topic Working through our TODO list | 19:37 |
clarkb | And now its time for the weekly reminder that we have a TODO list that still needs a new home | 19:37 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:37 |
clarkb | that is related enough to spec work that I should probably roll those two efforts into adjacent blocks of time | 19:37 |
clarkb | #topic Pre PTG Planning | 19:37 |
clarkb | #link https://etherpad.opendev.org/p/opendev-preptg-october-2025 Planning happening in this document | 19:37 |
clarkb | I've started to put the planning for this 'event' into its own dedicated document. | 19:38 |
clarkb | Feel free to add agenda items and add suggestions for specific timeframes and days in the etherpad. Once we've got a bit more of a schedule I'll announce this on the mailing list as well to make it more official | 19:38 |
clarkb | I can also port over still relevant items from the last event to this one. We've managed to complete a few things but the list is long and there are likely updates | 19:39 |
clarkb | #topic Open Discussion | 19:41 |
clarkb | ~August will be our next Service Coordinator election | 19:41 |
clarkb | I'm calling that out now in part to remind myself to figure out planning for that but also to encourage others to volunteer if interested. | 19:41 |
clarkb | sounds like that may be everything? | 19:44 |
clarkb | Thank you everyone for keeping OpenDev up and running | 19:45 |
fungi | thanks clarkb! | 19:45 |
corvus | thanks! | 19:45 |
clarkb | #endmeeting | 19:45 |
opendevmeet | Meeting ended Tue Jul 8 19:45:44 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:45 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.html | 19:45 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.txt | 19:45 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-08-19.00.log.html | 19:45 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!