clarkb | I'll start the meeting in just a few minutes. Heads up that HVAC person showed up early so if I disappear for a minute or two I 'm probably answer a question or modifying hvac settings. But I shouldn't be gone long | 18:58 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Jul 1 19:00:05 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YQ43ECZ5W6GKF4CTPVARL2U5XCQOKQCB/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | Friday is a US holiday so I expect several of us won't be around that day | 19:00 |
clarkb | I'm also going to be out the 15th-17th which means someone else will need to run the meeting on the 15th or we can skip it | 19:01 |
clarkb | I'm happy to defer that decision on everyone else since I won't be here :) | 19:01 |
fungi | i expect to be around | 19:01 |
corvus | me too | 19:01 |
fungi | happy to run meetings whenever | 19:02 |
clarkb | thanks I'll let you dceide that week if you need to send a meeting agenda then fungi | 19:02 |
clarkb | Anything else to announce? | 19:02 |
fungi | openinfra summit schedule is out | 19:02 |
clarkb | #topic Zuul-launcher | 19:04 |
clarkb | I guess the big news here is that we're using zuul launcher for the majority of nodes now | 19:04 |
fungi | yay! | 19:04 |
clarkb | Be on the lookout for unexpected behaviors. I found one yesterday and corvus quickly had a patch for that just merged | 19:04 |
corvus | yep! a few nodes still supplied for images we don't have yet | 19:04 |
clarkb | (more often than expected you can get nodesets with nodes from different cloud regions) | 19:05 |
corvus | also, there are a few labels we don't have for images that we do have (like nested-virt) | 19:05 |
clarkb | #link https://review.opendev.org/c/opendev/zuul-providers/+/953269 Add Ubuntu Focal and Bionic images | 19:05 |
fungi | someone reported a cirros image file missing in a job just a little bit ago, not sure if it was on a zuul or nodepool built node though | 19:05 |
corvus | i'm hoping to restart launchers with the latest fixes today; those also include additional metrics, so i'll merge updates to the graphs then too. | 19:05 |
clarkb | fungi: ya unfornately they didn't provide a build log to check | 19:06 |
fungi | and we did at least confirm that we're building images with that file in the expected path | 19:06 |
fungi | or configured to do so anyway | 19:06 |
corvus | fungi: for nodepool, niz, or both? | 19:07 |
fungi | both | 19:07 |
fungi | but hard to dig deeper until we get more details | 19:07 |
corvus | good... there's a lot of image metadata just in the web ui now; so looking into that should be a little bit easier | 19:08 |
corvus | but i know all of the web ui stuff still needs some work | 19:08 |
clarkb | corvus: both mnasiadka's focal+bionic change and frickler's debian trixie depend on dib changes that haven't merged. If/when they do merge are we using dib from relaeses or master in the image build jobs? | 19:08 |
fungi | otherwise we'll need to d another dib release | 19:09 |
corvus | releases, i think, and if that's right, that makes the depends-on testing a dangerous non-co-gating situation | 19:09 |
corvus | (i wonder if we should have the job run, but fail, if it uses dib from source) | 19:10 |
corvus | but let's double check that, since i'm not sure | 19:10 |
clarkb | corvus: ++ that seems like a good plan. or we can use it from source too | 19:10 |
clarkb | from source all the time I mean | 19:10 |
corvus | yeah | 19:10 |
clarkb | ok so to summarize continue to try and debug issues that may be related to the niz switch, support corvus in improving things via code reviews etc, review the dib changes necessary to build additional images, and then try to rollout more images | 19:11 |
clarkb | and maybe we need to toggle how we're installing dib to make things less flaky post merge | 19:11 |
corvus | ++ | 19:11 |
clarkb | #link https://review.opendev.org/c/opendev/zuul-providers/+/951471 Debian Trixie Image builds | 19:12 |
clarkb | One thing I wanted to do this morning when looking at the cirros image is missing claim is log into an image built by niz to double check. But the web ui only shows in use nodes and doesn't supply their IPs. It sounds like the json blob may have some of that info if anyone lese needs to look it up | 19:12 |
corvus | i'm pretty sure we only install dib from source if the repo is there | 19:13 |
clarkb | I suspect improving the web ui around that sort of thing is going to happen too as we get deeper into this | 19:13 |
corvus | so easiest way to make it install from source all the time is just to add it to required-projects for the image build jobs | 19:13 |
clarkb | makes sense | 19:13 |
corvus | clarkb: yeah, i think we should also be able to show the image id we used in the web ui/json | 19:14 |
clarkb | ++ | 19:14 |
fungi | listing not-yet-used nodes in the ui could be weird, since the nodes list is tenant scoped and those nodes wouldn't be associated with a tenant yet? | 19:14 |
clarkb | I have confirmed that nodepool list doesn't seem to show you niz images. I didn't expect it to but thought maybe since they both use zk there would be enough overal pfor that to magically work | 19:14 |
corvus | fungi: yeah... though we do show building nodes that are assigned to a tenant | 19:15 |
corvus | the only thing that won't show up is unassigned ready nodes (typically from min-ready, but possibly from aborted buildsets) | 19:15 |
fungi | fair, if they've got an associated node request | 19:15 |
corvus | there is a way to filter for "ready nodes that could possibly be used by this tenant"; it's a bit more complex, but i think we can/should do it. | 19:15 |
fungi | cool! | 19:16 |
clarkb | I think that would've been useful for me today so I'm ++ on doing that | 19:16 |
corvus | (so then those ready nodes would show up in multiple tenant listings, until their probability field collapses) | 19:16 |
clarkb | each unassigned node is a quantum computer | 19:17 |
fungi | nodes that are "available" to the tenant | 19:17 |
clarkb | anything else on this topic? | 19:17 |
corvus | i think that's it from me | 19:17 |
clarkb | oh mnasiadka just mentioned in #opendev that image build jobs are running out of disk | 19:18 |
clarkb | so we may need to do more optimization of the disk usage in the jobs. But details are scarce right now. Needs more characterization | 19:18 |
corvus | oof. i guess we'll followup on that in opendev | 19:18 |
corvus | we do have cacti graphs | 19:18 |
clarkb | I think this was on the image build node itself | 19:18 |
clarkb | not the launchers fwiw | 19:18 |
clarkb | but ya we can followup there later | 19:19 |
corvus | oh derp sorry | 19:19 |
clarkb | #topic Gerrit shutdown problems | 19:19 |
clarkb | Last week I finally got around to doing the "testing" of gerrit shutdown processes in production | 19:19 |
clarkb | And I think the hunch that h2 db compaction was the cause of slow gerrit shutdown was accurate. | 19:19 |
clarkb | we ran a manual kill -HUP against gerrit to rule out sigint vs sighup behavior differences and sighup produced the same slow shutdown. It ended up taking about 6-7 minutes to finally shutdown | 19:20 |
clarkb | while we were waiting I ran a strace against gerrit and it was read/writing/seeking in h2 db files. And after the process completed the db files were smaller than when we started | 19:20 |
clarkb | we did the restart to apply the revert of h2 compaction so the testing was also the fix and I expect the next restart will be happy | 19:21 |
clarkb | fungi: you helped out with ^ anything else to add? | 19:21 |
fungi | nope, just that there was good solid evidence to support your guess | 19:21 |
fungi | looking forward to smooter restarts in the future | 19:22 |
fungi | smoother | 19:22 |
clarkb | so ya good news I think we've corrected this which means I can return to figuring out a gerrit 3.11 upgrade | 19:22 |
clarkb | #topic Gerrit 3.11 Upgrade Planning | 19:22 |
clarkb | however, I haven't started on this since hopefully fixing gerrit restarts | 19:22 |
clarkb | #link https://www.gerritcodereview.com/3.11.html | 19:22 |
clarkb | reading over the release notes is still helpful if you haven't done it yet | 19:22 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.11 Planning Document for the eventual Upgrade | 19:22 |
clarkb | and you can put any thoughts or notes in that etherpad. I'd like to pick this up again either this week or next so hopefully there will be proper updates soon | 19:23 |
clarkb | #topic Upgrading old servers | 19:23 |
clarkb | There are updates on this topic! | 19:23 |
fungi | lots | 19:24 |
clarkb | late last week and over the weekend corvus replaced all of the core zuul servers with noble nodes | 19:24 |
clarkb | (schedulers, mergers, executors, launchers) | 19:24 |
corvus | i did self-approve some pro-forma changes. hope that's okay. :) | 19:24 |
fungi | perfectly | 19:24 |
clarkb | then I followed that work up by replacing the zookeeper nodes behind zuul and nodepool yseterday | 19:25 |
corvus | one thing that went wrong with the zuul upgrade: the load balancer | 19:25 |
clarkb | corvus: the problem is that we hardcode IP addrs in the proxy config right? | 19:25 |
corvus | entirely my fault, because i forgot 2 things: 1) we moved the config files; and 2) the backend server config is explicit | 19:25 |
clarkb | I feel like we do that because then we aren't reliant on DNS for that to work. Maybe we should just let haproxy do dns lookups? | 19:26 |
corvus | i cleaned up our old config file locations on both load balancers (zuul and gitea), so hopefully that won't bite anyone else :) | 19:26 |
corvus | yeah, i'm kind of thinking that just letting dns or ansible write that automatically might be okay | 19:26 |
clarkb | I'm on board with that. Worst case we notice flakyness on the frontend and can revert | 19:27 |
clarkb | then tackle it some other way | 19:27 |
corvus | automatic config is... uh... what i was expecting to happen... and i think our process for replacing servers would work with that. | 19:27 |
corvus | probably just need to still be able to take a server in and out easily, but just not specify it's ip address. | 19:28 |
clarkb | ya we do that with zookeeper actually | 19:28 |
clarkb | the servers are all listed with explicit IPs but ansible figures out what they are for us and puts them in he config file | 19:28 |
corvus | i like that approach | 19:29 |
clarkb | and it does so via looking at hosts in the zookeeper ansible group | 19:29 |
corvus | sounds like consensus to change that. that was the only followup from my weekend | 19:29 |
tonyb | Makes sense to me | 19:29 |
clarkb | zookeeper replacement went smoothly. I did have one unexpcted election behavior but afterwards I thought it through and the behavior makes sense to me after the fact | 19:30 |
clarkb | #link https://review.opendev.org/c/opendev/zone-opendev.org/+/953844 Remove old zk servers from DNS | 19:30 |
clarkb | after lunch today I plan to approve ^ unless there are objections to remove the old zk servers from DNS. Then I'll plan to delete the old zk servers after or tomorrow morning (again if there are no objections) | 19:30 |
clarkb | corvus already acked doing ^ so let me know otherwise I'm planning to proceed with cleanup | 19:31 |
corvus | clarkb: one thought: now that we've shown we can do that particular upgrade process, we should make sure not to copy those questionable notes about upgrades from the etherpad. | 19:31 |
corvus | or revise them or whatever | 19:31 |
clarkb | corvus: ya I did put a note about them possibly being FUD in the etherpad already but I should go ahead and delete them or cross them out and mark them invalid | 19:31 |
corvus | ++ | 19:32 |
clarkb | now that we have all these nodes running on noble with docker compose instead of docker compose we can clean up their docker-compose.yaml files | 19:32 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/953846 | 19:32 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/953848 | 19:32 |
clarkb | it was writing these changes that helped discover the nodes from multiple clouds behavior | 19:32 |
clarkb | those aren't urgent but it might be nice to get rid of the warnings | 19:34 |
corvus | looking forward to that :) | 19:34 |
clarkb | I also swapped otu mirror-update servers not sure if we discussed that previously. | 19:34 |
clarkb | Eavesdrop and refstack are the "easy" nodes I have remaingin on the easy list | 19:34 |
clarkb | fungi: I know we've been busy with plenty of other stuff but any word on refstack cleanup? | 19:34 |
fungi | nothing | 19:37 |
clarkb | ok. The list is still quite big but we're slowly whittling it down. Thanks for the help and happy to have more | 19:37 |
clarkb | #topic OFTC Matrix bridge no longer supporting new users | 19:38 |
clarkb | I have an action item to go write a spec for this I just haven't gotten to it yet. Maybe I should do that before I start looking at gerrit 3.11 | 19:38 |
clarkb | I guess there isn't anything new on this until I do that | 19:40 |
clarkb | #topic Working through our TODO list | 19:40 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:40 |
clarkb | I also need to migrate this list to something a bit more permanent/better | 19:40 |
clarkb | but a friendly reminder that the list exists if you get bored :) | 19:41 |
clarkb | #topic Pre PTG Planning | 19:43 |
clarkb | I haven't heard any additional feedback on the week of October 6-10 for holding an opendev pre ptg | 19:43 |
clarkb | I think we shoudl all pencil in those dates and I'll start on an announcement and an agenda that we can fill in before then | 19:44 |
clarkb | ok no more feedback is good feedback I'll proceed wit htaht as the plan for now | 19:47 |
clarkb | #topic Open Discussion | 19:47 |
clarkb | Anything else? | 19:47 |
fungi | i got nothin' | 19:51 |
clarkb | in that case thanks everyone for your time here and elsewhere keeping opendev up and running | 19:52 |
clarkb | we'll be back next week at the same time and location | 19:52 |
clarkb | #endmeeting | 19:52 |
opendevmeet | Meeting ended Tue Jul 1 19:52:22 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:52 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-01-19.00.html | 19:52 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-01-19.00.txt | 19:52 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-07-01-19.00.log.html | 19:52 |
fungi | thanks clarkb! | 19:54 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!