Tuesday, 2025-03-04

clarkbjust about meeting time18:59
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Mar  4 19:00:09 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5FMGNRCSUJQZYAOHNLLXQI4QT7LIG4C7/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbPTG preparation in underway. I wasn't planning on participating as OpenDev. Just wanted to make that clear. I think we've found more valeu in our out of band meetups19:01
clarkband that gives each of us the ability to attend other PTG activities without additional distraction19:01
clarkbAnything else to add?19:01
clarkbSounds like no. We can continue19:03
clarkb#topic Redeploying raxflex resources19:03
clarkbAs mentioned previously we need to redeploy raxflex resources to take advantage of better network MTUs and to have tenant alignment across regions19:04
clarkbthis work is underway with new mirrors being booted in the new regions. DNS has been updated for sjc3 test nodes to talk to that mirror in the other tenant19:04
clarkbWe did discover that floating ips are intended to be mandatory for externally routable IPs and the MTUs we got by default were quite large so fungi manually reduced them to 1500 at the neutron network level19:04
corvusi think i'm just assuming fungi will approve all the changes to switch when he has bandwidth19:04
clarkb#link https://review.opendev.org/q/hashtag:flex-dfw319:05
clarkband ya fungi has pushed a number of changes that need to be stepped through in a particular order to effect the migration. That link should contain all of them though I suspect many/most/all have sufficient reviews at this point19:05
corvusi don't think any zuul-launcher coordination is necessary; so i say go when ready19:05
fungisorry, trying to switch gears, openstack tc meeting is just wrapping up late19:06
clarkbcorvus: ack that was going to be my last question. In that case I agree proceed when ready fungi19:06
clarkbfungi: let me know if you have anything else to add otherwise I'll continue19:07
fungii guess just make sure i haven't missed anything (besides lingering cleanup)19:07
corvus(looks like the image upload container switched merged and worked)19:07
clarkbnothing stood out to me hwen I reviewed the changes19:07
clarkbI double checked the reenablement stack ensured images go before max-servers etc and that all looked fine19:08
fungithere are a bunch of moving parts in different repo with dependencies and pauses for deploy jobs to complete before approving the next part19:08
clarkbprobably at this point the best thing is ti start rolling forward and just adjust if necessary19:09
clarkbI also double checked quotas in both new tenants and the math worked out for me19:09
fungiyeah, seems we're primarily constrained by the ram quota19:10
fungiwe could increase by 50% if just the ram were raised19:10
clarkbya a good followup once everything is moved is asking if we can bump up memory quotas to meet the 50 instance quota19:11
clarkbthen we'd have 100 instances total (50 each in two regions)19:11
clarkb#topic Zuul-launcher image builds19:11
clarkblets keep moving we're 1/4 through our time and have more items to discuss19:12
corvusyeah most action is in previous and next topics19:12
clarkbI also wanted to ask if we should drop max-servers on the nodepool side for raxflex to give zuul-launcher more of a chance to get nodes but I think that can happen after fungi flips stuff over19:12
clarkbnot sure if you think that will be helpful based on usage patterns19:13
corvusthe quota support should help us get nodes now19:13
corvusit was actually useful to test that even19:13
corvus(that we still get nodes and not errors when we're at unexpected quota)19:13
clarkbgot it. The less than ideal state ensures we exercise our handling of that scenario (whcih is reasonably common in the real world so a good thing)19:14
clarkbI guess we can discuss the next item then19:14
corvusso for the occasional test like we're doing now, i think it's fine.  once we use more regularly, yeah, we should probably start to split the quota a bit.19:14
corvus++19:14
clarkb#topic Updating Flavors in OVH19:14
clarkbthe zuul-launcher effort looked into expanding into ovh then discovered we hav a single flavor to use there19:14
clarkbthat flavor is an 8GB memory 8cpu 80GB disk flavor that gets scheduled to specific hardware for us and we aren't supposed to use the general purpose flavors in ovh19:15
clarkbthis started a conversation with amorin about expanding the flavor options for us as zuul-launcher is attemptign to support a spread of 4gb, 8gb, and 16gb nodes19:15
clarkb#link https://etherpad.opendev.org/p/ovh-flavors19:15
corvus(and we learned bad things will happen if we do!  they can't be scheduled on the same hypervisiors, so we would end up excluding ourselves from our own resources)19:15
clarkbthis document attempts to capture some of the problems and considerations with making the change19:15
clarkbya the existing flavors are using some custom scheduling parameters that don't work with a mix of flavors. They can move us to entirely new flavors that don't use those scheduling methods and we can avoid this problem. However, we have to take a downtime to do so19:16
clarkbconsidering that ovh is a fairly large chunk of our quota corvus suggested we might want to hold off on this until after the openstack release is settled19:16
corvuswhen is that?19:17
clarkbin general I'm on board with the plan. I think it gets us away from weird special case behaviors and makes the system more flexible19:17
clarkb#link https://releases.openstack.org/epoxy/schedule.html19:17
clarkbthe actual release is April 2. But things usually settle after the first rc which is March 14 ish19:18
fungifwiw, test usage should already be falling now that the freeze is done19:18
funginext week is when release candidates and stable branches appear19:18
clarkbya and after ^ we're past the major demands on the CI system. So week after next is fine? or maybe late next week19:18
corvuswould >= mar 17 be a good choice?19:19
clarkbya I think so19:19
corvussounds good, i'll let amorin know in #opendev 19:19
tonybIs there any additional logging we want to add to zuul before the stable branches appear.... To try and find where we're missing branch creation events #tangent19:19
fungii'd even go so far as to say now is probably fine, but would want to look at the graphs a bit more to confirm19:19
clarkbtonyb: I think we can followup on that in open discussion .it is a good question19:21
clarkbAnything else on the subject of ovh flavor migrations?19:21
tonyb++19:21
clarkb#topic Running infra-prod Jobs in Parallel on Bridge19:22
clarkbThis replaces the fix known_hosts problems on bridge topic from last week19:22
clarkbthis is a followup to that which is basically lets get the parallel infra-prod jobs running finally19:22
corvus[i would like to engage on the branch thing in open discussion, but need a reminder of the problem and current state; so if anyone has background bandwidth to find a link or something before we get there i would find that helpful]19:23
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94243919:23
clarkbThis is a change that ianw proposed and I've since edited to address some concern. But the tl;dr is we can use a paused parent job that sets up the bridge to run infra-prod ansible in subsequent jobs. That paused job will hold a lock preventing any other parent job from starting (to avoid conflit across pipelines)19:23
clarkbthen we can run those child jobs in parallel. TO start I've created a new semaphore with a limit of 1 so that we can transition and observe the current behavior is maintained. Then we can bump that semaphore limit to 2 or 5 etc and have the infra prod jobs run in parallel19:24
corvuslooks great and i see no reason not to +w it whenever you have time to check on it.19:24
clarkbya I think the main thing is to call it out now so anyone else can review it if they would like to. Then I would like to +w it tomorrow19:25
clarkbI should be able to make plenty of time to monitor it tomorrow19:25
clarkband I'm happy to answer any questions about it in the interim19:25
clarkbany questions or concerns to raise in the meeting about this change?19:25
tonybNone from me19:26
clarkb#topic Upgrading old servers19:26
clarkbcool feel free to ping me or leave review comments19:27
clarkbthat takes us to our long term topic on upgrading old servers. I don't have any tasks on this inflight at the moment. Refactoring our CD pipeline stuff took over tempoararily19:27
clarkbdoes anyone else have anything to bring up on this subject?19:27
tonybOne small thing, I did discover that upgrading mediawiki beyond 1.31 will likely break openid auth19:28
fungidid they drop support for the plugin?19:28
tonybSo that essentially means that once we get the automation on support we're blocked behind keycloak updates19:29
fungimakes sense19:29
tonybfungi: not officially, but the core dropped functions and the plugin wasn't updated.19:29
clarkbwe'd still be in a better position overall than the situation today right?19:29
clarkbso this is more of a speedbump than a complete roadblock?19:30
tonybI suppose there may be scope to fix it upstream 19:30
fungii remember when i was trying to update mediawiki a few years ago, i was struggling with getting the openid plugin working and didn't find a resolution19:30
tonybclarkb: correct it was just a new surprise I discovered yesterday so I wanted to share19:30
clarkbcool just making sure I understand the implications19:31
tonyb++19:31
clarkb#topic Sprinting to Upgrade Servers to Jammy19:32
clarkbThe initial sprint got a number of servers upgraded but there is still a fairly large backlog to get through.19:32
clarkbI should be able to shift gears back to this effort next week and treat next week as another sprint19:32
clarkbwith known_hosts addressed and hopefully parallel infra-prod exeuction things should move a bit more quickly too19:33
fungiand what was the reason to upgrade them to jammy instead of noble at this point?19:33
clarkboh sorry thats a typo arg it used to say focal19:33
clarkbthen I "fixed" it to jammy19:33
fungiaha, makes more sense19:33
clarkb#undo19:33
opendevmeetRemoving item from minutes: #topic Sprinting to Upgrade Servers to Jammy19:33
clarkb#topic Sprinting to Upgrade Servers to Noble19:33
clarkb#link https://etherpad.opendev.org/p/opendev-server-replacement-sprint19:34
fungisince noble also gets us the ability to use podman, so deploy containers from quay with speculative testing19:34
clarkbif anyone else has time and bandwidth to boot a new server or three taht would be appreciated too. My general thought on this process is to continue to build out a set of servers that provide us feedback before we get to gerrit19:34
clarkbyup exactly19:34
clarkbthere are benefits all around19:34
clarkbreduces our dependency on docker hub, leaves old distro releases behind and so on19:35
clarkbbut ya I don't really have anything new on this subject this week. Going to try and push forward again next week19:35
clarkbHelp is very much welcome19:35
clarkbwhich takes us to19:35
clarkb#topic Docker Hub rate limits19:35
clarkbDocker hub announced that on March 1, 2025 anonymous image pull rate limits would go from 100 per 6 hours per ipv4 address and per ipv6 /64 to 10 per hour19:36
clarkbI haven't done the adnce of manually testing that their api reports those new limits yet but I've been operating underthe assumption that they are in place19:36
corvus"please don't use our service"19:36
clarkbit is possible this is a good thing for us because the rate limit resets hourly rather than every 6 hours and our jobs often run for about an hour so we mgith get away with this19:37
clarkbbut it is a great reminder that we should do our best to mitigate and get off the service19:37
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94332619:37
clarkbthis change updates our use of selenium containers to fetch from mirrored images on quay19:37
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94321619:37
clarkbthis change forces docker fetches to go over ipv4 rather than ipv6 as we get more rate limit quota that way due to how ipv6 is set up in the clouds we use (many nodes can share the same /64)19:38
clarkbtonyb: on 943216 I made a note about a problem with that approach on long lived servers19:38
clarkbessentially if ip addrs in dns records change then we can orphan bad/broken and potentially even security risky values19:38
tonybYup.   I'll address that when I'm at my laptop 19:38
clarkbcool19:39
clarkbbeyond these changes and the longer term efforst we've got in place I don't really have any good answers. Its mostly a case of keep moving away as much as we possibly can19:39
clarkb#topic Running certcheck on bridge19:40
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94271919:40
clarkbthis is fungi's change to move certcheck over to bridge. I think ianw does haev a good point there though whcih is we can run this out of a zuul job instead and avoid baking extra stuff into bridge that we don't need19:41
clarkbthe suggested job is infra-prod-letsencrypt but I don't think it even needs to be an LE job if we can generate the list of domains somehow (that letsencrypt job may be a good place because it already has the list of domains available)19:41
fungimakes sense, all this really started as a naive attempt to get rid of a github connectivity failure in a job because we install certcheck from git in order to work around the fact that the cacti server is stuck on an old ubuntu version where the certcheck package is too old to work correctly any longer19:42
fungii'm not sure i have the time or appetite to develop a certcheck notification job (i didn't really have the time to work on moving it off the cacti server, but gave it a shot)19:43
clarkbmaybe we leave that open for a week or two and see if anyone gets to it otherwise we can go ahead with what you put together already19:43
fungiturned out the job failure was actually due to a github outage, but by then i'd already written the initial change19:44
clarkbya the status quo is probably suffiicent to wait a week or two19:44
fungiso went ahead and pushed it for review in case anyone found it useful19:44
clarkb#topic Working through our TODO list19:45
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:45
clarkbthis is just another friendyl reminder that if you get bored or need a change of pace this list has a good set of things to dive into19:45
tonyb++19:46
clarkbI've also added at least one item there that we didn't discuss during the meetup but made sense as something to get done (the parallel infra prod jobs)19:46
clarkbso feel free to edit the list as well19:46
clarkb#topic Open Discussion19:46
clarkbI had two really quick things then lets discuss zuul branch creation behavior19:47
fungistory time wrt bulk branch creation and missed events?19:47
fungiah, yeah19:47
clarkbfirst is we updated gitea to use an external memcached for caching rather than a process internal golang hashmap. The idea is to avoid golang gc paus the world behavior19:47
clarkbso far my anecdotal experience is that gitea is much more consistent in its response times so that is good. but please say something if you ntoice abnormally long responses suddenly that may indicate additional problems or a misdiagnosed problem19:48
clarkbthen the other thing is the oftc matrix bridge may shutdown at the end of this month. Those of us relying on matrix for irc connectivity in one form or another should be prepared with some other alternative (run your own bridge, use native irc client, etc)19:48
clarkbok that was it from me19:48
clarkbI can do my best to recollect the zuul thing, but if someone else did research already you go for it19:49
fungii mostly just remember a gerrit to gitea replication issue with branches not appearing in gitea. what's the symptom you're talking about, tonyb?19:50
clarkbthe symptom tonyb refers to is the one that starlingx also hit recently19:51
clarkbbasically when projects like openstack or starlingx use an automated script to create branches across many projects those events go into zuul and zuul needs to load their configs. But for some subset of projcts the config for those branches isn't loaded and changes to those branches are ignored by zuul19:51
clarkbtriggering a full tenant reload of the affected zuul tenant fixes the problem19:52
fungioh, right, that thanks19:52
tonybThat's more detail than I had.19:52
tonybBut that's the issue I was thinking of19:52
clarkbI don't think it is clear yet if gerrit is failing to emit the events correctly, if zuul is getting the events then mishanlding them, or something else19:53
corvushttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-01-05.log.html that day has some talk about missed deleted events19:53
tonybI thought there was a question about the source of the problem is Gerrit sending all the events? Is zuul missing them?19:54
corvusi believe we do log all gerrit events, so it should be possible to answer that q19:54
tonybOkay.19:54
corvusoh, i have a vague memory that maybe last time this came up, we were like 1 day past the retention time or something?19:55
corvusmaybe that was a different issue though.19:55
clarkbyes I think that was correct for the latest starlingx case19:55
tonybThat sounds familiar 19:55
clarkbthey did the creation in january thenshowed up in february asking why changes weren't getting tested and by then the logs were gone19:55
corvus2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: Received data from Gerrit event stream: 19:55
corvus2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh:   {'eventCreatedOn': 1741046440,19:55
corvusjust confirmed that has the raw event stream from gerrit19:56
fungiso perhaps we should proactively look for this symptom immediately after openstack stable/2025.1 branches are created19:56
clarkb++19:56
corvusso yeah, if we find out in time, we should be able to dig through that and answer the first question.  probably the next one after that too.19:56
clarkbyou should be able to query the zuul api for the valid branches for each project in zuul19:56
clarkbso create branches. Wait an hour or two (since reloading config isn't fast) then query the api for those that hit the problem. If we find at least one go back to the zuul logs and see if we got the events from gerrit and work from there?19:56
tonybYup if we have logging that'll be helpful 19:56
corvus++19:57
tonyb++19:57
fricklerjust note the branches aren't all created at the same time, but by release patches per project (team)19:57
clarkbbut they should all be created within a week period next week?19:58
clarkbmaybe just check for missing branches on friday and call that sufficient19:58
clarkb(friday of next week I mean)19:58
tonybYup.  I believe "large" projects like OSA have hit it 19:58
fricklerthere might be exceptions but that should mostly work19:58
frickleroh, osa is one of the cycle trailing projects, so kind of special19:59
fungithe goal is not to necessarily find all incidences of the problem, but to find at least one so we can collect the relevant log data19:59
clarkbyup exactly19:59
clarkband we can reload tenant configs as the general solution otherwise19:59
tonybYeah but, when they branch.... All by there lonesome they've seen this behaviour so it isn't like we need "all of openstack" to trigger it20:00
clarkbgot it20:00
clarkband we are at time20:00
clarkbI think that is a good plan and we can take it from there20:00
clarkbfeel free to discuss further in #opendev or on the mailing list20:00
clarkbthank you everyone!20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Mar  4 20:00:56 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.log.html20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!