clarkb | just about meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Mar 4 19:00:09 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5FMGNRCSUJQZYAOHNLLXQI4QT7LIG4C7/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | PTG preparation in underway. I wasn't planning on participating as OpenDev. Just wanted to make that clear. I think we've found more valeu in our out of band meetups | 19:01 |
clarkb | and that gives each of us the ability to attend other PTG activities without additional distraction | 19:01 |
clarkb | Anything else to add? | 19:01 |
clarkb | Sounds like no. We can continue | 19:03 |
clarkb | #topic Redeploying raxflex resources | 19:03 |
clarkb | As mentioned previously we need to redeploy raxflex resources to take advantage of better network MTUs and to have tenant alignment across regions | 19:04 |
clarkb | this work is underway with new mirrors being booted in the new regions. DNS has been updated for sjc3 test nodes to talk to that mirror in the other tenant | 19:04 |
clarkb | We did discover that floating ips are intended to be mandatory for externally routable IPs and the MTUs we got by default were quite large so fungi manually reduced them to 1500 at the neutron network level | 19:04 |
corvus | i think i'm just assuming fungi will approve all the changes to switch when he has bandwidth | 19:04 |
clarkb | #link https://review.opendev.org/q/hashtag:flex-dfw3 | 19:05 |
clarkb | and ya fungi has pushed a number of changes that need to be stepped through in a particular order to effect the migration. That link should contain all of them though I suspect many/most/all have sufficient reviews at this point | 19:05 |
corvus | i don't think any zuul-launcher coordination is necessary; so i say go when ready | 19:05 |
fungi | sorry, trying to switch gears, openstack tc meeting is just wrapping up late | 19:06 |
clarkb | corvus: ack that was going to be my last question. In that case I agree proceed when ready fungi | 19:06 |
clarkb | fungi: let me know if you have anything else to add otherwise I'll continue | 19:07 |
fungi | i guess just make sure i haven't missed anything (besides lingering cleanup) | 19:07 |
corvus | (looks like the image upload container switched merged and worked) | 19:07 |
clarkb | nothing stood out to me hwen I reviewed the changes | 19:07 |
clarkb | I double checked the reenablement stack ensured images go before max-servers etc and that all looked fine | 19:08 |
fungi | there are a bunch of moving parts in different repo with dependencies and pauses for deploy jobs to complete before approving the next part | 19:08 |
clarkb | probably at this point the best thing is ti start rolling forward and just adjust if necessary | 19:09 |
clarkb | I also double checked quotas in both new tenants and the math worked out for me | 19:09 |
fungi | yeah, seems we're primarily constrained by the ram quota | 19:10 |
fungi | we could increase by 50% if just the ram were raised | 19:10 |
clarkb | ya a good followup once everything is moved is asking if we can bump up memory quotas to meet the 50 instance quota | 19:11 |
clarkb | then we'd have 100 instances total (50 each in two regions) | 19:11 |
clarkb | #topic Zuul-launcher image builds | 19:11 |
clarkb | lets keep moving we're 1/4 through our time and have more items to discuss | 19:12 |
corvus | yeah most action is in previous and next topics | 19:12 |
clarkb | I also wanted to ask if we should drop max-servers on the nodepool side for raxflex to give zuul-launcher more of a chance to get nodes but I think that can happen after fungi flips stuff over | 19:12 |
clarkb | not sure if you think that will be helpful based on usage patterns | 19:13 |
corvus | the quota support should help us get nodes now | 19:13 |
corvus | it was actually useful to test that even | 19:13 |
corvus | (that we still get nodes and not errors when we're at unexpected quota) | 19:13 |
clarkb | got it. The less than ideal state ensures we exercise our handling of that scenario (whcih is reasonably common in the real world so a good thing) | 19:14 |
clarkb | I guess we can discuss the next item then | 19:14 |
corvus | so for the occasional test like we're doing now, i think it's fine. once we use more regularly, yeah, we should probably start to split the quota a bit. | 19:14 |
corvus | ++ | 19:14 |
clarkb | #topic Updating Flavors in OVH | 19:14 |
clarkb | the zuul-launcher effort looked into expanding into ovh then discovered we hav a single flavor to use there | 19:14 |
clarkb | that flavor is an 8GB memory 8cpu 80GB disk flavor that gets scheduled to specific hardware for us and we aren't supposed to use the general purpose flavors in ovh | 19:15 |
clarkb | this started a conversation with amorin about expanding the flavor options for us as zuul-launcher is attemptign to support a spread of 4gb, 8gb, and 16gb nodes | 19:15 |
clarkb | #link https://etherpad.opendev.org/p/ovh-flavors | 19:15 |
corvus | (and we learned bad things will happen if we do! they can't be scheduled on the same hypervisiors, so we would end up excluding ourselves from our own resources) | 19:15 |
clarkb | this document attempts to capture some of the problems and considerations with making the change | 19:15 |
clarkb | ya the existing flavors are using some custom scheduling parameters that don't work with a mix of flavors. They can move us to entirely new flavors that don't use those scheduling methods and we can avoid this problem. However, we have to take a downtime to do so | 19:16 |
clarkb | considering that ovh is a fairly large chunk of our quota corvus suggested we might want to hold off on this until after the openstack release is settled | 19:16 |
corvus | when is that? | 19:17 |
clarkb | in general I'm on board with the plan. I think it gets us away from weird special case behaviors and makes the system more flexible | 19:17 |
clarkb | #link https://releases.openstack.org/epoxy/schedule.html | 19:17 |
clarkb | the actual release is April 2. But things usually settle after the first rc which is March 14 ish | 19:18 |
fungi | fwiw, test usage should already be falling now that the freeze is done | 19:18 |
fungi | next week is when release candidates and stable branches appear | 19:18 |
clarkb | ya and after ^ we're past the major demands on the CI system. So week after next is fine? or maybe late next week | 19:18 |
corvus | would >= mar 17 be a good choice? | 19:19 |
clarkb | ya I think so | 19:19 |
corvus | sounds good, i'll let amorin know in #opendev | 19:19 |
tonyb | Is there any additional logging we want to add to zuul before the stable branches appear.... To try and find where we're missing branch creation events #tangent | 19:19 |
fungi | i'd even go so far as to say now is probably fine, but would want to look at the graphs a bit more to confirm | 19:19 |
clarkb | tonyb: I think we can followup on that in open discussion .it is a good question | 19:21 |
clarkb | Anything else on the subject of ovh flavor migrations? | 19:21 |
tonyb | ++ | 19:21 |
clarkb | #topic Running infra-prod Jobs in Parallel on Bridge | 19:22 |
clarkb | This replaces the fix known_hosts problems on bridge topic from last week | 19:22 |
clarkb | this is a followup to that which is basically lets get the parallel infra-prod jobs running finally | 19:22 |
corvus | [i would like to engage on the branch thing in open discussion, but need a reminder of the problem and current state; so if anyone has background bandwidth to find a link or something before we get there i would find that helpful] | 19:23 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/942439 | 19:23 |
clarkb | This is a change that ianw proposed and I've since edited to address some concern. But the tl;dr is we can use a paused parent job that sets up the bridge to run infra-prod ansible in subsequent jobs. That paused job will hold a lock preventing any other parent job from starting (to avoid conflit across pipelines) | 19:23 |
clarkb | then we can run those child jobs in parallel. TO start I've created a new semaphore with a limit of 1 so that we can transition and observe the current behavior is maintained. Then we can bump that semaphore limit to 2 or 5 etc and have the infra prod jobs run in parallel | 19:24 |
corvus | looks great and i see no reason not to +w it whenever you have time to check on it. | 19:24 |
clarkb | ya I think the main thing is to call it out now so anyone else can review it if they would like to. Then I would like to +w it tomorrow | 19:25 |
clarkb | I should be able to make plenty of time to monitor it tomorrow | 19:25 |
clarkb | and I'm happy to answer any questions about it in the interim | 19:25 |
clarkb | any questions or concerns to raise in the meeting about this change? | 19:25 |
tonyb | None from me | 19:26 |
clarkb | #topic Upgrading old servers | 19:26 |
clarkb | cool feel free to ping me or leave review comments | 19:27 |
clarkb | that takes us to our long term topic on upgrading old servers. I don't have any tasks on this inflight at the moment. Refactoring our CD pipeline stuff took over tempoararily | 19:27 |
clarkb | does anyone else have anything to bring up on this subject? | 19:27 |
tonyb | One small thing, I did discover that upgrading mediawiki beyond 1.31 will likely break openid auth | 19:28 |
fungi | did they drop support for the plugin? | 19:28 |
tonyb | So that essentially means that once we get the automation on support we're blocked behind keycloak updates | 19:29 |
fungi | makes sense | 19:29 |
tonyb | fungi: not officially, but the core dropped functions and the plugin wasn't updated. | 19:29 |
clarkb | we'd still be in a better position overall than the situation today right? | 19:29 |
clarkb | so this is more of a speedbump than a complete roadblock? | 19:30 |
tonyb | I suppose there may be scope to fix it upstream | 19:30 |
fungi | i remember when i was trying to update mediawiki a few years ago, i was struggling with getting the openid plugin working and didn't find a resolution | 19:30 |
tonyb | clarkb: correct it was just a new surprise I discovered yesterday so I wanted to share | 19:30 |
clarkb | cool just making sure I understand the implications | 19:31 |
tonyb | ++ | 19:31 |
clarkb | #topic Sprinting to Upgrade Servers to Jammy | 19:32 |
clarkb | The initial sprint got a number of servers upgraded but there is still a fairly large backlog to get through. | 19:32 |
clarkb | I should be able to shift gears back to this effort next week and treat next week as another sprint | 19:32 |
clarkb | with known_hosts addressed and hopefully parallel infra-prod exeuction things should move a bit more quickly too | 19:33 |
fungi | and what was the reason to upgrade them to jammy instead of noble at this point? | 19:33 |
clarkb | oh sorry thats a typo arg it used to say focal | 19:33 |
clarkb | then I "fixed" it to jammy | 19:33 |
fungi | aha, makes more sense | 19:33 |
clarkb | #undo | 19:33 |
opendevmeet | Removing item from minutes: #topic Sprinting to Upgrade Servers to Jammy | 19:33 |
clarkb | #topic Sprinting to Upgrade Servers to Noble | 19:33 |
clarkb | #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint | 19:34 |
fungi | since noble also gets us the ability to use podman, so deploy containers from quay with speculative testing | 19:34 |
clarkb | if anyone else has time and bandwidth to boot a new server or three taht would be appreciated too. My general thought on this process is to continue to build out a set of servers that provide us feedback before we get to gerrit | 19:34 |
clarkb | yup exactly | 19:34 |
clarkb | there are benefits all around | 19:34 |
clarkb | reduces our dependency on docker hub, leaves old distro releases behind and so on | 19:35 |
clarkb | but ya I don't really have anything new on this subject this week. Going to try and push forward again next week | 19:35 |
clarkb | Help is very much welcome | 19:35 |
clarkb | which takes us to | 19:35 |
clarkb | #topic Docker Hub rate limits | 19:35 |
clarkb | Docker hub announced that on March 1, 2025 anonymous image pull rate limits would go from 100 per 6 hours per ipv4 address and per ipv6 /64 to 10 per hour | 19:36 |
clarkb | I haven't done the adnce of manually testing that their api reports those new limits yet but I've been operating underthe assumption that they are in place | 19:36 |
corvus | "please don't use our service" | 19:36 |
clarkb | it is possible this is a good thing for us because the rate limit resets hourly rather than every 6 hours and our jobs often run for about an hour so we mgith get away with this | 19:37 |
clarkb | but it is a great reminder that we should do our best to mitigate and get off the service | 19:37 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/943326 | 19:37 |
clarkb | this change updates our use of selenium containers to fetch from mirrored images on quay | 19:37 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/943216 | 19:37 |
clarkb | this change forces docker fetches to go over ipv4 rather than ipv6 as we get more rate limit quota that way due to how ipv6 is set up in the clouds we use (many nodes can share the same /64) | 19:38 |
clarkb | tonyb: on 943216 I made a note about a problem with that approach on long lived servers | 19:38 |
clarkb | essentially if ip addrs in dns records change then we can orphan bad/broken and potentially even security risky values | 19:38 |
tonyb | Yup. I'll address that when I'm at my laptop | 19:38 |
clarkb | cool | 19:39 |
clarkb | beyond these changes and the longer term efforst we've got in place I don't really have any good answers. Its mostly a case of keep moving away as much as we possibly can | 19:39 |
clarkb | #topic Running certcheck on bridge | 19:40 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/942719 | 19:40 |
clarkb | this is fungi's change to move certcheck over to bridge. I think ianw does haev a good point there though whcih is we can run this out of a zuul job instead and avoid baking extra stuff into bridge that we don't need | 19:41 |
clarkb | the suggested job is infra-prod-letsencrypt but I don't think it even needs to be an LE job if we can generate the list of domains somehow (that letsencrypt job may be a good place because it already has the list of domains available) | 19:41 |
fungi | makes sense, all this really started as a naive attempt to get rid of a github connectivity failure in a job because we install certcheck from git in order to work around the fact that the cacti server is stuck on an old ubuntu version where the certcheck package is too old to work correctly any longer | 19:42 |
fungi | i'm not sure i have the time or appetite to develop a certcheck notification job (i didn't really have the time to work on moving it off the cacti server, but gave it a shot) | 19:43 |
clarkb | maybe we leave that open for a week or two and see if anyone gets to it otherwise we can go ahead with what you put together already | 19:43 |
fungi | turned out the job failure was actually due to a github outage, but by then i'd already written the initial change | 19:44 |
clarkb | ya the status quo is probably suffiicent to wait a week or two | 19:44 |
fungi | so went ahead and pushed it for review in case anyone found it useful | 19:44 |
clarkb | #topic Working through our TODO list | 19:45 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:45 |
clarkb | this is just another friendyl reminder that if you get bored or need a change of pace this list has a good set of things to dive into | 19:45 |
tonyb | ++ | 19:46 |
clarkb | I've also added at least one item there that we didn't discuss during the meetup but made sense as something to get done (the parallel infra prod jobs) | 19:46 |
clarkb | so feel free to edit the list as well | 19:46 |
clarkb | #topic Open Discussion | 19:46 |
clarkb | I had two really quick things then lets discuss zuul branch creation behavior | 19:47 |
fungi | story time wrt bulk branch creation and missed events? | 19:47 |
fungi | ah, yeah | 19:47 |
clarkb | first is we updated gitea to use an external memcached for caching rather than a process internal golang hashmap. The idea is to avoid golang gc paus the world behavior | 19:47 |
clarkb | so far my anecdotal experience is that gitea is much more consistent in its response times so that is good. but please say something if you ntoice abnormally long responses suddenly that may indicate additional problems or a misdiagnosed problem | 19:48 |
clarkb | then the other thing is the oftc matrix bridge may shutdown at the end of this month. Those of us relying on matrix for irc connectivity in one form or another should be prepared with some other alternative (run your own bridge, use native irc client, etc) | 19:48 |
clarkb | ok that was it from me | 19:48 |
clarkb | I can do my best to recollect the zuul thing, but if someone else did research already you go for it | 19:49 |
fungi | i mostly just remember a gerrit to gitea replication issue with branches not appearing in gitea. what's the symptom you're talking about, tonyb? | 19:50 |
clarkb | the symptom tonyb refers to is the one that starlingx also hit recently | 19:51 |
clarkb | basically when projects like openstack or starlingx use an automated script to create branches across many projects those events go into zuul and zuul needs to load their configs. But for some subset of projcts the config for those branches isn't loaded and changes to those branches are ignored by zuul | 19:51 |
clarkb | triggering a full tenant reload of the affected zuul tenant fixes the problem | 19:52 |
fungi | oh, right, that thanks | 19:52 |
tonyb | That's more detail than I had. | 19:52 |
tonyb | But that's the issue I was thinking of | 19:52 |
clarkb | I don't think it is clear yet if gerrit is failing to emit the events correctly, if zuul is getting the events then mishanlding them, or something else | 19:53 |
corvus | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-01-05.log.html that day has some talk about missed deleted events | 19:53 |
tonyb | I thought there was a question about the source of the problem is Gerrit sending all the events? Is zuul missing them? | 19:54 |
corvus | i believe we do log all gerrit events, so it should be possible to answer that q | 19:54 |
tonyb | Okay. | 19:54 |
corvus | oh, i have a vague memory that maybe last time this came up, we were like 1 day past the retention time or something? | 19:55 |
corvus | maybe that was a different issue though. | 19:55 |
clarkb | yes I think that was correct for the latest starlingx case | 19:55 |
tonyb | That sounds familiar | 19:55 |
clarkb | they did the creation in january thenshowed up in february asking why changes weren't getting tested and by then the logs were gone | 19:55 |
corvus | 2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: Received data from Gerrit event stream: | 19:55 |
corvus | 2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: {'eventCreatedOn': 1741046440, | 19:55 |
corvus | just confirmed that has the raw event stream from gerrit | 19:56 |
fungi | so perhaps we should proactively look for this symptom immediately after openstack stable/2025.1 branches are created | 19:56 |
clarkb | ++ | 19:56 |
corvus | so yeah, if we find out in time, we should be able to dig through that and answer the first question. probably the next one after that too. | 19:56 |
clarkb | you should be able to query the zuul api for the valid branches for each project in zuul | 19:56 |
clarkb | so create branches. Wait an hour or two (since reloading config isn't fast) then query the api for those that hit the problem. If we find at least one go back to the zuul logs and see if we got the events from gerrit and work from there? | 19:56 |
tonyb | Yup if we have logging that'll be helpful | 19:56 |
corvus | ++ | 19:57 |
tonyb | ++ | 19:57 |
frickler | just note the branches aren't all created at the same time, but by release patches per project (team) | 19:57 |
clarkb | but they should all be created within a week period next week? | 19:58 |
clarkb | maybe just check for missing branches on friday and call that sufficient | 19:58 |
clarkb | (friday of next week I mean) | 19:58 |
tonyb | Yup. I believe "large" projects like OSA have hit it | 19:58 |
frickler | there might be exceptions but that should mostly work | 19:58 |
frickler | oh, osa is one of the cycle trailing projects, so kind of special | 19:59 |
fungi | the goal is not to necessarily find all incidences of the problem, but to find at least one so we can collect the relevant log data | 19:59 |
clarkb | yup exactly | 19:59 |
clarkb | and we can reload tenant configs as the general solution otherwise | 19:59 |
tonyb | Yeah but, when they branch.... All by there lonesome they've seen this behaviour so it isn't like we need "all of openstack" to trigger it | 20:00 |
clarkb | got it | 20:00 |
clarkb | and we are at time | 20:00 |
clarkb | I think that is a good plan and we can take it from there | 20:00 |
clarkb | feel free to discuss further in #opendev or on the mailing list | 20:00 |
clarkb | thank you everyone! | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Mar 4 20:00:56 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.log.html | 20:00 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!