Tuesday, 2025-03-04

clarkb	just about meeting time	18:59
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Mar 4 19:00:09 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/5FMGNRCSUJQZYAOHNLLXQI4QT7LIG4C7/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	PTG preparation in underway. I wasn't planning on participating as OpenDev. Just wanted to make that clear. I think we've found more valeu in our out of band meetups	19:01
clarkb	and that gives each of us the ability to attend other PTG activities without additional distraction	19:01
clarkb	Anything else to add?	19:01
clarkb	Sounds like no. We can continue	19:03
clarkb	#topic Redeploying raxflex resources	19:03
clarkb	As mentioned previously we need to redeploy raxflex resources to take advantage of better network MTUs and to have tenant alignment across regions	19:04
clarkb	this work is underway with new mirrors being booted in the new regions. DNS has been updated for sjc3 test nodes to talk to that mirror in the other tenant	19:04
clarkb	We did discover that floating ips are intended to be mandatory for externally routable IPs and the MTUs we got by default were quite large so fungi manually reduced them to 1500 at the neutron network level	19:04
corvus	i think i'm just assuming fungi will approve all the changes to switch when he has bandwidth	19:04
clarkb	#link https://review.opendev.org/q/hashtag:flex-dfw3	19:05
clarkb	and ya fungi has pushed a number of changes that need to be stepped through in a particular order to effect the migration. That link should contain all of them though I suspect many/most/all have sufficient reviews at this point	19:05
corvus	i don't think any zuul-launcher coordination is necessary; so i say go when ready	19:05
fungi	sorry, trying to switch gears, openstack tc meeting is just wrapping up late	19:06
clarkb	corvus: ack that was going to be my last question. In that case I agree proceed when ready fungi	19:06
clarkb	fungi: let me know if you have anything else to add otherwise I'll continue	19:07
fungi	i guess just make sure i haven't missed anything (besides lingering cleanup)	19:07
corvus	(looks like the image upload container switched merged and worked)	19:07
clarkb	nothing stood out to me hwen I reviewed the changes	19:07
clarkb	I double checked the reenablement stack ensured images go before max-servers etc and that all looked fine	19:08
fungi	there are a bunch of moving parts in different repo with dependencies and pauses for deploy jobs to complete before approving the next part	19:08
clarkb	probably at this point the best thing is ti start rolling forward and just adjust if necessary	19:09
clarkb	I also double checked quotas in both new tenants and the math worked out for me	19:09
fungi	yeah, seems we're primarily constrained by the ram quota	19:10
fungi	we could increase by 50% if just the ram were raised	19:10
clarkb	ya a good followup once everything is moved is asking if we can bump up memory quotas to meet the 50 instance quota	19:11
clarkb	then we'd have 100 instances total (50 each in two regions)	19:11
clarkb	#topic Zuul-launcher image builds	19:11
clarkb	lets keep moving we're 1/4 through our time and have more items to discuss	19:12
corvus	yeah most action is in previous and next topics	19:12
clarkb	I also wanted to ask if we should drop max-servers on the nodepool side for raxflex to give zuul-launcher more of a chance to get nodes but I think that can happen after fungi flips stuff over	19:12
clarkb	not sure if you think that will be helpful based on usage patterns	19:13
corvus	the quota support should help us get nodes now	19:13
corvus	it was actually useful to test that even	19:13
corvus	(that we still get nodes and not errors when we're at unexpected quota)	19:13
clarkb	got it. The less than ideal state ensures we exercise our handling of that scenario (whcih is reasonably common in the real world so a good thing)	19:14
clarkb	I guess we can discuss the next item then	19:14
corvus	so for the occasional test like we're doing now, i think it's fine. once we use more regularly, yeah, we should probably start to split the quota a bit.	19:14
corvus	++	19:14
clarkb	#topic Updating Flavors in OVH	19:14
clarkb	the zuul-launcher effort looked into expanding into ovh then discovered we hav a single flavor to use there	19:14
clarkb	that flavor is an 8GB memory 8cpu 80GB disk flavor that gets scheduled to specific hardware for us and we aren't supposed to use the general purpose flavors in ovh	19:15
clarkb	this started a conversation with amorin about expanding the flavor options for us as zuul-launcher is attemptign to support a spread of 4gb, 8gb, and 16gb nodes	19:15
clarkb	#link https://etherpad.opendev.org/p/ovh-flavors	19:15
corvus	(and we learned bad things will happen if we do! they can't be scheduled on the same hypervisiors, so we would end up excluding ourselves from our own resources)	19:15
clarkb	this document attempts to capture some of the problems and considerations with making the change	19:15
clarkb	ya the existing flavors are using some custom scheduling parameters that don't work with a mix of flavors. They can move us to entirely new flavors that don't use those scheduling methods and we can avoid this problem. However, we have to take a downtime to do so	19:16
clarkb	considering that ovh is a fairly large chunk of our quota corvus suggested we might want to hold off on this until after the openstack release is settled	19:16
corvus	when is that?	19:17
clarkb	in general I'm on board with the plan. I think it gets us away from weird special case behaviors and makes the system more flexible	19:17
clarkb	#link https://releases.openstack.org/epoxy/schedule.html	19:17
clarkb	the actual release is April 2. But things usually settle after the first rc which is March 14 ish	19:18
fungi	fwiw, test usage should already be falling now that the freeze is done	19:18
fungi	next week is when release candidates and stable branches appear	19:18
clarkb	ya and after ^ we're past the major demands on the CI system. So week after next is fine? or maybe late next week	19:18
corvus	would >= mar 17 be a good choice?	19:19
clarkb	ya I think so	19:19
corvus	sounds good, i'll let amorin know in #opendev	19:19
tonyb	Is there any additional logging we want to add to zuul before the stable branches appear.... To try and find where we're missing branch creation events #tangent	19:19
fungi	i'd even go so far as to say now is probably fine, but would want to look at the graphs a bit more to confirm	19:19
clarkb	tonyb: I think we can followup on that in open discussion .it is a good question	19:21
clarkb	Anything else on the subject of ovh flavor migrations?	19:21
tonyb	++	19:21
clarkb	#topic Running infra-prod Jobs in Parallel on Bridge	19:22
clarkb	This replaces the fix known_hosts problems on bridge topic from last week	19:22
clarkb	this is a followup to that which is basically lets get the parallel infra-prod jobs running finally	19:22
corvus	[i would like to engage on the branch thing in open discussion, but need a reminder of the problem and current state; so if anyone has background bandwidth to find a link or something before we get there i would find that helpful]	19:23
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/942439	19:23
clarkb	This is a change that ianw proposed and I've since edited to address some concern. But the tl;dr is we can use a paused parent job that sets up the bridge to run infra-prod ansible in subsequent jobs. That paused job will hold a lock preventing any other parent job from starting (to avoid conflit across pipelines)	19:23
clarkb	then we can run those child jobs in parallel. TO start I've created a new semaphore with a limit of 1 so that we can transition and observe the current behavior is maintained. Then we can bump that semaphore limit to 2 or 5 etc and have the infra prod jobs run in parallel	19:24
corvus	looks great and i see no reason not to +w it whenever you have time to check on it.	19:24
clarkb	ya I think the main thing is to call it out now so anyone else can review it if they would like to. Then I would like to +w it tomorrow	19:25
clarkb	I should be able to make plenty of time to monitor it tomorrow	19:25
clarkb	and I'm happy to answer any questions about it in the interim	19:25
clarkb	any questions or concerns to raise in the meeting about this change?	19:25
tonyb	None from me	19:26
clarkb	#topic Upgrading old servers	19:26
clarkb	cool feel free to ping me or leave review comments	19:27
clarkb	that takes us to our long term topic on upgrading old servers. I don't have any tasks on this inflight at the moment. Refactoring our CD pipeline stuff took over tempoararily	19:27
clarkb	does anyone else have anything to bring up on this subject?	19:27
tonyb	One small thing, I did discover that upgrading mediawiki beyond 1.31 will likely break openid auth	19:28
fungi	did they drop support for the plugin?	19:28
tonyb	So that essentially means that once we get the automation on support we're blocked behind keycloak updates	19:29
fungi	makes sense	19:29
tonyb	fungi: not officially, but the core dropped functions and the plugin wasn't updated.	19:29
clarkb	we'd still be in a better position overall than the situation today right?	19:29
clarkb	so this is more of a speedbump than a complete roadblock?	19:30
tonyb	I suppose there may be scope to fix it upstream	19:30
fungi	i remember when i was trying to update mediawiki a few years ago, i was struggling with getting the openid plugin working and didn't find a resolution	19:30
tonyb	clarkb: correct it was just a new surprise I discovered yesterday so I wanted to share	19:30
clarkb	cool just making sure I understand the implications	19:31
tonyb	++	19:31
clarkb	#topic Sprinting to Upgrade Servers to Jammy	19:32
clarkb	The initial sprint got a number of servers upgraded but there is still a fairly large backlog to get through.	19:32
clarkb	I should be able to shift gears back to this effort next week and treat next week as another sprint	19:32
clarkb	with known_hosts addressed and hopefully parallel infra-prod exeuction things should move a bit more quickly too	19:33
fungi	and what was the reason to upgrade them to jammy instead of noble at this point?	19:33
clarkb	oh sorry thats a typo arg it used to say focal	19:33
clarkb	then I "fixed" it to jammy	19:33
fungi	aha, makes more sense	19:33
clarkb	#undo	19:33
opendevmeet	Removing item from minutes: #topic Sprinting to Upgrade Servers to Jammy	19:33
clarkb	#topic Sprinting to Upgrade Servers to Noble	19:33
clarkb	#link https://etherpad.opendev.org/p/opendev-server-replacement-sprint	19:34
fungi	since noble also gets us the ability to use podman, so deploy containers from quay with speculative testing	19:34
clarkb	if anyone else has time and bandwidth to boot a new server or three taht would be appreciated too. My general thought on this process is to continue to build out a set of servers that provide us feedback before we get to gerrit	19:34
clarkb	yup exactly	19:34
clarkb	there are benefits all around	19:34
clarkb	reduces our dependency on docker hub, leaves old distro releases behind and so on	19:35
clarkb	but ya I don't really have anything new on this subject this week. Going to try and push forward again next week	19:35
clarkb	Help is very much welcome	19:35
clarkb	which takes us to	19:35
clarkb	#topic Docker Hub rate limits	19:35
clarkb	Docker hub announced that on March 1, 2025 anonymous image pull rate limits would go from 100 per 6 hours per ipv4 address and per ipv6 /64 to 10 per hour	19:36
clarkb	I haven't done the adnce of manually testing that their api reports those new limits yet but I've been operating underthe assumption that they are in place	19:36
corvus	"please don't use our service"	19:36
clarkb	it is possible this is a good thing for us because the rate limit resets hourly rather than every 6 hours and our jobs often run for about an hour so we mgith get away with this	19:37
clarkb	but it is a great reminder that we should do our best to mitigate and get off the service	19:37
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/943326	19:37
clarkb	this change updates our use of selenium containers to fetch from mirrored images on quay	19:37
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/943216	19:37
clarkb	this change forces docker fetches to go over ipv4 rather than ipv6 as we get more rate limit quota that way due to how ipv6 is set up in the clouds we use (many nodes can share the same /64)	19:38
clarkb	tonyb: on 943216 I made a note about a problem with that approach on long lived servers	19:38
clarkb	essentially if ip addrs in dns records change then we can orphan bad/broken and potentially even security risky values	19:38
tonyb	Yup. I'll address that when I'm at my laptop	19:38
clarkb	cool	19:39
clarkb	beyond these changes and the longer term efforst we've got in place I don't really have any good answers. Its mostly a case of keep moving away as much as we possibly can	19:39
clarkb	#topic Running certcheck on bridge	19:40
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/942719	19:40
clarkb	this is fungi's change to move certcheck over to bridge. I think ianw does haev a good point there though whcih is we can run this out of a zuul job instead and avoid baking extra stuff into bridge that we don't need	19:41
clarkb	the suggested job is infra-prod-letsencrypt but I don't think it even needs to be an LE job if we can generate the list of domains somehow (that letsencrypt job may be a good place because it already has the list of domains available)	19:41
fungi	makes sense, all this really started as a naive attempt to get rid of a github connectivity failure in a job because we install certcheck from git in order to work around the fact that the cacti server is stuck on an old ubuntu version where the certcheck package is too old to work correctly any longer	19:42
fungi	i'm not sure i have the time or appetite to develop a certcheck notification job (i didn't really have the time to work on moving it off the cacti server, but gave it a shot)	19:43
clarkb	maybe we leave that open for a week or two and see if anyone gets to it otherwise we can go ahead with what you put together already	19:43
fungi	turned out the job failure was actually due to a github outage, but by then i'd already written the initial change	19:44
clarkb	ya the status quo is probably suffiicent to wait a week or two	19:44
fungi	so went ahead and pushed it for review in case anyone found it useful	19:44
clarkb	#topic Working through our TODO list	19:45
clarkb	#link https://etherpad.opendev.org/p/opendev-january-2025-meetup	19:45
clarkb	this is just another friendyl reminder that if you get bored or need a change of pace this list has a good set of things to dive into	19:45
tonyb	++	19:46
clarkb	I've also added at least one item there that we didn't discuss during the meetup but made sense as something to get done (the parallel infra prod jobs)	19:46
clarkb	so feel free to edit the list as well	19:46
clarkb	#topic Open Discussion	19:46
clarkb	I had two really quick things then lets discuss zuul branch creation behavior	19:47
fungi	story time wrt bulk branch creation and missed events?	19:47
fungi	ah, yeah	19:47
clarkb	first is we updated gitea to use an external memcached for caching rather than a process internal golang hashmap. The idea is to avoid golang gc paus the world behavior	19:47
clarkb	so far my anecdotal experience is that gitea is much more consistent in its response times so that is good. but please say something if you ntoice abnormally long responses suddenly that may indicate additional problems or a misdiagnosed problem	19:48
clarkb	then the other thing is the oftc matrix bridge may shutdown at the end of this month. Those of us relying on matrix for irc connectivity in one form or another should be prepared with some other alternative (run your own bridge, use native irc client, etc)	19:48
clarkb	ok that was it from me	19:48
clarkb	I can do my best to recollect the zuul thing, but if someone else did research already you go for it	19:49
fungi	i mostly just remember a gerrit to gitea replication issue with branches not appearing in gitea. what's the symptom you're talking about, tonyb?	19:50
clarkb	the symptom tonyb refers to is the one that starlingx also hit recently	19:51
clarkb	basically when projects like openstack or starlingx use an automated script to create branches across many projects those events go into zuul and zuul needs to load their configs. But for some subset of projcts the config for those branches isn't loaded and changes to those branches are ignored by zuul	19:51
clarkb	triggering a full tenant reload of the affected zuul tenant fixes the problem	19:52
fungi	oh, right, that thanks	19:52
tonyb	That's more detail than I had.	19:52
tonyb	But that's the issue I was thinking of	19:52
clarkb	I don't think it is clear yet if gerrit is failing to emit the events correctly, if zuul is getting the events then mishanlding them, or something else	19:53
corvus	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-01-05.log.html that day has some talk about missed deleted events	19:53
tonyb	I thought there was a question about the source of the problem is Gerrit sending all the events? Is zuul missing them?	19:54
corvus	i believe we do log all gerrit events, so it should be possible to answer that q	19:54
tonyb	Okay.	19:54
corvus	oh, i have a vague memory that maybe last time this came up, we were like 1 day past the retention time or something?	19:55
corvus	maybe that was a different issue though.	19:55
clarkb	yes I think that was correct for the latest starlingx case	19:55
tonyb	That sounds familiar	19:55
clarkb	they did the creation in january thenshowed up in february asking why changes weren't getting tested and by then the logs were gone	19:55
corvus	2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: Received data from Gerrit event stream:	19:55
corvus	2025-03-04 00:00:40,918 DEBUG zuul.GerritConnection.ssh: {'eventCreatedOn': 1741046440,	19:55
corvus	just confirmed that has the raw event stream from gerrit	19:56
fungi	so perhaps we should proactively look for this symptom immediately after openstack stable/2025.1 branches are created	19:56
clarkb	++	19:56
corvus	so yeah, if we find out in time, we should be able to dig through that and answer the first question. probably the next one after that too.	19:56
clarkb	you should be able to query the zuul api for the valid branches for each project in zuul	19:56
clarkb	so create branches. Wait an hour or two (since reloading config isn't fast) then query the api for those that hit the problem. If we find at least one go back to the zuul logs and see if we got the events from gerrit and work from there?	19:56
tonyb	Yup if we have logging that'll be helpful	19:56
corvus	++	19:57
tonyb	++	19:57
frickler	just note the branches aren't all created at the same time, but by release patches per project (team)	19:57
clarkb	but they should all be created within a week period next week?	19:58
clarkb	maybe just check for missing branches on friday and call that sufficient	19:58
clarkb	(friday of next week I mean)	19:58
tonyb	Yup. I believe "large" projects like OSA have hit it	19:58
frickler	there might be exceptions but that should mostly work	19:58
frickler	oh, osa is one of the cycle trailing projects, so kind of special	19:59
fungi	the goal is not to necessarily find all incidences of the problem, but to find at least one so we can collect the relevant log data	19:59
clarkb	yup exactly	19:59
clarkb	and we can reload tenant configs as the general solution otherwise	19:59
tonyb	Yeah but, when they branch.... All by there lonesome they've seen this behaviour so it isn't like we need "all of openstack" to trigger it	20:00
clarkb	got it	20:00
clarkb	and we are at time	20:00
clarkb	I think that is a good plan and we can take it from there	20:00
clarkb	feel free to discuss further in #opendev or on the mailing list	20:00
clarkb	thank you everyone!	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Mar 4 20:00:56 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-03-04-19.00.log.html	20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!