Tuesday, 2024-12-10

*** diablo_rojo_phone is now known as Guest2639		08:38
clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Dec 10 19:00:31 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YTAWQCPLRMRJARKVH4XZR3LXF5DJ4IHQ/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	tonyb mentioned not being here today for the meeting due to travel	19:00
clarkb	Anything else to announce?	19:01
clarkb	sounds like now	19:02
clarkb	*no	19:02
clarkb	#topic Zuul-launcher image builds	19:02
clarkb	corvus: anything new here? Curious if api improvements have made it into zuul yet	19:03
clarkb	Not sure if corvus is around right now. We can get back to this later if that changes	19:04
clarkb	#topic Backup Server Pruning	19:04
clarkb	We retired backups via the new automation and fungi reran the prune script both to check that it works post retirement but also because the server was complaining about being near full again. Good news is that got us down to 68% utilization (previous low water mark was ~77% iirc)	19:05
clarkb	a definite improvement	19:05
fungi	most thanks for that go to ianw and you, i just pushed some buttons	19:06
clarkb	The next step is to purge the backups for these retired backups	19:06
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/937040 Purge retired backups on vexxhost backup server	19:06
fungi	i should be able to take care of that after the meeting, once i grab some food	19:06
clarkb	and once that has landed and purged things we should rerun the manual prune again just to be sure that the prune script is still happy and also to get a new low water mark	19:06
clarkb	fungi: thanks!	19:06
clarkb	its worth noting that so far we have only done the retirements and purges on one of two backup servers. Once we're happy with this entire process on one server we can consider how we may want to apply it to the other server	19:07
clarkb	its just far less urgent to do on the other as it has more disk space and doing it one at a time is a nice safety mechanism to avoid overdeleting things	19:07
clarkb	Any questions or concerns on backup pruning?	19:07
fungi	none on my end	19:09
clarkb	#topic Upgrading old servers	19:09
clarkb	tonyb isn't here but I did have some uopdates on my end	19:09
clarkb	Now that the gerrit upgrade is mostly behind us (we still have some changes to land and test, more on that later) the next big gerrit related item on my todo list is upgrading the server	19:09
clarkb	I suspect the easiest way to do this will be to boot a new server with a new cinder volume and then sync everything over to it. Then schedule a downtime where we can do a final sync and udpate DNS	19:10
clarkb	any concerns with that approach or preferences for another method?	19:10
fungi	in the past we've given advance notice of the new ip address due to companies baking 25418/tcp into firewall policies	19:12
clarkb	good point. Thouhg I'm not sure if we did that for the move to vexxhost	19:12
fungi	no idea if that's a tradition worth continuing to observe	19:12
corvus	o/	19:13
clarkb	we can probably do some notice since we're taking a downtime anyway. Not sure it will be the month we've had in the past	19:13
frickler	do we want to stay on vexxhost? or maybe move to raxflex due to IPv6 issues?	19:13
fungi	i'd be okay with not worrying about it	19:13
fungi	not worrying about the advance notice i mean	19:13
clarkb	frickler: considering raxflex has once lost all of hte data on our cinder volumes I am extremely wary of doing that	19:14
frickler	although raxflex doesn't have v6 yet, either, I remember now	19:14
fungi	frickler: rackspace flex currently has worse ipv6 issues, in that it has no ipv6	19:14
clarkb	I think if we were upgrading in a year the consideration would be different	19:14
fungi	yeah, that	19:14
clarkb	and I'm not sure what flavor sizes are available there	19:14
clarkb	I don't think we can get away with a much smaller gerrit node (mostly for memroy)	19:14
fungi	also flex folks say it's still under some churn before reaching general availability, so i've been hesitant to suggest moving any control plane systems there so soon	19:14
corvus	i'm happy with gerrit and friends staying in rax classic for now.	19:15
clarkb	corvus: its in vexxhost fwiw	19:15
clarkb	I can take a survey of ipv6 availability and flavor sizes and operational concerns in each cloud before booting anything though	19:16
fungi	vexxhost's v6 routes in bgp have been getting filtered/dropped by some isps, hence frickler's concern	19:16
corvus	oh i do remember that now... :)	19:16
corvus	what's the rax classic ipv6 situation?	19:16
fungi	quite good	19:16
clarkb	rax classic ipv6 generally works except for sometimes not between rax classic nodes (this is why ansible is set to use ipv4)	19:16
clarkb	I don't think we've seen any connectivity problems to the outside world only within the cloud	19:16
corvus	maybe a move back to rax classic would be a good idea then?	19:17
fungi	yeah, they've had some quo filtering issues in the past with v6 between regions as well, though not presently to my knowledge	19:17
fungi	s/quo/qos/	19:17
clarkb	possibly the issues with xen not reliably working with noble probalby need better understanding before committing to that	19:18
clarkb	also not sure how big the big flavors are would have to check that	19:18
clarkb	(they got pretty big from memory doing elasticsearch though)	19:18
clarkb	which is the other related topic. I'd like to target noble for this new node so I may take a detour first and get noble with docker compose and podman working on a simpler service	19:18
clarkb	and then we can make a decision on the exact location :)	19:19
corvus	sounds like our choices are likely: a) aging hardware; b) unreliable v6; c) cloud could disappear at any time	19:19
corvus	oh -- are the different vexxhost regions any different in regards to v6?	19:20
frickler	d) semi-regular issues with vouchers expiring ;)	19:20
corvus	maybe gerrit is in ymq and maybe sjc could be better?	19:20
clarkb	corvus: that is a good question. frickler do you have problems with ipv6 reachability for opendev.org / gitea?	19:20
clarkb	corvus: ya ymq for gerrit. gitea is in sjc1	19:20
fungi	wrt v6 in sjc1 vs ca-ymq-1, probably the same since they're part of one larger allocation, but we'd have to do some long-term testing to know	19:21
frickler	I remember more issues with gitea, but I need to check	19:21
corvus	fungi: yeah, though those regions are so different otherwise i wouldn't take it as a given	19:21
frickler	2604:e100:1::/48 vs. 2604:e100:3::/48	19:21
frickler	I think they were mostly affected similarly	19:22
corvus	ack. oh well. :\|	19:22
clarkb	so anyway I think the roundabout process here is sorting out docker/compose/podman on noble, update a service like paste, concurrently survey hosting options, then we decide where to host gerrit and proceed with it on nobl;e	19:22
corvus	clarkb: ++	19:22
fungi	all of 2604:e100 seems to be assigned to vexxhost, fwiw	19:23
clarkb	I'm probably going to start with the container stuff as I expect that to require more time and effort	19:23
clarkb	if others want to do a more proper hosting survey I'd welcome that otherwise I'll poke at it as time allows	19:23
clarkb	Any other questions/concerns/etc?	19:23
fungi	2604:e100::/32 is a singular allocation according to arin's whois	19:24
clarkb	#topic Docker compose plugin with podman service for servers	19:24
clarkb	No new updates on this other than that I'm likely going to start poking at it and may have updates in the future	19:25
clarkb	#topic Docker Hub Rate Limits	19:25
clarkb	Sort of an update on this. I noticed yesterday that kolla ran a lot of jobs (like enough to have every single one of our nodes dedicated to kolla I think)	19:25
clarkb	for about 6-8 hours after that point docker hub rate limits were super common	19:26
clarkb	this has me wondering if at least some of the change we've observed in rate limit behavior is related to usage changes on our end	19:26
clarkb	though it could just be coincidental	19:26
fungi	some of our providers have limited address pools that are dedicated to us and get recycled frequently	19:26
clarkb	if anyone knows of potential suspicious changes though digging into those might help us better undersatnd the rate limits and further make this reliable in while we sort out alternatives	19:27
clarkb	fungi: yup though I was seeing this in rax-ord/rax-iad too which I don't think have the same super limited ip pools	19:27
clarkb	but maybe they are more limited than I expect	19:27
frickler	kolla tried to get rid of using dockerhub, though I'm not sure whether that succeeded to 100%	19:28
fungi	well, they may not allocate a lot of unused addresses, so deleting and replacing servers is still likely to cycle through the smaller number of available addresses	19:28
clarkb	as for alternatives https://review.opendev.org/c/zuul/zuul-jobs/+/935574 has the review approvals it needs but has struggled to merge	19:28
clarkb	fungi: good point	19:28
clarkb	corvus: to land 935574 I suspect approving it during our evening time is going to be least contended for rate limit quota	19:28
corvus	clarkb: ack	19:29
clarkb	corvus: maybe we should just try approving it now and if it fails agani approve it again this evening and try to get it in?	19:29
clarkb	s/try approving/try rechecking/	19:29
corvus	i just reapproved it now, but let's try to do that at the end of day too	19:29
clarkb	++	19:29
clarkb	then for opendev (and probably zuul) mirroring the python images (both ours and the parents) and the mariadb images are likely to have the biggest impact	19:29
clarkb	everything else in opendev is like a single container image that we're often pulling from the intermediate registry anyway	19:30
* clarkb scribbled a note to follow that chagne and get it landed		19:31
clarkb	we can followup once that happens. Anything else on this topic?	19:31
fungi	i can keep an eye on it and recheck too if it fails	19:31
clarkb	#topic Gerrit 3.10 Upgrade	19:32
clarkb	The upgrade is complete \o/	19:32
clarkb	and I think it went pretty smoothly	19:32
clarkb	#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document	19:32
clarkb	this document still has a few todos on it though most of them have been captured into changes under topic:upgrade-gerrit-3.10	19:33
corvus	thanks!	19:33
fungi	sorry my isp decided to experience a fiber cut 10 minutes into the maintenance	19:33
clarkb	essentially its update some test jobs to test the right version of gerrit, drop gerrit 3.9 image builds once we feel confident we won't need to revert, then add 3.11 image builds and 3.10 -> 3.11 upgrade testing	19:33
corvus	fungi: heck of a time to start gardening	19:33
clarkb	reviews on those are super appreciated. There is one gotcha in that the change to add 3.11 image builds will also rebuild our 3.10 image because I had to modify files to update acls in testing for 3.11. The side effect of this is we'll pull in my openid redirect fix from upstream when that change lands	19:34
clarkb	I want to test the behavior of that change against 3.10 (prior testing was only against 3.9) before we merge it so that we can confidently restart gerrit on the rebuilt 3.10 image post merge	19:34
clarkb	all t hat to say if you want to defer to me on approving things that is fine. But still reviews are helpful	19:34
clarkb	my goal is to test the openid redirect stuff this afternoon then tomorrow morning land changes and restart gerrit quickly to pick up the update	19:35
fungi	sounds great	19:35
fungi	i also need to look into what's needed so we can update git-review testing	19:36
clarkb	thank you to everyone who hlped with the upgrade. I'm really happy with how it went	19:36
clarkb	any gerrit 3.10 concerns before we move on?	19:36
clarkb	#topic Diskimage Builder Testing and Release	19:37
clarkb	we have a request to make a dib release to pick up gentoo element fixes	19:38
clarkb	unfortuantely the openeuler tests are failing consistently after our mirror updates and the distro doesn't seem to have fallback package location. Fungi pushed a change to disbale the tests which I have approved	19:38
clarkb	getting those tests out of the way should allow us to land a couple of other bug fixes (one for svc-map and another for centos 9 installing grub when using efi)	19:38
clarkb	and then I think we can push a tag for a new release	19:39
clarkb	I also noticed that our openeuler image builds in nodepool are failing likely due to a similar issue and we need to pause them	19:39
clarkb	do we know who to reach out to in openeuler land? I feel like this is a good "we're resetting things and you should be involved in that" point	19:39
clarkb	as I don't think we'll get fixes into dib or our nodepool install otherwise	19:40
fungi	i'm assuming the proposer of the change for the openeuler mirror update is someone who could speak for their community	19:40
clarkb	good point. I wonder if they read email :)	19:41
clarkb	I've made some notes to try and track them down and point them at what needs updating	19:41
fungi	we could cc them on a change pausing the image builds, they did cc some of us directly for reviews on the mirror update change	19:42
clarkb	I like that idea	19:42
clarkb	fungi: did you want to push that up?	19:42
fungi	i can, sure	19:43
fungi	i'll mention the dib change from earlier today too	19:43
clarkb	also your dib fix got a -1 testing noble and I'm guessing thats related to the broken noble package update that jayf just discovered	19:43
JayF	yep	19:43
clarkb	they pushed a kernel update but didn't push an updated kernel modules package	19:43
clarkb	(or maybe the other way around)	19:43
clarkb	fungi: thanks!	19:43
JayF	that's exactly it clarkb	19:43
fungi	well, sure. why would they? (distro'ing is hard work)	19:43
clarkb	#topic Rax-ord Noble Nodes with 1 VCPU	19:43
JayF	and then I tried to report the bug and got told if I needed commerical support to go buy it	19:43
fungi	bwahahaha	19:44
JayF	you can imagine how long I stayed in that channel	19:44
clarkb	oh I thought you said they confirmed it	19:44
clarkb	that is an unfortunate response	19:44
JayF	they confirmed it but were like "huh, weird, whatever"	19:44
clarkb	twice in as many months we've had jobs with weird failures due to having only 1 vcpu. THe jobs all ran noble on rax ord iirc	19:44
JayF	and I pushed them to acknowledge it was broken for everyone	19:44
clarkb	JayF: ack	19:45
JayF	that's what got that response, when I really just wanted someone to point me to the LP project to complain to :D	19:45
clarkb	on closer inspection of the zuul collected debugging info the ansible facts show a clear difference in xen version between noble nodes with 8vcpu and those with 1vcpu	19:45
fungi	people with actual support contracts will be pushing them to fix it soon enough anyway	19:45
JayF	I assumed they, like I would if reported for Ironic, would want to unbreak their users ASAP	19:45
clarkb	my best hunch at this point is that this is not a bug in nodepool using the wrong flavor or a bug in openstack using the wrong flavor (part of this assumption is other aspects of the flavor like disk and memroy totals check out) and isntaed a bug in xen or noble or both that rpevent the VMs from seeing all of the cpus	19:46
clarkb	we pinged a couple of raxflex folks about it but they both indicated that the xen stuff is outside of their purview. I suspect this means the best path forward here is to file a support ticket	19:47
clarkb	open to other ideas for getting this sorted out. I worry that a ticket might get dismissed as not having enough info or something	19:48
fungi	we do have some code already that rejects nodes very early in jobs so they get retried	19:48
frickler	likely watching whether we can gather more data might be the only option	19:49
fungi	in base-jobs i think	19:49
frickler	oh, also https://review.opendev.org/c/zuul/zuul-jobs/+/937376?usp=dashboard	19:49
frickler	to at least have a bit more evidence possibly	19:50
clarkb	oh ya we could have some check that was long the lines if rax and vcpu == 1 then fial	19:50
clarkb	maybe we start with the data collection change and then add ^ and if this problem happens more than once a month we escalate	19:50
fungi	pretty sure there are similar checks in the validate-host role, i just need to remember where it's defined	19:51
frickler	fungi: my patch is also within that role	19:52
fungi	aha, yep	19:52
clarkb	I just found the logs showing that frickler's update works. I'll approve it now	19:53
clarkb	#topic Open Discussion	19:53
clarkb	Anything else before our hour is up?	19:54
corvus	no updates on the niz image stuff (as expected); i'm hoping to get back to that soonish	19:54
clarkb	A reminder that we'll have our last weekly meeting of they year next week and then we'll be back January 7	19:55
fungi	same bat time, same bat channel	19:55
clarkb	sounds like that is everything. Thank you for your time and all your help running OpenDev	19:56
clarkb	#endmeeting	19:56
opendevmeet	Meeting ended Tue Dec 10 19:56:56 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:56
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.html	19:56
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.txt	19:56
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.log.html	19:56
fungi	thanks clarkb!	19:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!