Tuesday, 2024-12-10

*** diablo_rojo_phone is now known as Guest263908:38
clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Dec 10 19:00:31 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YTAWQCPLRMRJARKVH4XZR3LXF5DJ4IHQ/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbtonyb mentioned not being here today for the meeting due to travel19:00
clarkbAnything else to announce?19:01
clarkbsounds like now19:02
clarkb*no19:02
clarkb#topic Zuul-launcher image builds19:02
clarkbcorvus: anything new here? Curious if api improvements have made it into zuul yet19:03
clarkbNot sure if corvus is around right now. We can get back to this later if that changes19:04
clarkb#topic Backup Server Pruning19:04
clarkbWe retired backups via the new automation and fungi reran the prune script both to check that it works post retirement but also because the server was complaining about being near full again. Good news is that got us down to 68% utilization (previous low water mark was ~77% iirc)19:05
clarkba definite improvement19:05
fungimost thanks for that go to ianw and you, i just pushed some buttons19:06
clarkbThe next step is to purge the backups for these retired backups19:06
clarkb#link https://review.opendev.org/c/opendev/system-config/+/937040 Purge retired backups on vexxhost backup server19:06
fungii should be able to take care of that after the meeting, once i grab some food19:06
clarkband once that has landed and purged things we should rerun the manual prune again just to be sure that the prune script is still happy and also to get a new low water mark19:06
clarkbfungi: thanks!19:06
clarkbits worth noting that so far we have only done the retirements and purges on one of two backup servers. Once we're happy with this entire process on one server we can consider how we may want to apply it to the other server19:07
clarkbits just far less urgent to do on the other as it has more disk space and doing it one at a time is a nice safety mechanism to avoid overdeleting things19:07
clarkbAny questions or concerns on backup pruning?19:07
funginone on my end19:09
clarkb#topic Upgrading old servers19:09
clarkbtonyb isn't here but I did have some uopdates on my end19:09
clarkbNow that the gerrit upgrade is mostly behind us (we still have some changes to land and test, more on that later) the next big gerrit related item on my todo list is upgrading the server19:09
clarkbI suspect the easiest way to do this will be to boot a new server with a new cinder volume and then sync everything over to it. Then schedule a downtime where we can do a final sync and udpate DNS19:10
clarkbany concerns with that approach or preferences for another method?19:10
fungiin the past we've given advance notice of the new ip address due to companies baking 25418/tcp into firewall policies19:12
clarkbgood point. Thouhg I'm not sure if we did that for the move to vexxhost19:12
fungino idea if that's a tradition worth continuing to observe19:12
corvuso/19:13
clarkbwe can probably do some notice since we're taking a downtime anyway. Not sure it will be the month we've had in the past19:13
fricklerdo we want to stay on vexxhost? or maybe move to raxflex due to IPv6 issues?19:13
fungii'd be okay with not worrying about it19:13
funginot worrying about the advance notice i mean19:13
clarkbfrickler: considering raxflex has once lost all of hte data on our cinder volumes I am extremely wary of doing that19:14
frickleralthough raxflex doesn't have v6 yet, either, I remember now19:14
fungifrickler: rackspace flex currently has worse ipv6 issues, in that it has no ipv619:14
clarkbI think if we were upgrading in a year the consideration would be different19:14
fungiyeah, that19:14
clarkband I'm not sure what flavor sizes are available there19:14
clarkbI don't think we can get away with a much smaller gerrit node (mostly for memroy)19:14
fungialso flex folks say it's still under some churn before reaching general availability, so i've been hesitant to suggest moving any control plane systems there so soon19:14
corvusi'm happy with gerrit and friends staying in rax classic for now.19:15
clarkbcorvus: its in vexxhost fwiw19:15
clarkbI can take a survey of ipv6 availability and flavor sizes and operational concerns in each cloud before booting anything though19:16
fungivexxhost's v6 routes in bgp have been getting filtered/dropped by some isps, hence frickler's concern19:16
corvusoh i do remember that now...  :)19:16
corvuswhat's the rax classic ipv6 situation?19:16
fungiquite good19:16
clarkbrax classic ipv6 generally works except for sometimes not between rax classic nodes (this is why ansible is set to use ipv4)19:16
clarkbI don't think we've seen any connectivity problems to the outside world only within the cloud19:16
corvusmaybe a move back to rax classic would be a good idea then?19:17
fungiyeah, they've had some quo filtering issues in the past with v6 between regions as well, though not presently to my knowledge19:17
fungis/quo/qos/19:17
clarkbpossibly the issues with xen not reliably working with noble probalby need better understanding before committing to that19:18
clarkbalso not sure how big the big flavors are would have to check that19:18
clarkb(they got pretty big from memory doing elasticsearch though)19:18
clarkbwhich is the other related topic. I'd like to target noble for this new node so I may take a detour first and get noble with docker compose and podman working on a simpler service19:18
clarkband then we can make a decision on the exact location :)19:19
corvussounds like our choices are likely: a) aging hardware; b) unreliable v6; c) cloud could disappear at any time19:19
corvusoh -- are the different vexxhost regions any different in regards to v6?19:20
fricklerd) semi-regular issues with vouchers expiring ;)19:20
corvusmaybe gerrit is in ymq and maybe sjc could be better?19:20
clarkbcorvus: that is a good question. frickler do you have problems with ipv6 reachability for opendev.org / gitea?19:20
clarkbcorvus: ya ymq for gerrit. gitea is in sjc119:20
fungiwrt v6 in sjc1 vs ca-ymq-1, probably the same since they're part of one larger allocation, but we'd have to do some long-term testing to know19:21
fricklerI remember more issues with gitea, but I need to check19:21
corvusfungi: yeah, though those regions are so different otherwise i wouldn't take it as a given19:21
frickler2604:e100:1::/48 vs. 2604:e100:3::/4819:21
fricklerI think they were mostly affected similarly19:22
corvusack. oh well.  :|19:22
clarkbso anyway I think the roundabout process here is sorting out docker/compose/podman on noble, update a service like paste, concurrently survey hosting options, then we decide where to host gerrit and proceed with it on nobl;e19:22
corvusclarkb: ++19:22
fungiall of 2604:e100 seems to be assigned to vexxhost, fwiw19:23
clarkbI'm probably going to start with the container stuff as I expect that to require more time and effort19:23
clarkbif others want to do a more proper hosting survey I'd welcome that otherwise I'll poke at it as time allows19:23
clarkbAny other questions/concerns/etc?19:23
fungi2604:e100::/32 is a singular allocation according to arin's whois19:24
clarkb#topic Docker compose plugin with podman service for servers19:24
clarkbNo new updates on this other than that I'm likely going to start poking at it and may have updates in the future19:25
clarkb#topic Docker Hub Rate Limits19:25
clarkbSort of an update on this. I noticed yesterday that kolla ran a lot of jobs (like enough to have every single one of our nodes dedicated to kolla I think)19:25
clarkbfor about 6-8 hours after that point docker hub rate limits were super common19:26
clarkbthis has me wondering if at least some of the change we've observed in rate limit behavior is related to usage changes on our end19:26
clarkbthough it could just be coincidental19:26
fungisome of our providers have limited address pools that are dedicated to us and get recycled frequently19:26
clarkbif anyone knows of potential suspicious changes though digging into those might help us better undersatnd the rate limits and further make this reliable in while we sort out alternatives19:27
clarkbfungi: yup though I was seeing this in rax-ord/rax-iad too which I don't think have the same super limited ip pools19:27
clarkbbut maybe they are more limited than I expect19:27
fricklerkolla tried to get rid of using dockerhub, though I'm not sure whether that succeeded to 100%19:28
fungiwell, they may not allocate a lot of unused addresses, so deleting and replacing servers is still likely to cycle through the smaller number of available addresses19:28
clarkbas for alternatives https://review.opendev.org/c/zuul/zuul-jobs/+/935574 has the review approvals it needs but has struggled to merge19:28
clarkbfungi: good point19:28
clarkbcorvus: to land 935574 I suspect approving it during our evening time is going to be least contended for rate limit quota19:28
corvusclarkb: ack19:29
clarkbcorvus: maybe we should just try approving it now and if it fails agani approve it again this evening and try to get it in?19:29
clarkbs/try approving/try rechecking/19:29
corvusi just reapproved it now, but let's try to do that at the end of day too19:29
clarkb++19:29
clarkbthen for opendev (and probably zuul) mirroring the python images (both ours and the parents) and the mariadb images are likely to have the biggest impact19:29
clarkbeverything else in opendev is like a single container image that we're often pulling from the intermediate registry anyway19:30
* clarkb scribbled a note to follow that chagne and get it landed19:31
clarkbwe can followup once that happens. Anything else on this topic?19:31
fungii can keep an eye on it and recheck too if it fails19:31
clarkb#topic Gerrit 3.10 Upgrade19:32
clarkbThe upgrade is complete \o/19:32
clarkband I think it went pretty smoothly19:32
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document19:32
clarkbthis document still has a few todos on it though most of them have been captured into changes under topic:upgrade-gerrit-3.1019:33
corvusthanks!19:33
fungisorry my isp decided to experience a fiber cut 10 minutes into the maintenance19:33
clarkbessentially its update some test jobs to test the right version of gerrit, drop gerrit 3.9 image builds once we feel confident we won't need to revert, then add 3.11 image builds and 3.10 -> 3.11 upgrade testing19:33
corvusfungi: heck of a time to start gardening19:33
clarkbreviews on those are super appreciated. There is one gotcha in that the change to add 3.11 image builds will also rebuild our 3.10 image because I had to modify files to update acls in testing for 3.11. The side effect of this is we'll pull in my openid redirect fix from upstream when that change lands19:34
clarkbI want to test the behavior of that change against 3.10 (prior testing was only against 3.9) before we merge it so that we can confidently restart gerrit on the rebuilt 3.10 image post merge19:34
clarkball t hat to say if you want to defer to me on approving things that is fine. But still reviews are helpful19:34
clarkbmy goal is to test the openid redirect stuff this afternoon then tomorrow morning land changes and restart gerrit quickly to pick up the update19:35
fungisounds great19:35
fungii also need to look into what's needed so we can update git-review testing19:36
clarkbthank you to everyone who hlped with the upgrade. I'm really happy with how it went19:36
clarkbany gerrit 3.10 concerns before we move on?19:36
clarkb#topic Diskimage Builder Testing and Release19:37
clarkbwe have a request to make a dib release to pick up gentoo element fixes19:38
clarkbunfortuantely the openeuler tests are failing consistently after our mirror updates and the distro doesn't seem to have fallback package location. Fungi pushed a change to disbale the tests which I have approved19:38
clarkbgetting those tests out of the way should allow us to land a couple of other bug fixes (one for svc-map and another for centos 9 installing grub when using efi)19:38
clarkband then I think we can push a tag for a new release19:39
clarkbI also noticed that our openeuler image builds in nodepool are failing likely due to a similar issue and we need to pause them19:39
clarkbdo we know who to reach out to in openeuler land? I feel like this is a good "we're resetting things and you should be involved in that" point19:39
clarkbas I don't think we'll get fixes into dib or our nodepool install otherwise19:40
fungii'm assuming the proposer of the change for the openeuler mirror update is someone who could speak for their community19:40
clarkbgood point. I wonder if they read email :)19:41
clarkbI've made some notes to try and track them down and point them at what needs updating19:41
fungiwe could cc them on a change pausing the image builds, they did cc some of us directly for reviews on the mirror update change19:42
clarkbI like that idea19:42
clarkbfungi: did you want to push that up?19:42
fungii can, sure19:43
fungii'll mention the dib change from earlier today too19:43
clarkbalso your dib fix got a -1 testing noble and I'm guessing thats related to the broken noble package update that jayf just discovered19:43
JayFyep19:43
clarkbthey pushed a kernel update but didn't push an updated kernel modules package19:43
clarkb(or maybe the other way around)19:43
clarkbfungi: thanks!19:43
JayFthat's exactly it clarkb 19:43
fungiwell, sure. why would they? (distro'ing is hard work)19:43
clarkb#topic Rax-ord Noble Nodes with 1 VCPU19:43
JayFand then I tried to report the bug and got told if I needed commerical support to go buy it19:43
fungibwahahaha19:44
JayFyou can imagine how long I stayed in that channel19:44
clarkboh I thought you said they confirmed it19:44
clarkbthat is an unfortunate response19:44
JayFthey confirmed it but were like "huh, weird, whatever"19:44
clarkbtwice in as many months we've had jobs with weird failures due to having only 1 vcpu. THe jobs all ran noble on rax ord iirc19:44
JayFand I pushed them to acknowledge it was broken for everyone19:44
clarkbJayF: ack19:45
JayFthat's what got that response, when I really just wanted someone to point me to the LP project to complain to :D 19:45
clarkbon closer inspection of the zuul collected debugging info the ansible facts show a clear difference in xen version between noble nodes with 8vcpu and those with 1vcpu19:45
fungipeople with actual support contracts will be pushing them to fix it soon enough anyway19:45
JayFI assumed they, like I would if reported for Ironic, would want to unbreak their users ASAP19:45
clarkbmy best hunch at this point is that this is not a bug in nodepool using the wrong flavor or a bug in openstack using the wrong flavor (part of this assumption is other aspects of the flavor like disk and memroy totals check out) and isntaed a bug in xen or noble or both that rpevent the VMs from seeing all of the cpus19:46
clarkbwe pinged a couple of raxflex folks about it but they both indicated that the xen stuff is outside of their purview. I suspect this means the best path forward here is to file a support ticket19:47
clarkbopen to other ideas for getting this sorted out. I worry that a ticket might get dismissed as not having enough info or something19:48
fungiwe do have some code already that rejects nodes very early in jobs so they get retried19:48
fricklerlikely watching whether we can gather more data might be the only option19:49
fungiin base-jobs i think19:49
frickleroh, also https://review.opendev.org/c/zuul/zuul-jobs/+/937376?usp=dashboard19:49
fricklerto at least have a bit more evidence possibly19:50
clarkboh ya we could have some check that was long the lines if rax and vcpu == 1 then fial19:50
clarkbmaybe we start with the data collection change and then add ^ and if this problem happens more than once a month we escalate19:50
fungipretty sure there are similar checks in the validate-host role, i just need to remember where it's defined19:51
fricklerfungi: my patch is also within that role19:52
fungiaha, yep19:52
clarkbI just found the logs showing that frickler's update works. I'll approve it now19:53
clarkb#topic Open Discussion19:53
clarkbAnything else before our hour is up?19:54
corvusno updates on the niz image stuff (as expected); i'm hoping to get back to that soonish19:54
clarkbA reminder that we'll have our last weekly meeting of they year next week and then we'll be back January 719:55
fungisame bat time, same bat channel19:55
clarkbsounds like that is everything. Thank you for your time and all your help running OpenDev19:56
clarkb#endmeeting19:56
opendevmeetMeeting ended Tue Dec 10 19:56:56 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:56
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.html19:56
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.txt19:56
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.log.html19:56
fungithanks clarkb!19:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!