*** diablo_rojo_phone is now known as Guest2639 | 08:38 | |
clarkb | meeting time | 19:00 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Dec 10 19:00:31 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/YTAWQCPLRMRJARKVH4XZR3LXF5DJ4IHQ/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | tonyb mentioned not being here today for the meeting due to travel | 19:00 |
clarkb | Anything else to announce? | 19:01 |
clarkb | sounds like now | 19:02 |
clarkb | *no | 19:02 |
clarkb | #topic Zuul-launcher image builds | 19:02 |
clarkb | corvus: anything new here? Curious if api improvements have made it into zuul yet | 19:03 |
clarkb | Not sure if corvus is around right now. We can get back to this later if that changes | 19:04 |
clarkb | #topic Backup Server Pruning | 19:04 |
clarkb | We retired backups via the new automation and fungi reran the prune script both to check that it works post retirement but also because the server was complaining about being near full again. Good news is that got us down to 68% utilization (previous low water mark was ~77% iirc) | 19:05 |
clarkb | a definite improvement | 19:05 |
fungi | most thanks for that go to ianw and you, i just pushed some buttons | 19:06 |
clarkb | The next step is to purge the backups for these retired backups | 19:06 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/937040 Purge retired backups on vexxhost backup server | 19:06 |
fungi | i should be able to take care of that after the meeting, once i grab some food | 19:06 |
clarkb | and once that has landed and purged things we should rerun the manual prune again just to be sure that the prune script is still happy and also to get a new low water mark | 19:06 |
clarkb | fungi: thanks! | 19:06 |
clarkb | its worth noting that so far we have only done the retirements and purges on one of two backup servers. Once we're happy with this entire process on one server we can consider how we may want to apply it to the other server | 19:07 |
clarkb | its just far less urgent to do on the other as it has more disk space and doing it one at a time is a nice safety mechanism to avoid overdeleting things | 19:07 |
clarkb | Any questions or concerns on backup pruning? | 19:07 |
fungi | none on my end | 19:09 |
clarkb | #topic Upgrading old servers | 19:09 |
clarkb | tonyb isn't here but I did have some uopdates on my end | 19:09 |
clarkb | Now that the gerrit upgrade is mostly behind us (we still have some changes to land and test, more on that later) the next big gerrit related item on my todo list is upgrading the server | 19:09 |
clarkb | I suspect the easiest way to do this will be to boot a new server with a new cinder volume and then sync everything over to it. Then schedule a downtime where we can do a final sync and udpate DNS | 19:10 |
clarkb | any concerns with that approach or preferences for another method? | 19:10 |
fungi | in the past we've given advance notice of the new ip address due to companies baking 25418/tcp into firewall policies | 19:12 |
clarkb | good point. Thouhg I'm not sure if we did that for the move to vexxhost | 19:12 |
fungi | no idea if that's a tradition worth continuing to observe | 19:12 |
corvus | o/ | 19:13 |
clarkb | we can probably do some notice since we're taking a downtime anyway. Not sure it will be the month we've had in the past | 19:13 |
frickler | do we want to stay on vexxhost? or maybe move to raxflex due to IPv6 issues? | 19:13 |
fungi | i'd be okay with not worrying about it | 19:13 |
fungi | not worrying about the advance notice i mean | 19:13 |
clarkb | frickler: considering raxflex has once lost all of hte data on our cinder volumes I am extremely wary of doing that | 19:14 |
frickler | although raxflex doesn't have v6 yet, either, I remember now | 19:14 |
fungi | frickler: rackspace flex currently has worse ipv6 issues, in that it has no ipv6 | 19:14 |
clarkb | I think if we were upgrading in a year the consideration would be different | 19:14 |
fungi | yeah, that | 19:14 |
clarkb | and I'm not sure what flavor sizes are available there | 19:14 |
clarkb | I don't think we can get away with a much smaller gerrit node (mostly for memroy) | 19:14 |
fungi | also flex folks say it's still under some churn before reaching general availability, so i've been hesitant to suggest moving any control plane systems there so soon | 19:14 |
corvus | i'm happy with gerrit and friends staying in rax classic for now. | 19:15 |
clarkb | corvus: its in vexxhost fwiw | 19:15 |
clarkb | I can take a survey of ipv6 availability and flavor sizes and operational concerns in each cloud before booting anything though | 19:16 |
fungi | vexxhost's v6 routes in bgp have been getting filtered/dropped by some isps, hence frickler's concern | 19:16 |
corvus | oh i do remember that now... :) | 19:16 |
corvus | what's the rax classic ipv6 situation? | 19:16 |
fungi | quite good | 19:16 |
clarkb | rax classic ipv6 generally works except for sometimes not between rax classic nodes (this is why ansible is set to use ipv4) | 19:16 |
clarkb | I don't think we've seen any connectivity problems to the outside world only within the cloud | 19:16 |
corvus | maybe a move back to rax classic would be a good idea then? | 19:17 |
fungi | yeah, they've had some quo filtering issues in the past with v6 between regions as well, though not presently to my knowledge | 19:17 |
fungi | s/quo/qos/ | 19:17 |
clarkb | possibly the issues with xen not reliably working with noble probalby need better understanding before committing to that | 19:18 |
clarkb | also not sure how big the big flavors are would have to check that | 19:18 |
clarkb | (they got pretty big from memory doing elasticsearch though) | 19:18 |
clarkb | which is the other related topic. I'd like to target noble for this new node so I may take a detour first and get noble with docker compose and podman working on a simpler service | 19:18 |
clarkb | and then we can make a decision on the exact location :) | 19:19 |
corvus | sounds like our choices are likely: a) aging hardware; b) unreliable v6; c) cloud could disappear at any time | 19:19 |
corvus | oh -- are the different vexxhost regions any different in regards to v6? | 19:20 |
frickler | d) semi-regular issues with vouchers expiring ;) | 19:20 |
corvus | maybe gerrit is in ymq and maybe sjc could be better? | 19:20 |
clarkb | corvus: that is a good question. frickler do you have problems with ipv6 reachability for opendev.org / gitea? | 19:20 |
clarkb | corvus: ya ymq for gerrit. gitea is in sjc1 | 19:20 |
fungi | wrt v6 in sjc1 vs ca-ymq-1, probably the same since they're part of one larger allocation, but we'd have to do some long-term testing to know | 19:21 |
frickler | I remember more issues with gitea, but I need to check | 19:21 |
corvus | fungi: yeah, though those regions are so different otherwise i wouldn't take it as a given | 19:21 |
frickler | 2604:e100:1::/48 vs. 2604:e100:3::/48 | 19:21 |
frickler | I think they were mostly affected similarly | 19:22 |
corvus | ack. oh well. :| | 19:22 |
clarkb | so anyway I think the roundabout process here is sorting out docker/compose/podman on noble, update a service like paste, concurrently survey hosting options, then we decide where to host gerrit and proceed with it on nobl;e | 19:22 |
corvus | clarkb: ++ | 19:22 |
fungi | all of 2604:e100 seems to be assigned to vexxhost, fwiw | 19:23 |
clarkb | I'm probably going to start with the container stuff as I expect that to require more time and effort | 19:23 |
clarkb | if others want to do a more proper hosting survey I'd welcome that otherwise I'll poke at it as time allows | 19:23 |
clarkb | Any other questions/concerns/etc? | 19:23 |
fungi | 2604:e100::/32 is a singular allocation according to arin's whois | 19:24 |
clarkb | #topic Docker compose plugin with podman service for servers | 19:24 |
clarkb | No new updates on this other than that I'm likely going to start poking at it and may have updates in the future | 19:25 |
clarkb | #topic Docker Hub Rate Limits | 19:25 |
clarkb | Sort of an update on this. I noticed yesterday that kolla ran a lot of jobs (like enough to have every single one of our nodes dedicated to kolla I think) | 19:25 |
clarkb | for about 6-8 hours after that point docker hub rate limits were super common | 19:26 |
clarkb | this has me wondering if at least some of the change we've observed in rate limit behavior is related to usage changes on our end | 19:26 |
clarkb | though it could just be coincidental | 19:26 |
fungi | some of our providers have limited address pools that are dedicated to us and get recycled frequently | 19:26 |
clarkb | if anyone knows of potential suspicious changes though digging into those might help us better undersatnd the rate limits and further make this reliable in while we sort out alternatives | 19:27 |
clarkb | fungi: yup though I was seeing this in rax-ord/rax-iad too which I don't think have the same super limited ip pools | 19:27 |
clarkb | but maybe they are more limited than I expect | 19:27 |
frickler | kolla tried to get rid of using dockerhub, though I'm not sure whether that succeeded to 100% | 19:28 |
fungi | well, they may not allocate a lot of unused addresses, so deleting and replacing servers is still likely to cycle through the smaller number of available addresses | 19:28 |
clarkb | as for alternatives https://review.opendev.org/c/zuul/zuul-jobs/+/935574 has the review approvals it needs but has struggled to merge | 19:28 |
clarkb | fungi: good point | 19:28 |
clarkb | corvus: to land 935574 I suspect approving it during our evening time is going to be least contended for rate limit quota | 19:28 |
corvus | clarkb: ack | 19:29 |
clarkb | corvus: maybe we should just try approving it now and if it fails agani approve it again this evening and try to get it in? | 19:29 |
clarkb | s/try approving/try rechecking/ | 19:29 |
corvus | i just reapproved it now, but let's try to do that at the end of day too | 19:29 |
clarkb | ++ | 19:29 |
clarkb | then for opendev (and probably zuul) mirroring the python images (both ours and the parents) and the mariadb images are likely to have the biggest impact | 19:29 |
clarkb | everything else in opendev is like a single container image that we're often pulling from the intermediate registry anyway | 19:30 |
* clarkb scribbled a note to follow that chagne and get it landed | 19:31 | |
clarkb | we can followup once that happens. Anything else on this topic? | 19:31 |
fungi | i can keep an eye on it and recheck too if it fails | 19:31 |
clarkb | #topic Gerrit 3.10 Upgrade | 19:32 |
clarkb | The upgrade is complete \o/ | 19:32 |
clarkb | and I think it went pretty smoothly | 19:32 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document | 19:32 |
clarkb | this document still has a few todos on it though most of them have been captured into changes under topic:upgrade-gerrit-3.10 | 19:33 |
corvus | thanks! | 19:33 |
fungi | sorry my isp decided to experience a fiber cut 10 minutes into the maintenance | 19:33 |
clarkb | essentially its update some test jobs to test the right version of gerrit, drop gerrit 3.9 image builds once we feel confident we won't need to revert, then add 3.11 image builds and 3.10 -> 3.11 upgrade testing | 19:33 |
corvus | fungi: heck of a time to start gardening | 19:33 |
clarkb | reviews on those are super appreciated. There is one gotcha in that the change to add 3.11 image builds will also rebuild our 3.10 image because I had to modify files to update acls in testing for 3.11. The side effect of this is we'll pull in my openid redirect fix from upstream when that change lands | 19:34 |
clarkb | I want to test the behavior of that change against 3.10 (prior testing was only against 3.9) before we merge it so that we can confidently restart gerrit on the rebuilt 3.10 image post merge | 19:34 |
clarkb | all t hat to say if you want to defer to me on approving things that is fine. But still reviews are helpful | 19:34 |
clarkb | my goal is to test the openid redirect stuff this afternoon then tomorrow morning land changes and restart gerrit quickly to pick up the update | 19:35 |
fungi | sounds great | 19:35 |
fungi | i also need to look into what's needed so we can update git-review testing | 19:36 |
clarkb | thank you to everyone who hlped with the upgrade. I'm really happy with how it went | 19:36 |
clarkb | any gerrit 3.10 concerns before we move on? | 19:36 |
clarkb | #topic Diskimage Builder Testing and Release | 19:37 |
clarkb | we have a request to make a dib release to pick up gentoo element fixes | 19:38 |
clarkb | unfortuantely the openeuler tests are failing consistently after our mirror updates and the distro doesn't seem to have fallback package location. Fungi pushed a change to disbale the tests which I have approved | 19:38 |
clarkb | getting those tests out of the way should allow us to land a couple of other bug fixes (one for svc-map and another for centos 9 installing grub when using efi) | 19:38 |
clarkb | and then I think we can push a tag for a new release | 19:39 |
clarkb | I also noticed that our openeuler image builds in nodepool are failing likely due to a similar issue and we need to pause them | 19:39 |
clarkb | do we know who to reach out to in openeuler land? I feel like this is a good "we're resetting things and you should be involved in that" point | 19:39 |
clarkb | as I don't think we'll get fixes into dib or our nodepool install otherwise | 19:40 |
fungi | i'm assuming the proposer of the change for the openeuler mirror update is someone who could speak for their community | 19:40 |
clarkb | good point. I wonder if they read email :) | 19:41 |
clarkb | I've made some notes to try and track them down and point them at what needs updating | 19:41 |
fungi | we could cc them on a change pausing the image builds, they did cc some of us directly for reviews on the mirror update change | 19:42 |
clarkb | I like that idea | 19:42 |
clarkb | fungi: did you want to push that up? | 19:42 |
fungi | i can, sure | 19:43 |
fungi | i'll mention the dib change from earlier today too | 19:43 |
clarkb | also your dib fix got a -1 testing noble and I'm guessing thats related to the broken noble package update that jayf just discovered | 19:43 |
JayF | yep | 19:43 |
clarkb | they pushed a kernel update but didn't push an updated kernel modules package | 19:43 |
clarkb | (or maybe the other way around) | 19:43 |
clarkb | fungi: thanks! | 19:43 |
JayF | that's exactly it clarkb | 19:43 |
fungi | well, sure. why would they? (distro'ing is hard work) | 19:43 |
clarkb | #topic Rax-ord Noble Nodes with 1 VCPU | 19:43 |
JayF | and then I tried to report the bug and got told if I needed commerical support to go buy it | 19:43 |
fungi | bwahahaha | 19:44 |
JayF | you can imagine how long I stayed in that channel | 19:44 |
clarkb | oh I thought you said they confirmed it | 19:44 |
clarkb | that is an unfortunate response | 19:44 |
JayF | they confirmed it but were like "huh, weird, whatever" | 19:44 |
clarkb | twice in as many months we've had jobs with weird failures due to having only 1 vcpu. THe jobs all ran noble on rax ord iirc | 19:44 |
JayF | and I pushed them to acknowledge it was broken for everyone | 19:44 |
clarkb | JayF: ack | 19:45 |
JayF | that's what got that response, when I really just wanted someone to point me to the LP project to complain to :D | 19:45 |
clarkb | on closer inspection of the zuul collected debugging info the ansible facts show a clear difference in xen version between noble nodes with 8vcpu and those with 1vcpu | 19:45 |
fungi | people with actual support contracts will be pushing them to fix it soon enough anyway | 19:45 |
JayF | I assumed they, like I would if reported for Ironic, would want to unbreak their users ASAP | 19:45 |
clarkb | my best hunch at this point is that this is not a bug in nodepool using the wrong flavor or a bug in openstack using the wrong flavor (part of this assumption is other aspects of the flavor like disk and memroy totals check out) and isntaed a bug in xen or noble or both that rpevent the VMs from seeing all of the cpus | 19:46 |
clarkb | we pinged a couple of raxflex folks about it but they both indicated that the xen stuff is outside of their purview. I suspect this means the best path forward here is to file a support ticket | 19:47 |
clarkb | open to other ideas for getting this sorted out. I worry that a ticket might get dismissed as not having enough info or something | 19:48 |
fungi | we do have some code already that rejects nodes very early in jobs so they get retried | 19:48 |
frickler | likely watching whether we can gather more data might be the only option | 19:49 |
fungi | in base-jobs i think | 19:49 |
frickler | oh, also https://review.opendev.org/c/zuul/zuul-jobs/+/937376?usp=dashboard | 19:49 |
frickler | to at least have a bit more evidence possibly | 19:50 |
clarkb | oh ya we could have some check that was long the lines if rax and vcpu == 1 then fial | 19:50 |
clarkb | maybe we start with the data collection change and then add ^ and if this problem happens more than once a month we escalate | 19:50 |
fungi | pretty sure there are similar checks in the validate-host role, i just need to remember where it's defined | 19:51 |
frickler | fungi: my patch is also within that role | 19:52 |
fungi | aha, yep | 19:52 |
clarkb | I just found the logs showing that frickler's update works. I'll approve it now | 19:53 |
clarkb | #topic Open Discussion | 19:53 |
clarkb | Anything else before our hour is up? | 19:54 |
corvus | no updates on the niz image stuff (as expected); i'm hoping to get back to that soonish | 19:54 |
clarkb | A reminder that we'll have our last weekly meeting of they year next week and then we'll be back January 7 | 19:55 |
fungi | same bat time, same bat channel | 19:55 |
clarkb | sounds like that is everything. Thank you for your time and all your help running OpenDev | 19:56 |
clarkb | #endmeeting | 19:56 |
opendevmeet | Meeting ended Tue Dec 10 19:56:56 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:56 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.html | 19:56 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.txt | 19:56 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-12-10-19.00.log.html | 19:56 |
fungi | thanks clarkb! | 19:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!