clarkb | Just about meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Apr 15 19:00:11 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ICGFTZD6Y3OYOHTZWY4HN4VYLTN7R6BY/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | We announced the Gerrit server upgrade for April 21, 2025 at 1600 UTC (we'll talk more about that later) | 19:00 |
clarkb | anything else to announce? I guess the PTG is over as is the openstack release (mostly) so we're able to do more potentially impactful changes | 19:01 |
clarkb | I pulled the meetpad servers out of the emergency file so they updated over the weekend and I've tested the service since | 19:03 |
clarkb | if thats it I think we can dive into the agenda | 19:03 |
clarkb | #topic Zuul-launcher image builds | 19:03 |
corvus | a status update first, then i think we should make some decisions. | 19:03 |
corvus | status: i'm rolling out fixes to the defects previously observed: vhd upload timeouts and leaked node cleanups. very soon (today/tomorrow?) i'll be ready to proceed with the previously discussed/agreed move of opendev and zuul tenants to using only launcher nodes | 19:03 |
corvus | there's still an outstanding item before i'm ready to propose moving larger (openstack) tenants to the launcher: statsd output. i expect to have that done and ready to move openstack, maybe about a month from now? | 19:03 |
corvus | which leads us to the thing we should start making decisions about: i haven't seen any activity on getting the remaining images prepped. specifically, i don't see any work on centos or arm images. | 19:04 |
corvus | we've been talking about this for many months now. if someone here is going to volunteer to do that, i think now would be a good time to record that and an expected completion date. if no one is interested in that work, then we should start making alternate plans now. | 19:04 |
clarkb | This is a good opportunity for someone who is interested but doesn't have a lot of time or doesn't have root as its largely contained to zuul jobs | 19:05 |
clarkb | re arm64 iamge builds they are much quicker than before after changes made to the cloud | 19:06 |
corvus | in the case of no volunteers, i think it would be reasonable to send a message to the service-discuss list (and maybe someone should bounce it to openstack-discuss?) telling people that we need someone to take ownership of maintaining those images, otherwise, by mid-may they will become best effort and possibly start returning node errors, and at some later date (july?), they will be removed and will always return errors. | 19:06 |
clarkb | so I'm far less worried about them being too slow or annoyingly slow | 19:06 |
corvus | that should give people adequate notice to step up and fix them before we remove them due to lack of maintenance. | 19:06 |
fungi | i can also communicate that need to the openstack tc, maybe they can find volunteers to add those resources since some projects in openstack are relying on them | 19:06 |
clarkb | that all seems reasonable to me | 19:07 |
corvus | fungi: sounds good | 19:07 |
corvus | should we do those in parallel (email / tc) or one then the other? | 19:07 |
clarkb | maybe send to service-discuss first then we can point the openstack tc to that thread | 19:07 |
clarkb | ? | 19:07 |
fungi | if someone else wants to send the notification to service-discuss, i can take on mentioning it on the openstack-discuss list and also to the tc at the same time | 19:08 |
corvus | i'm happy to draft that email message | 19:08 |
corvus | s/happy/willing/ | 19:08 |
clarkb | sounds like a plan | 19:08 |
corvus | (i'm not happy about dropping images, i want someone to volunteer, and i would be happy to answer any questions they have) | 19:09 |
corvus | okay, clarkb want to #action? | 19:09 |
clarkb | #action corvus draft message to service-discuss asking for volunteers to port ci image builds otherwise images may become unmaintained | 19:09 |
clarkb | #action fungi followup to corvus' message with the openstack TC to see if anyone in openstack relying on those images is interested | 19:10 |
fungi | will do | 19:10 |
clarkb | anything else on this topic? | 19:10 |
corvus | thanks, that's it from me | 19:10 |
clarkb | #topic Container hygiene tasks | 19:10 |
clarkb | #link https://review.opendev.org/q/topic:%22opendev-python3.12%22+status:open Update images to use python3.12 | 19:10 |
clarkb | this has stalled out somewhat while I work on moving Gerrit servers (our next topic) | 19:11 |
clarkb | but this is still a topic whose changes could use another set of reviews if people have time. Though once Gerrit stuff is done I'll probably proceed with fungi's +2 if no one else reviews | 19:11 |
clarkb | Basically not urgent but a good thing to flush out of our queue for general hygiene | 19:12 |
clarkb | #topic Switching Gerrit to run on Review03 | 19:12 |
clarkb | As mentioned at the start of the meeting we've announced that April 21 (Monday) at 16:00 we're taking an outage to swap the servers around | 19:12 |
clarkb | The new server is very similar to the old one (still BFV, still running in vexxhost ca-ymq-1, using the same sized server) but with new backing resources | 19:13 |
clarkb | It is enrolled in our inventory as a review-staging group member and I've synced data and started gerrit on it. This means you can go to https://review03.opendev.org and test it today. IF you want to log in you need to update /etc/hosts to set up review.opendev.org to point at its IP(s) | 19:14 |
clarkb | the reason for needing /etc/hosts to login is the openid redirects use the review.opendev.org name | 19:14 |
clarkb | #link https://etherpad.opendev.org/p/i_vt63v18c3RKX2VyCs3 review03 testing and service migration planning | 19:14 |
clarkb | I've also been working on this document to track what I've done, what needs to be done, and the migration plan for Monday | 19:14 |
clarkb | feel free to leave comments on there if you have questions or concerns or testing shows something unexpected | 19:15 |
clarkb | Keep in mind that anything you do to review03 between now and monday will be delted when we switch servers | 19:15 |
clarkb | we aren't merging content we will replace what is on 03 with what is on 02 during the outage | 19:15 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/947044 Prep work to ensure port 29418 ssh host keys match on old and new servers | 19:15 |
clarkb | This change is for something I discovered when spinning up the new server. Running init on the new server generated new ecdsa and ed25519 ssh host keys. | 19:16 |
clarkb | This change adds management of those host keys to ansible and I've updated private vars to match review02 so that review02 and review03 should have the same hostkeys | 19:16 |
clarkb | when you review that change please double check the private vars content if you have time. Thats probably the most tricky thing about the change is just mapping all the data properly | 19:16 |
clarkb | and finally I realized that since we manage DNS with gerrit changes updating DNS during the outage is a bit tricky. I've documented the half manual path for getting that done on the etherpad. That may be something good to check for viability | 19:17 |
clarkb | basically we have to force merge the change to swap dns (because zuul won't be working yet) then we have to manually run the service-nameserver.yaml playbook (because zuul isn't running yet). Doable but different | 19:18 |
fungi | we could in theory wait for the dns change to deploy and start the outage then | 19:18 |
fungi | which could avoid additional manual steps unless we need to roll back | 19:18 |
clarkb | fungi: possibly, one upside to using the path I describe is it will exercise replication for us | 19:18 |
clarkb | which is something I can't easily test until we switch | 19:19 |
clarkb | so its a quick early indication of whether or not that fundamental feature is functional | 19:19 |
clarkb | corvus: I did want to ask if you have any concerns with leaving zuul up during the outage | 19:20 |
clarkb | I think it will be fine and zuul will gracefully degrade so I haven't been planning on touching zuul. But let me know if that is a bad plan | 19:20 |
fungi | though another downside of updating dns after we bring up the new server is that we can't have dns occurring in parallel with the rsync and startup | 19:20 |
fungi | can't have dns propogation happening in parallel i mean | 19:20 |
corvus | clarkb: yeah i thought about that and agree that we should leave it up, but just monitor logs on the schedulers and look for how it handles the failover | 19:20 |
clarkb | fungi: ya it would have to wait for gerrit to be up on 03 | 19:21 |
fungi | so deploying dns first could shorteen the effective outage duration | 19:21 |
clarkb | fungi: ya by about 5 minutes I guess I'll have to think about that alternative a bit more | 19:21 |
clarkb | any questions or concerns about the preparation, timing, testing, etc? | 19:22 |
fungi | none from me | 19:22 |
clarkb | I think my next goal is to login to review03 and make sure that functions and I see my personal content that I expect. Then maybe first thing tomorrow is a good time to land the ssh key management update if people can review before then | 19:23 |
clarkb | and then before firday we land the dns ttl update to shorten to 5 minutes on the review.o.o record | 19:24 |
clarkb | it just occurred to me that review.openstack.org is probably in DNS too and will need manual intervention | 19:24 |
clarkb | I'll update the etherpad for ^ after the meeting | 19:24 |
clarkb | fungi: ^ you will need to do that I think actually | 19:24 |
clarkb | since it is cloudflare now | 19:24 |
fungi | yeah, can do, thanks for the reminder | 19:25 |
fungi | i can adjust that during the outage | 19:25 |
clarkb | thanks! | 19:25 |
clarkb | #topic Upgrading old servers | 19:26 |
clarkb | Today I noticed that our noble nodes have a ~881MB /boot partition | 19:26 |
clarkb | due to historical reasons I have concerns about small /boot partitions but I did some cross checking against a personal machine that I had problems on and I think we're ok with that cloud image's setup | 19:27 |
clarkb | the initrds for kvm nodes are small enough that even with the smaller size I think we can fit ~twice as many kernels as my personal machine can do on a /boot that is 1.9GB large | 19:27 |
clarkb | But I wanted to call this out as a change from our existing older nodes that do not have a /boot partition | 19:27 |
fungi | yeah, that's plenty of room | 19:28 |
clarkb | any other server upgrade thinsg to note? | 19:28 |
fungi | not from me | 19:29 |
clarkb | ya seems liek not we can continue | 19:29 |
clarkb | #topic Running certcheck on bridge | 19:29 |
clarkb | I was going to drop this from the agenda but I forgot so its here :) | 19:29 |
fungi | no update | 19:29 |
fungi | yeah, please do | 19:29 |
clarkb | no updatse from me. That said once Gerrit is sorted out I should have time to look at this | 19:29 |
clarkb | so maybe it will go away temporarily. We'll see | 19:30 |
fungi | we can defer it to when we get prometheus done | 19:30 |
clarkb | Related to this the wiki cert expires in 2X days | 19:30 |
clarkb | and some consoritum of cert groups just announced a plan to limit cert validity down to 47 days over the next several yaers | 19:30 |
clarkb | March 2026 will be the end of getting a cert with one yaer of validity. So we might get one year now. Then another year in less tahn a year dpeending on how things go | 19:31 |
clarkb | anyway thats all for later | 19:31 |
fungi | wow, 47 days will be considerably less than let's encrypt does now even | 19:31 |
clarkb | fungi: yup | 19:31 |
fungi | by half | 19:32 |
clarkb | fungi: the original argument was for 90 days but apple apparently convinced everyone to go for ~half that | 19:32 |
clarkb | but that won't take effect until 2029 or something | 19:32 |
clarkb | #topic Working through our TODO list | 19:32 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:32 |
clarkb | just a reminder that we haev a higher level todo list on this etherpad | 19:32 |
clarkb | it mgiht want a better permanent home but until then feel free to pick up work off that list if you need something to do or want to get more invovled in opendev | 19:33 |
clarkb | I'm happy to answer any questions or concerns about the list and guide people on those tasks if they get picked up | 19:33 |
clarkb | and feel free to add things there as time goes on and we have new things | 19:33 |
clarkb | #topic Rotating mailman 3 logs | 19:33 |
fungi | not done yet | 19:34 |
clarkb | is there a held node or anything along those lines yet? | 19:34 |
fungi | no, not even a proposed change yet unless i did it and forgot | 19:34 |
clarkb | ack | 19:34 |
clarkb | we've lived without it for a while but eventually that file is going to become too unwieldly so we should try and figure out something here | 19:35 |
clarkb | let me know if I can help (I did discover the problem afterall) | 19:35 |
clarkb | #topic Moving hound image to quay.io | 19:35 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/947010 | 19:35 |
clarkb | We did this to lodgeit after running the service on noble. Codesearch has also been running on noble for some time and can use quay.io instead of docker too | 19:36 |
clarkb | Just another step in the slow process of getting off of docker hub | 19:36 |
clarkb | we'll be able to do this with the gerrit images too once we're confident we aren't rolling back | 19:36 |
clarkb | in related news it seems like the docker hub issues are less prevalent. I think all of the little changes we've made to rely on docker hub less add up to more reliability which is nice | 19:37 |
clarkb | (though the recent base python image updates did hit the issues pretty reliably but I Think that was due to the number of images being updated at once) | 19:37 |
clarkb | #topic Open Discussion | 19:38 |
clarkb | Anything else? | 19:38 |
fungi | i've got nothing | 19:38 |
clarkb | the weather is going to be great all week here before raining on the weekend again. So don't be surprised if I pop out for a bike ride here and there (definitely going to try for this afternoon) | 19:39 |
corvus | 1 thing | 19:39 |
corvus | * openEuler (x86 and Arm)... (full message at <https://matrix.org/oftc/media/v1/media/download/AUJLRSPeJHqFGHLgsbmtZfjle05809G8xfMA4kubBTrqB0se1-QBVrK6t_uAgY2M0OgPHkcXo8OnLmIOHKz0IMBCeWgxAFvQAG1hdHJpeC5vcmcva0NPaE9WQmFNVUhUaVB0a1VsdlV4SnNt>) | 19:39 |
corvus | that look okay in irc? i can make an alternative if not | 19:40 |
clarkb | corvus: it gave us irc folks a link to a paste essentially | 19:40 |
corvus | oh sorry. that was my alternative anyway. :) | 19:40 |
clarkb | openeuler and gentoo are effectively dead right now because they stopped building and there wasn't sufficient pick up to fix them | 19:40 |
corvus | okay, should i omit those from this email? | 19:41 |
clarkb | so you might treat them differently in your email and note that they don't currently build as is and will need a larger effort for the interested parties if those still exist. Otherwise letting them die is a good thing | 19:41 |
clarkb | the others should all currently have working builds and are more straightforward to port from nodepool to zuul-launcher | 19:41 |
corvus | okay, i'll try for that. we can see how it looks in the etherpad. | 19:41 |
corvus | cool, thx | 19:41 |
clarkb | I'll keep things open until 19:45 and if there is nothing before then I think we can end 15 minutes early | 19:43 |
clarkb | thanks everyone! | 19:45 |
fungi | thanks clarkb! | 19:45 |
clarkb | we'll be back here same time and location next week | 19:45 |
clarkb | #endmeeting | 19:45 |
opendevmeet | Meeting ended Tue Apr 15 19:45:13 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:45 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-04-15-19.00.html | 19:45 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-04-15-19.00.txt | 19:45 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-04-15-19.00.log.html | 19:45 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!