* frickler will be a couple of minutes late, need a break after the long TC meeting | 18:57 | |
clarkb | #startmeeting infra | 19:00 |
---|---|---|
opendevmeet | Meeting started Tue Nov 19 19:00:04 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
* tonyb is on a train so coverage may be sporadic | 19:00 | |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/BQ2IWL7BUMNUXYWCQV62DZQCF2AI7E5U/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
tonyb | also in an EU timezone for the next few weeks | 19:00 |
clarkb | oh fun | 19:00 |
tonyb | should be! | 19:00 |
clarkb | Next week is a major US holiday week. I plan to be around Monday and Tuesday and will host the weekly meeting. But we should probably expect slower response times from various places | 19:01 |
clarkb | Anything else to announce? | 19:01 |
clarkb | #topic Zuul-launcher image builds | 19:03 |
clarkb | corvus has continued to iterate on the mechanics of uploading images to swift, downloading them to the launcher and reuploading to the clouds | 19:03 |
clarkb | and good news is the time to build a qcow2 image then shuffle it around is close to if not better than the time the current builders do it in | 19:04 |
clarkb | #link https://review.opendev.org/935455 Setup a raw image cloud for raw image testing | 19:04 |
clarkb | qcow2 images are relatievly small compared to the raw and vhd images we also deal with so next step is testing this process with the larger image types | 19:04 |
clarkb | There is still opportunity to add image build jobs for other distros and releases as well | 19:04 |
tonyb | it's high on my to-do list | 19:05 |
clarkb | cool | 19:05 |
clarkb | anything else on this topic cc corvus | 19:05 |
clarkb | seems like good slow but steady progrss | 19:06 |
clarkb | I'll keep the meeting moving as we have a number of items to get through. We can always swing back to topics if we have time at the end or after the meeting etc | 19:07 |
clarkb | #topic Backup Server Pruning | 19:07 |
clarkb | the smaller backup server got close to filling up its disk again and fungi pruned it again. THank you for that | 19:07 |
clarkb | but this is a good reminder that we have a couple of changes proposed to help alleviate some of that by purging things from the backup servers once they aer no longer needed | 19:07 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/933700 Backup deletions managed through Ansible | 19:08 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/934768 Handle backup verification for purged backups | 19:08 |
clarkb | oh I missed hta fungi had +2'd but not approved the first one | 19:08 |
fungi | i wasn't sure if those needed close attention after merging | 19:09 |
clarkb | should we go ahead and either fix the indnetation then approve or just approve it? | 19:09 |
fungi | i think we can just approve | 19:09 |
tonyb | I agree, we can do the indentation after if we want | 19:10 |
clarkb | I think the test case on line 34 in https://review.opendev.org/c/opendev/system-config/+/933700/23/testinfra/test_borg_backups.py should ensure that it is fairly safe | 19:10 |
clarkb | then after it is landed we can touch the retired flag in the ethercalc dir and then add ethercalc to the list of retirements (to catch up the other server) check that worked, then add it to the purge list (to catch up the other server) and check that worked | 19:11 |
clarkb | then if that is good we can retire the rest of the items in our retirement list | 19:11 |
clarkb | I think this should make managing disk consumption a bit more sane | 19:12 |
clarkb | anything else related to backups? | 19:12 |
clarkb | actually I souldn't need to manually touch the retired file | 19:13 |
clarkb | going through the motions to retire it on the other server should fix the one I did manually | 19:13 |
clarkb | #topic Upgrading old servers | 19:13 |
clarkb | tonyb: anything new to report on wiki or other server replacments? | 19:14 |
clarkb | (I did have a note about the docker compose situation noble but was going to bring that up during the docker compose portion of the agenda) | 19:14 |
tonyb | nothing new. | 19:15 |
tonyb | I guess I discovered that noble is going to be harder than expected due to python being too new for docker-compose v1 | 19:15 |
clarkb | #topic Docker Hub Rate Limits | 19:15 |
clarkb | #undo | 19:15 |
opendevmeet | Removing item from minutes: #topic Docker Hub Rate Limits | 19:15 |
clarkb | ya thats the bit I was going to discuss during the docker compose podman section since they are related | 19:16 |
clarkb | I have an idea for that that may not be terrible | 19:16 |
tonyb | okay that's cool | 19:16 |
clarkb | #topic Docker Hub Rate Limits | 19:16 |
corvus | [i'm here now] | 19:16 |
clarkb | Before we get there another related topic is that people have been noticing we're hitting docker hub rate limits more often | 19:16 |
clarkb | #link https://www.docker.com/blog/november-2024-updated-plans-announcement/ | 19:16 |
tonyb | I don't know why it's only just come up as we have noble servers | 19:16 |
clarkb | tonyb: because mirrors don't use docker-compose I think | 19:17 |
clarkb | basically you've avoided the intersection so far | 19:17 |
clarkb | so that blog post says anonymous requetss will get 10 pulls per hour now | 19:17 |
clarkb | which is a reduction from whatever the old value is. However, if I go through the dance of getting an anonymous pull token and inspect that token it says 100 pulls per 6 hours | 19:17 |
corvus | maybe they walked it back due to complaints... | 19:18 |
clarkb | I've also experimentally checked docker image pull against 12 different alpine image tags and about 12 various library images from docker hub | 19:18 |
clarkb | and had no errors | 19:18 |
clarkb | corvus: ya that could be. Another thought I had was maybe they rate limit the library images differently than the normal images but once you hit the limit it fails for all pulls. But kolla/base reported the same 100 pulls per 6 hours limit that the other library images did so I don't think it is that | 19:19 |
frickler | well 100/6h is 16/h, not too far off, just a bit more burst allowed | 19:19 |
clarkb | frickler: ya the burst is important for CI workloads though particularly since we cache things (if you use the proxy cache) | 19:19 |
corvus | one contingency plan would be to mirror the dockerhub images we need on quay.io (or elsewhere); i started on https://review.opendev.org/935574 for that. | 19:19 |
clarkb | planning for contigencies and generally trying to be on the look out for anything that helps us undersatnd their changes (if any) would be good | 19:20 |
clarkb | but otherwise I now feel like I understand less today than I did yesterday. This doesn't feel like a drop everything emergency but something we should work to understand and then address | 19:20 |
tonyb | assuming that still worth with speculative builds that seems like a solid contingency | 19:20 |
clarkb | another suggested improvement was to lean in buildset registries more to stash all the images a buildset will need and not just those we may build locally | 19:21 |
clarkb | this way we're fetching images like mariadb once per buildset instead of N times for each of N jobs using it | 19:21 |
fungi | heck, some jobs themselves may be pulling the same image multiple times for different hosts | 19:22 |
tonyb | true. | 19:22 |
clarkb | so ya be on the lookout for more concrete info and feel free to experiment with making the jobs more image pull efficient | 19:23 |
clarkb | and maybe we'll know more next week | 19:23 |
clarkb | #topic Docker compose plugin with podman service for servers | 19:23 |
clarkb | This agenda item is related to the previous in that it would allow us to migrate off of docker hub for opendev images and preserve our speculative testing of our images | 19:23 |
clarkb | additionally tonyb found that python docker-compose doesn't run on python3.12 on noble | 19:24 |
clarkb | which is another reason to switch to docker compose | 19:24 |
corvus | (in particular, getting python-base and python-builder on quay.io could be a big win for total image pulls) | 19:24 |
clarkb | all of this is coming together to make this effort a bit of ah igher priority | 19:24 |
clarkb | I'd like to talk about the noble docker compose thing first | 19:25 |
clarkb | I suspect that in our install docker role we can do something hacky like have it install an alias/symlink/something that maps docker-compose to docker compose and then we won't have to rewrite our playbooks/roles until after everything has left docker-compose behind | 19:25 |
clarkb | the two tools have similar enough command lines that I suspect the only place we would run into trouble is anywhere we parse command output and we might have to check both versions instead | 19:26 |
tonyb | yeah, that's kinda gross but itd work for now | 19:26 |
clarkb | but this way we don't need to rewrite every role using docker-compose today and as long as we don't do in place upgrade we're replacing the servers anyway and they'll use the proper tool in that transition | 19:26 |
clarkb | I think this wouldnt' work fi we did an in place switch between docker-compose and docker compose | 19:26 |
clarkb | but as long as its old focal server replaced by new noble server that should mostly work | 19:27 |
tonyb | yeah. I guess we can come back and tidy up the roles after the fact | 19:27 |
clarkb | assuming we do that the next question is do we also have install docker configure docker-compose which is now really docker compose to run podman for everything? I think we were leaning that way when we first discussed this | 19:27 |
clarkb | the upside to that is we get the speculative testing and don't have to be on docker hub | 19:28 |
clarkb | tonyb: exactly | 19:28 |
corvus | i think the last time someone talked about parsing output of "docker-compose" i suggested an alternative... like maybe an "inspect" command we could use instead. | 19:28 |
clarkb | corvus: ++ | 19:28 |
corvus | it may actually be simpler to do it once with podman | 19:28 |
clarkb | so to tl;dr all my typing above I think there are two improvements we should make to our docker installation role. A) set it up to alias docker-compose to docker compose somehow and B) configure docker compose to rely on podman as the runtime | 19:29 |
tonyb | well one place thatd be hard is docker compose pull and looking for updates but yes generally avoinf parsing the output is good | 19:29 |
corvus | just because the difference between them is so small, but it's all at install time. so switching from docker to podman later is more work than "docker compose" plugin with podman now. | 19:29 |
clarkb | tonyb: ya I think thats the only place we do it. So we'd inspect, pull, inspect and see if they change or something | 19:29 |
clarkb | corvus: exactly | 19:29 |
corvus | tonyb: yeah, that's the thing i suggested an alternative to. no one took me up on it at the time, but it's in scrollback somewhere. | 19:30 |
tonyb | okay | 19:30 |
clarkb | tonyb: I know you brought this up as a need for noble servers. I'm not sure if you are interested in making those changes to the docker installation role | 19:30 |
tonyb | yeah I can | 19:30 |
corvus | i think the only outstanding question for https://review.opendev.org/923084 is how to set up the podman socket path -- i think in a previous meeting we identified a potential easy way of doing that. | 19:31 |
tonyb | I don't know exactly what's needed for the podman enablement | 19:31 |
clarkb | I can probably help though I feel like this week is already swamped and next week is the holiday, but ping me for reviews or ideas etc. I ended up brainstorming this a bit the other day so enough is paged in I think I can be useful | 19:31 |
tonyb | and making it nice but I can take direction | 19:31 |
clarkb | tonyb: 923084 does it in a different context so we have to map that into system-config | 19:31 |
tonyb | noted | 19:31 |
clarkb | and i guess address the question af the socket path | 19:31 |
tonyb | okay | 19:32 |
clarkb | sounds like no one is terribly concerned about these hacks and we should be able to get away from them as soon as we're sufficiently migrated | 19:32 |
clarkb | Anything else on these subjects? | 19:32 |
corvus | oh yeah docker "contexts" is the thing | 19:32 |
corvus | that might make setting the path easy | 19:32 |
tonyb | okay cool | 19:33 |
corvus | #link https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-01-19.00.log.html#l-91 docker contexts | 19:33 |
tonyb | and this is for noble+ | 19:33 |
clarkb | thanks! | 19:33 |
clarkb | tonyb: yes | 19:33 |
corvus | yep | 19:33 |
tonyb | perfect | 19:33 |
corvus | oh ha | 19:34 |
corvus | clarkb: also you suggested we could set the env var in a "docker-compose" compat tool shim :) | 19:34 |
corvus | (in that meeting) | 19:34 |
tonyb | ( for the record I'm about to get off the train which probably equates to offline) | 19:34 |
clarkb | no wonder when I was thinking about tonyb's propblem I was like this is the solution | 19:34 |
tonyb | yeah that could work I guess | 19:35 |
corvus | ¿por que no los dos? | 19:35 |
clarkb | tonyb: ack don't hurt yourself trying to type and walk at the same time | 19:35 |
clarkb | #topic Enabling mailman3 bounce processing | 19:35 |
clarkb | let's keep this show moving forward | 19:35 |
clarkb | last week lists.opendev.org and lists.zuul-ci.org lists were all set to (or already set to) enable bounce processing | 19:35 |
clarkb | fungi: do you know if openstack-discuss got its config updted? | 19:35 |
clarkb | then separately I haven't received any notifications of members hitting the limits for any of the lists I moderate and can't find evidence of anyone with a score higher than 3 (5 is the threshold) | 19:36 |
clarkb | so I'm curious if anyone has seen that in action yet | 19:36 |
fungi | clarkb: i've not done that yet, no | 19:37 |
clarkb | ok it would probably be good to set as I suspect that's where we'll get the most timely feedback | 19:37 |
clarkb | then if we're still happy with the results enabling this by default on new lists is likely to require we define a custom mailman list style and create new lists with that style | 19:37 |
clarkb | the documentation is pretty sparse on how you're actually supposed to create a new style unfortunately | 19:38 |
fungi | i've just now switched "process bounces" to "yes" for openstack-discuss | 19:38 |
clarkb | (there is a rest api endpoint for it btu without info on how you set the millions of config options) | 19:38 |
clarkb | fungi: thanks! | 19:38 |
clarkb | fungi: you don't happen to know what would be required to set up a new list style do you? | 19:39 |
clarkb | (we probably actually need two, one for private lists and one for public lists) | 19:39 |
fungi | no clue, short of the instructions for making a mailman plugin in python. might be worth one of us asking on the mailman-users ml | 19:39 |
corvus | i think i have an instance of a user where bounce processing is not working | 19:39 |
clarkb | corvus: are they hitting the threshold then not getting removed? | 19:40 |
clarkb | the bounce disable warnings configuration item implies there is some delay that must be reached after being above the threshold before you are removed | 19:42 |
clarkb | I wonder if the 7 day threshold reset is resetting them to 0 before hitting that if so | 19:42 |
corvus | hrm, i got a message about an unprocessed bounce a while back, but the user does not appear to be a member anymore. so this may not be actionable. | 19:42 |
corvus | not sure what happened in the interim. | 19:42 |
clarkb | ack, I guess we monitor for current behavior and debug from there | 19:42 |
clarkb | fungi: ++ that seems like a good idea. | 19:42 |
clarkb | I'm going to keep things moving as we have less than 20 minutes and still several topics to cover | 19:43 |
clarkb | #topic Intermediate Insecure CI Registry Pruning | 19:43 |
corvus | 0x8e | 19:43 |
clarkb | As scheduled/announced we started this on Friday. We hit a few issues. The first was 404s on object delete requests | 19:43 |
clarkb | that was a simple fix as we can simply ignore 404 errors when trying to delete something. The other was that we weren't paginating object listings so were capped at 10k objects per listing reuqest | 19:43 |
clarkb | this gave the pruning process an incomplete (and possibly inaccurate) picture of what should be deleted vs kept | 19:44 |
corvus | that means we've fixed two bugs that could have caused the previous issues! :) | 19:44 |
clarkb | the process was restarted after fixing the problems and has been running since late friday. We anticipate it will take at least 6 days though I THink it is trending slowly to be longer as it goes on | 19:44 |
clarkb | as far as I can tell no one has had problems with the intermediate registry while this is running either which is a good sign we're not over deleting anything | 19:45 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/935542 Enable daily pruning after this bulk prune is complete | 19:45 |
corvus | 0x8e out of 0xff means we're 55% through the blobs. | 19:45 |
clarkb | once this is done we should be able to run regular pruning that doesn't take days to complete since we'll have much fewer objects to contend with | 19:45 |
clarkb | that change enables a cron job to do this. Which is probably good for hygiene purposes but shouldn't be merged until after we complete this manual run and are happy with it | 19:46 |
clarkb | we'll be much more disk efficient as a result \o/ | 19:46 |
clarkb | corvus: napkin math has that taking closer to 8 days now? | 19:46 |
* clarkb looks at the clock and continues on | 19:47 | |
clarkb | #topic Gerrit 3.10 Upgrade Planning | 19:47 |
corvus | are we starting from friday or saturday? | 19:47 |
clarkb | corvus: I think we started at roughly 00:00 UTC saturdayt | 19:48 |
clarkb | and we're almost at 20:00 UTC tuesday so like 7.75 days? | 19:48 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document | 19:48 |
clarkb | I've announced the upgrade for December 6. I'm hoping that I can focus on getting the last bits of checks and testing done next week before the holiday so that the week after we aren't rushed | 19:49 |
clarkb | If you have time to read the etherpad and call out any additioanl concerns I would appreciate it. There is a held node too fi you are interested in checking out the newer version of gerrit | 19:49 |
clarkb | I'm not too terribly concerned about this though as there is a straightforward rollback path which i have also tested | 19:49 |
clarkb | #topic mirror.sjc3.raxflex.opendev.org cinder volume issues | 19:50 |
clarkb | Yesterday after someone complained about this mirror not working I discovered it couldn't read sector 0 on its cinder volume backing the cache dirs | 19:50 |
clarkb | and then ~5 hours later it seems the server itself shutdown (maybe due to kernel panic after being in this state?) | 19:51 |
fungi | nope, services got restarted | 19:51 |
clarkb | in general that shouldn't restart VMs though? I guess maybe if you restart libvirt or something | 19:51 |
fungi | which apparently resulted in a lot of server instances shutting off or rebooting | 19:51 |
fungi | well, that was the explanation we got anyway | 19:51 |
clarkb | anyway klamath in #opendev reports that it should be happier now and that this wasn't intentional so we've stuck the mirro back into service | 19:51 |
clarkb | there are two things to consider though. One is that we are using a volume of type capacity instead of type standard and it is suggested we could hcange that | 19:52 |
clarkb | the other is if we rebuild our networks we will get bigger mtus | 19:52 |
fungi | full 1500-byte mtus to the server instances, specifically | 19:52 |
clarkb | to rebuild the networks I think the most straightforward option is to simply delete everything we've got and let cloud launcher recreate that stuff for us | 19:52 |
clarkb | doing that likely means deleting the mirror anyway based on mordred's report of not being able to change ports for stnadard port creation on instance create processes | 19:53 |
clarkb | so long story short we should pick a time where we intentionally stop using this cloud, delete the networks and all servers, rerun cloud launcher to recreate networks, then rebuild our mirror using the new networks and a new cinder volume of type standard | 19:54 |
fungi | sounds good | 19:54 |
clarkb | that might be a good big exercise for the week after the gerrit upgrade | 19:54 |
clarkb | in theory things will slow down as we near the end of the year making those changes less impactful | 19:54 |
clarkb | we should also check that the cloud launcher is running happily before we do that | 19:55 |
clarkb | (to avoid delay in reconfiguring the networks) | 19:55 |
fungi | yep | 19:55 |
clarkb | #topic Open Discussion | 19:56 |
clarkb | https://review.opendev.org/c/opendev/lodgeit/+/935712 someone noticed problems with captcha rendering in lodgeit today and has arleady pushed up a fix | 19:56 |
clarkb | we've also been updating our openafs packages and rolling those out with reboots to affected servers | 19:56 |
fungi | #link https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs | 19:57 |
fungi | i'm still planning to do the openinfra.org mailing list migration on december 2, i have the start of a migration plan outline in an etherpad i'll share when i get it a little more fleshed out | 19:58 |
fungi | sending an announcement about it to the foundation ml later today | 19:58 |
clarkb | fungi: is that something we can/should test in the system-config-run mm3 job? | 19:58 |
clarkb | I assume we create a new domain then move the lists over then add the lists to our config? | 19:59 |
clarkb | I'll wait for the etherpad no need to run through the whole process here | 19:59 |
fungi | no, database update query | 19:59 |
clarkb | oh fun | 19:59 |
fungi | mailman core, hyperkitty and postorius all use the django db | 19:59 |
fungi | so can basically just change the domain/host references there and then update our ansible data to match so it doesn't recreate the old lists | 20:00 |
clarkb | and we are at time | 20:00 |
clarkb | fungi: got it | 20:00 |
clarkb | makes sense | 20:00 |
clarkb | thank you everyone! as mentioned we'll be back here next week per usual despite the holiday for several of us | 20:00 |
fungi | thanks clarkb! | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Nov 19 20:00:33 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-19-19.00.log.html | 20:00 |
clarkb | and now time for lunch | 20:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!