Monday, 2024-12-30

*** jhorstmann is now known as Guest449905:30
*** jhorstmann is now known as Guest451608:58
*** jhorstmann is now known as Guest451709:03
opendevreviewDr. Jens Harbott proposed openstack/project-config master: Run periodic wheel cache publishing only on master  https://review.opendev.org/c/openstack/project-config/+/93831611:06
frickler^^ that's the only thing I noticed while checking periodic pipeline results, but that's not a new thing and not related to the ze01 issues11:07
*** jhorstmann is now known as Guest453814:53
fungithanks!15:24
opendevreviewMerged openstack/project-config master: Run periodic wheel cache publishing only on master  https://review.opendev.org/c/openstack/project-config/+/93831615:33
clarkbinfra-root not sure what availability is for the next week but https://review.opendev.org/c/opendev/system-config/+/938144 and https://review.opendev.org/c/opendev/system-config/+/937641 are the two next big things on my todo list16:09
clarkbthe first will upgrade gitea to the latest bugfix release and the second is the change to do podman + docker compose on noble. THe main risk with this second change is that if I've got something wrong and we try to use docker compose or podman or both on older platforms already running production services16:10
clarkbI expect to me around today, but then not so much the next two days then likely back again thursday and friday16:20
clarkband if we're less comfortable upgrading systems or modifying their underlying runtime configuration I'll find other things to do. Like maybe record some git for gerrit video or at least start up on that again16:20
clarkbI wrote the outline for a script somewhere I should dig up16:20
* fungi sighs at https://bugs.debian.org/1089501 having been open for weeks now with no confirmation (also i hunted for upstream bug reports and changes in their gerrit, but no dice)16:21
clarkbfungi: are you running 6.12? Looks like my tumbleweed install is about to switch16:22
fungiyeah, and i see some linux 6.13 build changes pending in the openafs gerrit but nothing for these 6.12 build issues16:23
fungiwhich leads me to wonder if the issue is specific to debian's toolchain16:23
clarkbI don't have openafs installed on this system otherwise I would be finding out momentarily16:24
fungimaybe this will light a fire under me to finally try out kafs16:24
clarkbcould be that the 6.13 patches cover 6.12 too?16:24
fungipossible, but it didn't look that way (at least it wasn't touching the code related to the compile errors)16:25
fungiclarkb: today seems like a great day for the gitea update and the noble docker compose podman switch (we have very few systems on noble yet so impact should be limited)16:39
fungii'm around all day to approve and monitor them in whichever order we think is best16:40
Clark[m]Gitea first is probably best?16:45
Clark[m]Sorry eating quick breakfast 16:45
fungisure, no problem. i'll approve it now, since it will take some time to clear the gate16:46
Clark[m]fungi: ya the main risk is if I added a bug that applies podman docker compose stuff to jammy focal et al16:46
fungiin unrelated news, it seems like openstack-discuss somehow reverted to not doing verp probes before disabling subscriptions sometime in the past week or two, i'm trying to get to the bottom of it now16:47
fungilooks like the behavior change started sometime between the 20th (the last time i saw an auto-disable that quoted a bounced verp probe message) and the 28th (the first time i started seeing subscriptions disabled without quoting a probe message)16:55
Clark[m]Is the config still correct on disk?16:59
fungiyeah, untouched since we added that option on the 2nd according to the file last update time, and is still showing up at /opt/mailman/mailman-extra.cfg if i check with a shell in the container17:00
fungithe compose file hasn't changed either17:00
Clark[m]There were the minor changes to how we run docker and compose to bootstrap the system17:01
Clark[m]For the podman docker compose prep. But I thought all of those were for bootstrapping the install 17:01
fungiwell, the mailman containers also don't seem to have been restarted since the 2nd17:01
Clark[m]That would corroborate the idea that the changes I made shouldn't affect the running system 17:04
Clark[m]Is it a per list setting also to do verp probes maybe?17:04
fungioh! this may be a red herring17:08
fungithe slew of messages i began receiving on saturday were for auto-unsubscribe, not auto-disable17:09
fungilooking at the list options for bounce processing, mailman will send a subscription disabled reminder every 7 days (configurable) and will do so 3 times (also configurable) before unsubscribing17:10
* fungi does math17:10
fungi5 days of bouncing, followed by a verp probe that bounces, and then three weeks of reminders before unsubscribing is 26 days, the span of time between the 2nd and the 28th17:11
fungiso these are all subscribers i previously received auto-disable notifications about17:12
clarkbok so this is just normal processing but unexpected due to time gaps17:13
fungiprecisely, move along, nothing to see here, these are not the droids you're looking for17:13
fungithough separately i've noticed a bug specific to list moderators, it seems like message held notifications are treated like explicit verp probes, so if a moderator's bounce score is already at the limit and then their mta rejects a message held notification for some random spam (likely if they don't completely skip spam checking for things from the list server), then they can get17:15
fungitheir subscription disabled17:15
clarkbinteresting. I always try to put in explicit rules for mailing lists I subscribe to in order to both sort them into the correct location but also to help ensure they don't get flagged as spam17:17
clarkbI've even noticed that sometimes gmail and google groups don't get along...17:17
clarkblooking at the podman change again I feel like it should be safe. The new behavior is clearly in a file marked with noble in the name and old behavior is in a task file labeled default17:38
clarkbI can't help but be paranoid about it given its potential for widespread impact though17:38
clarkbbut this is like the third time I've looked at it to double check17:39
fungialso things are quiet at the moment so even if there is unanticipated disruption it's unlikely to inconvenience anyone besides us17:43
clarkbReminder there will be no meeting tomorrow17:45
opendevreviewMerged opendev/system-config master: Update gitea to v1.22.6  https://review.opendev.org/c/opendev/system-config/+/93814417:49
clarkbhttps://gitea09.opendev.org:3081/opendev/system-config/ updated and looks ok at first glance17:53
fungiyeah, working for me too17:54
clarkbgit cloen works for me too17:56
fungifor some reason i always forget https://github.com/nelhage/reptyr exists when i actually need it17:56
clarkbthe gerrit cache h2 files continue to be far more reasonable in size17:59
clarkbI still think increasing compaction time is probably a good idea but doing a reset like we did seems to be a good halfway step18:00
clarkball 6 gitea backends have updated at this point18:00
fungideploy completed, yep18:00
clarkbhttps://zuul.opendev.org/t/openstack/build/7e98115a9ecd488f9314a88fdfd02d46 the deploy job was a success too18:00
fungishould we give it some time to bake? or start in on the noble docker compose change now?18:01
clarkblooking at the check job list for the noble docker compose change I suspect that one will take some time to run through the gate. Its probably ok to approve it now?18:02
fungiunrelated, the sunday backups for gerrit seem to have been failing consistently for a while (months i think). has it been looked into yet?18:02
clarkbI haven't looked but it may haev to do with collisions between repacking and backups taking to long or osmething?18:03
clarkbthough I think we repack more often than weekly18:03
fungilooks like it successfully backed up on october 20, but also failed two sundays prior and every sunday after18:05
clarkband just to one backup server too right? So its specific to the time it seems like18:06
fungiyes, just one notification each sunday since the end of september (excepting one sunday in october where it didn't error)18:06
fungiat least i'm assuming it was only one backup server, the "action required" e-mail comes at the same time every sunday18:07
fungi~17:46 utc every time18:08
fungithat coincides with the time for backup01.ord.rax backups and not backup02.ca-ymq-1.vexxhost18:09
fungiand it succeeds consistently on all other days of the week18:09
fungiso i agree it's likely hitting a conflict with some weekly scheduled task18:09
fungiFailed to create/acquire the lock /opt/backups-202010/borg-review02/backup/lock.exclusive (timeout).18:13
fungii'm not immediately finding what it might be conflicting with18:17
fungiaha!18:19
fungi0 0 * * 0 /usr/local/bin/verify-borg-backups >> /var/log/verify-borg-backups.log 2>&118:19
fungiSun Dec 29 16:08:08 UTC 2024 Verifying /opt/backups/borg-review02/backup ...18:20
fungiMon Dec 30 01:53:46 UTC 2024 ... done18:20
funginot urgent since we're only missing one day a week, but worth mulling over how we might want to go about it (maybe create a cow snapshot and verify that so we don't need to play duelling lockfiles?)18:21
fungifor now i'm sufficiently satisfied that we at least have a smoking gun and the behavior is sensible, if somewhat suboptimal18:23
clarkbagreed18:31
clarkbone option would be to smiply reschedule the verification to occur when a less important backup may run (rather than review)18:31
clarkbstill less than idael but more ideal than now18:31
fungiwell, it just loops through all the backups, so when it starts verifying review's depends on how long all the ones before it in the list take to complete18:43
clarkbright but it roughly coincides with then review02 backs up to rax ord. If we change the time that the verification occurs then that shouldn't happen?18:44
fungiwhich for the past few months has landed at a time window conflicting with that particular backup18:44
clarkbI suspect its a time window of less than half an hour18:44
clarkboh wait the vierication took many hours I see that now18:45
fungiwell, the verifications start at midnight utc, it takes about 16 hours to get to review and then nearly 10 hours to do that one18:46
clarkbya that was unexpected but I noticed the timestamps you posted and that is unfortunate18:47
fungiand yesterday review's backup tried to run about 1.5-2 hours into the verify for its backups18:47
clarkbI bet if we prune the backup server that will make things go quicker18:48
fungiprobably18:48
clarkbsince verification is going through each backup and checking them separately iirc.18:48
clarkbPruning will remove extra backups based on our pruning retention rules which in theroy should speed up verification18:48
fungiand i guess that one hasn't been pruned in a very long time due to having more space18:49
clarkbyup18:49
fungii can go ahead and do that now18:49
fungiin progress now in a root screen session18:50
fungistarting at 84% of 3tb used18:52
fungifor some reason system-config-run-eavesdrop has been queued an hour for 93764119:10
fungithankfully that one goes pretty quickly once it actually starts19:11
Clark[m]Ya 15 minutes ish to completion now19:14
fungiand the hourly sempahore should be released by the time it hits deploy19:19
fungiyeah, hourly just completed seconds ago now19:19
sergemHello Team! My name is Sergiy, I would like to ask your advice on the docker hub pull-through registry issue. We already have Docker2Mirror vhost on our mirror role deployed on our mirror servers. it acts as a proxy to docker hub. But it is just an apache and not anything more that that. It seems like we need to run a registry , not just an apache. Because we still hit limits in our gates in openstack-helm and airship projects19:27
sergemThere is a role registry in system-config that creates a registry. Is it deployed anywhere we can use it?19:27
clarkbthe problem isn't specific to apache. The problem appears to be that docker hub has reduced their unauthenticated pull limits19:27
clarkbif you used a registry instead of apache you would still theoretically have the same problems19:27
clarkbhowever our apache setup does evict things after 24 hours if they have a longer ttl. I don't know what TTLs docker hub uses. But in either case, a registry vs apache, you still need to fetch things from upstream if they have expired or are not already present19:28
clarkbit is for this reason we've largely suggested that people do their best to migrate away from docker hub and/or use authenticated requests if possible19:28
clarkbthere was previous discussion about this with kozhukalov I'm trying to find a link to now19:29
clarkbhttps://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2024-12-20.log.html here19:30
sergemwe use quay for our images but we rely on ubuntu base images and they are published on docker hub or aws ecr only with ratelimits...19:30
sergemthanks for the link19:30
fungiwe've started mirroring some of our dependencies to quay as well19:31
fungiso that we can pull them from there instead of from dockerhub19:31
clarkbhttps://review.opendev.org/c/zuul/zuul-jobs/+/935574 is where we've gotten for mirroring19:31
clarkbnothing is actually mirrored yet but that role and job should provide a framework to do so19:31
fungithe kolla team is doing something similar for some of their dependencies, i gather, e.g. https://quay.io/repository/openstack.kolla/fluentd19:31
clarkbre the registry in system-config that registry runs as the "intermediate" registry for job artifacts19:32
clarkbwe prune it regularly and it isn't meant to be a trusted source of images. Instead it holds speculative future states19:32
clarkbwe could in theory run a separate trusted registry that only hosts trusted things, but quay and github and docker hub and google cloud and amazon etc etc are already doing that so relying on them (quay has been popular) seems like a better first step?19:33
clarkbunfortunately docker the tooling and hte ecosystem were built around this idea of docker hub being a central hub and that worked well for about a decade and now it isn't financiall feasible? and we're in this bit of a mess19:34
clarkbyou even have to switch away from using docker the tool if you stop hosting images on docker hub because docker the tool only understands mirrosr for docker hub and not for any other registry19:34
fungicue merge notification from opendevreview any moment... ;)19:35
clarkbI don't really want to spend a lot of time invested in making docker hub less painful as we've been burned before and we shoudl just move on19:35
clarkbone other thing to note is that docker hub treats an ipv6 /64 as a single ip for rate limiting purposes19:35
clarkbyou may find your jobs are more reliable if you force communication to docker hub over ipv419:35
clarkbtonyb was working on a small script / ansibel role that could make that happe nfor jobs19:36
fungi(because some of our cloud providers just give all server instances global addresses from a single shared /64 allocation)19:36
sergemthanks guys, reading...19:38
fungibut yeah, longer-term it seems like trying to avoid using any of docker's/moby's tools and services is the most sensible path forward, painful as such transitions may be. they seem to have decided to burn their community to the ground and walk away19:38
opendevreviewMerged opendev/system-config master: Run containers on Noble with docker compose and podman  https://review.opendev.org/c/opendev/system-config/+/93764119:40
fungiand there's ^ some of our own related transition work19:40
clarkbfungi: that is going to run all the infra-prod deploy jobs19:41
clarkbwe should probably check the first few don't do anything unexpected19:41
fungiyes, i expect it to be a few hours19:41
fungibut have the in-progress job results up in front of me while i multitask19:41
clarkbcodesearch seems to have taken the right path in /var/log/ansible/service-codesearch.yaml.log on bridge19:42
clarkbincluded: /home/zuul/src/opendev.org/opendev/system-config/playbooks/roles/install-docker/tasks/default.yaml for codesearch01.opendev.org19:42
clarkband the container is still running and dockerd is running too so that lgtm on codesearch19:43
fungiagreed19:44
fungieavesdrop is almost done as well and etherpad should be quick. gitea will take a while though19:44
fungiwhat do we have running on noble at this point besides some (possibly not in service yet) mirror servers?19:45
clarkbgitea should be quicker than an upgrade but ya still touching individual servers one after another19:45
fungioh, and we don't container anything on the mirrors anyway19:46
fungiso yeah in theory i think all these deploy jobs should be running on non-noble servers, and hopefully all skip the new logic19:46
clarkbyup exactly19:46
fungiin which case the only expected behavior difference will be when we start creating replacement servers in the future, which was blocked on python docker-compose not working on noble. okay now i get it19:47
clarkbI've checked codesearch, eavesdrop and etherpad and they look good to me so far19:47
clarkbfungi: ya the concern was that if we accidentally changed existing behavior we could get problems. But the intention is that this will be a noop for everything running today19:48
clarkbso far behavior appears to be consistent with what we expect (no new behaviosr old behaviors are loaded via the new default.yaml code path)19:48
fungiright. perfect19:48
clarkbspot checked gitea09 since it is done and looks good too. That implies the other giteas should be good as well19:50
clarkbfungi: the problem we want to avoid is any unintended copy pasta failures from main.yaml to default.yaml or potentially applying podman things to existing older servers and so far I see no evidence of this type of problem19:54
clarkbgitea-lb looks good too.19:56
clarkband grafana too so ya I think this is happy. Its about lunch time here. But I can do more spot checks after I eat19:58
fungienjoy, i'll keep an eye on the deploy as it progresses20:02
tonybNice work!  thank you for merging that one.  that'll free up my ansible-devel stuff20:15
clarkbI spot checked some of the zuul services and gerrit and they all look good too20:32
clarkblooks like all of the deploy jobs were successful20:33
tonybNice 20:33
fungiyes, completed without issue20:37
*** jhorstmann is now known as Guest456121:21
fungiborg prune has reclaimed about 10% of the volume on backup01.ord.rax so far, not sure how long it's going to take to complete though21:55
*** jhorstmann is now known as Guest456422:00
*** jhorstmann is now known as Guest456722:53
*** jhorstmann is now known as Guest457023:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!