*** jhorstmann is now known as Guest4499 | 05:30 | |
*** jhorstmann is now known as Guest4516 | 08:58 | |
*** jhorstmann is now known as Guest4517 | 09:03 | |
opendevreview | Dr. Jens Harbott proposed openstack/project-config master: Run periodic wheel cache publishing only on master https://review.opendev.org/c/openstack/project-config/+/938316 | 11:06 |
---|---|---|
frickler | ^^ that's the only thing I noticed while checking periodic pipeline results, but that's not a new thing and not related to the ze01 issues | 11:07 |
*** jhorstmann is now known as Guest4538 | 14:53 | |
fungi | thanks! | 15:24 |
opendevreview | Merged openstack/project-config master: Run periodic wheel cache publishing only on master https://review.opendev.org/c/openstack/project-config/+/938316 | 15:33 |
clarkb | infra-root not sure what availability is for the next week but https://review.opendev.org/c/opendev/system-config/+/938144 and https://review.opendev.org/c/opendev/system-config/+/937641 are the two next big things on my todo list | 16:09 |
clarkb | the first will upgrade gitea to the latest bugfix release and the second is the change to do podman + docker compose on noble. THe main risk with this second change is that if I've got something wrong and we try to use docker compose or podman or both on older platforms already running production services | 16:10 |
clarkb | I expect to me around today, but then not so much the next two days then likely back again thursday and friday | 16:20 |
clarkb | and if we're less comfortable upgrading systems or modifying their underlying runtime configuration I'll find other things to do. Like maybe record some git for gerrit video or at least start up on that again | 16:20 |
clarkb | I wrote the outline for a script somewhere I should dig up | 16:20 |
* fungi sighs at https://bugs.debian.org/1089501 having been open for weeks now with no confirmation (also i hunted for upstream bug reports and changes in their gerrit, but no dice) | 16:21 | |
clarkb | fungi: are you running 6.12? Looks like my tumbleweed install is about to switch | 16:22 |
fungi | yeah, and i see some linux 6.13 build changes pending in the openafs gerrit but nothing for these 6.12 build issues | 16:23 |
fungi | which leads me to wonder if the issue is specific to debian's toolchain | 16:23 |
clarkb | I don't have openafs installed on this system otherwise I would be finding out momentarily | 16:24 |
fungi | maybe this will light a fire under me to finally try out kafs | 16:24 |
clarkb | could be that the 6.13 patches cover 6.12 too? | 16:24 |
fungi | possible, but it didn't look that way (at least it wasn't touching the code related to the compile errors) | 16:25 |
fungi | clarkb: today seems like a great day for the gitea update and the noble docker compose podman switch (we have very few systems on noble yet so impact should be limited) | 16:39 |
fungi | i'm around all day to approve and monitor them in whichever order we think is best | 16:40 |
Clark[m] | Gitea first is probably best? | 16:45 |
Clark[m] | Sorry eating quick breakfast | 16:45 |
fungi | sure, no problem. i'll approve it now, since it will take some time to clear the gate | 16:46 |
Clark[m] | fungi: ya the main risk is if I added a bug that applies podman docker compose stuff to jammy focal et al | 16:46 |
fungi | in unrelated news, it seems like openstack-discuss somehow reverted to not doing verp probes before disabling subscriptions sometime in the past week or two, i'm trying to get to the bottom of it now | 16:47 |
fungi | looks like the behavior change started sometime between the 20th (the last time i saw an auto-disable that quoted a bounced verp probe message) and the 28th (the first time i started seeing subscriptions disabled without quoting a probe message) | 16:55 |
Clark[m] | Is the config still correct on disk? | 16:59 |
fungi | yeah, untouched since we added that option on the 2nd according to the file last update time, and is still showing up at /opt/mailman/mailman-extra.cfg if i check with a shell in the container | 17:00 |
fungi | the compose file hasn't changed either | 17:00 |
Clark[m] | There were the minor changes to how we run docker and compose to bootstrap the system | 17:01 |
Clark[m] | For the podman docker compose prep. But I thought all of those were for bootstrapping the install | 17:01 |
fungi | well, the mailman containers also don't seem to have been restarted since the 2nd | 17:01 |
Clark[m] | That would corroborate the idea that the changes I made shouldn't affect the running system | 17:04 |
Clark[m] | Is it a per list setting also to do verp probes maybe? | 17:04 |
fungi | oh! this may be a red herring | 17:08 |
fungi | the slew of messages i began receiving on saturday were for auto-unsubscribe, not auto-disable | 17:09 |
fungi | looking at the list options for bounce processing, mailman will send a subscription disabled reminder every 7 days (configurable) and will do so 3 times (also configurable) before unsubscribing | 17:10 |
* fungi does math | 17:10 | |
fungi | 5 days of bouncing, followed by a verp probe that bounces, and then three weeks of reminders before unsubscribing is 26 days, the span of time between the 2nd and the 28th | 17:11 |
fungi | so these are all subscribers i previously received auto-disable notifications about | 17:12 |
clarkb | ok so this is just normal processing but unexpected due to time gaps | 17:13 |
fungi | precisely, move along, nothing to see here, these are not the droids you're looking for | 17:13 |
fungi | though separately i've noticed a bug specific to list moderators, it seems like message held notifications are treated like explicit verp probes, so if a moderator's bounce score is already at the limit and then their mta rejects a message held notification for some random spam (likely if they don't completely skip spam checking for things from the list server), then they can get | 17:15 |
fungi | their subscription disabled | 17:15 |
clarkb | interesting. I always try to put in explicit rules for mailing lists I subscribe to in order to both sort them into the correct location but also to help ensure they don't get flagged as spam | 17:17 |
clarkb | I've even noticed that sometimes gmail and google groups don't get along... | 17:17 |
clarkb | looking at the podman change again I feel like it should be safe. The new behavior is clearly in a file marked with noble in the name and old behavior is in a task file labeled default | 17:38 |
clarkb | I can't help but be paranoid about it given its potential for widespread impact though | 17:38 |
clarkb | but this is like the third time I've looked at it to double check | 17:39 |
fungi | also things are quiet at the moment so even if there is unanticipated disruption it's unlikely to inconvenience anyone besides us | 17:43 |
clarkb | Reminder there will be no meeting tomorrow | 17:45 |
opendevreview | Merged opendev/system-config master: Update gitea to v1.22.6 https://review.opendev.org/c/opendev/system-config/+/938144 | 17:49 |
clarkb | https://gitea09.opendev.org:3081/opendev/system-config/ updated and looks ok at first glance | 17:53 |
fungi | yeah, working for me too | 17:54 |
clarkb | git cloen works for me too | 17:56 |
fungi | for some reason i always forget https://github.com/nelhage/reptyr exists when i actually need it | 17:56 |
clarkb | the gerrit cache h2 files continue to be far more reasonable in size | 17:59 |
clarkb | I still think increasing compaction time is probably a good idea but doing a reset like we did seems to be a good halfway step | 18:00 |
clarkb | all 6 gitea backends have updated at this point | 18:00 |
fungi | deploy completed, yep | 18:00 |
clarkb | https://zuul.opendev.org/t/openstack/build/7e98115a9ecd488f9314a88fdfd02d46 the deploy job was a success too | 18:00 |
fungi | should we give it some time to bake? or start in on the noble docker compose change now? | 18:01 |
clarkb | looking at the check job list for the noble docker compose change I suspect that one will take some time to run through the gate. Its probably ok to approve it now? | 18:02 |
fungi | unrelated, the sunday backups for gerrit seem to have been failing consistently for a while (months i think). has it been looked into yet? | 18:02 |
clarkb | I haven't looked but it may haev to do with collisions between repacking and backups taking to long or osmething? | 18:03 |
clarkb | though I think we repack more often than weekly | 18:03 |
fungi | looks like it successfully backed up on october 20, but also failed two sundays prior and every sunday after | 18:05 |
clarkb | and just to one backup server too right? So its specific to the time it seems like | 18:06 |
fungi | yes, just one notification each sunday since the end of september (excepting one sunday in october where it didn't error) | 18:06 |
fungi | at least i'm assuming it was only one backup server, the "action required" e-mail comes at the same time every sunday | 18:07 |
fungi | ~17:46 utc every time | 18:08 |
fungi | that coincides with the time for backup01.ord.rax backups and not backup02.ca-ymq-1.vexxhost | 18:09 |
fungi | and it succeeds consistently on all other days of the week | 18:09 |
fungi | so i agree it's likely hitting a conflict with some weekly scheduled task | 18:09 |
fungi | Failed to create/acquire the lock /opt/backups-202010/borg-review02/backup/lock.exclusive (timeout). | 18:13 |
fungi | i'm not immediately finding what it might be conflicting with | 18:17 |
fungi | aha! | 18:19 |
fungi | 0 0 * * 0 /usr/local/bin/verify-borg-backups >> /var/log/verify-borg-backups.log 2>&1 | 18:19 |
fungi | Sun Dec 29 16:08:08 UTC 2024 Verifying /opt/backups/borg-review02/backup ... | 18:20 |
fungi | Mon Dec 30 01:53:46 UTC 2024 ... done | 18:20 |
fungi | not urgent since we're only missing one day a week, but worth mulling over how we might want to go about it (maybe create a cow snapshot and verify that so we don't need to play duelling lockfiles?) | 18:21 |
fungi | for now i'm sufficiently satisfied that we at least have a smoking gun and the behavior is sensible, if somewhat suboptimal | 18:23 |
clarkb | agreed | 18:31 |
clarkb | one option would be to smiply reschedule the verification to occur when a less important backup may run (rather than review) | 18:31 |
clarkb | still less than idael but more ideal than now | 18:31 |
fungi | well, it just loops through all the backups, so when it starts verifying review's depends on how long all the ones before it in the list take to complete | 18:43 |
clarkb | right but it roughly coincides with then review02 backs up to rax ord. If we change the time that the verification occurs then that shouldn't happen? | 18:44 |
fungi | which for the past few months has landed at a time window conflicting with that particular backup | 18:44 |
clarkb | I suspect its a time window of less than half an hour | 18:44 |
clarkb | oh wait the vierication took many hours I see that now | 18:45 |
fungi | well, the verifications start at midnight utc, it takes about 16 hours to get to review and then nearly 10 hours to do that one | 18:46 |
clarkb | ya that was unexpected but I noticed the timestamps you posted and that is unfortunate | 18:47 |
fungi | and yesterday review's backup tried to run about 1.5-2 hours into the verify for its backups | 18:47 |
clarkb | I bet if we prune the backup server that will make things go quicker | 18:48 |
fungi | probably | 18:48 |
clarkb | since verification is going through each backup and checking them separately iirc. | 18:48 |
clarkb | Pruning will remove extra backups based on our pruning retention rules which in theroy should speed up verification | 18:48 |
fungi | and i guess that one hasn't been pruned in a very long time due to having more space | 18:49 |
clarkb | yup | 18:49 |
fungi | i can go ahead and do that now | 18:49 |
fungi | in progress now in a root screen session | 18:50 |
fungi | starting at 84% of 3tb used | 18:52 |
fungi | for some reason system-config-run-eavesdrop has been queued an hour for 937641 | 19:10 |
fungi | thankfully that one goes pretty quickly once it actually starts | 19:11 |
Clark[m] | Ya 15 minutes ish to completion now | 19:14 |
fungi | and the hourly sempahore should be released by the time it hits deploy | 19:19 |
fungi | yeah, hourly just completed seconds ago now | 19:19 |
sergem | Hello Team! My name is Sergiy, I would like to ask your advice on the docker hub pull-through registry issue. We already have Docker2Mirror vhost on our mirror role deployed on our mirror servers. it acts as a proxy to docker hub. But it is just an apache and not anything more that that. It seems like we need to run a registry , not just an apache. Because we still hit limits in our gates in openstack-helm and airship projects | 19:27 |
sergem | There is a role registry in system-config that creates a registry. Is it deployed anywhere we can use it? | 19:27 |
clarkb | the problem isn't specific to apache. The problem appears to be that docker hub has reduced their unauthenticated pull limits | 19:27 |
clarkb | if you used a registry instead of apache you would still theoretically have the same problems | 19:27 |
clarkb | however our apache setup does evict things after 24 hours if they have a longer ttl. I don't know what TTLs docker hub uses. But in either case, a registry vs apache, you still need to fetch things from upstream if they have expired or are not already present | 19:28 |
clarkb | it is for this reason we've largely suggested that people do their best to migrate away from docker hub and/or use authenticated requests if possible | 19:28 |
clarkb | there was previous discussion about this with kozhukalov I'm trying to find a link to now | 19:29 |
clarkb | https://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2024-12-20.log.html here | 19:30 |
sergem | we use quay for our images but we rely on ubuntu base images and they are published on docker hub or aws ecr only with ratelimits... | 19:30 |
sergem | thanks for the link | 19:30 |
fungi | we've started mirroring some of our dependencies to quay as well | 19:31 |
fungi | so that we can pull them from there instead of from dockerhub | 19:31 |
clarkb | https://review.opendev.org/c/zuul/zuul-jobs/+/935574 is where we've gotten for mirroring | 19:31 |
clarkb | nothing is actually mirrored yet but that role and job should provide a framework to do so | 19:31 |
fungi | the kolla team is doing something similar for some of their dependencies, i gather, e.g. https://quay.io/repository/openstack.kolla/fluentd | 19:31 |
clarkb | re the registry in system-config that registry runs as the "intermediate" registry for job artifacts | 19:32 |
clarkb | we prune it regularly and it isn't meant to be a trusted source of images. Instead it holds speculative future states | 19:32 |
clarkb | we could in theory run a separate trusted registry that only hosts trusted things, but quay and github and docker hub and google cloud and amazon etc etc are already doing that so relying on them (quay has been popular) seems like a better first step? | 19:33 |
clarkb | unfortunately docker the tooling and hte ecosystem were built around this idea of docker hub being a central hub and that worked well for about a decade and now it isn't financiall feasible? and we're in this bit of a mess | 19:34 |
clarkb | you even have to switch away from using docker the tool if you stop hosting images on docker hub because docker the tool only understands mirrosr for docker hub and not for any other registry | 19:34 |
fungi | cue merge notification from opendevreview any moment... ;) | 19:35 |
clarkb | I don't really want to spend a lot of time invested in making docker hub less painful as we've been burned before and we shoudl just move on | 19:35 |
clarkb | one other thing to note is that docker hub treats an ipv6 /64 as a single ip for rate limiting purposes | 19:35 |
clarkb | you may find your jobs are more reliable if you force communication to docker hub over ipv4 | 19:35 |
clarkb | tonyb was working on a small script / ansibel role that could make that happe nfor jobs | 19:36 |
fungi | (because some of our cloud providers just give all server instances global addresses from a single shared /64 allocation) | 19:36 |
sergem | thanks guys, reading... | 19:38 |
fungi | but yeah, longer-term it seems like trying to avoid using any of docker's/moby's tools and services is the most sensible path forward, painful as such transitions may be. they seem to have decided to burn their community to the ground and walk away | 19:38 |
opendevreview | Merged opendev/system-config master: Run containers on Noble with docker compose and podman https://review.opendev.org/c/opendev/system-config/+/937641 | 19:40 |
fungi | and there's ^ some of our own related transition work | 19:40 |
clarkb | fungi: that is going to run all the infra-prod deploy jobs | 19:41 |
clarkb | we should probably check the first few don't do anything unexpected | 19:41 |
fungi | yes, i expect it to be a few hours | 19:41 |
fungi | but have the in-progress job results up in front of me while i multitask | 19:41 |
clarkb | codesearch seems to have taken the right path in /var/log/ansible/service-codesearch.yaml.log on bridge | 19:42 |
clarkb | included: /home/zuul/src/opendev.org/opendev/system-config/playbooks/roles/install-docker/tasks/default.yaml for codesearch01.opendev.org | 19:42 |
clarkb | and the container is still running and dockerd is running too so that lgtm on codesearch | 19:43 |
fungi | agreed | 19:44 |
fungi | eavesdrop is almost done as well and etherpad should be quick. gitea will take a while though | 19:44 |
fungi | what do we have running on noble at this point besides some (possibly not in service yet) mirror servers? | 19:45 |
clarkb | gitea should be quicker than an upgrade but ya still touching individual servers one after another | 19:45 |
fungi | oh, and we don't container anything on the mirrors anyway | 19:46 |
fungi | so yeah in theory i think all these deploy jobs should be running on non-noble servers, and hopefully all skip the new logic | 19:46 |
clarkb | yup exactly | 19:46 |
fungi | in which case the only expected behavior difference will be when we start creating replacement servers in the future, which was blocked on python docker-compose not working on noble. okay now i get it | 19:47 |
clarkb | I've checked codesearch, eavesdrop and etherpad and they look good to me so far | 19:47 |
clarkb | fungi: ya the concern was that if we accidentally changed existing behavior we could get problems. But the intention is that this will be a noop for everything running today | 19:48 |
clarkb | so far behavior appears to be consistent with what we expect (no new behaviosr old behaviors are loaded via the new default.yaml code path) | 19:48 |
fungi | right. perfect | 19:48 |
clarkb | spot checked gitea09 since it is done and looks good too. That implies the other giteas should be good as well | 19:50 |
clarkb | fungi: the problem we want to avoid is any unintended copy pasta failures from main.yaml to default.yaml or potentially applying podman things to existing older servers and so far I see no evidence of this type of problem | 19:54 |
clarkb | gitea-lb looks good too. | 19:56 |
clarkb | and grafana too so ya I think this is happy. Its about lunch time here. But I can do more spot checks after I eat | 19:58 |
fungi | enjoy, i'll keep an eye on the deploy as it progresses | 20:02 |
tonyb | Nice work! thank you for merging that one. that'll free up my ansible-devel stuff | 20:15 |
clarkb | I spot checked some of the zuul services and gerrit and they all look good too | 20:32 |
clarkb | looks like all of the deploy jobs were successful | 20:33 |
tonyb | Nice | 20:33 |
fungi | yes, completed without issue | 20:37 |
*** jhorstmann is now known as Guest4561 | 21:21 | |
fungi | borg prune has reclaimed about 10% of the volume on backup01.ord.rax so far, not sure how long it's going to take to complete though | 21:55 |
*** jhorstmann is now known as Guest4564 | 22:00 | |
*** jhorstmann is now known as Guest4567 | 22:53 | |
*** jhorstmann is now known as Guest4570 | 23:50 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!