Monday, 2024-12-30

*** jhorstmann is now known as Guest4499		05:30
*** jhorstmann is now known as Guest4516		08:58
*** jhorstmann is now known as Guest4517		09:03
opendevreview	Dr. Jens Harbott proposed openstack/project-config master: Run periodic wheel cache publishing only on master https://review.opendev.org/c/openstack/project-config/+/938316	11:06
frickler	^^ that's the only thing I noticed while checking periodic pipeline results, but that's not a new thing and not related to the ze01 issues	11:07
*** jhorstmann is now known as Guest4538		14:53
fungi	thanks!	15:24
opendevreview	Merged openstack/project-config master: Run periodic wheel cache publishing only on master https://review.opendev.org/c/openstack/project-config/+/938316	15:33
clarkb	infra-root not sure what availability is for the next week but https://review.opendev.org/c/opendev/system-config/+/938144 and https://review.opendev.org/c/opendev/system-config/+/937641 are the two next big things on my todo list	16:09
clarkb	the first will upgrade gitea to the latest bugfix release and the second is the change to do podman + docker compose on noble. THe main risk with this second change is that if I've got something wrong and we try to use docker compose or podman or both on older platforms already running production services	16:10
clarkb	I expect to me around today, but then not so much the next two days then likely back again thursday and friday	16:20
clarkb	and if we're less comfortable upgrading systems or modifying their underlying runtime configuration I'll find other things to do. Like maybe record some git for gerrit video or at least start up on that again	16:20
clarkb	I wrote the outline for a script somewhere I should dig up	16:20
* fungi sighs at https://bugs.debian.org/1089501 having been open for weeks now with no confirmation (also i hunted for upstream bug reports and changes in their gerrit, but no dice)		16:21
clarkb	fungi: are you running 6.12? Looks like my tumbleweed install is about to switch	16:22
fungi	yeah, and i see some linux 6.13 build changes pending in the openafs gerrit but nothing for these 6.12 build issues	16:23
fungi	which leads me to wonder if the issue is specific to debian's toolchain	16:23
clarkb	I don't have openafs installed on this system otherwise I would be finding out momentarily	16:24
fungi	maybe this will light a fire under me to finally try out kafs	16:24
clarkb	could be that the 6.13 patches cover 6.12 too?	16:24
fungi	possible, but it didn't look that way (at least it wasn't touching the code related to the compile errors)	16:25
fungi	clarkb: today seems like a great day for the gitea update and the noble docker compose podman switch (we have very few systems on noble yet so impact should be limited)	16:39
fungi	i'm around all day to approve and monitor them in whichever order we think is best	16:40
Clark[m]	Gitea first is probably best?	16:45
Clark[m]	Sorry eating quick breakfast	16:45
fungi	sure, no problem. i'll approve it now, since it will take some time to clear the gate	16:46
Clark[m]	fungi: ya the main risk is if I added a bug that applies podman docker compose stuff to jammy focal et al	16:46
fungi	in unrelated news, it seems like openstack-discuss somehow reverted to not doing verp probes before disabling subscriptions sometime in the past week or two, i'm trying to get to the bottom of it now	16:47
fungi	looks like the behavior change started sometime between the 20th (the last time i saw an auto-disable that quoted a bounced verp probe message) and the 28th (the first time i started seeing subscriptions disabled without quoting a probe message)	16:55
Clark[m]	Is the config still correct on disk?	16:59
fungi	yeah, untouched since we added that option on the 2nd according to the file last update time, and is still showing up at /opt/mailman/mailman-extra.cfg if i check with a shell in the container	17:00
fungi	the compose file hasn't changed either	17:00
Clark[m]	There were the minor changes to how we run docker and compose to bootstrap the system	17:01
Clark[m]	For the podman docker compose prep. But I thought all of those were for bootstrapping the install	17:01
fungi	well, the mailman containers also don't seem to have been restarted since the 2nd	17:01
Clark[m]	That would corroborate the idea that the changes I made shouldn't affect the running system	17:04
Clark[m]	Is it a per list setting also to do verp probes maybe?	17:04
fungi	oh! this may be a red herring	17:08
fungi	the slew of messages i began receiving on saturday were for auto-unsubscribe, not auto-disable	17:09
fungi	looking at the list options for bounce processing, mailman will send a subscription disabled reminder every 7 days (configurable) and will do so 3 times (also configurable) before unsubscribing	17:10
* fungi does math		17:10
fungi	5 days of bouncing, followed by a verp probe that bounces, and then three weeks of reminders before unsubscribing is 26 days, the span of time between the 2nd and the 28th	17:11
fungi	so these are all subscribers i previously received auto-disable notifications about	17:12
clarkb	ok so this is just normal processing but unexpected due to time gaps	17:13
fungi	precisely, move along, nothing to see here, these are not the droids you're looking for	17:13
fungi	though separately i've noticed a bug specific to list moderators, it seems like message held notifications are treated like explicit verp probes, so if a moderator's bounce score is already at the limit and then their mta rejects a message held notification for some random spam (likely if they don't completely skip spam checking for things from the list server), then they can get	17:15
fungi	their subscription disabled	17:15
clarkb	interesting. I always try to put in explicit rules for mailing lists I subscribe to in order to both sort them into the correct location but also to help ensure they don't get flagged as spam	17:17
clarkb	I've even noticed that sometimes gmail and google groups don't get along...	17:17
clarkb	looking at the podman change again I feel like it should be safe. The new behavior is clearly in a file marked with noble in the name and old behavior is in a task file labeled default	17:38
clarkb	I can't help but be paranoid about it given its potential for widespread impact though	17:38
clarkb	but this is like the third time I've looked at it to double check	17:39
fungi	also things are quiet at the moment so even if there is unanticipated disruption it's unlikely to inconvenience anyone besides us	17:43
clarkb	Reminder there will be no meeting tomorrow	17:45
opendevreview	Merged opendev/system-config master: Update gitea to v1.22.6 https://review.opendev.org/c/opendev/system-config/+/938144	17:49
clarkb	https://gitea09.opendev.org:3081/opendev/system-config/ updated and looks ok at first glance	17:53
fungi	yeah, working for me too	17:54
clarkb	git cloen works for me too	17:56
fungi	for some reason i always forget https://github.com/nelhage/reptyr exists when i actually need it	17:56
clarkb	the gerrit cache h2 files continue to be far more reasonable in size	17:59
clarkb	I still think increasing compaction time is probably a good idea but doing a reset like we did seems to be a good halfway step	18:00
clarkb	all 6 gitea backends have updated at this point	18:00
fungi	deploy completed, yep	18:00
clarkb	https://zuul.opendev.org/t/openstack/build/7e98115a9ecd488f9314a88fdfd02d46 the deploy job was a success too	18:00
fungi	should we give it some time to bake? or start in on the noble docker compose change now?	18:01
clarkb	looking at the check job list for the noble docker compose change I suspect that one will take some time to run through the gate. Its probably ok to approve it now?	18:02
fungi	unrelated, the sunday backups for gerrit seem to have been failing consistently for a while (months i think). has it been looked into yet?	18:02
clarkb	I haven't looked but it may haev to do with collisions between repacking and backups taking to long or osmething?	18:03
clarkb	though I think we repack more often than weekly	18:03
fungi	looks like it successfully backed up on october 20, but also failed two sundays prior and every sunday after	18:05
clarkb	and just to one backup server too right? So its specific to the time it seems like	18:06
fungi	yes, just one notification each sunday since the end of september (excepting one sunday in october where it didn't error)	18:06
fungi	at least i'm assuming it was only one backup server, the "action required" e-mail comes at the same time every sunday	18:07
fungi	~17:46 utc every time	18:08
fungi	that coincides with the time for backup01.ord.rax backups and not backup02.ca-ymq-1.vexxhost	18:09
fungi	and it succeeds consistently on all other days of the week	18:09
fungi	so i agree it's likely hitting a conflict with some weekly scheduled task	18:09
fungi	Failed to create/acquire the lock /opt/backups-202010/borg-review02/backup/lock.exclusive (timeout).	18:13
fungi	i'm not immediately finding what it might be conflicting with	18:17
fungi	aha!	18:19
fungi	0 0 * * 0 /usr/local/bin/verify-borg-backups >> /var/log/verify-borg-backups.log 2>&1	18:19
fungi	Sun Dec 29 16:08:08 UTC 2024 Verifying /opt/backups/borg-review02/backup ...	18:20
fungi	Mon Dec 30 01:53:46 UTC 2024 ... done	18:20
fungi	not urgent since we're only missing one day a week, but worth mulling over how we might want to go about it (maybe create a cow snapshot and verify that so we don't need to play duelling lockfiles?)	18:21
fungi	for now i'm sufficiently satisfied that we at least have a smoking gun and the behavior is sensible, if somewhat suboptimal	18:23
clarkb	agreed	18:31
clarkb	one option would be to smiply reschedule the verification to occur when a less important backup may run (rather than review)	18:31
clarkb	still less than idael but more ideal than now	18:31
fungi	well, it just loops through all the backups, so when it starts verifying review's depends on how long all the ones before it in the list take to complete	18:43
clarkb	right but it roughly coincides with then review02 backs up to rax ord. If we change the time that the verification occurs then that shouldn't happen?	18:44
fungi	which for the past few months has landed at a time window conflicting with that particular backup	18:44
clarkb	I suspect its a time window of less than half an hour	18:44
clarkb	oh wait the vierication took many hours I see that now	18:45
fungi	well, the verifications start at midnight utc, it takes about 16 hours to get to review and then nearly 10 hours to do that one	18:46
clarkb	ya that was unexpected but I noticed the timestamps you posted and that is unfortunate	18:47
fungi	and yesterday review's backup tried to run about 1.5-2 hours into the verify for its backups	18:47
clarkb	I bet if we prune the backup server that will make things go quicker	18:48
fungi	probably	18:48
clarkb	since verification is going through each backup and checking them separately iirc.	18:48
clarkb	Pruning will remove extra backups based on our pruning retention rules which in theroy should speed up verification	18:48
fungi	and i guess that one hasn't been pruned in a very long time due to having more space	18:49
clarkb	yup	18:49
fungi	i can go ahead and do that now	18:49
fungi	in progress now in a root screen session	18:50
fungi	starting at 84% of 3tb used	18:52
fungi	for some reason system-config-run-eavesdrop has been queued an hour for 937641	19:10
fungi	thankfully that one goes pretty quickly once it actually starts	19:11
Clark[m]	Ya 15 minutes ish to completion now	19:14
fungi	and the hourly sempahore should be released by the time it hits deploy	19:19
fungi	yeah, hourly just completed seconds ago now	19:19
sergem	Hello Team! My name is Sergiy, I would like to ask your advice on the docker hub pull-through registry issue. We already have Docker2Mirror vhost on our mirror role deployed on our mirror servers. it acts as a proxy to docker hub. But it is just an apache and not anything more that that. It seems like we need to run a registry , not just an apache. Because we still hit limits in our gates in openstack-helm and airship projects	19:27
sergem	There is a role registry in system-config that creates a registry. Is it deployed anywhere we can use it?	19:27
clarkb	the problem isn't specific to apache. The problem appears to be that docker hub has reduced their unauthenticated pull limits	19:27
clarkb	if you used a registry instead of apache you would still theoretically have the same problems	19:27
clarkb	however our apache setup does evict things after 24 hours if they have a longer ttl. I don't know what TTLs docker hub uses. But in either case, a registry vs apache, you still need to fetch things from upstream if they have expired or are not already present	19:28
clarkb	it is for this reason we've largely suggested that people do their best to migrate away from docker hub and/or use authenticated requests if possible	19:28
clarkb	there was previous discussion about this with kozhukalov I'm trying to find a link to now	19:29
clarkb	https://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2024-12-20.log.html here	19:30
sergem	we use quay for our images but we rely on ubuntu base images and they are published on docker hub or aws ecr only with ratelimits...	19:30
sergem	thanks for the link	19:30
fungi	we've started mirroring some of our dependencies to quay as well	19:31
fungi	so that we can pull them from there instead of from dockerhub	19:31
clarkb	https://review.opendev.org/c/zuul/zuul-jobs/+/935574 is where we've gotten for mirroring	19:31
clarkb	nothing is actually mirrored yet but that role and job should provide a framework to do so	19:31
fungi	the kolla team is doing something similar for some of their dependencies, i gather, e.g. https://quay.io/repository/openstack.kolla/fluentd	19:31
clarkb	re the registry in system-config that registry runs as the "intermediate" registry for job artifacts	19:32
clarkb	we prune it regularly and it isn't meant to be a trusted source of images. Instead it holds speculative future states	19:32
clarkb	we could in theory run a separate trusted registry that only hosts trusted things, but quay and github and docker hub and google cloud and amazon etc etc are already doing that so relying on them (quay has been popular) seems like a better first step?	19:33
clarkb	unfortunately docker the tooling and hte ecosystem were built around this idea of docker hub being a central hub and that worked well for about a decade and now it isn't financiall feasible? and we're in this bit of a mess	19:34
clarkb	you even have to switch away from using docker the tool if you stop hosting images on docker hub because docker the tool only understands mirrosr for docker hub and not for any other registry	19:34
fungi	cue merge notification from opendevreview any moment... ;)	19:35
clarkb	I don't really want to spend a lot of time invested in making docker hub less painful as we've been burned before and we shoudl just move on	19:35
clarkb	one other thing to note is that docker hub treats an ipv6 /64 as a single ip for rate limiting purposes	19:35
clarkb	you may find your jobs are more reliable if you force communication to docker hub over ipv4	19:35
clarkb	tonyb was working on a small script / ansibel role that could make that happe nfor jobs	19:36
fungi	(because some of our cloud providers just give all server instances global addresses from a single shared /64 allocation)	19:36
sergem	thanks guys, reading...	19:38
fungi	but yeah, longer-term it seems like trying to avoid using any of docker's/moby's tools and services is the most sensible path forward, painful as such transitions may be. they seem to have decided to burn their community to the ground and walk away	19:38
opendevreview	Merged opendev/system-config master: Run containers on Noble with docker compose and podman https://review.opendev.org/c/opendev/system-config/+/937641	19:40
fungi	and there's ^ some of our own related transition work	19:40
clarkb	fungi: that is going to run all the infra-prod deploy jobs	19:41
clarkb	we should probably check the first few don't do anything unexpected	19:41
fungi	yes, i expect it to be a few hours	19:41
fungi	but have the in-progress job results up in front of me while i multitask	19:41
clarkb	codesearch seems to have taken the right path in /var/log/ansible/service-codesearch.yaml.log on bridge	19:42
clarkb	included: /home/zuul/src/opendev.org/opendev/system-config/playbooks/roles/install-docker/tasks/default.yaml for codesearch01.opendev.org	19:42
clarkb	and the container is still running and dockerd is running too so that lgtm on codesearch	19:43
fungi	agreed	19:44
fungi	eavesdrop is almost done as well and etherpad should be quick. gitea will take a while though	19:44
fungi	what do we have running on noble at this point besides some (possibly not in service yet) mirror servers?	19:45
clarkb	gitea should be quicker than an upgrade but ya still touching individual servers one after another	19:45
fungi	oh, and we don't container anything on the mirrors anyway	19:46
fungi	so yeah in theory i think all these deploy jobs should be running on non-noble servers, and hopefully all skip the new logic	19:46
clarkb	yup exactly	19:46
fungi	in which case the only expected behavior difference will be when we start creating replacement servers in the future, which was blocked on python docker-compose not working on noble. okay now i get it	19:47
clarkb	I've checked codesearch, eavesdrop and etherpad and they look good to me so far	19:47
clarkb	fungi: ya the concern was that if we accidentally changed existing behavior we could get problems. But the intention is that this will be a noop for everything running today	19:48
clarkb	so far behavior appears to be consistent with what we expect (no new behaviosr old behaviors are loaded via the new default.yaml code path)	19:48
fungi	right. perfect	19:48
clarkb	spot checked gitea09 since it is done and looks good too. That implies the other giteas should be good as well	19:50
clarkb	fungi: the problem we want to avoid is any unintended copy pasta failures from main.yaml to default.yaml or potentially applying podman things to existing older servers and so far I see no evidence of this type of problem	19:54
clarkb	gitea-lb looks good too.	19:56
clarkb	and grafana too so ya I think this is happy. Its about lunch time here. But I can do more spot checks after I eat	19:58
fungi	enjoy, i'll keep an eye on the deploy as it progresses	20:02
tonyb	Nice work! thank you for merging that one. that'll free up my ansible-devel stuff	20:15
clarkb	I spot checked some of the zuul services and gerrit and they all look good too	20:32
clarkb	looks like all of the deploy jobs were successful	20:33
tonyb	Nice	20:33
fungi	yes, completed without issue	20:37
*** jhorstmann is now known as Guest4561		21:21
fungi	borg prune has reclaimed about 10% of the volume on backup01.ord.rax so far, not sure how long it's going to take to complete though	21:55
*** jhorstmann is now known as Guest4564		22:00
*** jhorstmann is now known as Guest4567		22:53
*** jhorstmann is now known as Guest4570		23:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!