Tuesday, 2024-11-05

clarkb	Our weekly team meeting will begin in ~5 minutes.	18:55
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Nov 5 19:00:12 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/TW6HC5TVBZNEAUWX6HBSATZKK7USHXAB/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	No big announcements today	19:00
clarkb	I will be out Thursday morning local time for an appointment then I'm also out Friday and Monday (long holiday weekend with family)	19:01
clarkb	did anyone else have anything to announce?	19:02
clarkb	sounds like no	19:03
clarkb	#topic Zuul-launcher image builds	19:03
clarkb	yesterday corvus reported that we ran a job on a nodepool in zuul built and uploaded image	19:03
clarkb	trying to find that link now	19:03
corvus	#link niz build https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6	19:04
clarkb	#link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6	19:04
clarkb	#undo	19:04
opendevmeet	Removing item from minutes: #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6	19:04
corvus	still some work to do on the implementation (we're missing important functionality), but we've got the basics in place	19:04
clarkb	I assume the bulk of that is on the nodepool in zuul side of things? On the opendev side of things we still need jobs to build the various images we have?	19:05
corvus	yep, still a bit of niz implementation to do	19:05
corvus	on the opendev side:	19:05
corvus	* need to fix the image upload to use the correct expire-after headers	19:05
corvus	(since all 3 ways we tried to do that failed; so something needs fixing)	19:06
corvus	* should probably test with a raw image upload for benchmarking	19:06
clarkb	do we know if the problem is the header value itself or something in the client? I'm guessing we don't actually know yet?	19:06
corvus	(but that should wait until after some more niz performance improvements)	19:06
corvus	* need to add more image build jobs (no rush, but not blocked by anything)	19:06
corvus	clarkb: tim said that it needs to be set for both the individual parts and the manifest; i take that to mean that swift cli client is only setting it on one of those	19:07
corvus	so one of the fixes is fix swiftclient to set both	19:07
corvus	another fix could be to fix openstacksdk	19:07
clarkb	got it	19:07
corvus	a third fix could be to do something custom in an ansible module	19:08
clarkb	side note: swift has been a thing for like 13 years? kind of amazing we're the first to hit this issue	19:08
corvus	honestly don't know the relative merits of those, so step 1 is to evaluate those choices :)	19:08
corvus	yeah... i guess people want to keep big files? :)	19:08
clarkb	anything else?	19:09
corvus	anyway, that's about it i think	19:09
clarkb	thank you for the update. Its cool to see it work end to end like that	19:09
clarkb	#topic Backup Server Pruning	19:09
corvus	yw; ++	19:09
clarkb	As previously mentioned I did some manual cleanup of ethercalc02 backups	19:10
clarkb	I then documented this process	19:10
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/933354 Documentation of cleanup process applied to ethercalc02 backups.	19:10
clarkb	Couple of things to note. The first is that there is still some discussion over whether or not this is the best approach	19:10
clarkb	in particular ianw has proposed an alternative which automates most of this process	19:11
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/933700 Automate backup deletions for old servers/services	19:11
clarkb	Reviewing this change is high on my todo list (hping to look after lunch today) as I do think it would be an improvement on what I've done	19:11
clarkb	The other thing to note is that the backup server checks noticed the backup dir for ethercalc02 was abnormal after the cleanup. This is true since the dir was removed but we should probably avoid having it emit the warning in the first place if we're intentionally cleaning things up	19:12
clarkb	oh and fungi ran our normal pruning process on the vexxhost backup server as we were running low on space and we're still working through he process for clearing unneeded content	19:12
clarkb	the good news is once we have a process for this and apply it I think we'll free up about 20% of the total disk space on the server which is a good chunk	19:13
fungi	yep, i did that	19:13
fungi	worth noting the ethercalc removal didn't really free up any observable space, but we didn't expect it to	19:13
clarkb	so anyway please review the two chagnes above and weigh in on whether or not you think either approach is worth pursuing further	19:13
clarkb	#topic Upgrading old servers	19:14
clarkb	I don't see any updates on tonyb's mediawiki stack	19:14
clarkb	tonyb: anything to note? I know there was a pile of comments on that stack so want to make sure that I was clear enough and also not off base on what I wrote	19:15
clarkb	separately I decided with Gerrit that I would like to upgrade Gerrit to 3.10 first then figure out the server upgrade. The reason for that is the service update is mroe straightforward and I think a better candidate for sorting out over the next while of holidays and such	19:16
clarkb	once that is done I'll have to look at server replacement	19:16
clarkb	more on Gerrit 3.10 later in the meeting	19:16
clarkb	any other server upgrade notes to make?	19:17
clarkb	#topic Docker compose plugin with podman service for servers	19:19
clarkb	lets move on. I didn't expect any updates on this topic since last week but wanted to give a chance for anyone to chime in if there was movement	19:19
clarkb	ok sounds like there wasn't anything new. We can move on	19:20
clarkb	#topic Enabling mailman3 bounce processing	19:20
clarkb	and now for topics new to this week	19:20
clarkb	frickler added this one and the question aiui is basically can we enable mailman3's automatic bounce processing which removes mailing list members after a certain number of bounces	19:21
frickler	yes, so basically I just stumbled about this when looking at the mailman list admin UI	19:21
clarkb	Last friday I logged into lists.opendev.org and looked at the options and tried to undersatnd how it works. Basically there are two values we can change the score threshold that when exceeded members are removed and the time period your score is valid for before being reset	19:21
frickler	and since we do see a large number of bounces in our exim logs, I thought maybe give it a try	19:22
clarkb	by default threshold is 5 (I think you get 1 score point for a hard bounce and half a point for a soft bounce not sure what the difference between hard and soft bounces are) and that value is reset weekly	19:22
frickler	yes, the default looked pretty sane to me	19:23
clarkb	one thing that occurred to me is we can enable this on service-discuss and since we only get about one email a week we should avoid removing anyone too quickly while we see if it works as expected	19:23
clarkb	but otherwise it does seem like a good idea to remove all the old addresses that are no longer valid	19:23
clarkb	fungi: corvus: ^ any thoughts or concerns on enabling this on some/all of our lists? I think dmarc/dkim validation was one concern?	19:23
frickler	iirc one could also set it to report to the list admin instead of auto-remove addresses?	19:24
clarkb	frickler: yes, you can also do both things.	19:24
clarkb	I figured we'd have it alert list owners as well as disabling/removing people	19:24
corvus	we definitely should if we can; i'm not familiar with the dkim issues that apparently caused us to turn it off	19:24
fungi	as mentioned earlier when we discussed it, spurious dmarc-related rejections are less of a concern with mm3 because it doesn't just score on bounced posts, it follows up with a verp probe and uses the results of that to increase the bounce score	19:24
clarkb	got it so previous concerns are theoretically a non issue in mm3. In that case should I go ahead and enable it on service-discuss and see what happens from there?	19:25
fungi	basically mm3 tries to avoid disabling subscriptions in cases where the bounce could have been related to the message content rather than an actual delivery problem at the destination	19:25
clarkb	oh also if you login to a list user member list page it shows the current score for all the members	19:25
clarkb	which is another way to keep track of how it is processing people	19:26
clarkb	then maybe enabling it on busier lists next week if nothing goes haywire?	19:26
fungi	sounds good to me	19:26
clarkb	cool I'll do that this afternoon for service-discuss too	19:27
clarkb	anything else on this topic?	19:27
clarkb	#topic Failures of insecure registry	19:28
clarkb	Recently we had a bunch of issues with the insecure-ci-registry not communicating with container image push/pull clients and that resulted in job timeouts for jobs pushing images in particular	19:28
clarkb	#link https://zuul.opendev.org/t/openstack/builds?job_name=osc-build-image&result=POST_FAILURE&skip=0 Example build failures some examples	19:29
frickler	seems to have been resolved by the latest image update? or did I miss something?	19:29
clarkb	After restarting the container a few times I noticed that the tracebacks recorded in the logs were happening in cheroot which had a more recent release than our current container image build. I pushed up a change to zuul-registry which rebuilt with latest cheroot as well as updating other system libs like openssl	19:29
clarkb	and yes that image update seems to have made things happier.	19:30
clarkb	Looking at logs it seems that some clients try to negotiate with invalid/unsupported tls versions and the registry is properly rejecting them now today. But my theory is this wasn't working previously and we'd eat up threads or some other limited resource on the server	19:30
clarkb	one thing that seemed to back this up is if you listed connections on the server prior to the update there were many tcp connections hanging around	19:31
clarkb	but now that isn't the case.	19:31
clarkb	No concrete evidence that this was the problem but it seems to be much happier now	19:31
clarkb	if you notice things going unhappy again please say something and check the container logs for any tracebacks	19:31
frickler	+1	19:32
clarkb	#topic Gerrit 3.10 Upgrade Planning	19:33
clarkb	#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document	19:33
clarkb	I've been looking at upgrading Gerrit recently	19:33
corvus	wait	19:33
corvus	*** We should plan cleanup of some sort for the backing store. Either test the prune command or swap to a new container and cleanup the old one.	19:33
clarkb	oh right	19:33
clarkb	#undo	19:33
opendevmeet	Removing item from minutes: #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10	19:33
corvus	^ should decide that	19:33
clarkb	the other thing that was brought up last week was that our swift container backing the insecure ci registry is quite large	19:34
clarkb	there is a zuul registry pruning command but corvus reports the last time we tried it things didn't go well	19:34
corvus	last time we ran prune, it manifested weird errors. i think zuul-registry is better now, but i don't know if whatever problem happened was specifically identified and fixed.	19:34
clarkb	I think our options are to try the prune command again or instead we can point the registry at a new container then work on deleting the old container in its entirety after the fact	19:34
corvus	so i think if we want to clean it up, we should prepare to replace the container (ie, be ready to make a new one and swap it in). then run the prune command. if we see issues, swap to the new container.	19:35
fungi	note that cleaning up the old container is likely to require asking the cloud operators to use admin privileges, since the swift bulk delete has to recursively delete each object in the container and i think is limited to something like 1k (maybe it was 10k?) objects per api call	19:35
corvus	fungi: well, we could run it in screen for weeks	19:35
fungi	yeah, i mean, that's also an option	19:35
frickler	doesn't rclone also work for that?	19:36
corvus	prune would be doing similar, so... run that it screen if you do :)	19:36
clarkb	sounds like there is probably some value in trying the prune command since that produces more feedback to the registry codebase and our fallback is the same either way (new container)	19:36
fungi	doing that in rackspace classic for some of our old build log containers would take years of continuous running to complete	19:36
corvus	registry has fewer, larger, objects	19:36
fungi	frickler: i think rclone was one of the suggested tools to use for bulk deletes, but it's still limited by what the swift api supports	19:37
corvus	so hopefully lower value for "years" :)	19:37
clarkb	need a volunteer to 1) prep a fallback container in swift 2) start zuul-registry prune command in screen then if there are problems 3) swap registry to fallback container and delete old container one way or another	19:37
corvus	if we swap, promotes won't work, obviously. so it's a little disruptive.	19:38
clarkb	right people would need to recheck thigns to regenerate images and upload them again	19:38
clarkb	so maybe there is a 0) which is announce this to service-announce as a potential outcome (rechecks required)	19:38
corvus	well, promote comes from gate, so there's typically no recourse other than "merge another change"	19:39
clarkb	oh right since promote is running post merge	19:39
clarkb	I should be able to work through this sometime next week. Maybe we announce it today for a wednesday or thursday implementation next week? that gives everyone a week of notice	19:39
clarkb	but also happy for someone else to drive ti and set a schedule	19:40
corvus	might be good to do on a friday/weekend. but if we're going to prune, no idea when we'd actually see problems come up, if they do. could be at any time, and if the duration is "months", then that's hard to announce/schedule.	19:40
clarkb	I see. So maybe a warning that the work is happening and we'd like to know if unexpected results occur	19:41
clarkb	I should be able to start it a week from this friday but not this friday	19:41
corvus	yeah, that sounds like it fits our expected outcomes :)	19:41
fungi	wfm	19:41
clarkb	ok I'll work on a draft email and make sure corvus reviews it for accuracy before sending sometime this week with a plan to perform the work November 15	19:42
fungi	thanks!	19:42
clarkb	and happy for others to help or drive any point along the way :)	19:42
corvus	clarkb: ++	19:42
clarkb	#topic Gerrit 3.10 Upgrade Planning	19:43
clarkb	oh shoot I'm just realizing I didn't undo enough items	19:43
clarkb	oh well	19:43
clarkb	the records will look weird but I think we can get by	19:43
clarkb	#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document	19:44
clarkb	so I've been looking at Gerrit 3.10 upgrade planning	19:44
clarkb	per our usual process I've put a document together and tried to identify areas of concern, breaking changes, and then go through and decide what if any impact any of them have on us	19:44
clarkb	I've also manually tested the upgrade (and we automatically test it too)	19:44
clarkb	overall this seems like another straightforward change for us but there are a couple of things to note	19:44
clarkb	one is that gerrit 3.10 can delete old log files itself so I pushed a change to configure that and we can remove our cronjob which does so post upgrade	19:45
clarkb	another is that robot comments are deprecated (I believe zuul uses this for inline comments)	19:45
clarkb	and that due to project imports being a thing that can lead to change number collisions some searches now may require project+changenumber	19:46
clarkb	we haven't imported any projects from another gerrit server so I don't think this last item is a real concern for us	19:46
fungi	what does gerrit recommend switching to, away from robot comments?	19:46
clarkb	fungi: using a checks api plugin I think	19:46
fungi	is there a new comment type for that purpose?	19:46
fungi	huh, i thought the checks api plugin was also deprecated	19:47
clarkb	see this is where things get very very confusing	19:47
corvus	nope that's the "checks plugin" :)	19:47
fungi	d'oh	19:47
clarkb	the system where you register jobs with gerrit and it triggers them is deprecated	19:47
corvus	(which is now a deprecated backend for the checks api)	19:47
fungi	i have a feeling i'm going to regret asking questions now ;)	19:47
clarkb	there is a different system that you just teach gerrit to query your ci system for info through that isn't	19:48
clarkb	but ya my understanding is that integration with CI systems directly in Gerrit are expected to go through this system so things like robot comments are deprecated	19:48
corvus	does the checks api plugin system support line comments?	19:48
clarkb	worst case we send normal comments from zuul	19:48
clarkb	corvus: I don't know but I guess that is a good question since that is what robot comments were doing	19:48
clarkb	but also the deprecation is listed as a breaking change but as far as I can tell they have only deprecated them not actually broken them	19:49
clarkb	so this is a problem that doesn't need solving for 3.10 we just need to be aware of it eventually needing a solution	19:49
corvus	since there isn't currently a zuul checks api implementation, if/when they break we will probably just fall back to regular human-comments	19:49
corvus	should be a 2-line change to zuul	19:49
clarkb	++	19:50
clarkb	another thing I wanted to call out is that gerrit made reindexing faster if you start with an old index	19:50
fungi	there is no robot only zuul	19:50
clarkb	for this reason our upgrade process is slightly modified in that document to backup the indexes then copy them back in place if we do a downgrade. In theory this will speed up our downgrade process	19:50
corvus	this reads to me like a check result can point to a single line of code and that's it. https://gerrit.googlesource.com/gerrit/+/master/polygerrit-ui/app/api/checks.ts#435	19:50
corvus	if that's correct, then i don't think the checks api is an adequate replacement, so falling back to human-comments is the best option for what zuul does.	19:51
clarkb	I also tested the downgrade on a held node with 3 chagnes so that process works but I can't really comment on whether or not it is faster	19:51
clarkb	we are running out of time so I'll end this topic here	19:52
clarkb	but please look over the document and the 3.10 release notes and call out any additioanl concerns that you feel need investigation	19:52
clarkb	otherwise I'm looking at December 6, 2024 as the upgrade date as that is after the big thanksgiving holiday	19:52
clarkb	should be plenty of time to get prepared by then	19:52
clarkb	#topic RTD Build Trigger Requests	19:52
clarkb	Really quickly before we run out of time I wanted to call out the read the docs build trigger api request failures in ansible	19:53
clarkb	tl;dr seems to be that making the same request with curl or python requests works but having ansible's uri module do so fails with an unauthorized error?	19:53
clarkb	additional debugging with a mitm proxy hasn't shown any clear reason for why this is happening	19:53
frickler	but only on some distros like bookworm or noble, not on f41 or trixie	19:53
frickler	so very weird indeed	19:53
clarkb	I think we could probably rewrite the jobs to use python requests or curl or something and just move on	19:54
fungi	something is causing http basic auth to return a 403 error from the zuul-executor containers, reproducible from some specific distro platforms but works from others	19:54
clarkb	but it is also interesting from a "what is ansible even doing here" perspective that may warrant further debugging	19:54
fungi	it does seem likely to be related to a shared library which will be fixed when we eventually move zuul images from bookworm to trixie	19:55
fungi	but hard to say what exactly	19:55
clarkb	do we think that just rebuilding the image may correct it too?	19:55
clarkb	(I don't think so since iirc we do update the platform when we build zuul images)	19:55
clarkb	another option to try could be updating to python3.12 in the zuul images	19:56
fungi	also strange that we didn't change zuul's container base recently, but the problem only started in late september	19:56
frickler	well trixie is only in testing as of now	19:56
fungi	yes	19:56
clarkb	corvus: is python3.12 something that we should consider generally for zuul or is that not in the cards yet?	19:56
clarkb	for the container images I mean. We're already testing with 3.12	19:56
frickler	and py3.12 on noble seems broken/affected, too, so that wouldn't help	19:56
clarkb	ack	19:56
corvus	not in a rush to change in general :)	19:57
corvus	(but happy to if we think it's a good idea)	19:57
corvus	but all things equal, id maybe just leave it for the next os upgrade? unless there's something pressing that 3.12 would make better	19:57
clarkb	more mitmproxy testing or setting up a server to log headers and then diffing between platforms is probably the next debugging step if anyone wants to do that	19:57
clarkb	corvus: I don't think there is a pressing need. unittests on 3.11 are faster too...	19:58
frickler	either that or try the curl solution	19:58
clarkb	ya we could just give up for now and switch to a working different implementation then revert when trixie happens	19:58
fungi	my guess is that there was some regression which landed in bookworm 6 weeks ago, or in a change backported into an ansible point release maybe	19:58
frickler	just before we close I'd also want to mention the promote-openstack-manuals-developer failures https://zuul.opendev.org/t/openstack/build/618e3a431a2145afb4344809a9aa84fa/console	19:58
fungi	or a python point release	19:58
frickler	no idea yet what's different there compared to promote-openstack-manuals runs	19:59
clarkb	its a different target so I suspect that fungi's original idea is the right path	19:59
clarkb	just not complete yet? basically need to get the paths and destinations all in alignment?	19:59
fungi	yeah, but my change didn't seem to fix the error	19:59
clarkb	(side note another reason why I think developer doesn't need a different target it makes the publications more complicated)	20:00
clarkb	and we are at time. Thank you everyone	20:00
clarkb	we can continue discussion in #opendev or on the mailing list as necessary but I don't want to keep anyone here longer than the prescribed hour	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Nov 5 20:00:33 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.log.html	20:00
corvus	clarkb: thanks!	20:00
fungi	thanks clarkb!	20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!