Tuesday, 2023-08-15

-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers		16:53
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers		17:42
clarkb	almost meeting time	18:59
fungi	ahoy!	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Aug 15 19:01:05 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/FRMRI2B7KC2HPOC5VTJYQBKARGCTY5GA/ Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	I'm back in my normal timezone. Other than that I didn't have anything to announce	19:01
clarkb	oh! the openstack feature freeze happens right at the end of this month/ start of septeber	19:02
fungi	yeah, that's notable	19:02
clarkb	something to be aware of as we make changes to avoid major impact to openstack	19:02
fungi	it will be the busiest time for our zuul resources	19:03
clarkb	sounds like openstack continues to struggle with reliability problems in the gate too so we may be asked to help diagnose issues	19:03
clarkb	I expect they will become more urgent in feature freeze crunch time	19:04
clarkb	#topic Topics	19:04
clarkb	#topic Google Account for Infra root	19:04
clarkb	our infra root email address got notified that an associated google account is on the chopping block December 1st if there is no activity bfore then	19:05
clarkb	assuming we decide to preserve the account we'll need to do some sort of activity every two yeras or it will be deleted	19:05
fungi	any idea what it's used for?	19:05
fungi	or i guess not used for	19:05
clarkb	deleted account names cannot be reused so we don't have t oworry about someone taking it over at least	19:05
fungi	formerly used for long ago	19:05
clarkb	I'm beinging it up here in hopes someone else knows what it was used for :)	19:06
corvus	i think they retire the address so it may be worth someone logging in just in case we decide it's important later	19:06
corvus	but i don't recall using it for anything, sorry	19:06
clarkb	ya if we login we'll reset that 2 year counter and the account itself may have clues for what it was used for	19:07
clarkb	I can try to do that befor eour next meeting and hopefully have new info to share then	19:08
clarkb	if anyone else recalls what it was for later please share	19:08
clarkb	but we have time to sort this out for now at least	19:08
clarkb	#topic Bastion Host Updates	19:08
clarkb	#link https://review.opendev.org/q/topic:bridge-backups	19:09
clarkb	this topic could still use root review by others	19:09
fungi	i haven't found time to look yet	19:09
clarkb	I also thought we may need to look at upgrading ansible on the bastion but I think ianw may have already taken care of that	19:10
clarkb	double checking probably a good idea though	19:10
clarkb	#topic Mailman 3	19:10
clarkb	fungi: it looks like the mailman 3 vhosting stuff is working as expected now. I recall reviewing some mm3 change sthough I'm not sure where we've ended up since	19:10
fungi	so the first three changes in topic:mailman3 still need more reviews but should be safe to merge	19:12
fungi	the last change in that series currently is the version upgrade to latest mm3 releases	19:12
clarkb	#link https://review.opendev.org/q/topic:mailman3+status:open	19:13
fungi	i have a held node prepped to do another round of import testing, but got sideswiped by other tasks and haven't had time to run through those yet	19:13
clarkb	ok I've got etherpad and gitea prod updates to land too. After the meeting we should make a rough plan for landing some of these things and pushing forward	19:13
fungi	the upgrade is probably also safe to merge, but has no votes and i can understand if reviewers would rather wait until i've tested importing on the held node	19:14
clarkb	that does seem like a good way to exercise the upgrade	19:14
clarkb	to summarize, no known issues. General update changes needed. Upgrade change queued after general updates. Import testing needed for further migration	19:15
fungi	also the manual steps for adding the dhango domains/postorius mailhost associations are currently recorded in the migration etherpad	19:15
fungi	i'll want to add those to our documentation	19:15
fungi	s/dhango/django/	19:15
clarkb	++	19:15
fungi	they involve things like ssh port forwarding so you can connect to the django admin url from localhost	19:16
fungi	and need admin the credentials from the ansible inventory	19:16
fungi	s/admin the/the admin/	19:16
fungi	once i've done anotehr round of import tests and we merge the version update, we should be able to start scheduling more domain migrations	19:17
clarkb	once logged in the steps are a few button clicks though so pretty straightforward	19:17
fungi	yup	19:17
fungi	i just wish that part were easier to script	19:17
clarkb	++	19:17
fungi	from discussion on the mm3 list, it seems there is no api endpoint in postorius for the mailhost assocuation step, which would be the real blocker (we'd have to hack up something based on the current source code for how postorius's webui works)	19:18
clarkb	lets coordinate to land some of these change safter the meeting and then figure out docs and an upgrade as followups	19:18
fungi	sounds great, thanks! that's all i had on this topic	19:18
clarkb	#topic Gerrit Updates	19:19
clarkb	I've kept this agenda item for two reasons. First I'm still hoping for some feedback on dealing with the replication task leaks. Second I'm hoping to start testing the 3.7 -> 3.8 upgrade very soon	19:19
clarkb	For replication task leaks the recent restart where we moved those aside/deleted them showed that is a reasonable thing to do	19:20
clarkb	we can either script that or simply stop bind mounting the dir where they are stored so docker rms them for us	19:20
clarkb	For gerrit upgrades the base upgrade job is working (and has been working) but we need to go through the release notes and test things on a held node like reverts (if possible) and any new feature sor behaviors that concern us	19:21
frickler	did you ever see my issue with starred changes?	19:21
clarkb	frickler: I don't think Idid	19:22
frickler	seems I'm unable to view them because I have more than 1024 starred	19:22
clarkb	interesting. This is querying is:starred listing?	19:22
frickler	yes, getting "Error 400 (Bad Request): too many terms in query: 1193 terms (max = 1024)"	19:22
frickler	and similar via gerrit cli	19:22
fungi	ick	19:23
clarkb	have we brought this up on their mailing list or via a bug?	19:23
frickler	I don't think so	19:23
clarkb	ok I can write an email to repo discuss if you prefer	19:23
frickler	I also use this very rarely, so it may be an older regression	19:23
clarkb	ack	19:23
clarkb	its also possible that is a configurable limit	19:23
frickler	that would be great, then I could at least find out which changes are starred and unstar them	19:24
frickler	a maybe related issue is that stars aren't show in any list view	19:24
frickler	just in the view of the change itself	19:24
clarkb	good point. I can ask for hints on methods for finding subsets for unstarring	19:24
frickler	thx	19:25
clarkb	#topic Server Upgrades	19:26
clarkb	No new servers booted recently that I am aware of	19:26
corvus	index.maxTerms	19:26
corvus	frickler: ^	19:26
clarkb	However we had trouble with zuul executors running out of disk today. The underlying issue was that /var/lib/zuul was not a dedicated fs with extra space	19:26
clarkb	so a reminder to all of us replacing servers and reviewing server replacements to check for volumes/filesystem mounts	19:27
fungi	those got replaced over the first two weeks of july, so it's amazing we didn't start seeing problems before now	19:27
corvus	https://gerrit-review.googlesource.com/Documentation/config-gerrit.html (i can't seem to deep link it, so search for maxTerms there)	19:27
clarkb	#topic Fedora Cleanup	19:28
corvus	re mounts -- i guess the question is what do we want to do in the fiture? update launch-node to have an option to switch around /opt? or make it a standard part of server documentation? (but then we have to remember to read the docs which we normally don't have to do for a simple server replacement)	19:29
clarkb	#undo	19:29
opendevmeet	Removing item from minutes: #topic Fedora Cleanup	19:29
clarkb	corvus: maybe a good place to annotate that info is in our inventory file since I at least tend to look there in order to get the next server in sequence	19:29
corvus	that's a good idea	19:29
clarkb	because you are right that it will be easy to miss in proper documentation	19:29
fungi	launch node does also have an option to add volumes, i think, which would be more portable outside rackspace	19:30
clarkb	fungi: yes it does volume management and can do arbitrary mounts for volumes	19:30
corvus	or...	19:30
fungi	so if we moved the executors to, say, vexxhost or ovh we'd want to do it that way presumably	19:30
corvus	we could update zuul executors to bind-mount in /opt as /var/lib/zuul, and/or reconfigure them to use /opt/zuul as the build dir	19:31
clarkb	/opt/zuul is a good idea actually	19:31
corvus	(one of those violates our "same path in container" rule, but the second doesn't)	19:31
clarkb	since that reduces moving parts and keeps thing ssimple	19:31
corvus	yeah, /opt/zuul would keep the "same path" rule, so is maybe the best option...	19:31
corvus	i like that.	19:32
fungi	it does look like launch/src/opendev_launch/make_swap.sh currently hard-codes /opt as the mountpoint	19:32
clarkb	yup and swap as the other portion	19:33
fungi	so would need patching if we wanted to make it configurable	19:33
clarkb	I like the simplicity of /opt/zuul	19:33
fungi	thus patching the compose files seems more reasonable	19:34
fungi	and if we're deploying outside rax we just need to remember to add a cinder volume for /opt	19:34
frickler	ack, that's also what my zuul aio uses	19:34
frickler	or have large enough /	19:34
fungi	(unless the flavor in that cloud has tons of rootfs and we feel safe using that instead)	19:34
fungi	yeah, exactly	19:35
frickler	coming back to index.maxTerms, do we want to try bumping that to 2k?	19:36
frickler	or 1.5k?	19:36
frickler	I think it'll likely require a restart, though?	19:37
clarkb	at the very least we can probably bump it temporarily allowing you to adjust your star count	19:37
clarkb	yes it will require a restart	19:37
frickler	ok, I'll propose a patch and then we can discuss the details	19:37
clarkb	I odn't know what the memory scaling is like for terms but that would be my main concern	19:38
clarkb	#topic Fedora Cleanup	19:38
clarkb	tonyb and I looked at doing this the graceful way and then upstream deleted the packages anyway	19:38
clarkb	I suspect this means we can forge ahead and simply remove the image type since they are largely non functional due to changes upstream of us	19:39
clarkb	then we can clear out the mirror content	19:39
clarkb	any concerns with that? I know nodepool recently updated its jobs to exclude fedora	19:39
clarkb	I think devstack has done similar cleanup	19:39
fungi	it's possible people have adjusted the urls in jobs to grab packages from the graveyard, but unlikely	19:39
corvus	zuul-jobs is mostly fedora free now due to the upstream yank	19:40
clarkb	I'm hearing we should just go ahead and remove the images :)	19:40
clarkb	I'll try to push that forward this week too	19:40
clarkb	cc tonyb if still interested	19:40
corvus	(also it's worth specifically calling out that there is now no fedora testing in zuul jobs, meaning that the base job playbooks, etc, could break for fedora at any time)	19:41
clarkb	even the software factory third party CI which uses fedora is on old nodes and not running jobs properly	19:41
corvus	so if anyone adds fedora images back to opendev, please make sure to add them to zuul-jobs for testing first before using them in any other projects	19:41
clarkb	++	19:41
clarkb	and maybe software factory is interested in updating their third party ci	19:41
fungi	maybe the fedora community wants to run a third-party ci	19:41
fungi	since they do use zuul to build fedora packages	19:42
fungi	(in a separate sf-based zuul deployment from the public sf, as i understand it)	19:42
fungi	so it's possible they have newer fedora on theirs than the public sf	19:42
clarkb	ya we can bring it up with the sf folks and take it fro mthere	19:43
fungi	or bookwar maybe	19:43
corvus	for the base jobs roles, first party ci would be ideal	19:43
fungi	certainly	19:43
corvus	and welcome. just to be clear.	19:44
fungi	just needs someone interested in working on that	19:44
corvus	base job roles aren't very effectively tested by third party ci	19:44
corvus	(there is some testing, but not 100% coverage, due to the nature of base job roles)	19:44
corvus	s/very effectively/completely effectively/ i think that's a little more accurate	19:45
clarkb	good to keep in mind	19:46
clarkb	#topic Gitea 1.20	19:46
clarkb	I sorted out the access log issue. Turns out there were additional undocumented in release notes breaking changes	19:47
fungi	gotta love those	19:47
clarkb	and you need different configs for access logs now. I still need to cross check the format of them to production since the breaking change they did document is that the format may differ	19:47
clarkb	Then I've got a whole list of TODOs in the commit messag eto work through	19:47
clarkb	in general though I just need a block of focused time to page all this back in and get up to speed on it	19:48
clarkb	but good news some progress here	19:48
clarkb	#topic etherpad 1.9.1	19:48
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/887006 Etherpad 1.9.1	19:48
clarkb	Looks like fungi and ianw have tested the held node	19:48
fungi	yes, seems good to me	19:49
clarkb	cool so we can probably land this one real soon	19:49
clarkb	as mentioned earlier we should sync up on a rough plan for some of these and start landing them	19:49
clarkb	#topic Python Container Updates	19:50
clarkb	we discovered last week when trying to sort out zookeeper installs on bookworm that the java packaging for bookworm is broken but not in a consistent manner	19:50
clarkb	it seems to run package setup in different orders depending on which packages you have installed and it only breaks in one order	19:50
clarkb	testing has package updates to fix this but they haven't made it back to bookworm yet. For zookeeper installs we are pulling the affected package from testing.	19:51
clarkb	I think the only service this currently affects is gerrit	19:51
clarkb	and we can probably take our time upgrading gerrit waiting for bookworm to be fixed properly	19:51
clarkb	but be aware of that if yo uare doing java things on the new bookworm images	19:51
corvus	#link https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129	19:51
clarkb	othrewise I think we are as ready as we can be migrating our images to bookworm. Sounds like zuul and nodepool plan to do so after their next release	19:53
corvus	clarkb: ... but local testing with a debian:bookworm image had the ca certs install working somehow...	19:53
corvus	so ... actually... we might be able to build bookworm java app containers?	19:53
clarkb	corvus: it appears to be related to the packages already installed and/or being installed affecting the order of installation	19:53
clarkb	corvus: thats true we may be able to build the gerrit containers and sidestep the issue	19:54
corvus	(but nobody will know why they work :)	19:54
clarkb	specifically if the jre package is setup before the ca-certificates-java package it works. But if we go in the other ordre it break	19:54
clarkb	the jre package depends on the certificates package so yo ucan't do separate install invocations between them	19:55
clarkb	#topic Open Disucssion	19:55
clarkb	Anything else?	19:55
fungi	forgot to add to the agenda, rackspace issues	19:56
fungi	around the end of july we started seeing frequent image upload errors to the iad glance	19:56
fungi	that led to filling up the builders and they ceased to be able to update images anywhere for about 10 days	19:57
fungi	i cleaned up the builders but the issue with glance in iad persists (we've paused uploads for it)	19:57
fungi	that still needs more looking into, and probably a ticket opened	19:57
clarkb	++ I mentioned this last week but I think our best bet is to engage rax and show them how the other clouds differ (if not entirely in behavior at least by degree)	19:58
fungi	separately, we have a bunch of stuck "deleting" nodes in multiple rackspace regions (including iad i think), taking up the majority of the quotas	19:58
clarkb	*the other rax regions	19:58
fungi	frickler did some testing with a patched builder and increasing the hardcoded 60-minute timeout for images to become active did work around the issue	19:59
fungi	for glance uploads i mean	19:59
fungi	but clearly that's a pathological case and not something we should bother actually implementing	19:59
fungi	and that's all i had	20:00
frickler	yes, that worked when uploading not too many images in parallel	20:00
frickler	but yes, 1h should be enough for any healthy cloud	20:00
corvus	60m or 6 hour?	20:00
frickler	1h is the default from sdk	20:01
frickler	I bumped it to 10h on nb01/02 temporarily	20:01
fungi	aha, so you patched the builder to specify a timeout when calling the upload method	20:01
corvus	got it, thx. (so many timeout values)	20:01
frickler	ack	20:01
clarkb	we are out of time	20:02
fungi	anyway, we were seeing upwards of a 5.5 hour delay for images to become active there when uploading manually	20:02
fungi	thanks clarkb!	20:02
clarkb	I'm going to end it here but feel free to continue conversation	20:02
clarkb	I just don't want to keep people from lunch/dinner/breakfast as necessary	20:02
clarkb	thank you all!	20:02
clarkb	#endmeeting	20:02
opendevmeet	Meeting ended Tue Aug 15 20:02:50 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:02
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.html	20:02
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.txt	20:02
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.log.html	20:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!