Tuesday, 2023-08-22

clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Aug 22 19:01:11 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VRBBT25TOTXJG3L5SXKWV3EELG34UC5E/ Our Agenda	19:01
clarkb	We've actually got a fairly full agenda so I may move quicker than I'd like. But we can always go back to discussing items at the end of our hour if we have time	19:01
clarkb	#topic Announcements	19:01
clarkb	The service coordinator nomination period ends today. I haven't seen anyone nominate themselves yet. Does this mean everyone is happy with and prefers me to keep doing it?	19:02
fungi	i'm happy to be not-it ;)	19:02
fungi	(also you're doing a great job!)	19:02
corvus	i can confirm 100% that is the correct interpretation	19:02
clarkb	ok I guess I can make it official with an email later today before today ends to avoid and needless process confusion	19:03
clarkb	anything else to announce?	19:04
clarkb	#topic Infra root google account	19:05
clarkb	Just a quick note that I haven't tried to login yet so I have no news yet	19:05
clarkb	but it is on my todo list and hopefully I can get to it soon. This week should be a bit less crazy ofr me than last week (half the visiting family is no longer here)	19:05
clarkb	#topic Mailman 3	19:05
clarkb	fungi: we made some changes and got things to a good stable stopping point I think. What is next? mailman 3 upgrade?	19:06
fungi	we merged the remaining fixes last week and the correct site names are showing up on archive pages now	19:06
fungi	i've got a held node built with https://review.opendev.org/869210 and have just finished syncing a copy of all our production mm2 lists to it to run a test import	19:07
fungi	#link https://review.opendev.org/869210 Upgrade to latest Mailman 3 releases	19:07
clarkb	oh right we wanted to make sure that upgrading wouldn't put us in a worse position for the 2 -> 3 migration	19:08
fungi	i'll step through the migration steps in https://etherpad.opendev.org/p/mm3migration to make sure they're still working correctly	19:08
fungi	#link https://etherpad.opendev.org/p/mm3migration Mailman 3 Migration Plan	19:08
fungi	at which point we can merge the upgrade change to swap out the containers and start scheduling new domain migrations	19:08
fungi	that's where we're at currently	19:09
clarkb	sounds good. Let us know when you feel ready for us to review and approve the upgrade change	19:09
clarkb	#topic Gerrit Updates	19:09
fungi	will do, thanks!	19:09
clarkb	I did email the gerrit list last week about the too many query terms problem frickler has run into with starred changes	19:10
clarkb	they seem to acknowledge that this is less than ideal (one idea was to log/report the query in its entirety so that you could use that information to find the starred changes)	19:10
clarkb	but no suggestions for a workaround other than what we already know (bump index.maxTerms)	19:11
clarkb	no one said don't do that either so I think we can proceed wtih frickler's change and then monitor the resulting performance situation	19:11
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/892057 Bump index.maxTerms to address starred changes limit.	19:11
clarkb	ideally we can approve that and restart gerrit with the new config during a timeframe that frickler is able to confirm it has fixed the issue quickly (so taht we can revert and/or debug further if necessary)	19:12
clarkb	frickler: I know you couldn't attend the meeting today, but maybe we can sync up later this week on a good time to do that restart with ou	19:13
clarkb	#topic Server Upgrades	19:13
clarkb	no news here. Mostly a casualty of my traveling and having family around. I should have more time for this in the near future	19:13
clarkb	(I don't like reaplcing servers when I feel I'm not in a position to revert or debug unexpected problems)	19:14
clarkb	#topic Rax IAD image upload struggles	19:14
clarkb	nodepool image uploads to rax iad are timing out	19:15
clarkb	this problem seems to get worse the more uploads you perform at the same tiem	19:15
clarkb	The other two rackspace regions do not have this problem (or if they share the underlying mechanism it doesnt' manifest as badly so we don't really care/notice)	19:15
fungi	specifically, the bulk of the time/increase seems to occur on the backend in glance after the upload completes	19:16
clarkb	fungi and frickler have been doing the bulk of the debugging (thank you)	19:16
clarkb	I think our end goal is to collect enough info that we can file a ticket with rakcspace to see if this is something they can fixup	19:16
fungi	when we were trying to upload all out images, i clockec around 5 hours from end of an upload to when it would first start to appear in the image list	19:16
fungi	when we're not uploading any images, a single test upload is followed by around 30 minutes of time before it appears in the image list	19:17
clarkb	and when you multiply the number of images we have by 30 minutes you end up with something suspicousl close to 5 hours	19:18
fungi	and yes, what we're mostly lacking right now is someone to have time to file a ticket and try to explain our observations in a way that doesn't come across as us having unrealistic expectations	19:18
corvus	in the mean time, are we increasing the upload timeout to accomodate 5h?	19:19
clarkb	I think a key part of doing that is showing that dfw and ord don't suffer from this which would indicate an actual problem with the egion	19:19
fungi	"we're trying to upload 15 images in parallel, around 20gb each, and your service can't keep up" is likely to result in us being told "please stop that"	19:19
clarkb	fungi: note it should eventually balance out to an image an hour or so due to rebuild timelines	19:20
clarkb	but due to the long processing times they all pile up instaed	19:20
fungi	corvus: frickler piecemeal unpaused some images manually to get fresher uploads for our more frequently used images to complete	19:20
fungi	he did test by overriding the openstacksdk default wait timeout to something like 10 hours, just to confirm it did work around the issue	19:21
clarkb	that isn't currently configurable in nodepool today? Or maybe we could do it with clouds.yaml?	19:21
corvus	oh this is an sdk-level timeout? not the nodepool one?	19:21
clarkb	ya	19:21
fungi	yes	19:21
fungi	nodepool doesn't currently expose a config option for passing the sdk timeout parameter	19:22
fungi	we could add one, but also this seems like a pathological condition	19:22
corvus	still, i think that would be an ok change.	19:23
fungi	yeah, i agree, i just don't think we're likely to want to actually set that long term	19:23
corvus	in general nodepool has a bunch of user-configurable timeouts for stuff like that because we know especially with clouds, things are going to be site dependent.	19:23
corvus	yeah, it's not an ideal solution, but, i think, an acceptable one. :)	19:24
clarkb	other things to note. We had leaked images in all three regions. fungi cleared those out manually. I'm not sure why the leak detection and cleanup in nodepool didn't find them and take care of it for us.	19:24
clarkb	and we had instances that we could not delete in all three regions that rackspace cleaned up for us after a ticket was submitted	19:24
fungi	i cleared out around 1200 leaked images in iad (mostly due to the upload timeout problem i think, based on their ages). the other two regions have around 400 leaked images but have not been cleaned up yet	19:25
corvus	maybe the leaked images had a different schema or something. if it happens again, ping me and i can try to see why.	19:25
clarkb	thanks!	19:25
fungi	corvus: sure, we can dig deeper. it seemed outwardly that nodepool decided the images never got created	19:25
fungi	because the sdk returned an error when it gave up waiting for them	19:26
corvus	ah, could be missing the metadata entirely then too	19:26
fungi	yes, it could be missing followup steps the sdk would otherwise have performed	19:26
fungi	if the metadata is something that gets set post-upload	19:27
fungi	(from the cloud api side i mean)	19:27
clarkb	anything else on this topic?	19:28
fungi	not from me	19:28
clarkb	#topic Fedora Cleanup	19:28
clarkb	I pushed two changes earlier today to start on this. Bindep is the only fedora-latest user in codesearch that is also in the zuul tenant config.	19:29
clarkb	#link https://review.opendev.org/c/opendev/base-jobs/+/892380 Cleanup fedora-latest in bindep	19:29
clarkb	er thats the next change one second while I copy paste the bindep one properly	19:29
clarkb	#undo	19:29
opendevmeet	Removing item from minutes: #link https://review.opendev.org/c/opendev/base-jobs/+/892380	19:29
clarkb	#link https://review.opendev.org/c/opendev/bindep/+/892378 Cleanup fedora-latest in bindep	19:29
clarkb	This should be a very safe change. The next one which removes nodeset: fedora-latest is less safe because older branches may use it etc	19:30
clarkb	#link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove fedora-latest nodeset	19:30
clarkb	I'm personally inclined to land both of them ad we can revert 892380 if something unexpected happens. but that nodeset doesn't work anyway so those jobs should already be broken	19:30
clarkb	We can then look at cleaning things up from nodepool and the mirrors etc	19:31
clarkb	let me know if you disagree or find extra bits that need cleanup first	19:31
clarkb	#topic Gitea 1.20 Upgrade	19:32
fungi	this cleanup also led to us rushing through an untested change for the zuul/zuul-jobs repo, we do need to remember that it has stakeholders beyond our deployment	19:32
clarkb	Gitea has published a 1.20.3 release already. I think my impressin that this is a big release with not a lot of new features is backed up by the amount of fixing they have had to do	19:32
clarkb	But I think I've managed to work through all the documented breaking changes (and one undocumented breaking change)	19:33
clarkb	there is a held node that seems to work here: https://158.69.78.38:3081/opendev/system-config	19:33
clarkb	The main thing at this point would be to go over the change itself and that held node to make sure you are happy with the changes I had to make	19:33
clarkb	and if so I can add the new necessary but otherwise completely ignored secret data to our prod secrets and merge the change when we can monitor it	19:34
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/886993 Gitea 1.20 change	19:34
clarkb	there is the change	19:34
clarkb	looks like fungi did +2 it. I should probably figure out those secrets then	19:35
clarkb	thanks for the review	19:35
fungi	it looked fine to me, but i won't be around today to troubleshoot if something goes sideways with the upgrade	19:35
clarkb	we can approve tomorrow. I'm not in a huge rush other than simply wanting it off my todo list	19:35
clarkb	#topic Zuul changes and updates	19:35
clarkb	There are a number of user facing changes that have been made or will be made very soon in zuul	19:36
clarkb	I want to make sure we're all aware of them and have some sort of plan for working through them	19:36
clarkb	First up Zuul has added Ansible 8 support. In the not too distant future Zuul will drop Ansible 6 support which is what we default to today	19:36
clarkb	in the past what we've done is asked our users to test new ansible against their jobs if they are worred and set a hard cutover date via tenant config in the future	19:37
corvus	and between those 2 events, zuul will switch the default from 6 to 8	19:37
fungi	also it's skipping ansible 7 support, right?	19:37
corvus	yeah we're too slow, missed it.	19:37
clarkb	OpenStack Bobcat releases ~Oct 6	19:37
clarkb	I'm thinking we switch all tenants to ansible 8 by default the week after that	19:38
clarkb	though we should probably go ahead and switch the opendev tenant nowish. So all tenants that haven't switched yet on that date	19:38
corvus	oh that sounds extremely generous. i think we can and should look into switching much earlier	19:38
clarkb	corvus: my concern with that is openstack will have a revolt if anything breaks due to their CI already being super flaky and the release coming up	19:39
corvus	what if we switch opendev, wait a while, switch everyone but openstack, wait a while, and then switch openstack...	19:39
fungi	wfm	19:39
clarkb	that is probably fine	19:39
frickler	with "a while" = 6 weeks that sounds fine	19:40
corvus	i think if we run into any problems, we can throw the breaks easily, but given that zuul (including zuul-jobs) switched with no changes, this might go smoothly...	19:40
clarkb	maybe try to switch opendev this week, everyone else less openstack the week after if opendev is happy. Then do openstack after the bobcat release?	19:40
fungi	sounds good	19:40
clarkb	that way we can get an email out and give people at least some lead time	19:40
corvus	well that's not really what i'm suggesting	19:40
corvus	we could do a week between each and have it wrapped up by mid september	19:41
clarkb	hrm I think the main risk is that specific jobs break	19:41
clarkb	and openstack is going to be angry about that given how unhappy openstack ci has been lately	19:41
corvus	or we could do opendev now, and then everyone else 1 week after that and wrap it up 2 weeks from now	19:41
frickler	where is this urgency coming from?	19:42
corvus	we actually should be doing this much more quickly and much more frequently	19:42
fungi	being able to merge the change in zuul that drops ansible 6 and being able to continue upgrading the opendev deployment whne that happens	19:42
clarkb	I think last time we went quickly with the idea that specific jobs could force ansible 6, but then other zuul config errors complicated that more than we anticipated	19:42
corvus	we need to change ansible versions every 6 months to keep up with what's supported upstream	19:42
corvus	so i would like us to acclimate to this being somewhat frequent and not a big deal	19:42
clarkb	and to be fair we've been pushing openstack to address those but largely only frickler has put any effort into it :/	19:42
fungi	thouhj as of an hour ago it seems like we've got buy-in from th eopenstack tc to start deleting branches in repos with unfixed zuul config errors	19:43
frickler	what's wrong with running unsupported ansible versions? people are still running python2	19:43
clarkb	frickler: at least one issue is the size of the installation for ansible. Every version you support creates massive bloat in your zuul executors	19:44
corvus	it's not our intention to use out-of-support ansible	19:44
frickler	I deon't think that that is compatible with the state openstack is in	19:44
corvus	how so?	19:44
corvus	do we know that openstack runs jobs that don't work with ansible 8?	19:45
fungi	we certainly can't encourage it, given zuul is designed to run untrusted payloads and ansible upstream won't be fixing security vulnerabilities in those versions	19:45
frickler	there is no developer capacity to adapt	19:45
clarkb	no we do not know that yet. JayF thought that some ironic jobs might use openstack collections though which are apparently not backward compatible in 8	19:45
frickler	if it all works fine, fine, but if not, adapting will take a long time	19:45
corvus	zuul's executor can't use collections...	19:45
fungi	i don't see how ironic jobs would be using collections with the executor's ansible	19:45
clarkb	ah	19:45
fungi	if they're using collections that would be in a nested ansible, so not affected by this	19:46
clarkb	maybe a compromise would be to try it soonish and if things break in ways that aren't reasonable to address then we can revert to 6? But it sounds like we expect 8 to work so go for it until we have evidence to the contrary?	19:46
corvus	yeah, i am fully on-board with throwing the emergency brake lever if we find problems	19:46
clarkb	I'm not sure if installing the big pypi ansible package gets you the openstack stuff	19:46
frickler	it does afaict	19:47
corvus	i don't think we should assume that everything will break. :)	19:47
frickler	the problem is to find how what breaks, how will you do that?	19:47
clarkb	ok proposal: switch opendev nowish. If that looks happy plan to switch everyone else less openstack early next week. If that looks good switch openstack late next week or early the week after	19:47
frickler	waiting for people to complain will not work	19:47
clarkb	frickler: I mean if a tree falls in a forest and no one hears...	19:48
clarkb	I understand your concern but if no one is paying attention then we aren't going to solve that either way	19:48
clarkb	we can however react to those who are paying attention	19:48
clarkb	and I think that is the best we can do whether we wait a long time or a short time	19:48
corvus	that works for me; if we want to give openstack more buffer around the release, then switching earlier may help. either sounds good to me.	19:48
clarkb	doesn't look lik any of the TC took fungi's invitation to discuss it here so we can't get their input	19:50
fungi	well, can't get their input in this meeting anyway	19:51
clarkb	corvus: just pushed a change to swithc opendev	19:51
clarkb	lets land that asap and then if we don't have anything pushing the brakes by tomorrow I can send email to service-announce with the plan I skethced out above	19:51
clarkb	we are running out of time though and I wanted to get to a few more items	19:52
corvus	#link https://review.opendev.org/c/openstack/project-config/+/892405 Switch opendev tenant to Ansible 8	19:52
frickler	switching the ansible version does work speculatively, right?	19:52
clarkb	frickler: it does at a job level yes	19:52
clarkb	so we can ask people to test things before we switch them too	19:52
clarkb	Zuul is also planning to uncombine stdout and stderr in command/shell like tasks	19:52
corvus	this one is riskier.	19:53
clarkb	I think this may be more likely to cause problems than the ansible 8 upgrade since it isn't always clear things are going to stderr particulalry when it hasjust worked historically	19:53
clarkb	we probably need to test this more methodically from a base jobs/base roles standpoint and then move up the job inheritance ladder	19:53
corvus	i think the best we can do there is speculatively test it with zuul-jobs first to see if anything explodes...	19:54
corvus	then what clarkb said	19:54
corvus	we might want to make new base jobs in opendev so we can flip tenants one at a time...?	19:54
corvus	(because we can change the per-tenant default base job...)	19:54
clarkb	oh that is an interesting idea I hadn't considered	19:55
corvus	i would be very happy to let this one bake for quite a while in zuul....	19:55
clarkb	ok so this is less urgent. That is good to know	19:55
clarkb	we can probably punt on decisions for it now. But keep it in mind and let us know if you have any good ideas for testing that ahead of time	19:56
corvus	like, long enough for lots of community members to upgrade their zuuls and try it out, etc.	19:56
clarkb	And finally early failure detection in tasks via regex matching of task output is in the works	19:56
corvus	there's no ticking clock on this one...	19:56
clarkb	ack	19:56
corvus	the regex change is being gated as we speak	19:56
clarkb	the failure stuff is more "this is a cool feature you might want to take advantage of" it won't affect existing jobs without intervention	19:57
corvus	we'll try it out in zuul and maybe come up with some patterns that can be copied	19:57
clarkb	#topic Base container image updates	19:58
clarkb	really quickly before we run out of time. We are now in a good spot to convert the consumers of the base container images to bookworm	19:58
clarkb	The only one I really expect to maybe have problems is gerrit due to the java stuff and it may be a non issue there since containers seem to have less issues with this	19:58
corvus	i think zuul is ready for that	19:59
clarkb	help appreciated updating any of our images :)	19:59
clarkb	#topic Open Discussion	19:59
clarkb	Anything important last minute before we call it a meeting?	19:59
fungi	nothing here	20:00
clarkb	Thank you everyone for your time. Feel free to continue discussion in our other venues	20:00
clarkb	#endmeeting	20:01
opendevmeet	Meeting ended Tue Aug 22 20:01:02 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:01
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.html	20:01
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.txt	20:01
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.log.html	20:01
fungi	thanks clarkb!	20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!