Tuesday, 2023-09-05

clarkb	hello it is meeting time	19:01
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Sep 5 19:01:10 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	It feels like yseterday was a holiday. So many things this morning	19:01
fungi	yesterday felt less like a holiday than it could have	19:01
ianychoi	o/	19:02
fungi	but them's the breaks	19:02
ianychoi	I have not been aware of such holidays but hope that many people had great holidays :)	19:02
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/F5B2EF7BWK62UQZCTHVGEKER4XFRDSIE/ Our Agenda	19:02
clarkb	ianychoi: I did! it was the last day my parents were here visiting and we made smoked bbq meats	19:03
clarkb	#topic Announcements	19:03
clarkb	I have nothing	19:03
ianychoi	Wow great :)	19:03
ianychoi	(I will bring translation topics during open discussion time.. thanks!)	19:04
clarkb	#topic Infra Root Google Account Activity	19:05
clarkb	I have nothing to report here	19:05
clarkb	I'e still got it on the todo list	19:06
clarkb	hopefully soon	19:06
clarkb	#topic Mailman 3	19:06
clarkb	as mentioned last week fungi thinks we're ready to start migrating additional domains and I agree. That means we need to schedule a time to migrate lists.kata-containers.io and lists.airshipit.org	19:06
fungi	yeah, so i've decided which are the most active lists on each site to notify	19:07
fungi	it seems like thursday september 14 may be a good date to do those two imports	19:07
fungi	if people agree, i can notify the airship-discuss and kata-dev mailing lists with a message similar to the one i sent to zuul-discuss when the lists.zuul-ci.org site was migrated	19:08
fungi	is there a time of day which would make sense to have more interested parties around to handle comms or whatever might arise?	19:09
frickler	I have no idea where the main timezones of those communities are located	19:10
clarkb	airship was in US central time iirc	19:10
clarkb	and kata is fairly global	19:10
frickler	so likely US morning would be best suited to give you some room to handle possible fallout	19:11
fungi	yeah, i'm more concerned with having interested sysadmins around for the maintenance window. the communities themselves will know to plan for the list archives to be temporarily unavailable and for mail deliveries to be deferred	19:11
fungi	i'm happy to run the migration commands (though they're also documented in the planning pad and scripted in system-config too)	19:12
clarkb	I'm happy to help anytime after about 15:30 UTC	19:12
fungi	but having more people around in general at those times helps	19:12
clarkb	and thursday the 14th shoudl work for me anytime after then	19:12
fungi	the migration process will probably take around an hour start to finish, and that includes dns propagation	19:13
fungi	i'll revisit what we did for the first migrations, but basically we leveraged dns as a means of making incoming messages temporarily undeliverable until the migration was done, and then updated dns to point to the new server	19:14
fungi	for kata it may be easier since it's on its own server already, the reason we did it via dns for the first migrations is that they shared a server with other sites that weren't moving at the same times	19:15
clarkb	using DNS seemed to work fine though so we can stick to that	19:16
fungi	frickler: does 15:30-16:30 utc work well for you?	19:16
fungi	if so i'll let the airship and kata lists know that's the time we're planning	19:16
frickler	I'm not sure I'll be around, but time is fine in general	19:17
frickler	*the time	19:17
fungi	okay, no worries. thanks!	19:17
clarkb	great see yall then	19:17
clarkb	anything else mailman 3 related?	19:17
fungi	let's go with 15:30-16:30 utc on thursday 2023-09-14	19:17
fungi	i'll send announcements some time tomorrow	19:17
clarkb	thanks	19:18
clarkb	#topic Server Upgrades	19:19
clarkb	Nothing new here either	19:20
clarkb	#topic Rax IAD image upload struggles	19:20
clarkb	Lots of progress/news here though	19:20
clarkb	fungi filed a ticket with rax and the response was essentially that iad is expected to behave differently	19:20
fungi	yes, sad panda	19:20
clarkb	this means we can't rely on the cloud provider to fix it for us. Instead we've reduced the number of upload threads to 1 per builder and increased the image rebuild time intervals	19:21
fungi	though that prompted us to look at whether we're too aggressively updating our images	19:21
clarkb	the idae ehre is that we don't actually need new images for all images constantly and can more conservatively update things in a way that the cloud region can hopefully keep up with	19:21
corvus	what's the timing look like now? how long between update cycles? and how long does a full update cycle take?	19:22
fungi	yeah, basically update the default nodeset's label daily, current versions of other distros every 2 days, and older versions of distros weekly	19:22
frickler	but that was only merged 2 days ago, so no effect seen yet	19:22
corvus	makes sense	19:22
fungi	noting that the "default nodeset" isn't necessarily always consistent across tenants, but we can come up with a flexible policy tehre	19:23
clarkb	fungi: we currently define it centrally in opendev/base-jobs though	19:23
fungi	and this is an experiment in order to hopefully get reasonably current images for jobs while minimizing the load we put on our providers' image services	19:23
frickler	looking at the upload ids, some images seem to have taken 4 attempts in iad to be successful	19:23
corvus	maybe we could look at adjusting that to so that current versions of all distros are updated daily. once we have more data.	19:24
fungi	frickler: that also means we probably have new leaked images in iad we should look at the metadata for now to see if we can identify for sure why nodepool doesn't clean them up	19:24
frickler	I was looking at nodepool stats in grafana and > 50% of the average used nodes were jammy	19:24
frickler	fungi: yes	19:24
fungi	corvus: yes, that also seems reasonable	19:24
corvus	if you find a leaked image, ping me with the details please	19:25
fungi	corvus: will do, thanks!	19:25
fungi	i'll try to take a look after dinner	19:25
fungi	i'll avoid cleaning them up until we have time to go over them	19:26
corvus	it's probably enough to keep one around if you want to clean up others	19:26
fungi	the handful we've probably leaked pales in comparison to the 1200 i cleaned up in iad recently	19:26
fungi	so i'll probably just delay cleanup until we're satisfied	19:27
clarkb	sounds like that is all for images	19:29
clarkb	#topic Fedora cleanup	19:29
clarkb	The nodeset removal from base-jobs landed	19:29
clarkb	I've also seen some projects like ironic push changes to clean up their use of fedora	19:29
clarkb	I think the next step is to actually remove the label (and images) from nodepool when we think people have had enough time to prepare	19:30
clarkb	should we send an email announcing a date for that?	19:30
frickler	well nothing of that would still work, or was that only devstack that was broken?	19:31
fungi	i thought devstack dropped fedora over a month ago	19:31
fungi	dropped the devstack-defined fedora nodesets anyway	19:31
frickler	but didn't all fedora testing stop working when they pulled their repo content?	19:31
clarkb	frickler: yes, unless jobs got updated to pull from other locations	19:32
fungi	yes, well anything that tried to install packages anyway	19:32
clarkb	I don't think we need to wait very long as you are correct most things would be very broken	19:32
clarkb	more of a final warning if anyone had this working somehow	19:32
frickler	I don't think waiting a week or two will help anyone, but it also doesn't hurt us	19:33
clarkb	ya I was thinking about a week	19:33
clarkb	maybe announce removal for Monday?	19:33
fungi	wfm	19:33
frickler	ack	19:33
clarkb	that gives me time to write the chagnes for it too :)	19:33
clarkb	cool	19:33
clarkb	#topic Zuul Ansible 8 Default	19:34
clarkb	All of the OpenDev Zuul tenants are ansible 8 by default now	19:34
clarkb	I haven't heard of or seen anyone needing to pin to ansible 6 either	19:34
corvus	what's the openstack switcheroo date?	19:34
fungi	yesterday	19:35
clarkb	This doesn't need to be on the agenda for next week, but I wanted to make note of this and remind people to call out oddities if they see them	19:35
frickler	it was yesterday	19:35
frickler	there's only one concern that I mentioned earlier: we might not notice when jobs pass that actually should fail	19:35
corvus	cool. :) sorry i misread comment from clark :)	19:35
frickler	might be because new ansible changed the error handling	19:35
clarkb	frickler: yes, I think that risk exists but it seems to be a low probability	19:35
clarkb	since ansible is fail by default if anything goes wrong generally	19:35
fungi	frickler: i probably skimmed that comment too quickly earlier, what error handling changed in 8?	19:36
fungi	or was it hypothetical?	19:36
frickler	that was purely hypothetical	19:36
fungi	okay, yes i agree that there are a number of hypothetical concerns with any upgrade	19:36
fungi	since in theory any aspect of the software can change	19:36
clarkb	yup mostly just be aware there may be behavior chagnes and if you see them please let zuul and opendev folks know	19:38
fungi	from a practical standpoint, unless anyone has mentioned specific changes to error handling in ansible 8 i'm not going to lose sleep over the possibility of that sort of regression, but we should of course be mindful of the ever-present possibility	19:38
clarkb	both of us will be interested in any observed differences even if they are minor	19:38
clarkb	(one thing I want to look at if I ever find time is performance)	19:38
clarkb	#topic Zuul PCRE regex support is deprecated	19:39
clarkb	The automatic weekend ugprade of zuul pulled in changes to deprecate PCRE regexes within zuul. This results in warnings where regexes that re2 cannot support are used	19:39
clarkb	There was a bug that caused these warnings to prevent new config updates from being usable. We tracked down and fixed those bugs and corvus restarted zuul schedulers outside of the automated upgrade system	19:40
corvus	sorry for the disruption, and thanks for the help	19:40
clarkb	Where that leaves us is opendev's zuul configs will need to be updated to remove pcre regexes. I don't think this is super urgent but cutting down on the warnings in the error list helps reduce noise	19:40
fungi	no need to apologize, thanks for implementing a very useful feature	19:41
fungi	and for the fast action on the fixes too	19:41
corvus	#link https://review.opendev.org/893702 merged change from frickler for project-config	19:41
corvus	#link https://review.opendev.org/893792 change to openstack-zuul-jobs	19:42
corvus	that showed us an issue with zuul-sphinx which should be resolved soon	19:42
corvus	i think those 2 changes will take care of most of the "central" stuff.	19:42
corvus	after they merge, i can write a message letting the wider community know about the change, how to make updates, point at those changes, etc.	19:43
fungi	i guess the final decision on the !zuul user match was that the \S trailer wasn't needed any longer?	19:43
corvus	i agreed with that assesment and approved it	19:43
fungi	okay, cool	19:43
frickler	I also checked the git history of that line and it seemed to agree with that assessment	19:44
fungi	yay for simplicity	19:44
corvus	in general, there's a lot less line noise in our branch matchers now. :)	19:44
fungi	thankfully	19:44
corvus	also, quick reminder in case it's useful -- branches is a list, so you can make a whole list of positive and negated regexes if you need to. the list is boolean or.	19:45
corvus	i haven't run into a case where that's necessary, or looks better than just a (a\|b\|c) sequence, but it's there if we need it.	19:45
fungi	a list of or'ed negated regexes wouldn't work would it?	19:46
corvus	i mean... it'd "work"...	19:46
fungi	!a\|!b would include everything	19:46
opendevmeet	fungi: Error: "a\|!b" is not a valid command.	19:46
fungi	opendevmeet agrees	19:46
opendevmeet	fungi: Error: "agrees" is not a valid command.	19:46
fungi	and would like to subscribe to our newsletter	19:46
corvus	but yes, probably of limited utility. :)	19:46
clarkb	anything else on regexes?	19:47
corvus	nak	19:47
clarkb	have a couple more things to get to and we are running out of time. Thanks	19:47
clarkb	#topic Bookworm updates	19:47
clarkb	#link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm.	19:47
clarkb	I think we're ready to proceed on this with zuul if zuul is ready. But nodepool may pose problems ebtween ext4 options nad older grub	19:47
clarkb	I am helping someone debug this today and I'm not sure yet if bookworm is affected. But generally you can create an ext4 fs that grub doesn't like using newer mkfs	19:48
fungi	what's grub's problem there?	19:48
clarkb	https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1844012 appears related	19:48
corvus	would we notice issues at the point where we already put the new image in service? so would need to roll back to "yesterday's"* image?	19:48
clarkb	fungi: basically grub says "unknown filesystem" when it doesn't like the features enabled in the ext4 fs	19:48
corvus	(* yesterday may not equal yesterday anymore)	19:48
clarkb	corvus: no, we would fail to build images in the first place	19:48
fungi	are we using grub in our bookworm containers?	19:49
clarkb	fungi: the bookworm containers run dib with makes grub for all of our images	19:49
clarkb	*all of our vm images	19:49
corvus	oh ok. in that case... is there dib testing using the nodepool container?	19:49
fungi	oh, right, that's the nodepool connection. now i understand	19:49
clarkb	corvus: oh there should be so we can use a depends on	19:49
clarkb	corvus: good idea	19:49
corvus	clarkb: not sure if they're in the same tenant though	19:49
clarkb	sorry I'm just being made aware of this as our meeting started so its all new to me	19:49
fungi	exciting	19:50
corvus	but if depends-on doesn't work, some other ideas:	19:50
clarkb	corvus: hrm they aren't. But we can probably figure some way of testing that. Maybe hardcoding off of the intermediate registry	19:50
corvus	yeah, manually specify the intermediate registry container	19:50
corvus	or we could land it in nodepool and fast-revert if it breaks	19:50
fungi	absolute worst case, new images we upload will fail to boot, so jobs will end up waiting indefinitely for node assignments until we roll back to prior images	19:51
clarkb	it is possible that bookworm avoids the feature anyway and we're fine so definitely worth testing	19:51
clarkb	fungi: dib fails hard on the grub failure so it shouldn't get taht far	19:51
clarkb	corvus: ya I'll keep the test in production alternative in mind if I can't find an easy way to test it otherwise	19:51
fungi	oh, so our images will just get stale? even less impact, we just have to watch for it since it may take a while to notice otherwise	19:51
corvus	oh, one other possibility might be a throwaway nodepool job that uses an older distro	19:51
corvus	(since nodepool does have one functional job that builds images with dib; just probably not on an affected distro)	19:52
clarkb	fungi: right dib running in bullseye today runs mkfs.ext4 and creates a new ext4 fs that focal grub can install into. When we switch to bookworm the concern is that grub will say unknown filesystem. exit with and error and the image build will fail	19:52
clarkb	but it shouldn't ever upload	19:52
clarkb	corvus: oh cool I can look at taht too. And see this is happening at build not boot time we don't need complicated verification of the end result. Just that the build itself succeeds	19:53
clarkb	*since this is happening	19:53
clarkb	that was all I had. Happy for zuul to proceed wtih bookworm in the meantime	19:53
clarkb	#topic Open Discussion	19:54
clarkb	ianychoi: I know you wanted to talk about the Zanata db/stats api stuff?	19:54
ianychoi	Yep	19:54
clarkb	I'm not aware of us doing anything special to prevent the stats api from being used. Which is why I wonder if it is an admin only function	19:55
clarkb	If it is i think we can provide you or someone else with admin access to use the api.	19:55
ianychoi	First, public APIs for user stats do not work - e.g., https://translate.openstack.org/rest/stats/user/ianychoi/2022-10-05..2023-03-22	19:55
clarkb	I am concerned that I don't know what all goes into that database though so am wary of providing a database dump. But others may know more and have other thoughts	19:55
ianychoi	It worked to calculate translation stats previously to sync with Stackalytics + to calculate extra ATC status	19:56
ianychoi	The root cause might be from some messy DB status in Zanata instance but it is not easy to get help..	19:57
ianychoi	So I thought investigating in DB issues would be one idea.	19:57
clarkb	I see. Probably before we get that far we should check the server logs to see if there are any errors associated with those requests. I do note that the zanata rest api documentation doesn't seem to show user stats just project stats	19:58
clarkb	http://zanata.org/zanata-platform/rest-api-docs/resource_StatisticsResource.html maybe you can get the data on a project by project basis instead?	19:59
frickler	I just noticed once again that I'm too young an infra-root in order to be able to access that host, but I'm fine with not changing that situation	19:59
ianychoi	Yep I also figured out that project APIs are working well	19:59
clarkb	frickler: interesting, I thought that server was getting users managed by ansibel like the other servers do	19:59
ianychoi	So, I think some help to co-work on this part with infra-root would be so great ideally	19:59
ianychoi	Or maybe me or Seongsoo need to step up :p	20:00
clarkb	ianychoi: ok, the main thing is that this service is long deprecated so we're unlikely to be able to invest much in it. But I think we can check logs for obvious errors.	20:00
fungi	frickler: you have an ssh account on translate.openstack.org	20:00
fungi	but maybe it's set up wrong or something	20:00
ianychoi	Agree with @clarkb would you help check logs?	20:01
ianychoi	Or feel free to point out to me so that I can investigate in seeing detail log messages	20:01
fungi	service admins for that platform are probably added manually though not through ansible	20:01
clarkb	ianychoi: yes a root will need to check the logs. I can look later today	20:01
ianychoi	Thank you!	20:02
frickler	ah, I was using the wrong username. so I can check tomorrow if noone else has time	20:02
fungi	i think pleia2 added our sysadmins as zanata admins back when the server was first set up, but it likely hasn't been revisited since	20:02
clarkb	sounds like a plan we can sync up from there	20:02
clarkb	but we are out of time. Feel free to bring up more discussion in #opendev or on the service-discuss mailing list	20:02
clarkb	thank you everyone!	20:03
clarkb	#endmeeting	20:03
opendevmeet	Meeting ended Tue Sep 5 20:03:03 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:03
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.html	20:03
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.txt	20:03
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.log.html	20:03
clarkb	(I can smell lunch and I am very hungry :) )	20:03
ianychoi	Thank you all	20:03
fungi	thanks!	20:03

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!