Tuesday, 2024-09-17

clarkb	meeting time	19:00
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Sep 17 19:00:28 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
NeilHanlon	o/ heya	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/OLEKXKOL5LLSYPUH6KMC5KSPZKYR24R6/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	I didn't have this in the email but a reminder that if you are eligible to vote in the openstack TC election you have ~1 day to do so	19:01
NeilHanlon	ty for reminder	19:01
clarkb	#topic Upgrading Old Servers	19:03
clarkb	tonyb: anything new with the wiki changes? I havne't seen any updates since I last review them. Also I suspect you have been busy with election stuff?	19:04
tonyb	not doing election stuff this time. just moving slower than I'd like	19:05
tonyb	I've been addressing your review comments and testing locally	19:06
tonyb	I should have updates later today	19:06
clarkb	cool, looks like there were also some comments from frickler	19:06
clarkb	also would it help to reorgnize the meeting so this topic went at the end given the timezone delta?	19:06
tonyb	yup, I'm looking at those as well	19:07
tonyb	it might, I do tend to miss the very beginning of the meeting	19:07
tonyb	good news is that Australia will do it's DST transition within a month	19:08
clarkb	still easy enough to chagne the order up for next time. I'll try to remember to do so	19:09
clarkb	anything else related to new servers?	19:09
tonyb	not from me	19:09
clarkb	#topic AFS Mirror Cleanups	19:10
clarkb	Nothing really new on this topic from me. Other than that I keep finding distractions when it comes to pushing on xenial cleanups. I do think the next step there is removing dead/idle projects from the zuul tenant config so that we can reduce the number of things with xenial references then follow up with xenial removal in what remains	19:11
clarkb	I may take this off the agenda until I'm able to pick that up again	19:11
clarkb	#topic Rackspace Flex Cloud	19:12
clarkb	Wanted to give an update on where we are with Rackspace's new Flex Cloud region but I may drop this off of next week too as I think we're overall in a good spot	19:13
clarkb	We're using the entirety of our quota and most things seem to be working	19:13
clarkb	The small issues we have seen include: this is a floating ip cloud so some jobs have had to adjust to using private ips in their configs instead of public ips (since nodes don't know their public ips)	19:13
clarkb	the mtu on the network interfaces is only 1442 instead of the common 1500.	19:14
clarkb	And we sometimes have slowness scanning ssh keys from nodepool which was causing boot timeouts until we increased the timeout	19:14
clarkb	I do wonder if possibly the mtu thing could cause the slowness ^ there. But it seems like fragmentation should negotiate more quickly than that	19:14
frickler	did we start using swift storage yet?	19:15
clarkb	Continue to be on the lookout for any unexpected behaviors, they have been receptive to our feedback so far and we can continue to feed that back as well	19:15
clarkb	frickler: not yet	19:15
clarkb	we did add this cloud region (and openmetal) to the nested virt labels and johnsom reports they both seem to be working for that	19:15
clarkb	uploading job logs to swift in that region is likely going to be the next good step we take	19:15
fungi	related to my swift cleanup work though, it might be worthwhile long term to migrate from classic rackspace swift to flex swift and then ask them to just delete all the containers in our account once the current log data has expired	19:16
clarkb	as far as setting swift up goes I think the first step is figureing out how auth is supposed to work for that and if our existing auth setup is functional	19:16
clarkb	if it is then I think we can just add this as a new region in the list that we randomly select from. However I half expect we'll need some new settings and setup will be more involved	19:17
frickler	so that would be worth keeping on the agenda then, I'd think	19:17
clarkb	sure we can do that if we want to track the swift effort that way	19:17
frickler	also maybe tracking when they're ready to ramp up quota?	19:18
clarkb	fungi: I don't think you tried swift auth with our swift accounts in the spin up earlier right?	19:18
fungi	i did not, no	19:18
clarkb	ok so we don't have any idea yet on how that works	19:18
clarkb	I'll see if I have time later this week to experiment	19:18
clarkb	frickler: ya though I half expect that to happen in an email response to the feedback thread I started so not sure we need to check in weekly on the quota situation	19:19
clarkb	any other questions or concerns or ideas related to the new cloud region?	19:20
clarkb	sounds like no	19:21
clarkb	#topic Etherpad 2.2.4 Upgrade	19:21
clarkb	So we upgraded and everything seemed happy except for the meetpad integration	19:21
clarkb	it turns out in version 2.2.2 or similar they updated etherpad to assume it is always in the root window for jquery (I may get some of these details wrong because js)	19:21
clarkb	and since meetpad embeds etherpad this broke etherpad	19:22
clarkb	other people using etherpad embedded (including jitsi meet users) noticed and reported the issue which got fixed in the first commit after the 2.2.4 release. Unfortauntely there is no 2.2.5 release yet so we went ahead and deployed a new image that checks out the latest commit (by sha) as of the time of writing that chagne and this has fixed things	19:22
clarkb	Ideally we won't run a random dev commit for very long so I'm still hopeful that 2.2.5 shows up soon. But things seem to work again	19:23
tonyb	makes sense given the ptg is coming up	19:23
fungi	yeah, we didn't want to leave it like that any longer than necessary	19:24
fungi	just glad we remembered to test it once the update was deployed	19:24
clarkb	if you notice any problems with etherpad or meetpad or the integration between the two please say something	19:24
clarkb	but with my admittedly limited in scope and duration testing it seems to be working again	19:24
clarkb	#topic Updating ansible+ansible-lint versions in our repos	19:25
clarkb	I'm selfishly keeping this item on the agenda because I'm having a tough time getting reviews :)	19:25
clarkb	#link https://review.opendev.org/c/openstack/project-config/+/926848	19:25
clarkb	#link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970	19:25
clarkb	I'd like to get these landed just as part of the ubuntu noble default nodeset manuever	19:26
clarkb	I'm happy to address feedback if we feel strongly about any of those ansible rules (eg I can disable them and undo the updates)	19:26
clarkb	s/ansible rules/ansible-lint rules/	19:26
clarkb	but I think getting that updated will help future proof us for a bit	19:26
frickler	I was looking at those but still undecided whether to just accept it or complain like corvus did	19:27
clarkb	basically don't take this as me advocating for anything in particular other than "run more up to date tools so we can keep up with python releases"	19:27
corvus	i offer my moral support for skipping those rules :)	19:28
clarkb	one upside to using a linter is we can avoid complaining about formatting ourselves. That said I would say that as a group we're pretty good about avoiding review nit picks like that and ansible-lint is extrmeely opinionated so we're kind of in a weird situation there	19:28
clarkb	I suspect that other projects (nova maybe based on recent mailing list emails) get bigger benefits from just going with what the tool says to do	19:29
frickler	all those "name"/"hosts" reorderings are the top ones I would likely want to not do	19:30
frickler	but I can also see benefit in just following those, similar to python projects just using black and putting an end to all formatting discussions	19:31
clarkb	ya thats the main thing the easiest thing is probably to just accept someone else had an opinion then fix it once	19:32
clarkb	anyway if no one feels strongly enough to -1 maybe we should proceed?	19:32
clarkb	we can discuss further in review	19:33
clarkb	#topic Zuul-launcher image builds	19:33
clarkb	The opendev/zuul-jobs project has been created and is hosting these image build configs now	19:33
clarkb	#link https://review.opendev.org/c/opendev/zuul-jobs/+/929141 Build a debian bullseye image with dib in a zuul job	19:33
clarkb	this change successfully builds a debian bullseye image and I think it just merged	19:33
clarkb	I think the next step is to upload it to an intermediate location then configure zuul to fetch and upload that to clouds?	19:34
corvus	that is one next step	19:35
frickler	so one question I have about this: we use the cache built into our image to prime the new cache, do I understand this correctly?	19:35
clarkb	corvus: ^ do we need to disable that image in the nodepool builders to prevent conflicts or will they coordinate via zk and it should work out?	19:35
corvus	the other next step, which can also start right now is for someone to run with making more jobs for more images	19:35
clarkb	frickler: correct. Its like we are doing mathematical induction on git caches	19:35
corvus	frickler: yes	19:35
frickler	but can we start the induction in case we loose our existing images?	19:35
corvus	clarkb: it's safe to build duplicate images	19:36
clarkb	frickler: I think the bootstrap process is to use an existing cloud image to run the job then the build will just take much longer to prime the cache essentially	19:36
corvus	frickler: the build should be able to run on an empty cloud node (slowly)	19:36
corvus	yep	19:36
tonyb	I'm keen to look at the "building more images" thing	19:36
clarkb	frickler: if we find that tiem is too long we could manually snapshot an instance with the git repos pre cloned and use that image	19:36
corvus	we could test that case with a 3 hour job if we want	19:36
corvus	tonyb: ++	19:36
NeilHanlon	(so I don't get distracted looking at opensearch, I have an update on rocky CI failures w.r.t. "should we mirror rocky")	19:37
NeilHanlon	just tag me when you're ready :D	19:37
clarkb	NeilHanlon: will do	19:37
corvus	after we get uploads to object storage working ...	19:37
clarkb	corvus: are the uploads to intermedaite storage then eventually the clouds something you'll be working on?	19:38
corvus	... the code to have the zuul-launcher actually create cloud images is nearly ready to merge; once that's done, we should have all the pieces in place to watch a zuul-launcher manage a full image build and upload process	19:38
corvus	we will need to add the openstack driver though :)	19:39
clarkb	also are we running a zuul-launcher node? or do we need to do that too	19:39
corvus	https://review.opendev.org/924188	19:39
corvus	safe to merge any time	19:39
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/924188/ Run a zuul-launcher	19:40
clarkb	thanks!	19:40
corvus	clarkb: if anyone else wants to do the upload to intermediate storage, i welcome it; otherwise i should be able to get to it in a bit.	19:40
corvus	one open question about that: what intermediate storage do we want? existing log storage? new rax flex container?	19:40
fungi	also because these are huge, we should think carefully about expirations	19:41
clarkb	corvus: due to the size of these images and not needing them to live for 30 days I wonder if we should use a dedicated container	19:41
clarkb	it will just make it easier for humans to grok pruning of the content should we need to	19:41
corvus	(incidentally, one thing we might want to consider if we don't end up liking the process with cloud storage is that we could use a simple opendev fileserver for our intermediate storage; but i like the idea of starting with object storage)	19:41
clarkb	but then we can probably upload to any/all/one of the existing swift locations	19:42
corvus	dedicated container sounds good. and i was thinking an expiration of a couple of days should be okay to start with. maybe we make it longer later, but that should keep the fallout small from any early errors in programming	19:42
clarkb	++	19:42
corvus	so if the rax-flex auth question is answered by then, maybe do it there? otherwise... vexxhost? rax-dfw?	19:43
clarkb	probably not vexxhost since we don't use swift there (we made ceph sad when we tried)	19:44
clarkb	but rax-dfw or ovh-bhs1 seem fine	19:44
corvus	dfw will use fewer intertubes	19:44
clarkb	that seems like a good reason to choose it	19:44
corvus	ok, so dedicated container in rax-flex or rax-dfw. sgtm!	19:44
corvus	if someone gets rax flex working, maybe please just go ahead and create an extra container? :)	19:45
fungi	yeah, i noticed that rax classic dfw to rax flex sjc3 communication goes through the internet (but at least they share a common backbone provider)	19:45
clarkb	corvus: will do if I manage that	19:45
corvus	thx!	19:46
clarkb	#topic Mirroring Rocky Linux Packages	19:46
clarkb	NeilHanlon: hello!	19:46
NeilHanlon	hi :)	19:46
NeilHanlon	so.. i can't get opensearch to do what I want, but	19:46
NeilHanlon	https://drop1.neilhanlon.me/irc/uploads/44fb256b36a4f97b/image.png	19:47
clarkb	looks like something keyed off of depsolving?	19:47
NeilHanlon	green is "successful", red is a job which had a "Depsolve Failed" message	19:47
NeilHanlon	yeah	19:47
NeilHanlon	https://drop1.neilhanlon.me/irc/uploads/17b1fdc1dad12d0b/image.png	19:48
NeilHanlon	i can't seem to generate a short URL otherwise I'd link to the viz	19:48
fungi	so indicates builds which hit some sort of package access problem i guess	19:48
NeilHanlon	yeah these I looked into and are almost all because the host got some mirror A for Appstream and mirror B for BaseOS which were not in sync	19:48
NeilHanlon	i'm sure there's others which aren't matching this depsolve message, but the signal was clear for these ones at least	19:49
NeilHanlon	https://paste.opendev.org/show/bHtL7sBLms4vpOIOkxBN/ here.. the opensearch url :D	19:49
clarkb	cool. I think that does point to using our own mirrors would have a benefit	19:49
fungi	which raises a related question then... when we mirror, how can we be sure we keep both of those in sync with each other?	19:49
clarkb	(side note I wonder if the proxies for the upstream mirrors should do some ip stickyness)	19:49
fungi	or are they mirrored as a unit?	19:50
clarkb	fungi: we would be rsyncing from a single source so in theory that source will be in sync with itself	19:50
clarkb	fungi: rather than rsyncing from multiple locations which may be out of sync	19:50
NeilHanlon	right, yeah. using --delay-updates or so	19:50
fungi	and yeah, we do delay deletions	19:50
NeilHanlon	alternatively, I've sometimes set it up so that everything except the metadata is synced first, then the metadata is can be fetched -- but if you're using --delete that wouldn't work	19:51
clarkb	so ya as mentioend before the next steps would be to ensure we've got enough disk (using centos 9 stream as a stand in I think we decided we do) then write a mirroring script (should look very similar to centos 9 stream and other rsync scripts) then an admin can create the afs volume and merge things and get stuff published	19:52
NeilHanlon	alright, I can work on a mirroring script and open a change for that	19:52
tonyb	Similar ti CentOS I'm working on a tool that will ensure that all packages in the repomd are available in a mirror. which we can run after rsync befoer the vos release	19:52
clarkb	NeilHanlon: that would be great. Then whoever ends up reviewing that can ensure the afs side is ready for it to land too	19:52
tonyb	I don't think that will help with issues where BaseOS and Appstream are out of sync though :(	19:52
clarkb	we can also set the quota on the afs volume such taht we don't accidentally sync down too much content	19:53
fungi	yeah, if there's a semi-quick way we can double-check consistency, afs lets us just avoid publishing that state when it's wrong	19:53
clarkb	better to hit a quota limit than completely run out of disk	19:53
NeilHanlon	hear hear	19:53
tonyb	fungi: I ran the tool on an afs node and it was < 1min per repo, which is quick enough for me	19:54
clarkb	tonyb: that is plenty fast compared to how long rsync takes even not syncing any real data	19:54
fungi	yeah, that's quick, especially where afs is concerned	19:54
clarkb	NeilHanlon: and don't hesitate to ask if any questions come up in preparing that script	19:54
clarkb	#topic Open Discussion	19:55
clarkb	we have 5 minutes for anything else before our hour is up	19:55
tonyb	I was thing so, also very quick if we can avoid a bunch of job failures	19:55
fungi	just a heads up that i won't be around much thursday/friday this week, or over the weekend	19:55
* frickler will also be offline starting thursday, hopefully just a couple of days		19:56
* tonyb will be more around again ... albeit in AU :/		19:56
clarkb	thanks for the heads up	19:57
clarkb	sounds like that may be just abuot everything. Thank you for your time today	19:57
clarkb	#endmeeting	19:57
opendevmeet	Meeting ended Tue Sep 17 19:57:46 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:57
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-17-19.00.html	19:57
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-17-19.00.txt	19:57
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-17-19.00.log.html	19:57
clarkb	we can end it there and everyone can go find $meal a couple minutes early	19:57
corvus	thanks!	19:57
fungi	thanks clarkb!	19:58
tonyb	Thanks all	20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!