Tuesday, 2024-09-17

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Sep 17 19:00:28 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
NeilHanlono/ heya19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/OLEKXKOL5LLSYPUH6KMC5KSPZKYR24R6/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI didn't have this in the email but a reminder that if you are eligible to vote in the openstack TC election you have ~1 day to do so19:01
NeilHanlonty for reminder19:01
clarkb#topic Upgrading Old Servers19:03
clarkbtonyb: anything new with the wiki changes? I havne't seen any updates since I last review them. Also I suspect you have been busy with election stuff?19:04
tonybnot doing election stuff this time.  just moving slower than I'd like19:05
tonybI've been addressing your review comments and testing locally 19:06
tonybI should have updates later today19:06
clarkbcool, looks like there were also some comments from frickler19:06
clarkbalso would it help to reorgnize the meeting so this topic went at the end given the timezone delta?19:06
tonybyup, I'm looking at those as well19:07
tonybit might, I do tend to miss the very beginning of the meeting 19:07
tonybgood news is that Australia will do it's DST transition within a month19:08
clarkbstill easy enough to chagne the order up for next time. I'll try to remember to do so19:09
clarkbanything else related to new servers?19:09
tonybnot from me19:09
clarkb#topic AFS Mirror Cleanups19:10
clarkbNothing really new on this topic from me. Other than that I keep finding distractions when it comes to pushing on xenial cleanups. I do think the next step there is removing dead/idle projects from the zuul tenant config so that we can reduce the number of things with xenial references then follow up with xenial removal in what remains19:11
clarkbI may take this off the agenda until I'm able to pick that up again19:11
clarkb#topic Rackspace Flex Cloud19:12
clarkbWanted to give an update on where we are with Rackspace's new Flex Cloud region but I may drop this off of next week too as I think we're overall in a good spot19:13
clarkbWe're using the entirety of our quota and most things seem to be working19:13
clarkbThe small issues we have seen include: this is a floating ip cloud so some jobs have had to adjust to using private ips in their configs instead of public ips (since nodes don't know their public ips)19:13
clarkbthe mtu on the network interfaces is only 1442 instead of the common 1500.19:14
clarkbAnd we sometimes have slowness scanning ssh keys from nodepool which was causing boot timeouts until we increased the timeout19:14
clarkbI do wonder if possibly the mtu thing could cause the slowness ^ there. But it seems like fragmentation should negotiate more quickly than that19:14
fricklerdid we start using swift storage yet?19:15
clarkbContinue to be on the lookout for any unexpected behaviors, they have been receptive to our feedback so far and we can continue to feed that back as well19:15
clarkbfrickler: not yet19:15
clarkbwe did add this cloud region (and openmetal) to the nested virt labels and johnsom reports they both seem to be working for that19:15
clarkbuploading job logs to swift in that region is likely going to be the next good step we take19:15
fungirelated to my swift cleanup work though, it might be worthwhile long term to migrate from classic rackspace swift to flex swift and then ask them to just delete all the containers in our account once the current log data has expired19:16
clarkbas far as setting swift up goes I think the first step is figureing out how auth is supposed to work for that and if our existing auth setup is functional19:16
clarkbif it is then I think we can just add this as a new region in the list that we randomly select from. However I half expect we'll need some new settings and setup will be more involved19:17
fricklerso that would be worth keeping on the agenda then, I'd think19:17
clarkbsure we can do that if we want to track the swift effort that way19:17
frickleralso maybe tracking when they're ready to ramp up quota?19:18
clarkbfungi: I don't think you tried swift auth with our swift accounts in the spin up earlier right?19:18
fungii did not, no19:18
clarkbok so we don't have any idea yet on how that works19:18
clarkbI'll see if I have time later this week to experiment19:18
clarkbfrickler: ya though I half expect that to happen in an email response to the feedback thread I started so not sure we need to check in weekly on the quota situation19:19
clarkbany other questions or concerns or ideas related to the new cloud region?19:20
clarkbsounds like no19:21
clarkb#topic Etherpad 2.2.4 Upgrade19:21
clarkbSo we upgraded and everything seemed happy except for the meetpad integration19:21
clarkbit turns out in version 2.2.2 or similar they updated etherpad to assume it is always in the root window for jquery (I may get some of these details wrong because js)19:21
clarkband since meetpad embeds etherpad this broke etherpad19:22
clarkbother people using etherpad embedded (including jitsi meet users) noticed and reported the issue which got fixed in the first commit after the 2.2.4 release. Unfortauntely there is no 2.2.5 release yet so we went ahead and deployed a new image that checks out the latest commit (by sha) as of the time of writing that chagne and this has fixed things19:22
clarkbIdeally we won't run a random dev commit for very long so I'm still hopeful that 2.2.5 shows up soon. But things seem to work again19:23
tonybmakes sense given the ptg is coming up19:23
fungiyeah, we didn't want to leave it like that any longer than necessary19:24
fungijust glad we remembered to test it once the update was deployed19:24
clarkbif you notice any problems with etherpad or meetpad or the integration between the two please say something19:24
clarkbbut with my admittedly limited in scope and duration testing it seems to be working again19:24
clarkb#topic Updating ansible+ansible-lint versions in our repos19:25
clarkbI'm selfishly keeping this item on the agenda because I'm having a tough time getting reviews :)19:25
clarkb#link https://review.opendev.org/c/openstack/project-config/+/92684819:25
clarkb#link https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/92697019:25
clarkbI'd like to get these landed just as part of the ubuntu noble default nodeset manuever19:26
clarkbI'm happy to address feedback if we feel strongly about any of those ansible rules (eg I can disable them and undo the updates)19:26
clarkbs/ansible rules/ansible-lint rules/19:26
clarkbbut I think getting that updated will help future proof us for a bit19:26
fricklerI was looking at those but still undecided whether to just accept it or complain like corvus did19:27
clarkbbasically don't take this as me advocating for anything in particular other than "run more up to date tools so we can keep up with python releases"19:27
corvusi offer my moral support for skipping those rules :)19:28
clarkbone upside to using a linter is we can avoid complaining about formatting ourselves. That said I would say that as a group we're pretty good about avoiding review nit picks like that and ansible-lint is extrmeely opinionated so we're kind of in a weird situation there19:28
clarkbI suspect that other projects (nova maybe based on recent mailing list emails) get bigger benefits from just going with what the tool says to do19:29
fricklerall those "name"/"hosts" reorderings are the top ones I would likely want to not do19:30
fricklerbut I can also see benefit in just following those, similar to python projects just using black and putting an end to all formatting discussions19:31
clarkbya thats the main thing the easiest thing is probably to just accept someone else had an opinion then fix it once19:32
clarkbanyway if no one feels strongly enough to -1 maybe we should proceed?19:32
clarkbwe can discuss further in review19:33
clarkb#topic Zuul-launcher image builds19:33
clarkbThe opendev/zuul-jobs project has been created and is hosting these image build configs now19:33
clarkb#link https://review.opendev.org/c/opendev/zuul-jobs/+/929141 Build a debian bullseye image with dib in a zuul job19:33
clarkbthis change successfully builds a debian bullseye image and I think it just merged19:33
clarkbI think the next step is to upload it to an intermediate location then configure zuul to fetch and upload that to clouds?19:34
corvusthat is one next step19:35
fricklerso one question I have about this: we use the cache built into our image to prime the new cache, do I understand this correctly?19:35
clarkbcorvus: ^ do we need to disable that image in the nodepool builders to prevent conflicts or will they coordinate via zk and it should work out?19:35
corvusthe other next step, which can also start right now is for someone to run with making more jobs for more images19:35
clarkbfrickler: correct. Its like we are doing mathematical induction on git caches19:35
corvusfrickler: yes19:35
fricklerbut can we start the induction in case we loose our existing images?19:35
corvusclarkb: it's safe to build duplicate images19:36
clarkbfrickler: I think the bootstrap process is to use an existing cloud image to run the job then the build will just take much longer to prime the cache essentially19:36
corvusfrickler: the build should be able to run on an empty cloud node (slowly)19:36
corvusyep19:36
tonybI'm keen to look at the "building more images" thing 19:36
clarkbfrickler: if we find that tiem is too long we could manually snapshot an instance with the git repos pre cloned and use that image19:36
corvuswe could test that case with a 3 hour job if we want19:36
corvustonyb: ++19:36
NeilHanlon(so I don't get distracted looking at opensearch, I have an update on rocky CI failures w.r.t. "should we mirror rocky")19:37
NeilHanlonjust tag me when you're ready :D19:37
clarkbNeilHanlon: will do19:37
corvusafter we get uploads to object storage working ...19:37
clarkbcorvus: are the uploads to intermedaite storage then eventually the clouds something you'll be working on?19:38
corvus... the code to have the zuul-launcher actually create cloud images is nearly ready to merge; once that's done, we should have all the pieces in place to watch a zuul-launcher manage a full image build and upload process19:38
corvuswe will need to add the openstack driver though :)19:39
clarkbalso are we running a zuul-launcher node? or do we need to do that too19:39
corvushttps://review.opendev.org/92418819:39
corvussafe to merge any time19:39
clarkb#link https://review.opendev.org/c/opendev/system-config/+/924188/ Run a zuul-launcher19:40
clarkbthanks!19:40
corvusclarkb: if anyone else wants to do the upload to intermediate storage, i welcome it; otherwise i should be able to get to it in a bit.19:40
corvusone open question about that: what intermediate storage do we want?  existing log storage?  new rax flex container?19:40
fungialso because these are huge, we should think carefully about expirations19:41
clarkbcorvus: due to the size of these images and not needing them to live for 30 days I wonder if we should use a dedicated container19:41
clarkbit will just make it easier for humans to grok pruning of the content should we need to19:41
corvus(incidentally, one thing we might want to consider if we don't end up liking the process with cloud storage is that we could use a simple opendev fileserver for our intermediate storage; but i like the idea of starting with object storage)19:41
clarkbbut then we can probably upload to any/all/one of the existing swift locations19:42
corvusdedicated container sounds good.  and i was thinking an expiration of a couple of days should be okay to start with.  maybe we make it longer later, but that should keep the fallout small from any early errors in programming19:42
clarkb++19:42
corvusso if the rax-flex auth question is answered by then, maybe do it there?  otherwise... vexxhost?  rax-dfw?19:43
clarkbprobably not vexxhost since we don't use swift there (we made ceph sad when we tried)19:44
clarkbbut rax-dfw or ovh-bhs1 seem fine19:44
corvusdfw will use fewer intertubes19:44
clarkbthat seems like a good reason to choose it19:44
corvusok, so dedicated container in rax-flex or rax-dfw.  sgtm!19:44
corvusif someone gets rax flex working, maybe please just go ahead and create an extra container? :)19:45
fungiyeah, i noticed that rax classic dfw to rax flex sjc3 communication goes through the internet (but at least they share a common backbone provider)19:45
clarkbcorvus: will do if I manage that19:45
corvusthx!19:46
clarkb#topic Mirroring Rocky Linux Packages19:46
clarkbNeilHanlon: hello!19:46
NeilHanlonhi :) 19:46
NeilHanlonso.. i can't get opensearch to do what I want, but19:46
NeilHanlonhttps://drop1.neilhanlon.me/irc/uploads/44fb256b36a4f97b/image.png 19:47
clarkblooks like something keyed off of depsolving?19:47
NeilHanlongreen is "successful", red is a job which had a "Depsolve Failed" message19:47
NeilHanlonyeah19:47
NeilHanlonhttps://drop1.neilhanlon.me/irc/uploads/17b1fdc1dad12d0b/image.png 19:48
NeilHanloni can't seem to generate a short URL otherwise I'd link to the viz19:48
fungiso indicates builds which hit some sort of package access problem i guess19:48
NeilHanlonyeah these I looked into and are almost all because the host got some mirror A for Appstream and mirror B for BaseOS which were not in sync19:48
NeilHanloni'm sure there's others which aren't matching this depsolve message, but the signal was clear for these ones at least19:49
NeilHanlonhttps://paste.opendev.org/show/bHtL7sBLms4vpOIOkxBN/ here.. the opensearch url :D 19:49
clarkbcool. I think that does point to using our own mirrors would have a benefit19:49
fungiwhich raises a related question then... when we mirror, how can we be sure we keep both of those in sync with each other?19:49
clarkb(side note I wonder if the proxies for the upstream mirrors should do some ip stickyness)19:49
fungior are they mirrored as a unit?19:50
clarkbfungi: we would be rsyncing from a single source so in theory that source will be in sync with itself19:50
clarkbfungi: rather than rsyncing from multiple locations which may be out of sync19:50
NeilHanlonright, yeah. using --delay-updates or so19:50
fungiand yeah, we do delay deletions19:50
NeilHanlonalternatively, I've sometimes set it up so that everything except the metadata is synced first, then the metadata is can be fetched -- but if you're using --delete that wouldn't work19:51
clarkbso ya as mentioend before the next steps would be to ensure we've got enough disk (using centos 9 stream as a stand in I think we decided we do) then write a mirroring script (should look very similar to centos 9 stream and other rsync scripts) then an admin can create the afs volume and merge things and get stuff published19:52
NeilHanlonalright, I can work on a mirroring script and open a change for that19:52
tonybSimilar ti CentOS I'm working on a tool that will ensure that all packages in the repomd are available in a mirror.  which we can run after rsync befoer the vos release19:52
clarkbNeilHanlon: that would be great. Then whoever ends up reviewing that can ensure the afs side is ready for it to land too19:52
tonybI don't think that will help with issues where BaseOS and Appstream are out of sync though :(19:52
clarkbwe can also set the quota on the afs volume such taht we don't accidentally sync down too much content19:53
fungiyeah, if there's a semi-quick way we can double-check consistency, afs lets us just avoid publishing that state when it's wrong19:53
clarkbbetter to hit a quota limit than completely run out of disk19:53
NeilHanlonhear hear19:53
tonybfungi: I ran the tool on an afs node and it was < 1min per repo, which is quick enough for me19:54
clarkbtonyb: that is plenty fast compared to how long rsync takes even not syncing any real data19:54
fungiyeah, that's quick, especially where afs is concerned19:54
clarkbNeilHanlon: and don't hesitate to ask if any questions come up in preparing that script19:54
clarkb#topic Open Discussion19:55
clarkbwe have 5 minutes for anything else before our hour is up19:55
tonybI was thing so, also very quick if we can avoid a bunch of job failures19:55
fungijust a heads up that i won't be around much thursday/friday this week, or over the weekend19:55
* frickler will also be offline starting thursday, hopefully just a couple of days19:56
* tonyb will be more around again ... albeit in AU :/19:56
clarkbthanks for the heads up19:57
clarkbsounds like that may be just abuot everything. Thank you for your time today19:57
clarkb#endmeeting19:57
opendevmeetMeeting ended Tue Sep 17 19:57:46 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:57
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-17-19.00.html19:57
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-17-19.00.txt19:57
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-09-17-19.00.log.html19:57
clarkbwe can end it there and everyone can go find $meal a couple minutes early19:57
corvusthanks!19:57
fungithanks clarkb!19:58
tonybThanks all20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!