Tuesday, 2024-01-09

clarkbmeeting time19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jan  9 19:00:25 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/7IXDFVY34MYBW3WO2EEU3AIGOLAL6WRB/ Our Agenda19:00
clarkbIts been a little while since we had these regularly19:00
clarkb#topic Announcements19:01
clarkbThe OpenInfra Foundation Individual Board member election is happening now. Look for your ballot via email and vote.19:02
clarkbThis election also includes bylaw ammendments to make the bylaws less openstack specific19:02
clarkbIf you expected to have a ballot and can't find out please reach out. There may have been email delivery problems19:02
clarkbSeparately we're going to feature OpenDev on the OpenInfra Live stream/podcast/show (I'm not sure exactly how you'd classify it)19:03
clarkbThat will happen on January 18th at 1500 UTC?19:04
clarkbI know the day is correct but not positive on the time. Feel free to tune it19:04
clarkb*tune in19:04
corvusclarkb: i think the kids are calling it a "realplayer tv show" now ;)19:04
fungialso some streaming platforms have the ability for you to heckle us and ask questions19:04
clarkb#topic Topics19:06
clarkb#topic Server Upgrades19:07
clarkbI believe that tonyb has gotten all of the mirror nodes upgraded at this point19:07
clarkbNot sure if tonyb is around for the meeting, but I think the plan was to look at meetpad servers next19:07
tonybCorrect19:08
tonybI started looking at meetpad, One thing that worries me a little is I can't quite see how we add the jvb nodes to meetpad19:08
clarkbtonyb: it should be automated via configuration somehow19:09
clarkbtonyb: I can look into that after the meeting19:09
tonybit seems to just be "magic" and I don't want any new jvb nodes added to auto regiuster with the existing meetpad19:09
tonybclarkb: Thanks19:09
clarkbtonyb: yes it should be magic and it happems via xmpp iirc19:09
fungiwe've scaled up and down if you look at git history19:09
tonybAh okay.19:10
clarkbso ya one approach would be to have a new jvb join the old meetpad and the nreplace old meetpad and have new jvb join to the new thing. Or update config management to allow two side by side installations then update dns19:10
clarkbwe'll need to sort out how the magic happens in order to make a decision on approach I think19:10
tonybThat was my thinking19:10
corvus(i think a rolling replacement sounds good, but i haven't thought about it deeply)19:12
tonybI also looked at mediawiki and I'm reasonably close to starting that server.  translate looks like we'll just turn it off when i18n are ready, but I'm trying to help them with new weblate tools19:12
corvus(just mostly that since we're not changing any software versions, we'd expect it to work)19:12
tonybso that leaves cacti and storyboard to look at19:12
clarkbtonyb: we've got a spec to add a prometheus and some agents on servers to replace cacti which is one option there19:12
clarkbbut maybe the easiest thing right now is to just uplift cacti? I don't know19:13
fungicacti was in theory going to be retired in favor of prometheus19:13
fungiyeah that19:13
clarkbI think the main issue with prometheus was figuring out the agent stuff. Running the service to collect the data is straightforward19:13
tonybOkay, I know ianw was thinking prometheus would be a good place for me to start so I'd be happy to look at that19:14
clarkbalright lets move on have a fair numebr of things things to discuss and it sounds like we're continuing to make progress there. Thanks!@19:14
clarkb#topic Python container updates19:15
clarkbThe zuul registry service migrated to bookworm images so I've proposed a change to drop the bullseye images it was relying on19:15
clarkb#link https://review.opendev.org/c/opendev/system-config/+/905018 Drop Bullseye python3.11 images19:15
clarkbThat leaves us with zuul-operator on the bullseye python3.10 images as our last bullseye container images19:15
clarkb#topic Upgrading Zuul's DB server19:16
clarkbI realized while prepping for this meeting that I had completely spaced on this.19:16
tonybIt happens at this time of year ;P19:16
clarkbHowever, coincidentally hacker news had a post about postgres options recently19:16
clarkb#link https://www.crunchydata.com/blog/an-overview-of-distributed-postgresql-architectures a recent rundown of postgresql options19:17
clarkbI haven't read the article yet, but figured I should as a good next step on this item19:17
clarkbdid anyone else have new input to add?19:17
* tonyb shakes head19:18
clarkb#topic EMS discontinuing legacy consumer hosting plans19:19
clarkbfungi indicated that at the last meeting the general consensus was that we should investigate a switch to the newer plans19:19
clarkbfungi: have we done any discussion about this on the foundation side yet? I'm guessing we need a general ack there then we can reach out to element about changing the deployment type?19:20
fungithey indicated in the notice that they'd let folks on the old plan have a half-normal minimum user license19:20
fungii did some cursory talking to wes about it and it sounded like they'd be able to work it in for 202419:21
fungiwe would have to pay for a full year up front though19:21
clarkbI don't expect we'll stop using matrix anytime soon19:21
clarkbso that seems fine from a usage standpoint19:21
fungiright, since we're supporting multiple openinfra projects with it, the cost is fairly easy to justify19:22
clarkbfungi: in that case I guess we should reach out to Element. IIRC the email gave a contact for the conversion19:22
clarkbmaybe double check with wes that nothing has changed in the last few weeks before sending that email19:22
* clarkb scribbles a note to do this stuff19:22
fungiwill do19:22
tonybAlso gives us this year to test self-hosting a homeserver19:23
fungiwe've still got about a month to sort it19:23
clarkbright we have until February 719:23
fricklerdo we really want to test self-hosting? also, would we get an export from element that would allow moving and keeping rooms and history?19:24
corvusno export is needed; the system is fully distributed19:24
clarkbthey provided a link to a mgiration document in the email too19:24
clarkbtrying to find it19:25
fungibut they do have a settings export we can use too19:25
clarkbhttps://ems-docs.element.io/books/element-cloud-documentation/page/migrate-from-ems-to-self-hosted19:25
fungibasically the homeserver config19:25
fricklerso you start a new homeserver with the same name and the rooms just magically migrate?19:25
tonybfrickler: I think it's something to investigate during the year. Gives us more information for making a long term decision19:25
clarkbwe "own" the room names so ti would largely be history and room config to worry about aiui19:25
corvusthe rooms and their contents exist on all matrix servers involved in the federation (typically homeservers of users in those rooms)19:26
corvusif the history is exported, cool, but in theory i think a replacement server should be able to grab the history from any other server19:27
clarkboh interesting. So if you stand up a new server and have the well known file say it is the :opendev.org homeserver then clients will talk to the new server. That new server will sync out of the federated state the history of its rooms19:28
corvusthat's what i'd expect.  i have not tested it.19:28
clarkback. Also looks like we can copy databases per the ems migration doc should that be necessary19:29
corvus(you'd just need to use one of the other room ids initially)19:29
corvusbut i'm still in no rush to self-host.19:29
clarkbin any case figuring that out is a next step. First up is figuring out a year of hosting19:29
clarkband if that is reasonable. Which I can help coordinate with fungi at the foundation and talking to element19:29
clarkb#topic Followup on haproxy update being broken19:30
clarkbThere was a lot of info under this item but the two main points seem to be "should we be more explicit about the versions of docker images we consume" and "should we prune less aggressively"19:30
corvus(like, i'm not looking at ems as an interim step based on our conversations so far -- but i agree that keeping aware of future options is good)19:30
clarkbI think for haproxy in particular we can and should probably stick with their lts tag19:31
fungii think we mostly covered the haproxy topic at the last meeting, but happy to revisit since not everyone was present19:31
corvus++lts tag19:31
clarkbfungi: ack. I wanted to bring up one thing primiarly on pruning19:31
clarkbOne gotcha with pruning is that it seems to be based on the image build/creation time not when you started using the newer image(s)19:31
fungiright, note that we hadn't actually pruned the old haproxy image we downgraded to, when i did the manual config change and pulled, it didn't need to retrieve the image19:32
clarkband so it is a bit of a clunky tool, but better than nothing for images like haproxy for example where we could easily revert19:32
clarkbI'm happy for us to extend the time we keep images, but also be aware of this limitation with the pruning command19:32
corvusi'm ambivalent about pruning because i'm not worried about not being able to pull an old version from a registry on demand19:33
fungithe main thing it might offer is insurance against upstreams deleting their images19:33
fungibut i don't think that's actually been an issue we've encountered yet?19:33
fricklerone concern of mine was being able to find out which last version it actually was that we were running19:33
corvusi'm not eager to run an image that upstream has deleted either19:33
fungifrickler: yes, if we could add some more verbosity around our image management, that could help19:34
clarkbfrickler: we could update our ansible runs to do something like a docker ps -a and docker image list19:34
clarkband record that in our deployment logs19:34
fungieven if it's just something that periodically interrogates docker for image ids and logs them to a file19:34
fungior yeah that19:34
fricklermaybe even somewhere more persistent than zuul build logs would be good19:35
corvusi agree with frickler that leaving an image sitting around for some number of days provides a good indication of what we were probably running before19:35
clarkbok so the outstanding need is better records of what docker images we ran during which timeframes19:36
corvus(we could stick version numbers in prometheus; it's not great for that though, but it's okay as long as they don't change too often)19:36
clarkbya this will probably require a bit more brainstorming19:36
corvus(the only way to do that with prometheus increases the cardinality of metrics with each new version number)19:36
clarkbmaybe start with the simple thing of having ansible record a bit more info then try and improve on that for longer term retention19:37
clarkbI'll continue on as we have a few more items to discuss19:38
clarkb#topic Followup on haproxy update being broken19:38
clarkbSimilar to the last one I'm not sure if this reached a conclusion but two things worth mentioning have happened recently. First zuul's doc quota was increased19:38
fricklerthat's the topic we just had?19:38
clarkbbah yes19:39
clarkb#undo19:39
opendevmeetRemoving item from minutes: #topic Followup on haproxy update being broken19:39
clarkb#topic AFS Quota issues19:39
clarkbcopy and paste failure19:39
* fungi is now much less confused19:39
clarkbSecond is that there are some early discussions around having openeuler be more involved with opendev and possibly contributing some CI resources19:39
fricklerthe zuul project quota was increased (not doc I think)19:39
clarkbfrickler: ya it hosts the zuul docs iirc19:40
clarkband website?19:40
fricklerIIUC the release artefacts19:40
clarkbThere may be an opportunity to lverage this interest in collaboration to clean up the openeuler mirrors and feedback to them on the growth problems19:40
corvuseverything under zuul-ci.org is on one volume19:40
fungizuul's docs are part of its project website19:40
fungiyeah that19:40
corvusand i increased it to 5gb19:40
clarkbahah19:40
clarkbessentially work with the interested parties to improve the situation around mirrors for openeuler and maybe our CI quotas19:41
clarkbresponding to their latest queries about the sizes of VMs and how many is on my todo list after meetings and lunch19:41
clarkb(you know we write that stuff down in a document but 100% of the time the questions get asked anyway)19:42
fricklerdo you have a reference to those openeuler discussions or are they private for now?19:42
corvusthey have an openstack cloud?19:42
clarkbfrickler: I think keeping the email discussion small while we sort out if it is even possible is good, but once we know if it will go somewhere we can do that more publicly19:43
clarkbcorvus: yes sounds like it? We tried to be explicit that what we need is an openstack api endpoint and accounts that can provision VMs19:43
frickleryeah, I just wanted to know whether I missed something somewhere19:43
fungifor transparency: openeuler representatives were in discussion with openinfra foundation staff members and offered to supply system resources, so the foundation staff are trying to put them in touch with us to determine more scope around it19:43
fungiit's all been private discussions so far19:43
corvusneat19:44
clarkbwere there other outstanding afs quota concerns to discuss?19:44
fungisince openstack is a primary use case for their distro, they have a vested interest in helping test openstack upstream on it19:44
fricklersome other mirror volumes need watching19:45
clarkbfor centos stream I seem to recall digging around in those mirrors and we end up with lots of packages with many versions19:45
fricklercentos-stream and ubuntu-ports look very close to their limit19:46
clarkbin theory we only need the newest 2 to avoid installation failures19:46
clarkbwe could potentially write a smarter syncing script that scanned through and deleted older versions19:46
clarkbfor ubuntu ports I had thought we were still syncing old versions of the distro that we could delete but we aren't so I'm nto sure what we can do there19:46
clarkbare we syncing more than arm64 packages maybe? like 32bit arm and or ppc? I think not19:47
clarkbI don't think we have time to solve that in this meeting. Lets continue on as we have ~3 more topics to cover19:48
clarkb#topic Broken wheel build issues19:48
fricklerI don't know, I just noticed these issues when checking whether we have room to mirror rocky19:48
clarkbfrickler: ack19:48
fungiit's also possible that dropping old releases from our config isn't cleaning up the old packages associated with them19:49
clarkbfungi: oh interesting. Worth double checking19:49
clarkbfor wheels I think we can stop building and mirroring them at any time beacuse pip will prefer new sdists over old wheels right? so we don't even need to update the pip.conf in our test nodes19:49
fungicorrect19:49
clarkbfungi: ^ you probably know off the top of your head if that is the case. But that would be my main concern is that we start testing older stuff accidentally if we stop building wheels19:50
fungiunless you pass the pip option to prefer "binary" packages (wheels)19:50
clarkbright19:50
fungibut it's not on by default19:50
fungii'd treat that as a case oc caveat emptor19:50
clarkbin that case I think it is reasonable to send email to the service announce list indicating we plan to stop running those jobs in the future (say beginning of february) ask if anyone is interested in keeping them alive and if not jobs will fallback to building from source19:50
clarkbthe fallback is slower and may require some bindep file updates but it isn't going to hard stop anyone from getting work done on centos distros19:51
fungiwfm19:51
fricklerwill we also clean out existing wheels at the same time? maybe keep the afs volume but not publish anymore?19:51
clarkbfrickler: I think we should keep the content for a bit as some of the existing wheels may be up to date for a while19:52
fungiwe could probably do it in phases19:52
fricklerok19:52
clarkbsince pip's behavior is acceptable by default here we can still take advantage of the remaining benefit from the mirror for a bit19:52
clarkbthen maybe after 6-12 months clean it up19:52
clarkbalright next topic19:53
clarkb#topic Gitea repo-archives filling server disk19:53
fungifwiw, the python ecosystem has gotten a lot better about making cross-platform wheels for releases of things now, and in a more timely fashion19:53
fungiso our pre-built wheels are far less necessary19:53
clarkbwhen you ask gitea for a repo archive (tarball/zip/.bundle) it caches that on disk19:53
clarkbthen once a day it runs an itnernal cron task (using a go library implemtnation of cron not system cron) to clean up any repo archives that are older than a day old19:54
fungioh, yeah this is a fun one. i'd somehow already pushed it to the back of my mind19:54
fricklercan we disable that functionality? we do have our own tarballs instead (at least for openstack)?19:54
corvusi'm guessing people do that a lot to get releases even though like zero opendev projects make releases that way?19:54
corvuswhat frickler said :)19:54
fungis/people/web crawlers/ i think19:55
clarkbupstream indicated it could be web crawlers19:55
clarkbso their suggestion was to update our robots.txt19:55
clarkb#link https://review.opendev.org/c/opendev/system-config/+/904868 update robots.txt on upstream's suggestion19:55
clarkband no we can't disable teh feature19:55
clarkbat least I haven't found a way to do that19:55
clarkbthe problem is the daily claenup isn't actually cleaning up everything more than a day old19:55
clarkbI've spent a bit of time rtfs'ing and looking at the database and I can't figure out why it is broken but you can see on gitea12 that it falls about 4 hours behind each time it runs so we end up leaking and filling the disk19:56
clarkbIn addition to reducing the number of archives generated by asking bots to leave them alone we can also run a cron job that simply deletes all archives19:56
clarkb#link https://review.opendev.org/c/opendev/system-config/+/904874 Run weekly removal of all cached repo archives19:56
fricklerdoes gitea break if we make the cache non-writeable?19:56
clarkbfrickler: I haven't tesed that but I would assume so. I would expect a 500 error when you request the archive19:57
fricklerwhich would also be like disabling it kind of19:57
fungii suppose it depends on your definition of "break" ;)19:57
clarkbsince we are already trying to delete archives more than a day old deleting all archives once a week on the weekend seems safe19:57
clarkband when you ask it to delte all archives it does successfully delete all archives19:57
clarkbI would prefer we not intentionally create 500 errors19:58
clarkbthere are valid reasons to get repo archives19:58
clarkbI also noticed when looking at the cron jobs that gitea has a phone home to check if it is running the latest release cron job19:58
corvusthe cron might have a small window of breakage, but should immediately work on a retry so lgtm19:59
clarkbI pushed https://review.opendev.org/c/opendev/system-config/+/905020 to disable that cron job ecabsue I hate the idea of a phone home for that19:59
clarkbour hour is up and I have to context switch to another meeting20:00
clarkb#topic Service Coordinator Election20:00
clarkbreally quickly because I end the meeting I wanted to call out that we're appraoaching the service coordinator election timeframe. I need to dig up emails to determine when I said that would happen (I beleivee it is end of january early february)20:01
clarkbnothing for anyone to do at this point other than consider if they wish to assume the role and nominate themselves. And I'll work to get things official via email20:01
tonybIf it matches openstack PTL/TC elections then they'll start in Feb20:01
clarkbtonyb: its slightly offset20:01
tonybokay20:01
clarkb#topic Open Discussion20:01
clarkbAnything else important before we call the meeting?20:02
tonybnope20:03
clarkbsounds like no. Thank you everyone for your time and help running the opendev services!20:03
clarkbwe'll be back next week same time and location20:03
clarkb#endmeeting20:03
opendevmeetMeeting ended Tue Jan  9 20:03:27 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:03
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-01-09-19.00.html20:03
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-01-09-19.00.txt20:03
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-01-09-19.00.log.html20:03
corvusthanks clarkb !20:03

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!