Tuesday, 2025-02-25

clarkbJust about meeting time18:59
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Feb 25 19:00:22 2025 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ZEEHBRE5DEZGXFXPGE4MYFH4NYGRUOIP/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbI didn't have anything to announce (there is service coordinator election stuff but I've given that a full agenda topic for later)19:01
fungi#link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/message/I2XP4T2C47TEODOH4JYVUZNEWK33R3PN/ draft governance documents and feedback calls for proposed OpenInfra/LF merge19:02
clarkbthanks seems like that may be it19:03
clarkba good reminder to keep an eye on that whole thread as well19:03
clarkb#topic Zuul-launcher image builds19:04
clarkbas mentioned previously corvus has been attempting to dogfood the zuul-launcher system with a test chagne in zuul itself19:04
fungiyeah, one thing i wish hyperkitty had was a good way to deep-link to a message within a thread view19:04
clarkb#link https://review.opendev.org/c/zuul/zuul/+/940824 Somewhat successful dogfooding in this zuul change19:04
clarkbpreviously there were issues with quota limits and other bugs, but now we've got actual job runs that succeed on the images19:04
clarkbthats pretty cool and great progress on the underlying migration of nodepool into zuul19:05
clarkbI think there are still some quota problems with the latest buildset but some jobs ran19:05
clarkband the work to understand quotas has started in zuul-launcher so this should only get better19:05
clarkbnot sure if corvus has anything else to add, but ya good progress19:06
clarkb#topic Fixing known_hosts generation on bridge19:08
clarkbwhen deploying tracing02 last week I discoverd that sometimes ssh known_hosts isn't updated when ansible runs the in the infra-prod-base job19:09
clarkbEventually I was able to track that down to the code updating known_hosts running against system-config content that was already on bridge iwthout updating it first19:09
clarkbsometimes ansible and ssh would work (known_hosts would update) because the load balancer jobs for zuul and gitea would sometimes run before the infra-prod-run job19:10
clarkb#link https://review.opendev.org/c/opendev/system-config/+/94230719:10
clarkbThis change aims to fix that by updating system-config as part of bridge bootstrapping and then we have to ensure the other jobs for load balancers (and the zuul db) don't run concurrently and contend for the system-config on disk19:11
clarkbtesting this is a bit difficult without just doing it. fungi and corvus have reviewed and +2'd the change. DO we want to proceed with the update nowish or wait for some better time? The hourly jobs do run the bootrap bridge job so we should start to get feedback pretty quickly19:12
fungii'd be fine goign ahead with it19:12
clarkbok after the meeting I've got lunch but then maybe we go for it afterwards if there are no objections or suggestions for a different approach? That would be at about 2100 UTC19:12
clarkb#topic Upgrading old servers19:14
clarkbNow that I'm running the meeting I realize this topic and the next one can be folded together so why don't I just do that19:14
clarkbI've continued to try and upgrade servers from focal to noble with the most recent one being tracing0219:14
clarkbeverything is switched over with zuul talking to the new server, dns is cleaned up/updated, the last step is to delete the old server which I'll try to do today19:15
clarkb#link https://etherpad.opendev.org/p/opendev-server-replacement-sprint has some high level details on the backlog todo list there. Help is super apprecaited19:15
clarkbtonyb not sure if you are awake yet, but anything to update from your side of things?19:15
clarkbanything we can do to be useful etc?19:15
tonybNope nothing from me19:16
clarkb#topic Redeploying raxflex resources19:16
clarkbsort of related but with different motivation is that we should redeploy our raxflex ci resources19:16
clarkbthere are two drivers for this. The first is that doing so in sjc3 will get us updated networking with 1500 mtus. The other is there is a new dfw3 region we can use but its tenants are different than the ones we are currently using in sjc3. The same tenants in dfw3 are available in sjc3 so if we redeploy we align with dfw3 and get updated networking19:17
fungiyeah, i was hoping to hear back from folks who work on it as to whether we're okay to direct attach server instances to the PUBLICNET network since that was working in sjc3 at one point, though now it errors in both regions19:18
fungiif i don't need to create all the additional network/router/et cetera boilerplate to handle floating-ip in the new projects, i'd rather not19:18
clarkbonce we sort out ^ we can deploy new mirrors then we can rollover the nodepool configs19:19
fungiyes19:19
clarkbhave we asked cardoe yet? cardoe seems good at running down those questions19:19
clarkbin any case I suspect ^ may be the next step if we haven't yet19:21
fungino, i was trying to get the attention of cloudnull or dan_with19:21
fungibut can try to see if cardoe is able to at least find out, even though he doesn't work on that environment19:22
clarkbya cardoe seems to know who to talk to and is often available on irc19:22
clarkba good combo for us if not the most efficient19:22
clarkbanything else on this topic?19:22
funginot from me19:22
clarkb#topic Running certcheck on bridge19:23
clarkbI think I just saw changes for this today19:23
fungiyeah, i split up my earlier change to add the git version to bridge first19:23
clarkb#link https://review.opendev.org/c/opendev/system-config/+/939187 and its parent19:23
fungibut up for discussion if we want a toggle based on platform version or something19:24
clarkbprobably just needs reviews at this point? Any other consideratiosn to call out?19:24
fungialso not sure how (or if) we'd want to go about stopping certcheck on cacti once we're good with how it's working on bridge, ansible to remove the cronjob only? clean up the git checkout too?19:25
fungior just manually delete stuff?19:25
fungisimilar for cleaning up the git deployment on bridge if we go down this path to switch it to the distro package later19:25
clarkbI think we need to stop applying updates to cacti otherwise we don't get the benefit of switching to running this on bridge. And if we do that the lists will become stale over time so we should disable it19:26
clarkbI don't know that we need to do any cleanup beyond disabling it19:26
clarkbfungi: the other thing I notice is the change doesn't update ourtesting which is already doing certcheck stuff on bridge I think19:26
fungii realized that my earlier version of 939187 just did it all at once (moved off cacti onto bridge and switched from git to distro package), but it hadn't been getting reviews19:26
clarkbwe probably want to cleanup whatever special case code does ^ and rely on the production path in testing19:26
clarkbplaybooks/zuul/templates/gate-groups.yaml.j2 is the file I can leave a review after the meeting19:27
fungithanks19:28
clarkbanything else?19:29
fungianyway, if people hadn't reviewed the earlier version of 939187 and are actually in favor of going back to a big-bang switch for install method and server at the same time, i'm happy to revery today's split19:29
fungirevert19:30
clarkback19:30
fungibut yes, that's all from me on this19:31
clarkb#topic Service Coordinator Election19:31
clarkbThe nomination period ended and as promised I nominted myself just before the time ran out as no one else had19:31
fungithanks!19:32
clarkbI don't see any other nominations here https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/19:32
clarkbso that means I'm it by default. If I've missed a nominee somehow please bring that up asap but I don't think I have19:32
fungiwelcome back!19:33
clarkbin six months we'll have another election and I'd be thrilled if someone else took on the role :)19:33
clarkbif you're interested feel free to reach out and we can talk. We can probably even do a more gradual transition if that works19:33
clarkb#topic Using more robust Gitea caches19:34
corvusclarkb: congratulations119:34
clarkbAbout a week and a half ago or so mnaser pointed out that some requests to gitea13 were really slow. Then a refresh wouldbe fast. This had us suspecting the gitea caches as that is pretty normal for uncached things. However, this was different because the time delta was huge like 20 seconds instead of half a second19:35
clarkband the same page would do it occasioanlyl within a relatively short period of time which seemed at oods with the caching behavior (it evicts after 16 hours not half an hour)19:35
clarkbso I dug around in the gitea source code and found the memory cache is implemented as a single Go hashmap. It sounds like massive Go hashmaps can create problems for the Go GC system. My suspicion now is that AI crawlers (and other usage of gitea) is slowly causing that hashmap to grow to some crazy size and eventually impacting GC with pause the world behavior leading to19:36
clarkbthis long requests19:36
clarkbone thing in support of this is restarting the gitea service causes it to be happy again until it stops being happy some time later19:37
clarkbI've since dug into alternative options to the 'memory' cache adapter and there are three: redis, memcached, and twoqueue. Redis is no longer open source and twoqueue is implemented within gitea as another memory cache but using a more sophistiacted implementation that allows you to configure a maximum entry count. I initially rejected these because redis isn't open source19:38
clarkb(but apparently valkey is an option) and twoqueue would still be running with Go GC. This led to a change implementing memcached as teh cache system19:38
clarkb#link https://review.opendev.org/c/opendev/system-config/+/942650 Set up gitea with memcached for caching19:38
clarkbthe upside to memcached is it is a fairly simple service, you can limit the total memory consumption, and it is open source. The downsides are that it is another container image to pull from docker hub (I would mirror it before we go to production if we decide to go to production with it)19:39
clarkbso I guess I'm asking for reviews and feedback on this. I think we could also use twoqueue instead and see if we still have problems after setting a maximum cache entry limit19:39
corvusi ❤️ memcached19:40
clarkbthat is probably simplest from an implementation perspective, and if it doesn't work we're in no worse position than today and could switch to memcached at that point19:40
fungiyeah, on its face this seems like a fine approach19:40
corvusthis is still distributed right, we'd put one memcached on each gitea server?19:41
clarkbcool I think that change is ready for review at this point as long as the new test case I added is functional19:41
clarkbcorvus: correct19:41
corvuswould centralized help?  let multiple gitea servers share cache and reduce io/cpu?19:41
corvus(at the cost of bandwidth and us needing to make that HA)19:41
clarkbcorvus: I'm not sure it would because we use hashed IPs for load balancing so we could have very different cache needs on one gitea vs another19:42
fungii guess that change is still meant to be wip since it seems to have some debugging enabled that may not make sense in production and also at least one todo comment19:42
clarkbso having separate caches makes sense to me19:42
corvusclarkb: yeah, i guess i was imagining that, like, lots of gitea servers might get a request for the nova README19:43
clarkbfungi: the debugging should be disbled in the latest patchset (-v really isn't debugging its not even that verbose) and the TODO is something I noticed adjacent to the change but not directly related to it19:43
fungii think lately it's that all the gitea servers get crawled constantly for every possible url they can serve, so caches on the whole are of questionable value19:43
clarkbcorvus: I'm also not sure if the keys they use are stable across instances19:43
clarkbcorvus: I think they should be because we use a consistent database (so project ids should be the saem?) but I'm not positive of that19:44
corvusi definitely think that starting with unshared/distributed caching is best, since it's the smallest delta from our previously working system and is the smallest change to fix the current problem.  was mostly wondering if we should explore shared caching since that was the actual design goal of memcached and so it's natural to wonder if it's a good fit here.19:44
clarkbI think the safest thing for now is distributed caches19:44
clarkbah19:44
corvusyes agree, just can't help thinking another step ahead :)19:44
clarkbI think it would definitely be possible if/when we use a shared db backend19:44
clarkbits not compeltely clear to me if we can until then19:45
corvusack19:45
clarkbfungi: we have to have a cache (there is no disabled caching option) and if my hunches are correct the default is actively making things worse so unfortauntely we need to go with something that is less bad at least19:45
clarkbit does sound like we're happy to use memcached so maybe I should go ahead and propose a change to mirror that container image today and then update the change tomorrow to pull from the mirror19:46
fungiyeah, i just meant tuning the cache for performance may not yield much difference19:46
clarkbah19:47
corvus(even a basic LRU cache wouldn't be terrible for "crawl everything"; anything non-ai-crawler would still heat up part of the cache and keep it in memory)19:47
corvus(but the cs in me says there's almost certainly a better algorithm for that)19:47
clarkbevicting based on total requests since entry would probably work well here19:48
clarkbsince the AI things grab a page once before scanning again next week19:48
corvus++19:48
clarkbthanks for the feedback I think I see the path forward here19:49
clarkb#topic Working through our TODO list19:49
clarkbA reminder that I'm trying to keep a rough high level todo list on this etherpad19:49
clarkb#link https://etherpad.opendev.org/p/opendev-january-2025-meetup19:49
clarkba good place for anyone to check if they'd like to get involved or if any of us get bored19:49
clarkbI intend on editing that list to include parallelize infra-prod-* jobs19:49
clarkb#topic Open Discussion19:49
fungii've started looking into mailman log rotation19:50
clarkbThe Matrix Foundation may be shutting down the OFTC irc bridge at the end of march if they don't get additional funding. Somethign for people to be awareof that rely on that bridge for irc access19:50
fungimailman-core specifically, since mailman-web is just django apps and django already handles rotation for those19:50
fungii ran across this bug, which worries me: https://gitlab.com/mailman/mailman/-/issues/93119:50
clarkbfungi: I think the main questions I had about log rotation was ensuring that mailman doesn't have some mechanism it wants to use for that and figuring out what retention should be. I feel like mail things tend to move more slowly than web things so longer retention might be appropriate?19:51
fungiseems like the lmtp handler logs to smtp.log and the command to tell mailman to reopen log files doesn't fully work19:51
clarkbfungi: copytruncate should address that right?19:52
clarkbiirc it exists because some services don't handle rotation gracefully?19:52
fungiyeah, looking in separate bug report https://gitlab.com/mailman/mailman/-/issues/1078 copytuncate is suggested19:52
clarkbsounds like that may be everything?19:55
fungiit's not clear to me that it fully works even then, whether we also need to do a `mailman reopen` or just sighup, et cetera19:55
clarkbya there si a comment there indicating copytruncate may still be problematic19:55
clarkbthough that is surprising to me since that should keep the existing fd and path. Its the rotaetd logs that are new and detached from the process19:55
fungiwell, you still need the logging process to gracefully reorient to the start of the file after truncation and not, e.g., seek to the old end address19:57
fungibut following those discussions it seems like it's partly about python logging base behaviors and handling multiple streams19:57
fungiwhich probably handles that okay19:57
fungihandles the truncation okay i mean19:58
clarkbhrm if its python logging in play then I wonder if we need a python logging config instead19:58
clarkbthat is what we typically do for services like zuul, nodepool, etc and set the rotation schedule in that config19:58
fungii want to say we've seen something like this before, where sighup to the process isn't propagating to logging from different libraries that get imported19:59
fungiat least the aiosmptd logging that ends up in smtp.log seems to be via python logging19:59
clarkboh but mailman would have to support loading a logging config from disk right?20:00
clarkbpython logging doesn't have a way to do that straight out of the library iirc20:00
clarkbwe are at time. The logging thing probably deserves some testing which we can do via a held node living long enough to do rotations20:00
clarkbthank you everyone for your time!20:00
fungihttps://github.com/aio-libs/aiosmtpd/issues/278 has some further details20:01
fungiwill follow up in #opendev20:01
clarkbfeel free to continue discussion in #opendev and/or on the mailing list20:01
clarkb#endmeeting20:01
opendevmeetMeeting ended Tue Feb 25 20:01:11 2025 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:01
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.html20:01
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.txt20:01
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.log.html20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!