Tuesday, 2025-02-25

clarkb	Just about meeting time	18:59
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Feb 25 19:00:22 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ZEEHBRE5DEZGXFXPGE4MYFH4NYGRUOIP/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	I didn't have anything to announce (there is service coordinator election stuff but I've given that a full agenda topic for later)	19:01
fungi	#link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/message/I2XP4T2C47TEODOH4JYVUZNEWK33R3PN/ draft governance documents and feedback calls for proposed OpenInfra/LF merge	19:02
clarkb	thanks seems like that may be it	19:03
clarkb	a good reminder to keep an eye on that whole thread as well	19:03
clarkb	#topic Zuul-launcher image builds	19:04
clarkb	as mentioned previously corvus has been attempting to dogfood the zuul-launcher system with a test chagne in zuul itself	19:04
fungi	yeah, one thing i wish hyperkitty had was a good way to deep-link to a message within a thread view	19:04
clarkb	#link https://review.opendev.org/c/zuul/zuul/+/940824 Somewhat successful dogfooding in this zuul change	19:04
clarkb	previously there were issues with quota limits and other bugs, but now we've got actual job runs that succeed on the images	19:04
clarkb	thats pretty cool and great progress on the underlying migration of nodepool into zuul	19:05
clarkb	I think there are still some quota problems with the latest buildset but some jobs ran	19:05
clarkb	and the work to understand quotas has started in zuul-launcher so this should only get better	19:05
clarkb	not sure if corvus has anything else to add, but ya good progress	19:06
clarkb	#topic Fixing known_hosts generation on bridge	19:08
clarkb	when deploying tracing02 last week I discoverd that sometimes ssh known_hosts isn't updated when ansible runs the in the infra-prod-base job	19:09
clarkb	Eventually I was able to track that down to the code updating known_hosts running against system-config content that was already on bridge iwthout updating it first	19:09
clarkb	sometimes ansible and ssh would work (known_hosts would update) because the load balancer jobs for zuul and gitea would sometimes run before the infra-prod-run job	19:10
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/942307	19:10
clarkb	This change aims to fix that by updating system-config as part of bridge bootstrapping and then we have to ensure the other jobs for load balancers (and the zuul db) don't run concurrently and contend for the system-config on disk	19:11
clarkb	testing this is a bit difficult without just doing it. fungi and corvus have reviewed and +2'd the change. DO we want to proceed with the update nowish or wait for some better time? The hourly jobs do run the bootrap bridge job so we should start to get feedback pretty quickly	19:12
fungi	i'd be fine goign ahead with it	19:12
clarkb	ok after the meeting I've got lunch but then maybe we go for it afterwards if there are no objections or suggestions for a different approach? That would be at about 2100 UTC	19:12
clarkb	#topic Upgrading old servers	19:14
clarkb	Now that I'm running the meeting I realize this topic and the next one can be folded together so why don't I just do that	19:14
clarkb	I've continued to try and upgrade servers from focal to noble with the most recent one being tracing02	19:14
clarkb	everything is switched over with zuul talking to the new server, dns is cleaned up/updated, the last step is to delete the old server which I'll try to do today	19:15
clarkb	#link https://etherpad.opendev.org/p/opendev-server-replacement-sprint has some high level details on the backlog todo list there. Help is super apprecaited	19:15
clarkb	tonyb not sure if you are awake yet, but anything to update from your side of things?	19:15
clarkb	anything we can do to be useful etc?	19:15
tonyb	Nope nothing from me	19:16
clarkb	#topic Redeploying raxflex resources	19:16
clarkb	sort of related but with different motivation is that we should redeploy our raxflex ci resources	19:16
clarkb	there are two drivers for this. The first is that doing so in sjc3 will get us updated networking with 1500 mtus. The other is there is a new dfw3 region we can use but its tenants are different than the ones we are currently using in sjc3. The same tenants in dfw3 are available in sjc3 so if we redeploy we align with dfw3 and get updated networking	19:17
fungi	yeah, i was hoping to hear back from folks who work on it as to whether we're okay to direct attach server instances to the PUBLICNET network since that was working in sjc3 at one point, though now it errors in both regions	19:18
fungi	if i don't need to create all the additional network/router/et cetera boilerplate to handle floating-ip in the new projects, i'd rather not	19:18
clarkb	once we sort out ^ we can deploy new mirrors then we can rollover the nodepool configs	19:19
fungi	yes	19:19
clarkb	have we asked cardoe yet? cardoe seems good at running down those questions	19:19
clarkb	in any case I suspect ^ may be the next step if we haven't yet	19:21
fungi	no, i was trying to get the attention of cloudnull or dan_with	19:21
fungi	but can try to see if cardoe is able to at least find out, even though he doesn't work on that environment	19:22
clarkb	ya cardoe seems to know who to talk to and is often available on irc	19:22
clarkb	a good combo for us if not the most efficient	19:22
clarkb	anything else on this topic?	19:22
fungi	not from me	19:22
clarkb	#topic Running certcheck on bridge	19:23
clarkb	I think I just saw changes for this today	19:23
fungi	yeah, i split up my earlier change to add the git version to bridge first	19:23
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/939187 and its parent	19:23
fungi	but up for discussion if we want a toggle based on platform version or something	19:24
clarkb	probably just needs reviews at this point? Any other consideratiosn to call out?	19:24
fungi	also not sure how (or if) we'd want to go about stopping certcheck on cacti once we're good with how it's working on bridge, ansible to remove the cronjob only? clean up the git checkout too?	19:25
fungi	or just manually delete stuff?	19:25
fungi	similar for cleaning up the git deployment on bridge if we go down this path to switch it to the distro package later	19:25
clarkb	I think we need to stop applying updates to cacti otherwise we don't get the benefit of switching to running this on bridge. And if we do that the lists will become stale over time so we should disable it	19:26
clarkb	I don't know that we need to do any cleanup beyond disabling it	19:26
clarkb	fungi: the other thing I notice is the change doesn't update ourtesting which is already doing certcheck stuff on bridge I think	19:26
fungi	i realized that my earlier version of 939187 just did it all at once (moved off cacti onto bridge and switched from git to distro package), but it hadn't been getting reviews	19:26
clarkb	we probably want to cleanup whatever special case code does ^ and rely on the production path in testing	19:26
clarkb	playbooks/zuul/templates/gate-groups.yaml.j2 is the file I can leave a review after the meeting	19:27
fungi	thanks	19:28
clarkb	anything else?	19:29
fungi	anyway, if people hadn't reviewed the earlier version of 939187 and are actually in favor of going back to a big-bang switch for install method and server at the same time, i'm happy to revery today's split	19:29
fungi	revert	19:30
clarkb	ack	19:30
fungi	but yes, that's all from me on this	19:31
clarkb	#topic Service Coordinator Election	19:31
clarkb	The nomination period ended and as promised I nominted myself just before the time ran out as no one else had	19:31
fungi	thanks!	19:32
clarkb	I don't see any other nominations here https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/	19:32
clarkb	so that means I'm it by default. If I've missed a nominee somehow please bring that up asap but I don't think I have	19:32
fungi	welcome back!	19:33
clarkb	in six months we'll have another election and I'd be thrilled if someone else took on the role :)	19:33
clarkb	if you're interested feel free to reach out and we can talk. We can probably even do a more gradual transition if that works	19:33
clarkb	#topic Using more robust Gitea caches	19:34
corvus	clarkb: congratulations1	19:34
clarkb	About a week and a half ago or so mnaser pointed out that some requests to gitea13 were really slow. Then a refresh wouldbe fast. This had us suspecting the gitea caches as that is pretty normal for uncached things. However, this was different because the time delta was huge like 20 seconds instead of half a second	19:35
clarkb	and the same page would do it occasioanlyl within a relatively short period of time which seemed at oods with the caching behavior (it evicts after 16 hours not half an hour)	19:35
clarkb	so I dug around in the gitea source code and found the memory cache is implemented as a single Go hashmap. It sounds like massive Go hashmaps can create problems for the Go GC system. My suspicion now is that AI crawlers (and other usage of gitea) is slowly causing that hashmap to grow to some crazy size and eventually impacting GC with pause the world behavior leading to	19:36
clarkb	this long requests	19:36
clarkb	one thing in support of this is restarting the gitea service causes it to be happy again until it stops being happy some time later	19:37
clarkb	I've since dug into alternative options to the 'memory' cache adapter and there are three: redis, memcached, and twoqueue. Redis is no longer open source and twoqueue is implemented within gitea as another memory cache but using a more sophistiacted implementation that allows you to configure a maximum entry count. I initially rejected these because redis isn't open source	19:38
clarkb	(but apparently valkey is an option) and twoqueue would still be running with Go GC. This led to a change implementing memcached as teh cache system	19:38
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/942650 Set up gitea with memcached for caching	19:38
clarkb	the upside to memcached is it is a fairly simple service, you can limit the total memory consumption, and it is open source. The downsides are that it is another container image to pull from docker hub (I would mirror it before we go to production if we decide to go to production with it)	19:39
clarkb	so I guess I'm asking for reviews and feedback on this. I think we could also use twoqueue instead and see if we still have problems after setting a maximum cache entry limit	19:39
corvus	i ❤️ memcached	19:40
clarkb	that is probably simplest from an implementation perspective, and if it doesn't work we're in no worse position than today and could switch to memcached at that point	19:40
fungi	yeah, on its face this seems like a fine approach	19:40
corvus	this is still distributed right, we'd put one memcached on each gitea server?	19:41
clarkb	cool I think that change is ready for review at this point as long as the new test case I added is functional	19:41
clarkb	corvus: correct	19:41
corvus	would centralized help? let multiple gitea servers share cache and reduce io/cpu?	19:41
corvus	(at the cost of bandwidth and us needing to make that HA)	19:41
clarkb	corvus: I'm not sure it would because we use hashed IPs for load balancing so we could have very different cache needs on one gitea vs another	19:42
fungi	i guess that change is still meant to be wip since it seems to have some debugging enabled that may not make sense in production and also at least one todo comment	19:42
clarkb	so having separate caches makes sense to me	19:42
corvus	clarkb: yeah, i guess i was imagining that, like, lots of gitea servers might get a request for the nova README	19:43
clarkb	fungi: the debugging should be disbled in the latest patchset (-v really isn't debugging its not even that verbose) and the TODO is something I noticed adjacent to the change but not directly related to it	19:43
fungi	i think lately it's that all the gitea servers get crawled constantly for every possible url they can serve, so caches on the whole are of questionable value	19:43
clarkb	corvus: I'm also not sure if the keys they use are stable across instances	19:43
clarkb	corvus: I think they should be because we use a consistent database (so project ids should be the saem?) but I'm not positive of that	19:44
corvus	i definitely think that starting with unshared/distributed caching is best, since it's the smallest delta from our previously working system and is the smallest change to fix the current problem. was mostly wondering if we should explore shared caching since that was the actual design goal of memcached and so it's natural to wonder if it's a good fit here.	19:44
clarkb	I think the safest thing for now is distributed caches	19:44
clarkb	ah	19:44
corvus	yes agree, just can't help thinking another step ahead :)	19:44
clarkb	I think it would definitely be possible if/when we use a shared db backend	19:44
clarkb	its not compeltely clear to me if we can until then	19:45
corvus	ack	19:45
clarkb	fungi: we have to have a cache (there is no disabled caching option) and if my hunches are correct the default is actively making things worse so unfortauntely we need to go with something that is less bad at least	19:45
clarkb	it does sound like we're happy to use memcached so maybe I should go ahead and propose a change to mirror that container image today and then update the change tomorrow to pull from the mirror	19:46
fungi	yeah, i just meant tuning the cache for performance may not yield much difference	19:46
clarkb	ah	19:47
corvus	(even a basic LRU cache wouldn't be terrible for "crawl everything"; anything non-ai-crawler would still heat up part of the cache and keep it in memory)	19:47
corvus	(but the cs in me says there's almost certainly a better algorithm for that)	19:47
clarkb	evicting based on total requests since entry would probably work well here	19:48
clarkb	since the AI things grab a page once before scanning again next week	19:48
corvus	++	19:48
clarkb	thanks for the feedback I think I see the path forward here	19:49
clarkb	#topic Working through our TODO list	19:49
clarkb	A reminder that I'm trying to keep a rough high level todo list on this etherpad	19:49
clarkb	#link https://etherpad.opendev.org/p/opendev-january-2025-meetup	19:49
clarkb	a good place for anyone to check if they'd like to get involved or if any of us get bored	19:49
clarkb	I intend on editing that list to include parallelize infra-prod-* jobs	19:49
clarkb	#topic Open Discussion	19:49
fungi	i've started looking into mailman log rotation	19:50
clarkb	The Matrix Foundation may be shutting down the OFTC irc bridge at the end of march if they don't get additional funding. Somethign for people to be awareof that rely on that bridge for irc access	19:50
fungi	mailman-core specifically, since mailman-web is just django apps and django already handles rotation for those	19:50
fungi	i ran across this bug, which worries me: https://gitlab.com/mailman/mailman/-/issues/931	19:50
clarkb	fungi: I think the main questions I had about log rotation was ensuring that mailman doesn't have some mechanism it wants to use for that and figuring out what retention should be. I feel like mail things tend to move more slowly than web things so longer retention might be appropriate?	19:51
fungi	seems like the lmtp handler logs to smtp.log and the command to tell mailman to reopen log files doesn't fully work	19:51
clarkb	fungi: copytruncate should address that right?	19:52
clarkb	iirc it exists because some services don't handle rotation gracefully?	19:52
fungi	yeah, looking in separate bug report https://gitlab.com/mailman/mailman/-/issues/1078 copytuncate is suggested	19:52
clarkb	sounds like that may be everything?	19:55
fungi	it's not clear to me that it fully works even then, whether we also need to do a `mailman reopen` or just sighup, et cetera	19:55
clarkb	ya there si a comment there indicating copytruncate may still be problematic	19:55
clarkb	though that is surprising to me since that should keep the existing fd and path. Its the rotaetd logs that are new and detached from the process	19:55
fungi	well, you still need the logging process to gracefully reorient to the start of the file after truncation and not, e.g., seek to the old end address	19:57
fungi	but following those discussions it seems like it's partly about python logging base behaviors and handling multiple streams	19:57
fungi	which probably handles that okay	19:57
fungi	handles the truncation okay i mean	19:58
clarkb	hrm if its python logging in play then I wonder if we need a python logging config instead	19:58
clarkb	that is what we typically do for services like zuul, nodepool, etc and set the rotation schedule in that config	19:58
fungi	i want to say we've seen something like this before, where sighup to the process isn't propagating to logging from different libraries that get imported	19:59
fungi	at least the aiosmptd logging that ends up in smtp.log seems to be via python logging	19:59
clarkb	oh but mailman would have to support loading a logging config from disk right?	20:00
clarkb	python logging doesn't have a way to do that straight out of the library iirc	20:00
clarkb	we are at time. The logging thing probably deserves some testing which we can do via a held node living long enough to do rotations	20:00
clarkb	thank you everyone for your time!	20:00
fungi	https://github.com/aio-libs/aiosmtpd/issues/278 has some further details	20:01
fungi	will follow up in #opendev	20:01
clarkb	feel free to continue discussion in #opendev and/or on the mailing list	20:01
clarkb	#endmeeting	20:01
opendevmeet	Meeting ended Tue Feb 25 20:01:11 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:01
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.html	20:01
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.txt	20:01
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.log.html	20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!