clarkb | Just about meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Feb 25 19:00:22 2025 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ZEEHBRE5DEZGXFXPGE4MYFH4NYGRUOIP/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | I didn't have anything to announce (there is service coordinator election stuff but I've given that a full agenda topic for later) | 19:01 |
fungi | #link https://lists.openinfra.org/archives/list/foundation@lists.openinfra.org/message/I2XP4T2C47TEODOH4JYVUZNEWK33R3PN/ draft governance documents and feedback calls for proposed OpenInfra/LF merge | 19:02 |
clarkb | thanks seems like that may be it | 19:03 |
clarkb | a good reminder to keep an eye on that whole thread as well | 19:03 |
clarkb | #topic Zuul-launcher image builds | 19:04 |
clarkb | as mentioned previously corvus has been attempting to dogfood the zuul-launcher system with a test chagne in zuul itself | 19:04 |
fungi | yeah, one thing i wish hyperkitty had was a good way to deep-link to a message within a thread view | 19:04 |
clarkb | #link https://review.opendev.org/c/zuul/zuul/+/940824 Somewhat successful dogfooding in this zuul change | 19:04 |
clarkb | previously there were issues with quota limits and other bugs, but now we've got actual job runs that succeed on the images | 19:04 |
clarkb | thats pretty cool and great progress on the underlying migration of nodepool into zuul | 19:05 |
clarkb | I think there are still some quota problems with the latest buildset but some jobs ran | 19:05 |
clarkb | and the work to understand quotas has started in zuul-launcher so this should only get better | 19:05 |
clarkb | not sure if corvus has anything else to add, but ya good progress | 19:06 |
clarkb | #topic Fixing known_hosts generation on bridge | 19:08 |
clarkb | when deploying tracing02 last week I discoverd that sometimes ssh known_hosts isn't updated when ansible runs the in the infra-prod-base job | 19:09 |
clarkb | Eventually I was able to track that down to the code updating known_hosts running against system-config content that was already on bridge iwthout updating it first | 19:09 |
clarkb | sometimes ansible and ssh would work (known_hosts would update) because the load balancer jobs for zuul and gitea would sometimes run before the infra-prod-run job | 19:10 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/942307 | 19:10 |
clarkb | This change aims to fix that by updating system-config as part of bridge bootstrapping and then we have to ensure the other jobs for load balancers (and the zuul db) don't run concurrently and contend for the system-config on disk | 19:11 |
clarkb | testing this is a bit difficult without just doing it. fungi and corvus have reviewed and +2'd the change. DO we want to proceed with the update nowish or wait for some better time? The hourly jobs do run the bootrap bridge job so we should start to get feedback pretty quickly | 19:12 |
fungi | i'd be fine goign ahead with it | 19:12 |
clarkb | ok after the meeting I've got lunch but then maybe we go for it afterwards if there are no objections or suggestions for a different approach? That would be at about 2100 UTC | 19:12 |
clarkb | #topic Upgrading old servers | 19:14 |
clarkb | Now that I'm running the meeting I realize this topic and the next one can be folded together so why don't I just do that | 19:14 |
clarkb | I've continued to try and upgrade servers from focal to noble with the most recent one being tracing02 | 19:14 |
clarkb | everything is switched over with zuul talking to the new server, dns is cleaned up/updated, the last step is to delete the old server which I'll try to do today | 19:15 |
clarkb | #link https://etherpad.opendev.org/p/opendev-server-replacement-sprint has some high level details on the backlog todo list there. Help is super apprecaited | 19:15 |
clarkb | tonyb not sure if you are awake yet, but anything to update from your side of things? | 19:15 |
clarkb | anything we can do to be useful etc? | 19:15 |
tonyb | Nope nothing from me | 19:16 |
clarkb | #topic Redeploying raxflex resources | 19:16 |
clarkb | sort of related but with different motivation is that we should redeploy our raxflex ci resources | 19:16 |
clarkb | there are two drivers for this. The first is that doing so in sjc3 will get us updated networking with 1500 mtus. The other is there is a new dfw3 region we can use but its tenants are different than the ones we are currently using in sjc3. The same tenants in dfw3 are available in sjc3 so if we redeploy we align with dfw3 and get updated networking | 19:17 |
fungi | yeah, i was hoping to hear back from folks who work on it as to whether we're okay to direct attach server instances to the PUBLICNET network since that was working in sjc3 at one point, though now it errors in both regions | 19:18 |
fungi | if i don't need to create all the additional network/router/et cetera boilerplate to handle floating-ip in the new projects, i'd rather not | 19:18 |
clarkb | once we sort out ^ we can deploy new mirrors then we can rollover the nodepool configs | 19:19 |
fungi | yes | 19:19 |
clarkb | have we asked cardoe yet? cardoe seems good at running down those questions | 19:19 |
clarkb | in any case I suspect ^ may be the next step if we haven't yet | 19:21 |
fungi | no, i was trying to get the attention of cloudnull or dan_with | 19:21 |
fungi | but can try to see if cardoe is able to at least find out, even though he doesn't work on that environment | 19:22 |
clarkb | ya cardoe seems to know who to talk to and is often available on irc | 19:22 |
clarkb | a good combo for us if not the most efficient | 19:22 |
clarkb | anything else on this topic? | 19:22 |
fungi | not from me | 19:22 |
clarkb | #topic Running certcheck on bridge | 19:23 |
clarkb | I think I just saw changes for this today | 19:23 |
fungi | yeah, i split up my earlier change to add the git version to bridge first | 19:23 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/939187 and its parent | 19:23 |
fungi | but up for discussion if we want a toggle based on platform version or something | 19:24 |
clarkb | probably just needs reviews at this point? Any other consideratiosn to call out? | 19:24 |
fungi | also not sure how (or if) we'd want to go about stopping certcheck on cacti once we're good with how it's working on bridge, ansible to remove the cronjob only? clean up the git checkout too? | 19:25 |
fungi | or just manually delete stuff? | 19:25 |
fungi | similar for cleaning up the git deployment on bridge if we go down this path to switch it to the distro package later | 19:25 |
clarkb | I think we need to stop applying updates to cacti otherwise we don't get the benefit of switching to running this on bridge. And if we do that the lists will become stale over time so we should disable it | 19:26 |
clarkb | I don't know that we need to do any cleanup beyond disabling it | 19:26 |
clarkb | fungi: the other thing I notice is the change doesn't update ourtesting which is already doing certcheck stuff on bridge I think | 19:26 |
fungi | i realized that my earlier version of 939187 just did it all at once (moved off cacti onto bridge and switched from git to distro package), but it hadn't been getting reviews | 19:26 |
clarkb | we probably want to cleanup whatever special case code does ^ and rely on the production path in testing | 19:26 |
clarkb | playbooks/zuul/templates/gate-groups.yaml.j2 is the file I can leave a review after the meeting | 19:27 |
fungi | thanks | 19:28 |
clarkb | anything else? | 19:29 |
fungi | anyway, if people hadn't reviewed the earlier version of 939187 and are actually in favor of going back to a big-bang switch for install method and server at the same time, i'm happy to revery today's split | 19:29 |
fungi | revert | 19:30 |
clarkb | ack | 19:30 |
fungi | but yes, that's all from me on this | 19:31 |
clarkb | #topic Service Coordinator Election | 19:31 |
clarkb | The nomination period ended and as promised I nominted myself just before the time ran out as no one else had | 19:31 |
fungi | thanks! | 19:32 |
clarkb | I don't see any other nominations here https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/ | 19:32 |
clarkb | so that means I'm it by default. If I've missed a nominee somehow please bring that up asap but I don't think I have | 19:32 |
fungi | welcome back! | 19:33 |
clarkb | in six months we'll have another election and I'd be thrilled if someone else took on the role :) | 19:33 |
clarkb | if you're interested feel free to reach out and we can talk. We can probably even do a more gradual transition if that works | 19:33 |
clarkb | #topic Using more robust Gitea caches | 19:34 |
corvus | clarkb: congratulations1 | 19:34 |
clarkb | About a week and a half ago or so mnaser pointed out that some requests to gitea13 were really slow. Then a refresh wouldbe fast. This had us suspecting the gitea caches as that is pretty normal for uncached things. However, this was different because the time delta was huge like 20 seconds instead of half a second | 19:35 |
clarkb | and the same page would do it occasioanlyl within a relatively short period of time which seemed at oods with the caching behavior (it evicts after 16 hours not half an hour) | 19:35 |
clarkb | so I dug around in the gitea source code and found the memory cache is implemented as a single Go hashmap. It sounds like massive Go hashmaps can create problems for the Go GC system. My suspicion now is that AI crawlers (and other usage of gitea) is slowly causing that hashmap to grow to some crazy size and eventually impacting GC with pause the world behavior leading to | 19:36 |
clarkb | this long requests | 19:36 |
clarkb | one thing in support of this is restarting the gitea service causes it to be happy again until it stops being happy some time later | 19:37 |
clarkb | I've since dug into alternative options to the 'memory' cache adapter and there are three: redis, memcached, and twoqueue. Redis is no longer open source and twoqueue is implemented within gitea as another memory cache but using a more sophistiacted implementation that allows you to configure a maximum entry count. I initially rejected these because redis isn't open source | 19:38 |
clarkb | (but apparently valkey is an option) and twoqueue would still be running with Go GC. This led to a change implementing memcached as teh cache system | 19:38 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/942650 Set up gitea with memcached for caching | 19:38 |
clarkb | the upside to memcached is it is a fairly simple service, you can limit the total memory consumption, and it is open source. The downsides are that it is another container image to pull from docker hub (I would mirror it before we go to production if we decide to go to production with it) | 19:39 |
clarkb | so I guess I'm asking for reviews and feedback on this. I think we could also use twoqueue instead and see if we still have problems after setting a maximum cache entry limit | 19:39 |
corvus | i ❤️ memcached | 19:40 |
clarkb | that is probably simplest from an implementation perspective, and if it doesn't work we're in no worse position than today and could switch to memcached at that point | 19:40 |
fungi | yeah, on its face this seems like a fine approach | 19:40 |
corvus | this is still distributed right, we'd put one memcached on each gitea server? | 19:41 |
clarkb | cool I think that change is ready for review at this point as long as the new test case I added is functional | 19:41 |
clarkb | corvus: correct | 19:41 |
corvus | would centralized help? let multiple gitea servers share cache and reduce io/cpu? | 19:41 |
corvus | (at the cost of bandwidth and us needing to make that HA) | 19:41 |
clarkb | corvus: I'm not sure it would because we use hashed IPs for load balancing so we could have very different cache needs on one gitea vs another | 19:42 |
fungi | i guess that change is still meant to be wip since it seems to have some debugging enabled that may not make sense in production and also at least one todo comment | 19:42 |
clarkb | so having separate caches makes sense to me | 19:42 |
corvus | clarkb: yeah, i guess i was imagining that, like, lots of gitea servers might get a request for the nova README | 19:43 |
clarkb | fungi: the debugging should be disbled in the latest patchset (-v really isn't debugging its not even that verbose) and the TODO is something I noticed adjacent to the change but not directly related to it | 19:43 |
fungi | i think lately it's that all the gitea servers get crawled constantly for every possible url they can serve, so caches on the whole are of questionable value | 19:43 |
clarkb | corvus: I'm also not sure if the keys they use are stable across instances | 19:43 |
clarkb | corvus: I think they should be because we use a consistent database (so project ids should be the saem?) but I'm not positive of that | 19:44 |
corvus | i definitely think that starting with unshared/distributed caching is best, since it's the smallest delta from our previously working system and is the smallest change to fix the current problem. was mostly wondering if we should explore shared caching since that was the actual design goal of memcached and so it's natural to wonder if it's a good fit here. | 19:44 |
clarkb | I think the safest thing for now is distributed caches | 19:44 |
clarkb | ah | 19:44 |
corvus | yes agree, just can't help thinking another step ahead :) | 19:44 |
clarkb | I think it would definitely be possible if/when we use a shared db backend | 19:44 |
clarkb | its not compeltely clear to me if we can until then | 19:45 |
corvus | ack | 19:45 |
clarkb | fungi: we have to have a cache (there is no disabled caching option) and if my hunches are correct the default is actively making things worse so unfortauntely we need to go with something that is less bad at least | 19:45 |
clarkb | it does sound like we're happy to use memcached so maybe I should go ahead and propose a change to mirror that container image today and then update the change tomorrow to pull from the mirror | 19:46 |
fungi | yeah, i just meant tuning the cache for performance may not yield much difference | 19:46 |
clarkb | ah | 19:47 |
corvus | (even a basic LRU cache wouldn't be terrible for "crawl everything"; anything non-ai-crawler would still heat up part of the cache and keep it in memory) | 19:47 |
corvus | (but the cs in me says there's almost certainly a better algorithm for that) | 19:47 |
clarkb | evicting based on total requests since entry would probably work well here | 19:48 |
clarkb | since the AI things grab a page once before scanning again next week | 19:48 |
corvus | ++ | 19:48 |
clarkb | thanks for the feedback I think I see the path forward here | 19:49 |
clarkb | #topic Working through our TODO list | 19:49 |
clarkb | A reminder that I'm trying to keep a rough high level todo list on this etherpad | 19:49 |
clarkb | #link https://etherpad.opendev.org/p/opendev-january-2025-meetup | 19:49 |
clarkb | a good place for anyone to check if they'd like to get involved or if any of us get bored | 19:49 |
clarkb | I intend on editing that list to include parallelize infra-prod-* jobs | 19:49 |
clarkb | #topic Open Discussion | 19:49 |
fungi | i've started looking into mailman log rotation | 19:50 |
clarkb | The Matrix Foundation may be shutting down the OFTC irc bridge at the end of march if they don't get additional funding. Somethign for people to be awareof that rely on that bridge for irc access | 19:50 |
fungi | mailman-core specifically, since mailman-web is just django apps and django already handles rotation for those | 19:50 |
fungi | i ran across this bug, which worries me: https://gitlab.com/mailman/mailman/-/issues/931 | 19:50 |
clarkb | fungi: I think the main questions I had about log rotation was ensuring that mailman doesn't have some mechanism it wants to use for that and figuring out what retention should be. I feel like mail things tend to move more slowly than web things so longer retention might be appropriate? | 19:51 |
fungi | seems like the lmtp handler logs to smtp.log and the command to tell mailman to reopen log files doesn't fully work | 19:51 |
clarkb | fungi: copytruncate should address that right? | 19:52 |
clarkb | iirc it exists because some services don't handle rotation gracefully? | 19:52 |
fungi | yeah, looking in separate bug report https://gitlab.com/mailman/mailman/-/issues/1078 copytuncate is suggested | 19:52 |
clarkb | sounds like that may be everything? | 19:55 |
fungi | it's not clear to me that it fully works even then, whether we also need to do a `mailman reopen` or just sighup, et cetera | 19:55 |
clarkb | ya there si a comment there indicating copytruncate may still be problematic | 19:55 |
clarkb | though that is surprising to me since that should keep the existing fd and path. Its the rotaetd logs that are new and detached from the process | 19:55 |
fungi | well, you still need the logging process to gracefully reorient to the start of the file after truncation and not, e.g., seek to the old end address | 19:57 |
fungi | but following those discussions it seems like it's partly about python logging base behaviors and handling multiple streams | 19:57 |
fungi | which probably handles that okay | 19:57 |
fungi | handles the truncation okay i mean | 19:58 |
clarkb | hrm if its python logging in play then I wonder if we need a python logging config instead | 19:58 |
clarkb | that is what we typically do for services like zuul, nodepool, etc and set the rotation schedule in that config | 19:58 |
fungi | i want to say we've seen something like this before, where sighup to the process isn't propagating to logging from different libraries that get imported | 19:59 |
fungi | at least the aiosmptd logging that ends up in smtp.log seems to be via python logging | 19:59 |
clarkb | oh but mailman would have to support loading a logging config from disk right? | 20:00 |
clarkb | python logging doesn't have a way to do that straight out of the library iirc | 20:00 |
clarkb | we are at time. The logging thing probably deserves some testing which we can do via a held node living long enough to do rotations | 20:00 |
clarkb | thank you everyone for your time! | 20:00 |
fungi | https://github.com/aio-libs/aiosmtpd/issues/278 has some further details | 20:01 |
fungi | will follow up in #opendev | 20:01 |
clarkb | feel free to continue discussion in #opendev and/or on the mailing list | 20:01 |
clarkb | #endmeeting | 20:01 |
opendevmeet | Meeting ended Tue Feb 25 20:01:11 2025 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:01 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.html | 20:01 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.txt | 20:01 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2025/infra.2025-02-25-19.00.log.html | 20:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!