Tuesday, 2024-10-22

clarkbjust about meeting time18:59
clarkbI'm not sure how many people we'll get today with the PTG happening18:59
clarkbI did want to make sure we had time to quickly go over things and catch up since we skipped last week's meeting19:00
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Oct 22 19:00:15 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/QGD26LEKHTM3AI6HTETDZWG6NQVM7ALV/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkb#link https://www.socallinuxexpo.org/scale/22x/events/open-infra-days CFP for Open Infra Days event at SCaLE is open until November 119:00
clarkbSounds like the zuul presentation at the recent open infra days in indiana was well received. I've been told I should encourage all ya'll with good ideas to propose presentations for the SCaLE event19:01
clarkbalso as I just mentioned the PTG is happening this week19:01
* frickler is watching with one eye or so19:02
clarkbplease be careful making changes particularly to meetpad or etherpad and ptgbot19:02
fungiyeah, i didn't get a chance to attend any zuul talks at oid-na but was glad to see there were some19:02
corvuso/19:02
clarkbalong those lines I put meetpad02 and jvb02 in the emergency file because jitsi meet just cut new releases and docker images. Having those servers in the emergency file should ensure that we don't update when our daily infra-prod jobs runs in about ~6 hours?19:02
clarkbonce the PTG is over we can remove those servers and let them upgrade normally. This just avoids any problems as meetpad has been working pretty well so far19:03
clarkb#topic Zuul-launcher image builds19:03
clarkbDiving right into the agenda I feel like I need to cathc up on the state of things here. Anything new to report corvus ?19:04
corvusat this point i think we can say the image build/upload is successful19:04
corvusi think there is an opportunity to improve the throughput of the download/upload cycle on the launcher, but we're missing some log detail to confirm that19:05
fungiyay!19:05
corvusi even went as far as to try launching a node from the uploaded image19:05
fungii mean yay-success, of course, not yay-missing-logs ;)19:05
corvusthat almost worked, but the launcher didn't provide enough info to the executor to actually use the node19:05
corvustechnically we did run a job on a zuul-launcher -launched node, it just failed.  :)19:06
corvuswe just (right before this meeting) merged changes to address both of those things19:06
clarkbok was going to ask if the problem was in our configs or in zuul itself19:06
corvusso i will retry the upload to get better logs, and retry the job to see what is to be seen there19:06
corvus2 things aside from the above:19:07
corvus1) i have a suspicion that the x-delete-after is not working.  maybe that's not honored by the swift cli when it's doing segmented uploads, or maybe the cloud doesn't support that.  i still need to confirm that with the most recent uploads, and then triage which of those things it is.19:08
corvus2) image build jobs for other images is still waiting for tonyb or someone to start on that (no rush, but it's ready for work to start whenever)19:08
corvusoh bonus #3:19:08
corvus3) i don't think we're doing periodic builds yet; but we can; so i or someone should hook up the jobs to that pipeline (that's a simple zuul.yaml change)19:09
clarkbre 1) probably a good idea to debug before we add a lot of image builds (just to keep the total amount of data as small as possible)19:09
corvusyep -- though to be clear, we can work on the jobs for the other platforms and not upload the images yet19:10
corvus(so #1 is not a blocker for #2)19:10
clarkbgot it, upload is a distinct step and we can start with simply doing builds19:10
corvus++19:10
corvusi think that's about it for updates (i could yak longer, but that's the critical path)19:11
clarkbthank you for the update19:11
* tonyb promises to write at least one job this week19:11
clarkb#topic OpenStack OpenAPI spec publishing19:11
clarkb#link https://review.opendev.org/92193419:11
clarkbI kept this on the agenda to make sure we don't lose track of it and I was hoping to maybe catch up with the PTG but I'm not sure timing will work out for that19:12
clarkbthe sdk team met yesterday during TC time19:12
clarkbso maybe we just need to follow up after the PTG and see what is next19:12
clarkb(there aren't any new comments in response to frickler or myself on the change)19:13
clarkbany other thoughts on this?19:13
fungii have none19:13
clarkb#topic Backup Server Pruning19:13
clarkbwe discussed options for reducing disk consumption on the smaller of the two backup servers 2 weeks ago then I went on a last minute international trip and haven't had a chance to do that19:14
clarkbgood news is tomorrow is a quiet day in my PTG schedule so I'm hoping I can sit down and carefully trim out the backup targets for old/ancient/gone servers19:14
clarkbask01, ethercalc02, etherpad01, gitea01, lists, review-dev01, and review0119:14
clarkbthat is the list I'll be looking at probably ethercalc to start since it seems the least impactful19:15
fungii think we already had consensus to remove those, but just to reiterate that list sounds good to me19:15
fungii'd volunteer to help but my dance card is full until at least mid-next week19:15
clarkbya between now and tomorrow is a good time to chime in if you think that we should replace the backing volume instead and keep those around or $otheridea19:16
tonyb++19:16
clarkbbut my itnention is to simply clear those out and ensure we're recovering expected disk space to start19:16
clarkbwe should have server snapshots and the other backup server too so risk seems low19:16
clarkb#topic Updating Gerrit Cache Sizes19:17
clarkblast Friday we upgraded Gerrit to pick up some bugfixes19:17
clarkbwhen gerrit started up it complained about a number of caches being over sized and needing pruning19:17
clarkbit turns out that gerrit prunes them automatically at 0100 but also on startup19:17
clarkbhttps://paste.opendev.org/show/bk4pTIuQLCsWaF3dVVF7/ is the relavent logged output which shows several related caches were much larger than their configured sizes (defaults all)19:18
clarkb#link https://review.opendev.org/c/opendev/system-config/+/932763 increase sizes for four gerrit caches19:18
clarkbI psuhed this change to update the cache sizes based on the data in those logs and the documentation to what I hope is a larger more reasonable and performant set of sizes19:18
clarkbupdating this config will require another gerrit restart so this isn't a rush. May be good to try and get done after the PTG though as dev work should ramp up and give us an idea of whether or not this is helpful19:19
clarkbprobably the main concern is that we're increasing the size of some memory caches too but they seem clearly too small and this is likely impacting performance19:20
fungiout of curiosity, i wonder if anyone has observed worse performance with the aggressively small cache target sizes19:20
fungibut also no clue how recently this started complaining19:20
clarkbfungi: I suspect that this is why we don't get diffs for a few minutes on gerrit startup. Gerrit is marking all of the cached data for those diffs as stale and it takes a while to repopulate19:20
fungidoes it persist caches over restarts? prune during startup?19:21
clarkbfungi: the disk caches are persisted over restarts but it prunes them to the configured size on startup19:21
fungiand also once a day19:21
clarkb"Cache jdbc:h2:file:///var/gerrit/cache/gerrit_file_diff size (2.51g) is greater than maxSize (128.00m), pruning" basically all this content is marked invalid at startup19:21
clarkbby increasing that cache size to 3g as proposed I suspect/hope that the next restart won't prune and we'll get diffs on startup19:22
fungiso maybe if people have been observing sluggishness after 01z daily that could be an explanation19:22
clarkbor if it prunes it will do so minimally19:22
fungithat sounds like a great test19:22
clarkbanyway comments welcome and definitely open to suggestions on size if we have different interpretations of the docs or concerns about memory consumption19:23
clarkband if we can reach general consensus a restart early next week would be great19:23
fricklerI already +2d, early next week sgtm19:24
clarkb#topic Upgrading old servers19:24
clarkbtonyb: not sure if you are still around. Any updates on the wiki changes?19:25
clarkbI don't see new patchsets. Any other updates?19:25
fungii ended up adding some ua filters to the existing set in order to hopefully get a handle on ai training scrapers overrunning it19:26
fungion the production server that is19:26
clarkboh ya tonyb mentioned those would need syncing as part of the redeployment19:26
fungitonyb mentioned adding those bots to the robots.txt in an update to his changes, since most of those bots should be well-behaved but the old server doesn't present a robots.txt at all19:27
fungii think the load average was up around 50 when i was looking into the problem19:27
clarkbI'm guessing tonyb managed to go on that run so we don't need to wait around19:27
clarkbfungi: I'm guessing that your edits improved things based on my ability to edit the agenda yesterday :)19:27
fungiwell, i also fully rebooted the server19:27
fungiload average is still pretty high, around 10 at the moment, but the reboot did seem to fix the inability to authenticate via openid19:28
fungianyway, the sooner we're able to move forward with the container replacement, the easier this all gets19:29
clarkband until the AI training wars subside we're likely to need to make continuous updates19:29
clarkb#topic Docker compose plugin with podman service for servers19:31
clarkbI don't think anyone has pushed up a chagne to start testing this with say paste/lodgeit but that is the current proposed plan19:31
clarkbif I'm wrong about that please correct me and point out what needs reviewing or if there are any other questions19:31
corvusi share that understanding19:31
fungii don't recall seeing a change yet19:31
clarkb#topic Open Discussion19:32
clarkbAnything else?19:32
fungii've got nothing19:33
clarkbit may be worth mentioning that I'll be out around veterans day weekend. I can't remember if I'm back Tuesday or Wednesday though19:34
clarkbalso thanksgiving is about a month away for those of us in the US19:34
fricklerEU switches back from DST next sunday19:34
clarkblooks liek I'll be back tuesday so no missed meeting for me and I expect to be around tuesday before thanksgiving19:35
fungii think it's a couple of weeks out that the usa does the same19:35
clarkbyes we're a week later than the EU19:35
funginovember 3, yep19:35
clarkbkeep those date changes in mind and as far as I can tell we should have meetings for the next month and a half or so19:36
clarkbs/date/timezone/19:36
clarkbI'll give it a few more minutes but we can end early if there is nothing else19:36
clarkbthank you for your time everyone! have a productive PTG and we'll see you back here next week19:38
clarkb#endmeeting19:39
opendevmeetMeeting ended Tue Oct 22 19:39:02 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:39
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-22-19.00.html19:39
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-22-19.00.txt19:39
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-10-22-19.00.log.html19:39
fungithanks clarkb!19:39
fricklero/19:39
clarkbnow we can go find $meal a little early19:39
fungior $drink depending on where we are19:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!