Tuesday, 2024-11-05

clarkbOur weekly team meeting will begin in ~5 minutes.18:55
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Nov  5 19:00:12 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/TW6HC5TVBZNEAUWX6HBSATZKK7USHXAB/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbNo big announcements today19:00
clarkbI will be out Thursday morning local time for an appointment then I'm also out Friday and Monday (long holiday weekend with family)19:01
clarkbdid anyone else have anything to announce?19:02
clarkbsounds like no19:03
clarkb#topic Zuul-launcher image builds19:03
clarkbyesterday corvus reported that we ran a job on a nodepool in zuul built and uploaded image19:03
clarkbtrying to find that link now19:03
corvus#link niz build https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de619:04
clarkb#link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de619:04
clarkb#undo19:04
opendevmeetRemoving item from minutes: #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de619:04
corvusstill some work to do on the implementation (we're missing important functionality), but we've got the basics in place19:04
clarkbI assume the bulk of that is on the nodepool in zuul side of things? On the opendev side of things we still need jobs to build the various images we have?19:05
corvusyep, still a bit of niz implementation to do19:05
corvuson the opendev side:19:05
corvus* need to fix the image upload to use the correct expire-after headers19:05
corvus(since all 3 ways we tried to do that failed; so something needs fixing)19:06
corvus* should probably test with a raw image upload for benchmarking19:06
clarkbdo we know if the problem is the header value itself or something in the client? I'm guessing we don't actually know yet?19:06
corvus(but that should wait until after some more niz performance improvements)19:06
corvus* need to add more image build jobs (no rush, but not blocked by anything)19:06
corvusclarkb: tim said that it needs to be set for both the individual parts and the manifest; i take that to mean that swift cli client is only setting it on one of those19:07
corvusso one of the fixes is fix swiftclient to set both19:07
corvusanother fix could be to fix openstacksdk19:07
clarkbgot it19:07
corvusa third fix could be to do something custom in an ansible module19:08
clarkbside note: swift has been a thing for like 13 years? kind of amazing we're the first to hit this issue19:08
corvushonestly don't know the relative merits of those, so step 1 is to evaluate those choices :)19:08
corvusyeah... i guess people want to keep big files?  :)19:08
clarkbanything else?19:09
corvusanyway, that's about it i think19:09
clarkbthank you for the update. Its cool to see it work end to end like that19:09
clarkb#topic Backup Server Pruning19:09
corvusyw; ++19:09
clarkbAs previously mentioned I did some manual cleanup of ethercalc02 backups19:10
clarkbI then documented this process19:10
clarkb#link https://review.opendev.org/c/opendev/system-config/+/933354 Documentation of cleanup process applied to ethercalc02 backups.19:10
clarkbCouple of things to note. The first is that there is still some discussion over whether or not this is the best approach19:10
clarkbin particular ianw has proposed an alternative which automates most of this process19:11
clarkb#link https://review.opendev.org/c/opendev/system-config/+/933700 Automate backup deletions for old servers/services19:11
clarkbReviewing this change is high on my todo list (hping to look after lunch today) as I do think it would be an improvement on what I've done19:11
clarkbThe other thing to note is that the backup server checks noticed the backup dir for ethercalc02 was abnormal after the cleanup. This is true since the dir was removed but we should probably avoid having it emit the warning in the first place if we're intentionally cleaning things up19:12
clarkboh and fungi ran our normal pruning process on the vexxhost backup server as we were running low on space and we're still working through he process for clearing unneeded content19:12
clarkbthe good news is once we have a process for this and apply it I think we'll free up about 20% of the total disk space on the server which is a good chunk19:13
fungiyep, i did that19:13
fungiworth noting the ethercalc removal didn't really free up any observable space, but we didn't expect it to19:13
clarkbso anyway please review the two chagnes above and weigh in on whether or not you think either approach is worth pursuing further19:13
clarkb#topic Upgrading old servers19:14
clarkbI don't see any updates on tonyb's mediawiki stack19:14
clarkbtonyb: anything to note? I know there was a pile of comments on that stack so want to make sure that I was clear enough and also not off base on what I wrote19:15
clarkbseparately I decided with Gerrit that I would like to upgrade Gerrit to 3.10 first then figure out the server upgrade. The reason for that is the service update is mroe straightforward and I think a better candidate for sorting out over the next while of holidays and such19:16
clarkbonce that is done I'll have to look at server replacement19:16
clarkbmore on Gerrit 3.10 later in the meeting19:16
clarkbany other server upgrade notes to make?19:17
clarkb#topic Docker compose plugin with podman service for servers19:19
clarkblets move on. I didn't expect any updates on this topic since last week but wanted to give a chance for anyone to chime in if there was movement19:19
clarkbok sounds like there wasn't anything new. We can move on19:20
clarkb#topic Enabling mailman3 bounce processing19:20
clarkband now for topics new to this week19:20
clarkbfrickler added this one and the question aiui is basically can we enable mailman3's automatic bounce processing which removes mailing list members after a certain number of bounces19:21
frickleryes, so basically I just stumbled about this when looking at the mailman list admin UI19:21
clarkbLast friday I logged into lists.opendev.org and looked at the options and tried to undersatnd how it works. Basically there are two values we can change the score threshold that when exceeded members are removed and the time period your score is valid for before being reset19:21
fricklerand since we do see a large number of bounces in our exim logs, I thought maybe give it a try19:22
clarkbby default threshold is 5 (I think you get 1 score point for a hard bounce and half a point for a soft bounce not sure what the difference between hard and soft bounces are) and that value is reset weekly19:22
frickleryes, the default looked pretty sane to me19:23
clarkbone thing that occurred to me is we can enable this on service-discuss and since we only get about one email a week we should avoid removing anyone too quickly while we see if it works as expected19:23
clarkbbut otherwise it does seem like a good idea to remove all the old addresses that are no longer valid19:23
clarkbfungi: corvus: ^ any thoughts or concerns on enabling this on some/all of our lists? I think dmarc/dkim validation was one concern?19:23
frickleriirc one could also set it to report to the list admin instead of auto-remove addresses?19:24
clarkbfrickler: yes, you can also do both things.19:24
clarkbI figured we'd have it alert list owners as well as disabling/removing people19:24
corvuswe definitely should if we can; i'm not familiar with the dkim issues that apparently caused us to turn it off19:24
fungias mentioned earlier when we discussed it, spurious dmarc-related rejections are less of a concern with mm3 because it doesn't just score on bounced posts, it follows up with a verp probe and uses the results of that to increase the bounce score19:24
clarkbgot it so previous concerns are theoretically a non issue in mm3. In that case should I go ahead and enable it on service-discuss and see what happens from there?19:25
fungibasically mm3 tries to avoid disabling subscriptions in cases where the bounce could have been related to the message content rather than an actual delivery problem at the destination19:25
clarkboh also if you login to a list user member list page it shows the current score for all the members19:25
clarkbwhich is another way to keep track of how it is processing people19:26
clarkbthen maybe enabling it on busier lists next week if nothing goes haywire?19:26
fungisounds good to me19:26
clarkbcool I'll do that this afternoon for service-discuss too19:27
clarkbanything else on this topic?19:27
clarkb#topic Failures of insecure registry19:28
clarkbRecently we had a bunch of issues with the insecure-ci-registry not communicating with container image push/pull clients and that resulted in job timeouts for jobs pushing images in particular19:28
clarkb#link https://zuul.opendev.org/t/openstack/builds?job_name=osc-build-image&result=POST_FAILURE&skip=0 Example build failures some examples19:29
fricklerseems to have been resolved by the latest image update? or did I miss something?19:29
clarkbAfter restarting the container a few times I noticed that the tracebacks recorded in the logs were happening in cheroot which had a more recent release than our current container image build. I pushed up a change to zuul-registry which rebuilt with latest cheroot as well as updating other system libs like openssl19:29
clarkband yes that image update seems to have made things happier.19:30
clarkbLooking at logs it seems that some clients try to negotiate with invalid/unsupported tls versions and the registry is properly rejecting them now today. But my theory is this wasn't working previously and we'd eat up threads or some other limited resource on the server19:30
clarkbone thing that seemed to back this up is if you listed connections on the server prior to the update there were many tcp connections hanging around19:31
clarkbbut now that isn't the case.19:31
clarkbNo concrete evidence that this was the problem but it seems to be much happier now19:31
clarkbif you notice things going unhappy again please say something and check the container logs for any tracebacks19:31
frickler+119:32
clarkb#topic Gerrit 3.10 Upgrade Planning19:33
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document19:33
clarkbI've been looking at upgrading Gerrit recently19:33
corvuswait19:33
corvus*** We should plan cleanup of some sort for the backing store. Either test the prune command or swap to a new container and cleanup the old one.19:33
clarkboh right19:33
clarkb#undo19:33
opendevmeetRemoving item from minutes: #link https://etherpad.opendev.org/p/gerrit-upgrade-3.1019:33
corvus^ should decide that19:33
clarkbthe other thing that was brought up last week was that our swift container backing the insecure ci registry is quite large19:34
clarkbthere is a zuul registry pruning command but corvus reports the last time we tried it things didn't go well19:34
corvuslast time we ran prune, it manifested weird errors.  i think zuul-registry is better now, but i don't know if whatever problem happened was specifically identified and fixed.19:34
clarkbI think our options are to try the prune command again or instead we can point the registry at a new container then work on deleting the old container in its entirety after the fact19:34
corvusso i think if we want to clean it up, we should prepare to replace the container (ie, be ready to make a new one and swap it in). then run the prune command.  if we see issues, swap to the new container.19:35
funginote that cleaning up the old container is likely to require asking the cloud operators to use admin privileges, since the swift bulk delete has to recursively delete each object in the container and i think is limited to something like 1k (maybe it was 10k?) objects per api call19:35
corvusfungi: well, we could run it in screen for weeks19:35
fungiyeah, i mean, that's also an option19:35
fricklerdoesn't rclone also work for that?19:36
corvusprune would be doing similar, so... run that it screen if you do :)19:36
clarkbsounds like there is probably some value in trying the prune command since that produces more feedback to the registry codebase and our fallback is the same either way (new container)19:36
fungidoing that in rackspace classic for some of our old build log containers would take years of continuous running to complete19:36
corvusregistry has fewer, larger, objects19:36
fungifrickler: i think rclone was one of the suggested tools to use for bulk deletes, but it's still limited by what the swift api supports19:37
corvusso hopefully lower value for "years" :)19:37
clarkbneed a volunteer to 1) prep a fallback container in swift 2) start zuul-registry prune command in screen then if there are problems 3) swap registry to fallback container and delete old container one way or another19:37
corvusif we swap, promotes won't work, obviously.  so it's a little disruptive.19:38
clarkbright people would need to recheck thigns to regenerate images and upload them again19:38
clarkbso maybe there is a 0) which is announce this to service-announce as a potential outcome (rechecks required)19:38
corvuswell, promote comes from gate, so there's typically no recourse other than "merge another change"19:39
clarkboh right since promote is running post merge19:39
clarkbI should be able to work through this sometime next week. Maybe we announce it today for a wednesday or thursday implementation next week? that gives everyone a week of notice19:39
clarkbbut also happy for someone else to drive ti and set a schedule19:40
corvusmight be good to do on a friday/weekend.  but if we're going to prune, no idea when we'd actually see problems come up, if they do.  could be at any time, and if the duration is "months", then that's hard to announce/schedule.19:40
clarkbI see. So maybe a warning that the work is happening and we'd like to know if unexpected results occur19:41
clarkbI should be able to start it a week from this friday but not this friday19:41
corvusyeah, that sounds like it fits our expected outcomes :)19:41
fungiwfm19:41
clarkbok I'll work on a draft email and make sure corvus reviews it for accuracy before sending sometime this week with a plan to perform the work November 1519:42
fungithanks!19:42
clarkband happy for others to help or drive any point along the way :)19:42
corvusclarkb: ++19:42
clarkb#topic Gerrit 3.10 Upgrade Planning19:43
clarkboh shoot I'm just realizing I didn't undo enough items19:43
clarkboh well19:43
clarkbthe records will look weird but I think we can get by19:43
clarkb#link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document19:44
clarkbso I've been looking at Gerrit 3.10 upgrade planning19:44
clarkbper our usual process I've put a document together and tried to identify areas of concern, breaking changes, and then go through and decide what if any impact any of them have on us19:44
clarkbI've also manually tested the upgrade (and we automatically test it too)19:44
clarkboverall this seems like another straightforward change for us but there are a couple of things to note19:44
clarkbone is that gerrit 3.10 can delete old log files itself so I pushed a change to configure that and we can remove our cronjob which does so post upgrade19:45
clarkbanother is that robot comments are deprecated (I believe zuul uses this for inline comments)19:45
clarkband that due to project imports being a thing that can lead to change number collisions some searches now may require project+changenumber19:46
clarkbwe haven't imported any projects from another gerrit server so I don't think this last item is a real concern for us19:46
fungiwhat does gerrit recommend switching to, away from robot comments?19:46
clarkbfungi: using a checks api plugin I think19:46
fungiis there a new comment type for that purpose?19:46
fungihuh, i thought the checks api plugin was also deprecated19:47
clarkbsee this is where things get very very confusing19:47
corvusnope that's the "checks plugin" :)19:47
fungid'oh19:47
clarkbthe system where you register jobs with gerrit and it triggers them is deprecated19:47
corvus(which is now a deprecated backend for the checks api)19:47
fungii have a feeling i'm going to regret asking questions now ;)19:47
clarkbthere is a different system that you just teach gerrit to query your ci system for info through that isn't19:48
clarkbbut ya my understanding is that integration with CI systems directly in Gerrit are expected to go through this system so things like robot comments are deprecated19:48
corvusdoes the checks api plugin system support line comments?19:48
clarkbworst case we send normal comments from zuul19:48
clarkbcorvus: I don't know but I guess that is a good question since that is what robot comments were doing19:48
clarkbbut also the deprecation is listed as a breaking change but as far as I can tell they have only deprecated them not actually broken them19:49
clarkbso this is a problem that doesn't need solving for 3.10 we just need to be aware of it eventually needing a solution19:49
corvussince there isn't currently a zuul checks api implementation, if/when they break we will probably just fall back to regular human-comments19:49
corvusshould be a 2-line change to zuul19:49
clarkb++19:50
clarkbanother thing I wanted to call out is that gerrit made reindexing faster if you start with an old index19:50
fungithere is no robot only zuul19:50
clarkbfor this reason our upgrade process is slightly modified in that document to backup the indexes then copy them back in place if we do a downgrade. In theory this will speed up our downgrade process19:50
corvusthis reads to me like a check result can point to a single line of code and that's it.  https://gerrit.googlesource.com/gerrit/+/master/polygerrit-ui/app/api/checks.ts#43519:50
corvusif that's correct, then i don't think the checks api is an adequate replacement, so falling back to human-comments is the best option for what zuul does.19:51
clarkbI also tested the downgrade on a held node with 3 chagnes so that process works but I can't really comment on whether or not it is faster19:51
clarkbwe are running out of time so I'll end this topic here19:52
clarkbbut please look over the document and the 3.10 release notes and call out any additioanl concerns that you feel need investigation19:52
clarkbotherwise I'm looking at December 6, 2024 as the upgrade date as that is after the big thanksgiving holiday19:52
clarkbshould be plenty of time to get prepared by then19:52
clarkb#topic RTD Build Trigger Requests19:52
clarkbReally quickly before we run out of time I wanted to call out the read the docs build trigger api request failures in ansible19:53
clarkbtl;dr seems to be that making the same request with curl or python requests works but having ansible's uri module do so fails with an unauthorized error?19:53
clarkbadditional debugging with a mitm proxy hasn't shown any clear reason for why this is happening19:53
fricklerbut only on some distros like bookworm or noble, not on f41 or trixie19:53
fricklerso very weird indeed19:53
clarkbI think we could probably rewrite the jobs to use python requests or curl or something and just move on19:54
fungisomething is causing http basic auth to return a 403 error from the zuul-executor containers, reproducible from some specific distro platforms but works from others19:54
clarkbbut it is also interesting from a "what is ansible even doing here" perspective that may warrant further debugging19:54
fungiit does seem likely to be related to a shared library which will be fixed when we eventually move zuul images from bookworm to trixie19:55
fungibut hard to say what exactly19:55
clarkbdo we think that just rebuilding the image may correct it too?19:55
clarkb(I don't think so since iirc we do update the platform when we build zuul images)19:55
clarkbanother option to try could be updating to python3.12 in the zuul images19:56
fungialso strange that we didn't change zuul's container base recently, but the problem only started in late september19:56
fricklerwell trixie is only in testing as of now19:56
fungiyes19:56
clarkbcorvus: is python3.12 something that we should consider generally for zuul or is that not in the cards yet?19:56
clarkbfor the container images I mean. We're already testing with 3.1219:56
fricklerand py3.12 on noble seems broken/affected, too, so that wouldn't help19:56
clarkback19:56
corvusnot in a rush to change in general :)19:57
corvus(but happy to if we think it's a good idea)19:57
corvusbut all things equal, id maybe just leave it for the next os upgrade?  unless there's something pressing that 3.12 would make better19:57
clarkbmore mitmproxy testing or setting up a server to log headers and then diffing between platforms is probably the next debugging step if anyone wants to do that19:57
clarkbcorvus: I don't think there is a pressing need. unittests on 3.11 are faster too...19:58
fricklereither that or try the curl solution19:58
clarkbya we could just give up for now and switch to a working different implementation then revert when trixie happens19:58
fungimy guess is that there was some regression which landed in bookworm 6 weeks ago, or in a change backported into an ansible point release maybe19:58
fricklerjust before we close I'd also want to mention the promote-openstack-manuals-developer failures https://zuul.opendev.org/t/openstack/build/618e3a431a2145afb4344809a9aa84fa/console19:58
fungior a python point release19:58
fricklerno idea yet what's different there compared to promote-openstack-manuals runs19:59
clarkbits a different target so I suspect that fungi's original idea is the right path19:59
clarkbjust not complete yet? basically need to get the paths and destinations all in alignment?19:59
fungiyeah, but my change didn't seem to fix the error19:59
clarkb(side note another reason why I think developer doesn't need a different target it makes the publications more complicated)20:00
clarkband we are at time. Thank you everyone20:00
clarkbwe can continue discussion in #opendev or on the mailing list as necessary but I don't want to keep anyone here longer than the prescribed hour20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Nov  5 20:00:33 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.log.html20:00
corvusclarkb:  thanks!20:00
fungithanks clarkb!20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!