clarkb | Our weekly team meeting will begin in ~5 minutes. | 18:55 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Nov 5 19:00:12 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/TW6HC5TVBZNEAUWX6HBSATZKK7USHXAB/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | No big announcements today | 19:00 |
clarkb | I will be out Thursday morning local time for an appointment then I'm also out Friday and Monday (long holiday weekend with family) | 19:01 |
clarkb | did anyone else have anything to announce? | 19:02 |
clarkb | sounds like no | 19:03 |
clarkb | #topic Zuul-launcher image builds | 19:03 |
clarkb | yesterday corvus reported that we ran a job on a nodepool in zuul built and uploaded image | 19:03 |
clarkb | trying to find that link now | 19:03 |
corvus | #link niz build https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6 | 19:04 |
clarkb | #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6 | 19:04 |
clarkb | #undo | 19:04 |
opendevmeet | Removing item from minutes: #link https://zuul.opendev.org/t/opendev/build/944e17ac26f841b5a91f07e37ac79de6 | 19:04 |
corvus | still some work to do on the implementation (we're missing important functionality), but we've got the basics in place | 19:04 |
clarkb | I assume the bulk of that is on the nodepool in zuul side of things? On the opendev side of things we still need jobs to build the various images we have? | 19:05 |
corvus | yep, still a bit of niz implementation to do | 19:05 |
corvus | on the opendev side: | 19:05 |
corvus | * need to fix the image upload to use the correct expire-after headers | 19:05 |
corvus | (since all 3 ways we tried to do that failed; so something needs fixing) | 19:06 |
corvus | * should probably test with a raw image upload for benchmarking | 19:06 |
clarkb | do we know if the problem is the header value itself or something in the client? I'm guessing we don't actually know yet? | 19:06 |
corvus | (but that should wait until after some more niz performance improvements) | 19:06 |
corvus | * need to add more image build jobs (no rush, but not blocked by anything) | 19:06 |
corvus | clarkb: tim said that it needs to be set for both the individual parts and the manifest; i take that to mean that swift cli client is only setting it on one of those | 19:07 |
corvus | so one of the fixes is fix swiftclient to set both | 19:07 |
corvus | another fix could be to fix openstacksdk | 19:07 |
clarkb | got it | 19:07 |
corvus | a third fix could be to do something custom in an ansible module | 19:08 |
clarkb | side note: swift has been a thing for like 13 years? kind of amazing we're the first to hit this issue | 19:08 |
corvus | honestly don't know the relative merits of those, so step 1 is to evaluate those choices :) | 19:08 |
corvus | yeah... i guess people want to keep big files? :) | 19:08 |
clarkb | anything else? | 19:09 |
corvus | anyway, that's about it i think | 19:09 |
clarkb | thank you for the update. Its cool to see it work end to end like that | 19:09 |
clarkb | #topic Backup Server Pruning | 19:09 |
corvus | yw; ++ | 19:09 |
clarkb | As previously mentioned I did some manual cleanup of ethercalc02 backups | 19:10 |
clarkb | I then documented this process | 19:10 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/933354 Documentation of cleanup process applied to ethercalc02 backups. | 19:10 |
clarkb | Couple of things to note. The first is that there is still some discussion over whether or not this is the best approach | 19:10 |
clarkb | in particular ianw has proposed an alternative which automates most of this process | 19:11 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/933700 Automate backup deletions for old servers/services | 19:11 |
clarkb | Reviewing this change is high on my todo list (hping to look after lunch today) as I do think it would be an improvement on what I've done | 19:11 |
clarkb | The other thing to note is that the backup server checks noticed the backup dir for ethercalc02 was abnormal after the cleanup. This is true since the dir was removed but we should probably avoid having it emit the warning in the first place if we're intentionally cleaning things up | 19:12 |
clarkb | oh and fungi ran our normal pruning process on the vexxhost backup server as we were running low on space and we're still working through he process for clearing unneeded content | 19:12 |
clarkb | the good news is once we have a process for this and apply it I think we'll free up about 20% of the total disk space on the server which is a good chunk | 19:13 |
fungi | yep, i did that | 19:13 |
fungi | worth noting the ethercalc removal didn't really free up any observable space, but we didn't expect it to | 19:13 |
clarkb | so anyway please review the two chagnes above and weigh in on whether or not you think either approach is worth pursuing further | 19:13 |
clarkb | #topic Upgrading old servers | 19:14 |
clarkb | I don't see any updates on tonyb's mediawiki stack | 19:14 |
clarkb | tonyb: anything to note? I know there was a pile of comments on that stack so want to make sure that I was clear enough and also not off base on what I wrote | 19:15 |
clarkb | separately I decided with Gerrit that I would like to upgrade Gerrit to 3.10 first then figure out the server upgrade. The reason for that is the service update is mroe straightforward and I think a better candidate for sorting out over the next while of holidays and such | 19:16 |
clarkb | once that is done I'll have to look at server replacement | 19:16 |
clarkb | more on Gerrit 3.10 later in the meeting | 19:16 |
clarkb | any other server upgrade notes to make? | 19:17 |
clarkb | #topic Docker compose plugin with podman service for servers | 19:19 |
clarkb | lets move on. I didn't expect any updates on this topic since last week but wanted to give a chance for anyone to chime in if there was movement | 19:19 |
clarkb | ok sounds like there wasn't anything new. We can move on | 19:20 |
clarkb | #topic Enabling mailman3 bounce processing | 19:20 |
clarkb | and now for topics new to this week | 19:20 |
clarkb | frickler added this one and the question aiui is basically can we enable mailman3's automatic bounce processing which removes mailing list members after a certain number of bounces | 19:21 |
frickler | yes, so basically I just stumbled about this when looking at the mailman list admin UI | 19:21 |
clarkb | Last friday I logged into lists.opendev.org and looked at the options and tried to undersatnd how it works. Basically there are two values we can change the score threshold that when exceeded members are removed and the time period your score is valid for before being reset | 19:21 |
frickler | and since we do see a large number of bounces in our exim logs, I thought maybe give it a try | 19:22 |
clarkb | by default threshold is 5 (I think you get 1 score point for a hard bounce and half a point for a soft bounce not sure what the difference between hard and soft bounces are) and that value is reset weekly | 19:22 |
frickler | yes, the default looked pretty sane to me | 19:23 |
clarkb | one thing that occurred to me is we can enable this on service-discuss and since we only get about one email a week we should avoid removing anyone too quickly while we see if it works as expected | 19:23 |
clarkb | but otherwise it does seem like a good idea to remove all the old addresses that are no longer valid | 19:23 |
clarkb | fungi: corvus: ^ any thoughts or concerns on enabling this on some/all of our lists? I think dmarc/dkim validation was one concern? | 19:23 |
frickler | iirc one could also set it to report to the list admin instead of auto-remove addresses? | 19:24 |
clarkb | frickler: yes, you can also do both things. | 19:24 |
clarkb | I figured we'd have it alert list owners as well as disabling/removing people | 19:24 |
corvus | we definitely should if we can; i'm not familiar with the dkim issues that apparently caused us to turn it off | 19:24 |
fungi | as mentioned earlier when we discussed it, spurious dmarc-related rejections are less of a concern with mm3 because it doesn't just score on bounced posts, it follows up with a verp probe and uses the results of that to increase the bounce score | 19:24 |
clarkb | got it so previous concerns are theoretically a non issue in mm3. In that case should I go ahead and enable it on service-discuss and see what happens from there? | 19:25 |
fungi | basically mm3 tries to avoid disabling subscriptions in cases where the bounce could have been related to the message content rather than an actual delivery problem at the destination | 19:25 |
clarkb | oh also if you login to a list user member list page it shows the current score for all the members | 19:25 |
clarkb | which is another way to keep track of how it is processing people | 19:26 |
clarkb | then maybe enabling it on busier lists next week if nothing goes haywire? | 19:26 |
fungi | sounds good to me | 19:26 |
clarkb | cool I'll do that this afternoon for service-discuss too | 19:27 |
clarkb | anything else on this topic? | 19:27 |
clarkb | #topic Failures of insecure registry | 19:28 |
clarkb | Recently we had a bunch of issues with the insecure-ci-registry not communicating with container image push/pull clients and that resulted in job timeouts for jobs pushing images in particular | 19:28 |
clarkb | #link https://zuul.opendev.org/t/openstack/builds?job_name=osc-build-image&result=POST_FAILURE&skip=0 Example build failures some examples | 19:29 |
frickler | seems to have been resolved by the latest image update? or did I miss something? | 19:29 |
clarkb | After restarting the container a few times I noticed that the tracebacks recorded in the logs were happening in cheroot which had a more recent release than our current container image build. I pushed up a change to zuul-registry which rebuilt with latest cheroot as well as updating other system libs like openssl | 19:29 |
clarkb | and yes that image update seems to have made things happier. | 19:30 |
clarkb | Looking at logs it seems that some clients try to negotiate with invalid/unsupported tls versions and the registry is properly rejecting them now today. But my theory is this wasn't working previously and we'd eat up threads or some other limited resource on the server | 19:30 |
clarkb | one thing that seemed to back this up is if you listed connections on the server prior to the update there were many tcp connections hanging around | 19:31 |
clarkb | but now that isn't the case. | 19:31 |
clarkb | No concrete evidence that this was the problem but it seems to be much happier now | 19:31 |
clarkb | if you notice things going unhappy again please say something and check the container logs for any tracebacks | 19:31 |
frickler | +1 | 19:32 |
clarkb | #topic Gerrit 3.10 Upgrade Planning | 19:33 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document | 19:33 |
clarkb | I've been looking at upgrading Gerrit recently | 19:33 |
corvus | wait | 19:33 |
corvus | *** We should plan cleanup of some sort for the backing store. Either test the prune command or swap to a new container and cleanup the old one. | 19:33 |
clarkb | oh right | 19:33 |
clarkb | #undo | 19:33 |
opendevmeet | Removing item from minutes: #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 | 19:33 |
corvus | ^ should decide that | 19:33 |
clarkb | the other thing that was brought up last week was that our swift container backing the insecure ci registry is quite large | 19:34 |
clarkb | there is a zuul registry pruning command but corvus reports the last time we tried it things didn't go well | 19:34 |
corvus | last time we ran prune, it manifested weird errors. i think zuul-registry is better now, but i don't know if whatever problem happened was specifically identified and fixed. | 19:34 |
clarkb | I think our options are to try the prune command again or instead we can point the registry at a new container then work on deleting the old container in its entirety after the fact | 19:34 |
corvus | so i think if we want to clean it up, we should prepare to replace the container (ie, be ready to make a new one and swap it in). then run the prune command. if we see issues, swap to the new container. | 19:35 |
fungi | note that cleaning up the old container is likely to require asking the cloud operators to use admin privileges, since the swift bulk delete has to recursively delete each object in the container and i think is limited to something like 1k (maybe it was 10k?) objects per api call | 19:35 |
corvus | fungi: well, we could run it in screen for weeks | 19:35 |
fungi | yeah, i mean, that's also an option | 19:35 |
frickler | doesn't rclone also work for that? | 19:36 |
corvus | prune would be doing similar, so... run that it screen if you do :) | 19:36 |
clarkb | sounds like there is probably some value in trying the prune command since that produces more feedback to the registry codebase and our fallback is the same either way (new container) | 19:36 |
fungi | doing that in rackspace classic for some of our old build log containers would take years of continuous running to complete | 19:36 |
corvus | registry has fewer, larger, objects | 19:36 |
fungi | frickler: i think rclone was one of the suggested tools to use for bulk deletes, but it's still limited by what the swift api supports | 19:37 |
corvus | so hopefully lower value for "years" :) | 19:37 |
clarkb | need a volunteer to 1) prep a fallback container in swift 2) start zuul-registry prune command in screen then if there are problems 3) swap registry to fallback container and delete old container one way or another | 19:37 |
corvus | if we swap, promotes won't work, obviously. so it's a little disruptive. | 19:38 |
clarkb | right people would need to recheck thigns to regenerate images and upload them again | 19:38 |
clarkb | so maybe there is a 0) which is announce this to service-announce as a potential outcome (rechecks required) | 19:38 |
corvus | well, promote comes from gate, so there's typically no recourse other than "merge another change" | 19:39 |
clarkb | oh right since promote is running post merge | 19:39 |
clarkb | I should be able to work through this sometime next week. Maybe we announce it today for a wednesday or thursday implementation next week? that gives everyone a week of notice | 19:39 |
clarkb | but also happy for someone else to drive ti and set a schedule | 19:40 |
corvus | might be good to do on a friday/weekend. but if we're going to prune, no idea when we'd actually see problems come up, if they do. could be at any time, and if the duration is "months", then that's hard to announce/schedule. | 19:40 |
clarkb | I see. So maybe a warning that the work is happening and we'd like to know if unexpected results occur | 19:41 |
clarkb | I should be able to start it a week from this friday but not this friday | 19:41 |
corvus | yeah, that sounds like it fits our expected outcomes :) | 19:41 |
fungi | wfm | 19:41 |
clarkb | ok I'll work on a draft email and make sure corvus reviews it for accuracy before sending sometime this week with a plan to perform the work November 15 | 19:42 |
fungi | thanks! | 19:42 |
clarkb | and happy for others to help or drive any point along the way :) | 19:42 |
corvus | clarkb: ++ | 19:42 |
clarkb | #topic Gerrit 3.10 Upgrade Planning | 19:43 |
clarkb | oh shoot I'm just realizing I didn't undo enough items | 19:43 |
clarkb | oh well | 19:43 |
clarkb | the records will look weird but I think we can get by | 19:43 |
clarkb | #link https://etherpad.opendev.org/p/gerrit-upgrade-3.10 Gerrit upgrade planning document | 19:44 |
clarkb | so I've been looking at Gerrit 3.10 upgrade planning | 19:44 |
clarkb | per our usual process I've put a document together and tried to identify areas of concern, breaking changes, and then go through and decide what if any impact any of them have on us | 19:44 |
clarkb | I've also manually tested the upgrade (and we automatically test it too) | 19:44 |
clarkb | overall this seems like another straightforward change for us but there are a couple of things to note | 19:44 |
clarkb | one is that gerrit 3.10 can delete old log files itself so I pushed a change to configure that and we can remove our cronjob which does so post upgrade | 19:45 |
clarkb | another is that robot comments are deprecated (I believe zuul uses this for inline comments) | 19:45 |
clarkb | and that due to project imports being a thing that can lead to change number collisions some searches now may require project+changenumber | 19:46 |
clarkb | we haven't imported any projects from another gerrit server so I don't think this last item is a real concern for us | 19:46 |
fungi | what does gerrit recommend switching to, away from robot comments? | 19:46 |
clarkb | fungi: using a checks api plugin I think | 19:46 |
fungi | is there a new comment type for that purpose? | 19:46 |
fungi | huh, i thought the checks api plugin was also deprecated | 19:47 |
clarkb | see this is where things get very very confusing | 19:47 |
corvus | nope that's the "checks plugin" :) | 19:47 |
fungi | d'oh | 19:47 |
clarkb | the system where you register jobs with gerrit and it triggers them is deprecated | 19:47 |
corvus | (which is now a deprecated backend for the checks api) | 19:47 |
fungi | i have a feeling i'm going to regret asking questions now ;) | 19:47 |
clarkb | there is a different system that you just teach gerrit to query your ci system for info through that isn't | 19:48 |
clarkb | but ya my understanding is that integration with CI systems directly in Gerrit are expected to go through this system so things like robot comments are deprecated | 19:48 |
corvus | does the checks api plugin system support line comments? | 19:48 |
clarkb | worst case we send normal comments from zuul | 19:48 |
clarkb | corvus: I don't know but I guess that is a good question since that is what robot comments were doing | 19:48 |
clarkb | but also the deprecation is listed as a breaking change but as far as I can tell they have only deprecated them not actually broken them | 19:49 |
clarkb | so this is a problem that doesn't need solving for 3.10 we just need to be aware of it eventually needing a solution | 19:49 |
corvus | since there isn't currently a zuul checks api implementation, if/when they break we will probably just fall back to regular human-comments | 19:49 |
corvus | should be a 2-line change to zuul | 19:49 |
clarkb | ++ | 19:50 |
clarkb | another thing I wanted to call out is that gerrit made reindexing faster if you start with an old index | 19:50 |
fungi | there is no robot only zuul | 19:50 |
clarkb | for this reason our upgrade process is slightly modified in that document to backup the indexes then copy them back in place if we do a downgrade. In theory this will speed up our downgrade process | 19:50 |
corvus | this reads to me like a check result can point to a single line of code and that's it. https://gerrit.googlesource.com/gerrit/+/master/polygerrit-ui/app/api/checks.ts#435 | 19:50 |
corvus | if that's correct, then i don't think the checks api is an adequate replacement, so falling back to human-comments is the best option for what zuul does. | 19:51 |
clarkb | I also tested the downgrade on a held node with 3 chagnes so that process works but I can't really comment on whether or not it is faster | 19:51 |
clarkb | we are running out of time so I'll end this topic here | 19:52 |
clarkb | but please look over the document and the 3.10 release notes and call out any additioanl concerns that you feel need investigation | 19:52 |
clarkb | otherwise I'm looking at December 6, 2024 as the upgrade date as that is after the big thanksgiving holiday | 19:52 |
clarkb | should be plenty of time to get prepared by then | 19:52 |
clarkb | #topic RTD Build Trigger Requests | 19:52 |
clarkb | Really quickly before we run out of time I wanted to call out the read the docs build trigger api request failures in ansible | 19:53 |
clarkb | tl;dr seems to be that making the same request with curl or python requests works but having ansible's uri module do so fails with an unauthorized error? | 19:53 |
clarkb | additional debugging with a mitm proxy hasn't shown any clear reason for why this is happening | 19:53 |
frickler | but only on some distros like bookworm or noble, not on f41 or trixie | 19:53 |
frickler | so very weird indeed | 19:53 |
clarkb | I think we could probably rewrite the jobs to use python requests or curl or something and just move on | 19:54 |
fungi | something is causing http basic auth to return a 403 error from the zuul-executor containers, reproducible from some specific distro platforms but works from others | 19:54 |
clarkb | but it is also interesting from a "what is ansible even doing here" perspective that may warrant further debugging | 19:54 |
fungi | it does seem likely to be related to a shared library which will be fixed when we eventually move zuul images from bookworm to trixie | 19:55 |
fungi | but hard to say what exactly | 19:55 |
clarkb | do we think that just rebuilding the image may correct it too? | 19:55 |
clarkb | (I don't think so since iirc we do update the platform when we build zuul images) | 19:55 |
clarkb | another option to try could be updating to python3.12 in the zuul images | 19:56 |
fungi | also strange that we didn't change zuul's container base recently, but the problem only started in late september | 19:56 |
frickler | well trixie is only in testing as of now | 19:56 |
fungi | yes | 19:56 |
clarkb | corvus: is python3.12 something that we should consider generally for zuul or is that not in the cards yet? | 19:56 |
clarkb | for the container images I mean. We're already testing with 3.12 | 19:56 |
frickler | and py3.12 on noble seems broken/affected, too, so that wouldn't help | 19:56 |
clarkb | ack | 19:56 |
corvus | not in a rush to change in general :) | 19:57 |
corvus | (but happy to if we think it's a good idea) | 19:57 |
corvus | but all things equal, id maybe just leave it for the next os upgrade? unless there's something pressing that 3.12 would make better | 19:57 |
clarkb | more mitmproxy testing or setting up a server to log headers and then diffing between platforms is probably the next debugging step if anyone wants to do that | 19:57 |
clarkb | corvus: I don't think there is a pressing need. unittests on 3.11 are faster too... | 19:58 |
frickler | either that or try the curl solution | 19:58 |
clarkb | ya we could just give up for now and switch to a working different implementation then revert when trixie happens | 19:58 |
fungi | my guess is that there was some regression which landed in bookworm 6 weeks ago, or in a change backported into an ansible point release maybe | 19:58 |
frickler | just before we close I'd also want to mention the promote-openstack-manuals-developer failures https://zuul.opendev.org/t/openstack/build/618e3a431a2145afb4344809a9aa84fa/console | 19:58 |
fungi | or a python point release | 19:58 |
frickler | no idea yet what's different there compared to promote-openstack-manuals runs | 19:59 |
clarkb | its a different target so I suspect that fungi's original idea is the right path | 19:59 |
clarkb | just not complete yet? basically need to get the paths and destinations all in alignment? | 19:59 |
fungi | yeah, but my change didn't seem to fix the error | 19:59 |
clarkb | (side note another reason why I think developer doesn't need a different target it makes the publications more complicated) | 20:00 |
clarkb | and we are at time. Thank you everyone | 20:00 |
clarkb | we can continue discussion in #opendev or on the mailing list as necessary but I don't want to keep anyone here longer than the prescribed hour | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Nov 5 20:00:33 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-11-05-19.00.log.html | 20:00 |
corvus | clarkb: thanks! | 20:00 |
fungi | thanks clarkb! | 20:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!