Tuesday, 2023-09-12

clarkbhellot everyone19:01
clarkbmeeting time19:01
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Sep 12 19:01:12 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V23MFBBCANVK4YAK2IYOV5XNLFY64U3X/ Our Agenda19:01
clarkb#topic Announcements19:01
clarkbI will be out tomorrow19:02
fungihellot to you too!19:02
clarkbI'll be on a little bit fishing for hopefully not little fish so won't raelly be around keyboards19:02
clarkbbut back thursday for the mailman fun19:02
clarkb#topic Mailman 319:03
clarkbmay as well jump straight into that19:03
funginothing new here really. planning to migrate airship and kata on thursday19:03
clarkbfungi: The plan is to migrate lists.katacontainers.io and lists.airshipit.org to mailman 3 on september 1419:03
clarkbfungi: and youare starting around 1500 iirc?19:03
fungiyep. will do a preliminary data sync tomorrow to prepare, so the sync during downtime will be shorter19:03
fungii notified the airship-discuss and kata-dev mailing lists since those are the most active ones for their respective domains19:04
clarkbfungi: anything else we can do to help prepare?19:05
fungiif thursday's maintenance goes well, i'll send similar notifications to the foundation and starlingx-discuss lists immediately thereafter about a similar migration for the lists.openinfra.dev and lists.starlingx.io sites thursday of next week19:05
fungii don't think we've got anything else that needs doing for mailman at the moment19:05
clarkbexcellent. Thank you for getting this together19:06
fungiwe did find out that the current lists.katacontainers.io server's ipv4 address was on the spamhaus pbl19:06
fungii put in a removal request for that in the meantime19:06
clarkbthe ipv6 address was listed too but we can't remove it because the entire /64 is listed and we have a /128 on the host?19:06
clarkbthat should improve after we migrate hosts anyway19:07
fungitheir css blocklist has the ipv6 /64 for that server listed, not much we can do about that yeah19:07
fungithe "problem" traffic from that /64 is coming from another rackspace customer19:07
funginew server is in a different /64, so not affected at the moment19:07
clarkbanything else mail(man) related?19:07
funginot from my end19:08
fricklerthere was a spam mail on service-discuss earlier, just wondering whether that might be mailman3 related19:08
clarkboh right that is worth calling out19:09
clarkbfrickler: I think it is mailman3 related in that fungi suspects that maybe the web interface makes it easier for people to do? 19:09
fungioh, did one get through? i didn't notice, but can delete it from the archive and block that sender19:09
frickler#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/message/JLGRB7TNXJK2W3ELRXMOTAK3NH5TNYI3/19:09
clarkbfungi: is that process documented for mm3 yet? might be good to do so if not19:09
fungithanks, i'll take care of that right after the meeting19:10
clarkb#topic Infra Root Google Account19:10
fungiand yes, the supposition is that making it possible for people to post to mailing lists without a mail client (via their web browser) also makes it easier for spammers to do the same. the recommendation from the mm3 maintainers is to adjust the default list moderation policy so that every new subscriber is moderated by default until a moderator adds them to a whitelist19:11
clarkbI did an investigating and learned things. tl;dr is that this account is used by zuul to talk to gerrit's gerrit19:11
clarkbusing the resulting gerrit token doesn't appear to count as somethign google can track as account activity (probably a good thing imo)19:11
clarkbI logged into the account which I think is sufficient to reset the delete clock on the account. However, after reading elsewhere some people are suggesting you do some action on top of logging in like making a google search or watching a youtube video. I'll plan to log back in and do some searches then19:12
clarkbunfortunately, this is clear as mud when it comes to what actually counts for not getting deleted19:12
fungiworst case we get the gerrit admins to authorize a new account when that one gets deactivated by google19:13
clarkbfungi: yup but that will require a new email address. I'm not sure if google will allow +suffixes19:13
clarkb(my guess is not)19:14
corvusbut then we could use snarky email addresses :)19:14
corvus(but still, who needs the trouble; thanks for logging in and searching for "timothee chalamet" or whatever)19:15
clarkbI was going to search for zuul19:15
clarkbmake it super meta19:15
clarkbanyway we should keep this account alvie if we can and that was all i had19:16
corvusooh good one19:16
clarkb#topic Server Upgrades19:16
clarkbNo updates from me on this and I haven't seen any other movement on it19:16
clarkb#topic Nodepool image upload changes19:16
clarkbWe have removed fedora 35 and 36 images from nodepool which reduces the total number of image builds that need uploading19:16
clarkbon top of that we reduced the build and upload rates. Has anyone dug in to see how that is going since we made the change?19:17
clarkbglancing at https://grafana.opendev.org/d/f3089338b3/nodepool3a-dib-status?orgId=1 the builds themselves seem fine, but I'm not sure if we're keeping up with uploads particularly to rax iad19:17
corvuswe also added the ability to configure the upload timeout in nodepool; have we configured that in opendev yet?19:18
fricklerI don't think so, I missed that change19:18
clarkbcorvus: oh yes I put that in the agenda email even. I don't think we have configured it but we probably should19:18
corvusit used to be hard-coded to 6 hours before nodepool switched to sdk19:18
corvusi think that would be a fine value to start with :)19:18
clarkbwfm19:19
fungiyeah, i found in git history spelunking that we lost the 6-hour value when shade was split from nodepool19:19
fungiso it's... been a while19:19
clarkbmaybe we set that value then check back in next week to see how rax iad is doing afterwards19:19
corvus#link https://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].image-upload-timeout19:19
fricklerone thing I notice about the 7d rebuild cycle is that it doesn't seem to work well with two images being kept19:20
clarkbfrickler: beacuse we delete the old image after one day?19:20
frickleryes and then upload another19:20
fricklerso the two images are just one day apart, not another 719:21
fricklerbut maaybe that's as designed kind of19:21
fricklerat least that's my interpretation for why we have images 6d and 7d old19:22
clarkboh hrm19:22
clarkbI guess that still cuts down on total uploads but in a weird way19:22
clarkbI think we can live with that for now19:22
fungimakes sense, maybe we need to set a similarly long image expiration (can that also be set per image?)19:22
frickleror maybe it is only because we started one week ago19:23
fungioh, possible19:23
fungiso the older of the two was built on the daily rather than weekly cycle19:24
clarkbok any volunteers to puysh a change updating the timeout for uplaods in each of the provider configs?19:25
clarkbI think that goes in the two builder config files as the launcher nl0X files don't care about upload timeouts19:25
fricklerfor bookworm we have 2d and 4d, that looks better19:25
fricklerI can do the timeout patch tomorrow19:25
clarkbfrickler: thanks!19:26
clarkb#topic Zuul PCRE deprecation19:26
fungiwe've merged most of the larger changes impacting our tenants, right? just a bunch of stragglers now?19:27
clarkbwanted to check in onthis as I haven't been able to follow the work that has happend super closely. I know changes to fix the regexes are landing and I haven't heard of any problems. But is there anythign to be concerned about?19:27
corvusi haven't drafted an email yet for this, but will soon19:27
corvusi haven't heard of any issues19:27
funginor have i19:27
fungiat least not after the fixes for config changes on projects with warnings19:28
fricklerthis list certainly is something to be concerned about https://zuul.opendev.org/t/openstack/config-errors?severity=warning19:28
fungiyes, all the individual projects with one or two regexes in their configs (especially on stable branches) are going to be the long tail19:29
fungii think we just let them know when the removal is coming and either they fix it or we start removing projects from the tenant config after they've been broken for too long19:30
frickleror we just keep backwards compatibility for once?19:31
corvuswe can't keep backwards compat once this is removed from zuul19:31
fricklerwell that's on zuul then. or opendev can stop running latest zuul19:31
corvusand the point of this migration is to prevent dos attacks against sites like opendev19:32
clarkbthe motivation behind the change is actually one that is useful to opendev though ya that19:32
corvusso it would be counter productive for opendev to do that19:32
corvusis that a serious consideration?  is this group interested in no longer running the latest zuul?  because of this?19:32
fricklerhow many dos have we seen vs. how much trouble has the queue change brought?19:33
corvusi have to admit i'm a little surprised to hear the suggestion19:33
clarkbfrickler: I think the queue changes expose underlying problems in various projects more than they are the actual problem19:33
clarkbit isn't our fault that many peices of software are largely unmaintained and in those cases I don't think not runnign CI is a huge deal19:33
clarkbwhen the projects become maintained again they can update their configs and move forward19:34
clarkbeither to fix existing problems in the older CI setups for the projects or by deleting what was there and starting over19:34
fungii guess it's a question of whether projects want a ci system that never breaks backwards compatibility, or one that fixes bugs and adds new features19:34
clarkbI agree that this exposes a problem I just don't see the zuul updates as the problem. They indicate deeper problems19:34
corvusindeed, zuul's configuration error handling was specifically designed for this set of requirements from opendev -- that if there are problems in "leaf node" projects, it doesn't take out the whole system.  so it's also surprising to hear that is not helpful.19:35
fungialso though, zuul relies on other software (notably python and ansible) which have relatively aggressive deprecation and eol schedules, so not breaking backward compatibility would quickly mean running with eol versions of those integral dependencies19:36
clarkbmy personal take on it is that openstack should do like opendev and aggressively prune what isn't sustainable. But I know that isn't the current direction of the project19:36
fungieven with jenkins we had upgrades which broke some job configurations and required changes for projects to continue running things in it19:37
corvusi thought opendev's policy of making sure people know about upcoming changes, telling them how to fix problems, and otherwise not being concerned if individual projects aren't interested in maintaining their use of the system is reasonable.  and in those cases, just letting the projects "error out" and then pruning them.  i don't feel like that warrants a change in the ci system.19:37
clarkbcorvus: any idea what zuul's deprecation period looks like for this change? I suspect this one might take some time due to the likelyhood of impacting existing installs? More so than the ansible switch anyway19:38
corvusi don't think it's been established, but my own feeling is the same, and i would be surprised if anyone had an appetite for it in less than, say, 6 months?19:39
corvusyeah, ansible switches are short, and not by our choice.19:39
corvusthese other things can take as long as we want.19:39
corvus(we = zuul community)19:39
clarkbya so this is going to be less urgent anyway and the bulk of that list is currently tempest (something that should be fixable) and tripleo-ci (something that is going away anyway)19:40
clarkbI feel like this is doable with a reasonable outcome in opendev19:40
corvusso far, the regexes we've looked at are super easy to fix too19:40
fungithe named queue change was a year+ deprecation period, fwiw. but this time we have active reporting of deprecations which we didn't have back then19:41
fricklerthe errors page and filtering is really helpful indeed19:42
fungiunfortunately, that "long tail" of inactive projects who don't get around to fixing things until after they break (if ever) will be the same no matter how long of a deprecation period we give them19:42
corvusto me, it seems like a reasonable path forward to send out an email announcing it along with suggestions on how to fix (that's my TODO), then if there's some kind of hardship in the future when zuul removes the backwards compat, bring that up.  maybe zuul delays or mitigates it.  or maybe opendev drops some more idle projects/branches.19:43
frickleranyway 6 months is more than I expected, that gives some room for things to happen19:44
corvusoh yeah, when we have a choice, zuul tends to have very long deprecation periods19:44
clarkbok lets check back in after the email notice is sent and people have had more time to address things19:44
clarkb#topic Python Container Updates19:46
clarkbThe only current open chagne is for Gerrit. I decided to hold off on doing that with the zuul restart because we had weekend things and simplying to zuul streamlined a bit19:46
clarkbI'm thinking maybe this Friday we land the change and do a short gerrit outage. Fridays tend to be quiet.19:47
clarkbNodepool and Zuul both seem happy running on bookworm though which is a good sign since they are pretty non trivial setups (nodepool builders in particular)19:47
fungii'm going to be mostly afk on friday, but don't let that stop you19:48
clarkback19:49
clarkbI suspect we've got a few more services that could be converted too, but I've got to look more closely codesearch to find them19:49
clarkbprobably worth getting gerrit out of the way then starting with a new batch19:49
clarkb#topic Open Discussion19:49
clarkbWe upgraded Gitea to 1.20.419:49
clarkbWe removed fedora images as previously noted19:50
clarkbNeither have produced any issues that I've seen19:50
clarkbAnything else?19:50
fungii didn't have anything19:51
corvusmaybe the github stuff?19:52
fricklermaybe mention the inmotion rebuild idea19:52
clarkbfor github we can work aroudn rate limits (partially) by adding a user token to our zuul config19:52
corvuscouple years ago, we noted that the openstack tenant can not complete a reconfiguration in zuul in less than an hour because of github api limits19:52
clarkbbut for that to work properly we need zuul to fallback from the app token to the user token to anonymous19:53
corvusthat problem still persists, and is up to >2 hours now19:53
clarkbthere is a change open for that https://review.opendev.org/c/zuul/zuul/+/794688 that needs updates and I've volunteered to revive it19:53
fricklerdo we know if something changed on the github side?19:53
corvus(and is what caused the full zuul restart to take a while)19:53
frickleror just more load from zuul?19:53
clarkbfrickler: I think the chagnes are in the repos growing more branches which leads to more api requests19:53
clarkband then instead of hitting the limit once we now hit it multiple times and each time is an hour delay19:54
corvusyeah, more repos + more branches since then, but already 2 yeags ago it was enough to be a problem19:54
clarkbthis is partially why I want to clean up those projects which can be cleaned up and/or are particulaly bad about this19:54
clarkbhttps://review.opendev.org/c/openstack/project-config/+/894814 and child should be safe to merge for taht19:54
fricklerlikely also we didn't do a real restart for a long time, so didn't notice that growth?19:54
clarkbbut the better fix is making zuul smarter about its github requests19:54
clarkbfrickler: correct we haven't done a full restart without cached data in a logn time19:55
clarkbfor Inmotion the cloud we've got is older and was one of the first cloud as a service deployments they did. Since then they have reportedly improved their tooling and systems and can deploy newer openstack19:55
fricklerregarding inmotion, we could update the cluster to a recent openstack version and thus avoid having to fix stuck image deletions. and hopefully have less issues with stuck things going forward19:56
corvusthis does affect us while not restarting; tenant reconfigurations can take a long time.  it's just zuul is really good at masking that now.19:56
clarkbcorvus: ah fun19:56
clarkbfrickler: yup there are palcement issues in particular that melwitt reports would be fixed if we upgraded19:56
clarkbI can write an email to yuriys indicating we'd like to do that after the openstack release happens and get some guidance from their end before we start19:56
clarkbbut I suspect the rough process to be make notes for how things are setup today (number of nodes and roles of nodes, networking setup for neutron, size of nodes, etc) then replicate that using newer openstack in a new cluster19:57
fricklergetting feedback what versions they support would also be interesting19:57
clarkbone open question is if we have to delete the existing cluster first or if we can have tow clusters side by side for a short time. yuriys should be able to give feedback on that19:57
fricklerlike switch to rocky as base os and have 2023.1? or maybe even 2023.2 by then?19:58
clarkbI think they said they were doing stream19:58
clarkbbut ya we can clarify all that too19:58
fricklerI think two parallel clusters might be difficult in regard to IP space19:58
clarkbgood point particularly since that was a pain point previously. We probably do need to reclaim the old cloud or at least its IPs first19:59
fungiyeah, they were reluctant to give us much ipv4 addressing, and still (last i heard) don't have ipv6 working19:59
clarkband that is our hour. Thank you for your time today and for all the help keeping these systems running20:00
clarkbI'll take the todo to write the email to yuriys and see what they say20:00
fricklerI should point them to my IPv6 guide then ;)20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Sep 12 20:00:47 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-12-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-12-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-12-19.01.log.html20:00
fungithanks!20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!