clarkb | hellot everyone | 19:01 |
---|---|---|
clarkb | meeting time | 19:01 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Sep 12 19:01:12 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/V23MFBBCANVK4YAK2IYOV5XNLFY64U3X/ Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | I will be out tomorrow | 19:02 |
fungi | hellot to you too! | 19:02 |
clarkb | I'll be on a little bit fishing for hopefully not little fish so won't raelly be around keyboards | 19:02 |
clarkb | but back thursday for the mailman fun | 19:02 |
clarkb | #topic Mailman 3 | 19:03 |
clarkb | may as well jump straight into that | 19:03 |
fungi | nothing new here really. planning to migrate airship and kata on thursday | 19:03 |
clarkb | fungi: The plan is to migrate lists.katacontainers.io and lists.airshipit.org to mailman 3 on september 14 | 19:03 |
clarkb | fungi: and youare starting around 1500 iirc? | 19:03 |
fungi | yep. will do a preliminary data sync tomorrow to prepare, so the sync during downtime will be shorter | 19:03 |
fungi | i notified the airship-discuss and kata-dev mailing lists since those are the most active ones for their respective domains | 19:04 |
clarkb | fungi: anything else we can do to help prepare? | 19:05 |
fungi | if thursday's maintenance goes well, i'll send similar notifications to the foundation and starlingx-discuss lists immediately thereafter about a similar migration for the lists.openinfra.dev and lists.starlingx.io sites thursday of next week | 19:05 |
fungi | i don't think we've got anything else that needs doing for mailman at the moment | 19:05 |
clarkb | excellent. Thank you for getting this together | 19:06 |
fungi | we did find out that the current lists.katacontainers.io server's ipv4 address was on the spamhaus pbl | 19:06 |
fungi | i put in a removal request for that in the meantime | 19:06 |
clarkb | the ipv6 address was listed too but we can't remove it because the entire /64 is listed and we have a /128 on the host? | 19:06 |
clarkb | that should improve after we migrate hosts anyway | 19:07 |
fungi | their css blocklist has the ipv6 /64 for that server listed, not much we can do about that yeah | 19:07 |
fungi | the "problem" traffic from that /64 is coming from another rackspace customer | 19:07 |
fungi | new server is in a different /64, so not affected at the moment | 19:07 |
clarkb | anything else mail(man) related? | 19:07 |
fungi | not from my end | 19:08 |
frickler | there was a spam mail on service-discuss earlier, just wondering whether that might be mailman3 related | 19:08 |
clarkb | oh right that is worth calling out | 19:09 |
clarkb | frickler: I think it is mailman3 related in that fungi suspects that maybe the web interface makes it easier for people to do? | 19:09 |
fungi | oh, did one get through? i didn't notice, but can delete it from the archive and block that sender | 19:09 |
frickler | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/message/JLGRB7TNXJK2W3ELRXMOTAK3NH5TNYI3/ | 19:09 |
clarkb | fungi: is that process documented for mm3 yet? might be good to do so if not | 19:09 |
fungi | thanks, i'll take care of that right after the meeting | 19:10 |
clarkb | #topic Infra Root Google Account | 19:10 |
fungi | and yes, the supposition is that making it possible for people to post to mailing lists without a mail client (via their web browser) also makes it easier for spammers to do the same. the recommendation from the mm3 maintainers is to adjust the default list moderation policy so that every new subscriber is moderated by default until a moderator adds them to a whitelist | 19:11 |
clarkb | I did an investigating and learned things. tl;dr is that this account is used by zuul to talk to gerrit's gerrit | 19:11 |
clarkb | using the resulting gerrit token doesn't appear to count as somethign google can track as account activity (probably a good thing imo) | 19:11 |
clarkb | I logged into the account which I think is sufficient to reset the delete clock on the account. However, after reading elsewhere some people are suggesting you do some action on top of logging in like making a google search or watching a youtube video. I'll plan to log back in and do some searches then | 19:12 |
clarkb | unfortunately, this is clear as mud when it comes to what actually counts for not getting deleted | 19:12 |
fungi | worst case we get the gerrit admins to authorize a new account when that one gets deactivated by google | 19:13 |
clarkb | fungi: yup but that will require a new email address. I'm not sure if google will allow +suffixes | 19:13 |
clarkb | (my guess is not) | 19:14 |
corvus | but then we could use snarky email addresses :) | 19:14 |
corvus | (but still, who needs the trouble; thanks for logging in and searching for "timothee chalamet" or whatever) | 19:15 |
clarkb | I was going to search for zuul | 19:15 |
clarkb | make it super meta | 19:15 |
clarkb | anyway we should keep this account alvie if we can and that was all i had | 19:16 |
corvus | ooh good one | 19:16 |
clarkb | #topic Server Upgrades | 19:16 |
clarkb | No updates from me on this and I haven't seen any other movement on it | 19:16 |
clarkb | #topic Nodepool image upload changes | 19:16 |
clarkb | We have removed fedora 35 and 36 images from nodepool which reduces the total number of image builds that need uploading | 19:16 |
clarkb | on top of that we reduced the build and upload rates. Has anyone dug in to see how that is going since we made the change? | 19:17 |
clarkb | glancing at https://grafana.opendev.org/d/f3089338b3/nodepool3a-dib-status?orgId=1 the builds themselves seem fine, but I'm not sure if we're keeping up with uploads particularly to rax iad | 19:17 |
corvus | we also added the ability to configure the upload timeout in nodepool; have we configured that in opendev yet? | 19:18 |
frickler | I don't think so, I missed that change | 19:18 |
clarkb | corvus: oh yes I put that in the agenda email even. I don't think we have configured it but we probably should | 19:18 |
corvus | it used to be hard-coded to 6 hours before nodepool switched to sdk | 19:18 |
corvus | i think that would be a fine value to start with :) | 19:18 |
clarkb | wfm | 19:19 |
fungi | yeah, i found in git history spelunking that we lost the 6-hour value when shade was split from nodepool | 19:19 |
fungi | so it's... been a while | 19:19 |
clarkb | maybe we set that value then check back in next week to see how rax iad is doing afterwards | 19:19 |
corvus | #link https://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].image-upload-timeout | 19:19 |
frickler | one thing I notice about the 7d rebuild cycle is that it doesn't seem to work well with two images being kept | 19:20 |
clarkb | frickler: beacuse we delete the old image after one day? | 19:20 |
frickler | yes and then upload another | 19:20 |
frickler | so the two images are just one day apart, not another 7 | 19:21 |
frickler | but maaybe that's as designed kind of | 19:21 |
frickler | at least that's my interpretation for why we have images 6d and 7d old | 19:22 |
clarkb | oh hrm | 19:22 |
clarkb | I guess that still cuts down on total uploads but in a weird way | 19:22 |
clarkb | I think we can live with that for now | 19:22 |
fungi | makes sense, maybe we need to set a similarly long image expiration (can that also be set per image?) | 19:22 |
frickler | or maybe it is only because we started one week ago | 19:23 |
fungi | oh, possible | 19:23 |
fungi | so the older of the two was built on the daily rather than weekly cycle | 19:24 |
clarkb | ok any volunteers to puysh a change updating the timeout for uplaods in each of the provider configs? | 19:25 |
clarkb | I think that goes in the two builder config files as the launcher nl0X files don't care about upload timeouts | 19:25 |
frickler | for bookworm we have 2d and 4d, that looks better | 19:25 |
frickler | I can do the timeout patch tomorrow | 19:25 |
clarkb | frickler: thanks! | 19:26 |
clarkb | #topic Zuul PCRE deprecation | 19:26 |
fungi | we've merged most of the larger changes impacting our tenants, right? just a bunch of stragglers now? | 19:27 |
clarkb | wanted to check in onthis as I haven't been able to follow the work that has happend super closely. I know changes to fix the regexes are landing and I haven't heard of any problems. But is there anythign to be concerned about? | 19:27 |
corvus | i haven't drafted an email yet for this, but will soon | 19:27 |
corvus | i haven't heard of any issues | 19:27 |
fungi | nor have i | 19:27 |
fungi | at least not after the fixes for config changes on projects with warnings | 19:28 |
frickler | this list certainly is something to be concerned about https://zuul.opendev.org/t/openstack/config-errors?severity=warning | 19:28 |
fungi | yes, all the individual projects with one or two regexes in their configs (especially on stable branches) are going to be the long tail | 19:29 |
fungi | i think we just let them know when the removal is coming and either they fix it or we start removing projects from the tenant config after they've been broken for too long | 19:30 |
frickler | or we just keep backwards compatibility for once? | 19:31 |
corvus | we can't keep backwards compat once this is removed from zuul | 19:31 |
frickler | well that's on zuul then. or opendev can stop running latest zuul | 19:31 |
corvus | and the point of this migration is to prevent dos attacks against sites like opendev | 19:32 |
clarkb | the motivation behind the change is actually one that is useful to opendev though ya that | 19:32 |
corvus | so it would be counter productive for opendev to do that | 19:32 |
corvus | is that a serious consideration? is this group interested in no longer running the latest zuul? because of this? | 19:32 |
frickler | how many dos have we seen vs. how much trouble has the queue change brought? | 19:33 |
corvus | i have to admit i'm a little surprised to hear the suggestion | 19:33 |
clarkb | frickler: I think the queue changes expose underlying problems in various projects more than they are the actual problem | 19:33 |
clarkb | it isn't our fault that many peices of software are largely unmaintained and in those cases I don't think not runnign CI is a huge deal | 19:33 |
clarkb | when the projects become maintained again they can update their configs and move forward | 19:34 |
clarkb | either to fix existing problems in the older CI setups for the projects or by deleting what was there and starting over | 19:34 |
fungi | i guess it's a question of whether projects want a ci system that never breaks backwards compatibility, or one that fixes bugs and adds new features | 19:34 |
clarkb | I agree that this exposes a problem I just don't see the zuul updates as the problem. They indicate deeper problems | 19:34 |
corvus | indeed, zuul's configuration error handling was specifically designed for this set of requirements from opendev -- that if there are problems in "leaf node" projects, it doesn't take out the whole system. so it's also surprising to hear that is not helpful. | 19:35 |
fungi | also though, zuul relies on other software (notably python and ansible) which have relatively aggressive deprecation and eol schedules, so not breaking backward compatibility would quickly mean running with eol versions of those integral dependencies | 19:36 |
clarkb | my personal take on it is that openstack should do like opendev and aggressively prune what isn't sustainable. But I know that isn't the current direction of the project | 19:36 |
fungi | even with jenkins we had upgrades which broke some job configurations and required changes for projects to continue running things in it | 19:37 |
corvus | i thought opendev's policy of making sure people know about upcoming changes, telling them how to fix problems, and otherwise not being concerned if individual projects aren't interested in maintaining their use of the system is reasonable. and in those cases, just letting the projects "error out" and then pruning them. i don't feel like that warrants a change in the ci system. | 19:37 |
clarkb | corvus: any idea what zuul's deprecation period looks like for this change? I suspect this one might take some time due to the likelyhood of impacting existing installs? More so than the ansible switch anyway | 19:38 |
corvus | i don't think it's been established, but my own feeling is the same, and i would be surprised if anyone had an appetite for it in less than, say, 6 months? | 19:39 |
corvus | yeah, ansible switches are short, and not by our choice. | 19:39 |
corvus | these other things can take as long as we want. | 19:39 |
corvus | (we = zuul community) | 19:39 |
clarkb | ya so this is going to be less urgent anyway and the bulk of that list is currently tempest (something that should be fixable) and tripleo-ci (something that is going away anyway) | 19:40 |
clarkb | I feel like this is doable with a reasonable outcome in opendev | 19:40 |
corvus | so far, the regexes we've looked at are super easy to fix too | 19:40 |
fungi | the named queue change was a year+ deprecation period, fwiw. but this time we have active reporting of deprecations which we didn't have back then | 19:41 |
frickler | the errors page and filtering is really helpful indeed | 19:42 |
fungi | unfortunately, that "long tail" of inactive projects who don't get around to fixing things until after they break (if ever) will be the same no matter how long of a deprecation period we give them | 19:42 |
corvus | to me, it seems like a reasonable path forward to send out an email announcing it along with suggestions on how to fix (that's my TODO), then if there's some kind of hardship in the future when zuul removes the backwards compat, bring that up. maybe zuul delays or mitigates it. or maybe opendev drops some more idle projects/branches. | 19:43 |
frickler | anyway 6 months is more than I expected, that gives some room for things to happen | 19:44 |
corvus | oh yeah, when we have a choice, zuul tends to have very long deprecation periods | 19:44 |
clarkb | ok lets check back in after the email notice is sent and people have had more time to address things | 19:44 |
clarkb | #topic Python Container Updates | 19:46 |
clarkb | The only current open chagne is for Gerrit. I decided to hold off on doing that with the zuul restart because we had weekend things and simplying to zuul streamlined a bit | 19:46 |
clarkb | I'm thinking maybe this Friday we land the change and do a short gerrit outage. Fridays tend to be quiet. | 19:47 |
clarkb | Nodepool and Zuul both seem happy running on bookworm though which is a good sign since they are pretty non trivial setups (nodepool builders in particular) | 19:47 |
fungi | i'm going to be mostly afk on friday, but don't let that stop you | 19:48 |
clarkb | ack | 19:49 |
clarkb | I suspect we've got a few more services that could be converted too, but I've got to look more closely codesearch to find them | 19:49 |
clarkb | probably worth getting gerrit out of the way then starting with a new batch | 19:49 |
clarkb | #topic Open Discussion | 19:49 |
clarkb | We upgraded Gitea to 1.20.4 | 19:49 |
clarkb | We removed fedora images as previously noted | 19:50 |
clarkb | Neither have produced any issues that I've seen | 19:50 |
clarkb | Anything else? | 19:50 |
fungi | i didn't have anything | 19:51 |
corvus | maybe the github stuff? | 19:52 |
frickler | maybe mention the inmotion rebuild idea | 19:52 |
clarkb | for github we can work aroudn rate limits (partially) by adding a user token to our zuul config | 19:52 |
corvus | couple years ago, we noted that the openstack tenant can not complete a reconfiguration in zuul in less than an hour because of github api limits | 19:52 |
clarkb | but for that to work properly we need zuul to fallback from the app token to the user token to anonymous | 19:53 |
corvus | that problem still persists, and is up to >2 hours now | 19:53 |
clarkb | there is a change open for that https://review.opendev.org/c/zuul/zuul/+/794688 that needs updates and I've volunteered to revive it | 19:53 |
frickler | do we know if something changed on the github side? | 19:53 |
corvus | (and is what caused the full zuul restart to take a while) | 19:53 |
frickler | or just more load from zuul? | 19:53 |
clarkb | frickler: I think the chagnes are in the repos growing more branches which leads to more api requests | 19:53 |
clarkb | and then instead of hitting the limit once we now hit it multiple times and each time is an hour delay | 19:54 |
corvus | yeah, more repos + more branches since then, but already 2 yeags ago it was enough to be a problem | 19:54 |
clarkb | this is partially why I want to clean up those projects which can be cleaned up and/or are particulaly bad about this | 19:54 |
clarkb | https://review.opendev.org/c/openstack/project-config/+/894814 and child should be safe to merge for taht | 19:54 |
frickler | likely also we didn't do a real restart for a long time, so didn't notice that growth? | 19:54 |
clarkb | but the better fix is making zuul smarter about its github requests | 19:54 |
clarkb | frickler: correct we haven't done a full restart without cached data in a logn time | 19:55 |
clarkb | for Inmotion the cloud we've got is older and was one of the first cloud as a service deployments they did. Since then they have reportedly improved their tooling and systems and can deploy newer openstack | 19:55 |
frickler | regarding inmotion, we could update the cluster to a recent openstack version and thus avoid having to fix stuck image deletions. and hopefully have less issues with stuck things going forward | 19:56 |
corvus | this does affect us while not restarting; tenant reconfigurations can take a long time. it's just zuul is really good at masking that now. | 19:56 |
clarkb | corvus: ah fun | 19:56 |
clarkb | frickler: yup there are palcement issues in particular that melwitt reports would be fixed if we upgraded | 19:56 |
clarkb | I can write an email to yuriys indicating we'd like to do that after the openstack release happens and get some guidance from their end before we start | 19:56 |
clarkb | but I suspect the rough process to be make notes for how things are setup today (number of nodes and roles of nodes, networking setup for neutron, size of nodes, etc) then replicate that using newer openstack in a new cluster | 19:57 |
frickler | getting feedback what versions they support would also be interesting | 19:57 |
clarkb | one open question is if we have to delete the existing cluster first or if we can have tow clusters side by side for a short time. yuriys should be able to give feedback on that | 19:57 |
frickler | like switch to rocky as base os and have 2023.1? or maybe even 2023.2 by then? | 19:58 |
clarkb | I think they said they were doing stream | 19:58 |
clarkb | but ya we can clarify all that too | 19:58 |
frickler | I think two parallel clusters might be difficult in regard to IP space | 19:58 |
clarkb | good point particularly since that was a pain point previously. We probably do need to reclaim the old cloud or at least its IPs first | 19:59 |
fungi | yeah, they were reluctant to give us much ipv4 addressing, and still (last i heard) don't have ipv6 working | 19:59 |
clarkb | and that is our hour. Thank you for your time today and for all the help keeping these systems running | 20:00 |
clarkb | I'll take the todo to write the email to yuriys and see what they say | 20:00 |
frickler | I should point them to my IPv6 guide then ;) | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Sep 12 20:00:47 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-12-19.01.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-12-19.01.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-12-19.01.log.html | 20:00 |
fungi | thanks! | 20:00 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!