Tuesday, 2023-12-19

fungimeeting in about 10 minutes18:50
fungimeeting time!19:00
fungii've volunteered to chair this week since clarkb is feeling under the weather19:00
fungi#startmeeting infra19:00
opendevmeetMeeting started Tue Dec 19 19:00:52 2023 UTC and is due to finish in 60 minutes.  The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
fungi#link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting Our Agenda19:01
fungi#topic Announcements19:01
fungi#info The OpenDev weekly meeting is cancelled for the next two weeks owing to lack of availability for many participants; we're skipping December 26 and January 2, resuming as usual on January 9.19:02
fungii'm also skipping the empty boilerplate topics19:03
fungi#topic Upgrading Bionic servers to Focal/Jammy (clarkb)19:03
fungi#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades19:03
tonybmirrors are done and need to be cleaned up19:04
fungithere's a note here in the agenda about updating cnames and cleaning up old servers for mirror replacements19:04
fungiyeah, that19:04
fungiare there open changes for dns still?19:04
tonybI started doing this yesterday but wanted additional eyes as it's my first time19:04
fungior do we just need to delete servers/volumes?19:04
tonybno open changes ATM19:04
tonybthat one19:04
fungiwhat specifically do you want an extra pair of eyes on? happy to help19:04
tonybfungi: the server and volume deletes  I understand the process19:05
fungii'm around to help after the meeting if you want, or you can pick another better time19:06
tonybfungi: after the meeting is good for me19:06
fungisounds good, thanks!19:07
tonybI've started looking at jvb and meetpad for upgrades19:07
fungithat's a huge help19:07
tonybI'm thinking we'll bring up 3 new servers and then do a cname swiatch.19:07
fungithat should be fine. there's not a lot of utilization on them at this time of year anyway19:08
fungi#topic DIB bionic support (ianw)19:08
tonybI was considering a more complex process to growing the jvb pool but I think that way is uneeded19:08
fungii think this got covered last week. was there any followup we needed to do?19:08
fungiseems like there was some work to fix the dib unit tests?19:09
fungii'm guessing this no longer needed to be on the agenda, just making sure19:09
fungi#topic Python container updates (tonyb)19:10
fungizuul-operator seems to still need addressing19:11
tonybno updates this week19:11
fungino worries, just checking. thanks!19:11
fungiGitea 1.21.1 Upgrade (clarkb)19:11
fungier...19:11
fungi#topic Gitea 1.21.1 Upgrade (clarkb19:11
tonybYup I intend to update the roles to enhance container logging and then we'll have a good platform to understand the problem19:11
fungiwe were planning to do the gitea upgrade at the beginning of the week, but with lingering concerns after the haproxy incident over the weekend we decided to postpone19:12
tonybI think we're safe to remove the 2 LBs from emergency right?19:12
fungi#link https://review.opendev.org/903805 Downgrade haproxy image from latest to lts19:13
fungithat hasn't been approved yet19:13
fungiso not until it merges at least19:13
tonybAh19:13
fungibut upgrading gitea isn't necessarily blocked on the lb being updated19:14
fungidifferent system, separate software19:14
tonybFair point19:14
fungianyway, with people also vacationing and/or ill this is probably still not a good time for a gitea upgrade. if the situation changes later in the week we can decide to do it then, i think19:15
tonybOkay19:15
fungi#topic Updating Zuul's database server (clarkb)19:15
tonybI suspect there hasn't been much progress this week.19:16
fungii'm not sure where we ended up on this, there was research being done, but also an interest in temporarily dumping/importing on a replacement trove instance in the meantime19:16
fungiwe can revisit next year19:16
fungi#topic Annual Report Season (clarkb)19:17
fungi#link OpenDev's 2023 Annual Report Draft will live here: https://etherpad.opendev.org/p/2023-opendev-annual-report19:17
fungiwe need to get that to the foundation staff coordinator for the overall annual report by the end of the week, so we're about out of time for further edits if you wanted to check it over19:17
fungi#topic EMS discontinuing legacy/consumer hosting plans (fungi)19:18
fungiwe received a notice last week that element matrix services (ems) who hosts our opendev.org matrix homeserver for us is changing their pricing and eliminating the low-end plan we had the foundation paying for19:19
fungithe lowest "discounted" option they're offering us comes in at 10x what we've been paying, and has to be paid a year ahead in one lump sum19:20
fungi(we were paying monthly before)19:20
tonybwhen?19:20
tonybdoes the plan need to be purchased19:20
fungiwe have until 2024-02-07 to upgrade to a business hosting plan or move elsewhere19:20
tonybphew19:20
fungiso ~1.5 months to decide on and execute a course of action19:21
tonybnot a lot of lead time but also some lead time19:21
corvusis the foundation interested in upgrading?19:22
fungii've so far not heard anyone say they're keen to work on deploying a matrix homeserver in our infrastructure, and i looked at a few (4 i think?) other hosting options but they were either as expensive or problematic in various ways, and also we'd have to find time to export/import our configuration and switch dns resulting in some downtime19:22
fungii've talked to the people who hold the pursestrings on the foundation staff and it sounds like we could go ahead and buy a year of business service from ems since we do have several projects utilizing it at this point19:23
fungiwhich would buy us more time to decide if we want to keep doing that or work on our own solution19:24
tonybA *very* quick looks implies that hosting our own server wouldn't be too bad.  the hardest part will be the export/import and downtime19:24
frickleranother option might be dropping the homeserver and moving the rooms to matrix.org?19:24
tonybI suspect that StartlingX will be the "most impacted"19:24
fricklerI've tried running a homeserver privately some time ago but it was very opaque and not debugable19:25
fungimaybe, but with as many channels as they have they're still not super active on them (i lurk in all their channels and they average a few messages a day tops)19:25
corvusfungi: does the business plan support more than one hostname?  the foundation may be able to eek out some more value if they can use the same plan to host internal comms.19:26
fungilooking at https://element.io/pricing it's not clear to me how that's covered exactly19:28
fungimaybe?19:28
corvusok.  just a thought  :)19:28
frickleralso, is that "discounted" option a special price or does that match the public pricing?19:28
fungithe "discounted" rate they offered us to switch is basically the normal business cloud option on that page, but with a reduced minimum user count of 20 instead of 5019:29
fungianyway, mostly wanted to put this on the agenda so folks know it's coming and have some time to think about options19:30
fungiwe can discuss again in the next meeting which will be roughly a month before the deadline19:31
corvusif the foundation is comfortable paying for it, i'd lean that direction19:31
fungiyeah, i'm feeling similarly. i don't think any of us has a ton of free time for another project just now19:31
corvus(i think there are good reasons to do so, including the value of the service provided compared to our time and materials cost of running it ourselves, and also supporting open source projects)19:32
fungiagreed, and while it's 10x what we've been paying, there wasn't a lot of surprise at a us$1.2k/yr+tax price tag19:32
fungihelps from a budget standpoint that it's due in the beginning of the year19:33
corvustbh i thought the original price was too way low for an org (i'm personally sad that it isn't an option for individuals any more though)19:33
fungiyeah, we went with it mainly because they didn't have any open source community discounts, which we'd have otherwise opted for19:34
fungiany other comments before we move to other topics?19:34
fungi#topic Followup on 20231216 incident (frickler)19:35
fungiyou have the floor19:35
fricklerwell I just collected some things that came to my mind on sunday19:35
fricklerfirst question: Do we want to pin external images like haproxy and only bump them after testing? (Not sure that would've helped for the current issue though)19:36
fungithere's a similar question from corvus in 903805 about whether we want to make the switch from "latest" to "lts" permanent19:36
fungitesting wouldn't have caught it though i don't thinl19:37
corvusyeah, unlike gerrit/gitea where there's stuff to test, i don't think we're going to catch haproxy bugs in advance19:37
fungibut maybe someone with a higher tolerance for the bleeding edge would have spotted it before latest became lts19:37
fungialso it's not like we use recent/advanced features of haproxy19:38
corvusfor me, i think maybe permanently switching to tracking the lts tag is the right balance of auto-upgrade with hopefully low probability of breakage19:38
fungiso i think the answer is "it depends, but we can be conservative on haproxy and similar components"19:38
fricklerare there other images we consume that could cause similar issues?19:38
fricklerand I'm fine with haproxy:lts as a middle ground for now19:39
fungii don't know off the top of my head, but if someone wants to `git grep :latest$` and do some digging, i'm happy to review a change19:39
fricklerok, second thing:  Use docker prune less aggressively for easier rollback?19:39
fricklerWe do so for some services, like https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea/tasks/main.yaml#L71-L76, might want to duplicate for all containers? Bump the hold time to 7d?19:39
corvus(also, honestly i think the fact that haproxy is usually rock solid is why it took us so long to diagnose it.  normally checking restart times would be near the top of the list of things to check)19:40
fungifwiw, when i switched gitea-lb's compose file from latest to lts and did a pull, nothing was downloaded, the image was still in the cache19:40
fricklerso "docker prune" doesn't clear the cache?19:41
tonybIIRC it did download on zuul-lb19:41
corvusi sort of wonder what we're trying to achieve there?  resiliency against upstream retroactively changing a tag?  or shortened download times?  or ability to guess what versions we were running by inspecting the cache?19:41
fricklerbeing able to have a fast revert of an image upgrade by just checking "docker images" locally19:42
fungii guess the concern is that we're tracking lts, upstream moves lts to a broken image, and we've pruned the image that lts used to point to so we have to redownload it when we change the tag?19:42
frickleralso I don't think we are having disk space issues that make fast pruning essential19:43
corvusif it's resiliency against changes, i agree that 7d is probably a good idea.  otherwise, 1-3 days is probably okay... if we haven't cared enough to look after 3 days, we can probably check logs or dockerhub, etc...19:43
fungibut also the download times are generally on the order of seconds, not minutes19:43
fungiit might buy us a little time but it's far from the most significant proportion of any related outage19:44
fricklerthe 3d is only in effect for gitea, most other images are pruned immediately after upgrading19:44
corvus(we're probably going to revert to a tag, which, since we can download in a few seconds, means the local cache isn't super important)19:44
fungii'm basically +/-0 on adjusting image cache times. i agree that we can afford the additional storage, but want to make sure it doesn't grow without bound19:45
fricklerok, so leave it at that for now19:45
fricklernext up: Add timestamps to zuul_reboot.log?19:45
fricklerhttps://opendev.org/opendev/system-config/src/branch/master/playbooks/service-bridge.yaml#L41-L55 Also this is running on Saturdays (weekday: 6), do we want to fix the comment or the dow?19:45
fungialso having too many images cached makes it a pain to dig through when you're looking for a recent-ish obne19:45
fungiis zuul_reboot.log a file? on bridge?19:46
frickleryes19:46
fricklerthe code above shows how it is generated19:46
fungiaha, /var/log/ansible/zuul_reboot.log19:47
corvusadding timestamps sounds good to me; i like the current time so i'd say change the comment19:47
fungii have no objection to adding timestamps19:47
fungito, well, anything really19:47
fricklerok, so I'll look into that19:47
fungimore time-based context is preferable to less19:47
fungithanks!19:47
fricklerfinal one: Do we want to document or implement a procedure for rolling back zuul upgrades? Or do we assume that issues can always be fixed in a forward going way?19:47
fungii think the challenge there is that "downgrading" may mean manually undoing database migrations19:48
fricklerlike what would we have done if we hadn't found a fast fix for the timer issue?19:48
fungithe details of which will differ from version to version19:48
fungifrickler: if the solution hadn't been obvious i was going to propose a revert of the offending change and try to get that fast-tracked19:49
fricklerok, what if no clear bad patch had been identied?19:49
frickleridentified19:49
frickleranyway, we don't need to discuss this at length right now, more something to think about medium term19:50
fungifor zuul we're in a special situation where several of us are maintainers, so we've generally been able to solve things like that quickly one way or another19:50
corvusi agree with fungi, any downgrade procedure is dependent on the revisions in scope, so i don't think there's a generic process we can do19:50
fungiit'll be an on-the-spot determination as to whether its less work to roll forward or try to unwind things19:51
fricklerok, time's tight, so lets move to AFS?19:51
fungiyep!19:51
fungi#topic AFS quota issues (frickler)19:51
frickler     mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have doubled their volume over the last 12 months19:52
frickler    ubuntu mirrors are also getting close, but we might have another couple of months time there19:52
frickler    mirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon19:52
frickler    project.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that19:52
fricklerthe last one is easy I think. for openeuler instead of bumping the quota someone may want to look into cleanup options first?19:52
fricklerthe others are more of something to keep an eye on19:53
fungibroken openeuler mirrors that nobody brought to our attention would indicate they're not being used, but yes it's possible we can filter out some things like we do for centos19:53
fungii'll try to figure out based on git blame who added the openeuler mirror and see if they can propose improvements before lunar new year19:53
fricklerwell they are being used in devstack, but being out of date for some weeks does not yet break jobs19:53
corvusfeel free to action me on the zuul thing if no one else wants to do it19:54
fungii agree just bumping the zuul quota is fine19:54
fungi#action fungi Reach out to someone about cleaning up OpenEuler mirroring19:54
fungi#action corvus Increase project.zuul AFS volume quota19:55
fungilet's move to the last topic19:55
fungi#topic Broken wheel build issues (frickler)19:55
fungicentos8 wheel builds are the only ones that are thoroughly broken currently?19:55
fungii'm pleasantly surprised if so19:56
fricklerfungi: https://review.opendev.org/c/openstack/devstack/+/900143 is the last patch on devstack that a quick search showed me for openeuler19:56
fricklerI think centos9 too?19:56
fungioh, >+819:56
fungi>=819:56
fungigot it19:56
fricklerthough depends on what you mean by thoroughly (8 months vs. just 1)19:57
fungihow much centos testing is going on these days now that tripleo has basically closed up shop?19:57
fungiwondering how much time we're saving by not rebuilding some stuff from sdist in centos jobs19:57
fricklernot sure, I think some usage is still there for special reqs like FIPS19:58
tonybyup FIPS still needs it.19:58
fricklerat least people are still enough concerned about devstack global_venv being broken on centos19:58
fungifor 9 or 8 too?19:58
fricklerboth I think19:59
tonybI can work with ade_lee to verify what *exactly* is needed and fix or prune as appropriate19:59
fungiwe can quite easily stop running the wheel build jobs, if the resources for running those every day is a concern19:59
fungii guess we can discuss options in #opendev since we're past the end of the hour20:00
fricklerthe question is then do we want to keep the outdated builds or purge them too?20:00
fungikeeping them doesn't hurt anything, i don't think20:00
fungiit's just an extra index url for pypi20:00
fungiand either the desired wheel is there or it's not20:00
fungiand if it's not, the job grabs the sdist from pypi and builds it20:00
tonyband storage?20:00
fricklerit does mask build errors that can happen for people that do not have access to those wheels20:00
fricklerif the build was working 6 months ago but has broken since then20:01
fricklerbut anyway, not urgent, we can also continue next year20:01
fungiit's a good point, we considered that as a balance between using job resources continually building the same wheels over and over20:01
fungiand projects forgetting to list the necessary requirements for building the wheels for things they depend on that lack them20:01
fungiokay, let's continue in #opendev. thanks everyone!20:02
fungi#endmeeting20:02
opendevmeetMeeting ended Tue Dec 19 20:02:18 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:02
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.html20:02
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.txt20:02
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.log.html20:02
fricklerthx fungi 20:02
tonybThanks all20:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!