Tuesday, 2021-12-14

clarkbAnyone else here for the meeting?19:00
fungiyo!19:00
fricklero/19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Dec 14 19:01:01 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-December/000309.html Our Agenda19:01
ianwo/19:01
clarkb#topic Announcements19:01
clarkbWe'll cancel next week's meeting and the meeting on January 4, 2022. I'll see what the temperature for having a meeting on the 28th is on the 27th. Though I half expect no one to be around for that one either :)19:02
clarkbHopefully we can all enjoy a bit of rest and holidays and so on19:02
clarkb#topic Actions from last meeting19:03
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-12-07-19.01.txt minutes from last meeting19:03
clarkbThere were no actions recorded so we'll just dive straight in19:03
clarkb#topic Topics19:03
clarkb#topic Log4j vulnerability19:03
clarkbLast week on Thursday afternoon relative to me a 0day RCE in a popular java library was released19:04
clarkbThe vast majority of the java applicatiosn that we care about either use an older version of the library or don't use it at all and were not vulnerable19:04
clarkbThe exception was meetpad which we shut down in response to19:05
fungithough i was surprised to realize just how much java we do have scattered throughout our services19:05
clarkbSince then jitsi devs have patched and updated docker images which we haev updated to and the service is running again19:05
clarkbThe roughest part of this situation was this was not a coordinated disclosure with well understood behaviors and parameters. Instaed it was a fire drill with a lot of FUD and misinformation floating around19:06
clarkbI ended up doing a fair bit of RTFSing and reading between the lines of what others had said that night to gain confidence that the older version was not affected and eventually the authors of the older code pushed for the current log4j authors to udpate their statements to make them accurate and clear confirming our analysis19:06
fungiand led to us having to do a lot of source code level auditing19:07
clarkbThank you to everyone that helped out digging into this and responding. I think a number of other organizations and individuals have had a much rougher go of it.19:07
fricklerand still have and will continue for some time19:08
clarkbReally just wanted to mention it happened, we were aware right about when it started to become public, and the whole team ended up digging in and assessing our risk as well as responding in the case of jitsi. And that is deserving of thanks so tahnk you all!19:08
clarkbIs there anything else to add on this subject?19:08
fungii'll drink to that!19:10
fungithanks everybody!19:10
clarkb#topic Improving OpenDev's CD throughput19:10
clarkbSounds like that was it so we can move on19:10
clarkbianw has made good progress getting our serially run jobs all organized. Now that we are looking at running in parallel the next step is centralizing the git repo updates for system-config on bridge at the beginning of each buildset19:11
clarkbsince we don't want the jobs fighting over repo contents19:11
clarkbWhat this exposed is that bootstrapping the bridge currently requires human intervention19:11
clarkbianw is wondering if we should have zuul do a bare minimum of bootstrapping so that subsequent jobs can take it from there19:11
clarkbDoing this requires using zuul secrets19:12
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/821645 -- spec outlining some of the issues with secrets19:12
clarkb#link https://review.opendev.org/c/opendev/system-config/+/821155 -- sample of secret writing; more info in changelog19:12
clarkbianw: ^ feel free to dive into more detail or point out what the next steps that need help are19:12
ianwyeah, in the abstract, i guess our decision is to think about moving secrets into zuul19:13
clarkband the spec link above is the best venue for that discussioN?19:14
ianwi think this has quite a few advantages, particularly around running credential updates through gerrit as changes as "usual"19:14
ianwthe obvious disadvantages are that we have more foot-gun potential for publishing things, and more exposure to Zuul issues19:15
ianwthe spec -- i'm not sure if this has been discussed previously in the design of all this19:15
ianwi mean the spec is probably a good place for discussion19:16
clarkbya I think the previous implemetnation was largely a how do we combine what we had before with zuul with minimal effort19:16
clarkbAnd having a spec to formally work through the design seems like a great idea.19:16
ianwif people are ok with 821155 i think it's worth a merge and revert when quiet just to confirm it works as we think it works19:16
clarkbconsidering the potential impact involved and the fast approaching holidays I don't think this is something we want to rush through, but infra-root review of that spec when able would be great19:17
clarkband ya 821155 seems low impact enough but good poc as input to the spec.19:17
clarkbAnd maybe if we end up meeting on the 28th we can discuss the spec a bit more19:18
ianwi'd also say it not proposing a radical change to the CD pipeline19:18
clarkbthough I doubt we'll be able to do that synchronously so keeping discussion on the spec as much as possible is probably best19:18
ianwthe credentials are still on the bastion host, which is still running ansible independently19:18
ianwjust instead of admins updating them in git, Zuul would put them on the bastion host19:19
clarkbright19:19
ianwwe would also have to run with both models -- i'm not proposing we move everything wholesale19:19
ianwprobably new things could work from zuul, and, like puppet, as it makes sense as we migrate we can move bits19:20
clarkbthank you for writing the spec up. I'll do my best to get to it this week (though unsure if I'll get to it today)19:21
ianwnp -- i agree if nobody has preexisting thoughts on "oh, we considered that and ..." we can move on and leave it to the spec19:21
clarkb#topic Container Maintenance19:23
clarkb#link https://etherpad.opendev.org/p/opendev-container-maintenance19:23
clarkbI've been working on figuring out which images need bullseye updates and what containers need uid changes and so on and ended up throwing it all into this etherpad19:23
clarkbI've got changes up for all of the container images that need bullseye at this point and have been trying to approve them slowly one or two at a time as I can monitor them going in19:24
clarkbSo far the only real problem has been uWSGI's wheel not building reliably but jrosser dug into that and we think we identified the issue and are reasonably happy with the hacky workaround for now19:24
clarkbOnce the bullseye updates are done I'm hoping to look at the irc bot uid updates next19:25
clarkbWhen I did the audit I noticed that we should also plan to upgrade our zookeeper cluster and our mariadb containers to more recent versions. The current versions are still supported which means this isn't urgent, but keeping up and learning what that process is seems like a good idea19:25
clarkbFor zookeeper we'd go to 3.6.latest as 3.7 isn't fully released yet aiui. For mariadb it is 10.6 I think19:26
clarkbAnd another thing that occured to me is I'm not sure if we are pruning the CI registry's old container contents19:26
clarkbcorvus: ^ do you know?19:26
clarkbI'm happy for people to help out with this stuff too, feel free to put your name next to an item on that etherpad and I'll look for changes19:27
clarkb#topic Mailman Ansible fixups19:28
clarkbfungi has a stack of changes to better test our ansible for mailman and in the process fix that ansible19:28
clarkb#link https://review.opendev.org/c/opendev/system-config/+/820900/ this stack should address newlist issues.19:28
clarkbIn particular we don't run newlist properly as it expects input. But there is an undocumented flag to make it stop looking for input which we switch to. THen we also update firewall rules to block outbound smtp on the test nodes so we can verify it attempted to send email rather than telling newlist to not send email19:29
clarkbfungi: do you need anything else to move that along or is largely a matter of you having time to approve cahnges and watch them?19:29
fungii'm working on the penultimate change in that series to move the service commands into a dedicated playbook rather thanrunning them in testinfra19:29
clarkbah right that was suggested in review.19:30
fungithe earlier changes can merge any time19:30
fungii'm about done with the requested update to 82114419:30
clarkbfungi: were you planning to +A them when ready then?19:30
clarkbmostly want to make sure you aren't waiting on anything from someone lese19:30
fungisure, i can if nobody beats me to it19:30
corvusclarkb: We are not19:31
clarkbcorvus: ok so that is something we should be doing. Is there anything preventing us from pruning those? or just need to do it?19:32
corvusSuspected corruption bug19:32
clarkbcorvus: thanks I've updated the etherpad with these notes19:32
clarkb#topic Nodepool image cleanup.19:33
clarkb#link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html19:33
corvus(sorry for phone brevity)19:33
clarkbI'm realizing that the timelines I put in that email were maybe a bit aggressive with holidays fast approaching and other demands. I think this is fine. Better to say it is going away at the end of the month and take it away a week or two later than to take it away earlier than expected19:33
fricklerI would volunteer to try to keep gentoo running19:34
clarkbI expect that this cleanup can be broken down into subtasks if others are able to help (one person per image or similar)19:34
clarkbfrickler: cool good to hear. Maybe you can respond to the thread so that other gentoo interested parties know to reach out?19:34
ianw#link https://review.opendev.org/c/opendev/base-jobs/+/82164919:34
frickleryeah, I can do that19:35
ianwi've proposed the fedora-latest update, i think we can do that19:35
ianwalso there's some other missing labels there in the follow-on19:35
clarkbianw: frickler thanks19:35
fricklerianw: for fedora I think devstack is still running with f34?19:35
ianwyeah, we should updaate that -- it doesn't actually boot on most of our hosts19:36
clarkbya f34 seems liek a dead end unfortunately.19:36
ianwor, actually, it was the other way, it doesn't boot on rax because they dropped some xen things from the initrd19:36
ianwso it can't find it's root disk19:36
clarkbI'll try to start the tumbleweed cleanup after christmas if I've got time. I expect this is one of our least used images and it hasn't held up to its promise of being an early warning system19:36
clarkbAnd then I can help with the centos-8 cleanup in the new year19:37
ianwnote with f35 there's some intersection with -> https://review.opendev.org/c/openstack/diskimage-builder/+/82152619:37
ianwit looks like "grub2-mkconfig" isn't doing what we hoped on the centos 9-stream image (*not* centos-minimal) -- still need to investigate fully19:38
clarkbits always something with every new release :)19:39
clarkb#topic Open Discussion19:40
clarkbThat was it for the published agenda. Anything else?19:40
fricklertwo things from me19:40
clarkbgo for it19:40
fungithe promised update for 821771 is up for review now19:40
fricklera) I finally finished the exim4 paniclog cleanup19:40
fungi(er, for 821144 i meant)19:41
fricklerall entries were from immediately after the nodes were set up19:41
fungiyes, it seems like exim gets into an unhappy state during bootstrapping19:41
fricklerand all were bionic nodes, so not something likely to repeat19:41
fungioh, doesn't happen on focal? that's great news19:41
fricklerat least I didn't see any incidents there19:42
fricklerb) this came up discussing zuul restarts: do we want to have some kind of freeze for the holidays?19:42
clarkbya I think we should do our best to avoid big central changes for sure. It is probably ok to do job updates and more leaf node things since they can be reverted easily19:43
fricklermy suggestion would be to avoid things like zuul or other updates if possible for some time, maybe from this friday eob to 3rd of jan?19:43
corvusI disagree19:43
clarkbI would add that if people are around to babysit then my concern goes down significantly19:44
corvuszuul is in a stabilization period; if we avoid restarts, we're only going to keep running buggy code and avoid fixes19:44
fungiif zuul development is speeding toward 5.0.0 during that time, i'd rather opendev didn't fall behind19:44
corvusthis is the best period to make infrastructure changes19:44
clarkbHistorically, what we have had issues with is people making changes (pbr release on christmas eve one year iirc) then disappearing19:44
corvusfewer people being around is the best time for zuul restarts19:44
corvusthat's why i do so many over weekends19:44
clarkbIf we avoid the "then disappearing" bit we should probably be ok19:45
corvusif we're going to avoid infra changes when people are busy with openstack releases and also avoid them when they aren't, then there's precious little time to actually do them.19:45
fungimaybe if people avoid doing things late in their day when they'll be falling unconscious soon after?19:45
clarkbalso with zuul specifically we can revert to a known good (or good enough)19:45
corvusi don't think i have a history of doing that?19:45
clarkbcorvus: no you don't19:45
fungicorrect19:46
clarkbI guess what I'm saying is it will often come down to a judgement call. If a change is revertable (say in the case of zuul) and the person driving it plans to pay attention (again say with zuul) then it is probably fine19:46
clarkbbut something like a gerrit 3.4 upgrade wouldn't fall into that bucket I don't think19:46
fungiand plenty of information on how to undo things helps in case issues crop up much later (a day or three even)19:47
clarkbin the past we've said we should be "slushy"19:47
clarkbI think that is what I'm advocating for here. Which still allows for changes to be made but understanding how to revert and/or proceed if something goes wrong and ensuring someone is able to do that19:47
clarkband ya the next week and week after seem like the period of time where we'll want to take that into account19:48
ianwfwiw from the 23rd (.au time) i'm not going to be easily within access of a interactive login till prob around jan 1019:49
clarkbI just want to avoid repeated PBR incidents, but I think we all want to avoid that and have a good understanding of what is likely to be safe or at least revertable19:49
clarkbthe reason I dig up things like image removals and various container image updates is they both fit under the likely safe to do with a revert path if necessary :)19:50
clarkbBut are still important updates so not a waste of time19:50
fungii'll have family visiting from the 24th throug the 31st but will try to be around some for that week19:50
clarkbAnyway I don't think we need to say no zuul updates. But if you are doing them monitoring the results and be prepared to revert to last known good seems reasonable (and so far that has been done with zuul so I'm not worried)19:51
fricklersounds like a reasonable compromise, ok19:52
fungiit seems like between now and 5.0.0 we should expect updates to come with more fixes than problems anyway19:52
clarkbfrickler: what if we say something like "Be aware that there may be minimal support from December 20-31, if you are making changes in OpenDev please consider what your revert path or path out of danger is before making the change and monitor to ensure this isn't necessary"19:52
clarkbAlso historically setuptools is due for a release that will break everything next week19:53
clarkbLet's hope that don't do that to us :)19:53
fungiright... most of the fire drills this time of year have traditionally not been of our own making19:53
clarkbalso credit to zuul I don't see zuul upgrades a "big central changes" these days. Zuul is well tested and monitored and we have rollback plans19:54
clarkbSomething like Gerrit or our bridge ansible version or PBR are what I've got in mind19:54
clarkbSounds like that may be about it. Last call :)19:56
fungii'll go ahead and approve the smtp egress block for our deploy tests now19:56
fungiwill keep an eye on things while putting dinner together19:57
clarkbAlright sounds like that was it. Thank you everyone19:58
clarkb#endmeeting19:58
opendevmeetMeeting ended Tue Dec 14 19:58:34 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:58
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-14-19.01.html19:58
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-14-19.01.txt19:58
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-14-19.01.log.html19:58
fungithanks clarkb!19:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!