clarkb | Anyone else here for the meeting? | 19:00 |
---|---|---|
fungi | yo! | 19:00 |
frickler | o/ | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Dec 14 19:01:01 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000309.html Our Agenda | 19:01 |
ianw | o/ | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | We'll cancel next week's meeting and the meeting on January 4, 2022. I'll see what the temperature for having a meeting on the 28th is on the 27th. Though I half expect no one to be around for that one either :) | 19:02 |
clarkb | Hopefully we can all enjoy a bit of rest and holidays and so on | 19:02 |
clarkb | #topic Actions from last meeting | 19:03 |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-12-07-19.01.txt minutes from last meeting | 19:03 |
clarkb | There were no actions recorded so we'll just dive straight in | 19:03 |
clarkb | #topic Topics | 19:03 |
clarkb | #topic Log4j vulnerability | 19:03 |
clarkb | Last week on Thursday afternoon relative to me a 0day RCE in a popular java library was released | 19:04 |
clarkb | The vast majority of the java applicatiosn that we care about either use an older version of the library or don't use it at all and were not vulnerable | 19:04 |
clarkb | The exception was meetpad which we shut down in response to | 19:05 |
fungi | though i was surprised to realize just how much java we do have scattered throughout our services | 19:05 |
clarkb | Since then jitsi devs have patched and updated docker images which we haev updated to and the service is running again | 19:05 |
clarkb | The roughest part of this situation was this was not a coordinated disclosure with well understood behaviors and parameters. Instaed it was a fire drill with a lot of FUD and misinformation floating around | 19:06 |
clarkb | I ended up doing a fair bit of RTFSing and reading between the lines of what others had said that night to gain confidence that the older version was not affected and eventually the authors of the older code pushed for the current log4j authors to udpate their statements to make them accurate and clear confirming our analysis | 19:06 |
fungi | and led to us having to do a lot of source code level auditing | 19:07 |
clarkb | Thank you to everyone that helped out digging into this and responding. I think a number of other organizations and individuals have had a much rougher go of it. | 19:07 |
frickler | and still have and will continue for some time | 19:08 |
clarkb | Really just wanted to mention it happened, we were aware right about when it started to become public, and the whole team ended up digging in and assessing our risk as well as responding in the case of jitsi. And that is deserving of thanks so tahnk you all! | 19:08 |
clarkb | Is there anything else to add on this subject? | 19:08 |
fungi | i'll drink to that! | 19:10 |
fungi | thanks everybody! | 19:10 |
clarkb | #topic Improving OpenDev's CD throughput | 19:10 |
clarkb | Sounds like that was it so we can move on | 19:10 |
clarkb | ianw has made good progress getting our serially run jobs all organized. Now that we are looking at running in parallel the next step is centralizing the git repo updates for system-config on bridge at the beginning of each buildset | 19:11 |
clarkb | since we don't want the jobs fighting over repo contents | 19:11 |
clarkb | What this exposed is that bootstrapping the bridge currently requires human intervention | 19:11 |
clarkb | ianw is wondering if we should have zuul do a bare minimum of bootstrapping so that subsequent jobs can take it from there | 19:11 |
clarkb | Doing this requires using zuul secrets | 19:12 |
clarkb | #link https://review.opendev.org/c/opendev/infra-specs/+/821645 -- spec outlining some of the issues with secrets | 19:12 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/821155 -- sample of secret writing; more info in changelog | 19:12 |
clarkb | ianw: ^ feel free to dive into more detail or point out what the next steps that need help are | 19:12 |
ianw | yeah, in the abstract, i guess our decision is to think about moving secrets into zuul | 19:13 |
clarkb | and the spec link above is the best venue for that discussioN? | 19:14 |
ianw | i think this has quite a few advantages, particularly around running credential updates through gerrit as changes as "usual" | 19:14 |
ianw | the obvious disadvantages are that we have more foot-gun potential for publishing things, and more exposure to Zuul issues | 19:15 |
ianw | the spec -- i'm not sure if this has been discussed previously in the design of all this | 19:15 |
ianw | i mean the spec is probably a good place for discussion | 19:16 |
clarkb | ya I think the previous implemetnation was largely a how do we combine what we had before with zuul with minimal effort | 19:16 |
clarkb | And having a spec to formally work through the design seems like a great idea. | 19:16 |
ianw | if people are ok with 821155 i think it's worth a merge and revert when quiet just to confirm it works as we think it works | 19:16 |
clarkb | considering the potential impact involved and the fast approaching holidays I don't think this is something we want to rush through, but infra-root review of that spec when able would be great | 19:17 |
clarkb | and ya 821155 seems low impact enough but good poc as input to the spec. | 19:17 |
clarkb | And maybe if we end up meeting on the 28th we can discuss the spec a bit more | 19:18 |
ianw | i'd also say it not proposing a radical change to the CD pipeline | 19:18 |
clarkb | though I doubt we'll be able to do that synchronously so keeping discussion on the spec as much as possible is probably best | 19:18 |
ianw | the credentials are still on the bastion host, which is still running ansible independently | 19:18 |
ianw | just instead of admins updating them in git, Zuul would put them on the bastion host | 19:19 |
clarkb | right | 19:19 |
ianw | we would also have to run with both models -- i'm not proposing we move everything wholesale | 19:19 |
ianw | probably new things could work from zuul, and, like puppet, as it makes sense as we migrate we can move bits | 19:20 |
clarkb | thank you for writing the spec up. I'll do my best to get to it this week (though unsure if I'll get to it today) | 19:21 |
ianw | np -- i agree if nobody has preexisting thoughts on "oh, we considered that and ..." we can move on and leave it to the spec | 19:21 |
clarkb | #topic Container Maintenance | 19:23 |
clarkb | #link https://etherpad.opendev.org/p/opendev-container-maintenance | 19:23 |
clarkb | I've been working on figuring out which images need bullseye updates and what containers need uid changes and so on and ended up throwing it all into this etherpad | 19:23 |
clarkb | I've got changes up for all of the container images that need bullseye at this point and have been trying to approve them slowly one or two at a time as I can monitor them going in | 19:24 |
clarkb | So far the only real problem has been uWSGI's wheel not building reliably but jrosser dug into that and we think we identified the issue and are reasonably happy with the hacky workaround for now | 19:24 |
clarkb | Once the bullseye updates are done I'm hoping to look at the irc bot uid updates next | 19:25 |
clarkb | When I did the audit I noticed that we should also plan to upgrade our zookeeper cluster and our mariadb containers to more recent versions. The current versions are still supported which means this isn't urgent, but keeping up and learning what that process is seems like a good idea | 19:25 |
clarkb | For zookeeper we'd go to 3.6.latest as 3.7 isn't fully released yet aiui. For mariadb it is 10.6 I think | 19:26 |
clarkb | And another thing that occured to me is I'm not sure if we are pruning the CI registry's old container contents | 19:26 |
clarkb | corvus: ^ do you know? | 19:26 |
clarkb | I'm happy for people to help out with this stuff too, feel free to put your name next to an item on that etherpad and I'll look for changes | 19:27 |
clarkb | #topic Mailman Ansible fixups | 19:28 |
clarkb | fungi has a stack of changes to better test our ansible for mailman and in the process fix that ansible | 19:28 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/820900/ this stack should address newlist issues. | 19:28 |
clarkb | In particular we don't run newlist properly as it expects input. But there is an undocumented flag to make it stop looking for input which we switch to. THen we also update firewall rules to block outbound smtp on the test nodes so we can verify it attempted to send email rather than telling newlist to not send email | 19:29 |
clarkb | fungi: do you need anything else to move that along or is largely a matter of you having time to approve cahnges and watch them? | 19:29 |
fungi | i'm working on the penultimate change in that series to move the service commands into a dedicated playbook rather thanrunning them in testinfra | 19:29 |
clarkb | ah right that was suggested in review. | 19:30 |
fungi | the earlier changes can merge any time | 19:30 |
fungi | i'm about done with the requested update to 821144 | 19:30 |
clarkb | fungi: were you planning to +A them when ready then? | 19:30 |
clarkb | mostly want to make sure you aren't waiting on anything from someone lese | 19:30 |
fungi | sure, i can if nobody beats me to it | 19:30 |
corvus | clarkb: We are not | 19:31 |
clarkb | corvus: ok so that is something we should be doing. Is there anything preventing us from pruning those? or just need to do it? | 19:32 |
corvus | Suspected corruption bug | 19:32 |
clarkb | corvus: thanks I've updated the etherpad with these notes | 19:32 |
clarkb | #topic Nodepool image cleanup. | 19:33 |
clarkb | #link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html | 19:33 |
corvus | (sorry for phone brevity) | 19:33 |
clarkb | I'm realizing that the timelines I put in that email were maybe a bit aggressive with holidays fast approaching and other demands. I think this is fine. Better to say it is going away at the end of the month and take it away a week or two later than to take it away earlier than expected | 19:33 |
frickler | I would volunteer to try to keep gentoo running | 19:34 |
clarkb | I expect that this cleanup can be broken down into subtasks if others are able to help (one person per image or similar) | 19:34 |
clarkb | frickler: cool good to hear. Maybe you can respond to the thread so that other gentoo interested parties know to reach out? | 19:34 |
ianw | #link https://review.opendev.org/c/opendev/base-jobs/+/821649 | 19:34 |
frickler | yeah, I can do that | 19:35 |
ianw | i've proposed the fedora-latest update, i think we can do that | 19:35 |
ianw | also there's some other missing labels there in the follow-on | 19:35 |
clarkb | ianw: frickler thanks | 19:35 |
frickler | ianw: for fedora I think devstack is still running with f34? | 19:35 |
ianw | yeah, we should updaate that -- it doesn't actually boot on most of our hosts | 19:36 |
clarkb | ya f34 seems liek a dead end unfortunately. | 19:36 |
ianw | or, actually, it was the other way, it doesn't boot on rax because they dropped some xen things from the initrd | 19:36 |
ianw | so it can't find it's root disk | 19:36 |
clarkb | I'll try to start the tumbleweed cleanup after christmas if I've got time. I expect this is one of our least used images and it hasn't held up to its promise of being an early warning system | 19:36 |
clarkb | And then I can help with the centos-8 cleanup in the new year | 19:37 |
ianw | note with f35 there's some intersection with -> https://review.opendev.org/c/openstack/diskimage-builder/+/821526 | 19:37 |
ianw | it looks like "grub2-mkconfig" isn't doing what we hoped on the centos 9-stream image (*not* centos-minimal) -- still need to investigate fully | 19:38 |
clarkb | its always something with every new release :) | 19:39 |
clarkb | #topic Open Discussion | 19:40 |
clarkb | That was it for the published agenda. Anything else? | 19:40 |
frickler | two things from me | 19:40 |
clarkb | go for it | 19:40 |
fungi | the promised update for 821771 is up for review now | 19:40 |
frickler | a) I finally finished the exim4 paniclog cleanup | 19:40 |
fungi | (er, for 821144 i meant) | 19:41 |
frickler | all entries were from immediately after the nodes were set up | 19:41 |
fungi | yes, it seems like exim gets into an unhappy state during bootstrapping | 19:41 |
frickler | and all were bionic nodes, so not something likely to repeat | 19:41 |
fungi | oh, doesn't happen on focal? that's great news | 19:41 |
frickler | at least I didn't see any incidents there | 19:42 |
frickler | b) this came up discussing zuul restarts: do we want to have some kind of freeze for the holidays? | 19:42 |
clarkb | ya I think we should do our best to avoid big central changes for sure. It is probably ok to do job updates and more leaf node things since they can be reverted easily | 19:43 |
frickler | my suggestion would be to avoid things like zuul or other updates if possible for some time, maybe from this friday eob to 3rd of jan? | 19:43 |
corvus | I disagree | 19:43 |
clarkb | I would add that if people are around to babysit then my concern goes down significantly | 19:44 |
corvus | zuul is in a stabilization period; if we avoid restarts, we're only going to keep running buggy code and avoid fixes | 19:44 |
fungi | if zuul development is speeding toward 5.0.0 during that time, i'd rather opendev didn't fall behind | 19:44 |
corvus | this is the best period to make infrastructure changes | 19:44 |
clarkb | Historically, what we have had issues with is people making changes (pbr release on christmas eve one year iirc) then disappearing | 19:44 |
corvus | fewer people being around is the best time for zuul restarts | 19:44 |
corvus | that's why i do so many over weekends | 19:44 |
clarkb | If we avoid the "then disappearing" bit we should probably be ok | 19:45 |
corvus | if we're going to avoid infra changes when people are busy with openstack releases and also avoid them when they aren't, then there's precious little time to actually do them. | 19:45 |
fungi | maybe if people avoid doing things late in their day when they'll be falling unconscious soon after? | 19:45 |
clarkb | also with zuul specifically we can revert to a known good (or good enough) | 19:45 |
corvus | i don't think i have a history of doing that? | 19:45 |
clarkb | corvus: no you don't | 19:45 |
fungi | correct | 19:46 |
clarkb | I guess what I'm saying is it will often come down to a judgement call. If a change is revertable (say in the case of zuul) and the person driving it plans to pay attention (again say with zuul) then it is probably fine | 19:46 |
clarkb | but something like a gerrit 3.4 upgrade wouldn't fall into that bucket I don't think | 19:46 |
fungi | and plenty of information on how to undo things helps in case issues crop up much later (a day or three even) | 19:47 |
clarkb | in the past we've said we should be "slushy" | 19:47 |
clarkb | I think that is what I'm advocating for here. Which still allows for changes to be made but understanding how to revert and/or proceed if something goes wrong and ensuring someone is able to do that | 19:47 |
clarkb | and ya the next week and week after seem like the period of time where we'll want to take that into account | 19:48 |
ianw | fwiw from the 23rd (.au time) i'm not going to be easily within access of a interactive login till prob around jan 10 | 19:49 |
clarkb | I just want to avoid repeated PBR incidents, but I think we all want to avoid that and have a good understanding of what is likely to be safe or at least revertable | 19:49 |
clarkb | the reason I dig up things like image removals and various container image updates is they both fit under the likely safe to do with a revert path if necessary :) | 19:50 |
clarkb | But are still important updates so not a waste of time | 19:50 |
fungi | i'll have family visiting from the 24th throug the 31st but will try to be around some for that week | 19:50 |
clarkb | Anyway I don't think we need to say no zuul updates. But if you are doing them monitoring the results and be prepared to revert to last known good seems reasonable (and so far that has been done with zuul so I'm not worried) | 19:51 |
frickler | sounds like a reasonable compromise, ok | 19:52 |
fungi | it seems like between now and 5.0.0 we should expect updates to come with more fixes than problems anyway | 19:52 |
clarkb | frickler: what if we say something like "Be aware that there may be minimal support from December 20-31, if you are making changes in OpenDev please consider what your revert path or path out of danger is before making the change and monitor to ensure this isn't necessary" | 19:52 |
clarkb | Also historically setuptools is due for a release that will break everything next week | 19:53 |
clarkb | Let's hope that don't do that to us :) | 19:53 |
fungi | right... most of the fire drills this time of year have traditionally not been of our own making | 19:53 |
clarkb | also credit to zuul I don't see zuul upgrades a "big central changes" these days. Zuul is well tested and monitored and we have rollback plans | 19:54 |
clarkb | Something like Gerrit or our bridge ansible version or PBR are what I've got in mind | 19:54 |
clarkb | Sounds like that may be about it. Last call :) | 19:56 |
fungi | i'll go ahead and approve the smtp egress block for our deploy tests now | 19:56 |
fungi | will keep an eye on things while putting dinner together | 19:57 |
clarkb | Alright sounds like that was it. Thank you everyone | 19:58 |
clarkb | #endmeeting | 19:58 |
opendevmeet | Meeting ended Tue Dec 14 19:58:34 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:58 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-14-19.01.html | 19:58 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-14-19.01.txt | 19:58 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-14-19.01.log.html | 19:58 |
fungi | thanks clarkb! | 19:58 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!