-opendevstatus- NOTICE: review.opendev.org (Gerrit) is currently down, we are working to restore service as soon as possible | 07:30 | |
-opendevstatus- NOTICE: review.opendev.org (Gerrit) is back online | 14:25 | |
clarkb | almost meeting time. We'll get started shortly | 19:00 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Nov 1 19:01:41 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-October/000376.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:02 |
clarkb | There were no announcements so we can dive right in | 19:02 |
clarkb | #topic Topics | 19:02 |
clarkb | #topic Bastion Host Updates | 19:02 |
clarkb | #link https://review.opendev.org/q/topic:prod-bastion-group | 19:02 |
clarkb | #link https://review.opendev.org/q/topic:bridge-ansible-venv | 19:03 |
clarkb | are a couple groups of changes to keep moving this along | 19:03 |
* frickler should finally review some of those | 19:03 | |
clarkb | frickler also discovered that the secrets management key is missing on the new host. Something that should be migrated over and tested before we remove the old one | 19:03 |
clarkb | but I think we're really close to being able to finish this up. ianw if you are around anything else to add? | 19:04 |
frickler | we should also agree when to move editing those from one host to the other | 19:04 |
ianw | o/ | 19:04 |
clarkb | ++ at this point I would probably say anything that can't be done on the new host is a bug and we should fix that as quickly s possible and use the new host | 19:04 |
ianw | yes please move over anything from your home directories, etc. that you want | 19:05 |
ianw | i've added a note on the secret key to | 19:05 |
ianw | #link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10 | 19:05 |
ianw | thanks for that -- i will be writing something up on that | 19:06 |
clarkb | I also need to review the virtualenv management change since that will ensure we have a working openstackclient for rax and others | 19:06 |
ianw | yeah a couple of changes are out there just to clean up some final things | 19:07 |
clarkb | also the zuul reboot playbook ran successfully off the new bridge | 19:07 |
frickler | ianw: are you o.k. with rebooting bridge01 after the openssl updates or is there some blocker for that? | 19:08 |
frickler | (I ran the apt update earlier already) | 19:08 |
clarkb | one thing to consider when doing ^ is if we have any infra prod jobs that we don't want to conflict with | 19:08 |
clarkb | but I'm not aware of any urgent jobs at the moment | 19:08 |
ianw | the gist is I think that we can get testing to use "bridge99.opendev.org" -- which is a nice proof that we're not hard-coding in references | 19:08 |
ianw | i think it's fine to reboot -- sorry i've been out last two days and not caught up but i can babysit it soonish | 19:09 |
clarkb | sounds good we can coordinate further after the meeting. | 19:09 |
clarkb | Anything else on this topic? | 19:09 |
ianw | nope, thanks for the reviews and getting it this far! | 19:10 |
clarkb | #topic Upgrading Bionic Servers | 19:10 |
clarkb | at this point I think we've largely sorted out the jammy related issues and we should be good to boot just about anything on jammy | 19:10 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/862835/ Disable phased package updates | 19:11 |
clarkb | that is one remaining item though. Basically it says don't do phased updates which will ensure that our jammy servers all get the same pacakges at the same time | 19:11 |
clarkb | rather than staggering them over time. I'm concerned the staggering will just lead to confusion about whether or not a package is related to unexpected behaviors | 19:11 |
clarkb | https://review.opendev.org/c/opendev/zone-opendev.org/+/862941 and its depends on are related to gitea-lb02 being brought up as a jammy node too (this is cleanup of old nodes) | 19:12 |
clarkb | Otherwise nothing extra to say. Just that we can (and probably should) do this for new servers and replacing old servers with jammy is extra good too | 19:12 |
clarkb | I'm hoping I'll have time later this week to replace another server (maybe one of the gitea backends) | 19:13 |
clarkb | #topic Removing snapd | 19:13 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/862834 Change to remove snapd from our servers | 19:13 |
clarkb | after we discussed this in our last meeting I poked around on snapcraft and in ubuntu package repositories and I think there isn't much reason for us to have snapd installed on our servers | 19:13 |
clarkb | This change can and will affect a number of servers though so worth double checking. I haven't done an audit to see which would be affected but we could do that if we think it is necessary | 19:14 |
clarkb | to do my filtering I looked for snaps maitnained by canonical on snapcraft to see which ones were likely to be useful for us. And many ofthem continue to have actual packages or aren't useful to servers | 19:15 |
clarkb | Reviews very much welcome | 19:15 |
clarkb | #topic Mailman 3 | 19:15 |
clarkb | Since our last meeting the upstream for the mailman 3 docker images did land my change to add lynx to the images | 19:16 |
clarkb | No repsonses on the other issues I filed though. | 19:16 |
clarkb | Unfortunately, I think this makes it more confusing over whether or not we should fork not easier. I'm leaning more towards forking at this point simply because I'm not sure how responsive upstream will be. But feedback there continues to be welcome | 19:16 |
clarkb | When fungi is back we should make a decision and move forward | 19:18 |
clarkb | #topic Updating base python docker images to use pip wheel | 19:18 |
clarkb | Upstream seems to be moving slowly on my bugfix PR. Some of that slowness is changes to git happened at the same time that impacted their CI setup | 19:18 |
clarkb | Either way I think we should strongly consider their suggestion of using pip wheel though | 19:18 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/862152 | 19:18 |
*** diablo_rojo_phone is now known as Guest182 | 19:19 | |
clarkb | There is a nodepool and a dib change too which help illustrate that this change functions and doesn't regress features like siblings installs | 19:19 |
clarkb | It should be a noop for us, but makes us more resilient to pip changes if/when they happen in the future. Reviews very much welcome on this as well | 19:19 |
clarkb | #topic Etherpad service logging | 19:20 |
clarkb | ianw: did you have time to write the change to update etherpad logging to syslog yet? | 19:20 |
ianw | oh no, sorry, totally distracted on that | 19:21 |
ianw | will do | 19:21 |
clarkb | thanks | 19:22 |
clarkb | unrelated to the logging issue I had to reboot etherpad after its db volume got remounted RO due to errors | 19:22 |
clarkb | after reboot it mounted the volume just fine as far as I could tell and things have been happy since yesterday | 19:22 |
clarkb | (just a heads up I don't think any action is necessary there) | 19:22 |
clarkb | #topic Unexpected Gerrit Reboot | 19:23 |
clarkb | This happened around 06:00 UTC ish today | 19:23 |
clarkb | basically looks like review02.o.o rebooted and when it came back it had no networking until ~13:00 UTC | 19:23 |
clarkb | we suspect something on the cloud side which would explain the lack of networking for some time as well. But we haven't heard back on that yet | 19:23 |
frickler | do we have some support contact at vexxhost other than mnaser__ ? | 19:24 |
clarkb | frickler: mnaser__ has been the primary contact. There have been others in the past but I don't think they are at vexxhost anymore | 19:24 |
clarkb | If we think it is important I can ask if anyone at the foundation has contacts we might try | 19:25 |
frickler | there's also the (likely unrelated) IPv6 routing issue, which I think is more important | 19:25 |
clarkb | but at this point things seem stable and we're mostly just interested in confirmation of our assumptions? Might be ok to wait a day | 19:25 |
ianw | one thing i noticed was that corvus i think had to start the container? | 19:25 |
clarkb | ianw: yes, our docker-compose file doesn't specify a restart policy which mimics the old pre docker behavior of not starting automatically | 19:26 |
clarkb | frickler: re ipv6 thats a good point. | 19:26 |
frickler | regarding manual start we assumed that that was intentional and agreeable behavior | 19:27 |
corvus | i did perform some sanity checks to make sure the server looked okay before starting | 19:27 |
corvus | (which is one of the benefits of that) | 19:27 |
ianw | something to think about -- but also this is the first case i can think of since we migrated the review host to vexxhost that there's been what seems to be instability beyond our control | 19:27 |
ianw | so yeah, no need to make urgent changes to handle unscheduled reboots at this point :) | 19:28 |
clarkb | considering that we seemed to lack network access anyway I'm not sure its super important to auto restart based on this event | 19:28 |
clarkb | we would've waited either way | 19:28 |
frickler | the other thing worth considering is whether we want to have some local account to allow debugging via vnc | 19:28 |
corvus | but i think honestly the main reason we didn't have it start on boot is so that if we stopped a service manually it didn't restart manually. that can be achieved with a "restart: unless-stopped" policy. so really, there are 2 reasons not to start on boot, and we can evaluate whether we still like 1, the other, both, or none of them. | 19:28 |
frickler | since my first assumption was lack of network device caused by a kernel update | 19:29 |
clarkb | frickler: the way we would normally handle that today is via a rescue instance | 19:29 |
clarkb | when you rescue and instance with nova it shuts down the instance then boots another image and attaches the broken instance to it as a device which allows you to mount the partitions | 19:30 |
clarkb | its a little bit of work, but the cases where we've had to resort to it are few and its probably worth keeping our images as simple as possible iwthout user passwords? | 19:30 |
frickler | except for boot-from-volume instances, which seem to be a bit more tricky? | 19:30 |
ianw | have we ever done that with vexxhost? | 19:30 |
clarkb | frickler: oh is bfv different? | 19:30 |
frickler | at least it needs a recent compute api (>=ussuri iirc) | 19:31 |
clarkb | ianw: I'm not sure about doing it specifically in vexxhost. Testing it is a good idea I suppose before I/we declare it is good enough | 19:31 |
clarkb | my concern with passwords on instances is that we don't have central auth so rotating/changing/managing them is mroe difficult | 19:31 |
corvus | i love not having local passwords. i hope it is good enough. | 19:31 |
clarkb | ya I'd much rather avoid it if possible | 19:31 |
frickler | I was also wondering why we choose boot from volume, was that intentional? | 19:32 |
clarkb | I've made a note to myself to test instance rescue in vexxhost. Both bfv and not | 19:32 |
corvus | i have a vague memory that it might be a flavor requirement, but i'm not sure | 19:32 |
ianw | frickler: i'd have to go back and research, but i feel like it was a requirement of vexxhost | 19:33 |
clarkb | yes, I think at the time the current set of flavors had no disk | 19:33 |
clarkb | their latest flavors do have disk and can be booted without bfv | 19:33 |
ianw | heh, i think that's three vague memories, maybe that makes 1 real one :) | 19:33 |
clarkb | I booted gitea-lb02 without bfv (but uploaded the jammy image to vexxhost as raw allowing us to do bfv as well) | 19:33 |
clarkb | #action clarkb test vexxhost instance rescues | 19:34 |
clarkb | why don't we do that and come back to the idea of passwords for recovery once we know if ^ works | 19:34 |
clarkb | anything else on this subject? | 19:34 |
ianw | ++ | 19:34 |
frickler | ack | 19:34 |
clarkb | also I'll use throwaway test instances not anything prod like :) | 19:35 |
clarkb | #topic OpenSSL v3 | 19:35 |
clarkb | As everyone is probably aware of openssl v3 had a big security release today. It turned out to be a bit less scary than the CRITICAL label that was initially shared led everyone to believe (they downgraded it to high) | 19:35 |
clarkb | Since all but two of our servers are too old to have openssl v3 we are largely unaffected | 19:36 |
clarkb | all in all the impact is far more limited than feared which is great | 19:36 |
clarkb | Also ubunut seems to think they way they compile openssl with stack protections mitigates the RCE and this is only a DoS | 19:36 |
clarkb | #topic Upgrading Zookeeper | 19:37 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/863089 | 19:38 |
clarkb | I would like to upgrade zookeeper tomorrow | 19:38 |
clarkb | at first I thought that we could just let automation do it (whcih is still liekyl fine) but all the docs I can find suggesting upgrading the leader which our automation isn't aware of | 19:39 |
clarkb | That means my plan is to stop ansible via the emergency file on zk04-zk06 and do them one by one. Followers first then the leader (currently zk05). Then merge that chagne and finally rmeove the hosts form the emergency file | 19:39 |
clarkb | if I could get reviews on the change and any concerns for that plan I'd appreciate it. | 19:39 |
clarkb | That said it seems like zookeeper upgrades if you go release to release are meant to be uneventful | 19:40 |
corvus | (upgrading the leader last i think you missed a word) | 19:40 |
clarkb | yup leader last I mean | 19:40 |
frickler | the plan sounds fine to me and I'll try to review until your morning | 19:40 |
corvus | i'll be around to help | 19:41 |
clarkb | thanks! | 19:41 |
clarkb | #topic Gitea Rebuild | 19:41 |
clarkb | There are golang compiler updates today as well and it seems worthwhile to rebuild gitea under them | 19:42 |
clarkb | I'll have that change up as soon as the meeting ends | 19:42 |
clarkb | I should be able to monitor that change as it lands and gets deployed today. But we should coordinate that with the bridge reboot | 19:42 |
clarkb | #topic Open Discussion | 19:43 |
ianw | ++ | 19:43 |
clarkb | It is probably worth mentioning that gitea as an upstream is going through a bit of a rough time. Their community has disagreements over the handling of trademarks and some individuals have talked about forking | 19:44 |
ianw | :/ | 19:44 |
clarkb | I've been tryingto follow along as well as I canto understand any potential impact to us and I'm not sure we're at a point where we need to take a stance or plan to change anything | 19:44 |
clarkb | but it is possible that we'll be in that position whether or not we like it in the future | 19:44 |
ianw | on the zuul-sphinx bug that started occuring with the latest sphinx -- might need to think about how that works including files per https://sourceforge.net/p/docutils/bugs/459/ | 19:44 |
clarkb | Sounds like that may be it? | 19:49 |
clarkb | Everyone can have 10 minutes for breakfast/lunch/dinner/sleep :) | 19:50 |
clarkb | thank you all for your time and we'll be back here same time and location next week | 19:50 |
clarkb | #endmeeting | 19:50 |
opendevmeet | Meeting ended Tue Nov 1 19:50:23 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:50 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.html | 19:50 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.txt | 19:50 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.log.html | 19:50 |
*** Guest182 is now known as diablo_rojo_phone | 20:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!