Tuesday, 2022-11-01

-opendevstatus- NOTICE: review.opendev.org (Gerrit) is currently down, we are working to restore service as soon as possible		07:30
-opendevstatus- NOTICE: review.opendev.org (Gerrit) is back online		14:25
clarkb	almost meeting time. We'll get started shortly	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Nov 1 19:01:41 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-October/000376.html Our Agenda	19:01
clarkb	#topic Announcements	19:02
clarkb	There were no announcements so we can dive right in	19:02
clarkb	#topic Topics	19:02
clarkb	#topic Bastion Host Updates	19:02
clarkb	#link https://review.opendev.org/q/topic:prod-bastion-group	19:02
clarkb	#link https://review.opendev.org/q/topic:bridge-ansible-venv	19:03
clarkb	are a couple groups of changes to keep moving this along	19:03
* frickler should finally review some of those		19:03
clarkb	frickler also discovered that the secrets management key is missing on the new host. Something that should be migrated over and tested before we remove the old one	19:03
clarkb	but I think we're really close to being able to finish this up. ianw if you are around anything else to add?	19:04
frickler	we should also agree when to move editing those from one host to the other	19:04
ianw	o/	19:04
clarkb	++ at this point I would probably say anything that can't be done on the new host is a bug and we should fix that as quickly s possible and use the new host	19:04
ianw	yes please move over anything from your home directories, etc. that you want	19:05
ianw	i've added a note on the secret key to	19:05
ianw	#link https://etherpad.opendev.org/p/bastion-upgrade-nodes-2022-10	19:05
ianw	thanks for that -- i will be writing something up on that	19:06
clarkb	I also need to review the virtualenv management change since that will ensure we have a working openstackclient for rax and others	19:06
ianw	yeah a couple of changes are out there just to clean up some final things	19:07
clarkb	also the zuul reboot playbook ran successfully off the new bridge	19:07
frickler	ianw: are you o.k. with rebooting bridge01 after the openssl updates or is there some blocker for that?	19:08
frickler	(I ran the apt update earlier already)	19:08
clarkb	one thing to consider when doing ^ is if we have any infra prod jobs that we don't want to conflict with	19:08
clarkb	but I'm not aware of any urgent jobs at the moment	19:08
ianw	the gist is I think that we can get testing to use "bridge99.opendev.org" -- which is a nice proof that we're not hard-coding in references	19:08
ianw	i think it's fine to reboot -- sorry i've been out last two days and not caught up but i can babysit it soonish	19:09
clarkb	sounds good we can coordinate further after the meeting.	19:09
clarkb	Anything else on this topic?	19:09
ianw	nope, thanks for the reviews and getting it this far!	19:10
clarkb	#topic Upgrading Bionic Servers	19:10
clarkb	at this point I think we've largely sorted out the jammy related issues and we should be good to boot just about anything on jammy	19:10
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/862835/ Disable phased package updates	19:11
clarkb	that is one remaining item though. Basically it says don't do phased updates which will ensure that our jammy servers all get the same pacakges at the same time	19:11
clarkb	rather than staggering them over time. I'm concerned the staggering will just lead to confusion about whether or not a package is related to unexpected behaviors	19:11
clarkb	https://review.opendev.org/c/opendev/zone-opendev.org/+/862941 and its depends on are related to gitea-lb02 being brought up as a jammy node too (this is cleanup of old nodes)	19:12
clarkb	Otherwise nothing extra to say. Just that we can (and probably should) do this for new servers and replacing old servers with jammy is extra good too	19:12
clarkb	I'm hoping I'll have time later this week to replace another server (maybe one of the gitea backends)	19:13
clarkb	#topic Removing snapd	19:13
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/862834 Change to remove snapd from our servers	19:13
clarkb	after we discussed this in our last meeting I poked around on snapcraft and in ubuntu package repositories and I think there isn't much reason for us to have snapd installed on our servers	19:13
clarkb	This change can and will affect a number of servers though so worth double checking. I haven't done an audit to see which would be affected but we could do that if we think it is necessary	19:14
clarkb	to do my filtering I looked for snaps maitnained by canonical on snapcraft to see which ones were likely to be useful for us. And many ofthem continue to have actual packages or aren't useful to servers	19:15
clarkb	Reviews very much welcome	19:15
clarkb	#topic Mailman 3	19:15
clarkb	Since our last meeting the upstream for the mailman 3 docker images did land my change to add lynx to the images	19:16
clarkb	No repsonses on the other issues I filed though.	19:16
clarkb	Unfortunately, I think this makes it more confusing over whether or not we should fork not easier. I'm leaning more towards forking at this point simply because I'm not sure how responsive upstream will be. But feedback there continues to be welcome	19:16
clarkb	When fungi is back we should make a decision and move forward	19:18
clarkb	#topic Updating base python docker images to use pip wheel	19:18
clarkb	Upstream seems to be moving slowly on my bugfix PR. Some of that slowness is changes to git happened at the same time that impacted their CI setup	19:18
clarkb	Either way I think we should strongly consider their suggestion of using pip wheel though	19:18
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/862152	19:18
*** diablo_rojo_phone is now known as Guest182		19:19
clarkb	There is a nodepool and a dib change too which help illustrate that this change functions and doesn't regress features like siblings installs	19:19
clarkb	It should be a noop for us, but makes us more resilient to pip changes if/when they happen in the future. Reviews very much welcome on this as well	19:19
clarkb	#topic Etherpad service logging	19:20
clarkb	ianw: did you have time to write the change to update etherpad logging to syslog yet?	19:20
ianw	oh no, sorry, totally distracted on that	19:21
ianw	will do	19:21
clarkb	thanks	19:22
clarkb	unrelated to the logging issue I had to reboot etherpad after its db volume got remounted RO due to errors	19:22
clarkb	after reboot it mounted the volume just fine as far as I could tell and things have been happy since yesterday	19:22
clarkb	(just a heads up I don't think any action is necessary there)	19:22
clarkb	#topic Unexpected Gerrit Reboot	19:23
clarkb	This happened around 06:00 UTC ish today	19:23
clarkb	basically looks like review02.o.o rebooted and when it came back it had no networking until ~13:00 UTC	19:23
clarkb	we suspect something on the cloud side which would explain the lack of networking for some time as well. But we haven't heard back on that yet	19:23
frickler	do we have some support contact at vexxhost other than mnaser__ ?	19:24
clarkb	frickler: mnaser__ has been the primary contact. There have been others in the past but I don't think they are at vexxhost anymore	19:24
clarkb	If we think it is important I can ask if anyone at the foundation has contacts we might try	19:25
frickler	there's also the (likely unrelated) IPv6 routing issue, which I think is more important	19:25
clarkb	but at this point things seem stable and we're mostly just interested in confirmation of our assumptions? Might be ok to wait a day	19:25
ianw	one thing i noticed was that corvus i think had to start the container?	19:25
clarkb	ianw: yes, our docker-compose file doesn't specify a restart policy which mimics the old pre docker behavior of not starting automatically	19:26
clarkb	frickler: re ipv6 thats a good point.	19:26
frickler	regarding manual start we assumed that that was intentional and agreeable behavior	19:27
corvus	i did perform some sanity checks to make sure the server looked okay before starting	19:27
corvus	(which is one of the benefits of that)	19:27
ianw	something to think about -- but also this is the first case i can think of since we migrated the review host to vexxhost that there's been what seems to be instability beyond our control	19:27
ianw	so yeah, no need to make urgent changes to handle unscheduled reboots at this point :)	19:28
clarkb	considering that we seemed to lack network access anyway I'm not sure its super important to auto restart based on this event	19:28
clarkb	we would've waited either way	19:28
frickler	the other thing worth considering is whether we want to have some local account to allow debugging via vnc	19:28
corvus	but i think honestly the main reason we didn't have it start on boot is so that if we stopped a service manually it didn't restart manually. that can be achieved with a "restart: unless-stopped" policy. so really, there are 2 reasons not to start on boot, and we can evaluate whether we still like 1, the other, both, or none of them.	19:28
frickler	since my first assumption was lack of network device caused by a kernel update	19:29
clarkb	frickler: the way we would normally handle that today is via a rescue instance	19:29
clarkb	when you rescue and instance with nova it shuts down the instance then boots another image and attaches the broken instance to it as a device which allows you to mount the partitions	19:30
clarkb	its a little bit of work, but the cases where we've had to resort to it are few and its probably worth keeping our images as simple as possible iwthout user passwords?	19:30
frickler	except for boot-from-volume instances, which seem to be a bit more tricky?	19:30
ianw	have we ever done that with vexxhost?	19:30
clarkb	frickler: oh is bfv different?	19:30
frickler	at least it needs a recent compute api (>=ussuri iirc)	19:31
clarkb	ianw: I'm not sure about doing it specifically in vexxhost. Testing it is a good idea I suppose before I/we declare it is good enough	19:31
clarkb	my concern with passwords on instances is that we don't have central auth so rotating/changing/managing them is mroe difficult	19:31
corvus	i love not having local passwords. i hope it is good enough.	19:31
clarkb	ya I'd much rather avoid it if possible	19:31
frickler	I was also wondering why we choose boot from volume, was that intentional?	19:32
clarkb	I've made a note to myself to test instance rescue in vexxhost. Both bfv and not	19:32
corvus	i have a vague memory that it might be a flavor requirement, but i'm not sure	19:32
ianw	frickler: i'd have to go back and research, but i feel like it was a requirement of vexxhost	19:33
clarkb	yes, I think at the time the current set of flavors had no disk	19:33
clarkb	their latest flavors do have disk and can be booted without bfv	19:33
ianw	heh, i think that's three vague memories, maybe that makes 1 real one :)	19:33
clarkb	I booted gitea-lb02 without bfv (but uploaded the jammy image to vexxhost as raw allowing us to do bfv as well)	19:33
clarkb	#action clarkb test vexxhost instance rescues	19:34
clarkb	why don't we do that and come back to the idea of passwords for recovery once we know if ^ works	19:34
clarkb	anything else on this subject?	19:34
ianw	++	19:34
frickler	ack	19:34
clarkb	also I'll use throwaway test instances not anything prod like :)	19:35
clarkb	#topic OpenSSL v3	19:35
clarkb	As everyone is probably aware of openssl v3 had a big security release today. It turned out to be a bit less scary than the CRITICAL label that was initially shared led everyone to believe (they downgraded it to high)	19:35
clarkb	Since all but two of our servers are too old to have openssl v3 we are largely unaffected	19:36
clarkb	all in all the impact is far more limited than feared which is great	19:36
clarkb	Also ubunut seems to think they way they compile openssl with stack protections mitigates the RCE and this is only a DoS	19:36
clarkb	#topic Upgrading Zookeeper	19:37
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/863089	19:38
clarkb	I would like to upgrade zookeeper tomorrow	19:38
clarkb	at first I thought that we could just let automation do it (whcih is still liekyl fine) but all the docs I can find suggesting upgrading the leader which our automation isn't aware of	19:39
clarkb	That means my plan is to stop ansible via the emergency file on zk04-zk06 and do them one by one. Followers first then the leader (currently zk05). Then merge that chagne and finally rmeove the hosts form the emergency file	19:39
clarkb	if I could get reviews on the change and any concerns for that plan I'd appreciate it.	19:39
clarkb	That said it seems like zookeeper upgrades if you go release to release are meant to be uneventful	19:40
corvus	(upgrading the leader last i think you missed a word)	19:40
clarkb	yup leader last I mean	19:40
frickler	the plan sounds fine to me and I'll try to review until your morning	19:40
corvus	i'll be around to help	19:41
clarkb	thanks!	19:41
clarkb	#topic Gitea Rebuild	19:41
clarkb	There are golang compiler updates today as well and it seems worthwhile to rebuild gitea under them	19:42
clarkb	I'll have that change up as soon as the meeting ends	19:42
clarkb	I should be able to monitor that change as it lands and gets deployed today. But we should coordinate that with the bridge reboot	19:42
clarkb	#topic Open Discussion	19:43
ianw	++	19:43
clarkb	It is probably worth mentioning that gitea as an upstream is going through a bit of a rough time. Their community has disagreements over the handling of trademarks and some individuals have talked about forking	19:44
ianw	:/	19:44
clarkb	I've been tryingto follow along as well as I canto understand any potential impact to us and I'm not sure we're at a point where we need to take a stance or plan to change anything	19:44
clarkb	but it is possible that we'll be in that position whether or not we like it in the future	19:44
ianw	on the zuul-sphinx bug that started occuring with the latest sphinx -- might need to think about how that works including files per https://sourceforge.net/p/docutils/bugs/459/	19:44
clarkb	Sounds like that may be it?	19:49
clarkb	Everyone can have 10 minutes for breakfast/lunch/dinner/sleep :)	19:50
clarkb	thank you all for your time and we'll be back here same time and location next week	19:50
clarkb	#endmeeting	19:50
opendevmeet	Meeting ended Tue Nov 1 19:50:23 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:50
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.html	19:50
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.txt	19:50
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-11-01-19.01.log.html	19:50
*** Guest182 is now known as diablo_rojo_phone		20:34

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!