Tuesday, 2023-05-23

clarkb	Just about meeting time	18:59
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue May 23 19:01:19 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	Hello everyone (I expect a small group today thats fine)	19:01
clarkb	link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda	19:01
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	I guess a friendly reminder that the Open Infra summit is coming up in a few weeks	19:02
clarkb	less than a month away now (like 3 weeks?)	19:02
* fungi sticks his head in the sand		19:02
corvus	2 more weeks till time to panic	19:02
clarkb	#topic Migrating container images to quay.io	19:03
clarkb	After our discussion last week I dove in and started poking at running services with podman in system-config and leaned on Zuul jobs to give us info back	19:03
clarkb	The good news is that podman and buildkit seem to solve the problem with speculative testing of container images	19:04
clarkb	(this was expected but good to confirm(	19:04
clarkb	The bad news is that switching to podman isn't super straightforward for a number of smaller issues that add up in my opinion	19:04
clarkb	Hurdles include packaging for podman and friends isn't great until you get on Ubuntu Jammy or newer. Podman doesn't support syslog logging so we need to switch all our logging over to journald. In addition to simply checking if services can run under podman we need to transition to podman which means stopping docker services and starting podman services in some sort of	19:06
clarkb	coordinated fashion. There are also questions about whether or not we should change which user runs services when moving to podman	19:06
clarkb	I have yet to find a single blocker that would prevent us from doing the move, but I'm not confident we can do it in a short period of time	19:06
clarkb	For this reason I brought this up yesterday in #opendev and basically said I think we should go with the skopeo workaround or revert to docker hub. In both cases we could then work forward to move services onto podman and either remove the skopeo workaround or migrate to quay.io at a later date	19:07
clarkb	During that discussion the consensus seemed to be that we preferred reverting back to docker hub. Then we can move to podman then we can move to quay.io and everything should be happy with us unlike the current state	19:07
fungi	that matches my takeaway	19:08
clarkb	I wanted to bring this up more formally in the meeting before proceeding with that plan. Has anyone changed their mind or have new input etc?	19:08
clarkb	if we proceed with that plan I think it will look very similar to the quay.io move. We want to move the base images first so that when we move the other images back they rebuild and see the current base images	19:09
clarkb	One thing that makes this tricky is the sync of container images back to docker hub since I don't think I can use a personal account for that like I did with quay.io. But that is solvable	19:09
clarkb	I'll also draft up a document like I did for the move to quay.io so that we can have a fairly complete list of tasks and keep track of it all.	19:10
clarkb	I'm not hearing any objections or new input. In that case I plan to start on this tomorrow (I'd like to be able to be pretty head down on it at least to start to make sure I don't miss anyting important)	19:11
corvus	clarkb: sounds good to me	19:11
fungi	thanks!	19:11
tonyb	sounds good. Thank you clarkb for pulling all that together	19:11
corvus	clarkb: i think you can just use manually the opendev zuul creds on docker?	19:11
clarkb	corvus: yup	19:12
clarkb	should be able to docker login with them during the period of time I need them	19:12
corvus	also, it seems like zuul can continue moving forward with quay.io	19:12
corvus	since it's isolated from the opendev/openstack tenant	19:12
corvus	so it can be the advance party	19:12
clarkb	agreed and the podman moves there are isolated to CI jobs which are a lot easier to transition. The either work or they don't (and as of a few minutes ago I think they are working)	19:13
corvus	i'm in favor of zuul continuing with that, iff the plan is for opendev to eventually move.	19:13
corvus	i'd like us to be aligned long-term, one way or the other	19:13
clarkb	I'd personally like to see opendev move to podman and quay.io. I think for long term viability the extra functioanlity is going to be useful	19:13
corvus	++	19:14
fungi	same	19:14
clarkb	both tools have extra features that enable things like per image access controls in quay.io, speculative gating etc	19:14
clarkb	I'm just acknowledging it will take time	19:14
corvus	also, we should be able to move services to podman one at a time, then once everything is on podman, make the quay switch	19:14
clarkb	++	19:15
clarkb	servers currently on Jammy would be a good place to start since jammy has podman packages built in	19:16
clarkb	currently I Think that is gitea, gitea-lb and etherpad?	19:16
clarkb	alright I probably won't get to that today since I've got other stuff in flight already, but hope to focus on this tomorrow. I'll ping for reviews :)	19:16
clarkb	#topic Bastion Host Changes	19:17
clarkb	#link https://review.opendev.org/q/topic:bridge-backups	19:17
clarkb	this topic still needs reviews if anyone has time	19:17
clarkb	otherwise I am not aware of anything new	19:17
fungi	lists01 is also jammy	19:18
clarkb	ah up	19:18
clarkb	#link Mailman 3	19:18
clarkb	fungi: any new movement with mailman3 things speaking of list01	19:18
fungi	i'm fiddling with it now, current held node is 173.231.255.71	19:18
clarkb	anything we can do to help?	19:19
fungi	with the current stack the default mailman hostname is no longer one of the individual list hostnames, which gives us some additional flexibility	19:19
clarkb	fungi: we probably want to test the transition from our current state to that state as well?	19:20
clarkb	(I'm not sure if that has been looked at yet)	19:20
fungi	i'm still messing with the django commands to see how we might automate the additional django site to postorius/hyperkitty hostname mappings	19:20
fungi	but yes, that should be fairly testable manually	19:20
fungi	for a one-time transition it probably isn't all that useful to build automated testing	19:21
fungi	but ultimately it's just a few db entries changing	19:21
clarkb	ya not sure it needs to be autoamted. More just tested	19:21
fungi	agreed	19:22
clarkb	similar to how we've checked the gerrit upgrade rollback procedures	19:22
clarkb	Sounds good. Let us know if we can help	19:22
fungi	folks are welcome to override their /etc/hosts to point to the held node and poke around the webui, though this one needs data imported still	19:22
fungi	and there are still things i haven't set, so that's probably premature anyway	19:23
clarkb	ok I've scribbled a note to do that if I have time. Can't hurt anyway	19:23
fungi	exactly	19:23
fungi	thanks!	19:23
clarkb	#topic Gerrit leaked replication task files	19:24
clarkb	This is ongoing. No movement upstream in the issues I've filed. The number of files is growing at a steady ut management rate	19:24
clarkb	I'm honestly tempted to undo the bind mound for this directory and go back to potentially lossy replications though	19:24
clarkb	I've been swamped with other stuff though and since the rate hasn't grown in a scary way I've been happy to deprioritize looking into this further	19:25
clarkb	Maybe late this week / early next week I can put a java developer hat on and see about fixing it though	19:25
clarkb	#topic Upgrading Old Servers	19:26
clarkb	similar to the last item I hvaen't had time to look at this recently. I don't think anyone else has either based on what I've seen in gerrit/irc/email	19:26
fungi	i have not, sorry	19:26
clarkb	This should probably be a higher priority than playing java dev though so maybe I start here when I dig out of my hole	19:27
clarkb	if anyone else ends up with time please jump on one of these. It is a big help	19:27
clarkb	#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes	19:27
clarkb	#topic Fedora Cleanup	19:27
clarkb	This topic was the old openafs utilization topic. But I pushed a change to remove unused fedora mirror content and that freed up about 200GB and now we are below 90% utilization	19:28
tonyb	yay	19:28
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IOYIYWGTZW3TM4TR2N47XY6X7EB2W2A6/ Proposed removal of fedora mirrors	19:28
clarkb	No one has said please keep fedora we need it for X, Y or Z	19:28
clarkb	corvus: ^ I actually recently remembered that the nodepool openshift testing uses fedora but I don't think it needs to	19:28
clarkb	openshift itself is installed on centos and then nodepool is run on fedora. It could probably be a single node job even or the fedora node should be able to be replaced with anything else	19:29
clarkb	either way I think the next step here is removing fedora mirror config from our jobs so that jobs on fedora talk upstream	19:29
tonyb	I started looking at removing pulling the setup of f36 from zuul but that's a little bigger than expected so we so we don't break existing users outside of here	19:29
clarkb	then we can remove all fedora mirroring and then we can work on clearing out fedora jobs and the nodes themselves	19:30
fungi	ianw has a (failing/wip) change to remove our ci mirrors from fedora builds in dib	19:30
clarkb	tonyb: oh cool	19:30
fungi	#link https://review.opendev.org/883798 "fedora: don't use CI mirrors"	19:30
clarkb	fungi: that change should pass as soon as the nodepool podman change merges	19:30
corvus	clarkb: i agree, it should be able to be replaced	19:30
clarkb	tonyb: is the issue that we assume all nodes should have a local mirror configured? I wonder how we are handling that with rocky. Maybe we just ignored rocky in that role?	19:30
fungi	clarkb: there are fedora-specific tests in dib which are currently in need of removal for that change to pass, it looked like to me	19:31
tonyb	yeah rocky is ignored	19:31
tonyb	my initial plan was to just pull fedora but then I realised if there were non OpenDev users of zuul that did care about fedora mirroring they get broken	19:32
clarkb	fungi: ah	19:32
tonyb	so now I'm working on adding a flag to the base zuul jobs to say ... skip fedora	19:32
fungi	oh, never mind. now the failures i'm looking at are about finding files on disk, so maybe. (or maybe i'm looking at different errors now)	19:33
clarkb	tonyb: got it. That does pose a problem. If you add flags it might be good to add flags for everthing too for consistency	19:33
tonyb	but trying to do that well and struggling with perfect being the enemy of good	19:33
clarkb	tonyb: but getting something up so that the zuul community can weigh in is probably best before over engineering	19:33
clarkb	++	19:33
tonyb	clarkb: yeah doing it for everything was the plan	19:33
fungi	but yeah, the dib-functtests failure doesn't appear on the surface to be container-toolchain-related	19:34
tonyb	kinda adding a distro-enable-mirror style flag	19:34
tonyb	I'll post a Wip for a more concrete discussion	19:34
clarkb	sounds good	19:35
clarkb	#topic Quo Vadis Storyboard	19:35
clarkb	Anything new here?	19:35
clarkb	I don't have anything new. ETOOMANYTHINGS	19:35
fungi	nor i	19:36
fungi	well, the spamhaus blocklist thing	19:36
fungi	apparently spamhaus css has listed the /64 containing storyboard01's ipv6 address	19:36
clarkb	oh ya we should probably talk to oepnstack about that. tl;dr is cloud providers (particularly public clodu providers) should assign no less than a /64 to each instance	19:36
fungi	due to some other tenant in rackspace sending spam from a different address in that same network	19:37
fungi	they processed my removal request, but them immediately re-added it because of the problem machine that isn't ours	19:37
fungi	#link https://storyboard.openstack.org/#!/story/2010689 "Improve email deliverabilty for storyboard@storyboard.openstack.org, some emails are considered SPAM"	19:37
fungi	there was also a related ml thread on openstack-discuss, which one of my story comments links to a message in	19:38
corvus	i know you can tell exim not to use v6 for outgoing; it may be possible to tell it to prefer v4...	19:38
fungi	right, that's a variation of what i was thinking in my last comment on the story	19:38
fungi	and less hacky	19:39
fungi	though still definitely hacky	19:39
fungi	anyway, that's all i really had storyboard-related this week	19:39
clarkb	thanks!	19:39
clarkb	#topic Open Discussion	19:39
clarkb	Anything else?	19:39
fungi	there were a slew of rackspace tickets early today about server outages... pulling up the details	19:40
fungi	zm06, nl04, paste01,	19:41
fungi	looks like they were from right around midnight utc	19:41
fungi	er, no, 04:00 utc	19:41
fungi	anyway, if anyone spots problems with those three servers, that's likely related	19:42
clarkb	only paste01 is a singleton that would really show up as a problem. Might be worth double checkign the other two	19:43
fungi	all three servers have a ~16-hour uptime right now	19:43
fungi	so they at least came back up and are responding to ssh	19:43
fungi	host became unresponsive and was rebooted in all three cases (likely all on the same host)	19:44
clarkb	zm06 did not restart its merger	19:45
clarkb	nl04 did restart its launcher	19:45
clarkb	I can restart zm06 after lunch	19:45
clarkb	thank you for calling that out	19:45
corvus	we have "restart: on-failure"...	19:45
fungi	i'll go ahead and close out those tickets on the rackspace end	19:46
corvus	and services normally restart after boots	19:46
fungi	also they worked two tickets i opened yesterday about undeletable servers, most were in the iad region	19:46
corvus	so i wonder why zm06 didn't?	19:46
fungi	in total, freed up 118 nodes worth of capacity	19:46
clarkb	corvus: maybe hard reboot under the host doesn't count as failure?	19:46
fungi	in the twisted land of systemd	19:47
corvus	May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.636033310Z" level=info msg="Removing stale sandbox 88813ad85aa8c751aa92b684e64e1ea7f2e9f2e9c8209ce79bfcf9fe18ee77e7 (05ce3156e52feef4e8e78ad8015aabba477f7d926635e8bf59534a8294d44559)"	19:47
corvus	May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.638772820Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint f128a3c81341d323dfbcbb367224049ec2aac64f10af50e702b3f546f8f09a6c 6e0ac20c216668af07c98111a59723329046fab81fde48b45369a1f7d088ffeb], retrying...."	19:47
corvus	i have no idea what either of those mean	19:47
corvus	other than that, i don't see any logs about docker's decision making there	19:48
clarkb	probably fine to start things back up again then and see if it happens again in the future? I guess if we really want we can do a graceful reboot and see if it comes back up	19:49
clarkb	Anything else? Last call I can give you all 10 minutes back for breakfast/lunch/dinner :)	19:49
fungi	i'm set	19:50
clarkb	Thank you everyone. I expect we'll be back here next week and the week after. But then we'll skip on june 13th due to the summit	19:50
clarkb	#endmeeting	19:51
opendevmeet	Meeting ended Tue May 23 19:51:03 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:51
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.html	19:51
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.txt	19:51
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.log.html	19:51
fungi	thanks clarkb!	19:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!