Tuesday, 2021-02-09

*** openstackstatus has quit IRC		01:20
*** openstack has joined #opendev-meeting		01:22
*** ChanServ sets mode: +o openstack		01:22
*** sboyron has joined #opendev-meeting		07:58
*** hashar has joined #opendev-meeting		08:00
*** hashar is now known as hasharAway		11:44
*** hasharAway is now known as hashar		12:35
*** hashar is now known as hasharAway		15:27
*** hasharAway is now known as hashar		15:58
*** hashar is now known as hasharAway		18:23
clarkb	Anyone else here for the meeting? we will get started shortly	18:59
ianw	o/	19:00
fungi	yep	19:01
clarkb	#startmeeting infra	19:01
openstack	Meeting started Tue Feb 9 19:01:16 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
*** openstack changes topic to " (Meeting topic: infra)"		19:01
openstack	The meeting name has been set to 'infra'	19:01
clarkb	#link http://lists.opendev.org/pipermail/service-discuss/2021-February/000180.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"		19:01
clarkb	I had no announcements	19:01
clarkb	#topic Actions from last meeting	19:01
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"		19:01
clarkb	#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-02-19.01.txt minutes from last meeting	19:02
clarkb	I had an action to start writing down a xenial upgrade todo list.	19:02
clarkb	#link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades	19:02
clarkb	I started there, it is incomplete, but figured starting with something that we can all put notes on was better than waiting for perfect	19:02
clarkb	ianw also had an action to followup with wiki backups. Any update on that?	19:03
ianw	yes, i am getting closer :)	19:03
ianw	do you want to talk about pruning now?	19:03
clarkb	lets pick that up later	19:04
clarkb	#topic Priority Efforts	19:04
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"		19:04
clarkb	#topic OpenDev	19:04
*** openstack changes topic to "OpenDev (Meeting topic: infra)"		19:04
clarkb	I have continued to make progress (though it feels slow) on the gerrit account situation	19:04
clarkb	11 more accounts with preferred emails lackign external ids have been cleaned up. The bulk of these were simply retired. But one example for tbachman's accounts was a good learning experience	19:05
clarkb	With tbachman there were two accounts. An active one that had preferred email set and no external id for that email and another inactive account with the same preferred email and external ids to match	19:06
clarkb	tbachman said the best thing for them was to update the preferred email to a current email address. We tested this on review-test and tbachman was able to fix things on their end. The update was then made on the prod server	19:06
clarkb	To avoid confusion with the other unused account I set it inactive	19:07
clarkb	The important bit of news here is that users can actually update things themselves within the web ui and don't need us to intervene for this situation. They just need to update their preferred email address to be one of the actual email addresses further down in the settings page	19:07
clarkb	I have also begun looking at the external id email conflicts. This is where two or more different accounts have external ids for the same email address	19:08
clarkb	The vast majority of these seem to be accounts where one is clearly the account that has been used and the other is orphaned	19:08
clarkb	for these cases I think we retire the orphaned account then remove the external ids assoicated with that account that conflict. The order here is important to ensure we don't generate a bunch of new preferred email doesn't have external id errors	19:09
clarkb	There are a few cases where both accounts have been used and we may need to use our judgement or perhaps disable both accounts and let the user come to us with problems if they are still around (however most of these seem to be from years ago)	19:10
clarkb	I suspect that the vast majority of users who are active and have these problems have reached out to us to help fix them	19:10
clarkb	Where I am struggling is that I am finding it hard to automate the classification aspects. I have autoamted a good chunk of the data pulling but there is a fair bit of judgement in "what do we do next"	19:11
clarkb	if others get a chance maybe they can take a look at my notes on review-test and see if any improvements to process or information gathing stand out. I'd also be curious if people think I've prposed invalid solutions to the issues	19:11
clarkb	we don't need to go through that here though, can do that outside of meetings	19:12
clarkb	As a reminder the workaround in the short term is to make changes with gerrit offline then reindex accounts (and groups?) with gerrit offline	19:12
clarkb	I'm hoping we can fix all these issues without ever doing that, but that option is availalbe if we run into a strong need for it	19:13
clarkb	As far as next steps go I'll continue to classify things in my notes on -test and if others agree the proposed plans there seem valid I should make a checkout of the external ids on review and then start committing those fixes	19:13
clarkb	then if we do have to take a downtime we can get as many fixes as are already prepared in too	19:14
clarkb	Next up is a pointer to my gerrit 3.3 image build changes	19:14
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/765021 Build 3.3 images	19:14
clarkb	reviews appreciated.	19:14
clarkb	And that takes us to the gitea OOM'ing from last week	19:14
clarkb	we had to add richer logging to apache so that we had source connection port for the haproxy -> apache connections. We haven't seen the issue return so haven't really had any new data to debug aiui	19:15
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/774023 Rate limiting framework change for haproxy.	19:15
clarkb	I also put up an example of what haproxy tcp connection based rate limits might look like. I think the change as proposed woudl completely break users behind corporate NAT though	19:15
clarkb	so the change is WIP	19:15
clarkb	fungi: ianw anything else to add re Gitea OOMs?	19:16
fungi	i'm already finding it hard to remember last week. that's not good	19:16
ianw	yeah, i don't think we really found a smoking gun, it just sort of went away?	19:17
clarkb	ya it went away and by the time we got better logging in place there wasn't much to look at	19:17
clarkb	I guess we keep our eyes open and use better logging next time around if it happens again. Separately maybe take a look at haproxy rate limiting and decide if we watn to implement some version of that?	19:17
clarkb	(the trick is going to be figuringout what a valid bound is that doesn't just braek all the corporate NAT users)	19:18
clarkb	sounds like that may be it let's move on	19:19
clarkb	#topic Update Config Management	19:19
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"		19:19
clarkb	There are OpenAFS and refstack ansible (and docker in the case of refstack) efforts underway.	19:19
clarkb	I also saw mention that launch node may not be working?	19:19
ianw	launch node was working for me yesterday (i launched a refstack) ... but openstack client on bridge isn't	19:20
clarkb	oh I see I think I mixed up launch node and openstackclient	19:20
fungi	problems with latest openstackclient (or sdk?) talking to rackspace's api	19:20
ianw	well, it can't talk to rax anyway. i didn't let myself yak shave, fungi had a bit of a look too	19:20
clarkb	ianw: I've got an older openstackclient in a venv in my homedir that I use to cross check against clouds when that happens	19:20
clarkb	basically to answer the question of "does this work if we use old osc"	19:21
ianw	yeah, same, and my older client works	19:21
fungi	problem is the exception isn't super helpful because it's masked by a retry	19:21
fungi	so the exception is that the number of retries was exceeded	19:21
fungi	and it (confusingly) complains about host lookup failing	19:21
clarkb	did osc drop keystone api v2 support?	19:22
clarkb	that might be something to check?	19:22
fungi	if mordred gets bored he might be interested in looking at that failure case	19:22
clarkb	I can probably take a look later today after lunch and bike ride stuff. Would be a nice change of pace from staring at gerrit accoutns :)	19:22
clarkb	let me know if that would be helpful	19:22
fungi	but it probably merits being brought up in #openstack-sdk if it hasn't been already	19:22
mordred	fungi: what did I do?	19:23
fungi	mordred: you totally broke rackspace ;)	19:23
fungi	not really	19:23
mordred	ah - joy	19:23
fungi	just thought you might be interested that latest openstacksdk is failing to talk to rackspace's keystone	19:24
mordred	that's exciting	19:24
fungi	using older openstacksdk works, so that's how we got around it in the short term	19:24
fungi	well, an older openstacksdk install, so also older dependencies. it could be any of a number of them	19:25
clarkb	ianw: I've got openafs and refstack as separate agenda items, Should we just go over them here or move on and catch up under proper topic headings?	19:25
ianw	up to you	19:25
clarkb	#topic General topics	19:26
*** openstack changes topic to "General topics (Meeting topic: infra)"		19:26
clarkb	#topic OpenAFS Cluster Status	19:26
mordred	fungi: I'll take a look - the only thing that would be likely to have an impact would be keystoneauth	19:26
*** openstack changes topic to "OpenAFS Cluster Status (Meeting topic: infra)"		19:26
clarkb	I don't think I saw any movement on this but wanted to double check. The fileservers are upgraded to 1.8.6 but not the db servers?	19:26
ianw	the openafs status is that all servers/db servers are running 1.8.6-5	19:26
clarkb	oh nice the db servers got upgraded too. Excellent. Thank you for working on that	19:27
clarkb	next steps there are to do the server upgrades then?	19:27
clarkb	I've got them on my initial pass of a list for server upgrades too	19:27
ianw	yep; so next i'll try an in-place focal upgrade, probably on one of the db servers first as they're small, and start that process	19:27
clarkb	great, thanks again	19:28
clarkb	#topic Refstack upgrade and container deployment	19:28
*** openstack changes topic to "Refstack upgrade and container deployment (Meeting topic: infra)"		19:28
ianw	i got started on this	19:29
ianw	there's a couple of open reviews in https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) to add the production deployment jobs	19:29
clarkb	is there a change to add a server to inventory yet? I suppose for this server we won't have dns changes as dns will be upgraded via rax	19:29
ianw	yeah i merged that yesterday	19:30
clarkb	#link https://review.opendev.org/q/topic:%22refstack%22+(status:open%20OR%20status:merged) Refstack changes that need review	19:30
ianw	if we can just double check those jobs, i can babysit it today	19:30
clarkb	cool I can take a look at those really quickly after the meeting I bet	19:30
fungi	SotK: ^ that may also make a good example for doing the storyboard deployment	19:30
clarkb	++	19:30
ianw	then have to look at the db migration; the old one seemed to have a trove while we're running it from a container now	19:30
clarkb	ya I expect we'll restore from dump for now testing that things work? then schedule a downtime so that we can stop refstack properly, do a dump, restore from that, then start on the new server with dns updates	19:31
clarkb	and kopecmartin volunteered to test the service that has been newly deployed whcih will go a long way as I don't even know how to interact with it properly	19:32
clarkb	Anything else to add on this topic?	19:32
ianw	yep, there's terse notes at	19:32
ianw	#link https://etherpad.opendev.org/p/refstack-docker	19:32
ianw	other than that no	19:32
clarkb	thank you everyone who helped move this along	19:32
clarkb	#topic Bup and Borg Backups	19:33
*** openstack changes topic to "Bup and Borg Backups (Meeting topic: infra)"		19:33
clarkb	ianw feel free to give us an update on borg db streaming and pruning and all other new info	19:33
ianw	the streaming bit seems to be going well	19:34
ianw	modulo of course mysqldump --all-databases stopping actually dumping all databases with a recent update	19:34
clarkb	but it does still work if you specify specificy databases	19:35
ianw	#link https://bugs.launchpad.net/ubuntu/+source/mysql-5.7/+bug/1914695	19:35
openstack	Launchpad bug 1914695 in mysql-5.7 (Ubuntu) "mysqldump --all-databases not dumping any databases with 5.7.33" [Undecided,New]	19:35
clarkb	(which is the workaround we're going with?)	19:35
fungi	also there was some unanticipated fallout from the bup removal	19:35
ianw	nobody else has commented or mentioned anything in this bug, and i can't find anything in the mysql bug thing (though it's a bit of a mess) and i don't know how much more effort we want to spend on it, because it's talking to a 5.1 server in our case	19:36
fungi	apparently apt-get thought bup was the only reason we wanted pymysql installed on the storyboard server, so when bup got uninstalled so did the python-pymysql package. hilarity ensued	19:36
clarkb	mordred: ^ possible you may be interested? but ya I think our workaround is likely sufficient	19:36
ianw	I also realised some things about borg's append-only model and pruning that are explained in their docs, if you read them the right way	19:38
ianw	i've put up some reviews at	19:39
ianw	#link https://review.opendev.org/q/topic:%22backup-more-prune%22+status:open	19:39
ianw	that provides a script to do manual prunes of the backups, and a cron job to warn us via email when the backup partitions are looking full	19:39
ianw	i think that is the best way to manage things for now	19:39
clarkb	ianw: that seems like a good compromise, similar to how the certchecker reminded us to go buy new certs when we weren't using LE	19:40
ianw	i think the best way would be to have rolling LVM snapshots implemented on the backup server	19:40
ianw	but i think it's more important to just get running 100% with borg in an stable manner first	19:41
clarkb	++	19:41
ianw	so yeah, basically request for reviews on the ideas presented in those changes	19:42
clarkb	thank you for sticking to this. Its never an easy thing to change, but helps enable upgrades to focal and beyond for a number of services	19:42
ianw	but i think we've got it working at a stable working set. some things we can't avoid like the review backups being big diffs due to git pack file updates	19:42
clarkb	we could stop packing but then gerrit would get slow	19:42
clarkb	Anything else on this or should we move on?	19:43
fungi	we could "backup" git repositories via replication rather than off the fs?	19:43
fungi	though what does the replication in that case?	19:43
clarkb	fungi: the risk with that is a corrupted repo wouldn't be able to roll back easily	19:44
fungi	yeah	19:44
clarkb	with proper backups we can go to an old state	19:44
fungi	well, assuming the repository was not mid-write when we backed it up	19:44
ianw	yep, and although the deltas take up a lot of space, the other side is they do prune well	19:44
clarkb	I think git is pretty godo about that	19:44
clarkb	basically git does order of operations to make backups like that mostly work aiui	19:45
clarkb	Alright lets move on as we have a few more topics to cover	19:45
clarkb	#topic Xenial Server Upgrades	19:46
*** openstack changes topic to "Xenial Server Upgrades (Meeting topic: infra)"		19:46
clarkb	#link https://etherpad.opendev.org/p/infra-puppet-conversions-and-xenial-upgrades	19:46
clarkb	this has sort of been in continual progress over time, but as xenial eol appraoches I think we should capture what remains and start prioritizing things	19:46
clarkb	I've started to write down a partial list in that etherpad	19:46
clarkb	I'm hoping that I might have time next week to start doing rolling replacements of zuul-mergers, zuul-executors, and nodepool-launchers	19:47
clarkb	my idea there was to redeploy oen of each on focal and we can check everything is happy with the switch, then roll through the others in each group	19:47
clarkb	If you've got ideas on priorities or process/method/etc feel free to add notes to that etherpad	19:48
clarkb	#topic Meetpad Audio Stopped Working	19:48
*** openstack changes topic to "Meetpad Audio Stopped Working (Meeting topic: infra)"		19:48
clarkb	Late last week a few of us noticed that meetpad's audio wasn't working. By the time I got around to trying it again in order to look at it this week it was working	19:49
fungi	yeah, it seems to be working fine today	19:49
fungi	used it for a while	19:49
clarkb	Last week I had actually tried using the main meet.jit.si service as well and had problems with it too. I suspect that we may have deployed a bug then deployed the fix all automatically	19:49
clarkb	This reminds me that I think corvus has mentioend we should be able to unfork one of the images we are running too	19:50
*** diablo_rojo has joined #opendev-meeting		19:50
clarkb	it is possible that having a more static image for one of the services could have contributed as well	19:50
* diablo_rojo appears suuuuper late		19:50
clarkb	corvus: ^ is it just a matter of replacing the image in our docker-compose?	19:50
corvus	ohai	19:51
corvus	clarkb: everything except -web is unpinned i think	19:52
corvus	-web is the fork	19:52
clarkb	and to unfork web we just update our docker-compose file? maybe set some new settings?	19:52
corvus	i don't think we'd be updating/restarting any of those automatically	19:52
clarkb	corvus: I think we may do a docker-compose pull && docker-compose up -d regularly	19:52
corvus	gimme a sec	19:53
clarkb	similar to how gitea does it (and it finds new mariadb images)	19:53
corvus	okay, yeah, looks like we do restart, last was 4/5 days ago	19:54
corvus	to unfork web, we actually update our dockerfile and the docker-compose	19:54
clarkb	ok, it wasn't clear to me if we had to keep building the image ourselves or if we can use theirs like we do for the other services	19:55
corvus	(we're building the image from a github/jeblair source repo and deploying it; to unfork, change docker-compose to deploy from upstream and rm the dockerfile) -- but don't do that yet, upstream may not have updated the image.	19:55
corvus	we should use their	19:55
clarkb	got it	19:55
corvus	https://hub.docker.com/r/jitsi/web	19:55
corvus	https://hub.docker.com/layers/jitsi/web/latest/images/sha256-018f7407c2514b5eeb27f4bc4d887ae4cd38d8446a0958c5ca9cee3fa811f575?context=explore	19:56
corvus	4 days ago	19:56
corvus	we should unfork now	19:56
clarkb	excellent. Did you want to write that change? If not I'm sure we can find a volunteer	19:56
corvus	their build of -web should now have the meetpad PR merge in it	19:56
corvus	clarkb: i will do so	19:56
clarkb	thank you	19:56
corvus	#action corvus unfork jitsi-meet	19:56
clarkb	#topic InMotion OpenStack as a Service	19:57
*** openstack changes topic to "InMotion OpenStack as a Service (Meeting topic: infra)"		19:57
clarkb	Really quickly before our hour is up: I have deployed a control plane for an inmotion openstack managed cloud	19:57
clarkb	everything seems to work at first glance and we could bootstrap users and projects and then point cloud launcher at it. Except that none of the api endpoints have ssl	19:57
clarkb	thee is a VIP involved somehow that load balances requests across the three control plane nodes (it is "hyperconverged")	19:58
clarkb	I need to figure out how to properly listen on that VIP and then can run a simple ssl terminating proxy with a self signed cert or LE cert that forwards to local services	19:58
clarkb	I have not yet figured that out	19:58
clarkb	I've also tried to give this feedback back to inmotion as something that would be useful	19:59
clarkb	another thing worth noting is that we have a /28 of ipv4 addresses there currently so the ability to expand our nodepool resources is minimal right now	19:59
fungi	got it, so by default their cloud deployments don't provide a reachable rest api>	19:59
clarkb	well they do but in plaintext	19:59
corvus	clarkb: what's the vip attached to?	19:59
fungi	oh! http just no https?	20:00
clarkb	corvus: I have no idea. I tried looking and couldn't find it then ran out of time	20:00
clarkb	a lot of things are in kolla containers and they all run the same exact command so its been interesting poking around	20:00
clarkb	(they run some sort of init that magically knows what other comamnds to run)	20:00
clarkb	fungi: yup	20:01
ianw	is it only ipv4 or also ipv6?	20:01
clarkb	ianw: currently only ipv4 but ipv6 is something that they are looking at	20:01
fungi	ipv6 is all the rage with the kids these days	20:01
clarkb	(I expect that if we use this more properly it will be as an ipv6 "only" cloud then use the ipv4 /28 to do nat for outbound like limestone does)	20:01
clarkb	but that is still theoretical right now	20:01
clarkb	also we are now at time	20:01
clarkb	thank you everyone!	20:01
ianw	yeah, 28 is ... 16? nodes? - control plane bits?	20:01
fungi	thanks clarkb!	20:02
fungi	ianw: correct	20:02
clarkb	ianw: the control plane has a separate /28 or /29	20:02
clarkb	this /28 is for the neutron networking side so ya probably 14 usable and after neutron uses a couple 12?	20:02
clarkb	We can continue conversations in #opendev	20:02
clarkb	#endmeeting	20:02
fungi	if the entire /28 is routed to the endpoint you could in theory use all 16 addresses	20:02
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"		20:02
openstack	Meeting ended Tue Feb 9 20:02:33 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:02
openstack	Minutes: http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.html	20:02
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.txt	20:02
openstack	Log: http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-02-09-19.01.log.html	20:02
*** hasharAway has quit IRC		21:14
kopecmartin	ianw: thank you for working on it	21:21
*** gmann is now known as gmann_afk		22:13
*** sboyron has quit IRC		23:18

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!