Tuesday, 2020-07-14

*** SotK has quit IRC		06:54
*** SotK has joined #opendev-meeting		06:55
*** frickler is now known as frickler_pto		09:44
*** frickler_pto is now known as frickler		09:47
*** gouthamr has quit IRC		10:22
*** gouthamr has joined #opendev-meeting		10:23
*** frickler is now known as frickler_pto		13:50
*** hamalq has joined #opendev-meeting		15:23
*** hamalq has quit IRC		15:29
*** hamalq has joined #opendev-meeting		15:35
corvus	o/	18:59
clarkb	hello!	18:59
clarkb	we'll get started in a couple minutes	18:59
ianw	o/	19:00
clarkb	#startmeeting infra	19:01
openstack	Meeting started Tue Jul 14 19:01:05 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
*** openstack changes topic to " (Meeting topic: infra)"		19:01
openstack	The meeting name has been set to 'infra'	19:01
clarkb	#link http://lists.opendev.org/pipermail/service-discuss/2020-July/000056.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"		19:01
clarkb	OpenDev virtual event #2 happening July 20-22	19:01
*** zbr has joined #opendev-meeting		19:01
clarkb	calling this out as they are using etherpad, but the previous event didn't have any problems with etherpad. I plan to be around and support the service if necessary though	19:01
zbr	o/	19:01
clarkb	also if you are interested in baremetal management that is the topic and you are welcome to join	19:02
clarkb	#topic Actions from last meeting	19:02
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"		19:02
clarkb	#link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-07-19.00.txt minutes from last meeting	19:03
clarkb	ianw: thank you for running last weeks meeting when I was out. I didn't see any actions recorded on the minutes.	19:03
clarkb	ianw: is there anything else to add or should we move on to today's topics?	19:03
ianw	nothing, i think move on	19:03
clarkb	#topic Specs approval	19:04
*** openstack changes topic to "Specs approval (Meeting topic: infra)"		19:04
clarkb	#link https://review.opendev.org/#/c/731838/ Authentication broker service	19:04
clarkb	I believe this got a new patchset and I ws going to review it then things got busy before I took a week off	19:04
clarkb	fungi: ^ other than needing reviews anything else to add?	19:04
fungi	nope, there's a minor sequence numbering problem in one of the lists in it, but no major revisions requested yet	19:05
clarkb	great, and a friendly reminder to the rest of us to try and review that spec	19:05
fungi	or counting error i guess	19:05
clarkb	#topic Priority Efforts	19:05
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"		19:06
clarkb	#topic Update Config Management	19:06
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"		19:06
clarkb	ze01 is running on containers again. We've vendored the gear lib into the ansible role that uses it	19:06
fungi	no new issues seen since?	19:06
clarkb	other than a small hiccup with the vendoring I haven't seen any additional issues related to this	19:06
clarkb	maybe give it another day or two then we should consider updating the remaining executors?	19:07
fungi	any feel for how long we should pilot it before redoing the other 11?	19:07
fungi	ahh, yeah, another day or two sounds fine to me	19:07
clarkb	most of the issues we've hit have been in jobs that don't run frequently which is why giving it a few days to have those random jobs run on that executors seems like a good idea	19:07
clarkb	but I don't think we need to wait for very long either	19:07
ianw	umm, there is one	19:08
ianw	https://review.opendev.org/#/c/740854/	19:08
clarkb	ah that was related to the executor then?	19:08
clarkb	(I saw the failures were happening but hadn't followed it that closely)	19:08
ianw	yes, the executor writes out the job ssh key in the new openssh format, and it is more picky about whitespace	19:08
clarkb	#link https://review.opendev.org/#/c/740854/ fixes an issue with containerized ze01. Should be landed and confirmed happy before converting more executors	19:09
fungi	ahh, right, specifically because the version of openssh in the container is newer	19:09
clarkb	I'll give that a review after the meeting if no one beats me to it	19:09
fungi	fwiw the reasoning is sound and it's a very small patch, but disappointing default behavior from variable substitution	19:10
fungi	i guess ansible or jinja assumes variables with trailing whitespace are a mistake unless you tell it otherwise	19:10
clarkb	as far as converting the other 11 goes, I'm not entirely sure what the exact process is there. I think its something like stop zuul services, manually remove systemd units for zuul services, run ansible, start container but we'll want to double check that if mordred isn't able to update us	19:11
mordred	ohai - not really here - but here fora a sec	19:11
fungi	i'd also be cool waiting for mordred's return to move forward, in case he wants to be involved in the next steps	19:11
mordred	yeah - I think the story when we're happy with ze01 is for each remaining ze to shut down the executor, run ansible to update to docker	19:12
mordred	but I can totally drive that when I'm back online for real	19:12
clarkb	cool that probably gives us a good burn in period for ze01 too	19:12
mordred	yah	19:13
ianw	yeah probably worth seeing if any other weird executor specific behaviour pops up	19:13
fungi	sounds okay to me	19:13
corvus	mordred: eta for your return?	19:13
mordred	I'll be back online enough to work on this on Thursday	19:14
fungi	"chapter 23: the return of mordred"	19:14
mordred	I'll have electricians replacing mains power ... But I have a laptop and phone :)	19:14
fungi	that also fits with the "couple of days" suggestion	19:14
corvus	cool, no rush or anything, just thought if it was going to be > this week maybe i'd start to pick some stuff off, but "wait for mordred" sounds like it'll fit time-wise :)	19:15
clarkb	ya I'm willing to help too, just let me know	19:15
fungi	same	19:15
mordred	Cool. I should be straight forward at this point	19:15
corvus	meanwhile, i'll continue work on the (tangentially related) multi-arch container stuff	19:15
clarkb	corvus: that was the next item on my list of notes related to config management updates	19:15
corvus	cool i'll summarize	19:16
corvus	despite all of our reworking, we're still seeing the "container ends up with wrong arch" problem for the nodepool builder containers	19:16
corvus	we managed to autohold a set of nodes exhibiting the problem reliably	19:16
corvus	(and by reliably, i mean, i can run the build over and over and get the same result)	19:17
corvus	so i should be able to narrow down the problem with that	19:17
corvus	at this point, it's unknown whether it's an artifact of buildx, zuul-registry, or something else	19:17
clarkb	is there any concern that if we were to restart nodepool builders right now they may fail due to a mismatch in the published artifacts?	19:17
corvus	clarkb: 1 sec	19:18
corvus	mordred, ianw: do you happen to have a link to the change?	19:18
ianw	https://review.opendev.org/#/c/726263/ the multi-arch python-builder you mean?	19:19
corvus	yep	19:19
ianw	then yes :)	19:19
corvus	#link https://review.opendev.org/726263 failing multi arch change	19:19
corvus	that's the change with the held nodes	19:19
corvus	(it was +3, but failed in gate with the error; i've modified it slightly to fail the buildset registry in order to complete the autohold)	19:20
corvus	clarkb: i think the failure is happening reliably in the build stage, so i think we're unlikely to have a problem restarting with anything published	19:21
clarkb	gotcha, basically if we make it to publishing things have succeeded which implies the arches are mapped properly?	19:21
corvus	we do have published images for both arches, and, tbh, i'm not sure what's actually on them.	19:21
* mordred is excited to learn what the issue is		19:21
corvus	are we running both arches in containers at this point?	19:21
mordred	no - arm is still running non-container	19:22
mordred	multi-arch being finished here should let us run the arm builder in containers	19:22
mordred	and stop having differences	19:22
corvus	okay. then my guess would be that there is a good chance the arm images published may not be arm images. but i honestly don't know.	19:22
corvus	we should certainly not proceed any further with arm until this is resolved	19:23
clarkb	++	19:23
mordred	++	19:23
fungi	noted	19:23
mordred	well	19:23
mordred	we haven't built arm python-base images	19:23
mordred	so any existing arm layers for nodepool-builder are defnitely bogus	19:23
mordred	so defnitely should not proceed further :)	19:23
clarkb	mordred: even if those layers possibly don't do anything arch specific?	19:23
mordred	they do	19:24
clarkb	like our python-base is just python and bash right?	19:24
clarkb	ah ok	19:24
mordred	they install dumb-init	19:24
mordred	which is arch-specific	19:24
fungi	python is arch-specific	19:24
clarkb	fungi: ya but cpython is on the lower layer	19:24
mordred	yah - but it comes from the base image	19:24
mordred	from docker.io/library/python	19:24
mordred	and is properly arched	19:24
clarkb	.py files in a layer won't care	19:24
clarkb	unless they link to c things	19:24
mordred	but we install at least one arch-specific package in docker.io/opendevorg/python-base	19:25
clarkb	or we install dumb init	19:25
mordred	yah	19:25
fungi	oh, okay, when you said "our python-base is just python and bash right" you meant python scripts, not compiled cpython	19:25
ianw	clarkb: but anything that builds does, i think that was where we saw some issues with gcc at least	19:25
clarkb	fungi: yup	19:25
fungi	i misunderstood, sorry. thought you meant python the interpreter	19:25
mordred	ianw: the gcc issue is actually a symptom of arch mismatch	19:25
mordred	the builder image buiolds wheels so the base image dones't have to - but the builder and base layers were out of sync arch-wise - so we saw the base image install trying to gcc something (and failing)	19:26
mordred	(yay for complex issues)	19:26
ianw	i thought so, and the random nature of the return was why it passed check but failed in gate (irrc?)	19:27
mordred	yup	19:27
mordred	thank goodness corvus managed to hold a reproducible env	19:27
clarkb	anything else on the topic of config management?	19:27
ianw	i have grafana.opendev.org rolled out	19:28
ianw	i'm still working on graphite02.opendev.org and migrating the settings and data correctly	19:28
fungi	might be worth touching on the backup server split, though we can cover that under a later topic if preferred	19:29
clarkb	yup its a separate topic	19:30
fungi	cool, let's just lump it into that then	19:30
clarkb	#topic OpenDev	19:30
*** openstack changes topic to "OpenDev (Meeting topic: infra)"		19:30
clarkb	lets talk opendev things really quickly then we can get to some of the things that popped up recently	19:30
clarkb	#link https://review.opendev.org/#/c/740716/ Upgrade to v1.12.2	19:30
clarkb	That change upgrades us to latest gitea. Notably its changelog asys it allows you to properly set the default branch on new projects to something other than manster	19:31
* fungi changes all his default branches to manster		19:31
clarkb	this isn't smoeting we're using yet but figuring these things out was noted in https://etherpad.opendev.org/p/opendev-git-branches so upgraded sooner than later seems like a good idea	19:31
fungi	yeah, i think it's a good idea to have that in place soon	19:32
clarkb	my fix for the repo list pagination did merge upstream and some future version of gitea should include it. That said the extra check we've got seems like good belts and suspenders	19:32
fungi	surprised nobody's asked for the option yet	19:32
clarkb	that fix is not in v1.12.2	19:32
clarkb	and finally I need to send an email to all of our advisory board volunteers and ask them to sub to service-discuss and service-annoucne if they haven't already, then see if I can get them to agree on a comms method (I've suggested service-discuss for simplicity)	19:33
clarkb	#topic General Topics	19:34
*** openstack changes topic to "General Topics (Meeting topic: infra)"		19:34
clarkb	#topic Dealing with Bup indexes and backup server volume migrations and our new backup server	19:34
*** openstack changes topic to "Dealing with Bup indexes and backup server volume migrations and our new backup server (Meeting topic: infra)"		19:34
clarkb	this is the catch all for backup related items. Maybe we should start with what led us into discovering things?	19:34
clarkb	My understanding of it is that zuul01's root disk filled up and this was tracked back to bups local to zuul01 "git" indexes	19:35
clarkb	we rm'ed that entire dir in /root/ but then bup stopped working on zuul01	19:35
clarkb	in investigating the fix for that we discovered our existing volume was nearing full capacity so we rotated out the oldest volume and made it latest on the old backup server	19:35
fungi	probably the biggest things to discuss are that we've discovered it's safe to reinitialize ~root/.bup on backup clients, and that we're halfway through replacing the puppet-managed backups with ansible-managed backups but they use different servers (and redundancy would be swell)	19:36
clarkb	after that ianw pointed out we have a newer backup server which is in use for some servers	19:36
ianw	i had a think about how it ended up like that ...	19:36
corvus	i kinda thought rm'ing the local index should not have caused a problem; it's not clear if it did or not; we didn't spend much time on that since it was time to roll the server side anyway	19:36
clarkb	corvus: I think for review we may not have rotated its remote backup after rm'ing the local index because its remote server was the new server (not the old one). ianw and fungi can probablyh confirm that though	19:37
clarkb	ianw: fungi ^ maybe lets sort that out first then talk about the new server?	19:37
fungi	corvus: i did see a (possibly related) problem when i did it on review01... specifically that it ran away with disk (spooling something i think but i couldn't tell where) on the first backup attempt and filled the rootfs and crashed	19:37
corvus	oh i missed the review01 issue	19:38
corvus	it filled up root01's rootfs?	19:38
corvus	er	19:38
fungi	and yeah, when i removed and reinitialized ~root/.bup on review01 i didn't realize we were backing it up to a different (newer) backup server	19:38
corvus	review01's rootfs	19:38
corvus	fungi: what did you do to correct that?	19:39
fungi	then i started the backup without clearing its remote copy on the new backup server, and rootfs space available quickly drained to 0%	19:39
ianw	fungi: is that still the state of things?	19:39
fungi	bup crashed with a puthon exception due to the enospc, but it immediately freed it all, leading me to suspect it was spooling in unlinked tempfiles	19:39
fungi	which would also explain why i couldn't find them	19:40
clarkb	ya it basically fixed itself afterwards but cacti showed a spike during roughly the time bup was running	19:40
clarkb	then a subsequent run of bup was fine	19:40
fungi	after that, i ran it again, and it used a bit of space on / temporarily but eventually succeeded	19:40
fungi	so really not sure what to make of that	19:41
clarkb	it may be worth doing a test recovery off the current review01 backups (and zuul01?) just to be sure the removal of /root/.bup isn't a problem there	19:41
fungi	it did not exhibit whatever behavior led zuul01 to have two bup processes hung/running started on successive days	19:41
ianw	clarkb: ++ i can take an action item to confirm i can get some data out of those backups if we like	19:43
clarkb	ianw: that would be great, thanks	19:43
clarkb	and I think otherwise we continue to monitor it and see if we have disk issues?	19:43
clarkb	ianw: what are we thinking for the server swap itself?	19:44
ianw	yeah, so some history	19:44
ianw	i wrote the ansible roles to install backup users and cron jobs etc in ansible, and basically iirc the idea was that as we got rid of puppet everything would pull that in, everything would be on the new server and the old could be retired	19:44
ianw	however, puppet clearly has a long tail ...	19:45
ianw	which is how we've ended up in a confusing situation for a long time	19:45
ianw	firstly	19:45
fungi	but also we're already ansibling other stuff on all those servers, so the fact that some also get unrelated services configured by puppet should be irrelevant now as far as that goes	19:45
ianw	fungi: yes, that was my next point :) i don't think that was true, or as true, at the time of the original backup roles	19:46
ianw	so, for now	19:46
fungi	if we can manage user accounts across all of them with ansible then seems like we could manage backups across all of them with ansible too	19:46
ianw	#link https://review.opendev.org/740824 add zuul to backup group	19:46
fungi	yeah, a year ago maybe not	19:46
ianw	we should do that ^ ... zuul dropped the bup:: puppet bits, but didn't pick up the ansible bits	19:46
ianw	then	19:46
ianw	#link https://review.opendev.org/740827 backup all hosts with ansible	19:47
ianw	that adds the ansible backup roles to all extant backup hosts	19:47
ianw	so, they will be backing up to the vexxhost server (new, ansible roles) and the rax one (old, puppet roles)	19:47
clarkb	gotcha so we'll swap over puppeted hosts too that way its less confusing	19:47
ianw	once that is rolled out, we should clean up the puppet side, drop the bup:: bits from them and remove the cron job	19:47
clarkb	oh we'll keep the puppet too? would it be better to have the ansible side configure both theo ld and new server?	19:48
fungi	and once we do that, build a second new backup server and add it to the role?	19:48
clarkb	and simply remove the puppetry?	19:48
ianw	then we should add a second backup server in RAX, add that to the ansible side, and we'll have dual backups	19:48
fungi	yeah, all that sounds fine to me	19:48
clarkb	gotcha	19:48
ianw	yes ... sounds like we agree :)	19:48
* fungi makes thumbs-up sign		19:48
clarkb	basically add the second back in with ansible rather than worry too much about continuing to use the puppeted side of things	19:48
clarkb	wfm	19:48
clarkb	as a time check we have ~12 minutes and a few more items so I'll keep things moving here	19:49
clarkb	#topic Retiring openstack-infra ML July 15	19:49
*** openstack changes topic to "Retiring openstack-infra ML July 15 (Meeting topic: infra)"		19:49
clarkb	fungi: I havne't seen any objections for this, are we still a go for that tomorrow?	19:49
fungi	yeah, that's the plan	19:50
fungi	#link https://review.opendev.org/739152 Forward openstack-infra ML to openstack-discuss	19:51
fungi	i'll be approving that tomorrow, preliminary reviews appreciated	19:51
fungi	i've also got a related issue	19:51
fungi	in working on a mechanism for analyzing mailing list volume/activity for our engagement statistics i've remembered that we'd never gotten around to coming up with a means of providing links to the archives for retired mailing lists	19:52
fungi	and mailman 2.x doesn't have a web api really	19:53
fungi	or more specifically pipermail which does the archive presentation	19:53
clarkb	the archives are still there if you know the urls though iirc. Maybe a basic index page we can link to somewhere?	19:53
fungi	basically once these are deleted, mailman no longer knows about the lists but pipermail-generated archives for them continue to exist and be served if you know the urls	19:54
fungi	at the moment there are 24 (25 tomorrow) retired mailing lists on domains we host, and they're all on the lists.openstack.org domain so far but eventually there will be others	19:54
fungi	i don't know if we should just manually add links to retired list archives in the html templates for each site (there is a template editor in the webui though i've not really played with it)	19:55
clarkb	each site == mailman list?	19:55
fungi	or if we should run some cron/ansible/zuul automation to generate a static list of them and publish it somewhere discoverable	19:55
fungi	sites are like lists.opendev.org, lists.zuul-ci.org, et cetera	19:56
clarkb	ah	19:56
clarkb	that seems reasonable to me because it is where people will go looking for it	19:56
clarkb	but I'm not sure how automatable that is	19:56
fungi	yeah, i'm mostly just brniging this up now to say i'm open to suggestions outside the meeting (so as not to take up any more of the hour)	19:56
clarkb	++	19:57
clarkb	#topic China telecom blocks	19:57
*** openstack changes topic to "China telecom blocks (Meeting topic: infra)"		19:57
fungi	i'll keep this short	19:57
clarkb	AIUI we removed the blcoks and did not need to switch to ianw's UA based filtering?	19:57
fungi	we dropped the temporary firewall rules (see opendev-announce ml archives for date and time) once the background activity dropped to safe levels	19:57
fungi	it could of course reoccur, or something like it, at any time. no guarantees it would be from the same networks/providers either	19:58
clarkb	we've landed the apache filtration code now though right?	19:58
fungi	so i do still think ianw's solution is a good one to keep in our back pocket	19:58
clarkb	so our response in the future can be to switch to the apache port in haproxy configs?	19:59
fungi	yes, the plumbing is in place we just have to turn it on and configure it	19:59
ianw	yeah, i think it's probably good we have the proxy option up our sleeve if we need those layer 7 blocks	19:59
ianw	touch wood, never need it	19:59
clarkb	++	19:59
fungi	bt of course if we don't exercise it, then it's at risk of bitrot as well so we should be prepared to have to fix something with it	19:59
clarkb	are there any changes needed to finish that up so its is ready if we need ti?	19:59
clarkb	or are we in the state where its in our attic and good to go when necessary?	20:00
ianw	fungi: it is enabled and tested during the gate testing runs	20:00
clarkb	(we are at time now but have one last thing to bring up)	20:00
ianw	gate testing runs for gitea	20:00
fungi	yeah, hopefully that mitigates the bitrot risk then	20:00
clarkb	#topic Project Renames	20:01
*** openstack changes topic to "Project Renames (Meeting topic: infra)"		20:01
clarkb	There are a couple of renames requested now.	20:01
clarkb	I'm already feeling a bit swamped this week just catching up on things and making progress on items that I was pushing on	20:01
clarkb	makes me think that July 24 may be a good option for rename outage	20:02
clarkb	if I can get at least one other set of eyeballs for that I'll go ahead and announce it. We're at time so don't need to have that answer right now but let me know if you can help	20:02
fungi	the opendev hardware automation conference finishes on the 22nd, so i can swing the 24th	20:02
clarkb	(we've largely automated that whole process now which is cool)	20:02
clarkb	fungi: thanks	20:03
clarkb	Thanks everyone!	20:03
clarkb	#endmeeting	20:03
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"		20:03
openstack	Meeting ended Tue Jul 14 20:03:17 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:03
openstack	Minutes: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.html	20:03
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.txt	20:03
openstack	Log: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-07-14-19.01.log.html	20:03
fungi	thanks clarkb!	20:03
*** hamalq has quit IRC		22:58

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!