Friday, 2020-04-24

*** diablo_rojo has quit IRC		07:45
clarkb	fungi maybe move the mailman discussion here. I think the IncomingRunner is the runner that processes our "GLOBAL_PIPELINE" config	16:05
clarkb	so it could be emails causing it to balloon	16:05
clarkb	fungi: around the time of the memory balloon we have a lot of errors like Apr 24 11:28:33 2020 (4985) listinfo: No such list "openstack-dev": in the openstack vhost error log	16:07
clarkb	looking in the exim log around that same time period I've grepped for mailman_router and I don't see anything headed to openstack-dev there	16:09
clarkb	I don't think those errors are generated by external email as a result. But maybe we've got somethign still referencing openstack-dev internally and it not existing causes us to use more memory than we should?	16:10
clarkb	(I'	16:10
clarkb	er	16:10
clarkb	(I'm mostly just pulling on the random thread I've found don't know how valid this is)	16:11
clarkb	aha mailmain docs tell me listinfo: likely originates from the web ui	16:12
clarkb	that explains why exim is unaware	16:12
clarkb	yup I've generated that error by visiting the listinfo page for openstack-dev	16:13
clarkb	and it didn't drastically change memory use. I think that is the thread fully pulled out	16:14
clarkb	https://mail.python.org/pipermail/mailman-users/2012-November/074397.html	16:16
clarkb	skimming the error and vette logs for all vhosts the only one active with anything out of the ordinary is openstack and that was a message was 6kb too large	16:26
clarkb	corvus: I seem to recall at one point we had swap issues due to a kernel bug on executors? where they thought they were running out of memory but they had plenty of cache and buffers that could be evicted	16:32
clarkb	corvus: do you recall what the outcome of that was? looking at the cacti graphs here it almost seems like it may be similar (granted we aer using memory unexpectedly too)	16:33
clarkb	hrm I take that back I think we may be down to ~50MB of buffers and cache at the point OOMkiller is invoked	16:34
fungi	okay, moving here	16:41
corvus	clarkb: not off hand, but it looks like based on your latest message that may not be important?	16:42
clarkb	corvus: ya I don't think it is important anymore	16:42
fungi	right, cacti showed something spiked active memory and exhausted it, so not buffers/cache	16:43
clarkb	corvus: I was off by an order of magnitude when looking at memory use	16:43
fungi	clarkb: also it looks like these spikes happen around 1000-1300z daily, so could be some scheduled thing i suppose	16:44
fungi	maybe digests going out?	16:44
clarkb	aren't those monthly though?	16:44
clarkb	(I don't digset so not sure how frequently they can be configured)	16:44
corvus	daily is possible	16:45
fungi	i think there's a periodic wakeup which sends them so long as the length is over a certain number of messages, but also no fewer than some timeframe	16:45
corvus	(in fact, i think daily is typical for digest users)	16:45
clarkb	ah ok	16:45
clarkb	maybe we should try and spread the timing of those out over different vhosts (even if it isn't the problem that may be a healthy thing?)	16:46
clarkb	(I'm not seeing where that might be configurable in the config though)	16:47
clarkb	fungi: other thoughts, this server notes it could be restarted for updates. Might be worthwhile starting there just to rule out any bugs in other systems?	16:47
clarkb	but otherwise I'm thinking restart dstat with cpu and memory consumption reporting for processes and see if that adds any new useful info	16:49
fungi	yeah, can't hurt, it was last rebooted over a year ago	16:50
clarkb	I think digests are configured via cron to go out at 12:00 UTC	16:52
clarkb	which is slightly before we saw the OOM in this case	16:52
clarkb	the thing prior to that runs at 9:00 UTC and is the disabled account reminder send	16:52
clarkb	given that I don't think its digests but dstat should help us further rule that in or out	16:54
fungi	and /etc/cron.d/mailman-senddigest-sites seems to fire each of the sites after 12:00 as well	16:54
fungi	nothing in /etc/cron.d/mailman really correlates either	16:55
clarkb	my hunch continues to be its some email that makes the IncomingRouter's pipeline modules unhappy	16:56
clarkb	or a pipeline module	16:56
clarkb	and we are seeing the openstack IncomingRunner (not router) in a half state due to that rightn ow	16:57
corvus	i have a vague memory we started logging emails at some point to try to catch something; is that still in place?	17:02
fungi	logging e-mail content? i believe i remember doing something like that at one point but only for a specific list	17:03
fungi	and technically we have logs of what messages arrived and were accepted courtesy of exim, though if the message winds up not getting processed because some fork handling it is killed in an oom, we might not have record of the message content	17:04
fungi	or it may still be sitting in one of the incoming dirs or something	17:04
fungi	but yeah, if memory serves we were recording posts for the openstack-discuss (or maybe it was long enough ago to be openstack-dev?) ml to have unadulterated copies for comparison while investigating dmarc validation failures	17:07
clarkb	I'm making a top-mem plugin that shows pid	17:07
clarkb	currently trying to sort out output width formatting :)	17:08
fungi	and yeah, i agree the stuff i saw dstat logging didn't seem particularly more useful than what cacti is trending, just better granularity and more reliable recording	17:15
fungi	but i'd never looked at what devstack's generating so wasn't really clear what we should expect out of dstat	17:15
fungi	i assumed it was basically like a fancy version of sar	17:15
clarkb	fungi: ~clarkb/.dstat/dstat_top_mem_adv.py is a working modification to the shipped top mem plugin to show pid	17:16
clarkb	fungi: if you put it in the dstat process owner's homedir at .dstat/dstat_top_mem_adv.py you can use it with --top-mem-adv flag	17:16
clarkb	fungi: ya the thing it does that is useful for us is tracking memory use up to the OOM I think	17:17
clarkb	whereas with OOM output you only get that once oomkiller has run	17:17
clarkb	(and cacti isn't really tracking with that same granularity)	17:18
fungi	clarkb: like the command i have staged in the root screen session? i copied that file into place and added --top-mem-adv to the previous command, after copying the old dstat-csv.log out of the way	17:19
clarkb	fungi: I don't think its compatible with csv output /me looks at root screen	17:19
clarkb	fungi: ya that looks right. lets try it and see what csv looks like?	17:20
clarkb	if it fails it fails and we figure out redirecting the table output to file instead	17:20
fungi	dstat: option --top-mem-adv not recognized	17:27
clarkb	fungi: where did you copy the file?	17:27
clarkb	it should be ~root/.dstat/dstat_top_mem_adv.py	17:28
fungi	d'oh, i did it into my homedir. scattered today	17:28
fungi	sudo doesn't re-resolve ~ when your shell is already expanding it :/	17:28
fungi	that's better :)	17:29
fungi	thanks	17:29
clarkb	fungi: ok its not doing what I expect and I know why	17:29
clarkb	fungi: let me hack on that plugin a bit more	17:29
fungi	sure	17:31
clarkb	fungi: ok recopy it from my homedir and restart it. It should have the pid, process name, and memory use now	17:32
clarkb	the existing thing only has process name and memory use (no pid) because the csv and tabular output are different parts of the plugin	17:32
clarkb	and hopefully that gives us better clues. Also do we want to reboot ?	17:32
fungi	clarkb: it's running again with the plugin updated	17:32
clarkb	fungi: yup that seems to be happy it has the info about memory in the csv	17:33
fungi	looks like it does make it into the csv, yep	17:33
fungi	should i also be running it with --swap?	17:34
clarkb	fwiw making plugins like that seems pretty straightforward	17:34
clarkb	we could probably write a mailman specific plugin to check queue lengths and such pretty easily	17:34
fungi	yeah, looks like you just copied one and tweaked a few lines?	17:34
clarkb	fungi: yup I copied top-mem and edited it to output pid in addtion to the other stuff	17:35
fungi	cool	17:35
clarkb	just thinking out loud here that maybe if we find it is mailman that is exploding in size the next step is instrumenting mailman and dstat may help us do that too via plugins	17:35
fungi	yep	17:35
clarkb	fwiw the vette log for mailman does report when it gets really large emails. And we only have a 46kb email as showing up around that time	17:36
clarkb	I suppose some list could have a really high message size set by admins and someone is sending really large emails?	17:36
clarkb	we wouldn't see that in vette because it wouldn't error?	17:36
clarkb	the existing plugins are in /usr/share/dstat if you want to see them	17:36
fungi	should i also be running it with --swap? how about --top-cpu-adv and --top-io-adv?	17:37
clarkb	fungi: ya maybe we should just add in as much info as possible	17:37
fungi	those are the other ones the foreground dstat uses in devstack	17:37
clarkb	the cpu and io info could be useful if its also processing some email furiously while expanding in size	17:37
clarkb	we'd probably see that info there and help us understand the behavior better	17:38
fungi	KeyError: 'i/o process'	17:38
fungi	hrm	17:38
fungi	maybe raised at dstat_top_io_adv.py +75	17:38
clarkb	fungi: it only does that when run with --output	17:39
clarkb	I think the plugin doesn't support csv output. Maybe just drop that on	17:39
clarkb	*that one	17:39
fungi	i've left out --top-io-adv for now	17:39
fungi	yeah, that's what i figured too	17:39
clarkb	and that key is definitely unused elsewhere in the plugin and not set anywhere so bug in dstat	17:41
fungi	maybe that's why devstack only uses it in the foreground process	17:41
clarkb	fungi: other than watching it to see what happens I think other things we can do are: reboot, move digest cron time, log all incoming email sizes (or complexity if we can do that somehow)	17:42
clarkb	corvus: ^ you might know how to make exim logging richer? I don't see message sizes but maybe we can start there?	17:43
clarkb	oh unless is S=\d+ a size?	17:44
corvus	clarkb: it is	17:45
corvus	fungi: yes! the dmarc debug thing is what i was thinking of	17:45
corvus	is that still in place?	17:46
fungi	checking...	17:46
corvus	fungi: yes /var/mail/openstack-discuss	17:47
corvus	we probably all have that content anyway, but that's the raw incoming messages to that list if we need them	17:47
clarkb	ooh more debugging tools	17:48
fungi	corvus: looks like we used a mailman_copy router for that one lisyt	17:48
fungi	list	17:48
clarkb	fwiw we are looking for something around 11:28UTC today that triggered memory increase today	17:48
clarkb	(also how funny would it be if that is the source of the issue :P)	17:48
clarkb	(I mean that file is quite large and if it has to be opened to write to...)	17:49
fungi	if so it's something getting sent to the ml at around the same time every day, which would be weird	17:49
fungi	or opening that file around the same time every day	17:50
clarkb	fungi: it does make me wonder if that is why the incoming runner for openstack is 800MB is resident size right now	17:50
corvus	clarkb: that's written by exim, mm doesn't know about that file	17:50
clarkb	corvus: ah thanks	17:50
fungi	also exim is appending the file, so probably just does a seek based on file length?	17:52
clarkb	ok I think Imay have found something weird	18:18
clarkb	Subject: Re: [openstack-dev] [Tacker] <- that thread recieve a message at 11:22UTC according to exim	18:19
clarkb	that message isn't in the openstack-discuss mailman archive	18:19
clarkb	itspossible that is just fallout but may also indicate a trigger?	18:19
fungi	it may just be in the moderation queue, i haven't checked held messages yet today	18:20
fungi	i'll go through them now	18:20
clarkb	I also notice that at 01:04UTC april 23 we get a message with a fairly large zip attachemnt. Its clearly spam, but not sure if it ever gets to mailman or not (could be mailman has a sad dealing with attachments like that?)	18:23
fungi	clarkb: yeah, there were several posts to that thread from a non-subscriber i'm approving momentarily (once i get through skimming the remaining spam)	18:28
fungi	largest is 183kb, which is excessive but not significantly so	18:29
clarkb	fungi: ya it comes in right before we have issues though. Also it has a bunch of utf8 in it	18:29
clarkb	its got a few things that make it stand out but not necessarily point to it being the issue	18:30
fungi	okay, call caught up on the 10 lists i moderate, nothing really jumped out at me as unusual	18:31
fungi	those missing messages should show up soon	18:32
clarkb	ya I see them now	18:32
clarkb	fungi: do you think the zip spam yseterday is anything to be concerned about?	18:33
clarkb	I guess at this point its probably better to stop looking without direction and wait for more dstat info	18:33
clarkb	my rtt to the AP just got really bad. Need a local network reset I think	18:35
clarkb	fungi: I guess we might also consider adding swapp to the host as a stop gap?	18:35
fungi	we get tons of spam with attachments, i doubt exim or mailman attempt to expand them	18:36
clarkb	mordred: did you add the zuul user to the hosts?	20:01
*** yoctozepto has joined #opendev-meeting		20:01
fungi	/etc/passwd on zm02 was last modified 25 minutes ago	20:02
clarkb	in any case there is no /home/zuul/.ssh/id_rsa*	20:02
mordred	so - we have a uid/gid conflict	20:02
mordred	playbooks/group_vars/zuul.yaml:zuul_user_id: 10001	20:02
clarkb	I think what hapepned was you added a zuul user and that changed the zuul users homedir	20:02
clarkb	and now the homedir has no ssh key	20:02
clarkb	this is why we added a zuulcd user	20:02
mordred	no - wait	20:02
clarkb	/var/lib/zuul/ssh/id_rsa is where the key is I Think	20:03
mordred	we did not add the zuul user with the additional_users mechanism we used on bridge	20:03
mordred	we added the zuul user in playbooks/roles/zuul/tasks/main.yaml explicitly	20:03
mordred	and did, in fact, set the home dir to /home/zuul	20:03
clarkb	ah I see	20:04
mordred	it seems that was a mistake	20:04
fungi	the zuul group seems to have maybe been changed from 3000 to 10001 though	20:04
AJaeger	looks like we have the same problem now also on https://review.opendev.org/722309	20:04
mordred	and we should set it to /var/lib/zuul ?	20:04
clarkb	mordred: I think /var/lib/zuul is what it was	20:04
mordred	should that be zuul's home on all of the machines?	20:04
clarkb	but I'm going to check puppet now	20:04
mordred	thanks	20:04
fungi	and yeah, there's stuff in /var/lib/zuul and /home/zuul group-owned by an anonymous gid 3000	20:05
clarkb	oh no it was /home/zuul before	20:05
clarkb	I think its just that the uids changed?	20:05
fungi	uid still seems the same as it was	20:05
clarkb	so the issue is adding a new uid?	20:05
clarkb	whcih orphaned the perms on those files?	20:05
fungi	i don't see a new uid, just gid	20:05
mordred	the ansible thinks it wants to set the uid and gid to 10001	20:06
fungi	looks like the old zuul group was probably 3000 but has been replaced by group 10001	20:06
clarkb	-r-------- 1 3000 10001 1675 Apr 24 19:40 id_rsa	20:06
clarkb	it can't use the ssh key beacuse its owned by the old uid	20:06
fungi	possible paramiko/openssh don't like incorrect group ownership, yeah	20:06
clarkb	group doesn't matter there	20:06
clarkb	its 400	20:06
clarkb	but note the uid is 3000	20:07
fungi	you keep saying the old uid, but the old uid and new uid still seem to be 3000	20:07
fungi	i only see a change in the gid	20:07
clarkb	oh we didn't change the uid?	20:07
fungi	not that i have seen evidence of yet	20:07
clarkb	fungi: I was going off of "mordred \| the ansible thinks it wants to set the uid and gid to 10001"	20:07
mordred	that's what we have set in ansible	20:07
fungi	the zuul user in /etc/passwd is still 3000	20:08
fungi	on ze02 anyway	20:08
mordred	so - we told ansible to set uid and git to 10001 but that only worked for group it seems	20:08
corvus	usermod refuses to change the uid of a user if it's running processes	20:09
corvus	(that task should have failed)	20:09
mordred	ok - so maybe let's change ansible back to 3000 for both of these?	20:09
mordred	I thought 10001 was the correct value, but clearly it wasn't	20:09
clarkb	why would ssh fail to read the file if its got the correct perms? maybe the process is the different uid?	20:09
*** gouthamr has joined #opendev-meeting		20:09
corvus	it is the correct value for the container	20:09
mordred	corvus: yeah - I thought we picked the value for the container to match opendev's prod	20:10
clarkb	oh	20:10
mordred	but clearly that's not correct	20:10
fungi	clarkb: openssh and paramiko are both paranoid about permissions and ownership on key files, so it could just be that it doesn't like the random group id even though that group lacks read perms	20:10
fungi	anyway, this is at least a problem, it may not be the only problem	20:11
clarkb	fwiw there are no containers running	20:12
clarkb	so we aren't having a mismatch between container and host uids	20:12
corvus	are we sure the key didn't change?	20:12
corvus	running this as zuul: ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org -vvv gerrit ls-projects	20:12
fungi	corvus: i'm not sure, no	20:12
corvus	that doesn't complain about perms at all	20:13
fungi	/var/lib/zuul/ssh/id_rsa was modified 35 minutes ago	20:13
fungi	so may also have been overwritten with incorrect content	20:13
clarkb	the error complains about publickey too	20:13
corvus	that seems like the theory we should work to confirm or discount right now	20:13
mordred	clarkb: we set the ansible values to match the contianer values so that when we started teh containers all of the permissions would match. we did this thinking that we'd picked the container values to match the production opendev values. that is why there is a gid numericm mismatch	20:14
clarkb	drwxr-xr-x 5 zuul 3000 4096 Apr 24 20:10 zuul	20:14
mordred	but clearly those were not actually our production values	20:14
clarkb	is it possibly looking in /home/zuul/.ssh/known_hosts to verify host keys (or similar) and finding the homedir is owned by the wrong group and bailing out?	20:15
fungi	`sudo -u zuul -- ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org` gives me "Permission denied (publickey)."	20:15
mordred	so I'd say one thing we should do - that may have _zilch_ to do with this issue - is that we should reset the ansible uid and gid values to 3000, since that's what the files are on disk - and then add a user: 3000 to the docker-compose files so that when we start dokcer things will also work	20:15
corvus	mordred: i disagree, but let's talk about that later	20:15
corvus	i suspect we're rejecting every new patchset now	20:16
*** tristanC has joined #opendev-meeting		20:16
corvus	in fact	20:16
corvus	we may want to see if we can pause the mergers and executors	20:16
corvus	that may reduce the carnage	20:16
clarkb	corvus: should be able to just stop them entirely?	20:16
clarkb	then the scheduler will maintain the state?	20:16
corvus	clarkb: yeah, if we are okay throwing away the jobs	20:16
corvus	do we have executor pause running yet?	20:16
clarkb	I don't think we've restart zuul any time recently	20:17
fungi	status notice zuul is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress	20:17
fungi	should we send something like that? ^	20:17
corvus	fungi: sounds good; clarkb you stop all mergers, i'll do something to the executors	20:17
clarkb	ok stopping mergers now	20:18
gouthamr	+1, helps me stop rechecking :)	20:18
clarkb	systemctl stop zuul-merger doesn't seem to be working fwiw	20:18
clarkb	did we remove the unit?	20:18
fungi	#status notice The Zuul project gating service is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress	20:18
openstackstatus	fungi: sending notice	20:18
clarkb	oh tehre it goes maybe just a delay	20:18
mordred	don't think so - that was going to be cleanup	20:18
clarkb	ok doing the other 7 now	20:18
-openstackstatus- NOTICE: The Zuul project gating service is reporting new patches in merge conflict erroneously due to a configuration error, fix in progress		20:19
clarkb	zm01-08 should all be stopped or stopping now	20:20
fungi	do the mergers need to be stopped as well?	20:20
fungi	(the independent mergers i mean)	20:20
clarkb	fungi: that is what I stopped	20:20
mordred	fungi: clark has stopped the mergers - corvus is working on executors	20:20
corvus	all the executors should be paused, meaning they aren't accepting new jobs, but a lot of them are in retry loops for git checkouts	20:20
fungi	oh, right, i misread	20:20
corvus	so they are still logging errors	20:21
corvus	but i don't think stopping them would produce a different outcome	20:21
mordred	++	20:21
clarkb	the operating theory is that we've changed the private key so we no longer authaenticate against the public key in gerrit?	20:21
mordred	clarkb: corvus used the key to talk to gerrit and it worked	20:21
corvus	mordred: no it failed	20:21
fungi	clarkb: that seems likely if my reproducer quoted above is the correct way to test it	20:21
mordred	oh! I misread	20:22
fungi	`sudo -u zuul -- ssh -i /var/lib/zuul/ssh/id_rsa -p 29418 review.opendev.org` gives me "Permission denied (publickey)."	20:22
openstackstatus	fungi: finished sending notice	20:22
corvus	so yeah, i think that's the unconfirmed theory which we should confirm now, probably by untangling the old hiera values and comparing?	20:22
clarkb	I've confirmed the key in brdige hiera is different than what we have on zm02	20:22
mordred	clarkb: which key?	20:22
fungi	i also tried zuul@review.opendev.org too just to make sure	20:23
mordred	zuul_ssh_private_key_contents is what ansible wants to install	20:23
clarkb	bridge:/etc/ansible/hosts/group_vars/zuul-merger.yaml:zuulv3_ssh_private_key_contents != zm02:/var/lib/zuul/ssh/id_rsa	20:23
mordred	ah - wrong hiera key	20:23
clarkb	note I've not confirmed that is the correct key	20:24
clarkb	but going to try and do that now	20:24
corvus	$zuul_ssh_private_key = hiera('zuul_ssh_private_key_contents')	20:24
clarkb	$zuul_ssh_private_key = hiera('zuulv3_ssh_private_key_contents')	20:24
clarkb	my line is from the zm node block	20:24
corvus	i think that's what puppet writes out to '/var/lib/zuul/ssh/id_rsa'	20:25
corvus	clarkb: mine is from executor	20:25
clarkb	we may have called it different things in differnt places?	20:25
corvus	in site.pp	20:25
clarkb	ya looks like it may be different key between merger and executor	20:25
corvus	wow	20:25
mordred	no wonder it broke then	20:26
clarkb	well both have different values than what is on the merger right now	20:26
mordred	so maybe we should just update the key name in the zuul merger hiera	20:26
clarkb	and they aer different from each other I think	20:26
mordred	if zuul_ssh_private_key_contents is what we're using elsewhere?	20:26
corvus	do we have both public keys set for that user in gerrit?	20:27
mordred	looking	20:27
mordred	oh wait - not sure how to look - do I need to log in as that user to gerrit?	20:28
clarkb	we also have two different keys in used on executors	20:28
corvus	i'll poke at the gerrit db	20:28
fungi	i confirm the /etc/ansible/hosts/group_vars/zuul-merger.yaml:zuulv3_ssh_private_key_contents works with my reproducer command above	20:28
clarkb	ok I think I've got it	20:28
fungi	so that seems to confirm it's the one we want	20:28
mordred	awesome	20:28
clarkb	zuul_ssh_private_key might be for executor to test nodes	20:28
clarkb	gerrit_ssh_private_key on the executor is ssh from executor to gerrit	20:28
clarkb	and on zuul mergers we called that zuulv3_ssh_private_key	20:29
clarkb	trying to compare them in hiera now	20:29
clarkb	yes gerrit_ssh_private_key == zuulv3_ssh_private_key	20:29
clarkb	between zuul-executor.yaml and zuul-merger.yaml	20:29
fungi	it's definitely the correct file content anyway (or at least it's a key the zuul user in gerrit can authenticate with)	20:29
mordred	wow	20:29
mordred	so - in the current ansible we want that to be zuul_ssh_private_key_contents everywhere	20:30
clarkb	mordred: and then you need to do something to write out the executor to test node key	20:30
clarkb	(I mean thats probably alerady done, we just need to make sure we get it right too)	20:30
corvus	there is one public key in gerrit. i could paste it but it doesn't seem necessary since that's easy to test.	20:30
mordred	clarkb: yeah	20:31
mordred	clarkb: I think we can followup with the executor-to-test-nodes key since I don't think we're touching it right now	20:31
clarkb	I'll see if I can find it on ze01 and cross check against hiera	20:31
fungi	well, like i said, i tested authenticating with it, and it does work	20:32
mordred	but if we update hiera for zuul-to-gerrit-key to be called zuul_ssh_private_key_contents and then re-reun ansible, we should get the right keys back everywhere - right?	20:32
fungi	oh, you mean the test node key	20:32
fungi	yeah	20:32
mordred	but I thnik we don't have to solve that key right now, because it shoudln't have changed from the last time puppet wrote it if we're not writing it	20:32
clarkb	mordred: confirmed those other keys look ok	20:33
clarkb	so its just the zuul to gerrit key that got mismatched	20:33
clarkb	as well as gids?	20:33
mordred	clarkb: cool. so - you've got your head wrapped around the key names in hiera - can you update the key names?	20:33
clarkb	(and possibly uids if ansible manages to do its thing?)	20:33
fungi	so far that's all i've seem wrong at least	20:33
clarkb	mordred: I don't know where I need to update them	20:34
mordred	clarkb: in hiera	20:34
clarkb	mordred: right but in which group or host?	20:34
mordred	all of them	20:34
clarkb	mordred: sorry let me rephrase	20:34
mordred	ansible wants the key to be zuul_ssh_private_key_contents	20:34
clarkb	zuul-merger doesn't have the key you used and yet it still updated the contents	20:34
clarkb	I don't know where it is finding those contents	20:34
mordred	group_vars/zuul-merger.yaml:zuul_ssh_private_key_contents: \|	20:35
mordred	clarkb: that's where ansible got the contents from	20:35
mordred	for the mergers	20:35
clarkb	oh it is in there sorry	20:35
mordred	phew	20:35
fungi	zuul_ssh_private_key_contents seems to exist in zuul-executor.yaml, zuul-merger.yaml and zuul-scheduler.yaml according to git grep	20:35
mordred	I was worried there was even more weirdness	20:35
clarkb	ok now my concern is if I chagne that are we consuming it anywhere it is expected to be a thing?	20:36
clarkb	mordred: did you add that key to zuul-merger and zuul-scheduler recently?	20:36
mordred	no - so - I think my error is that I expected we called the key the same thing in all three places	20:36
mordred	but that is apparently not what we did	20:36
clarkb	right so if I change the value of that key, anything consuming the key for its intended purpose may break	20:37
mordred	nothing else is consuming that key	20:37
mordred	the only thing consuming this is ansible now	20:37
clarkb	ok	20:37
fungi	codesearch turns up a couple of hits for zuul_ssh_private_key_contents in x/ci-cd-pipeline-app-murano and x/infra-ansible which i expect we can ignore	20:37
corvus	should be able to grep system-config to confirm that	20:37
clarkb	I'm taking the lock then	20:37
mordred	and the only thing it's consuming it for is /var/lib/git/id_rsa	20:37
fungi	aside form that, only hits are in system-config	20:37
clarkb	I'll update zuul_ssh_private_key_contents to be the same value as zuulv3_whatever on zuul-merger	20:37
fungi	playbooks/roles/zuul/tasks/main.yaml, playbooks/zuul/templates/group_vars/zuul-scheduler.yaml.j2 and playbooks/zuul/templates/group_vars/zuul.yaml.j2	20:37
mordred	yah	20:38
mordred	so as long as we have the correct data referred to by that key, we should be good to go	20:38
mordred	corvus: what do you think we should do about uid/gid?	20:39
corvus	i think we should fix hiera and run ansible first; get that deployed and verify things are running (since it doesn't look like the id issue will be a problem right now)	20:39
mordred	kk	20:39
mordred	and agree	20:39
corvus	then maybe take a breath and talk about that :)	20:39
clarkb	I think zuul-merger.yaml is good now I ketp the old ssh key around under a new hiera key name	20:40
clarkb	I'm going to do the same in the other two files now	20:40
mordred	cool	20:40
mordred	++ \o/	20:40
clarkb	if you want to diff zuul-merger you'll see what I ddi	20:40
fungi	yeah, odds are the changed gid may be purely cosmetic/immaterial, and ansible is failing to change the uid anyway	20:40
corvus	clarkb: maybe go ahead and get rid of it?	20:40
corvus	clarkb: we have git history :)	20:40
corvus	clarkb: and the last thing we need is more keys in those file :)	20:40
fungi	yeah, easy to revert changes to that file	20:40
mordred	yeah - these files are ... busy	20:40
corvus	(we're never going to think it's a better time than right now to make that cleanup)	20:41
clarkb	corvus: ++ I'll remove them	20:42
clarkb	I think I'm done except for that cleanup	20:42
clarkb	zuul-scheduler was already correct	20:42
clarkb	so we had 3 different names for they key in 3 different palces. All three are now using the name and value that scheduler was using	20:42
mordred	awesome	20:43
mordred	clarkb: and then zuul_ssh_private_key_contents_for_ssh_to_test_nodes is the executor key for talking to the test nodes, right?	20:43
clarkb	mordred: yes I've removed that now though	20:44
mordred	clarkb: +old_zuul_ssh_private_key_contents_for_test_node_ssh still seems to be in merger	20:44
clarkb	mordred: should be gone from both now	20:44
fungi	what a marvellously verbose variable name	20:44
mordred	clarkb: I agree	20:44
mordred	ok - so I should re-run the playbook now yes?	20:44
clarkb	mordred: and yes you can confirm by looking at /var/lib/zuul/ssh/nodepool_id_rsa	20:44
clarkb	ya I think hiera/group_vars are ready for it now	20:44
mordred	cool	20:44
fungi	also zuul_ssh_private_key_contents_for_ssh_to_test_nodes appears nowhere in our public repos per codesearch	20:45
mordred	I've joined the screen session on bridge	20:45
fungi	joined	20:45
clarkb	fungi: I added it as a placeholder bu tthen removed it per corvus' suggestion	20:45
mordred	we will expect it to error out eventually at zuul-executor	20:45
fungi	clarkb: oh, got it, so that was a made-up-on-the-spot one	20:45
mordred	but that's ok	20:45
clarkb	mordred: if we are happy with the output of that run we should git add zuul-executor.yaml and zuul-merger.yaml in group_vars and commit the change	20:46
mordred	clarkb: ++	20:46
mordred	https://review.opendev.org/723023 <-- this is the fix for the executor error that will happen, fwiw	20:46
clarkb	+2 on that change	20:46
mordred	the mergers are stopped not paused, right?	20:46
clarkb	mordred: correct systemctl stop zuul-merger'd	20:47
mordred	ok. I think that means it's going to update the uid	20:47
clarkb	so we may see ansible start them again	20:47
clarkb	hrm that may break us too	20:47
corvus	please carry on, but once this is done, we do need to talk about gid before we reinstate	20:47
mordred	yeah	20:47
mordred	I'm happy to talk about that while this is running	20:47
corvus	if the mergers are stopped, i say we just let ansible do it's remapping	20:47
mordred	kk	20:48
corvus	but we don't want to start the mergers	20:48
mordred	should I cancel ansible real quick?	20:48
clarkb	ya we can chown /var/lib/zuul /var/run/zuul /home/zuul eassy enough	20:48
mordred	I don't think it's going to start anything	20:48
corvus	did we land the change where we said "oh it's always safe to start mergers"?	20:48
mordred	actually - it's DEFINITELY not going to start anything	20:48
mordred	no, we did not	20:48
corvus	k	20:48
fungi	confirmed, it did switch up the uid on the mergers just now	20:48
fungi	-r-------- 1 3000 zuul 1675 Apr 24 19:40 id_rsa	20:49
fungi	so that's definitely going to need fixing before we can start them again	20:49
mordred	ok. so the mergers now completely match the container uid/gid - but we'll need to run a chown on some directories	20:49
mordred	yup	20:49
corvus	ok, stable enough to talk about id stuff?	20:49
mordred	I'm good to talk about it	20:49
corvus	File "/usr/local/lib/python3.5/dist-packages/zuul/driver/bubblewrap/__init__.py", line 121, in getPopen	20:49
corvus	group = grp.getgrgid(gid)	20:49
corvus	KeyError: 'getgrgid(): gid not found: 3000'	20:49
corvus	that's why we can't turn things on again until we fix that :/	20:50
corvus	i think we should just roll forward and re-id everything to the container ids	20:50
clarkb	its using the current process gid I guess?	20:50
fungi	that seems fine to me. we'll need to do the scheduler too right?	20:50
mordred	corvus: I agree - since we're essentially down anyway	20:50
corvus	yeah, that's basically my thinking	20:50
corvus	fungi: maybe?	20:51
fungi	clarkb: the running process is running with gid 3000 which doesn't have a corresponding group name, right	20:51
corvus	i think this is easy for the executors (they'll be just like the mergers; maybe we have to do some chowns)	20:51
corvus	but i'm guessing we do want to do the scheduler too, so maybe we just go ahead and save queues and do a full shutdown?	20:51
mordred	corvus: yeah - and I mean - we needed to shutdown to start up from container - so while it's not ideal, again, we're down, so might as well	20:52
corvus	yep; fungi, clarkb: sound good? if so, i can do the shutdown	20:52
fungi	sounds fine, still here to help	20:52
mordred	do we know which all dirs we need to chown?	20:52
clarkb	ya I think rolling forard is our best choice	20:52
mordred	/var/lib/zuul for sure	20:52
clarkb	otherwise we'll just be chasing down weird bugs likely	20:52
corvus	why don't folks divide up looking on the different machine classes for dirs to change?	20:53
clarkb	mordred: /var/lib/zuul/ /var/run/zuul and /home/zuul I think	20:53
clarkb	there may be others	20:53
mordred	ok. I'm going to write a quick playbook	20:53
* mnaser doesn’t have access but can provide extra hands or eyes doing other things if needed		20:53
fungi	i can check executors since clarkb's already been tackling mergers and mordred's ansibling	20:54
mordred	https://etherpad.opendev.org/p/Egpih9sInMgzpz0DFF9m	20:55
clarkb	mordred: also /etc/zuul	20:55
mordred	how does that look so far? (mnaser - eyeballs ^^ ?)	20:55
fungi	like originally on the mergers, the executors failed to update the zuul uid, which is still 3000, but have updated the zuul gid to 10001	20:55
mordred	clarkb: those are the same on both mergers and executors right? is it the same dirs on scheduler?	20:55
clarkb	mordred: yes I believe its the same for mergers and executors	20:55
clarkb	looking at scheduler now	20:55
mnaser	mordred: looks good, i think that covers most folders	20:55
mordred	thank you	20:55
fungi	mordred: some of our files in those paths are root-owned	20:56
fungi	not sure if that matters	20:56
mordred	betcha ansible will happily set them back if it does	20:56
mordred	fungi: got an example?	20:56
fungi	possibly just because they're world-readable and configuration management wasn't told a uid/gid	20:56
mordred	oh - I bet all the /etc/zuul stuff	20:56
corvus	ya	20:56
mordred	yeah	20:56
clarkb	mordred: ya scheduler looks the same	20:56
mnaser	mordred: maybe adding a -v for verbosity just in case to see what actually got changed	20:57
fungi	/etc/zuul/executor-logging.conf for example	20:57
mordred	cool. so that etherpad should take care of it	20:57
mordred	mnaser: ++	20:57
clarkb	the scheduler has some old extra stuff in /var/run but those aren't used anymore	20:57
clarkb	its just /var/run/zuul now	20:57
corvus	deja vu, btw	20:57
corvus	we did this the last time we landed this change :)	20:57
corvus	maybe drop /etc/zuul and just chown /etc/zuul/github.key	20:57
clarkb	mordred: ^	20:58
corvus	and zuul.conf	20:58
mordred	so - I think we should a) shut down scheduler, executors, web and fingergw b) run current ansible c) run chown playbook	20:58
mordred	corvus: kk	20:58
corvus	d) run ansible again	20:58
mordred	yes	20:58
corvus	(to achieve steady state in case chown playbook != ansible)	20:58
mordred	yup	20:58
mnaser	corvus: gerrit.key too? or is that in /var/lib/zuul/.ssh/id_rsa ?	20:58
clarkb	mnaser: its /var/lib/zuul/ssh/$files	20:58
mordred	ok - I wrote that etherpad to chown-zuul.yaml	20:59
corvus	yeah, our install is... weird and not recommended :)	20:59
corvus	it's a zuulv0 install	20:59
mnaser	hehe, also, the -v might had a _lot_ of output	20:59
clarkb	ya the git repos have lots of files	20:59
mnaser	esp when it hits the repo folders i guess	20:59
clarkb	mordred: ^	20:59
fungi	there are a few files in /etc on executors with old gids but that may be because they're stale files? /etc/zuul/site-variables.yaml, /etc/zuul/ssl/{ca.pem,client.key,client.pem}	20:59
mordred	oh - maybe we don't want the -v	20:59
clarkb	or do a -v for everything but /var/lib/zuul	21:00
clarkb	but definitely be careful with -v in /var/lib/zuul :)	21:00
corvus	fungi: those files are important	21:00
fungi	we can also use find to look for files/directories with old uids and gids	21:00
mordred	clarkb: like that?	21:01
mnaser	fungi: that's a good idea too, depending on how happy the iops on the system are :)	21:01
corvus	we may just not have installed the site-variables file in this playbook, since it comes from system-config	21:01
clarkb	mordred: the repos are in /var/lib/zuul/git iirc	21:01
clarkb	mnaser: and then there is antoehr set of hardlinked repos on the executors	21:01
clarkb	er sorry mordred ^	21:01
corvus	clarkb: we shouldn't have any hardlinked repos	21:01
mordred	ok. but I mean - does the playbook update look better?	21:01
fungi	corvus: good point, those files may reflect things we're missing in ansible too	21:01
corvus	let me make sure we have cleaned the build dir	21:02
mordred	these: /etc/zuul/ssl/{ca.pem,client.key,client.pem} changed in the new rollout	21:02
mnaser	find / -uid $olduid --- could very well be useful if the drives are fast enough	21:02
mordred	they're /etc/zuul/ssl/gearman-{ca.pem,client.key,client.pem}	21:02
clarkb	mordred: ya it looks good though corvus is mentioning we should clear out the build dirs	21:02
mordred	ah - good point	21:02
mnaser	as suggested by fungi	21:02
fungi	mordred: i can confirm, those new filenames have the correct ownership	21:02
corvus	derp, my shutdown was hung on an interactive step; i've resumed it but it's still proceeding	21:03
fungi	it's the old filenames which do not	21:03
mnaser	wait actually	21:03
mordred	site-variables hasn't been written yet because of the executor bug	21:03
corvus	mordred: wait for an all clear on shutdown from me before running	21:03
mordred	corvus: yes	21:03
clarkb	mordred: fungi we should check that the new files are used by zuul?	21:03
mnaser	find / -uid old_zuul_uid -exec chown zuul:zuul {} \;	21:03
mnaser	thoughts about that ^ ?	21:03
mordred	clarkb: I did earlier - but please double-check that	21:03
mordred	because this is one of those cascading failures sorts of days	21:04
mnaser	(if we're traversing the entire file system, might as well as check if its the right uid and chown)	21:04
clarkb	mnaser: I'ev checked on zm01 and the new files appear to be what is used	21:04
mnaser	s/entire/most/	21:04
fungi	mnaser: i'd probably do without the -exec first to see what's found	21:04
clarkb	mnaser: it will likely be incredibly slow too	21:04
fungi	so that we can look into why they got skipped	21:04
mordred	I think I'm ok with just the targetted chown - but maybe let's do that as a print later to make sure we didn't miss something	21:04
fungi	but yeah, may want to exclude /var/lib/zuul from the find	21:04
fungi	since we're chowning it all anyway	21:05
fungi	and that'll be the bulk of the chew anyway	21:05
mnaser	cool fair i figured while chown -Rv will probably scan all inodes find was going to do the same, but probably an overoptimization	21:05
mordred	fungi: the old filenames got skipped because ansible doens't reference them at all	21:05
corvus	fwiw, we could also just remove the git repos	21:05
fungi	mordred: yeah, that's what i gathered when you said they'd been renamed in ansible	21:05
corvus	bit slower on startup, but it's friday afternoon, going into a weekend, and they all need pruning anyway	21:05
mordred	ah - nod	21:05
mnaser	corvus: when was the last time an executor started with a clean slate? maybe that might take a while and introduce all other fun things too	21:06
clarkb	mnaser: you don't need to get the old file stat to update its perms	21:06
mordred	corvus: fine by me	21:06
corvus	mnaser: a few months	21:06
clarkb	mnaser: I expect it won't do a full stat	21:06
mnaser	clarkb: ah yes good point	21:06
corvus	mnaser: (they get terribly slow, and need periodic cleaning at opendev scale; we just merged a patch from tobiash to fix that, and i think we'll be restarting with it in place)	21:07
mordred	corvus: looks like there's no python processes on scheduler	21:07
corvus	not ready yet	21:07
mordred	kk	21:07
corvus	okay, everything is shutdown, and there are no files in builds	21:08
corvus	i think now we decide if we want to delete the git repos or chown them	21:08
mnaser	corvus: maybe start with chown and see how quickly its churning through them?	21:09
corvus	sure, it's a second task anyway	21:09
mordred	corvus: I think I can run the service-zuul playbook now while we discuss yeah?	21:09
corvus	mordred: ++	21:09
mordred	ok. I'm going to run that now	21:09
corvus	and restarting with the new code will immediately correct and gc them as soon as they are used (now that might be slow)	21:09
corvus	but it'll be a good upgrade test :)	21:10
mordred	++	21:10
clarkb	are we starting on containers or old init?	21:10
mordred	containers	21:10
corvus	i think we're gonna be starting on containers (except executors)	21:10
mordred	yeah	21:10
fungi	so really really roll forward	21:10
mordred	yup! we needed to restart everything to pick up the container change anyway - so, you know, here we are :)	21:10
mnaser	if not using init then are the (assuming systemd) units disabled just in case?	21:10
clarkb	mnaser: we've done that as a followup in other cases yes	21:11
corvus	yeah. the upside of the 'unscheduled outage on friday' approach is this is going to go a lot faster than it would have otherwise :)	21:11
mordred	corvus: jfdi ftw!	21:11
clarkb	mnaser: basically systemctl stop foo && systemctl disable foo	21:11
mnaser	ok cool	21:11
mnaser	no one wants a reboot surprise :)	21:11
mordred	yea - I think we need to write a cleanup playbook that does the disable and then deletes the units	21:12
corvus	i would like a tequila sunrise	21:12
mordred	corvus: now _I_ want a tequila sunrise	21:12
fungi	nobody expects the reboot inquisition!	21:13
clarkb	I might need to go make an okolehao something (https://www.islanddistillers.com/product/okolehao-100-proof-cork/)	21:13
clarkb	you'll like the literal translation on that	21:13
clarkb	but its the only alcohol I have left I Think	21:13
mordred	clarkb: sandy thinks you should save that for disinfecting things	21:14
corvus	not high enough proof :(	21:14
fungi	only good for disinfecting your digestive tract	21:14
mordred	or your iron butt	21:14
mordred	ok. the playbook is done	21:15
mordred	we ready for the chown playbook?	21:15
corvus	++	21:15
* fungi nods affirmatively		21:15
clarkb	I think all the things I was doing are long done	21:15
mordred	cut/paste looks ok?	21:15
corvus	++	21:15
mordred	-f 20 is 20 forks right?	21:16
fungi	that's my understanding	21:16
mordred	I will now run that command	21:16
fungi	fireball 20	21:16
mordred	ugh. what's page up in screen scrollback?	21:17
fungi	the pgup key usually	21:17
corvus	for me "page up"	21:17
mordred	(I saw a bunch of error stream by)	21:17
mordred	can one of you do that? I have a stupid keyboard	21:17
corvus	we're at the top	21:17
fungi	yeah, there's a limited buffer by default	21:17
mordred	oh. well - let's run it and re look later then	21:18
mnaser	mordred: if stupid key that being mac keyboard, fn+arrow up is page up :p	21:18
mnaser	s/key/keyboard/	21:18
fungi	looks like it just worried about the missing file	21:18
fungi	you told it to chown something which only exists on the scheduler	21:18
mordred	ok. so it might have done everything ok	21:18
mordred	/var/lib/zuul on the scheduler looks decent	21:19
mordred	so - the answer is - recursive chown is quick	21:20
fungi	/var/lib/zuul/ssh/{nodepool_id_rsa,static_id_rsa} on ze01 are 3000:3000	21:20
fungi	but maybe those are stale files?	21:20
mordred	we still should have chowned them	21:20
mordred	maybe let's pull github.key from the list	21:20
clarkb	fungi: those are unmanaged by ansible currently but a chown -R should've gotten them	21:21
clarkb	fungi: the nodepool_id_rsa file is how we ssh into nodes	21:21
mordred	I'm re-running having taken the github.key file out of the first task	21:21
fungi	makes sense	21:22
mordred	because I'm pretty sure that caused it to bail and not run the second task :)	21:22
clarkb	okolehau + club soda + lime and some ice is drinkable. Not really something you'd probably pay for though	21:23
mordred	I would have expected zuul01 to take the longest	21:26
mordred	done	21:27
mordred	6 minutes total fwiw - so still not terrible for a chown	21:27
mordred	folks want to verify that stuff looks chown'd?	21:27
mordred	then we can the service-zuul playbook again	21:27
fungi	those keyfiles in /var/lib/zuul/ssh/ on ze01 are correct now	21:27
clarkb	zm01 looks good to me in /var/lib	21:28
fungi	/home/dmsimard on ze01 has some files with old ownership	21:29
fungi	which is fine	21:29
mordred	if that breaks zuul something is horribly wrong :)	21:29
fungi	indeed	21:29
mordred	I will now re-run service-zuul - yeah?	21:29
fungi	the aforementioned files in /etc/zuul still have old gids	21:30
fungi	but they weren't included in the chown	21:30
mordred	(should I fix the jeamalloc line in the executor role?)	21:30
fungi	/etc/zuul/site-variables.yaml /etc/zuul/ssl/ca.pem /etc/zuul/ssl/client.pem /etc/zuul/ssl/client.key	21:30
corvus	mordred: oh, i guess if it's aborting due to that error we should fix it	21:30
mordred	I think since we're rolling forward - we should go ahead and fix that	21:30
fungi	those last three are not used by the container sounded like, as theyve been renamed	21:30
mordred	yeah	21:30
fungi	i'm unsure whether /etc/zuul/site-variables.yaml needs correcting though?	21:31
mordred	let's see how it is after this ansible run	21:31
mordred	since we won't abort on zuul-executor now (we hope)	21:31
fungi	i guess it's managed by a different playbook we haven't run? or just missing from ansible entirely?	21:31
mordred	we've been bombing out due to the jemalloc bug	21:32
fungi	got it	21:32
mordred	so didnt' get to the scheduler role	21:32
mordred	ok - I run playbook	21:32
fungi	also /etc/zuul/site-variables.yaml is world-readable so if we're configuration-managing it then that's probably fine for the moment	21:32
mordred	corvus: so - the real question I have is - why did I think 10001 was the uid of zuul in opendev production?	21:33
clarkb	mordred: was that the uid that we used on static for the zuul user?	21:34
clarkb	to run goaccess?	21:34
mordred	no - I mean - we set 10001 in the zuul images a while ago	21:34
mordred	and we picked it to make it match what we were already running	21:34
mordred	except WOW was that wrong	21:34
clarkb	oh got it	21:35
mordred	also - this same issue exists with nodepool nodes	21:35
clarkb	I hae no idea in that case :)	21:35
clarkb	I'm not in the screen, are we running the regular ansible now?	21:35
mordred	yes	21:35
mordred	it's very exciting	21:36
fungi	also i've done a find on /etc, /home, /usr and /var for anything using -uid 3000 -o -gid 3000	21:36
fungi	no other hits	21:36
fungi	(on ze01)	21:36
corvus	mordred: i don't know, i think i just took your word for it when i asked about it :)	21:37
mordred	corvus: :)	21:37
mordred	corvus: so - how should we handle this for nodepool?	21:37
fungi	mordred is a very believable fellow	21:37
corvus	mordred: same way -- full shutdown and restart?	21:37
corvus	(that's less disruptive)	21:37
mordred	yeah	21:37
mordred	we have't landed the nodepool patch yet	21:38
clarkb	if we do it in a rolling fashion it won't be an outage	21:38
clarkb	and each nodepool launcher runs indepednently so should be ok?	21:38
mordred	ok - the playbook is done and it looks like it was happy about itself	21:40
mordred	I believe we are now at the place where we can consider running some docker-composes	21:41
fungi	that still hasn't updated ownership for /etc/zuul/site-variables.yaml on the executors, btw	21:41
clarkb	we need mergers for the scheduler to start up	21:41
corvus	lets start a merger?	21:41
mordred	is there some way to do that piecewise in a way that's less disruptive while we check stuff/	21:41
clarkb	corvus: ++ I think merger then escheduler then the others if they are happy?	21:41
mordred	should we do a chown playbook for site-variables real quick?	21:41
mordred	like that?	21:42
fungi	maybe we should confirm we're ansibling that file at all	21:42
mordred	oh - nope	21:43
mordred	we point to /opt/project-config in conf now	21:43
mordred	playbooks/roles/zuul/templates/zuul.conf.j2:variables=/opt/project-config/zuul/site-variables.yaml	21:43
fungi	codesearch only turns up hits for that path in opendev/puppet-zuul	21:43
fungi	so yeah, that's stale along with the other three files in /etc/ with old ownership	21:43
fungi	we're all set in that case	21:43
mordred	cool. I think we shoudl start a merger	21:44
corvus	let's rm that file then	21:44
clarkb	well we use that file	21:44
* fungi concurs		21:44
corvus	clarkb: ?	21:44
mordred	clarkb: we do not	21:44
clarkb	oh wait I get it now	21:44
clarkb	we chagned the config to consume it directly	21:44
clarkb	rather than writing it in /etc/zuul	21:44
clarkb	got it	21:44
corvus	clarkb, fungi, mordred: ok to rm /etc/zuul/site-variables.yaml ?	21:44
mordred	++	21:45
fungi	yeah, we should be able to delete these: /etc/zuul/site-variables.yaml /etc/zuul/ssl/ca.pem /etc/zuul/ssl/client.pem /etc/zuul/ssl/client.key	21:45
corvus	(that way we don't get confused	21:45
mordred	want me to playbook it?	21:45
clarkb	ya I think so	21:45
fungi	all are using new names or paths now	21:45
corvus	fungi: agreed	21:45
fungi	which is why they have old ownership still, no longer referenced anywhere	21:45
mordred	like that?	21:45
fungi	yep	21:46
mordred	k. running real quick	21:46
mordred	great. worked on executors, not on mergers - which is probably correct	21:46
clarkb	ya mergers won't have site-variables	21:46
fungi	find no longer turns up files under /etc with -uid 3000 -o -gid 3000	21:46
fungi	lgtm	21:46
clarkb	but will haev zuul/ssl/ files	21:47
mordred	yes - we should have /etc/zuul/ssl/gearman-*	21:47
fungi	we do on ze01	21:47
mordred	and on zm01	21:47
clarkb	zm01 looks good	21:47
mordred	cool	21:47
mordred	who wants to start a merger?	21:48
corvus	i will	21:48
clarkb	I'm still on zm01 if you want me to do it	21:48
clarkb	k /me lets corvus do it	21:48
corvus	clarkb: you do it	21:48
* mordred wants both clarkb and corvus to do it		21:48
corvus	clarkb: all you :)	21:48
clarkb	alright starting momentarily	21:48
clarkb	its sudo docker-compose up -d ?	21:48
clarkb	in /etc/zuul-merger?	21:49
mordred	yup	21:49
clarkb	container is up	21:49
clarkb	its waiting for gearman	21:49
clarkb	I think that is about as far as it will get without the scheduler?	21:49
corvus	yep	21:50
corvus	let's start an executor?	21:50
clarkb	wfm	21:50
corvus	i'll do that one :)	21:50
mordred	this is very exciting	21:50
corvus	it's up and running, no complaints (of course it's not a container)	21:51
mordred	good that it's running though :)	21:51
corvus	mordred: can you write a playbook to start the rest of the mergers and executors?	21:52
clarkb	whats startup for executor? same as before?	21:52
corvus	clarkb: yeah systemctl	21:52
mordred	corvus: yes	21:52
clarkb	rgr	21:52
corvus	mordred: i think both 'docker-compose up' and 'systemctl start' are idempotent enough we can just run that on all hosts	21:52
mordred	how's that look?	21:53
mordred	oh - piddle	21:53
mordred	how's that?	21:53
corvus	clarkb: i kind of want a unit file, because stopping/starting zuul is currently much easier than with docker-compose	21:53
corvus	mordred: ++	21:53
corvus	(like, a unit file to run docker-compose)	21:53
mordred	k. I'm going to run that	21:53
clarkb	ooh ya maybe do that for all our docker-copmpose too?	21:54
clarkb	we can also do it without -d and systemd will be happy I think	21:54
corvus	mordred: "- shell"	21:54
mordred	thanks	21:55
clarkb	btw the sudo that failed from zm02 was me	21:55
mordred	ok. they're all started	21:55
clarkb	I was trying to check if it was running under containers yet or not and didn't realize I was zuul	21:55
corvus	all right, i'll start the scheduler?	21:56
clarkb	I guess so	21:56
clarkb	can't think of much else we can do without gearman in the mix	21:56
corvus	starting	21:58
corvus	the scheduler is waiting for gearman	21:58
corvus	oh it's in a restart loop	21:59
clarkb	gearman is?	21:59
corvus	no the scheduler container	21:59
clarkb	got it	21:59
corvus	problem with the zk config	21:59
corvus	i'm thinking this may not be well tested in the gate	21:59
corvus	[zookeeper]	22:00
corvus	hosts=23.253.236.126:2888:3888,172.99.117.32:2888:3888,23.253.90.246:2888:3888session_timeout=40	22:00
corvus	that's on all hosts	22:00
mordred	corvus: I agree - it's possible we're only testing that docker-compose up works (which it did)	22:00
corvus	i think we should shut everything down, fix that manually, run ansible, then start again	22:00
mordred	corvus: we're missing a \n yeah?	22:00
corvus	yep	22:01
clarkb	mordred: ya session_timeout needs its own line	22:01
mordred	k. I'll stop the mergers and executors	22:01
mordred	and edit the file - someone else want to capture it in a gerrit change?	22:01
clarkb	I can take a look at gerrit	22:01
corvus	scheduler is down	22:01
mordred	corvus: what's syntax issue there?	22:02
mordred	there's no -%} - it's just a %}	22:02
corvus	mordred: i don't understand the question	22:02
mordred	corvus: sorry - in the jina template for zuul.conf.j2 - I don't understand why there's no \n in the output	22:03
clarkb	mordred: ya I was lookingat it and getting confused too	22:03
mordred	I've got it up in vim in the screen	22:03
clarkb	mordred: we can just add a newline between the two items probably	22:03
clarkb	but tahts hacky	22:03
mordred	yeah - maybe do that for now?	22:04
clarkb	k I'm pulling up jinja2 docs now to try and understand better	22:05
mordred	also - let's look at the rest of the file and make sure nothing else is dumb	22:05
mordred	the connection sections look ok to me	22:05
corvus	could use some spaces between them	22:05
corvus	some newlines before each connection header	22:05
mordred	added. shall I run it with this?	22:06
corvus	also the gearman headers	22:06
clarkb	https://jinja.palletsprojects.com/en/2.11.x/templates/#whitespace-control	22:06
clarkb	"a single trailing newline is stripped if present" is the default apparently?	22:07
corvus	mordred: ++	22:08
mordred	k. I did the hacky version - let's see if we can't get the nicer version	22:08
corvus	"By default, Jinja also removes trailing newlines. To keep single trailing newlines, configure Jinja to keep_trailing_newline."	22:08
clarkb	reading that doc I think we have to do the hacky thing	22:08
clarkb	corvus: ya so we'd have to change ansible to get what we want?	22:08
corvus	so if there's a way to tell ansible to do that, that might be the less hacky thing. but if not, i agree with clarkb, hacky is the only option	22:09
corvus	clarkb: i dunno if there's an argument to the template module?	22:09
clarkb	corvus: let me read those docs	22:09
clarkb	https://docs.ansible.com/ansible/latest/modules/template_module.html#parameter-trim_blocks	22:09
clarkb	we can set that	22:10
mordred	ok - the zuul.conf looks better now	22:10
corvus	clarkb: i see a trim_blocks option but not a keep_trailing_newline	22:10
corvus	clarkb: i read trim_blocks as being to remove the first newline inside a block	22:10
corvus	but i'm not at all confident i'm interpreting that right	22:10
clarkb	oh ya	22:10
clarkb	keep_trailing_newline is different	22:10
corvus	it's worth a local experiment	22:10
mordred	corvus: I still didnt' get a blank line on top of each [connection ...	22:10
mordred	do we care?	22:11
corvus	mordred: not now, but we should when we fix it in gerrit	22:11
mordred	++	22:11
corvus	cause it's pretty hard to read without	22:11
mordred	ok. that's done - shall I restart the executors and mergers?	22:12
corvus	yep	22:12
mordred	done	22:12
mordred	ready for a second stab at the scheduler	22:12
corvus	it's still in a restart loop with the same error	22:13
corvus	[zookeeper]	22:13
corvus	hosts=23.253.236.126:2888:3888,172.99.117.32:2888:3888,23.253.90.246:2888:3888	22:13
corvus	what looks wrong with that?	22:13
clarkb	corvus: the colons maybe?	22:13
clarkb	is that how we range them?	22:13
clarkb	thats my best guess. kazoo doesn't like foo:2888:3888 ?	22:15
corvus	does aynone know where the config file reference is for zuul.conf now?	22:15
corvus	because i can't seem to find it with our new improved navigation	22:16
corvus	oh	22:16
corvus	https://zuul-ci.org/docs/zuul/discussion/components.html	22:17
corvus	hosts=zk1.example.com,zk2.example.com,zk3.example.com	22:17
corvus	mordred: let's drop all the ports	22:17
mordred	corvus: ok	22:17
corvus	those are actually quorum ports	22:17
clarkb	oh	22:17
corvus	(we could do zk:2181 -- that's the client port)	22:17
clarkb	we do that on the servers but not as clients	22:17
corvus	but i think it's optional	22:17
mordred	stopping executors and mergers	22:17
corvus	scheduler is stopped	22:18
mordred	re-running service-zuul	22:18
mordred	ok. all done	22:21
mordred	ready to start mergers and executors?	22:22
corvus	yep	22:22
corvus	scheduler is up	22:22
mordred	corvus: is it happier this time?	22:22
corvus	yep	22:22
mordred	(not having those ports will make that jinja line able to use \| join(',')	22:22
corvus	mordred: we may want ports for zk-tls	22:23
corvus	(i'm not sure, we should test)	22:23
clarkb	merger is resetting all the things	22:23
mordred	corvus: nod	22:23
mordred	corvus: we clearly need a better test for this in the gate	22:23
clarkb	no failure yet that I see	22:23
fungi	once we're sure this is working...	22:23
fungi	status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:25 were missed entirely by Zuul and should also be rechecked to get fresh test results	22:23
fungi	does that ^ look reasonable? i tried to base the times there on when the keyfile was first overwritten, when the mergers were stopped, when the scheduler was stopped, and when the scheduler was started again	22:23
mordred	turns out "does docker-compose work" is not sufficient	22:23
clarkb	fungi: lgtm	22:24
mordred	fungi: lgtm	22:24
corvus	mordred: we can borrow a lot of stuff from quickstart	22:24
corvus	fungi: ++	22:24
corvus	mordred: are we not running zuul-web and fingergw in containers?	22:25
corvus	oh it's a separate docker-compose	22:25
corvus	for zuul-web	22:25
clarkb	probably want to consolidate that?	22:25
corvus	web and fingergw are in the same docker-compose	22:26
*** rkukura has joined #opendev-meeting		22:26
corvus	web is in a restart loop	22:26
corvus	PermissionError: [Errno 13] Permission denied: '/var/log/zuul/web.log'	22:26
corvus	we did not chown those	22:26
fungi	yep, still has old uid/gid	22:26
corvus	how is the scheduler writing to its log?	22:26
mordred	corvus: its 666	22:27
fungi	world-writeable	22:27
fungi	yep	22:27
mordred	web.log is 644	22:27
corvus	that is amusing	22:27
mordred	yeah	22:27
fungi	we probably don't really want those logs world-writeable	22:27
mordred	so - shall I do a quick cown of /var/log/zuul ?	22:27
corvus	i'll do the chown	22:27
mordred	ok	22:28
mordred	we might need to do it everywhere?	22:28
mordred	also - I jiust changed windows in screen I think by accident?	22:28
fungi	logs on executors are world-readable too	22:28
fungi	which is why they're also working	22:28
fungi	er, world-writeable	22:28
mordred	how do I switch windows in screen?	22:29
fungi	ctrl-a,n	22:29
fungi	for next window	22:29
mordred	thanks	22:29
fungi	p for previous	22:29
corvus	web server is up now	22:29
corvus	tenants loaded, will re-enqueue	22:29
corvus	i think we're good to call this up and send the announcement?	22:29
fungi	lmk when i should status notice that novella	22:29
mordred	corvus: I think I should run the playbook I just wrote in screen	22:29
fungi	maybe after reenqueuing, yeah	22:29
clarkb	corvus: maybe we should check executors can ssh into test nodes?	22:30
mordred	and yes - I think we can announce back u	22:30
corvus	mordred: sure	22:30
mordred	up	22:30
clarkb	if we haven't confirmed that yet? simply because ssh was affected	22:30
corvus	oh	22:30
corvus	yeah	22:30
corvus	like those retry_limits i'm seeing at https://zuul.opendev.org/t/openstack/status	22:30
corvus	and there was one merger_failure	22:30
mordred	2020-04-24 22:31:13,629 DEBUG zuul.AnsibleJob.output: [build: ad8be97939f543289336ff87bf847f1e] Ansible output: b' "msg": "Data could not be sent to remote host \\"23.253.166.144\\". Make sure this host can be reached over ssh: Permission denied (publickey).\\r\\n",'	22:31
mordred	I think that's a nope	22:31
corvus	i'll shut down the scheduler	22:31
clarkb	is that executor to test node?	22:32
corvus	yeah	22:32
fungi	looks like it	22:32
clarkb	ok the nodepool_id_rsa file hasn't been chagned recently	22:33
clarkb	which makes me think maybe its a config issue. possibly we are trying to use teh gerrit key to ssh to test nodes?	22:33
clarkb	trying to figure out where we configure that on executors now	22:33
corvus	[executor]	22:34
corvus	private_key_file=/var/lib/zuul/ssh/id_rsa	22:34
mordred	so - that's wrong - that just need to point to nodepool_id_rsa	22:35
clarkb	that should be /var/lib/zuul/ssh/nodepool_id_rsa	22:35
clarkb	__	22:35
clarkb	er ++	22:35
mordred	yes - and we need to be sure to write that file out in the ansible	22:35
mordred	I'll make the local change real quick	22:35
mordred	clarkb: so - we should make sure there is a hiera entry for that key content :)	22:36
mordred	clarkb: I think that's one of the keys you were looking at earlier	22:36
mordred	I don't think we're writing it at all in the ansible	22:36
clarkb	mordred: correct I think ansible needs to pull it out of ansible	22:37
clarkb	mordred: and its the key content I delented in zuul-executor	22:37
clarkb	mordred: should I go ahead and add it under a new name?	22:38
clarkb	how about nodepool_test_node_ssh_key_content or similar	22:38
fungi	fine by me	22:38
mordred	++	22:39
corvus	mordred: restart all the executors?	22:39
corvus	probably fine to do the mergers too if it's easy to reuse the playbook	22:40
mordred	++	22:40
mordred	I pushed up a broken patch to remind us to fix the nodepool key	22:41
mordred	ok - executors and mergers are restarted	22:41
clarkb	nodepool_test_node_ssh_private_key_contents <- that is the key I used	22:41
clarkb	I've also annotated the keys to explain what they are fo	22:42
mordred	clarkb: aewsome	22:42
mordred	I pushed remote: https://review.opendev.org/723049 Add nodepool node key	22:42
corvus	ready to restart sched?	22:42
mordred	that's a much better key name	22:42
mordred	corvus: yes	22:42
corvus	starting	22:43
corvus	k it's up	22:47
corvus	re-enqueueing	22:47
clarkb	jobs seem to be working on ze01 if I'm reading the logs correctly	22:48
corvus	yeah, that's what i'm seeing	22:48
mordred	PHEW	22:48
clarkb	its configuring mirrors and stuff	22:48
mordred	ok - we don't actually have that many changes in the local ansible repo on system-config - I think we've captured them all in gerrit changes	22:48
clarkb	jemalloc, zuul.conf, and managing nodepool test node ssh key?	22:49
fungi	status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results	22:49
fungi	maybe after we've seen at least one build report success	22:49
fungi	and all the reenqueuing is finished	22:49
clarkb	mordred: https://review.opendev.org/#/c/723049/1 that needs a small update	22:50
clarkb	(commented inline	22:51
mordred	clarkb: you didn't click send	22:51
clarkb	https://review.opendev.org/#/c/723023/ https://review.opendev.org/#/c/723049/ and https://review.opendev.org/#/c/723046/ are the changes	22:51
clarkb	mordred: gah sorry	22:51
clarkb	mordred: now I have	22:52
mordred	probably 0400 too yeah?	22:52
mordred	should I do owner/group?	22:52
clarkb	the others are all 0400 and owned by zuul:zuul	22:53
clarkb	I think paramiko may complain if it isn't set that way so probably should	22:53
fungi	yes, or 0600 would also be okay but since we're deploying them with configuration management 0400 is probably a better signal	22:53
corvus	fungi: reenqueing is finished	22:54
corvus	fungi: i see a successful build	22:54
fungi	thanks corvus, now to find a success	22:54
clarkb	maybe someone else can double check the group_vars git diff and we can git commit that? I'm happy to git commit if someone else confirms it looks good to them	22:54
corvus	fungi: top of check (but the change hasn't reported yet, still has a job running)	22:54
fungi	corvus: any reason to wait for a full buildset to report, or is that unlikely to be broken?	22:54
corvus	fungi: i think you're good to send	22:55
fungi	thanks, doing then	22:55
mordred	clarkb: I updated 723049 with 0400 and also added a key to the test templates	22:55
fungi	#status notice the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results	22:55
openstackstatus	fungi: sending notice	22:55
-openstackstatus- NOTICE: the This Zuul outage was taken as an opportunity to perform an impromptu maintenance for changing our service deployment model; any merge failures received from Zuul between 19:40 and 20:20 UTC were likely in error and those changes should be rechecked; any patches uploaded between 20:55 and 22:45 UTC were missed entirely by Zuul and should also be rechecked to get fresh test results		22:56
clarkb	mordred: thanks lgtm	22:56
corvus	am i clear to eod? i have no brains.	22:58
clarkb	I think so. If there are any braincells left reviewing those changes might be a good thing. However, maybe better to do that when brains are ready	22:58
clarkb	I think the three changes above cover all our debugging	22:59
fungi	i will review with coffee in the morning	22:59
openstackstatus	fungi: finished sending notice	22:59
fungi	if they don't get approved sooner	22:59
clarkb	I'm going to take a break then will check back in to see that things are still chugging along	23:01
clarkb	lots of successful jobs though so things are looking goog	23:01
clarkb	*good too	23:01
corvus	mordred: oh i'm seeing a lot of cronspam	23:01
corvus	looks like the backup cron may have a syntax error?	23:02
corvus	the zuul queue backup cron	23:02
corvus	# Puppet Name: zuul_scheduler_status_backup-openstack-zuul-tenant	23:02
corvus	* * * * * timeout -k 5 10 curl https://zuul.opendev.org/api/tenant/openstack/status -o /var/lib/zuul/backup/openstack-zuul-tenant_status_$(date +\%s).json 2>/dev/null	23:02
corvus	#Ansible: zuul-scheduler-status-openstack	23:02
corvus	* * * * * timeout -k 5 10 curl https://zuul.opendev.org/api/tenant/openstack/status -o /var/lib/zuul/backup/openstack_status_$(date +\\%s).json 2>/dev/null	23:02
corvus	looks like the wrong level of escaping	23:03
mordred	yeah	23:03
corvus	do we have periodic crons running these playbooks yet?	23:04
corvus	i've manually fixed the crontab on zuul01	23:04
mordred	corvus: we do	23:05
corvus	that means we need to land all of the changes, plus this one now, right?	23:05
mordred	https://review.opendev.org/#/c/723052/	23:05
mordred	yeah	23:05
mordred	clarkb, corvus: https://review.opendev.org/#/c/723022/ ... actually, lemme free that from the stack	23:06
mordred	corvus: should we direct enqueue some into the gate?	23:07
clarkb	we can also put them in emergency	23:07
mordred	clarkb: https://review.opendev.org/#/c/723052/	23:07
corvus	i like emergency	23:07
corvus	mordred: mostly because https://review.opendev.org/723046	23:08
mordred	yeah	23:08
corvus	(i don't want to land that without seeing the output)	23:08
clarkb	I've +2'd 3052 but not approved it	23:08
mordred	I think it's just executors we need in emergency yes?	23:08
corvus	i think we can +w that one	23:08
corvus	mordred: everything?	23:08
mordred	kk. on it	23:08
corvus	(i don't want a delta between the running config and what's on disk)	23:09
mordred	ok. I have put all of zuul in emergency	23:09
clarkb	corvus: if we approve it it will run ansible against zuul and chagne the configs I think?	23:09
corvus	mordred: thanks!	23:09
clarkb	I'll let others decide if that is safe and approve if so	23:10
corvus	clarkb: hopefully not if they're in emergency?	23:10
mordred	clarkb: no - we just put everythign in emergency	23:10
clarkb	corvus: ohya if we are in emergency now we can approve I'll do that	23:10
clarkb	mordred what is a zl01?	23:16
mordred	exactly	23:16
clarkb	I'm trying to update the zuul run job to log the zuul conf and noticed that	23:16
mordred	clarkb: fixed here: https://review.opendev.org/#/c/723023/3/.zuul.yaml	23:16
fungi	our old ansible launchers were zl*	23:17
fungi	for zuul v2.5	23:17
mordred	clarkb: which is why the jemalloc bug was able to exist	23:17
clarkb	'm k I'll rebase	23:17
clarkb	ok popping out now for a bit	23:21
mordred	same	23:27

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!