Friday, 2017-04-14

pabelanger	bos status afs01.ord.openstack.org	00:00
pabelanger	Auxiliary status is: file server shut down.	00:00
pabelanger	think we are ready for reboot	00:00
pabelanger	rebooting	00:00
clarkb	ok	00:01
clarkb	I'm watching boot of baremetal00 on console	00:02
pabelanger	okay, back online	00:02
pabelanger	bos status afs01.ord.openstack.org	00:02
pabelanger	Auxiliary status is: file server running.	00:02
pabelanger	kernel is also correct	00:02
clarkb	MLNX initializing devices	00:02
clarkb	it says Link: Down Link status: Not connected	00:03
clarkb	and the mac addr there matches what is in hiera :(	00:03
pabelanger	odd	00:04
clarkb	sorry I take that back	00:04
clarkb	the macaddr doesn't match	00:04
clarkb	it finally moved on to net1 which appears to be the one that matches and that one is dhcping	00:04
pabelanger	I think we can give afs02.dfw.openstack.org a shoot now	00:05
clarkb	gah missed it dhcped or not	00:06
clarkb	pabelanger: cool I say go for it then	00:06
clarkb	now just a bunch of spam about usb devices flapping and thats it	00:06
clarkb	I'm not sure its getting network	00:06
pabelanger	bos status afs02.dfw.openstack.org	00:07
pabelanger	Auxiliary status is: file server shut down.	00:07
pabelanger	rebooting	00:07
clarkb	and VSP isn't working for getting a login promopt	00:08
pabelanger	bos status afs02.dfw.openstack.org	00:09
pabelanger	Auxiliary status is: file server running.	00:09
clarkb	cool	00:09
clarkb	pabelanger: now how do we make afs02 the RW RO instead of just RO? or does it matter since we aren't going to do any writes with all the locks held. EG can we just do afs01 next?	00:10
clarkb	thinking about that I bet we could get away with no RW temporarily	00:10
pabelanger	right, if we are holding locks, we might be able to get away from it	00:10
pabelanger	unless somebody else does a write from some other location then mirror-update.o.o	00:11
clarkb	like one of the wheel builders?	00:11
clarkb	but otherwise that should be it ya?	00:11
clarkb	and in that case they should just fail and try again later I think	00:11
pabelanger	ya	00:11
pabelanger	or docs job	00:11
clarkb	oh right docs	00:12
pabelanger	let me read quickly how to move a RW volume	00:12
pabelanger	vos move seems to be the command	00:13
clarkb	and maybe just move docs?	00:14
pabelanger	ya, let me see if I can do that	00:14
clarkb	I'm about to call it a day on baremetal00 for now and maybe see if cmurphy or rcarillocruz can take a look in their morning time	00:16
clarkb	I don't know enough about the env to debug much further but it appears to be no networking on that nic	00:16
pabelanger	Volume is locked for a move operation	00:18
pabelanger	okay move looks to be happening	00:18
pabelanger	was sure to do it in screen too	00:18
clarkb	cool	00:18
pabelanger	vos move -id docs -fromserver afs01.dfw.openstack.org -frompartition vicepa -toserver afs01.ord.openstack.org -topartition vicepa	00:18
pabelanger	was syntax	00:18
clarkb	I take it it isn't an instantaneous promotion?	00:21
pabelanger	doesn't look like it	00:21
clarkb	still going?	00:31
pabelanger	yup	00:32
pabelanger	have screen running on afs01.dfw.o.o	00:33
clarkb	I'm trying mnasers suggestion of trying to boot on old kernel on baremetal00 now	00:33
clarkb	waiting for it to boot back up again	00:33
pabelanger	k	00:35
clarkb	ok baremetal00 is back up on its old kernel	00:45
clarkb	gonna leave it alone until tomorrow	00:45
clarkb	pabelanger: with that out of the way do we want to try doing kdc01 or the db servers?	00:46
clarkb	I guess that may impact the move so probably not?	00:46
pabelanger	ya, I think we should wait unti the move is done	00:46
pabelanger	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2264&rra_id=all	00:47
pabelanger	looks like 11 Mb cap	00:47
clarkb	how big is the volume?	00:47
pabelanger	536870991	00:48
pabelanger	500GB?	00:48
clarkb	ya	00:48
clarkb	so this gonna take a while	00:48
clarkb	also that seems like a lot of docs	00:48
pabelanger	so, going to be a few hours	00:49
clarkb	maybe pick this up tomorrow morning?	00:50
clarkb	I too have errands that need to be done	00:50
pabelanger	sounds like a great idea	00:50
clarkb	like new phone. My current one just died :(	00:50
clarkb	I am going to reattempt rebooting baremetal00. I diffed the linux packages on it against compute000 and found it was missing linux-image-generic and linux-image-extras-$version-generic so pulled those in and I think that will get us future kernel updates (and the extras package has kernel modules we likely need). The old kernel is still there so can fall back on it again via ilo + grub if I	15:45
clarkb	need to	15:45
clarkb	success! uname reports new kernel, ssh works, and I can ironic node list the ironic cluster	15:55
pabelanger	yay	15:57
pabelanger	also	15:58
pabelanger	Volume 536870991 moved from afs01.dfw.openstack.org /vicepa to afs01.ord.openstack.org /vicepa	15:58
clarkb	cool, I really need caffeine, but will be back shortly	15:59
clarkb	guessing next step is to reboot afs01.df1?	15:59
pabelanger	ya, going to grab locks first again	15:59
pabelanger	okay, decided to remove contrab this time and place mirror-update into emergency file	16:01
pabelanger	so, I don't think we have done a vos release on npm for some time	16:05
pabelanger	so, once the current flocks expire, we can start bos shutdown	16:06
pabelanger	# bos status afs01.dfw.openstack.org	16:08
pabelanger	Auxiliary status is: file server shut down.	16:08
pabelanger	rebooting	16:08
clarkb	kk	16:09
pabelanger	Auxiliary status is: file server running.	16:10
clarkb	and on new kernel?	16:10
clarkb	before we allow things to run on mirror-update again lets decide on plan for the dbs and kdc	16:11
pabelanger	yup	16:11
clarkb	I think kdc can be done safely as lon as there are no writers getting kerberos tokens	16:11
clarkb	which means I think we can do that one now?	16:11
clarkb	oh right docs	16:11
clarkb	hrm	16:11
clarkb	maybe just go for it and rerun any docs jobs that may fail?	16:12
pabelanger	docs is cron based	16:12
pabelanger	isn't it?	16:12
pabelanger	vos release is crontab	16:12
clarkb	the vos release is, but I think the job write to the RW volume directly	16:12
pabelanger	ya	16:12
clarkb	both things require kerberos	16:12
pabelanger	ya, guess we should setup redundant kerberos next :)	16:13
clarkb	we can stop the cron from doing releases	16:13
clarkb	then just rerun any jobs that fail	16:13
pabelanger	is docs jobs on static node?	16:13
pabelanger	we could stop zlstatic01 for now	16:13
clarkb	doesn't look like it at least going off of the project-config docs job running now (its in osic)	16:14
clarkb	mordred: ^ you set up this stuff, any thoughts ?	16:14
pabelanger	wheel mirror also	16:15
pabelanger	afsdb01 and afsdb02 we should be able to rotate then for shutdown	16:16
clarkb	ya I think the link you posted yesterday said it was same steps as FS too? bos shutdown then reboot?	16:17
clarkb	I'm taking notes on things we have to watch out for on the kdc01 restart on the etherpad	16:17
pabelanger	ya, afs docs say same process for db servers	16:17
clarkb	Why don't we go ahead with db0X rotations. Then we can quiesce as best as possible the afs writers then restart kdc01	16:17
pabelanger	I'm assuming openafs is setup to fail over between the databases	16:17
pabelanger	k	16:18
clarkb	pabelanger: lets hope thats why we have two of them :)	16:18
pabelanger	will do afsdb02.o.o first	16:18
clarkb	ok	16:18
clarkb	and are you running that as root? and does bos shutdown require kerberos auth?	16:18
pabelanger	using localauth	16:19
pabelanger	and root	16:19
pabelanger	# bos status afsdb02.openstack.org -localauth	16:19
pabelanger	Instance ptserver, temporarily disabled, currently shutdown.	16:19
pabelanger	Instance vlserver, temporarily disabled, currently shutdown.	16:19
pabelanger	rebooting	16:19
pabelanger	# bos status afsdb02.openstack.org -localauth	16:20
pabelanger	Instance ptserver, currently running normally.	16:20
pabelanger	Instance vlserver, currently running normally.	16:20
pabelanger	kernel good too	16:21
clarkb	yay	16:21
pabelanger	moving on to afsdb01	16:21
clarkb	ok	16:21
pabelanger	# bos status afsdb01.openstack.org -localauth	16:22
pabelanger	Instance ptserver, temporarily disabled, has core file, currently shutdown.	16:22
pabelanger	Instance vlserver, temporarily disabled, currently shutdown.	16:22
clarkb	I annotated etherpad with thoughts on how we can quiesce the various pieces for kdc01 restart	16:22
pabelanger	# bos status afsdb01.openstack.org -localauth	16:23
pabelanger	Instance ptserver, has core file, currently running normally.	16:23
pabelanger	Instance vlserver, currently running normally.	16:23
pabelanger	and kernel is good	16:23
clarkb	woot	16:23
clarkb	where does the docs vos release cron run?	16:25
* clarkb greps puppet		16:25
pabelanger	okay, I can graceful shutdown zlstatic	16:25
pabelanger	k, not sure myself	16:25
pabelanger	2017-04-14 16:25:44,636 DEBUG zuul.LaunchServer: Stopped	16:26
clarkb	looks like maybe afsdb01	16:26
pabelanger	ya	16:27
pabelanger	I see it	16:27
clarkb	pabelanger: you want me to put that host in the emergency file?	16:27
pabelanger	sure	16:27
pabelanger	I don't see a lock, so we'll have to remove crontab	16:28
clarkb	ya I just added afsdb01.openstack.org to emergency file so I think you can remove the cron entry now	16:28
pabelanger	crontab removed	16:28
clarkb	I'm double checking packages on kdc01 now	16:29
clarkb	looks good	16:29
pabelanger	kk	16:29
clarkb	shall I reboot it?	16:29
pabelanger	sure	16:29
clarkb	actually /me checks what services are running on it first so we can confirm they are up before releasing locks and things	16:30
clarkb	oh wait there is a kdc02	16:30
pabelanger	oh	16:31
clarkb	maybe	16:31
clarkb	I can't ssh to it	16:31
clarkb	pabelanger: are you able to get onto it?	16:31
pabelanger	same, not responding	16:31
pabelanger	http://www.tldp.org/HOWTO/Kerberos-Infrastructure-HOWTO/server-replication.html	16:32
pabelanger	seems straightforward	16:32
clarkb	kdc01 is trying to propogate the principals databse to kdc02 using kprop right now	16:32
clarkb	thats how I noticed it	16:32
pabelanger	once we finish this, we can do ^	16:32
pabelanger	let me check cacti	16:32
clarkb	well I think we may already do ^ but kdc02 is unworking. So maybe we first sort out kdc02, then do the propogation then do kdc01?	16:33
pabelanger	seems we lost it back in Oct	16:33
pabelanger	according to cacti	16:33
pabelanger	sure, we can try and bring it back online	16:34
clarkb	ya lets do that	16:34
clarkb	then I think we can do this using process above	16:34
pabelanger	okay, you grabbing console?	16:34
clarkb	I was tempted to just try reboot it with nova api... but we can try console first	16:35
clarkb	looks like its an ord server	16:35
pabelanger	okay, I'll let you drive :)	16:35
clarkb	oh right you can't use the normal console log api with rax?	16:35
clarkb	confirmed...	16:36
clarkb	nova api says server is active and running	16:37
clarkb	pabelanger: was it you saying there was an ssh option for this?	16:39
clarkb	I got the web thing working, there is a login prompt, but I can't login because ENOPASSWD	16:41
clarkb	should we go ahead and do a reboot via nova api?	16:43
pabelanger	clarkb: ya, ssh	16:44
pabelanger	you place server into emergency mode, it will use the same IP	16:44
pabelanger	you then get a new root password and can SSH	16:45
clarkb	ah I think we can just try a reboot before emergency	16:45
pabelanger	kk	16:45
clarkb	going to do that now via server reboot kdc02.openstack.org	16:45
clarkb	I see it booting in the console	16:46
pabelanger	ssh works	16:46
clarkb	the kprop runs every 15 minutes	16:47
pabelanger	we should do kdc02 first for kernel	16:48
pabelanger	looks like it might need an update	16:48
clarkb	pabelanger: lets let the 1700UTC kprop run, then update kdc02 kernel, then kprop again then do kdc01	16:48
clarkb	yup	16:48
pabelanger	wfm	16:48
clarkb	I'm gonna make sure packages on kdc02 are up to date since it may not have had networking for a while	16:49
pabelanger	maybe just kick it from puppetmaster.oo?	16:49
clarkb	package updates are run out of apt not puppet master	16:49
clarkb	I just manaully doing an update and sure enough there are things that need updating	16:50
clarkb	apt updatse around 0600UTC daily iirc	16:50
pabelanger	ah, right	16:50
clarkb	hrm got a message about setting up a kerberos realm.. I wonder if that happens regardless or if we haven't configured this server properly :/	16:50
pabelanger	ya, I am unsure actually	16:52
clarkb	/etc/krb5.conf says default realm is OPENSTACK.ORG so I think we are good once we kprop	16:53
pabelanger	ansible is also running on kdc02 now	16:54
clarkb	ok I didn't catch the kprop happen. maybe I should run it in the foreground just to make sure it is happy?	17:00
clarkb	doing that now	17:01
clarkb	Database propagation to kdc02.openstack.org: SUCCEEDED	17:01
clarkb	pabelanger: ready to reboot kdc02?	17:01
pabelanger	sure	17:01
clarkb	ok doing it now	17:01
clarkb	ok kdc02 is back up again. I'm going to rerun propogation manually	17:02
clarkb	then I think we can reboot 01?	17:03
pabelanger	think so	17:03
clarkb	kprop: Connection refused while connecting to server <-	17:03
pabelanger	docs say things will fail over	17:03
* clarkb waits patiently		17:03
pabelanger	oh	17:03
pabelanger	Apr 14 16:57:04 kdc02 puppet-user[18536]: (/Stage[main]/Kerberos::Server/Service[krb5-kpropd]/ensure) ensure changed 'stopped' to 'running'	17:03
pabelanger	think we need to wait for puppet to start it	17:04
clarkb	oh maybe	17:04
pabelanger	let me kick.sh	17:04
clarkb	kk	17:04
pabelanger	if it comes online, then we can patch puppet	17:04
pabelanger	kicking	17:05
pabelanger	clarkb: try now	17:06
clarkb	Database propagation to kdc02.openstack.org: SUCCEEDED	17:06
clarkb	that must've been it	17:06
pabelanger	working on patch	17:06
clarkb	I'm gonna double check packages on kdc01 now	17:06
clarkb	says its up to date so ready to reboot it whenever you are	17:07
pabelanger	go for it	17:07
clarkb	ok doing it now	17:07
clarkb	its back	17:08
clarkb	and kernel is updated	17:08
pabelanger	Yay	17:08
pabelanger	clarkb: remove servers from emergency and bring zlstatic online?	17:11
clarkb	pabelanger: yes I think so	17:11
pabelanger	zuul started	17:12
pabelanger	servers removed from emergency	17:12
pabelanger	will make sure crontabs are recreated	17:12
clarkb	ok	17:12
pabelanger	mirror-update.o.o good	17:15
pabelanger	afsdb01.o.o good too	17:16
clarkb	so now we just monitor that things are updating as expected ya?	17:17
clarkb	also do we want to vos move the docs volume back to 01?	17:17
pabelanger	ya, I'm going to hold the lock on npm and figure that out	17:17
pabelanger	we haven't released in a month or so	17:17
clarkb	oh wow	17:17
clarkb	I'm tempted to try and write down the "how to reboot an entire afs cluster without downtime" in system-config	17:18
clarkb	let me start on that draft so that we don't forget	17:18
pabelanger	kk	17:21
pabelanger	oh, we should also vos move docs back	17:21
pabelanger	I can start that shortly	17:21
clarkb	ok	17:24
pabelanger	vos move -id docs -toserver afs01.dfw.openstack.org -topartition vicepa -fromserver afs01.ord.openstack.org -frompartition vicepa --localauth	17:27
pabelanger	running now	17:27
clarkb	pabelanger: and you are running that on afsdb01?	17:28
pabelanger	yes	17:28
pabelanger	from screen	17:28
clarkb	pabelanger: fwiw next time I think we want to move to afs02 which is local to the same datacenter (will be faster)	17:45
clarkb	oh I thought we had a second server in dfw doesn't look like we do	17:45
pabelanger	clarkb: agree. I thought that afs01.ord.o.o was actually not used any more	17:45
clarkb	listvldb says its the same	17:46
pabelanger	we should confirm with jeblair next week	17:46
clarkb	er rather RW and RO are cohabitated on same server?	17:46
pabelanger	Ya, RW RO on the same server	17:47
pabelanger	with back up RO	17:47
clarkb	but backup RO is in ord right?	17:48
pabelanger	not any more	17:49
pabelanger	I thought we had stopped using ord, because network was bottleneck	17:49
clarkb	I'm looking at mirror specifically and I see afs01.dfw is RW and RO and afs01.ord is RO	17:49
clarkb	I thought we added a second server in dfw to be RO too?	17:49
clarkb	but not seeing that (could just be blind)	17:50
pabelanger	ya, I seen that too. I _think_ we need to fix somethings next week. And make sure everything is setup for afs01.dfw and afs02.dfw	17:50
pabelanger	jeblair likely knows more	17:51
clarkb	++	17:51

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!