pabelanger | bos status afs01.ord.openstack.org | 00:00 |
---|---|---|
pabelanger | Auxiliary status is: file server shut down. | 00:00 |
pabelanger | think we are ready for reboot | 00:00 |
pabelanger | rebooting | 00:00 |
clarkb | ok | 00:01 |
clarkb | I'm watching boot of baremetal00 on console | 00:02 |
pabelanger | okay, back online | 00:02 |
pabelanger | bos status afs01.ord.openstack.org | 00:02 |
pabelanger | Auxiliary status is: file server running. | 00:02 |
pabelanger | kernel is also correct | 00:02 |
clarkb | MLNX initializing devices | 00:02 |
clarkb | it says Link: Down Link status: Not connected | 00:03 |
clarkb | and the mac addr there matches what is in hiera :( | 00:03 |
pabelanger | odd | 00:04 |
clarkb | sorry I take that back | 00:04 |
clarkb | the macaddr doesn't match | 00:04 |
clarkb | it finally moved on to net1 which appears to be the one that matches and that one is dhcping | 00:04 |
pabelanger | I think we can give afs02.dfw.openstack.org a shoot now | 00:05 |
clarkb | gah missed it dhcped or not | 00:06 |
clarkb | pabelanger: cool I say go for it then | 00:06 |
clarkb | now just a bunch of spam about usb devices flapping and thats it | 00:06 |
clarkb | I'm not sure its getting network | 00:06 |
pabelanger | bos status afs02.dfw.openstack.org | 00:07 |
pabelanger | Auxiliary status is: file server shut down. | 00:07 |
pabelanger | rebooting | 00:07 |
clarkb | and VSP isn't working for getting a login promopt | 00:08 |
pabelanger | bos status afs02.dfw.openstack.org | 00:09 |
pabelanger | Auxiliary status is: file server running. | 00:09 |
clarkb | cool | 00:09 |
clarkb | pabelanger: now how do we make afs02 the RW RO instead of just RO? or does it matter since we aren't going to do any writes with all the locks held. EG can we just do afs01 next? | 00:10 |
clarkb | thinking about that I bet we could get away with no RW temporarily | 00:10 |
pabelanger | right, if we are holding locks, we might be able to get away from it | 00:10 |
pabelanger | unless somebody else does a write from some other location then mirror-update.o.o | 00:11 |
clarkb | like one of the wheel builders? | 00:11 |
clarkb | but otherwise that should be it ya? | 00:11 |
clarkb | and in that case they should just fail and try again later I think | 00:11 |
pabelanger | ya | 00:11 |
pabelanger | or docs job | 00:11 |
clarkb | oh right docs | 00:12 |
pabelanger | let me read quickly how to move a RW volume | 00:12 |
pabelanger | vos move seems to be the command | 00:13 |
clarkb | and maybe just move docs? | 00:14 |
pabelanger | ya, let me see if I can do that | 00:14 |
clarkb | I'm about to call it a day on baremetal00 for now and maybe see if cmurphy or rcarillocruz can take a look in their morning time | 00:16 |
clarkb | I don't know enough about the env to debug much further but it appears to be no networking on that nic | 00:16 |
pabelanger | Volume is locked for a move operation | 00:18 |
pabelanger | okay move looks to be happening | 00:18 |
pabelanger | was sure to do it in screen too | 00:18 |
clarkb | cool | 00:18 |
pabelanger | vos move -id docs -fromserver afs01.dfw.openstack.org -frompartition vicepa -toserver afs01.ord.openstack.org -topartition vicepa | 00:18 |
pabelanger | was syntax | 00:18 |
clarkb | I take it it isn't an instantaneous promotion? | 00:21 |
pabelanger | doesn't look like it | 00:21 |
clarkb | still going? | 00:31 |
pabelanger | yup | 00:32 |
pabelanger | have screen running on afs01.dfw.o.o | 00:33 |
clarkb | I'm trying mnasers suggestion of trying to boot on old kernel on baremetal00 now | 00:33 |
clarkb | waiting for it to boot back up again | 00:33 |
pabelanger | k | 00:35 |
clarkb | ok baremetal00 is back up on its old kernel | 00:45 |
clarkb | gonna leave it alone until tomorrow | 00:45 |
clarkb | pabelanger: with that out of the way do we want to try doing kdc01 or the db servers? | 00:46 |
clarkb | I guess that may impact the move so probably not? | 00:46 |
pabelanger | ya, I think we should wait unti the move is done | 00:46 |
pabelanger | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=2264&rra_id=all | 00:47 |
pabelanger | looks like 11 Mb cap | 00:47 |
clarkb | how big is the volume? | 00:47 |
pabelanger | 536870991 | 00:48 |
pabelanger | 500GB? | 00:48 |
clarkb | ya | 00:48 |
clarkb | so this gonna take a while | 00:48 |
clarkb | also that seems like a lot of docs | 00:48 |
pabelanger | so, going to be a few hours | 00:49 |
clarkb | maybe pick this up tomorrow morning? | 00:50 |
clarkb | I too have errands that need to be done | 00:50 |
pabelanger | sounds like a great idea | 00:50 |
clarkb | like new phone. My current one just died :( | 00:50 |
clarkb | I am going to reattempt rebooting baremetal00. I diffed the linux packages on it against compute000 and found it was missing linux-image-generic and linux-image-extras-$version-generic so pulled those in and I think that will get us future kernel updates (and the extras package has kernel modules we likely need). The old kernel is still there so can fall back on it again via ilo + grub if I | 15:45 |
clarkb | need to | 15:45 |
clarkb | success! uname reports new kernel, ssh works, and I can ironic node list the ironic cluster | 15:55 |
pabelanger | yay | 15:57 |
pabelanger | also | 15:58 |
pabelanger | Volume 536870991 moved from afs01.dfw.openstack.org /vicepa to afs01.ord.openstack.org /vicepa | 15:58 |
clarkb | cool, I really need caffeine, but will be back shortly | 15:59 |
clarkb | guessing next step is to reboot afs01.df1? | 15:59 |
pabelanger | ya, going to grab locks first again | 15:59 |
pabelanger | okay, decided to remove contrab this time and place mirror-update into emergency file | 16:01 |
pabelanger | so, I don't think we have done a vos release on npm for some time | 16:05 |
pabelanger | so, once the current flocks expire, we can start bos shutdown | 16:06 |
pabelanger | # bos status afs01.dfw.openstack.org | 16:08 |
pabelanger | Auxiliary status is: file server shut down. | 16:08 |
pabelanger | rebooting | 16:08 |
clarkb | kk | 16:09 |
pabelanger | Auxiliary status is: file server running. | 16:10 |
clarkb | and on new kernel? | 16:10 |
clarkb | before we allow things to run on mirror-update again lets decide on plan for the dbs and kdc | 16:11 |
pabelanger | yup | 16:11 |
clarkb | I think kdc can be done safely as lon as there are no writers getting kerberos tokens | 16:11 |
clarkb | which means I think we can do that one now? | 16:11 |
clarkb | oh right docs | 16:11 |
clarkb | hrm | 16:11 |
clarkb | maybe just go for it and rerun any docs jobs that may fail? | 16:12 |
pabelanger | docs is cron based | 16:12 |
pabelanger | isn't it? | 16:12 |
pabelanger | vos release is crontab | 16:12 |
clarkb | the vos release is, but I think the job write to the RW volume directly | 16:12 |
pabelanger | ya | 16:12 |
clarkb | both things require kerberos | 16:12 |
pabelanger | ya, guess we should setup redundant kerberos next :) | 16:13 |
clarkb | we can stop the cron from doing releases | 16:13 |
clarkb | then just rerun any jobs that fail | 16:13 |
pabelanger | is docs jobs on static node? | 16:13 |
pabelanger | we could stop zlstatic01 for now | 16:13 |
clarkb | doesn't look like it at least going off of the project-config docs job running now (its in osic) | 16:14 |
clarkb | mordred: ^ you set up this stuff, any thoughts ? | 16:14 |
pabelanger | wheel mirror also | 16:15 |
pabelanger | afsdb01 and afsdb02 we should be able to rotate then for shutdown | 16:16 |
clarkb | ya I think the link you posted yesterday said it was same steps as FS too? bos shutdown then reboot? | 16:17 |
clarkb | I'm taking notes on things we have to watch out for on the kdc01 restart on the etherpad | 16:17 |
pabelanger | ya, afs docs say same process for db servers | 16:17 |
clarkb | Why don't we go ahead with db0X rotations. Then we can quiesce as best as possible the afs writers then restart kdc01 | 16:17 |
pabelanger | I'm assuming openafs is setup to fail over between the databases | 16:17 |
pabelanger | k | 16:18 |
clarkb | pabelanger: lets hope thats why we have two of them :) | 16:18 |
pabelanger | will do afsdb02.o.o first | 16:18 |
clarkb | ok | 16:18 |
clarkb | and are you running that as root? and does bos shutdown require kerberos auth? | 16:18 |
pabelanger | using localauth | 16:19 |
pabelanger | and root | 16:19 |
pabelanger | # bos status afsdb02.openstack.org -localauth | 16:19 |
pabelanger | Instance ptserver, temporarily disabled, currently shutdown. | 16:19 |
pabelanger | Instance vlserver, temporarily disabled, currently shutdown. | 16:19 |
pabelanger | rebooting | 16:19 |
pabelanger | # bos status afsdb02.openstack.org -localauth | 16:20 |
pabelanger | Instance ptserver, currently running normally. | 16:20 |
pabelanger | Instance vlserver, currently running normally. | 16:20 |
pabelanger | kernel good too | 16:21 |
clarkb | yay | 16:21 |
pabelanger | moving on to afsdb01 | 16:21 |
clarkb | ok | 16:21 |
pabelanger | # bos status afsdb01.openstack.org -localauth | 16:22 |
pabelanger | Instance ptserver, temporarily disabled, has core file, currently shutdown. | 16:22 |
pabelanger | Instance vlserver, temporarily disabled, currently shutdown. | 16:22 |
clarkb | I annotated etherpad with thoughts on how we can quiesce the various pieces for kdc01 restart | 16:22 |
pabelanger | # bos status afsdb01.openstack.org -localauth | 16:23 |
pabelanger | Instance ptserver, has core file, currently running normally. | 16:23 |
pabelanger | Instance vlserver, currently running normally. | 16:23 |
pabelanger | and kernel is good | 16:23 |
clarkb | woot | 16:23 |
clarkb | where does the docs vos release cron run? | 16:25 |
* clarkb greps puppet | 16:25 | |
pabelanger | okay, I can graceful shutdown zlstatic | 16:25 |
pabelanger | k, not sure myself | 16:25 |
pabelanger | 2017-04-14 16:25:44,636 DEBUG zuul.LaunchServer: Stopped | 16:26 |
clarkb | looks like maybe afsdb01 | 16:26 |
pabelanger | ya | 16:27 |
pabelanger | I see it | 16:27 |
clarkb | pabelanger: you want me to put that host in the emergency file? | 16:27 |
pabelanger | sure | 16:27 |
pabelanger | I don't see a lock, so we'll have to remove crontab | 16:28 |
clarkb | ya I just added afsdb01.openstack.org to emergency file so I think you can remove the cron entry now | 16:28 |
pabelanger | crontab removed | 16:28 |
clarkb | I'm double checking packages on kdc01 now | 16:29 |
clarkb | looks good | 16:29 |
pabelanger | kk | 16:29 |
clarkb | shall I reboot it? | 16:29 |
pabelanger | sure | 16:29 |
clarkb | actually /me checks what services are running on it first so we can confirm they are up before releasing locks and things | 16:30 |
clarkb | oh wait there is a kdc02 | 16:30 |
pabelanger | oh | 16:31 |
clarkb | maybe | 16:31 |
clarkb | I can't ssh to it | 16:31 |
clarkb | pabelanger: are you able to get onto it? | 16:31 |
pabelanger | same, not responding | 16:31 |
pabelanger | http://www.tldp.org/HOWTO/Kerberos-Infrastructure-HOWTO/server-replication.html | 16:32 |
pabelanger | seems straightforward | 16:32 |
clarkb | kdc01 is trying to propogate the principals databse to kdc02 using kprop right now | 16:32 |
clarkb | thats how I noticed it | 16:32 |
pabelanger | once we finish this, we can do ^ | 16:32 |
pabelanger | let me check cacti | 16:32 |
clarkb | well I think we may already do ^ but kdc02 is unworking. So maybe we first sort out kdc02, then do the propogation then do kdc01? | 16:33 |
pabelanger | seems we lost it back in Oct | 16:33 |
pabelanger | according to cacti | 16:33 |
pabelanger | sure, we can try and bring it back online | 16:34 |
clarkb | ya lets do that | 16:34 |
clarkb | then I think we can do this using process above | 16:34 |
pabelanger | okay, you grabbing console? | 16:34 |
clarkb | I was tempted to just try reboot it with nova api... but we can try console first | 16:35 |
clarkb | looks like its an ord server | 16:35 |
pabelanger | okay, I'll let you drive :) | 16:35 |
clarkb | oh right you can't use the normal console log api with rax? | 16:35 |
clarkb | confirmed... | 16:36 |
clarkb | nova api says server is active and running | 16:37 |
clarkb | pabelanger: was it you saying there was an ssh option for this? | 16:39 |
clarkb | I got the web thing working, there is a login prompt, but I can't login because ENOPASSWD | 16:41 |
clarkb | should we go ahead and do a reboot via nova api? | 16:43 |
pabelanger | clarkb: ya, ssh | 16:44 |
pabelanger | you place server into emergency mode, it will use the same IP | 16:44 |
pabelanger | you then get a new root password and can SSH | 16:45 |
clarkb | ah I think we can just try a reboot before emergency | 16:45 |
pabelanger | kk | 16:45 |
clarkb | going to do that now via server reboot kdc02.openstack.org | 16:45 |
clarkb | I see it booting in the console | 16:46 |
pabelanger | ssh works | 16:46 |
clarkb | the kprop runs every 15 minutes | 16:47 |
pabelanger | we should do kdc02 first for kernel | 16:48 |
pabelanger | looks like it might need an update | 16:48 |
clarkb | pabelanger: lets let the 1700UTC kprop run, then update kdc02 kernel, then kprop again then do kdc01 | 16:48 |
clarkb | yup | 16:48 |
pabelanger | wfm | 16:48 |
clarkb | I'm gonna make sure packages on kdc02 are up to date since it may not have had networking for a while | 16:49 |
pabelanger | maybe just kick it from puppetmaster.oo? | 16:49 |
clarkb | package updates are run out of apt not puppet master | 16:49 |
clarkb | I just manaully doing an update and sure enough there are things that need updating | 16:50 |
clarkb | apt updatse around 0600UTC daily iirc | 16:50 |
pabelanger | ah, right | 16:50 |
clarkb | hrm got a message about setting up a kerberos realm.. I wonder if that happens regardless or if we haven't configured this server properly :/ | 16:50 |
pabelanger | ya, I am unsure actually | 16:52 |
clarkb | /etc/krb5.conf says default realm is OPENSTACK.ORG so I think we are good once we kprop | 16:53 |
pabelanger | ansible is also running on kdc02 now | 16:54 |
clarkb | ok I didn't catch the kprop happen. maybe I should run it in the foreground just to make sure it is happy? | 17:00 |
clarkb | doing that now | 17:01 |
clarkb | Database propagation to kdc02.openstack.org: SUCCEEDED | 17:01 |
clarkb | pabelanger: ready to reboot kdc02? | 17:01 |
pabelanger | sure | 17:01 |
clarkb | ok doing it now | 17:01 |
clarkb | ok kdc02 is back up again. I'm going to rerun propogation manually | 17:02 |
clarkb | then I think we can reboot 01? | 17:03 |
pabelanger | think so | 17:03 |
clarkb | kprop: Connection refused while connecting to server <- | 17:03 |
pabelanger | docs say things will fail over | 17:03 |
* clarkb waits patiently | 17:03 | |
pabelanger | oh | 17:03 |
pabelanger | Apr 14 16:57:04 kdc02 puppet-user[18536]: (/Stage[main]/Kerberos::Server/Service[krb5-kpropd]/ensure) ensure changed 'stopped' to 'running' | 17:03 |
pabelanger | think we need to wait for puppet to start it | 17:04 |
clarkb | oh maybe | 17:04 |
pabelanger | let me kick.sh | 17:04 |
clarkb | kk | 17:04 |
pabelanger | if it comes online, then we can patch puppet | 17:04 |
pabelanger | kicking | 17:05 |
pabelanger | clarkb: try now | 17:06 |
clarkb | Database propagation to kdc02.openstack.org: SUCCEEDED | 17:06 |
clarkb | that must've been it | 17:06 |
pabelanger | working on patch | 17:06 |
clarkb | I'm gonna double check packages on kdc01 now | 17:06 |
clarkb | says its up to date so ready to reboot it whenever you are | 17:07 |
pabelanger | go for it | 17:07 |
clarkb | ok doing it now | 17:07 |
clarkb | its back | 17:08 |
clarkb | and kernel is updated | 17:08 |
pabelanger | Yay | 17:08 |
pabelanger | clarkb: remove servers from emergency and bring zlstatic online? | 17:11 |
clarkb | pabelanger: yes I think so | 17:11 |
pabelanger | zuul started | 17:12 |
pabelanger | servers removed from emergency | 17:12 |
pabelanger | will make sure crontabs are recreated | 17:12 |
clarkb | ok | 17:12 |
pabelanger | mirror-update.o.o good | 17:15 |
pabelanger | afsdb01.o.o good too | 17:16 |
clarkb | so now we just monitor that things are updating as expected ya? | 17:17 |
clarkb | also do we want to vos move the docs volume back to 01? | 17:17 |
pabelanger | ya, I'm going to hold the lock on npm and figure that out | 17:17 |
pabelanger | we haven't released in a month or so | 17:17 |
clarkb | oh wow | 17:17 |
clarkb | I'm tempted to try and write down the "how to reboot an entire afs cluster without downtime" in system-config | 17:18 |
clarkb | let me start on that draft so that we don't forget | 17:18 |
pabelanger | kk | 17:21 |
pabelanger | oh, we should also vos move docs back | 17:21 |
pabelanger | I can start that shortly | 17:21 |
clarkb | ok | 17:24 |
pabelanger | vos move -id docs -toserver afs01.dfw.openstack.org -topartition vicepa -fromserver afs01.ord.openstack.org -frompartition vicepa --localauth | 17:27 |
pabelanger | running now | 17:27 |
clarkb | pabelanger: and you are running that on afsdb01? | 17:28 |
pabelanger | yes | 17:28 |
pabelanger | from screen | 17:28 |
clarkb | pabelanger: fwiw next time I think we want to move to afs02 which is local to the same datacenter (will be faster) | 17:45 |
clarkb | oh I thought we had a second server in dfw doesn't look like we do | 17:45 |
pabelanger | clarkb: agree. I thought that afs01.ord.o.o was actually not used any more | 17:45 |
clarkb | listvldb says its the same | 17:46 |
pabelanger | we should confirm with jeblair next week | 17:46 |
clarkb | er rather RW and RO are cohabitated on same server? | 17:46 |
pabelanger | Ya, RW RO on the same server | 17:47 |
pabelanger | with back up RO | 17:47 |
clarkb | but backup RO is in ord right? | 17:48 |
pabelanger | not any more | 17:49 |
pabelanger | I thought we had stopped using ord, because network was bottleneck | 17:49 |
clarkb | I'm looking at mirror specifically and I see afs01.dfw is RW and RO and afs01.ord is RO | 17:49 |
clarkb | I thought we added a second server in dfw to be RO too? | 17:49 |
clarkb | but not seeing that (could just be blind) | 17:50 |
pabelanger | ya, I seen that too. I _think_ we need to fix somethings next week. And make sure everything is setup for afs01.dfw and afs02.dfw | 17:50 |
pabelanger | jeblair likely knows more | 17:51 |
clarkb | ++ | 17:51 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!