Wednesday, 2015-05-13

*** anteaya has quit IRC		05:26
*** anteaya has joined #openstack-infra-incident		05:29
*** jhesketh has quit IRC		10:25
*** jhesketh has joined #openstack-infra-incident		10:31
*** fungi has joined #openstack-infra-incident		21:12
*** ChanServ changes topic to "Incident in progress: https://etherpad.openstack.org/p/venom-reboots"		21:12
*** clarkb has joined #openstack-infra-incident		21:15
fungi	same bat time, different bat channel	21:15
clarkb	how do we want to do this, just claim nodes off the etherpad?	21:16
fungi	probably, after a little planning	21:16
pleia2	clarkb: unrelated to dns, gid woes means I still don't have an account on git-fe01 & 02	21:16
clarkb	pleia2: oh darn	21:16
pleia2	we'll fix that up later	21:17
fungi	oh, right, we've not cleaned up the uids/gids on those two have we?	21:17
clarkb	apparently not which makes it hard for pleia2 to do the laod balancing dance for git* safe reboots	21:17
pleia2	right	21:17
clarkb	why don't I go take a look and see how bad the gid situation is right now	21:18
clarkb	maybe we can fix that real quick	21:18
mordred	hey all	21:18
pleia2	welcome to the party, mordred	21:18
lifeless	so what is venom ? I can't read the ticket	21:18
mordred	lifeless: it's the latest marketing-named CVE	21:19
clarkb	lifeless: its VMs can break into the hypervisor via floppy drive code in qemu	21:19
pleia2	lifeless: guest-executable buffer overflow of the kerney floppy thing	21:19
lifeless	clarkb: LOOOOOOOL	21:19
pleia2	kernel too	21:19
*** zaro has joined #openstack-infra-incident		21:19
anteaya	http://seclists.org/oss-sec/2015/q2/425	21:19
clarkb	lifeless: so we can reoot gracefully or we get rebooted forcefully in ~24 hours	21:19
lifeless	anteaya: thanks	21:19
anteaya	welcome	21:19
lifeless	much brilliant, such wow	21:19
pleia2	apparently floppy stuff is hard to remove from the kernel (I was surprised it was included at all in base systems)	21:20
fungi	clarkb: yeah, remapping the uids/gids on those should be relatively trivial (i hope)	21:20
fungi	pretty sure she's just conflicting with some unused account called "admin"	21:21
pleia2	admin is the fomer, only used to-be-backwards compatible, sudo group on ubuntu	21:21
*** SpamapS has joined #openstack-infra-incident		21:21
pleia2	former	21:21
pleia2	unlikely that we're using it	21:21
clarkb	fungi: group admin has gid 2015 which is pleia2's group gid	21:21
clarkb	fungi: but sudo group needs to be moved too	21:22
clarkb	so basically I need to find where group owner is 2015 or 2016, make sure I am root so I can chown files after the regiding (as I may break sudo) then run puppet	21:22
fungi	k. neither of those should actually own any files, i don't think	21:22
clarkb	fungi: I am going to find out shortly :)	21:22
*** nibalizer has joined #openstack-infra-incident		21:22
fungi	that sounds like a good plan	21:23
* mordred isn't QUITE back online yet - just got back from taking sandy to the airport - will be useful in a few ...		21:23
fungi	if nobody disagrees that the entries marked with * can be rebooted now, i'll go ahead and start in on those. take note that we need to not just reboot them from the command line. we need to halt them and then nova reboot --hard	21:24
clarkb	sudo find / -gid 2015 and sudo find / -gid 2016 don't find any files on git-fe01, so I guess its time to become root, change gid, then puppet	21:25
clarkb	fungi: go for it, maybe put rebooting process on etherpad for ease of finding	21:26
pleia2	I think stackalytics.openstack.org should be safe to reboot too	21:26
fungi	will do	21:26
fungi	oh, right. that's not actually production anything	21:26
jeblair	so for design-summit-prep, i'm not sure what to do... i think we can/should reboot it, but none of us actually knows anything about the app on there, so i don't know what happens when we reboot it and it doesn't come up	21:28
jeblair	we should really stop just handing root to people	21:29
mordred	++	21:29
*** greghaynes has joined #openstack-infra-incident		21:29
fungi	that one's been in a transitional state waiting on ttx to work with someone on writing puppet modules for the apps on there	21:30
clarkb	pleia2: can you hop on git-fe01 and check that it works for you? sudo too?	21:30
jeblair	tbh, my inclination is to reboot it and since there is nothing described by puppet running on it, nor any documentation, call our work done.	21:30
pleia2	mordred: is test-mordred-config-drive deletable? (see pad)	21:30
lifeless	live migration should mitigate it... if they had that working	21:30
fungi	i've pinged him in #-infra in case he's around	21:30
jeblair	it's pretty late for ttx	21:30
fungi	lifeless: yep! too bad	21:30
pleia2	clarkb: I'm in, thanks!	21:30
pleia2	(with sudo, yay)	21:31
jeblair	i'm equally okay with "do nothing and let rax take care of it"	21:31
clarkb	pleia2: awesome, I am working on fe02 now	21:31
fungi	i'm starting down the easy reboots list in alpha order	21:31
clarkb	pleia2: so first thing we need to do is take fe01 or fe02 out of the DNS roudn robin, then we can take one backend out of haproxy at a time on the other frontend and reboot the backend, put it back into service, rinse and repeat	21:32
clarkb	pleia2: then when that is all done add fe02 back to DNS round robin, remove 01, reboot 01	21:32
clarkb	pleia2: and the only people that should see any downtime are those that hardcode a git frontend	21:32
pleia2	clarkb: ok, how are we interacting with dns for this?	21:32
* mordred is going to start on the easy reboots in reverse alpha order		21:33
mordred	will meet fungi in the middle	21:33
fungi	thanks mordred	21:33
fungi	pleia2: clarkb: there is a rax dns client, but probably webui is easier for this	21:33
pleia2	fungi: if you could toss the exact instructions you're using for actually-effective reboots in the pad, it would help us be consistent	21:33
clarkb	mordred: keep in mind a normal reboot is not good enough	21:34
fungi	well, maybe not actually. since we just need to delete and then create a and aaaa records. cli client may be easier	21:34
clarkb	(so lets all know how to reboot before we reboot)	21:34
mordred	yah	21:34
mordred	we apparently need to halt. then reboot --hard	21:34
mordred	yah?	21:34
jeblair	how can you do anything after halting?	21:34
clarkb	mordred: thats what fungi said, but ^	21:34
clarkb	jeblair: I think you have to nova reboot it	21:35
mordred	yah	21:35
clarkb	so in instance do shutdown -h now	21:35
jeblair	oh, "nova reboot --hard"	21:35
mordred	yah	21:35
clarkb	then go to nova client and reboot it	21:35
mordred	sorry	21:35
lifeless	list(parse_requirements('foo==0.1.dev0'))[0].specifier.contains(parse_version('0.1.dev0'))	21:35
lifeless	True	21:35
lifeless	man, copypasta all over the place today	21:35
mordred	lifeless: wrong channel	21:35
clarkb	pleia2: git-fe02 is ready for you to test there	21:35
fungi	pleia2: i logged into puppetmaster, sourced our nova creds for openstackci in dfw, then `sudo ssh $hostname` and run `halt`, then when i get kicked out start pinging it, and then `nova reboot --hard $hostname`	21:35
lifeless	mordred: I know, it was my belly on my mouse pad	21:35
fungi	once i t no longer responds to ping	21:35
mordred	oh good	21:37
mordred	we have 2 zuul-devs	21:37
mordred	should I delete the one that is not the one dns resolves to?	21:37
fungi	yeah, should be safe at this point	21:38
jeblair	mordred: yes	21:38
mordred	and we have more than one translate-dev	21:40
mordred	pleia2: ^^ delete the one that is not in DNS? or reboot it?	21:40
jeblair	is the old one the pootle one?	21:40
mordred	maybe?	21:40
jeblair	mordred: paste both ips?	21:41
pleia2	the old pootle one is deleteable	21:41
mordred	afd4a8d9-98a7-4a21-a827-33106abeeb8a \| translate-dev.openstack.org \| ACTIVE \| - \| Running \| public=104.130.243.78, 2001:4800:7819:105:be76:4eff:fe04:4758; private=10.209.160.236 \|	21:41
mordred	\| f1103432-ae29-4ec2-87e0-39920429ac50 \| translate-dev.openstack.org \| ACTIVE \| - \| Running \| public=23.253.231.198, 2001:4800:7817:103:be76:4eff:fe04:545a; private=10.208.170.81 \|	21:41
pleia2	new traslate-dev server is 104.130.243.78	21:41
mordred	104. is the one in dns	21:41
pleia2	yeah, can kill the 23. one afaic	21:41
mordred	also wiki.o.o is not o nthe list - I thin kit'sa "can reboot any time" yeah?	21:41
jeblair	104.130.243.78 does not respond for me	21:42
fungi	mordred: it's likely not pvhvm	21:42
mordred	fungi: ah - we only have to delete pvhm?	21:42
mordred	reboot?	21:42
fungi	mordred: pvhvm is affected, pv is not	21:42
mordred	gotcha	21:42
* mordred removes from list		21:42
fungi	basically this is the list which rax put in the ticket	21:42
mordred	ah	21:43
mordred	so - translate.openstack.org is not pvhvm?	21:43
pleia2	jeblair: oh dear, maybe zanata went sideways when I wasn't looking	21:43
mordred	translate is a standard - that's ok - we can think about that later - if we're happy with performance, then it's likely fine :)	21:44
fungi	puppetmaster root has cached the wrong ssh host key for ci-backup-rs-ord	21:47
clarkb	pleia2: I think we should remove git-fe02 records from the git.o.o name to start. git.o.o A 23.253.252.15 and git.o.o AAAA 2001:4800:7818:104:be76:4eff:fe04:7072	21:47
fungi	should i correct it, or work around it (to avoid ansible doing things to it)?	21:47
jeblair	fungi: work around it for now; not sure what the state is with that	21:47
pleia2	clarkb: yep, sounds good	21:47
fungi	jeblair: thanks, will do	21:47
clarkb	pleia2: also note the TTL (I think its the minimum of 5 minutes for round robining but we will have to add that back in when we make the record for it again)	21:48
fungi	mordred: i'll skip hound since i'm not sure what your plan is with that or if it needs special care	21:48
pleia2	my favorite part is how their interface doesn't show me the whole ipv6 address in the list	21:48
clarkb	pleia2: :(	21:48
fungi	rax dns is a special critter, to be sure	21:48
pleia2	modify record lets me see it and cancel ;)	21:49
clarkb	pleia2: once you have those records removed you will want to hop on git-fe02 and sudo tail -f /var/log/haproxy.log and wait for connections to stop coming in	21:50
clarkb	pleia2: at that point we can do the reboot procedures	21:50
mordred	AND - in a fit of consistency - we have 2 review-dev	21:50
clarkb	mordred: is one trusty the other precise?	21:51
clarkb	mordred: if so you can probably remove the precise node	21:51
clarkb	pleia2: once you are about at that point let me know and I can dig up my haproxy knoweldge	21:51
mordred	yup	21:51
pleia2	clarkb: busy servers these ones	21:51
fungi	jeblair: there are a jeblairtest2 and jeblairtest3 as well which aren't on the list but shall i delete them while i'm here?	21:52
mordred	well, that's exciting	21:53
mordred	I halted review dev. it's not returning pings - BUT - nova won't reboot it because it's in state "powering off"	21:53
jeblair	fungi: please	21:53
fungi	will do	21:53
anteaya	can someone with ops spare a moment to kick a spammer from -infra?	21:56
mordred	fungi: hound done. no special care - it's all happy and normal	21:57
fungi	i've also been sshing back into each and checking uptime to make sure it really rebooted	21:58
clarkb	pleia2: its been > TTL now ya?	21:58
pleia2	clarkb: down to a trickle, so should be ready soon	21:58
mordred	so - anybody have any ideas what to do about a vm stuck in nova powering-off state?	21:58
clarkb	pleia2: ya and if these don't go away after another minute or two Ithink we blame their bad DNS resolving	21:58
pleia2	clarkb: nods	21:59
clarkb	mordred: I want to say in hpcloud when that happened you had to contact support, unsure if rax is different	21:59
mordred	if there isnt' a quick answer from john - I'm going to leave it - because it's review-dev and it'll get hard-rebooted tomorrow	22:03
clarkb	pleia2: found the magic socat commands for haproxy cnotrol on my fe02 scrollback	22:03
clarkb	mordred: wfm	22:03
pleia2	clarkb: cool, I'll have a peek	22:03
fungi	okay, got the all-clear from reed to reboot groups and ask so doing those next	22:03
clarkb	pleia2: getting a paste up for you	22:03
pleia2	clarkb: thanks	22:03
mordred	fungi: I am not using puppetmaster - so happy for you to reboot it any time	22:03
clarkb	pleia2: http://paste.openstack.org/show/222205/	22:04
clarkb	pleia2: basically haproxy is organized as frontends eg balance_git_daemon and backends for each frontend so we have to disable the full set of pairs for all of those per backend	22:05
clarkb	pleia2: once that is done it should be safe to reboot the backend	22:05
pleia2	clarkb: ok, so this series of commands + confirm they come back for all the git0x servers?	22:06
clarkb	pleia2: then off to the next backend, when all backends are done we add git-fe02 back to dns, remove git-fe01 and then reboot git-fe01	22:06
clarkb	pleia2: yup and run that on git-fe01 since its the only haproxy balancing traffic right now	22:06
pleia2	clarkb: sounds good, on it	22:07
clarkb	pleia2: but basically go host by host, rebooting then reenabling and we shouldn't have any downtiem	22:07
* pleia2 nods		22:07
clarkb	we should probably ansible this for the general case, not sure we want to ansibe it for the hard reboot case since one already broke on us	22:07
pleia2	heh, right	22:07
jeblair	thinking ahead to the harder servers; we have enough SPOFS that we may just want to take an outage for all of them at once; maybe i'll start working on identifying what we should group together	22:09
mordred	jeblair: ++	22:09
clarkb	jeblair: sounds good	22:09
mordred	clarkb: yah - I was thinking we should maybe have some ansibles to handle "we need to reboot the world" - I'm sure it'll come up again	22:09
SpamapS	kernel updates come to mind	22:10
fungi	mordred: are you rebooting openstackid.org? if not, i'll get it next	22:10
mordred	fungi: yeah - I got it	22:10
fungi	thanks	22:11
*** reed has joined #openstack-infra-incident		22:11
jeblair	fungi, clarkb: er, i see the chat in etherpad; what should we do with the ES machines?	22:12
clarkb	jeblair: fungi we can roll through them right now if we want, we can have rax just reboot all of them, or we can reboot all of them	22:12
fungi	i'm inclined to halt and hard reboot them ourselves since no idea if rax halts these or just blip!	22:13
clarkb	jeblair: fungi: I am tempted to go ahead and reboot one then see how long recovery takes, then based on that either reboot the rest all at once or reboot them one by one	22:13
pleia2	clarkb: 01 down, 4 to go!	22:13
mordred	clarkb: sounds reasonable	22:13
fungi	also we have some opportunity to look carefully at the systems as they come back up and identify obvious issues immediately rather than when we happen to notice later	22:13
clarkb	fungi: ++	22:13
mordred	subunit-worker01 should be able to just be rebooted too, no?	22:13
jeblair	clarkb: okay, we'll be taking "the ci system" down anyway, so if it's easier to do all at once, we will have that opportunity	22:13
clarkb	mordred: yup	22:13
* mordred does subunit-worker		22:14
jeblair	well, it may lose info	22:14
clarkb	jeblair: well recovery from reboot all at once is a many hours thing too I think	22:14
jeblair	which is why i marked it with [1]	22:14
jeblair	mordred: ^	22:14
clarkb	jeblair: and logstash.o.o will just queue up the work for us so ES can be in that state whenever and we should be fine	22:14
clarkb	jeblair: biggest impact is the delay on indexing until we catch back up again	22:14
jeblair	clarkb: the system is not busy and slowing, so we may be in luck there	22:15
fungi	okay, so logstash workers and elasticsearch group together with [1] as well?	22:15
clarkb	wfm	22:16
jeblair	oh, logstash workers can go at any time, right?	22:16
clarkb	jeblair: actually ya, let me just go dothose right now	22:16
fungi	they used to go at any time they wanted, so i suppose so ;)	22:16
jeblair	i vote we keep them out of [1] if that's the case; there are 16 others that we aren't rebooting anyway	22:16
fungi	agreed	22:16
clarkb	yup yup I am doing logstash workers now	22:16
jeblair	so probably can actually just do all 4 right now	22:16
jeblair	(at once)	22:16
mordred	jeblair: sorry. IRC race-condition subunit-worker01 rebooted	22:17
jeblair	mordred: i did have the [1] in the etherpad before you did that	22:17
clarkb	pleia2: I think you can reboot git-fe02 whenever you have time, its down to like 2 requests per minute	22:17
fungi	is the plan to put zuul into graceful shutdown long enough to quiesce jobs and then stop the zuul mergers and bring zuul back up?	22:18
fungi	that would avoid running jobs while we do the pypi mirrors too	22:18
fungi	then once the rest of the [1] group is done, bring the mergers back up	22:18
pleia2	git02 seems stuck after the halt, not pingable, but nova shows it as running, for git01 I waited to run the nova reboot --hard until it was in shutdown mode	22:18
jeblair	fungi: we have to take review.o.o down, i think we should just do it without quiescing	22:19
fungi	jeblair: oh, right i forgot it was in that set	22:19
fungi	pleia2: i guess give it a few minutes. if we have to open a ticket for it, we can limp along down one member until rax fixes it for us	22:19
jeblair	yeah, i'd think about a way to do it more gracefully, but there's no hiding that, so we might as well make it easy on ourselves and just reboot it all at once	22:20
fungi	the joys of active redundancy	22:20
jeblair	we can save the zuul queues	22:20
fungi	fair enough	22:20
clarkb	pleia2: I have nothing better than what fungi suggests	22:20
pleia2	fungi: oh good, looks like all it needed was 5 minutes to die, on to 03!	22:20
fungi	heh	22:20
clarkb	doing logstash-worker17 now	22:20
fungi	sometimes rax likes to just sit on api calls too. it's an openstack thing	22:21
mordred	ooh! review-dev finally rebooted	22:22
clarkb	they are just teaching us patience	22:22
mordred	awesome	22:22
fungi	i have a feeling rax didn't tune their environment to expect all customers to want to reboot everything in one day	22:22
pleia2	indeed	22:23
mordred	weird	22:23
fungi	pleia2: clarkb: let me know when you expect to be idle on the puppetmaster for a few minutes and i'll reboot it too	22:23
clarkb	fungi: let me get through the logstash workers, will let you know	22:23
fungi	there's no hurry	22:24
pleia2	fungi: will do, just going to finish up this git fun	22:24
jeblair	okay, i laid out a plan for the group[1] reboots	22:26
clarkb	after halting 18,19,20 they all show avtive according to nova show, are we waiting for them to be shutdown before hard rebooting?	22:26
mordred	jeblair: I agree with your plan	22:27
fungi	yeah, lgtm	22:27
mordred	clarkb: I have been	22:27
clarkb	ok I will wait then	22:27
fungi	might want to also wait for the pypi mirrors to boot before starting zuul?	22:27
fungi	otherwise there could be job carnage	22:28
clarkb	++	22:28
fungi	also waiting to bring up the zuul mergers until most worker types have registered in zuul again could avoid some not_registered results?	22:28
jeblair	fungi: true, i was mostly thinking of waiting for the services though; we should probably wait until the reboots have been issued for all of the vms	22:28
fungi	though i suppose just not readding the saved queues for a bit would work as well	22:29
fungi	worst case a couple of changes get uploaded to gerrit and their jobs are kicked back because it was too soon	22:30
jeblair	or waiting until nodepool has ready nodes of each type (which it may immediately sinc the system is not busy)	22:30
fungi	yeah, i expect that to be quick	22:31
jeblair	200 concurrent jobs running is "not busy"	22:31
fungi	heh	22:32
jeblair	do folks want to claim nodes in the group[1] by adding their names to the ep?	22:32
fungi	also cinder seems pretty broken	22:32
fungi	yep, will do	22:32
jeblair	i'll do the gerrit/zuul/nodepool bits	22:32
mordred	I'll just watch I guess	22:33
jeblair	(and btw, not stopping nodepool is intentional; it'll reduce number of aliens)	22:33
jeblair	mordred: want the zuul mergers?	22:33
mordred	oh - sure!	22:34
jeblair	status alert Gerrit and Zuul are going offline for reboots to fix a security vulnerability.	22:35
jeblair	?	22:35
fungi	halting nodepool.o.o will likely send a term to nodepoold as init tries to gracefully stop things	22:35
jeblair	fungi: yup	22:35
fungi	jeblair: that looks good	22:35
jeblair	clarkb, fungi, mordred: are you standing by?	22:36
mordred	jeblair: yup	22:36
fungi	on hand	22:36
clarkb	I am here	22:37
clarkb	almost done with the last logstash worker it took forever to stop	22:37
jeblair	i'll send the announcement then wait a bit, then give you the go ahead when i've stopped zuul and gerrit	22:37
jeblair	#status alert Gerrit and Zuul are going offline for reboots to fix a security vulnerability.	22:37
openstackstatus	jeblair: sending alert	22:37
*** reed has left #openstack-infra-incident		22:37
-openstackstatus- NOTICE: Gerrit and Zuul are going offline for reboots to fix a security vulnerability.		22:39
*** ChanServ changes topic to "Gerrit and Zuul are going offline for reboots to fix a security vulnerability."		22:39
clarkb	thats exciting, the logstash indexer refuses to keep running on 19 and doesn't log why it failed	22:40
clarkb	but we are super redundant there so I can switch to the [1] group whenever	22:40
openstackstatus	jeblair: finished sending alert	22:42
jeblair	clarkb, fungi, mordred: gerrit is stopped, clear to reboot	22:42
mordred	rebooting	22:42
clarkb	all the ES's are halted	22:43
clarkb	waiting for them to not be ACTIVE in nova show before rebooting	22:44
fungi	my 4 are back up, i'm checking their services now	22:44
clarkb	fungi: did you wait for them to not be active?	22:44
clarkb	I am trying to decide if that is necessary	22:44
fungi	i did not	22:44
fungi	zuul is already stopped	22:45
clarkb	ok I won't wait either then	22:45
fungi	so it's not like job results will matter at this point	22:45
fungi	none that are running will report	22:45
mordred	mine are all rebooted and services are running	22:45
jeblair	(nodepool is up and running)	22:46
fungi	yep, all mine are looking good	22:46
fungi	rebooting gerrit now	22:46
pleia2	clarkb: rebooting git-fe02 now, once that's back up I'll readd to dns, remove git-fe01 and wait for that to trickle out (and tell fungi to reboot puppetmaster)	22:46
mordred	hour and a half isn't terrible for an emergency reboot the world	22:48
clarkb	elasticsearch is semi up, its red and 5/6 nodes are present	22:48
fungi	gerrit's on its way back up	22:49
pleia2	fungi: I'm done with puppetmaster for now, just have a git-fe01 reboot to do, but will wait on dns for that so reboot away	22:49
fungi	review.o.o looks like it's working	22:49
mordred	I agree that review.o.o ooks like it's running	22:49
fungi	jeblair: should be set to start zuul?	22:49
anteaya	fungi: slow but working...	22:50
jeblair	[2015-05-13 22:50:02,782] WARN org.eclipse.jetty.io.nio : Dispatched Failed! SCEP@34ee175d{l(/127.0.0.1:58159)<->r(/127.0.0.1:8081),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@242369e7,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} to org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@10fdcf3a	22:50
clarkb	es02 ran out of disk again	22:50
clarkb	pleia2: ^ same issue as last time, /me makes a note to logrotate for it better	22:50
jeblair	wow there were a lot of those errors in the gerrit log	22:50
jeblair	but it seems happier now	22:50
fungi	it was probably getting slammed by people desperately reloading while it was starting up	22:51
jeblair	okay i will restart zuul	22:51
clarkb	wait hrm	22:52
fungi	hrm?	22:52
jeblair	clarkb: i've already started zuul and begun re-enqueing	22:52
clarkb	can someone else log into es02	22:52
fungi	on it	22:52
clarkb	jeblair: sorry I think you are ok to start zuul	22:52
jeblair	ok, will continue	22:52
clarkb	fungi: tell me if mount looks funn	22:52
clarkb	jeblair: but it looks like es02 came up without a /	22:53
fungi	um... wow!	22:53
clarkb	I wonder if this is what hit worker19	22:53
fungi	i mean, that can happen, technically, as / is mounted at boot and might not be in mtab	22:53
clarkb	in any case we should probably do an audit of this	22:53
pleia2	clarkb: oops, I think I made a bug/story about that but never managed to cycle back to it	22:53
jeblair	okay, zuul is up and there are a very small handful of not_registered; i think we can ignore them	22:53
fungi	overflow on /tmp type tmpfs (rw,size=1048576,mode=1777)	22:53
clarkb	pleia2: np	22:53
jeblair	i will status ok?	22:54
fungi	whazzat	22:54
mordred	jeblair: ++	22:54
fungi	jeblair: go for it	22:54
clarkb	jeblair: do we want to check the mout table everywhere first?	22:54
clarkb	we may need to reboot more if we don't see those devices within a VM?	22:54
jeblair	clarkb: review looks ok (it has a / and an /opt)	22:55
clarkb	jeblair: cool thats probably the most important one	22:55
clarkb	fungi: so short of rebooting, any ideas on fixing/debugging this?	22:55
fungi	clarkb: i'm looking at dmesg. just a sec	22:55
jeblair	#status ok Gerrit and Zuul are back online.	22:55
openstackstatus	jeblair: sending ok	22:55
jeblair	nodepool looks sane	22:56
clarkb	ES is yellow despite this trouble	22:56
clarkb	so it is recovering in the right direction (started red)	22:57
fungi	dmesg looks like it properly remounted xvda1 rw... maybe mtab is corrupt	22:57
jhesketh	Morning	22:57
*** ChanServ changes topic to "Incident in progress: https://etherpad.openstack.org/p/venom-reboots"		22:57
-openstackstatus- NOTICE: Gerrit and Zuul are back online.		22:57
* jhesketh catches up on incident(s)		22:58
anteaya	jhesketh: https://etherpad.openstack.org/p/venom-reboots if you haven't found it yet	22:58
fungi	also what is this tmpfs called "overflow" mounted at /tmp? it's 1mb in size, which seems dysfunctional	22:58
Clint	fungi: that's a "feature" for when / is full on bootup	22:59
fungi	anteaya: i also put it in the channel topic	22:59
clarkb	and / was full	22:59
fungi	Clint: thank you!	22:59
anteaya	fungi: thank you	22:59
clarkb	because ES rotates logs but doesn't have a limit on how far back to keep	22:59
fungi	so, yes, that's interesting and i think explains some of this then	22:59
fungi	i wonder if that is why root isn't actually showing up mounted then	23:00
openstackstatus	jeblair: finished sending ok	23:00
clarkb	fungi: seems likely	23:00
fungi	clarkb: i guess just clean up the root filesystem and reboot again and it will probably go back to "normal"	23:01
fungi	Clint: you gave me something new to read about	23:01
clarkb	fungi: ya I have already killed the log files taking up all the disk	23:01
Clint	fungi: enjoy	23:01
clarkb	so I will go ahead and reboot it	23:01
fungi	sounds good	23:01
clarkb	then make it more of a priority to have logrotate clean up after ES	23:01
jhesketh	Looks like good progress on the reboots has been made	23:01
jhesketh	Let me know if I can help	23:01
fungi	but that being the case, i think this probably is not some mysterious endemic issue we should go hunting on our other servers looking for	23:02
jeblair	jhesketh: thanks, i think we're nearing the end of it	23:02
fungi	it's gone pleasantly quickly	23:02
jhesketh	Great stuff	23:02
fungi	mordred: clarkb: so you're idle on puppetmaster for a bit. looks like i'm clear to reboot that. speak soon if not	23:03
pleia2	fungi: git-fe01 just slowed down enough for me to reboot it, if you can wait 5 minutes I might as well finish this up	23:03
clarkb	fungi: I am done	23:03
mordred	fungi: I am doing nothing there	23:03
fungi	pleia2: no problem, go for it	23:03
fungi	i'll wait	23:03
clarkb	pleia2: fe01 or fe02?	23:03
pleia2	clarkb: fe02 is all done	23:04
clarkb	pleia2: woot	23:04
clarkb	es02 is rebooting now	23:04
clarkb	looks like the logstash workers may be leaking logs too	23:04
clarkb	I find it somewhat :( that our logging system is bad at logging (most likely my fault)	23:04
fungi	logs are hard (and made of wood)	23:05
clarkb	and its up, cluster is still recovering and yellow so that all looks good	23:06
mordred	fungi: more witches!	23:07
pleia2	fungi: all clear puppetmaster	23:07
fungi	pleia2: thanks! restarting it next	23:07
fungi	puppet master is back up now	23:08
pleia2	updated dns & load is coming back to git-fe01, so all is looking good	23:09
clarkb	ok confirmed that es doesn't do rotations with deletes properly until release 1.5	23:10
jeblair	are all servers complete now?	23:10
clarkb	but I think logrotate can delete based on age so will just set it up to kill anything >2 weeks old or whatever	23:10
pleia2	jeblair: I think so	23:10
clarkb	the list seems to make it look that way	23:10
fungi	gate-heat-python26 NOT_REGISTERED	23:11
fungi	same for gate-pbr-python26	23:11
jeblair	there are bare-centos6 nodes ready, maybe we should delete them	23:12
fungi	looks like a lot of jobs expecting bare-precise and baer-centos6 are not registered, so yeah	23:12
fungi	let's	23:12
jeblair	doing	23:12
fungi	thanks	23:12
jeblair	oh, sorry, my query was wrong; no ready nodes there, only hold, building, and used (possibly from before reboot)	23:13
jeblair	building might be before reboot too	23:13
jeblair	anyone using those held nodes?	23:14
fungi	i am no longer	23:14
clarkb	jeblair: I have one I would like to keep	23:14
fungi	just finished yesterday	23:14
fungi	i'll delete mine	23:14
clarkb	2536877 would be nice for me to keep	23:14
fungi	i kept a list	23:14
clarkb	been using it as part of the nodepool + devstack work	23:14
jeblair	okay, i deleted building/used; leaving hold to you	23:15
jeblair	that should be enough to correct the registrations once those are built	23:15
jeblair	should be clear out of this room now?	23:15
jeblair	/be/we/	23:16
pleia2	sounds good, see you on the other side	23:16
*** clarkb has left #openstack-infra-incident		23:16
fungi	adios	23:17
*** fungi has left #openstack-infra-incident		23:17
*** ChanServ changes topic to "Discussion of OpenStack project infrastructure incidents \| No current incident"		23:17
* jeblair puts the sheets back over the control consoles		23:17
*** nibalizer has left #openstack-infra-incident		23:18

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!