Thursday, 2017-04-13

*** AJaeger is now known as AJaeger_		06:08
*** hjensas has joined #openstack-infra-incident		07:30
*** hjensas has quit IRC		08:23
-openstackstatus- NOTICE: zuul was restarted due to an unrecoverable disconnect from gerrit. If your change is missing a CI result and isn't listed in the pipelines on http://status.openstack.org/zuul/ , please recheck		08:51
*** hjensas has joined #openstack-infra-incident		09:34
*** Daviey has quit IRC		12:28
*** Daviey has joined #openstack-infra-incident		12:55
jroll	https://nvd.nist.gov/vuln/detail/CVE-2016-10229#vulnDescriptionTitle "udp.c in the Linux kernel before 4.5 allows remote attackers to execute arbitrary code via UDP traffic that triggers an unsafe second checksum calculation during execution of a recv system call with the MSG_PEEK flag."	13:00
jroll	not sure if infra listens for UDP off the top of my head, but thought I'd drop that here	13:00
*** hjensas has quit IRC		14:57
*** lifeless_ has joined #openstack-infra-incident		15:07
*** mordred has quit IRC		15:10
*** lifeless has quit IRC		15:10
*** EmilienM has quit IRC		15:10
*** mordred1 has joined #openstack-infra-incident		15:10
*** 21WAAA2JF has joined #openstack-infra-incident		15:13
*** mordred1 is now known as mordred		15:46
clarkb	jroll: thanks, we have an open snmp port we might want to close	16:10
clarkb	pabelanger: fungi ^	16:10
pabelanger	ack	16:10
pabelanger	pbx might be one	16:12
pabelanger	since we use UDP for RTP	16:12
*** hjensas has joined #openstack-infra-incident		16:12
jroll	clarkb: np	16:13
clarkb	actually snmp issourcespecific so fairly safe	16:13
clarkb	afs	16:13
clarkb	is udp	16:13
clarkb	mordred: ^	16:13
pabelanger	ya, AFS might be our large exposure	16:15
mordred	oy. that's awesome	16:17
pabelanger	https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html	16:17
pabelanger	Linux afs01.dfw.openstack.org 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux	16:18
pabelanger	so ya, might need a new kernel and reboot?	16:18
fungi	ugh	16:19
fungi	any reports it's actively exploited in the wild?	16:19
pabelanger	I am not sure	16:21
pabelanger	looks like android is getting the brunt of it however	16:22
fungi	keep in mind source filtering is still a lot less effective for udp than tcp	16:23
fungi	easier to spoof (mainly just need to guess a source address and active ephemeral port)	16:23
*** openstack has joined #openstack-infra-incident		16:33
*** openstackstatus has joined #openstack-infra-incident		16:34
*** ChanServ sets mode: +v openstackstatus		16:34
*** 21WAAA2JF is now known as EmilienM		17:05
*** EmilienM has joined #openstack-infra-incident		17:05
clarkb	looking at https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html says xenial is not affected?	18:57
clarkb	I also don't see a ubuntu security notice for it yet	18:59
clarkb	it looks like we may be patched in many places?	19:06
clarkb	trying to figure out what exactly is required, but if I read that correctly xenial is fine despite being 4.4? trusty needs kernel >=3.13.0-79.123	19:06
clarkb	pabelanger: fungi ^ that sound right to you? if so maybe next step is generate a list of kernels on all our hosts via ansible then produce a needs to be rebooted list	19:12
pabelanger	clarkb: ya, xenail isn't affected what what I read.	19:13
pabelanger	++ to ansible run	19:13
clarkb	pabelanger: Linux review 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux	19:48
pabelanger	++	19:48
clarkb	on review.o.o which is newer than 3.13.0-79.123	19:48
clarkb	so I think just restart the service?	19:48
pabelanger	ya, looks like just a restart then	19:49
-openstackstatus- NOTICE: The Gerrit service on http://review.openstack.org is being restarted to address hung remote replication tasks.		19:51
fungi	sorry for not being around... kernel update sounds right, too bad we didn't take the gerrit restart as an opportunity to reboot	19:58
clarkb	fungi: we don't need to reboot it	20:00
clarkb	fungi: gerrit's kernel is new enough I think ^ you can double check above.	20:00
fungi	oh	20:00
fungi	yep, read those backwards	20:00
clarkb	pabelanger: puppetmaster.o.o:/home/clarkb/collect_facts.yaml has a small playbook thing to collect kernel info, want to check that out?	20:00
clarkb	that is incredibly verbose, is there a better way to introspect facts?	20:02
pabelanger	clarkb: sure	20:05
clarkb	pabelanger: if it looks good to you I will run it against all the hosts and stop using my --limit commands to test	20:06
clarkb	its verbose but works so just going to go with it I think	20:06
pabelanger	clarkb: looks good	20:06
clarkb	pabelanger: ok I will run it and redirect output into ~clarkb/kernels.txt	20:07
clarkb	its running	20:08
pabelanger	clarkb: only seeing the ok: hostname bits	20:10
clarkb	pabelanger: ya its gathering all the facts before running the task I think	20:10
pabelanger	Ha, ya	20:11
pabelanger	gather_facts: false	20:11
clarkb	well we need the facts	20:11
pabelanger	but, we need them	20:11
pabelanger	ya	20:11
clarkb	I guess I could've uname -a 'd it instead :)	20:11
pabelanger	okay, let me get some coffee	20:11
pabelanger	also, forgot about infracloud	20:12
pabelanger	that will be fun	20:12
clarkb	hrm why does the mtl01 internap mirror show up, I thoguht I cleaned that host up a while back	20:16
mgagne	mtl01 is the active region, nyj01 is the one that is now unused	20:16
* mgagne didn't read backlog		20:16
clarkb	oh I got them mixed up, thanks	20:18
clarkb	looks like right now ansible is timing out trying to get to things like jeblairtest	20:27
clarkb	I'm just gonna let it timeout on its own, does anyone know how long that will take?	20:27
fungi	maybe 60 minutes in my experience	20:36
clarkb	fwiw most of my spot checking shows our kernels are new neough	20:54
clarkb	so don't expect to need to reboot much once ansible gets back to me	20:54
clarkb	its been an hour and they haven't timed out yet...	21:17
clarkb	done waiting going to kill ssh processes and hope that doesn't prevent play from finishing	21:20
clarkb	pabelanger: https://etherpad.openstack.org/p/infra-reboots-old-kernel	21:26
clarkb	I'm just going to start picking off some of the easy ones	21:28
clarkb	mordred: are you around? how do we do the afs reboots? make afs01 rw for all volumes, reboot afs02, make 02 rw, reboot 01?	21:28
clarkb	then do each of the db hosts one at a time? what about kdc01?	21:28
clarkb	doing etherpad now so the etherpad will be temporarily unavailable	21:34
clarkb	for propsal.slave.openstack.org and others, is the zuul launcher going to gracefully pick those back up again after a reboot or will we have to restart the launcher too?	21:37
clarkb	I guess I'm going to find out?	21:37
clarkb	rebooting proposal.slave.o.o now as its not doing anything	21:38
clarkb	I'm going to try grabbing all the mirror update locks on mirror-update so that I can reboot it next	22:09
clarkb	pabelanger: gem mirroring appears to have been stalled since april 2nd but there is a process holding the lock. Safe to grab it and then reboot?	22:26
pabelanger	clarkb: ya, we'll need to grab lock after reboot	22:33
pabelanger	just finishing up dinner, and need to run an errand	22:34
pabelanger	I can help with reboots once I get back in about 90mins	22:34
clarkb	pabelanger: sounds good we can leave mirror-update and afs for then. I will keep working on the others	22:34
clarkb	afs didn't survive reboot on gra1 mirror. working to fix now	22:39
clarkb	oh maybe it did and its just slow things are cdable now	22:40
clarkb	pabelanger for when you get back grafana updated on the grafana server, not sure if it matter or not? hopefully I didn't derp anything	22:43
clarkb	my apologies if it does :/	22:44
clarkb	the web ui seems to be working though so going to reboot server now	22:44
clarkb	and its up and happy again	22:50
clarkb	now for the baremetal00 host for infra clouds running bifrost	22:50
pabelanger	looks like errands are pushed back a few hours	23:00
pabelanger	clarkb: look like we might have upgraded grafana.o.o too	23:00
pabelanger	checking logs to see if there are any errors	23:00
pabelanger	but so far, seems okay	23:00
clarkb	pabelanger: yes it upgraded, sorry I didn't think to not do that until it was done	23:00
clarkb	but ya service seems to work	23:00
pabelanger	2.6.0	23:01
pabelanger	should be fine	23:01
clarkb	I have been running apt-get update and dist-upgrade before reboots to make sure we get the new stuff	23:01
pabelanger	we'll find out soon if grafyaml has issues	23:01
clarkb	:)	23:01
clarkb	baremetal00 and puppetmaster are the two I want to do next	23:01
clarkb	then we are just left with afs things	23:01
pabelanger	k	23:01
clarkb	I think baremetal should be fine to just reboot	23:01
pabelanger	ya	23:02
clarkb	for puppetmaster we should grab the puppet run lock and then do it so we don't interrupt a bunch of ansible/puppet	23:02
pabelanger	okay	23:02
pabelanger	which do you want me to do	23:02
clarkb	I want you to do mirror-update if you can	23:02
pabelanger	k	23:02
clarkb	since you have the gem lock? I should have all the other locks at this point you can ps -elf \| grep k5 to make sure nothing else is running	23:02
clarkb	I'm logged in but don't worry about it the only process I have are holding locsk	23:03
clarkb	I'm going to reboot baremetal00 now	23:03
* clarkb crosses fingers		23:03
pabelanger	ya, k5start process are not running	23:03
pabelanger	so think mirror-update.o.o is ready	23:03
clarkb	pabelanger: cool go for it	23:04
clarkb	then grab whatever locks you need	23:04
clarkb	since they shouldn't survive a reboot	23:04
pabelanger	rebooting	23:04
clarkb	then when thats done and baremetal comes back lets figure out puppetmaster, then figure out afs servers	23:05
clarkb	baremetal still not back. Real hardware is annoying :)	23:06
pabelanger	mirror-update.o.o good, locks grabbed agin	23:08
clarkb	and still not up	23:08
pabelanger	ya, will take a few minutes	23:08
clarkb	pabelanger: can you start poking at puppetmaster maybe, see about grabbing lock for the puppet/ansible rotation?	23:08
clarkb	remmeber there are two now iirc	23:08
pabelanger	yup	23:09
clarkb	tyty	23:09
clarkb	at what point do I worry the hardware for baremetal00 is not coming back? :/	23:10
clarkb	oh it just starting pinging	23:10
clarkb	\o/	23:10
pabelanger	okay, have both locks on puppetmaster.o.o	23:12
pabelanger	and ansible is not running	23:12
clarkb	I don't need puppetmaster if you want to go for it	23:12
pabelanger	k, rebooting	23:12
clarkb	baremetal is up now. and ironic node-list works	23:13
clarkb	well thats interesting	23:13
clarkb	its urnning its old kernel	23:13
pabelanger	puppetmaster.o.o online	23:14
clarkb	I'm going to keep debugging baremetal and may have to reboot it again :(	23:14
pabelanger	ansible now running	23:15
clarkb	as for afs, can we reboot the kdc01 server safely ? we just won't be able to get kerberos tokens while its down?	23:15
clarkb	and can we reboot the db servers one at a time without impacting the service?	23:16
clarkb	then we just have to do the fileservers in a synchronized manner ya?	23:16
clarkb	mordred: corvus ^	23:16
clarkb	I manually installed linux-image-3.13.0-116-generic on baremetal00, I do not know why a dist-upgrade was not pulling that in	23:18
clarkb	but its in there and in grub so thinking I will do a second reboot now	23:18
clarkb	pabelanger: ^ any ideas on that or concerns?	23:18
pabelanger	nope, go for it. We have access to iLo if needed	23:19
clarkb	we don't have to go through baremetal00 to get ilo? we can go through any of the other hosts ya?	23:20
clarkb	thats my biggest concern	23:20
pabelanger	I think we can do any now	23:20
pabelanger	they are all on same network	23:20
clarkb	ok rebooting	23:20
clarkb	I put some notes about doing the afs related servers on the etherpad. Does that look right to everyone?	23:23
clarkb	pabelanger: maybe you can check if one file server is already rw for all volumes and we can reboot the other one?	23:23
* clarkb goes to grab a drink while waiting for baremetal00		23:23
pabelanger	clarkb: vos listvldb show everything in symc	23:25
pabelanger	sync*	23:25
pabelanger	clarkb: http://docs.openafs.org/AdminGuide/HDRWQ139.html	23:26
pabelanger	we might want to follow that?	23:26
clarkb	vldb is what we run on afsdb0X?	23:29
pabelanger	I did it from afsdb01	23:30
pabelanger	I think we use bos shutdown	23:30
pabelanger	to rotate things out	23:30
clarkb	gotcha thats the way you signal the other server to take over duties?	23:30
pabelanger	I think so	23:30
clarkb	definitely seems like what you are supposed to do according to the guide	23:30
pabelanger	maybe start with afs02.dfw.openstack.org	23:31
clarkb	does afs02 only have ro volumes?	23:31
pabelanger	yes	23:31
clarkb	and afs01 is all rw? if so then ya I think we do that one first	23:31
pabelanger	right	23:31
clarkb	(still waiting on baremetal00)	23:31
pabelanger	afs01 has rw and ro	23:31
pabelanger	err	23:31
pabelanger	afs01.dfw.openstack.org RW RO	23:32
pabelanger	afs01.dfw.openstack.org RO	23:32
pabelanger	afs02.dfw.openstack.org RO	23:32
pabelanger	afs01.ord.openstack.org	23:32
pabelanger	is still online, but not used	23:32
pabelanger	maybe we do afs01.ord.openstack.org first	23:32
clarkb	right ok. Then once afs02 is back up again we transition all the volumes to be swapped on the RW RO	23:32
pabelanger	npm volume locked atm	23:33
clarkb	ord's kernel is old too but not in my list	23:33
clarkb	maybe we skipped things in emergency file? may need to dbouel check that after we are done	23:34
clarkb	(I used hosts: '*' to try and get them all and not !disabled)	23:34
pabelanger	odd, okay we'll need to do 3 servers it seems. afs01.ord.openstack.org is still used by a few volumes	23:34
clarkb	still waiting on baremetal00 :/	23:35
pabelanger	okay, so which do we want to shutdown first?	23:35
clarkb	I really don't know :( my feeling is the non file servers may be the lowest impact?	23:36
pabelanger	right, afs01.ord.openstack.org is the least used FS	23:36
clarkb	ok lets do that one first of the fileservers	23:37
clarkb	then question is do we want to do kdc and afsdb before fileservers or after?	23:37
clarkb	also still no baremetal pinging. This is much longer than the last time	23:37
clarkb	pabelanger: does the ord fs have an RW volumes?	23:38
pabelanger	mirror.npm still is locked	23:38
pabelanger	clarkb: no	23:39
pabelanger	just RO	23:39
clarkb	ok so what we want to do then maybe is grab all the flocks on mirror-update so that things stop updating volumes (like npm)	23:40
clarkb	then reboot ord fileserver first?	23:40
clarkb	see how that goes?	23:40
pabelanger	sure	23:40
clarkb	ok why don't you grab the flocks. I am working on getting ilo access to baremetal00	23:41
pabelanger	ha	23:42
pabelanger	puppet needs to create the files in /var/run still	23:42
pabelanger	since they are deleted on boot	23:42
clarkb	the lock files are deleted?	23:42
pabelanger	/var/run is tmpfs	23:43
pabelanger	so /var/run/npm/npm.lock	23:43
pabelanger	will fail, until puppet create /var/run/npm	23:43
clarkb	I can't seem to hit the ipmi adds with ssh	23:44
clarkb	s/adds/addrs/	23:44
clarkb	I am able to hit the compute hosts own ipmi but its slwo, maybe baremetal is slower /me tries harder	23:45
clarkb	ok I'm on the ilo now. I guess being persistent after multiple connection timeouts is the trick	23:49
pabelanger	k, have all the lock on mirror-update	23:50
clarkb	so I can see the text console. The server is running	23:56
clarkb	but no ssh	23:56
clarkb	I think I am going to reboot with the text console up	23:57
pabelanger	k	23:57
pabelanger	like I'm ready to bos shutdown afs01.ord.openstack.org	23:58
clarkb	pabelanger: I think if you are ready and willing then lets do it :)	23:58

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!