Thursday, 2018-01-11

clarkb	so we don't have any hypervisors (other than 012) that are unpatched and actually being used in vanilla	00:00
clarkb	I executed what I think was a reboot via ironic to compute005 we will see if it comes back	00:00
clarkb	but I think we may just have to disable these in nova so that if they do mysteriously manage to return we don't run anything on them	00:01
fungi	wfm	00:02
fungi	my test probe to 012 did eventually return so i have the upgrade running against it now	00:02
clarkb	ok I've got to take care of kids for a bit	00:03
fungi	dist-upgrade on 012 finished so rebooting now	00:05
clarkb	fungi: when thats done I think we can also patch amd reboot controller00.vanilla	00:07
clarkb	this will make nodepool unhappy for a short bit but its used to clouds going away	00:07
fungi	especially this one, probably	00:07
clarkb	I'm mostly not able to computer now that kids are awake from nap	00:07
fungi	i'm nearing end of steam myself and about to go relax	00:08
fungi	but will see copute012 through first at least	00:08
pabelanger	and back	00:08
pabelanger	afs01.dfw is waiting for mirror-update.o.o to stop running bandersnatch	00:09
pabelanger	checking now	00:09
clarkb	pabelanger: its done	00:09
clarkb	pabelanger: I rebooted it and removed mirror update from emergency file	00:09
pabelanger	great	00:09
clarkb	we are doing infracloud vanilla now	00:10
clarkb	I cross checked nova-manage service list against the unreachable aervers	00:10
clarkb	and tried rebooting compute005 to see if it comes back	00:10
clarkb	but I'm guessing we just disable the unreachable servers in nova and remove them from our inventory	00:11
pabelanger	yah, we likely should	00:11
fungi	clarkb: i have scoured my chat logs and can't find where we might have asked ttx about odsreg.o.o, so i'll just try to remember to ask him tomorrow once he's around	00:12
fungi	how long should i wait for this to boot back up before i get worried it's not coming back?	00:14
clarkb	fungi: 10 minutes	00:15
fungi	k	00:15
pabelanger	okay, so anything I should focus on, or just slow process of infra-cloud	00:15
clarkb	~5 seems to be short end with somewhere less than 10 as common	00:15
fungi	it's probably coming close to 10 minutes at this point, but i'll give it a while longer	00:15
clarkb	pabelanger: I think just slow process of infracloud. maybe you want to do controller00.vanilla.ic.openstack.org?	00:15
clarkb	I'm going to start doing the chocolate status check	00:15
clarkb	I guess the worry here is controller00 could not come back	00:16
pabelanger	sure	00:16
clarkb	maybe thta would be a good thing >_>	00:16
pabelanger	Oh, is the HDD on the way out?	00:16
clarkb	pabelanger: last night compute017 and compute027 had RO / filesystems and dmesg complained about medium errors	00:17
clarkb	not sure about controller00 but the other hosts in the group are definitely on their last legs I think	00:17
pabelanger	I do see some HDD messages in dmesg on controller00	00:18
clarkb	pabelanger: and medium error?	00:18
clarkb	or sd read error	00:18
pabelanger	[Wed Jan 10 07:08:22 2018] end_request: critical medium error, dev sda, sector 308872	00:18
pabelanger	for example	00:18
clarkb	ya, controller00 may not come back in that case	00:18
clarkb	017 and 027 I couldn't patch due to the RO fs so I rebooted them and they never came back :)	00:18
fungi	compute012 _did_ eventually come back online. i can ssh into it pretty quickly, so i think the super slow ansible results are likely slowness with fact gathering?	00:19
clarkb	fungi: the network bw is also bad	00:19
pabelanger	apt-get is still working	00:19
clarkb	fungi: there is one lsat step you need to do post boot and htat is create the network interfaces for neutron	00:19
pabelanger	guess we try a reboot and if it fails, rebuild from working compute?	00:19
fungi	sure, but my point was i can ssh into it and check dmesg in a matter of a few seconds, yet trying to get ansible to do the same takes minutes	00:19
clarkb	pabelanger: considering controller00 is not ro fs I say we try it and honestly if it doesn't come back we home the vanilla computes to chocolate and move on	00:19
clarkb	fungi: yup I think its network problems	00:20
pabelanger	clarkb: sure	00:20
fungi	but specifically ansible seems to be doing many orders of magnitude more network things than just ssh'ing into the server, running my command and regurgitating the results	00:20
clarkb	fungi: ansible -i /etc/ansible/hosts/infracloud 'compute*.vanilla.ic.openstack.org' -m shell -a '/sbin/ip link add veth1 type veth peer name veth2 && /sbin/brctl addif br-vlan2551 veth1 && /sbin/ip link set dev veth1 up && /sbin/ip link set dev veth2 up'	00:20
clarkb	those are the steps puppet runs to make networking go but puppet will happen too slowly	00:20
fungi	thanks, running that against 012 now	00:21
fungi	grr...	00:21
fungi	Failed to connect to the host via ssh: Connection timed out during banner exchange	00:21
fungi	oh, is that a result of the command maybe?	00:22
fungi	since it fiddles with networking	00:22
fungi	doesn't seem to have worked	00:23
fungi	`ip link show dev veth2` returns 'Device "veth2" does not exist.'	00:23
fungi	same for veth1	00:23
clarkb	maybe it didn't manage to run?	00:24
fungi	`brctl show` indicates there is a br-vlan2551 but only has eth2.2551 in it	00:24
fungi	so, yeah, i think it disconnected before running	00:25
fungi	trying again	00:25
fungi	unreachable again	00:25
fungi	interesting... i can't ssh to it from puppetmaster but i can from home	00:25
clarkb	fungi: ya I had that problem yesterday too	00:26
fungi	didn't you have another one like this, clarkb?	00:26
clarkb	thats why I Think the networking there is super hosed	00:26
clarkb	fungi: compute000.vanilla	00:26
clarkb	like broken lacp or asymetric route from puppetmaster?	00:26
clarkb	something funny going on	00:26
fungi	oh, i did eventually get through from puppetmaster with a straight ssh	00:26
fungi	just took a while	00:26
fungi	more wondering whether there's a peering problem. i can get to it quite quickly from home but it takes forever from puppetmaster	00:27
clarkb	[71381.503893] EDAC MC0: 25661 CE error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)	00:27
fungi	seeing a some packet loss from puppetmaster but not terirble	00:27
clarkb	oh it does eventually work? interesting	00:27
fungi	yeah, did `sudo ssh compute012.vanilla.ic.openstack.org` on puppetmaster and after a couple minutes it rendered a root prompt	00:28
clarkb	this hardware is really on its last legs. Don't be surprised if an item on next weeks meeting agenda is turn off infracloud	00:28
clarkb	fungi: huh	00:28
fungi	i wonder if ansible just isn't waiting long enough	00:28
fungi	it gives up after about 10 seconds	00:29
* fungi times it		00:29
clarkb	[18371.678636] end_request: I/O error, dev sdc, sector 8089728	00:29
fungi	yeah, ansible gives up just under 11 seconds, so probably a 10-second timeout on the ssh cakk	00:29
fungi	call	00:30
fungi	a straight ssh to it from puppetmaster takes just under 51 seconds to return	00:30
fungi	i'll run the commands by hand, not via ansible	00:31
clarkb	http://paste.openstack.org/show/642525/ is choclate. 2 hypervisors there need patching	00:31
clarkb	I will work on 061 in chocolate	00:31
pabelanger	cannot compute MD5 hash for file '/etc/ldap/ldap.conf': failed to read (Input/output error)	00:32
pabelanger	looks like a bad file	00:32
pabelanger	was able to use the version in dpkg-new for now	00:33
fungi	ick	00:33
clarkb	we don't ldap so that should be fine	00:33
pabelanger	dist-upgrade done	00:33
clarkb	other than th efact that the host is about to die	00:33
pabelanger	cross fingers, rebooting	00:33
fungi	okay, neutron networking has been set back up on compute012.vanilla.ic	00:33
fungi	anything else it should need?>	00:34
clarkb	fungi: that should be it	00:34
clarkb	just confirm kpti is on	00:35
fungi	yeah, did that earlier	00:35
fungi	it's good	00:35
clarkb	I am rebooting compute064 and 061 in chocolate now to pick up the changes	00:35
fungi	i think i'm gonna drop off for the evening in that case and check back in tomorrow morning to see if there's anything left	00:35
clarkb	cool thanks for all the help. I think all that is left will be cleaning up odsreg and disabling any nova hypervisors we can't patch	00:36
fungi	yup, i'll try to remember to ask ttx about the odsreg server tomorrow	00:36
clarkb	and maybe doing chocolate controller and baremetal00 as I am doing my best to computer with kids but its not easy :)	00:36
fungi	sure	00:36
fungi	maybe someone in later timezones will pick up some of what's left for chocolate too	00:37
pabelanger	still waiting for server to come back online from reboot	00:37
fungi	of they're feeling up to it	00:37
fungi	s/of/if/	00:37
pabelanger	it does take a bit to boot however	00:37
clarkb	pabelanger: ya ~10 minutes or so is normal	00:37
clarkb	as fungi just demonstrated with his server	00:38
fungi	that's about how long compute012.vanilla.ic took	00:38
fungi	yeah	00:38
pabelanger	yah, sounds right	00:38
* fungi fades into the aether		00:38
pabelanger	hey, it is back	00:41
pabelanger	checking nova logs now	00:42
clarkb	064 is back up and I added the veth interfaces	00:42
clarkb	and now 061 is back and veth interfaces added	00:43
pabelanger	oh right, linux-bridge	00:43
clarkb	ya puppet configures it but thtas slow	00:43
clarkb	the controller doesn't need it iirc	00:43
clarkb	only the computes	00:43
pabelanger	going to kick.sh it	00:44
clarkb	hrm chocolate does have it	00:44
clarkb	*has veths on controller	00:44
clarkb	chocolate unreachable nodes are not all in the nova-manage service list with XXX entries	00:46
clarkb	so going to double check those from here to see if I managed to patch them	00:46
pabelanger	welp, that's not good	00:47
pabelanger	[ 302.415533] EXT4-fs (sda1): Remounting filesystem read-only	00:47
pabelanger	just happened	00:47
pabelanger	I think neurton / rabbitmq spamming logs didn't help	00:48
pabelanger	clarkb: ^	00:48
pabelanger	so, I can reboot again, see stop neutron / rabbit, get veth back online	00:48
pabelanger	but, since HDD is dying, maybe we should just let it	00:49
clarkb	fun	00:49
clarkb	ya I mean you can try a reboot but if it does that again I think we just disable infracloud vanilla in nodepool	00:50
pabelanger	kk	00:50
clarkb	ok I'm fairly confident that chocolate ocmputes are all patched now	00:54
pabelanger	cool	00:55
clarkb	the unreachable computes that don't show up as XXX in nova-manage service list don't show up at all :)	00:55
clarkb	considering vanilla's controller is afk maybe we save chocolates controller and baremetal for when we aren't about to afk ourselves :)	00:56
pabelanger	k, got pings, kicking now	00:56
pabelanger	well, trying to SSH	00:57
pabelanger	doesn't look good :(	00:57
clarkb	womp womp	00:58
clarkb	so thats at least 3 dead machines in vanilla from patching	00:58
pabelanger	port 22 open, no reply	00:58
clarkb	controller00 and compute017 and compute027	00:58
pabelanger	I'm guess HDD died again	00:58
clarkb	chocolate actually does seem to be in much better shape	00:58
pabelanger	guessing*	00:58
pabelanger	yah	00:58
clarkb	unfortuantely our baremetal controller is on the vanilla hardware :/	00:58
pabelanger	okay, so propose patch to disable vanilla in nodepool, then deal with migrating compute nodes	00:59
clarkb	ya I think so	00:59
clarkb	we can do the migrating of computes later	00:59
pabelanger	sure	00:59
pabelanger	remote: https://review.openstack.org/532705 Set max-server to 0 for infracloud-vanilla	01:02
pabelanger	clarkb: ^ tomorrow I'll clean up images, so we don't try to upload DIBs to it	01:02
*** rlandy\|bbl is now known as rlandy		03:50
*** rosmaita has quit IRC		04:34
*** rlandy has quit IRC		04:53
*** rosmaita has joined #openstack-infra-incident		10:32
*** panda\|rover\|afk has quit IRC		11:24
*** panda has joined #openstack-infra-incident		11:38
*** panda has quit IRC		11:59
*** panda has joined #openstack-infra-incident		12:13
*** rosmaita has quit IRC		12:34
*** panda is now known as panda\|ruck		13:29
*** rlandy has joined #openstack-infra-incident		13:36
*** rosmaita has joined #openstack-infra-incident		13:41
*** rosmaita has quit IRC		14:34
*** Shrews has joined #openstack-infra-incident		14:45
Shrews	FYI, from what I can determine from nodepool-launcher logs, looks like we temporarily lost connection to our ZK server around 22:27 UTC yesterday. I think it's around that time we began seeing znodes in ZK in the 'deleting' state and locked. The few I manually checked did not actually exist in the provider, yet the znode remained. After the launcher restart, seems to be that most did not	14:47
Shrews	exist in the provider, but we could now delete the znodes.	14:47
Shrews	I can't find how this might happen in the launcher code, but will keep looking.	14:47
Shrews	https://review.openstack.org/532823 might help debug this in the future	14:48
Shrews	We've lost the ZK connection before and nodepool recovered fine, so this is new.	14:49
Shrews	maybe kazoo itself got confused	14:49
clarkb	Shrews: not certain but that is probably when we rebooted the server running zk	16:27
*** rosmaita has joined #openstack-infra-incident		16:31
fungi	Shrews: 22:27 is roughly when we were restarting everything, so losing connections to stuff is expected	16:37
fungi	specifically, nodepool.o.o was rebooted	16:37
clarkb	I've just checked dmesg on baremetal00.vanilla.ic.openstack.org and there are no disk or memory errors	16:50
clarkb	I think baremetal00.vanilla.ic.openstack.org and controller00.chocolate.ic.openstack.org are the last infracloud nodes that need patching	16:50
clarkb	I think before we patch and reboot controller00 we should try to understand if that will break nodepool again	16:55
pabelanger	agree	16:59
fungi	looks like odsreg was down due to a botched migration. i issued a ctrl-alt-del at the oob console and it came back online	17:00
fungi	okay, after confirming with ttx that it really could go, it's gone now	17:03
fungi	our openstackci tenant in rax is pretty thoroughly cleaned up at this point	17:04
fungi	well, at least as far as server instances are concerned. we could likely still stand to look through trove instances, snapshots, cinder volumes, swift containers...	17:04
-openstackstatus- NOTICE: Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets		17:47
clarkb	knowing what I now know after reading logs and having shrews bring me up to speed I don't think restarting controller00 is currently safe (I mean if it comes up quickly we probably won't notice but if it doesn't we likely will have node_failures again)	17:49
*** weshay is now known as weshay_interview		18:14
clarkb	ptg flights are quite a bit more expensive than I was expecting	18:50
clarkb	I guess thats the downside to flying somewhere with cheap lodging and food, its harder to get to	18:50
* clarkb books anyway		18:50
clarkb	er ww	18:50
clarkb	but hey now you know :)	18:50
clarkb	also if flying through london beware the 3 airport problem	18:50
corvus	clarkb: try staying across a saturday. could be cheaper even including the extra hotel. :)	19:11
corvus	sfo<->dub was < $700 for me	19:12
clarkb	corvus: ya the google flights breakdown says leaving sunday and monday is roughly the same so doing monday flight back	19:14
clarkb	but still >900	19:14
*** weshay_interview is now known as weshay		19:16
*** panda\|ruck is now known as panda\|ruck\|afk		19:20
pabelanger	can confirm staying across a saturday was cheaper, fly out on the Friday night	19:37
pabelanger	flying in I mean	19:38
*** rlandy is now known as rlandy\|bbl		22:52
clarkb	nodepool launchers are now running code that will handle max-servers:0 properly	23:43
clarkb	I don't think I'll do it today because feel like running out of steam early but will try to finish up infracloud patching tomorrow (and set chocolate's max-server to 0 first)	23:43

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!