Wednesday, 2018-01-10

clarkb	ugh ubuntu has kernels up for xenial now but not for trusty unless you use trusty with hardware enablement kernels	00:01
clarkb	which btw is not their default trusty server kernel	00:01
clarkb	infra-root ^	00:02
mordred	clarkb: AWESOME	00:03
clarkb	I'm going to patch my local fileserver then will probably start doing logsatsh workers	00:04
clarkb	as those are probably a good canary and easily rebuilt	00:04
clarkb	(I had hoped to start with infracloud, may be we update infracloud to hwe kernel?)	00:04
corvus	clarkb: where's the info about hwe?	00:06
clarkb	corvus: https://wiki.ubuntu.com/Kernel/LTSEnablementStack is the generic doc on it	00:07
corvus	clarkb: right, i mean where did you see that they aren't doing stuff for hwe?	00:08
corvus	er for not-hwe?	00:08
clarkb	oh	00:08
clarkb	https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown and the usn site	00:09
clarkb	currently they have only published for the 14.04 + hwe kernel	00:09
clarkb	well that an linux-aws	00:09
clarkb	heh now https://usn.ubuntu.com/usn/usn-3524-1/ is there	00:11
clarkb	so maybe just lag involved	00:11
clarkb	in that case I will likely do infracloud next to pick up ^	00:11
clarkb	my local fileserver doesn't seem to see the new kernels yet	00:14
clarkb	going to sort that out	00:15
corvus	clarkb: yeah, on a trusty machine i have, i only see linux-image-3.13.0-137-generic not 139	00:17
clarkb	I'm pointed at security.ubuntu.com and I see the package in the package list if I browse to it	00:19
* clarkb tries again		00:19
clarkb	ya apt nto seeing it for some reason	00:20
*** rlandy has quit IRC		00:24
clarkb	logstash-worker01 does see the new kernel though	00:24
clarkb	and I have the same set of xenial-security entries in my apt sources list so this has to be cdn or caching etc	00:26
clarkb	corvus: I think the InRelease and Release files I am being served are older than the packages in the repo	00:31
corvus	i need to go offline for a bit to patch my workstation and the system my irc client is on... i may be gone for a bit, but will rejoin asap	00:32
clarkb	ok	00:33
clarkb	someone completely unrelated to openstack says xenials 4.4.0.108 131 kernel panics on their workstation	00:34
clarkb	so uh be prepared for that I Guess :/	00:34
corvus	that's great	00:34
*** rosmaita_ has quit IRC		00:36
clarkb	I might just turn my fileserver back off and patch it when things get better :/	00:37
corvus	clarkb: so... permanently off?	00:43
corvus	okay, really signing off now	00:43
*** corvus has quit IRC		00:44
clarkb	I'm going to do logstash-worker01 by hand really quickly to get a xenial canary up	00:46
clarkb	aha I think I sorted out my lcoal fileserver issue	00:50
clarkb	I was using the old hwe	00:50
clarkb	due to chipset issues	00:50
clarkb	ok logstash-worker01 came up	00:53
clarkb	it does not have the bugs: cpu_insecure entry in the cpuinfo file but does haev [ 0.000000] Kernel/User page tables isolation: enabled in dmesg	00:53
clarkb	https://etherpad.openstack.org/p/infra-meltdown-patching	01:02
clarkb	using compute000.vanilla.ic.openstack.org as infracloud canary	01:09
clarkb	just manually going to do that one too to figure out how to check if kpti is enabled and if instances will even boot up	01:09
fungi	you should see it mentioned in dmesg at least	01:10
clarkb	ya on xenial it was in dmesg	01:11
clarkb	details like this going into the etherpad above	01:12
fungi	Kernel/User page tables isolation: enabled	01:12
fungi	that's what i was looking for on my systems	01:12
clarkb	yup	01:14
clarkb	arg sudoers contents changed on infracloud nodes	01:14
clarkb	fungi: http://paste.openstack.org/show/641745/ should be safe to accept the new version there ya?	01:19
*** tristanC has joined #openstack-infra-incident		01:21
clarkb	I'm going ahead and choosing the new version so this problem doesn't persist as I believe it to be largely equivalent to the old version	01:23
fungi	yeah, it shouldn't cause any problems	01:25
fungi	and puppet will undo it anyway	01:25
clarkb	oh do we manage the top level sudoers file? I kept our unattended upgrades config because I know puppet manages that one	01:26
clarkb	and am keeping the nova conf	01:26
clarkb	arg sudoers update did end up breaking sudo for me	01:28
clarkb	I'm going to kick.sh compute000.vanilla.ic.openstack.org so I can reboot it	01:28
clarkb	so I think we want to keep our versions of all those config files on trusty nodes	01:28
*** rosmaita has joined #openstack-infra-incident		01:29
fungi	your account was in the sudo group, right?	01:30
ianw	so sorry do we need repos for xenial, or is it just update & reboot?	01:30
clarkb	yes I am in the sudo group	01:30
ianw	i can go through the ze nodes and do that, since they're tolerant to dying	01:30
clarkb	ianw: it should be just update and reboot. see https://etherpad.openstack.org/p/infra-meltdown-patching	01:31
clarkb	fungi: but it wants my password now	01:31
fungi	weird that it broke sudo for you. it's not clear to me what in that diff would have	01:31
fungi	oh	01:31
fungi	OH	01:31
fungi	it dropped NOPASSWD	01:31
clarkb	and puppet master is having a hard time ssh'ing to compute000.vanilla.ic.openstack.org now (I am ssh'd though)(	01:32
fungi	i totally overlooked that since the %sudo stanza also moved	01:32
clarkb	ya oh well	01:32
clarkb	as soon as I can get puppetmaster to ssh in it should fix it	01:32
clarkb	but uh thats not working	01:33
ianw	ok, trying on ze01 and see what happens	01:33
fungi	was there a sudo package uprade or something?	01:34
fungi	i guess you're upgrading more than just kernel and microcode packages	01:34
clarkb	fungi: yes	01:34
clarkb	fungi: ya I was doing full dist-upgrades	01:34
clarkb	assuming unattended upgrades were mostly keeping up with the other stuff	01:34
clarkb	whihc seems to be the case on xenial at least	01:34
clarkb	oh wow compute000 doesn't permit root ssh	01:35
clarkb	so I may have toasted it and it needs to be rebuilt :/	01:35
clarkb	sshd_config was not something I was asked to update	01:35
fungi	i bet another package upgraded (openssh-server?) and disabled root login	01:35
clarkb	oh wait there is a specific whitelist for puppetmaster on compute000	01:36
clarkb	in sshd_config so why isn't this working	01:36
clarkb	ssh -vvv seems to indicate it is a tcp problem	01:37
clarkb	oh wait no that was just slow but it connected via tcp	01:37
clarkb	it might just be slow as molasses	01:38
ianw	how long does the dkms take ...	01:38
ianw	ohh, it's doing afs	01:38
clarkb	yup confirmed just slow as can be	01:40
clarkb	I'm going to manually replace the sudoers file as I doube ansible + puppet will ever be happy with this slow connection	01:40
ianw	ok, ze01 done	01:42
ianw	i will let it run for a while as i get some lunch, and if all good, i'll go through the rest and update	01:42
clarkb	also based on how slow this connection is I'm not entirely convinced we should be using infracloud in its current state	01:42
clarkb	ianw: sounds, good thanks	01:42
ianw	i'll also stop the executor and clear out /var/lib/zuul/builds on each host, as it seems there's a little cruft in there	01:43
clarkb	compute000 is rebooting now	01:43
clarkb	it will probably take 10 minutes to come back up so I can cehck on it	01:44
clarkb	fungi: does dist-upgrade -y imply responding N (eg keep old version of conf file) when there are conflicts in packages?	01:44
clarkb	fungi: since I think we do want N in all cases but -y is yes not no but N is default	01:45
clarkb	I'm going to put the elasticsearch cluster in the "don't worry about rebalancing indexes" mode then reboot the whole cluster at once	01:46
clarkb	thats best way of ripping off that bandaid I think	01:46
fungi	clarkb: not really sure. if you use the noninteractive mode it will keep old configs	01:46
clarkb	ok rebooting elasticsaerch cluster now	01:51
clarkb	elasticsaerch is recovering shards now	01:55
clarkb	fungi: --force-confold as a dpkg option is the magic there apparently	01:59
fungi	yeah, you can do it that way too	01:59
clarkb	fungi: `export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get -o Dpkg::Options::="--force-confold" dist-upgrade -y` how does that look?	02:01
fungi	clarkb: remarkably similar to the syntax we're using somewhere in our automation	02:02
fungi	and yes, should do what you want	02:02
clarkb	looks like archive.ubuntu.com has the new kernel now	02:03
clarkb	so don't need to update sources.list on infracloud	02:03
clarkb	if compute001 comes out of this looking happy I'm going to ansible the above command across infracloud compute nodes and I guess start doing reboots?	02:04
clarkb	I figure control plane is lesser concern and probably needs more eyeballs on it	02:04
fungi	sounds good. i should be able to help tomorrow but it's getting late here	02:05
clarkb	I also confirmed that trusty has the same dmesg check as xenial	02:06
fungi	excellent	02:08
clarkb	hrm rebooting broke neutron networking	02:15
fungi	ouch	02:16
clarkb	puppet is what sets that up	02:16
clarkb	and since puppetmaster can't ssh to half these nodes...	02:16
clarkb	ok 000 and 001 in vanilla are working with the manually run steps that puppet would normally run	02:23
clarkb	apt-get update is running on those vanilla compute hosts that puppetmaster can ssh into	02:23
clarkb	I think I will go ahead and upgarde those, reboot them, then fix networking, then use the dmesg check for finding those that were missed due to ssh issues	02:24
*** mordred has quit IRC		02:42
*** mordred has joined #openstack-infra-incident		02:44
*** rosmaita has quit IRC		02:58
ianw	doing the executors now	03:05
clarkb	I'm tracking the various states of infracloud things and its not very pretty...	03:06
clarkb	bunch of nodes can't be hit, at least one node has a ro / that apepars to be due to medium errors	03:06
clarkb	I'm gonna continue working through this but I'm beginning to think we might want to seriously considering not infraclouding anymore	03:07
clarkb	I expect I'll be in a position to do mass reboots in an hour or so	03:08
* clarkb finds dinner		03:08
clarkb	thats cool the bad disk/fs happened just now ish	03:19
clarkb	I guess I'll give it a reboot and see if it comes up and is patchable	03:19
*** rosmaita has joined #openstack-infra-incident		03:42
clarkb	doing mass reboot of chocolate now	03:59
clarkb	(it updated quicker than vanilla)	03:59
clarkb	all the gate jobs are failing on a tox siblings thing anyways so I figure just go for it	04:00
clarkb	(also zuul should retry anyways)	04:00
clarkb	chocoalte nodes are coming abck and I am applying their network config	04:07
clarkb	I expect that I will have all of the reachable chocolcate compute hosts patched shortly	04:07
clarkb	vanilla compute hosts are rebooting now as well	04:10
*** rosmaita has quit IRC		04:21
clarkb	I think all infracloud computes that are reachable are patched	04:40
clarkb	the chocolate cloud appears to be functioning too	04:40
clarkb	but the vanilla cloud is somewhat ambiguous going by grafana data	04:40
clarkb	TL;DR for those of you catching up in the morning. I think a good next step would be to generate some proper node lists and we can start going through them. For what ianw and I did this evening was mostly using representative sets like logstash-workers and elasticsearch* and infracloud to get the ball rolling. I've tried to capture my notes on what I've done to get things updated properly	04:55
clarkb	https://etherpad.openstack.org/p/infra-meltdown-patching has more infos	04:56
clarkb	and with that I'm off to bed	04:57
*** jeblair has joined #openstack-infra-incident		05:21
*** jeblair is now known as corvus		05:21
*** frickler has joined #openstack-infra-incident		08:53
*** rosmaita has joined #openstack-infra-incident		13:04
*** rlandy has joined #openstack-infra-incident		13:34
*** pabelanger has quit IRC		14:56
*** pabelanger has joined #openstack-infra-incident		14:56
-openstackstatus- NOTICE: Gerrit is being restarted due to slowness and to apply kernel patches		14:58
fungi	to also track in here, review.o.o is running the updated kernel as of a few minutes ago (thanks pabelanger!)	15:13
pabelanger	Yes, seems things are working	15:14
fungi	and frickler noticed the mirror in ic-choco was broken. apparently got shutdown when the hosts were rebooted for newer kernels, and i guess nova doesn't start instances automagically when the host comes back online?	15:14
fungi	i should check vanilla too, now that i think about it	15:14
fungi	i can't ssh to it either. not a good sign	15:15
pabelanger	yah, I think you need a nove.conf setting to turn instances back on	15:15
pabelanger	nova.conf*	15:15
fungi	confirmed, it too was in a shutoff state	15:15
pabelanger	ack	15:16
*** dmsimard has joined #openstack-infra-incident		15:19
frickler	I haven't dug into how we set up infra cloud yet, but it should be "resume_guests_state_on_host_boot = true" in nova.conf/DEFAULT	15:35
corvus	oh, it looks like we're up to 4.4.0-109 this morning	15:38
corvus	it was 108 yesterday	15:38
corvus	(for xenial)	15:39
corvus	and trusty+hwe	15:41
corvus	xenial: https://usn.ubuntu.com/usn/usn-3522-3/	15:41
corvus	apparently fixes a regression that caused "a few systems failed to boot successfully"	15:42
corvus	so we probably don't have to redo the 108 hosts -- if they booted.	15:42
clarkb	fungi: oh sorry I completely spaced on the mirrors	16:36
fungi	no problem. they're working now	16:43
clarkb	is review.o.o the only thing that has been patched since last night?	16:44
fungi	corvus: yeah, i say we just make sure the updated kernels get installed so the next time we reboot they'll also hopefully not fail to reboot	16:44
clarkb	I'll probably work on generating server lists as next step and adding that into the etherpad	16:44
fungi	clarkb: yes, we missed the opportunity to patch zuulv3.o.o when it got restarted for excessive memory use	16:44
dmsimard	btw I mentioned yesterday a playbook to get an inventory of patched and unpatched nodes, I cleaned what I had and pushed it here as WIP: https://github.com/dmsimard/ansible-meltdown-spectre-inventory	16:45
dmsimard	need to afk	16:45
fungi	clarkb: oh, also i saw that debian oldstable/jessie got meltdown patches if you want to update your personal system	16:45
clarkb	fungi: oh thanks for that heads up. I may actually start with that and patch my irc client server	16:46
clarkb	ya gonna get that taken care of, will be afk for a bit	16:48
*** clarkb has quit IRC		16:53
*** clarkb1 has joined #openstack-infra-incident		16:54
fungi	helo clarkb1	16:56
clarkb1	hello, now to figure out my nick situation	16:56
fungi	indeed!	16:56
fungi	i see webkitgtk+ just released mitigation patches along with an advisory, so maybe chrome and safari will be a little safer shortly?	16:56
*** rosmaita has quit IRC		16:58
*** clarkb1 is now known as clarkb		16:58
clarkb	ok I think thats irc mostly sorted out. Need to join a couple dozen more chnanels but that can happen later	17:00
clarkb	dmsimard: you had to afk, but how work in progress is that playbook? should we avoid running it against say infracloud or the rest of infra?	17:01
clarkb	does sshing to compute030.vanilla.ic.openstack.org close the connection just as soon as you actually login?	17:02
clarkb	or rather does it do that for anyone else?	17:02
corvus	clarkb: yep	17:02
clarkb	`ansible -i /etc/ansible/hosts/openstack ':!git.openstack.org' -m shell -a "dmesg \| grep 'Kernel/User page tables isolation: enabled'"` is what I'm going to start with to generate a list of what is and what isn't patched	17:04
clarkb	that excludes infracloud and excludes the centos git servers which should already be patched	17:04
clarkb	corvus: completely unreachable from local and puppetmaster, unreachable from puppetmaster but reachable from local, reachable but connecton gets killed like for compute030, and sad hard drives plus RO / seems to be the rough set of different ways things are broken in infracloud	17:06
pabelanger	yah, sad HHDs in infracloud are happening more often	17:15
clarkb	vanilla also appears to be in worse shape	17:16
clarkb	I think the chocolate servers are newer	17:17
clarkb	I'm going to reorganize the etherpad a bit as I don't think the trusty vs xenial distinction matters much	17:25
clarkb	ok I think https://etherpad.openstack.org/p/infra-meltdown-patching is fairly well organized now	17:33
clarkb	I'm going to continue to pick off some of the easy ones like translate-dev etherpad-dev logstash.o.o subunit workers	17:36
clarkb	but then after breakfast I probably should context switch back to infracloud and finish that up	17:36
pabelanger	I can start picking up some hosts too	17:38
fungi	i'm just about caught up with other stuff to the point where i can as well	17:39
pabelanger	will do kerberos servers, since they fail over	17:40
*** rosmaita has joined #openstack-infra-incident		17:41
pabelanger	okay, rebooting kcd01, run-kprop.sh worked as expected	17:42
pabelanger	actually, let me confirm we have right kernel first	17:42
pabelanger	linux-image-3.13.0-139-generic	17:44
pabelanger	rebooting now	17:44
clarkb	on trusty unattended upgrades pulled in the latest kernel	17:45
clarkb	but ubuntu released a newer kernel for xenial 109 instead of 108 that addressed some booting issues that we should install on unpatched servers	17:45
pabelanger	[ 0.000000] Kernel/User page tables isolation: enabled	17:46
clarkb	(I'm still running an apt-get update and dist-upgrade per the etherpad on the servers I'm patching regardless)	17:47
pabelanger	yah, just doing that on kdc04.o.o now	17:47
pabelanger	which is xenial and got latest 109 kernel	17:47
pabelanger	lists-upgrade-test.openstack.org \| FAILED	17:49
pabelanger	can we just delete that now?	17:49
pabelanger	clarkb: corvus: ^	17:50
pabelanger	logstash-worker-test.openstack.org \| FAILED I guess too	17:50
clarkb	I have no idea what logstash-worker-test.o.o is	17:52
clarkb	so I think it can be cleaned up	17:52
pabelanger	working nb03.o.o and nb04.o.o	17:52
clarkb	lists-upgrade-test was used to test the inplace upgrade of lists.o.o I don't think we need the server but corvus should confirm	17:52
clarkb	fungi: maybe you want to do the wiki hosts as I think you understand their current situation best	17:54
corvus	clarkb: confirmed -- lists-	17:55
corvus	ha	17:56
corvus	upgrade-test is not needed :)	17:56
fungi	clarkb: yup, i'll get wiki-upgrade-test (a.k.a. wiki.o.o) and wiki-dev now	17:57
pabelanger	ack, I'll clean up both in a moment	17:58
pabelanger	should we wait until later this evening for nodepool-launchers or fine to reboot now?	17:59
clarkb	I'm kinda leaning towards ripping the bandaid off on this one	18:00
clarkb	and nodepool should handle that sanely	18:00
pabelanger	Yah, launcher should bring nodes online again. nodepool.o.o (zookeeper) we'll need to do when we stop zuul	18:01
clarkb	speaking of bandaids, just going to do codeserach since there isn't a good way of doing that one without an outage	18:03
clarkb	maybe I should send email to the dev list first though	18:03
clarkb	ya writing an email now	18:04
pabelanger	I didn't see anybody on pbx.o.o, so I've rebooted it	18:05
pabelanger	moving to nl01.o.o	18:06
pabelanger	clarkb: okay, we ready for reboot of nl01?	18:08
clarkb	I think so, that should be zero outage from a user perspective	18:08
clarkb	I'm almost done with this email too	18:09
pabelanger	okay, nl01 rebooted	18:10
pabelanger	and back online	18:10
clarkb	email sent	18:11
clarkb	oh hey arm64 email I should read too	18:11
fungi	the production wiki server is taking a while to boot	18:12
fungi	may be doing a fsck... checking console now	18:12
pabelanger	okay, both nodepool-launchers are done	18:14
fungi	yeah, fsck in progress, ~1/3 complete now	18:14
clarkb	codesearch is reindexing	18:15
pabelanger	clarkb: how do we want to handle mirrors? No impact would be to launch new mirrors or disable provider in nodepool.yaml	18:15
clarkb	pabelanger: ya considering how painful the last day or so has been for jobs we may want to be extra careful there	18:16
fungi	replacing mirrors loses their warm cache	18:16
clarkb	either boot new instances or do them when we do zuul	18:16
pabelanger	fungi: yah, that too	18:16
fungi	i vote do them when we do zuul scheduler	18:16
pabelanger	okay, that works	18:16
fungi	one outage to rule them all	18:16
clarkb	wfm	18:17
fungi	also, we may not have sufficient quota in some providers to stand up a second mirror without removing the first?	18:17
*** ChanServ changes topic to "Meltdown patching \| https://etherpad.openstack.org/p/infra-meltdown-patching \| OpenStack Infra team incident handling channel"		18:17
clarkb	pabelanger: what we can do no though is run the update and dist upgrade steps	18:18
clarkb	so that all we have to do when zuul is ready is reboot them	18:18
pabelanger	fungi: I think we do, at least when I've built them we've had 2 online at once	18:19
pabelanger	clarkb: sure	18:19
pabelanger	k, doing grafana.o.o now	18:21
corvus	what do you think about pulling the list of systems that need patching into an ansible inventory file, then run apt-get update && apt-get dist-upgrade on all of them?	18:21
clarkb	thats annoying hound wasn't actually started on codesearch01 (I've started it)	18:22
clarkb	corvus: not a bad idea. I can regenerate my list so that it is up to date	18:22
fungi	3.13.0-139-generic is booted on wiki-upgrade-test (and the wiki is back online again) but dmesg doesn't indicate that page table isolation is in effect	18:23
corvus	rebooting cacti02	18:23
clarkb	fungi: weird, does it show up in the kernel config as an enabled option?	18:24
pabelanger	Hmm, grafana is trying to migrate something	18:24
fungi	CONFIG_PAGE_TABLE_ISOLATION=y is in /boot/config-3.13.0-139-generic	18:25
pabelanger	ah, new grafana package was pulled in	18:25
corvus	meeting channels are idle, i will do eavesdrop now	18:25
fungi	good call	18:25
*** openstack has joined #openstack-infra-incident		18:30
*** ChanServ sets mode: +o openstack		18:30
corvus	supybot is not set to start on boot. and we call it meetbot-openstack in some places and openstack-meetbot others. :\|	18:30
corvus	i'm guessing something was missed in the systemd unit conversion.	18:30
fungi	eep, last time i restarted it i used `sudo service openstack-meetbot restart`	18:31
fungi	which _seemed_ to work	18:31
corvus	fungi: i think they all go to the same place :)	18:31
fungi	fun	18:32
corvus	any reason why etherpad-dev was done but not etherpad?	18:32
fungi	i'll do planet next	18:32
corvus	hrm. apt-get dist-upgrade on etherpad only sees -137, not 139	18:33
fungi	pointed at an out-of-date mirror?	18:34
clarkb	corvus: yes	18:34
clarkb	corvus: we are using the etherpad to coordindate so didn't want to just reboot it	18:34
clarkb	but maybe it is best to get that out of the way	18:34
clarkb	I'm going to update the node list first though	18:34
corvus	fungi: deb http://security.ubuntu.com/ubuntu trusty-security main restricted	18:34
fungi	strange	18:34
fungi	that's after apt-get update?	18:34
corvus	anyone sucessfully done a trusty upgrade to 139 today?	18:34
corvus	fungi: yep	18:34
corvus	the ones i picked up were all xenial	18:35
clarkb	corvus: yes etherpad-dev was trusty too and worked	18:35
fungi	i just did wiki and wiki-dev and they're trusty	18:35
clarkb	is it saying 137 is no longer needed?	18:35
corvus	oh wait...	18:35
clarkb	the trusty nodes largely got the new kernel from unattended upgrades last night	18:35
corvus	clarkb: i think that's the case	18:35
clarkb	and then the message says 137 is not needed anymore	18:35
corvus	sorry :)	18:36
clarkb	np	18:36
fungi	right, unattended-upgrades will install new kernel packages and then e-mail you to let you know you need to reboot	18:36
pabelanger	sinc we nolonger do HTTP on zuul-mergers, those could be rebooted with no impact?	18:36
clarkb	everyone ok if I just delete the current list (you'll lose your annotations) and replace with current list?	18:36
clarkb	or current as of a few minutes ago	18:36
corvus	clarkb: wfm	18:36
fungi	clarkb: fine with me	18:36
fungi	though wiki-upgrade-test will show back up	18:36
fungi	unless you're filtering out amd cpus	18:37
pabelanger	corvus: ok here	18:37
clarkb	fungi: I am not, but can make a note on that one	18:37
corvus	clarkb: i'm ready to reboot when you're done etherpadding.	18:37
clarkb	corvus: ready now	18:37
corvus	pabelanger: why didn't your zk hosts show up as success?	18:38
fungi	clarkb: more pointing out that there could be other instances in the same boat (however unlikely at this point)	18:38
pabelanger	clarkb: I just did them, maybe clarkb list outdated?	18:38
clarkb	ya could be a race in ansible running and me filtering	18:39
clarkb	it takes a bit of time for ansible to go thorugh the whole list	18:39
clarkb	I havne't gotten a fully automated solution yet because ansible hangs on a couple hosts so never exits in order to do things like sorting	18:39
corvus	etherpad is back up	18:40
corvus	for that matter, the 2 i just did also showed up as failed	18:40
corvus	any special handling for files02, or should we just rip off the bandaid?	18:41
clarkb	we could deploy a new 01 and then delete 02	18:42
clarkb	but I'm not sure that is necessary, server should be up quickly and start serving data again	18:42
clarkb	(also we lose all the cache data if we do that	18:42
corvus	clarkb: if we feel that a 1 minute outage is too much, i'd suggest we deploy 01 and not delete 02 :)	18:43
corvus	i'll ask in #openstack-doc	18:43
clarkb	sounds good	18:43
clarkb	I got a response to my we are patching email pointing to a thing that indicates centos may not actually kpti if your cpu's don't have pcid	18:44
clarkb	so I'm going to track that down	18:44
corvus	also jeepers the dkms build is slow	18:44
clarkb	"The performance impact of needing to fully flush the TLB on each transition is apparently high enough that at least some of the Meltdown-fixing variants I've read through (e.g. the KAISER variant in RHEL7/RHEL6 and their CentOS brethren) are not willing to take it. Instead, some of those variants appear to implicitly turn off the dual-page-table-per-process security measure if the processor	18:46
clarkb	they are running on does not have PCID capability."	18:46
clarkb	uhm	18:46
fungi	huh	18:46
corvus	how do we git a console these days?	18:46
clarkb	git.o.o does not have pcid in its cpuinfo flags like my laptop does	18:46
clarkb	corvus: log into rax website go to servers open up specific server page then under actions there is open remote console iirc	18:47
clarkb	pabelanger: corvus dmsimard this may explain why centos dmesg doesn't say kpti is enabled	18:47
clarkb	because it isn't? the sysfs entry definitely seems to say it is there though	18:47
corvus	okay, files02 is back up. i got nervous because it took about 2 minutes to reboot, which is very long.	18:48
corvus	clarkb: also, we don't necessarily lose the afs cache -- it shouldn't be any worse than a typical volume release (which happens every 5min)	18:49
clarkb	oh right docs releases often	18:49
corvus	and it caches the content, and just checks that the files are up to date after invalidation.	18:50
corvus	clarkb: link to what you're looking at?	18:52
clarkb	corvus: https://groups.google.com/forum/m/#!topic/mechanical-sympathy/L9mHTbeQLNU	18:53
clarkb	reading othe rthings on the internet the best check for kpti enablement is the message in dmesg	18:54
clarkb	that is a clear indication it was enabled and is being used post boot	18:54
clarkb	but ya apparently red hat did the sysfs thing which isn't standard so harder to get info on that and determine if that is a clear "it is on" message	18:54
clarkb	so I'm fairly convinced that it is working as expected on the ubuntu hosts we are patching (beacuse we are checking dmesg)	18:55
clarkb	dmsimard: do you have red hat machines where the cpu does have pcid in the cpu flags? if so does dmesg output the isolation enabled message in that case?	18:56
clarkb	pabelanger: ^	18:56
clarkb	I'm going to do review-dev now	18:56
fungi	as soon as i confirm paste is okay, i'm going to do openstackid-dev and then openstackid.org	18:57
fungi	though i'll give the foundation staff and tipit devs a heads up on the latter	18:58
fungi	lodgeit on paste01 seems unhappy	18:59
pabelanger	clarkb: I look at see on some internal things	19:01
pabelanger	but don't have much access myself	19:01
pabelanger	once bandersnatch is done running, I'll reboot mirror-update.o.o	19:04
fungi	looks like the openstack-paste systemd unit raced unbound at boot and failed to start because it couldn't resolve the dns name of the remote database. do we have a good long-term solution to that?	19:04
clarkb	anyone know whay apps-dev and stackalytics still show up in our inventory?	19:07
clarkb	I'm going to make a list of services that hsould happen with zuul's reboot	19:09
pabelanger	have they been deleted?	19:09
clarkb	pabelanger: havne't checked yet, I guess if they are still instances up but no dns that could explain it	19:09
pabelanger	ah, I wasn't aware we deleted stackalytics.o.o	19:10
clarkb	well I don't know that we did but both are unreachable	19:13
clarkb	do the zuul mergers need to happen with zuul scheduler?	19:13
clarkb	or can we do them one at a time and get them done ahead of time?	19:13
clarkb	corvus: ^	19:13
corvus	clarkb: stopping them while running a job may cause a merger failure to be reported. other than that, it's okay.	19:14
pabelanger	maybe we can add graceful stop for next time :)	19:16
pabelanger	corvus: also, ns01 and ns02, anything specific we need to do for nameservers? Assume rolling restarts are okay	19:17
clarkb	I'm inclined to do them as part of zuul restart then since we've alrady had a rough day with test stability yesterday	19:17
corvus	pabelanger: rolling should be fine	19:17
pabelanger	okay, I'll look at them now	19:17
corvus	pabelanger: a graceful stop for mergers would be great :)	19:18
clarkb	translate.o.o is failing to update package indexes, something related to a nova agent repo? I wonder if we made this xenial instance during a period of funny rax images	19:22
clarkb	I'm looking into it	19:22
fungi	ick	19:22
pabelanger	clarkb: question asked about PCID on internal chat, hopefully know more soon	19:23
clarkb	deb http://ppa.launchpad.net/josvaz/nova-agent-rc/ubuntu xenial main	19:23
clarkb	is in a sources.list.d file	19:24
clarkb	that looks like a use customers as guinea pigs solution	19:24
clarkb	cloud-support-ubuntu-rax-devel-xenial.list is a differetn file name in /etc/apt/sources.list.d but is empty	19:25
clarkb	I'm thinking I will just comment out the deb line there and move on	19:25
clarkb	any objections to that?	19:25
pabelanger	clarkb: a PCID cpu with RHEL7, doesn't show isolation in dmesg	19:27
clarkb	pabelanger: ok, thanks for checking	19:27
clarkb	so I guess other than rtfs'ing we treat the sysfs entry saying pti is enabled as good enough?	19:28
clarkb	commenting out the ppa allowed translate01 to update. I am rebooting it now	19:29
pabelanger	okay, all mirrors upgraded, just need reboots when we are ready	19:30
pabelanger	I don't think we have any clients connected to firehose01.o.o yet?	19:32
clarkb	nothing production like. I think mtreinish uses it randomly	19:33
pabelanger	kk, will reboot now then	19:33
clarkb	zuul-dev can just be rebooted right?	19:33
clarkb	corvus: ^	19:33
pabelanger	or even deleted?	19:33
corvus	clarkb, pabelanger: either of those	19:34
clarkb	well if it can be deleted then that is probably preferable	19:34
clarkb	doing ask-staging now	19:34
pabelanger	okay, I can propose patched to remove zuul-dev from system-config	19:35
fungi	clarkb: yes please to be commenting out the rax nova-agent guinea pig sources.list entry	19:39
pabelanger	remote: https://review.openstack.org/511986 Remove zuulv3-dev.o.o	19:40
pabelanger	remote: https://review.openstack.org/532615 Remove zuul-dev.o.o	19:40
pabelanger	should be able to land both of them	19:40
fungi	openstackid server restarts are done, after coordinating with foundation/tipit people	19:40
fungi	going to do groups-dev and groups next	19:41
pabelanger	graphite.o.o is ready for a reboot, if we want to do it	19:41
clarkb	fungi: ask-staging looked happy anything I should know about before doing ask.o.o?	19:41
fungi	pabelanger: after restarting firehose01, please take a moment to test connecting to it per the instructions in the system-config docs to confirm it's streaming again	19:41
pabelanger	fungi: sure, I can do that now	19:42
fungi	clarkb: assuming the webui is up, i think it's safe to proceed with prod	19:42
clarkb	pabelanger: if we do graphite with zuul then we won't lose job stats or nodepool stats	19:42
clarkb	pabelanger: but I'm not sure that is mission critical its probably fine to have a small gap in the data and reboot it	19:42
clarkb	fungi: ok doing ask now then	19:42
*** rlandy has quit IRC		19:43
pabelanger	fungi: firehose.o.o looks good	19:44
pabelanger	clarkb: okay, will reboot now	19:45
clarkb	ask.o.o is rebooting now	19:45
fungi	cool	19:45
fungi	thanks for checking pabelanger!	19:45
pabelanger	np!	19:45
*** rlandy has joined #openstack-infra-incident		19:48
pabelanger	clarkb: what about health.o.o, reboot when we do zuul?	19:49
clarkb	health should be fine to do beforehand	19:49
clarkb	since its mostly decoupled from zuul (it reads the subunit2sql db)	19:49
pabelanger	k	19:49
pabelanger	rebooting now	19:50
clarkb	ask.o.o is up and patched but not serving data properly yet	19:50
clarkb	I see there are some manage processes running for aksbot though	19:50
clarkb	[Wed Jan 10 19:50:48.367125 2018] [mpm_event:error] [pid 2155:tid 140109368838016] AH00485: scoreboard is full, not at MaxRequestWorkers	19:50
clarkb	any idea what that means?	19:50
pabelanger	I'd have to google	19:51
clarkb	looks like it may indicate that apache worker processes are out to lunch	19:52
clarkb	and it won't start more to handle incoming connections	19:52
clarkb	I am going to try restarting apache	19:52
pabelanger	going to see how we can do rolling updates on AFS servers	19:53
pabelanger	I think we just need to do one at a time	19:53
clarkb	seems to be working now	19:54
clarkb	pabelanger: ya I think I documented the process in the system config docs	19:54
pabelanger	Yup	19:54
pabelanger	https://docs.openstack.org/infra/system-config/afs.html#no-outage-server-maintenance	19:54
pabelanger	going to start with db servers	19:54
fungi	got another one... groups-dev is AMD Opteron(tm) Processor 4170 HE	19:55
clarkb	adns1.openstack.org can just be rebooted right? its not internet facing	19:56
clarkb	corvus: ^ you good for me to do that one?	19:56
fungi	i agree it should be safe to reboot at will, but will defer to corvus	19:57
clarkb	I've pinged jbryce about kata's mail list server to make sure there isn't any crazy conflict with reboot it	19:58
fungi	groups-dev is looking good after its reboot, so moving on to groups.o.o	19:58
fungi	groups.o.o is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz	19:58
fungi	so groups-dev won't be a great validation for prod, but them's the breaks	19:59
pabelanger	both AFS db servers are done	20:01
clarkb	I'm actually going to take a break and get lunch	20:02
clarkb	I think we are close to being ready to doing zuul though which is exciting	20:02
AJaeger	clarkb: kata-dev mailing list is dead silent ;/	20:02
pabelanger	AFS file servers need a little more work, we'll have to stop mirror-update for sure, and likely wait until zuul is shutdown to be safe	20:02
pabelanger	otherwise, we can migrate volumes	20:02
clarkb	jbryce also confirms we can do kata list server whenever we are ready	20:02
pabelanger	which takes time	20:02
pabelanger	actually, afs02.dfw.openstack.org looks clear of RW volumes, so I can start work on that	20:06
* clarkb lunches		20:06
fungi	groups servers are done and looking functional in their webuis	20:08
fungi	hrm... we have status and status01. is the former slated for deletion?	20:09
fungi	i'll coordinate with the refstack/interop team on the refstack.o.o reboot	20:11
pabelanger	fungi: yes, i believe status.o.o can be deleted but it was there even before we create new status01.o.o server. I'm not sure why	20:12
corvus	fungi, clarkb: reboot adns at will	20:13
pabelanger	moving on to afs01.ord.openstack.org it also has not RW volumes	20:13
pabelanger	no*	20:14
corvus	pabelanger: i would just gracefully stop mirror-update before doing afs01.dfw. i wouldn't worry about zuul.	20:14
pabelanger	k	20:14
corvus	the wheel build jobs should be fine if interrupted. the only reason to take care with mirror-update is so we don't accidentally get in a state where we need to restart bandersnatch. but even that should be okay.	20:15
corvus	(i mean, it's run out of space twice in the past few months and has not needed a full restart)	20:15
pabelanger	yah, that is true	20:16
pabelanger	AFK for a few to get some coffee before doing afs01.dfw	20:18
fungi	i'm taking a quick look in rax to see if i can suss out what the situation with stackalytics.o.o is	20:18
fungi	i want to say we decided to delete it and redeploy with xenial if/when we were ready to work on it further	20:19
clarkb	ya some of the unreachable servers may have had failed migrations	20:19
fungi	looks like stackalytics is in that boat	20:20
fungi	nothing in its oob console, but i issued a ctrl-alt-del through there	20:21
fungi	i have at times seen a similar issue i've also guessed may be migration-related, where the ip addresses for some instances cease getting routed until they're rebooted, but they're otherwise fine	20:21
fungi	even happened to one of my personal debian systems in rax	20:22
fungi	in that case i was able to work out a login through the oob console and inspect from the guest side	20:22
fungi	and the interface was up and configured but tcpdump showed no packets arriving on it	20:23
fungi	(my personal instances have password login through the console enabled, just not via ssh)	20:24
clarkb	general warning kids have been sick the last couple days and larissa now feels sick so I will be doing patching from couch while entertaining 2 year olds	20:24
dmsimard	did someone forget to delete the old eavesdrops machine ?	20:24
dmsimard	fatal: [eavesdrop.openstack.org]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 2001:4800:7818:101:be76:4eff:fe05:31bf port 22: No route to host\r\n", "unreachable": true}	20:24
fungi	dmsimard: i bet it was shutdown but not deleted until we were sure the replacement was good?	20:24
fungi	i think stackalytics is exhibiting the same broken network behavior. console indicated it wasn't able to configure its network interface and eventually timed that out and booted anyway... remotely the instance is still unreachable. i'll move on to rebooting it through the api next	20:27
fungi	huh, even after a nova reboot of the instance, stackalytics.o.o seems unable to bring up its network	20:32
dmsimard	clarkb, fungi: the playbook works after some fiddling, the inventory is in /tmp/meltdown-spectre-inventory	20:33
dmsimard	~60 unpatched hosts still	20:34
dmsimard	and 66 patched hosts	20:34
fungi	dmsimard: does it check whether they're intel or amd?	20:35
fungi	i've already turned up two amd hosts which won't show patched because kpti isn't enabled for amd cpus	20:35
dmsimard	For some reason the facts gathering would hang at some point under the system-installed ansible version (2.2.1.0), I installed latest (2.4.2.0) in a venv and used that	20:35
dmsimard	fungi: it doesn't but I can add that in -- can you show me an example ?	20:36
fungi	dmsimard: so far wiki-upgrade-test and groups-dev	20:37
fungi	dmsimard: you could check for GenuineIntel or not AuthenticAMD in /proc/cpuinfo	20:37
fungi	whichever seems better	20:37
dmsimard	hmm, my key isn't on wiki-upgrade-test	20:37
dmsimard	looks like I can use the ansible_processor fact (since we're gathering facts anyway) let me fix that	20:40
fungi	dmsimard: yeah, just don't apply puppet on it. it's basically frozen in time from before we started the effort to puppetize our mediawiki installation (which is what's on wiki-dev)	20:45
fungi	and since it's perpetually in the emergency file until we get the puppet-deployed mediawiki viable, new shell accounts aren't getting created	20:45
clarkb	dmsimard: its because ssh fails to a couple nodes I think. Newer ansible must handle that better and timeout	20:46
pabelanger	I've added mirror-update.o.o to emergency file so I can edit crontab	20:55
fungi	stackalytics.o.o isn't coming back no matter how hard i reboot it. we likely need to delete and rebuild anyway	20:59
dmsimard	clarkb: that's what I thought as well.	21:00
pabelanger	fungi: yah, sounds fair. needs to be moved to xenial anyways	21:03
pabelanger	eavesdrop.o.o was likely me, yah it was shutdown and never deleted after eavesdrop01.o.o was good	21:03
clarkb	ok back to computer now	21:18
clarkb	I'm going to do adns now if it isn't already done	21:18
clarkb	dmsimard: the storyboard hosts have been patched for ~2 hours but they show up in your unpatched list	21:22
clarkb	dmsimard: was it run more than two hours ago or maybe its not quite accurate?	21:22
clarkb	(I just checked both by hand and they are patched)	21:22
clarkb	adns1 rebooting now	21:24
clarkb	ok its up and reports being patched and named is running	21:26
clarkb	anyone know what is up with status.o.o and status01.o.o?	21:28
clarkb	looks like they are different hosts	21:28
fungi	my bet is a not-yet-completed xenial replacement	21:29
fungi	status01 is not in dns	21:29
fungi	and status is still in production and on trusty	21:29
clarkb	any objections to me patching and rebooting both of them?	21:29
fungi	none from me. i'll see if i can figure out who was working on the status01 build	21:30
clarkb	status hosts elasticsearch and bots	21:30
clarkb	otherwise its mostly a redirect to zuulv3 status I think	21:30
fungi	yeah	21:30
clarkb	ok doing that now	21:30
fungi	seems to have been ianw working on status01 during the sprint, according to channel logs	21:31
clarkb	er not elasticsearch, elastic-recheck	21:31
fungi	http://eavesdrop.openstack.org/irclogs/%23openstack-sprint/%23openstack-sprint.2017-12-12.log.html#t2017-12-12T00:41:19	21:32
clarkb	I think most of what is left is the zuul group, lists, puppetmaster and backup server	21:33
clarkb	status* rebooting now	21:34
clarkb	should probably do puppetmaster last	21:34
clarkb	so that if it hiccups it does so after everything else is updated	21:34
clarkb	oh and infracloud still needs updating on control plane	21:35
fungi	zuul-dev and zuul.o.o also need updating but we should take care not to accidentally start services on the latter	21:37
fungi	maybe it's time to talk again about deleting them	21:37
clarkb	ya earlier we said we could delete zuul-dev	21:38
clarkb	I'm doing the backup server now	21:38
fungi	zuulv3.o.o has only one vhost so apache directs all requests there as the default vhost, and as such just making zuul a cname to zuulv3 will work without needing the old zuul serving a redirect	21:38
clarkb	I'm up for deleting zuul.o.o too	21:38
clarkb	but first backup server	21:39
fungi	i tested with an /etc/hosts entry and my browser earlier, worked fine	21:39
corvus	there is no zuul, only zuulv3	21:40
clarkb	fungi: I would say make the dns record update then and lets delete both old servers	21:41
corvus	++	21:41
clarkb	backup server rebooting now	21:42
clarkb	ok backups server came back happy and has both filesystems (old and current) mounted	21:44
fungi	i'll do the cname dance now	21:44
fungi	for zuul->zuulv3	21:44
clarkb	I guess I'm up now for lists reboots	21:45
clarkb	I'm going to start iwth kata because lower traffic	21:45
clarkb	lists.o.o is actually amd	21:48
clarkb	so patching less urgent for it but we may as well	21:48
fungi	ttl on zuul.o.o a and aaaa were 5 minutes	21:49
fungi	cname from zuul to zuulv3 now added	21:49
clarkb	lists.katacontainers.io is back up and happy. At least it has exim, mailman, and apache running	21:50
clarkb	going to do lists.openstack.org now	21:50
fungi	#status log deleted old zuul-dev.openstack.org instance	21:51
openstackstatus	fungi: finished logging	21:51
fungi	will delete the zuul.o.o instance shortly once dns changes have a chance to propagate	21:51
clarkb	after zuul is cleaned up I'm going to rerun my check for what is patched	21:51
clarkb	but I think we will be down to the zuulv3 group	21:52
clarkb	(and infracloud)	21:52
clarkb	oh and puppetmaster (but again do this one after zuulv3 group)	21:52
clarkb	lists.o.o rebooting now	21:53
clarkb	and is back, services look good, but no kpti because it is AMD	21:54
clarkb	I'm going to make sure zm01-zm08 are patched now but not reboot them	21:55
fungi	i think that makes three we know about now with amd cpus	21:56
fungi	(wiki-upgrade-test, groups-dev and lists)	21:56
clarkb	I think it has to do with the age of the server	21:57
clarkb	since rax was all amd before they added the performance flavors	21:57
clarkb	fungi: let me know when zuul.o.o is gone and I am gonna regen our list to make sure only the zuulv3 set is left	21:58
clarkb	patching nodepool.o.o now too but not rebooting	22:01
clarkb	and now patching static.o.o	22:06
corvus	i just popped back to pick up another server and don't see one available. it looks like we're at the end of the list, where we reboot the zuul system all at once?	22:09
fungi	yeah, i'm about to delete old zuul.o.o now	22:09
fungi	which i think is the end	22:10
fungi	we've made short work of all this	22:10
fungi	(not me so much, but the rest of you)	22:10
clarkb	corvus: yes I am generating a new list from ansible output to compare now	22:10
clarkb	will be a sec	22:10
corvus	cool, count me in for helping with that. i'd like to handle zuul.o.o itself, and patch in the repl in case i have time to debug memory stuff later in the week.	22:11
clarkb	ok	22:11
fungi	i hope you mean zuulv3.o.o	22:11
fungi	since i'm about to push the button on deleting the old zuul.o.o	22:11
corvus	fungi: yes. thinking ahead. :)	22:11
fungi	dns propagation has had plenty of time now	22:12
fungi	and http://zuul.openstack.org/ is giving me the status page just fine	22:12
ianw	status01 is waiting for us to finish the node puppet stuff	22:13
fungi	ianw: cool, thanks. i thought it might be something like that after looking at the log from the sprint channel	22:14
fungi	#status log deleted old zuul.openstack.org instance	22:14
clarkb	http://paste.openstack.org/show/642507/ up to date list	22:14
openstackstatus	fungi: finished logging	22:14
clarkb	I sorted twice so we can split up by status too	22:15
clarkb	pabelanger: you still doing afs01?	22:15
clarkb	I guess that one can happen out of band	22:15
pabelanger	clarkb: yup, just waiting for mirror-update stop bandersnatch	22:15
fungi	i'll go ahead and delete the stackalytics server too, it's not going to be recoverable without a trouble ticket, and that's a bridge too far for something we know is inherently broken and unused anyway	22:16
clarkb	fungi: sounds good	22:16
fungi	#status log deleted old stackalytics.openstack.org instance	22:16
openstackstatus	fungi: finished logging	22:16
pabelanger	I'll have to step away shortly for supper with family, are we thinking of doing zuulv3.o.o reboot shortly?	22:17
clarkb	pabelanger: yes I think so	22:17
clarkb	you still good to do the mirrors if we do it in the next few minutes?	22:17
pabelanger	clarkb: yah, I can do reboot them	22:17
clarkb	line 151 hsa the zuulv3 set. I've given corvus zuulv3.o.o and I've taken zuul mergers	22:18
clarkb	I'll put pabelanger on the mirrors	22:18
clarkb	that leaves static and nodepool	22:18
fungi	i'm around to help for the next little bit, but it's going to be lateish here soon so i likely won't be around later if something crops up	22:18
clarkb	fungi: ^	22:18
fungi	i'll take static	22:18
fungi	and nodepool unless someone else wants to grab that	22:18
corvus	this one we should announce -- should we go ahead and put statusbot on that?	22:18
fungi	big concern with static.o.o is making sure it doesn't fsck /srv/static/logs	22:19
fungi	does touching /fastboot still work in this day and age?	22:19
clarkb	corvus: ya why don't we do that	22:19
clarkb	fungi: I'm not sure, it is a trusty node right?	22:19
fungi	also, full agree on status alert this	22:19
fungi	yeah, static.o.o is trusty	22:20
corvus	status notice The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.	22:21
corvus	^?	22:21
clarkb	corvus: +1	22:21
fungi	lgtm	22:21
corvus	#status notice The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.	22:21
*** corvus is now known as jeblair		22:21
jeblair	#status notice The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.	22:22
openstackstatus	jeblair: sending notice	22:22
clarkb	fungi: supposedly setting the sixth column in fstab to 0 will prevent fscking	22:22
*** jeblair is now known as corvus		22:22
pabelanger	ansible -i /etc/ansible/hosts/openstack 'mirror01*' -m command -a "reboot"	22:22
pabelanger	is the command I'll use to reboot mirrors	22:22
corvus	oh, now i spot the typo in that	22:22
clarkb	fungi: http://www.man7.org/linux/man-pages/man5/fstab.5.html	22:22
fungi	clarkb: thanks, done. that's a more durable solution anyway, i should have thought of it	22:22
corvus	re-equeued. something about a horse i assume.	22:22
-openstackstatus- NOTICE: The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved.		22:23
fungi	historically anyway, creating a /fastboot file only exempted it from running fsck for the next reboot, and then the secure or single-user runlevel initscripts would delete it at boot	22:23
clarkb	ok I've double checked the zm0* servers have up to date packages. I think I am ready	22:24
clarkb	(everyone else should probably double check too before we turn off zuul)	22:24
fungi	checking mine now	22:24
openstackstatus	jeblair: finished sending notice	22:24
corvus	i'm ready and waiting on fungi and pabelanger to indicate they are ready	22:25
pabelanger	yes, ready	22:25
corvus	waiting on fungi	22:25
clarkb	I think order is grab queues, stop zuul, reboot everything but zuul, reboot zuul after static, etc come back up	22:25
fungi	okay, all set to reboot mine	22:25
corvus	clarkb: yep. after fungi is ready, i will grab + stop, then let you know to proceed	22:26
fungi	and yes, that plan sounds correct	22:26
fungi	corvus: start at will	22:26
corvus	zuul is stopped -- you may proceed	22:26
fungi	thanks, rebooting static and nodepool now	22:26
pabelanger	okay, doing mirrors	22:26
clarkb	zm0* have been rebooted, waiting for them to return to us	22:27
fungi	will the launchers need to have anything done to reconnect to zk on nodepool.o.o?	22:27
clarkb	fungi: I don't think so	22:27
clarkb	it should retry if it fails iirc	22:27
clarkb	similar to how gear works	22:28
clarkb	(but we should check)	22:28
clarkb	fungi: oh also we may need to make sure the old nodepool process doesn't start?	22:28
clarkb	fungi: should just have zk running on nodepool.o.o	22:28
fungi	nodepool has booted already	22:28
fungi	i'll go check what's running	22:28
clarkb	all zuul mergers report kpti, checking them for daemons	22:28
dmsimard	clarkb: looking for storyboard re: playbook	22:29
fungi	zookeeper is running, nodepool is not	22:29
fungi	so should be okay	22:29
clarkb	all 8 zuul mergers are running a zuul merger process I think my side is good	22:29
fungi	static.o.o is up and reporting correct kernel security	22:29
dmsimard	clarkb: ok I need to fix ubuntu 14.04 vs 16.04	22:30
dmsimard	storyboard have not yet been updated	22:30
clarkb	dmsimard: they should be the same	22:30
clarkb	dmsimard: and they havebeen updated	22:30
dmsimard	14.04 doesn't have the kaiser flag in /proc/cpuinfo, 16.04 does	22:30
pabelanger	okay, just waiting for inap, all other mirrors show good for reboot. Checking apache now	22:30
pabelanger	inap is also good for reboot	22:31
clarkb	dmsimard: ansible -i /etc/ansible/hosts/openstack 'logstash-worker0*' -m shell -a "dmesg \| grep 'Kernel/User page tables isolation: enabled'" is apparently hte most reliable way to check	22:31
clarkb	dmsimard: as all the flag in cpuinfo means is that the kernel has detected the insecure cpu not that it has necessarily enabled pti	22:31
dmsimard	clarkb: yeah but that doesn't work for centos/rhel :(	22:31
clarkb	pabelanger: good for reboot meaning reboot is complete and the dmesg content is there?	22:31
clarkb	dmsimard: ya I know rhel is the only distro it doesn't work on from the ones I have sampled	22:31
clarkb	but checking cpuinfo doesn't tell oyu if pti is enabled	22:32
clarkb	it only tells you if kernel knows its cpu is insecure	22:32
pabelanger	clarkb: yes, just validating AFS is working properly now	22:32
dmsimard	/boot/config-3.13.0-139-generic:CONFIG_PAGE_TABLE_ISOLATION=y is probably safe	22:32
pabelanger	so far, issue with vexxhost mirror	22:32
clarkb	dmsimard: that also doesn't tell you pti is enabled	22:32
clarkb	dmsimard: only that support for it was compiled in	22:32
dmsimard	damn it	22:32
clarkb	(this is why the dmesg check is important it is positive confirmation from the kernel that it is ptiing)	22:32
mgagne	pabelanger: what is the reboot about? kernel for meltdown?	22:32
clarkb	mgagne: yes	22:33
corvus	clarkb, fungi: i believe i'm waiting only on pabelanger at this point, correct?	22:33
clarkb	corvus: that is my understanding yes	22:33
fungi	yes, everything good on my end	22:33
corvus	pabelanger: i'm idle if you need me to jump on a mirror	22:33
pabelanger	corvus: yes, vexxhost please	22:33
pabelanger	it is not serving up AFS	22:33
pabelanger	I am checking others still	22:33
corvus	ack	22:34
corvus	pabelanger: seems up now: http://mirror01.ca-ymq-1.vexxhost.openstack.org/ubuntu/lists/	22:34
clarkb	mgagne: we are doing all of our VM kernels today (and if I don't run out of time the control plane of infracloud)	22:35
fungi	`ls /afs/openstack.org/docs` returns content for me on mirror01.ca-ymq-1.vexxhost	22:35
mgagne	clarkb: cool, just wanted to check if it was for spectre	22:35
clarkb	mgagne: I'm not aware of any spectre patches yet	22:35
pabelanger	corvus: yes, confirmed. Thanks	22:35
fungi	329 entries to be exact	22:35
clarkb	mgagne: unfrotunately	22:35
mgagne	clarkb: ok, we are on the same page then	22:36
pabelanger	okay, all mirrors are rebooted, dmesg confirmed and apache running	22:36
corvus	mgagne: from what i read from gkh, that's probably next week. and the week after. and so on forever. :\|	22:36
clarkb	corvus: we'll get really good at rebooting :)	22:36
fungi	until we get redesigned cpus deployed everywhere	22:36
mgagne	¯\_(ツ)_/¯	22:36
clarkb	corvus: I think that means you are good to patch zuulv3 and reboot	22:36
fungi	and discover whatever new class of bugs they introduce	22:36
corvus	cool, proceeding with zuulv3.o.o. i expect it to start on boot.	22:37
clarkb	corvus: note I didn't prepatch zuulv3	22:37
corvus	clarkb: i did	22:37
clarkb	since python had been crashing there I didn't want to do anything early	22:37
clarkb	cool	22:37
corvus	host is up	22:38
corvus	zuul is querying all gerrit projects	22:38
corvus	zuul-web did not start	22:38
corvus	or rather, it seems to have started and crashed without error?	22:39
pabelanger	clarkb: okay if I step away now? Have some guests over this evening	22:39
clarkb	pabelanger: yup	22:39
pabelanger	great, good luck all	22:39
clarkb	pabelanger: enjoy, and thanks for the help!	22:39
fungi	thanks pabelanger!	22:40
clarkb	pabelanger: oh can you tldr what needs to be done with afs?	22:40
clarkb	pabelanger: we can finish that up while you dinner :)	22:40
corvus	submitting cat jobs now	22:40
clarkb	I guess its wait for mirror-update to stop doing things	22:40
clarkb	then reboot the afs server	22:40
corvus	mergers seem to be running them	22:40
fungi	and after that, puppetmaster	22:40
clarkb	fungi: ++	22:40
corvus	i've restarted zuul-web	22:41
clarkb	looks like a couple changes have enqueued according to status	22:42
corvus	according to grafana we're at 10 executors and 18 mergers which is the full complement	22:42
clarkb	and jobs are running	22:42
corvus	okay, i'll re-enqueue now	22:43
fungi	43fa686e-12a4-4c51-ad3b-d613e2417ff3 claims to be named "ethercalc01"	22:43
fungi	but is not the real slim shady	22:43
ianw	fungi: that could be our in-progress ethercalc ... again waiting for nodejs	22:43
clarkb	fungi: oh I don't think the real one is in my listings either	22:43
ianw	i think fricker had one up for testing last year	22:43
fungi	93b2b91f-7d01-442b-8dff-96a53088654a is actual ethercalc01	22:44
clarkb	fungi: it might be a good idea to regenerate the openstack inventory cache file?	22:44
clarkb	since a few servers have been deleted that are showing up in there	22:44
fungi	so it's in the inventory, just tracked by uuid since there are two of them	22:44
clarkb	corvus: console streaming is working	22:44
fungi	should we delete the in-progress one before clearing and regenerating the inventory cache?	22:44
ianw	fungi / clarkb: see https://review.openstack.org/#/c/529186/ ... may be related?	22:45
clarkb	fungi: the in progress ethercalc? its probably fine to keep it but just patch it?	22:45
fungi	should we delete status01 and the nonproduction ethercalc01 duplicate for now, since we'd want to test bootstrapping them from scratch again anyway?	22:45
ianw	fungi: yep	22:46
clarkb	fungi: thinking about it htough that may be simplest	22:46
mordred	infra-root: oh, well. I somehow missed that we were doing work over here in incident (and quite literally spent the day wondering why everyone was so quiet)	22:46
clarkb	++ to cleaning up	22:46
fungi	mordred: i hope you got a lot done in the quiet!	22:46
clarkb	mordred: maybe you want to help mwhahaha debug jobs?	22:46
mordred	what, if anything, can I do to make myself useful?	22:46
mordred	clarkb: kk. will do	22:46
ianw	fungi: do you want me to do that, if you have other things in flight?	22:46
clarkb	mordred: the bulk of patching is done and unless you really want to do reboots on whatsleft we probably have it under control?	22:47
fungi	ianw: feel free but i'm basically idle for the moment which is why i got to looking at the odd entries	22:47
clarkb	infracloud was mostly ignored today but I think I got most/all of the hypervisors done there last night	22:47
ianw	fungi: ok, well if you've got all the id's there ready to go might be quicker for you	22:47
clarkb	fungi: mirror-update looks idle now, maybe you want to do the remaining afs server?	22:47
fungi	will do	22:48
clarkb	in that case I can do afs server :)	22:48
fungi	go for it, i'll work on ethercalc01 upgrade and the deletions for dupes	22:48
clarkb	cool	22:48
clarkb	corvus: and to confirm afs01.dfw can just be rebooted now that mirror-update is happy?	22:49
corvus	clarkb: yep, should be fine.	22:51
clarkb	mordred: http://logs.openstack.org/80/532680/1/check/openstack-tox-py27/2b32f22/ara/result/1771aec0-9c03-40bd-837e-4aca16e1ec88/ the fail there in run.yaml caused a post failure	22:51
clarkb	mordred: so that may be part of it	22:51
fungi	i've deleted status01 and the shadow of ethercalc01, cleared the inventory cache, and am updating the real ethercalc01 now	22:51
corvus	clients should fail over to afs02.dfw	22:51
clarkb	updating pacakges on afs01.dfw now	22:52
clarkb	and rebooting it now	22:53
fungi	ethercalc01 is done and seems to be none the worse	22:53
clarkb	afs01.dfw is back up and running now and has the patch in place according to dmesg	22:56
* clarkb reads docs on how to check its volume are happy		22:56
fungi	i'll make sure the updated packages are on puppetmaster (very carefully taking note first of what it wants to upgrade)	22:56
corvus	all changes re-enqueued	22:57
clarkb	corvus: vos listvldb looks good to me	22:57
clarkb	anything else I should check before reenabling puppet on mirror-update?	22:58
corvus	should we send a second status notice, or was the first sufficient?	22:58
fungi	upgraded packages on puppetmaster, keeping old configus	22:58
fungi	i think the first was fine, that was brief enough of an outage	22:58
fungi	puppetmaster is ready for its reboot when we're all ready	22:58
corvus	clarkb: 'listvol' is probably better for this -- i think it consults the fileserver itself, not just the database	22:58
clarkb	corvus: cool will do that now	22:58
mordred	clarkb: that's a normal test failure - so yah, we probably need to make our post-run things a little more resilient	22:59
clarkb	and they all show online	22:59
clarkb	mordred: ya not sure if mwhahaha's things were all due to that weird error reporting though	22:59
mordred	clarkb: I'm guessing that'll be "zomg there is no .testrepository" - which is normal when a patch causes tests to not be able to import	22:59
clarkb	mordred: justnoticed it may be a cause of lots of post failures	22:59
mordred	clarkb: indeed.	22:59
clarkb	fungi: I think I'm ready for puppetmaster when everyone else is	22:59
mordred	clarkb: I'll work up a fix for it in any case - it's definitely sub-optimal	22:59
clarkb	fungi: i can reenable puppet on mirror-update after the reboot	23:00
fungi	sure	23:00
clarkb	ze10 is apparently not patched?	23:00
clarkb	(that can happen after puppetmaster)	23:01
fungi	okay, last call for objections before i reboot puppetmaster.o.o	23:01
fungi	and here goes	23:01
fungi	i can ssh into puppetmaster.o.o again now	23:02
clarkb	as can I	23:03
fungi	Kernel/User page tables isolation: enabled	23:03
fungi	should be all set	23:03
clarkb	I'm going to remove mirror-update from the emergency file now	23:03
fungi	sounds good	23:03
clarkb	thats done	23:03
clarkb	now to look into ze10	23:03
clarkb	ok it looks like the dmesg buffer rolled over	23:04
clarkb	we don't have the initial boot messages, but the kernel version lgtm so I'm going to trust that it was done	23:04
fungi	yeah, that'll happen	23:05
clarkb	going to generate a new list then I think it likely infracloud is what is left	23:05
fungi	i guess the noise from bwrap floods dmesg pretty thoroughly	23:06
fungi	ahh, oom killer running wild on ze10 actually	23:07
clarkb	corvus: jeblairtest01 is the only server that is reachable and not patched	23:09
clarkb	infra-root http://paste.openstack.org/show/642516/	23:10
* clarkb looks into some of those unreachable servers		23:10
fungi	hopefully there are fewer unreachables after the stuff i deleted earlier	23:13
clarkb	yup just 5 now.	23:13
fungi	oh, apps-dev should be able to go away	23:13
fungi	i'll delete it	23:13
fungi	eavesdrop is in a stopped state, pending deletion after we're sure eavesdrop01 is good (per pabelanger)	23:13
clarkb	ianw: https://etherpad.openstack.org/p/infra-meltdown-patching line 173 is I think the old backup server?	23:14
clarkb	ianw: is that something you are able to confirm (and did you want to delete it?)	23:14
fungi	#status log deleted old apps-dev.openstack.org server	23:14
openstackstatus	fungi: finished logging	23:14
fungi	infra-root: any objections to deleting the old (trusty) eavesdrop.o.o server?	23:15
dmsimard	fungi: pabelanger mentioned earlier it was him who kept it undeleted just in case	23:15
dmsimard	I don't believe there's anything wrong with the new eavesdrop	23:15
fungi	data was on a cinder volume anyway so unless you stuck something in your homedir it's not like there's anything on there anyway	23:15
mordred	no objectionshere	23:16
clarkb	fungi: if you are confident it can go away then I'm fine with it	23:16
ianw	clarkb: yeah, the old one, think it's fine to delete it now, will do	23:16
clarkb	I am putting these servers on the etherpad fwiw	23:16
fungi	#status log deleted old eavesdrop.openstack.org server	23:16
openstackstatus	fungi: finished logging	23:16
fungi	kdc02 was supplanted by kdc04 right?	23:17
clarkb	fungi: ya there was a change for that too	23:17
fungi	pretty sure i reviewed (maybe approved?) that	23:17
clarkb	looks like its already been deleted so stale inventory cache?	23:17
clarkb	unless its not in dfw	23:17
fungi	shouldn't be stale inventory cache	23:17
clarkb	ah yup its in ORD	23:18
fungi	ord	23:18
fungi	just looked in the cache	23:18
clarkb	its state is shut off	23:18
fungi	which makes sense, and explains the unreachable	23:18
fungi	so we're safe to delete that instance as well?	23:19
clarkb	I think so	23:19
fungi	and odsreg, i'm pretty sure we checked with ttx and he said it was clear for deletion too	23:19
clarkb	ya I seem to recall that	23:19
fungi	#status log deleted old kdc02.openstack.org server	23:19
openstackstatus	fungi: finished logging	23:20
dmsimard	clarkb: retrying the inventory after fixing 14.04/16.04 and adding AMD in	23:20
clarkb	fungi: you don't happen to hae that logged in your irc logs do you? I can't find it (I don't keep long term logs and just rebooted)	23:20
clarkb	dmsimard: thanks	23:20
fungi	clarkb: that's what i'm hunting down now	23:20
dmsimard	sorry about what little I could do today, I've been fighting other things	23:20
dmsimard	I'll help after dinner	23:21
clarkb	I think we are just about to the real fun. INFRACLOUD!	23:21
dmsimard	should probably drain nodepool	23:23
dmsimard	and then yolo it	23:23
ianw	so my test xenial build is -109 generic, what got added over 108?	23:24
clarkb	ianw: they fixed booting problems :)	23:24
fungi	ianw: now with less crashing	23:24
clarkb	ianw: so anything that is already on 108 is probably fine as long as unattended upgrades is working and updates to 109 before the next reboot	23:24
clarkb	109 is not required to be secure aiui	23:24
ianw	ok, cool, thanks	23:25
clarkb	dmsimard: I'm not even sure we have to drain nodepool. I got the hypervisors done last night (the ones I could access) and now its just the control plane and making sure we've done due diligence to do any other hypervisors that might still be alive but not responding to puppetmaster.o.o	23:26
clarkb	compute000.vanilla.ic.openstack.org for example I can connect to from home but not puppetmaster	23:26
clarkb	compute030.vanilla.ic.openstack.org accepts ssh connections then immediately kills them	23:27
clarkb	alright I'm going to try collecting data on what needs help in infracloud	23:30
*** rlandy is now known as rlandy\|bbl		23:37
clarkb	infra-root http://paste.openstack.org/show/642518/ that is infracloud	23:45
clarkb	we need to figure out which hypervisor is running the mirror but we should be able to reboot any other node	23:45
clarkb	lets save baremetal00 for last as its the bastion for that env	23:45
clarkb	(similar to how we did puppetmaster last in our control plane)	23:45
* clarkb figures out where the mirror is running		23:46
clarkb	compute039.vanilla.ic.openstack.org is where mirror01.regionone.vanilla.openstack.org is running and it has been patched so we don't have to worry about that	23:49
clarkb	anyone want to do compute012.vanilla.ic.openstack.org? I'm going to start digging into compute005.vanilla.ic.openstack.org and why it is not working	23:49
*** SergeyLukjanov has quit IRC		23:53
*** SergeyLukjanov has joined #openstack-infra-incident		23:54
fungi	pretty sure the hypervisor hosts for both mirror instances were already done, because i had to explicitly boot them both earlier today	23:56
fungi	i can give compute012.vanilla.ic a shot at an upgrade	23:56
clarkb	looking at the nova-manager service list and the ironic node-list I think we have diverged a bit between what is working and what is expected to be working	23:57
clarkb	I think what we should likely do is do our best to recover nodes like compute005 but if they don't come back we can disable them in nova then remove them from the inventory file	23:57
clarkb	I'm going to quickly make sure all of the patche dhypervisors + 012 are the only ones that nova thinks it can use	23:57
fungi	compute012.vanilla.ic is taking a while to reach, seems like	23:58
clarkb	the networking there is so screwed up	23:59
clarkb	compute000 for example apparently puppetmaster can hit it now	23:59
clarkb	the XXX nodes in nova-manager service list are the ones that are unreachable	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!