clarkb | ugh ubuntu has kernels up for xenial now but not for trusty unless you use trusty with hardware enablement kernels | 00:01 |
---|---|---|
clarkb | which btw is not their default trusty server kernel | 00:01 |
clarkb | infra-root ^ | 00:02 |
mordred | clarkb: AWESOME | 00:03 |
clarkb | I'm going to patch my local fileserver then will probably start doing logsatsh workers | 00:04 |
clarkb | as those are probably a good canary and easily rebuilt | 00:04 |
clarkb | (I had hoped to start with infracloud, may be we update infracloud to hwe kernel?) | 00:04 |
corvus | clarkb: where's the info about hwe? | 00:06 |
clarkb | corvus: https://wiki.ubuntu.com/Kernel/LTSEnablementStack is the generic doc on it | 00:07 |
corvus | clarkb: right, i mean where did you see that they aren't doing stuff for hwe? | 00:08 |
corvus | er for not-hwe? | 00:08 |
clarkb | oh | 00:08 |
clarkb | https://wiki.ubuntu.com/SecurityTeam/KnowledgeBase/SpectreAndMeltdown and the usn site | 00:09 |
clarkb | currently they have only published for the 14.04 + hwe kernel | 00:09 |
clarkb | well that an linux-aws | 00:09 |
clarkb | heh now https://usn.ubuntu.com/usn/usn-3524-1/ is there | 00:11 |
clarkb | so maybe just lag involved | 00:11 |
clarkb | in that case I will likely do infracloud next to pick up ^ | 00:11 |
clarkb | my local fileserver doesn't seem to see the new kernels yet | 00:14 |
clarkb | going to sort that out | 00:15 |
corvus | clarkb: yeah, on a trusty machine i have, i only see linux-image-3.13.0-137-generic not 139 | 00:17 |
clarkb | I'm pointed at security.ubuntu.com and I see the package in the package list if I browse to it | 00:19 |
* clarkb tries again | 00:19 | |
clarkb | ya apt nto seeing it for some reason | 00:20 |
*** rlandy has quit IRC | 00:24 | |
clarkb | logstash-worker01 does see the new kernel though | 00:24 |
clarkb | and I have the same set of xenial-security entries in my apt sources list so this has to be cdn or caching etc | 00:26 |
clarkb | corvus: I think the InRelease and Release files I am being served are older than the packages in the repo | 00:31 |
corvus | i need to go offline for a bit to patch my workstation and the system my irc client is on... i may be gone for a bit, but will rejoin asap | 00:32 |
clarkb | ok | 00:33 |
clarkb | someone completely unrelated to openstack says xenials 4.4.0.108 131 kernel panics on their workstation | 00:34 |
clarkb | so uh be prepared for that I Guess :/ | 00:34 |
corvus | that's great | 00:34 |
*** rosmaita_ has quit IRC | 00:36 | |
clarkb | I might just turn my fileserver back off and patch it when things get better :/ | 00:37 |
corvus | clarkb: so... permanently off? | 00:43 |
corvus | okay, really signing off now | 00:43 |
*** corvus has quit IRC | 00:44 | |
clarkb | I'm going to do logstash-worker01 by hand really quickly to get a xenial canary up | 00:46 |
clarkb | aha I think I sorted out my lcoal fileserver issue | 00:50 |
clarkb | I was using the old hwe | 00:50 |
clarkb | due to chipset issues | 00:50 |
clarkb | ok logstash-worker01 came up | 00:53 |
clarkb | it does not have the bugs: cpu_insecure entry in the cpuinfo file but does haev [ 0.000000] Kernel/User page tables isolation: enabled in dmesg | 00:53 |
clarkb | https://etherpad.openstack.org/p/infra-meltdown-patching | 01:02 |
clarkb | using compute000.vanilla.ic.openstack.org as infracloud canary | 01:09 |
clarkb | just manually going to do that one too to figure out how to check if kpti is enabled and if instances will even boot up | 01:09 |
fungi | you should see it mentioned in dmesg at least | 01:10 |
clarkb | ya on xenial it was in dmesg | 01:11 |
clarkb | details like this going into the etherpad above | 01:12 |
fungi | Kernel/User page tables isolation: enabled | 01:12 |
fungi | that's what i was looking for on my systems | 01:12 |
clarkb | yup | 01:14 |
clarkb | arg sudoers contents changed on infracloud nodes | 01:14 |
clarkb | fungi: http://paste.openstack.org/show/641745/ should be safe to accept the new version there ya? | 01:19 |
*** tristanC has joined #openstack-infra-incident | 01:21 | |
clarkb | I'm going ahead and choosing the new version so this problem doesn't persist as I believe it to be largely equivalent to the old version | 01:23 |
fungi | yeah, it shouldn't cause any problems | 01:25 |
fungi | and puppet will undo it anyway | 01:25 |
clarkb | oh do we manage the top level sudoers file? I kept our unattended upgrades config because I know puppet manages that one | 01:26 |
clarkb | and am keeping the nova conf | 01:26 |
clarkb | arg sudoers update did end up breaking sudo for me | 01:28 |
clarkb | I'm going to kick.sh compute000.vanilla.ic.openstack.org so I can reboot it | 01:28 |
clarkb | so I think we want to keep our versions of all those config files on trusty nodes | 01:28 |
*** rosmaita has joined #openstack-infra-incident | 01:29 | |
fungi | your account was in the sudo group, right? | 01:30 |
ianw | so sorry do we need repos for xenial, or is it just update & reboot? | 01:30 |
clarkb | yes I am in the sudo group | 01:30 |
ianw | i can go through the ze nodes and do that, since they're tolerant to dying | 01:30 |
clarkb | ianw: it should be just update and reboot. see https://etherpad.openstack.org/p/infra-meltdown-patching | 01:31 |
clarkb | fungi: but it wants my password now | 01:31 |
fungi | weird that it broke sudo for you. it's not clear to me what in that diff would have | 01:31 |
fungi | oh | 01:31 |
fungi | OH | 01:31 |
fungi | it dropped NOPASSWD | 01:31 |
clarkb | and puppet master is having a hard time ssh'ing to compute000.vanilla.ic.openstack.org now (I am ssh'd though)( | 01:32 |
fungi | i totally overlooked that since the %sudo stanza also moved | 01:32 |
clarkb | ya oh well | 01:32 |
clarkb | as soon as I can get puppetmaster to ssh in it should fix it | 01:32 |
clarkb | but uh thats not working | 01:33 |
ianw | ok, trying on ze01 and see what happens | 01:33 |
fungi | was there a sudo package uprade or something? | 01:34 |
fungi | i guess you're upgrading more than just kernel and microcode packages | 01:34 |
clarkb | fungi: yes | 01:34 |
clarkb | fungi: ya I was doing full dist-upgrades | 01:34 |
clarkb | assuming unattended upgrades were mostly keeping up with the other stuff | 01:34 |
clarkb | whihc seems to be the case on xenial at least | 01:34 |
clarkb | oh wow compute000 doesn't permit root ssh | 01:35 |
clarkb | so I may have toasted it and it needs to be rebuilt :/ | 01:35 |
clarkb | sshd_config was not something I was asked to update | 01:35 |
fungi | i bet another package upgraded (openssh-server?) and disabled root login | 01:35 |
clarkb | oh wait there is a specific whitelist for puppetmaster on compute000 | 01:36 |
clarkb | in sshd_config so why isn't this working | 01:36 |
clarkb | ssh -vvv seems to indicate it is a tcp problem | 01:37 |
clarkb | oh wait no that was just slow but it connected via tcp | 01:37 |
clarkb | it might just be slow as molasses | 01:38 |
ianw | how long does the dkms take ... | 01:38 |
ianw | ohh, it's doing afs | 01:38 |
clarkb | yup confirmed just slow as can be | 01:40 |
clarkb | I'm going to manually replace the sudoers file as I doube ansible + puppet will ever be happy with this slow connection | 01:40 |
ianw | ok, ze01 done | 01:42 |
ianw | i will let it run for a while as i get some lunch, and if all good, i'll go through the rest and update | 01:42 |
clarkb | also based on how slow this connection is I'm not entirely convinced we should be using infracloud in its current state | 01:42 |
clarkb | ianw: sounds, good thanks | 01:42 |
ianw | i'll also stop the executor and clear out /var/lib/zuul/builds on each host, as it seems there's a little cruft in there | 01:43 |
clarkb | compute000 is rebooting now | 01:43 |
clarkb | it will probably take 10 minutes to come back up so I can cehck on it | 01:44 |
clarkb | fungi: does dist-upgrade -y imply responding N (eg keep old version of conf file) when there are conflicts in packages? | 01:44 |
clarkb | fungi: since I think we do want N in all cases but -y is yes not no but N is default | 01:45 |
clarkb | I'm going to put the elasticsearch cluster in the "don't worry about rebalancing indexes" mode then reboot the whole cluster at once | 01:46 |
clarkb | thats best way of ripping off that bandaid I think | 01:46 |
fungi | clarkb: not really sure. if you use the noninteractive mode it will keep old configs | 01:46 |
clarkb | ok rebooting elasticsaerch cluster now | 01:51 |
clarkb | elasticsaerch is recovering shards now | 01:55 |
clarkb | fungi: --force-confold as a dpkg option is the magic there apparently | 01:59 |
fungi | yeah, you can do it that way too | 01:59 |
clarkb | fungi: `export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get -o Dpkg::Options::="--force-confold" dist-upgrade -y` how does that look? | 02:01 |
fungi | clarkb: remarkably similar to the syntax we're using somewhere in our automation | 02:02 |
fungi | and yes, should do what you want | 02:02 |
clarkb | looks like archive.ubuntu.com has the new kernel now | 02:03 |
clarkb | so don't need to update sources.list on infracloud | 02:03 |
clarkb | if compute001 comes out of this looking happy I'm going to ansible the above command across infracloud compute nodes and I guess start doing reboots? | 02:04 |
clarkb | I figure control plane is lesser concern and probably needs more eyeballs on it | 02:04 |
fungi | sounds good. i should be able to help tomorrow but it's getting late here | 02:05 |
clarkb | I also confirmed that trusty has the same dmesg check as xenial | 02:06 |
fungi | excellent | 02:08 |
clarkb | hrm rebooting broke neutron networking | 02:15 |
fungi | ouch | 02:16 |
clarkb | puppet is what sets that up | 02:16 |
clarkb | and since puppetmaster can't ssh to half these nodes... | 02:16 |
clarkb | ok 000 and 001 in vanilla are working with the manually run steps that puppet would normally run | 02:23 |
clarkb | apt-get update is running on those vanilla compute hosts that puppetmaster can ssh into | 02:23 |
clarkb | I think I will go ahead and upgarde those, reboot them, then fix networking, then use the dmesg check for finding those that were missed due to ssh issues | 02:24 |
*** mordred has quit IRC | 02:42 | |
*** mordred has joined #openstack-infra-incident | 02:44 | |
*** rosmaita has quit IRC | 02:58 | |
ianw | doing the executors now | 03:05 |
clarkb | I'm tracking the various states of infracloud things and its not very pretty... | 03:06 |
clarkb | bunch of nodes can't be hit, at least one node has a ro / that apepars to be due to medium errors | 03:06 |
clarkb | I'm gonna continue working through this but I'm beginning to think we might want to seriously considering not infraclouding anymore | 03:07 |
clarkb | I expect I'll be in a position to do mass reboots in an hour or so | 03:08 |
* clarkb finds dinner | 03:08 | |
clarkb | thats cool the bad disk/fs happened just now ish | 03:19 |
clarkb | I guess I'll give it a reboot and see if it comes up and is patchable | 03:19 |
*** rosmaita has joined #openstack-infra-incident | 03:42 | |
clarkb | doing mass reboot of chocolate now | 03:59 |
clarkb | (it updated quicker than vanilla) | 03:59 |
clarkb | all the gate jobs are failing on a tox siblings thing anyways so I figure just go for it | 04:00 |
clarkb | (also zuul should retry anyways) | 04:00 |
clarkb | chocoalte nodes are coming abck and I am applying their network config | 04:07 |
clarkb | I expect that I will have all of the reachable chocolcate compute hosts patched shortly | 04:07 |
clarkb | vanilla compute hosts are rebooting now as well | 04:10 |
*** rosmaita has quit IRC | 04:21 | |
clarkb | I think all infracloud computes that are reachable are patched | 04:40 |
clarkb | the chocolate cloud appears to be functioning too | 04:40 |
clarkb | but the vanilla cloud is somewhat ambiguous going by grafana data | 04:40 |
clarkb | TL;DR for those of you catching up in the morning. I think a good next step would be to generate some proper node lists and we can start going through them. For what ianw and I did this evening was mostly using representative sets like logstash-workers and elasticsearch* and infracloud to get the ball rolling. I've tried to capture my notes on what I've done to get things updated properly | 04:55 |
clarkb | https://etherpad.openstack.org/p/infra-meltdown-patching has more infos | 04:56 |
clarkb | and with that I'm off to bed | 04:57 |
*** jeblair has joined #openstack-infra-incident | 05:21 | |
*** jeblair is now known as corvus | 05:21 | |
*** frickler has joined #openstack-infra-incident | 08:53 | |
*** rosmaita has joined #openstack-infra-incident | 13:04 | |
*** rlandy has joined #openstack-infra-incident | 13:34 | |
*** pabelanger has quit IRC | 14:56 | |
*** pabelanger has joined #openstack-infra-incident | 14:56 | |
-openstackstatus- NOTICE: Gerrit is being restarted due to slowness and to apply kernel patches | 14:58 | |
fungi | to also track in here, review.o.o is running the updated kernel as of a few minutes ago (thanks pabelanger!) | 15:13 |
pabelanger | Yes, seems things are working | 15:14 |
fungi | and frickler noticed the mirror in ic-choco was broken. apparently got shutdown when the hosts were rebooted for newer kernels, and i guess nova doesn't start instances automagically when the host comes back online? | 15:14 |
fungi | i should check vanilla too, now that i think about it | 15:14 |
fungi | i can't ssh to it either. not a good sign | 15:15 |
pabelanger | yah, I think you need a nove.conf setting to turn instances back on | 15:15 |
pabelanger | nova.conf* | 15:15 |
fungi | confirmed, it too was in a shutoff state | 15:15 |
pabelanger | ack | 15:16 |
*** dmsimard has joined #openstack-infra-incident | 15:19 | |
frickler | I haven't dug into how we set up infra cloud yet, but it should be "resume_guests_state_on_host_boot = true" in nova.conf/DEFAULT | 15:35 |
corvus | oh, it looks like we're up to 4.4.0-109 this morning | 15:38 |
corvus | it was 108 yesterday | 15:38 |
corvus | (for xenial) | 15:39 |
corvus | and trusty+hwe | 15:41 |
corvus | xenial: https://usn.ubuntu.com/usn/usn-3522-3/ | 15:41 |
corvus | apparently fixes a regression that caused "a few systems failed to boot successfully" | 15:42 |
corvus | so we *probably* don't have to redo the 108 hosts -- if they booted. | 15:42 |
clarkb | fungi: oh sorry I completely spaced on the mirrors | 16:36 |
fungi | no problem. they're working now | 16:43 |
clarkb | is review.o.o the only thing that has been patched since last night? | 16:44 |
fungi | corvus: yeah, i say we just make sure the updated kernels get installed so the next time we reboot they'll also hopefully not fail to reboot | 16:44 |
clarkb | I'll probably work on generating server lists as next step and adding that into the etherpad | 16:44 |
fungi | clarkb: yes, we missed the opportunity to patch zuulv3.o.o when it got restarted for excessive memory use | 16:44 |
dmsimard | btw I mentioned yesterday a playbook to get an inventory of patched and unpatched nodes, I cleaned what I had and pushed it here as WIP: https://github.com/dmsimard/ansible-meltdown-spectre-inventory | 16:45 |
dmsimard | need to afk | 16:45 |
fungi | clarkb: oh, also i saw that debian oldstable/jessie got meltdown patches if you want to update your personal system | 16:45 |
clarkb | fungi: oh thanks for that heads up. I may actually start with that and patch my irc client server | 16:46 |
clarkb | ya gonna get that taken care of, will be afk for a bit | 16:48 |
*** clarkb has quit IRC | 16:53 | |
*** clarkb1 has joined #openstack-infra-incident | 16:54 | |
fungi | helo clarkb1 | 16:56 |
clarkb1 | hello, now to figure out my nick situation | 16:56 |
fungi | indeed! | 16:56 |
fungi | i see webkitgtk+ just released mitigation patches along with an advisory, so maybe chrome and safari will be a little safer shortly? | 16:56 |
*** rosmaita has quit IRC | 16:58 | |
*** clarkb1 is now known as clarkb | 16:58 | |
clarkb | ok I think thats irc mostly sorted out. Need to join a couple dozen more chnanels but that can happen later | 17:00 |
clarkb | dmsimard: you had to afk, but how work in progress is that playbook? should we avoid running it against say infracloud or the rest of infra? | 17:01 |
clarkb | does sshing to compute030.vanilla.ic.openstack.org close the connection just as soon as you actually login? | 17:02 |
clarkb | or rather does it do that for anyone else? | 17:02 |
corvus | clarkb: yep | 17:02 |
clarkb | `ansible -i /etc/ansible/hosts/openstack '*:!git*.openstack.org' -m shell -a "dmesg | grep 'Kernel/User page tables isolation: enabled'"` is what I'm going to start with to generate a list of what is and what isn't patched | 17:04 |
clarkb | that excludes infracloud and excludes the centos git servers which should already be patched | 17:04 |
clarkb | corvus: completely unreachable from local and puppetmaster, unreachable from puppetmaster but reachable from local, reachable but connecton gets killed like for compute030, and sad hard drives plus RO / seems to be the rough set of different ways things are broken in infracloud | 17:06 |
pabelanger | yah, sad HHDs in infracloud are happening more often | 17:15 |
clarkb | vanilla also appears to be in worse shape | 17:16 |
clarkb | I think the chocolate servers are newer | 17:17 |
clarkb | I'm going to reorganize the etherpad a bit as I don't think the trusty vs xenial distinction matters much | 17:25 |
clarkb | ok I think https://etherpad.openstack.org/p/infra-meltdown-patching is fairly well organized now | 17:33 |
clarkb | I'm going to continue to pick off some of the easy ones like translate-dev etherpad-dev logstash.o.o subunit workers | 17:36 |
clarkb | but then after breakfast I probably should context switch back to infracloud and finish that up | 17:36 |
pabelanger | I can start picking up some hosts too | 17:38 |
fungi | i'm just about caught up with other stuff to the point where i can as well | 17:39 |
pabelanger | will do kerberos servers, since they fail over | 17:40 |
*** rosmaita has joined #openstack-infra-incident | 17:41 | |
pabelanger | okay, rebooting kcd01, run-kprop.sh worked as expected | 17:42 |
pabelanger | actually, let me confirm we have right kernel first | 17:42 |
pabelanger | linux-image-3.13.0-139-generic | 17:44 |
pabelanger | rebooting now | 17:44 |
clarkb | on trusty unattended upgrades pulled in the latest kernel | 17:45 |
clarkb | but ubuntu released a newer kernel for xenial 109 instead of 108 that addressed some booting issues that we should install on unpatched servers | 17:45 |
pabelanger | [ 0.000000] Kernel/User page tables isolation: enabled | 17:46 |
clarkb | (I'm still running an apt-get update and dist-upgrade per the etherpad on the servers I'm patching regardless) | 17:47 |
pabelanger | yah, just doing that on kdc04.o.o now | 17:47 |
pabelanger | which is xenial and got latest 109 kernel | 17:47 |
pabelanger | lists-upgrade-test.openstack.org | FAILED | 17:49 |
pabelanger | can we just delete that now? | 17:49 |
pabelanger | clarkb: corvus: ^ | 17:50 |
pabelanger | logstash-worker-test.openstack.org | FAILED I guess too | 17:50 |
clarkb | I have no idea what logstash-worker-test.o.o is | 17:52 |
clarkb | so I think it can be cleaned up | 17:52 |
pabelanger | working nb03.o.o and nb04.o.o | 17:52 |
clarkb | lists-upgrade-test was used to test the inplace upgrade of lists.o.o I don't think we need the server but corvus should confirm | 17:52 |
clarkb | fungi: maybe you want to do the wiki hosts as I think you understand their current situation best | 17:54 |
corvus | clarkb: confirmed -- lists- | 17:55 |
corvus | ha | 17:56 |
corvus | upgrade-test is not needed :) | 17:56 |
fungi | clarkb: yup, i'll get wiki-upgrade-test (a.k.a. wiki.o.o) and wiki-dev now | 17:57 |
pabelanger | ack, I'll clean up both in a moment | 17:58 |
pabelanger | should we wait until later this evening for nodepool-launchers or fine to reboot now? | 17:59 |
clarkb | I'm kinda leaning towards ripping the bandaid off on this one | 18:00 |
clarkb | and nodepool should handle that sanely | 18:00 |
pabelanger | Yah, launcher should bring nodes online again. nodepool.o.o (zookeeper) we'll need to do when we stop zuul | 18:01 |
clarkb | speaking of bandaids, just going to do codeserach since there isn't a good way of doing that one without an outage | 18:03 |
clarkb | maybe I should send email to the dev list first though | 18:03 |
clarkb | ya writing an email now | 18:04 |
pabelanger | I didn't see anybody on pbx.o.o, so I've rebooted it | 18:05 |
pabelanger | moving to nl01.o.o | 18:06 |
pabelanger | clarkb: okay, we ready for reboot of nl01? | 18:08 |
clarkb | I think so, that should be zero outage from a user perspective | 18:08 |
clarkb | I'm almost done with this email too | 18:09 |
pabelanger | okay, nl01 rebooted | 18:10 |
pabelanger | and back online | 18:10 |
clarkb | email sent | 18:11 |
clarkb | oh hey arm64 email I should read too | 18:11 |
fungi | the production wiki server is taking a while to boot | 18:12 |
fungi | may be doing a fsck... checking console now | 18:12 |
pabelanger | okay, both nodepool-launchers are done | 18:14 |
fungi | yeah, fsck in progress, ~1/3 complete now | 18:14 |
clarkb | codesearch is reindexing | 18:15 |
pabelanger | clarkb: how do we want to handle mirrors? No impact would be to launch new mirrors or disable provider in nodepool.yaml | 18:15 |
clarkb | pabelanger: ya considering how painful the last day or so has been for jobs we may want to be extra careful there | 18:16 |
fungi | replacing mirrors loses their warm cache | 18:16 |
clarkb | either boot new instances or do them when we do zuul | 18:16 |
pabelanger | fungi: yah, that too | 18:16 |
fungi | i vote do them when we do zuul scheduler | 18:16 |
pabelanger | okay, that works | 18:16 |
fungi | one outage to rule them all | 18:16 |
clarkb | wfm | 18:17 |
fungi | also, we may not have sufficient quota in some providers to stand up a second mirror without removing the first? | 18:17 |
*** ChanServ changes topic to "Meltdown patching | https://etherpad.openstack.org/p/infra-meltdown-patching | OpenStack Infra team incident handling channel" | 18:17 | |
clarkb | pabelanger: what we can do no though is run the update and dist upgrade steps | 18:18 |
clarkb | so that all we have to do when zuul is ready is reboot them | 18:18 |
pabelanger | fungi: I think we do, at least when I've built them we've had 2 online at once | 18:19 |
pabelanger | clarkb: sure | 18:19 |
pabelanger | k, doing grafana.o.o now | 18:21 |
corvus | what do you think about pulling the list of systems that need patching into an ansible inventory file, then run apt-get update && apt-get dist-upgrade on all of them? | 18:21 |
clarkb | thats annoying hound wasn't actually started on codesearch01 (I've started it) | 18:22 |
clarkb | corvus: not a bad idea. I can regenerate my list so that it is up to date | 18:22 |
fungi | 3.13.0-139-generic is booted on wiki-upgrade-test (and the wiki is back online again) but dmesg doesn't indicate that page table isolation is in effect | 18:23 |
corvus | rebooting cacti02 | 18:23 |
clarkb | fungi: weird, does it show up in the kernel config as an enabled option? | 18:24 |
pabelanger | Hmm, grafana is trying to migrate something | 18:24 |
fungi | CONFIG_PAGE_TABLE_ISOLATION=y is in /boot/config-3.13.0-139-generic | 18:25 |
pabelanger | ah, new grafana package was pulled in | 18:25 |
corvus | meeting channels are idle, i will do eavesdrop now | 18:25 |
fungi | good call | 18:25 |
*** openstack has joined #openstack-infra-incident | 18:30 | |
*** ChanServ sets mode: +o openstack | 18:30 | |
corvus | supybot is not set to start on boot. and we call it meetbot-openstack in some places and openstack-meetbot others. :| | 18:30 |
corvus | i'm guessing something was missed in the systemd unit conversion. | 18:30 |
fungi | eep, last time i restarted it i used `sudo service openstack-meetbot restart` | 18:31 |
fungi | which _seemed_ to work | 18:31 |
corvus | fungi: i think they all go to the same place :) | 18:31 |
fungi | fun | 18:32 |
corvus | any reason why etherpad-dev was done but not etherpad? | 18:32 |
fungi | i'll do planet next | 18:32 |
corvus | hrm. apt-get dist-upgrade on etherpad only sees -137, not 139 | 18:33 |
fungi | pointed at an out-of-date mirror? | 18:34 |
clarkb | corvus: yes | 18:34 |
clarkb | corvus: we are using the etherpad to coordindate so didn't want to just reboot it | 18:34 |
clarkb | but maybe it is best to get that out of the way | 18:34 |
clarkb | I'm going to update the node list first though | 18:34 |
corvus | fungi: deb http://security.ubuntu.com/ubuntu trusty-security main restricted | 18:34 |
fungi | strange | 18:34 |
fungi | that's after apt-get update? | 18:34 |
corvus | anyone sucessfully done a trusty upgrade to 139 today? | 18:34 |
corvus | fungi: yep | 18:34 |
corvus | the ones i picked up were all xenial | 18:35 |
clarkb | corvus: yes etherpad-dev was trusty too and worked | 18:35 |
fungi | i just did wiki and wiki-dev and they're trusty | 18:35 |
clarkb | is it saying 137 is no longer needed? | 18:35 |
corvus | oh wait... | 18:35 |
clarkb | the trusty nodes largely got the new kernel from unattended upgrades last night | 18:35 |
corvus | clarkb: i think that's the case | 18:35 |
clarkb | and then the message says 137 is not needed anymore | 18:35 |
corvus | sorry :) | 18:36 |
clarkb | np | 18:36 |
fungi | right, unattended-upgrades will install new kernel packages and then e-mail you to let you know you need to reboot | 18:36 |
pabelanger | sinc we nolonger do HTTP on zuul-mergers, those could be rebooted with no impact? | 18:36 |
clarkb | everyone ok if I just delete the current list (you'll lose your annotations) and replace with current list? | 18:36 |
clarkb | or current as of a few minutes ago | 18:36 |
corvus | clarkb: wfm | 18:36 |
fungi | clarkb: fine with me | 18:36 |
fungi | though wiki-upgrade-test will show back up | 18:36 |
fungi | unless you're filtering out amd cpus | 18:37 |
pabelanger | corvus: ok here | 18:37 |
clarkb | fungi: I am not, but can make a note on that one | 18:37 |
corvus | clarkb: i'm ready to reboot when you're done etherpadding. | 18:37 |
clarkb | corvus: ready now | 18:37 |
corvus | pabelanger: why didn't your zk hosts show up as success? | 18:38 |
fungi | clarkb: more pointing out that there could be other instances in the same boat (however unlikely at this point) | 18:38 |
pabelanger | clarkb: I just did them, maybe clarkb list outdated? | 18:38 |
clarkb | ya could be a race in ansible running and me filtering | 18:39 |
clarkb | it takes a bit of time for ansible to go thorugh the whole list | 18:39 |
clarkb | I havne't gotten a fully automated solution yet because ansible hangs on a couple hosts so never exits in order to do things like sorting | 18:39 |
corvus | etherpad is back up | 18:40 |
corvus | for that matter, the 2 i just did also showed up as failed | 18:40 |
corvus | any special handling for files02, or should we just rip off the bandaid? | 18:41 |
clarkb | we could deploy a new 01 and then delete 02 | 18:42 |
clarkb | but I'm not sure that is necessary, server should be up quickly and start serving data again | 18:42 |
clarkb | (also we lose all the cache data if we do that | 18:42 |
corvus | clarkb: if we feel that a 1 minute outage is too much, i'd suggest we deploy 01 and *not* delete 02 :) | 18:43 |
corvus | i'll ask in #openstack-doc | 18:43 |
clarkb | sounds good | 18:43 |
clarkb | I got a response to my we are patching email pointing to a thing that indicates centos may not actually kpti if your cpu's don't have pcid | 18:44 |
clarkb | so I'm going to track that down | 18:44 |
corvus | also *jeepers* the dkms build is slow | 18:44 |
clarkb | "The performance impact of needing to fully flush the TLB on each transition is apparently high enough that at least some of the Meltdown-fixing variants I've read through (e.g. the KAISER variant in RHEL7/RHEL6 and their CentOS brethren) are not willing to take it. Instead, some of those variants appear to implicitly turn off the dual-page-table-per-process security measure if the processor | 18:46 |
clarkb | they are running on does not have PCID capability." | 18:46 |
clarkb | uhm | 18:46 |
fungi | huh | 18:46 |
corvus | how do we git a console these days? | 18:46 |
clarkb | git.o.o does not have pcid in its cpuinfo flags like my laptop does | 18:46 |
clarkb | corvus: log into rax website go to servers open up specific server page then under actions there is open remote console iirc | 18:47 |
clarkb | pabelanger: corvus dmsimard this may explain why centos dmesg doesn't say kpti is enabled | 18:47 |
clarkb | because it isn't? the sysfs entry definitely seems to say it is there though | 18:47 |
corvus | okay, files02 is back up. i got nervous because it took about 2 minutes to reboot, which is very long. | 18:48 |
corvus | clarkb: also, we don't necessarily lose the afs cache -- it shouldn't be any worse than a typical volume release (which happens every 5min) | 18:49 |
clarkb | oh right docs releases often | 18:49 |
corvus | and it caches the content, and just checks that the files are up to date after invalidation. | 18:50 |
corvus | clarkb: link to what you're looking at? | 18:52 |
clarkb | corvus: https://groups.google.com/forum/m/#!topic/mechanical-sympathy/L9mHTbeQLNU | 18:53 |
clarkb | reading othe rthings on the internet the best check for kpti enablement is the message in dmesg | 18:54 |
clarkb | that is a clear indication it was enabled and is being used post boot | 18:54 |
clarkb | but ya apparently red hat did the sysfs thing which isn't standard so harder to get info on that and determine if that is a clear "it is on" message | 18:54 |
clarkb | so I'm fairly convinced that it is working as expected on the ubuntu hosts we are patching (beacuse we are checking dmesg) | 18:55 |
clarkb | dmsimard: do you have red hat machines where the cpu does have pcid in the cpu flags? if so does dmesg output the isolation enabled message in that case? | 18:56 |
clarkb | pabelanger: ^ | 18:56 |
clarkb | I'm going to do review-dev now | 18:56 |
fungi | as soon as i confirm paste is okay, i'm going to do openstackid-dev and then openstackid.org | 18:57 |
fungi | though i'll give the foundation staff and tipit devs a heads up on the latter | 18:58 |
fungi | lodgeit on paste01 seems unhappy | 18:59 |
pabelanger | clarkb: I look at see on some internal things | 19:01 |
pabelanger | but don't have much access myself | 19:01 |
pabelanger | once bandersnatch is done running, I'll reboot mirror-update.o.o | 19:04 |
fungi | looks like the openstack-paste systemd unit raced unbound at boot and failed to start because it couldn't resolve the dns name of the remote database. do we have a good long-term solution to that? | 19:04 |
clarkb | anyone know whay apps-dev and stackalytics still show up in our inventory? | 19:07 |
clarkb | I'm going to make a list of services that hsould happen with zuul's reboot | 19:09 |
pabelanger | have they been deleted? | 19:09 |
clarkb | pabelanger: havne't checked yet, I guess if they are still instances up but no dns that could explain it | 19:09 |
pabelanger | ah, I wasn't aware we deleted stackalytics.o.o | 19:10 |
clarkb | well I don't know that we did but both are unreachable | 19:13 |
clarkb | do the zuul mergers need to happen with zuul scheduler? | 19:13 |
clarkb | or can we do them one at a time and get them done ahead of time? | 19:13 |
clarkb | corvus: ^ | 19:13 |
corvus | clarkb: stopping them while running a job may cause a merger failure to be reported. other than that, it's okay. | 19:14 |
pabelanger | maybe we can add graceful stop for next time :) | 19:16 |
pabelanger | corvus: also, ns01 and ns02, anything specific we need to do for nameservers? Assume rolling restarts are okay | 19:17 |
clarkb | I'm inclined to do them as part of zuul restart then since we've alrady had a rough day with test stability yesterday | 19:17 |
corvus | pabelanger: rolling should be fine | 19:17 |
pabelanger | okay, I'll look at them now | 19:17 |
corvus | pabelanger: a graceful stop for mergers would be great :) | 19:18 |
clarkb | translate.o.o is failing to update package indexes, something related to a nova agent repo? I wonder if we made this xenial instance during a period of funny rax images | 19:22 |
clarkb | I'm looking into it | 19:22 |
fungi | ick | 19:22 |
pabelanger | clarkb: question asked about PCID on internal chat, hopefully know more soon | 19:23 |
clarkb | deb http://ppa.launchpad.net/josvaz/nova-agent-rc/ubuntu xenial main | 19:23 |
clarkb | is in a sources.list.d file | 19:24 |
clarkb | that looks like a use customers as guinea pigs solution | 19:24 |
clarkb | cloud-support-ubuntu-rax-devel-xenial.list is a differetn file name in /etc/apt/sources.list.d but is empty | 19:25 |
clarkb | I'm thinking I will just comment out the deb line there and move on | 19:25 |
clarkb | any objections to that? | 19:25 |
pabelanger | clarkb: a PCID cpu with RHEL7, doesn't show isolation in dmesg | 19:27 |
clarkb | pabelanger: ok, thanks for checking | 19:27 |
clarkb | so I guess other than rtfs'ing we treat the sysfs entry saying pti is enabled as good enough? | 19:28 |
clarkb | commenting out the ppa allowed translate01 to update. I am rebooting it now | 19:29 |
pabelanger | okay, all mirrors upgraded, just need reboots when we are ready | 19:30 |
pabelanger | I don't think we have any clients connected to firehose01.o.o yet? | 19:32 |
clarkb | nothing production like. I think mtreinish uses it randomly | 19:33 |
pabelanger | kk, will reboot now then | 19:33 |
clarkb | zuul-dev can just be rebooted right? | 19:33 |
clarkb | corvus: ^ | 19:33 |
pabelanger | or even deleted? | 19:33 |
corvus | clarkb, pabelanger: either of those | 19:34 |
clarkb | well if it can be deleted then that is probably preferable | 19:34 |
clarkb | doing ask-staging now | 19:34 |
pabelanger | okay, I can propose patched to remove zuul-dev from system-config | 19:35 |
fungi | clarkb: yes please to be commenting out the rax nova-agent guinea pig sources.list entry | 19:39 |
pabelanger | remote: https://review.openstack.org/511986 Remove zuulv3-dev.o.o | 19:40 |
pabelanger | remote: https://review.openstack.org/532615 Remove zuul-dev.o.o | 19:40 |
pabelanger | should be able to land both of them | 19:40 |
fungi | openstackid server restarts are done, after coordinating with foundation/tipit people | 19:40 |
fungi | going to do groups-dev and groups next | 19:41 |
pabelanger | graphite.o.o is ready for a reboot, if we want to do it | 19:41 |
clarkb | fungi: ask-staging looked happy anything I should know about before doing ask.o.o? | 19:41 |
fungi | pabelanger: after restarting firehose01, please take a moment to test connecting to it per the instructions in the system-config docs to confirm it's streaming again | 19:41 |
pabelanger | fungi: sure, I can do that now | 19:42 |
fungi | clarkb: assuming the webui is up, i think it's safe to proceed with prod | 19:42 |
clarkb | pabelanger: if we do graphite with zuul then we won't lose job stats or nodepool stats | 19:42 |
clarkb | pabelanger: but I'm not sure that is mission critical its probably fine to have a small gap in the data and reboot it | 19:42 |
clarkb | fungi: ok doing ask now then | 19:42 |
*** rlandy has quit IRC | 19:43 | |
pabelanger | fungi: firehose.o.o looks good | 19:44 |
pabelanger | clarkb: okay, will reboot now | 19:45 |
clarkb | ask.o.o is rebooting now | 19:45 |
fungi | cool | 19:45 |
fungi | thanks for checking pabelanger! | 19:45 |
pabelanger | np! | 19:45 |
*** rlandy has joined #openstack-infra-incident | 19:48 | |
pabelanger | clarkb: what about health.o.o, reboot when we do zuul? | 19:49 |
clarkb | health should be fine to do beforehand | 19:49 |
clarkb | since its mostly decoupled from zuul (it reads the subunit2sql db) | 19:49 |
pabelanger | k | 19:49 |
pabelanger | rebooting now | 19:50 |
clarkb | ask.o.o is up and patched but not serving data properly yet | 19:50 |
clarkb | I see there are some manage processes running for aksbot though | 19:50 |
clarkb | [Wed Jan 10 19:50:48.367125 2018] [mpm_event:error] [pid 2155:tid 140109368838016] AH00485: scoreboard is full, not at MaxRequestWorkers | 19:50 |
clarkb | any idea what that means? | 19:50 |
pabelanger | I'd have to google | 19:51 |
clarkb | looks like it may indicate that apache worker processes are out to lunch | 19:52 |
clarkb | and it won't start more to handle incoming connections | 19:52 |
clarkb | I am going to try restarting apache | 19:52 |
pabelanger | going to see how we can do rolling updates on AFS servers | 19:53 |
pabelanger | I think we just need to do one at a time | 19:53 |
clarkb | seems to be working now | 19:54 |
clarkb | pabelanger: ya I think I documented the process in the system config docs | 19:54 |
pabelanger | Yup | 19:54 |
pabelanger | https://docs.openstack.org/infra/system-config/afs.html#no-outage-server-maintenance | 19:54 |
pabelanger | going to start with db servers | 19:54 |
fungi | got another one... groups-dev is AMD Opteron(tm) Processor 4170 HE | 19:55 |
clarkb | adns1.openstack.org can just be rebooted right? its not internet facing | 19:56 |
clarkb | corvus: ^ you good for me to do that one? | 19:56 |
fungi | i agree it should be safe to reboot at will, but will defer to corvus | 19:57 |
clarkb | I've pinged jbryce about kata's mail list server to make sure there isn't any crazy conflict with reboot it | 19:58 |
fungi | groups-dev is looking good after its reboot, so moving on to groups.o.o | 19:58 |
fungi | groups.o.o is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz | 19:58 |
fungi | so groups-dev won't be a great validation for prod, but them's the breaks | 19:59 |
pabelanger | both AFS db servers are done | 20:01 |
clarkb | I'm actually going to take a break and get lunch | 20:02 |
clarkb | I think we are close to being ready to doing zuul though which is exciting | 20:02 |
AJaeger | clarkb: kata-dev mailing list is dead silent ;/ | 20:02 |
pabelanger | AFS file servers need a little more work, we'll have to stop mirror-update for sure, and likely wait until zuul is shutdown to be safe | 20:02 |
pabelanger | otherwise, we can migrate volumes | 20:02 |
clarkb | jbryce also confirms we can do kata list server whenever we are ready | 20:02 |
pabelanger | which takes time | 20:02 |
pabelanger | actually, afs02.dfw.openstack.org looks clear of RW volumes, so I can start work on that | 20:06 |
* clarkb lunches | 20:06 | |
fungi | groups servers are done and looking functional in their webuis | 20:08 |
fungi | hrm... we have status and status01. is the former slated for deletion? | 20:09 |
fungi | i'll coordinate with the refstack/interop team on the refstack.o.o reboot | 20:11 |
pabelanger | fungi: yes, i believe status.o.o can be deleted but it was there even before we create new status01.o.o server. I'm not sure why | 20:12 |
corvus | fungi, clarkb: reboot adns at will | 20:13 |
pabelanger | moving on to afs01.ord.openstack.org it also has not RW volumes | 20:13 |
pabelanger | no* | 20:14 |
corvus | pabelanger: i would just gracefully stop mirror-update before doing afs01.dfw. i wouldn't worry about zuul. | 20:14 |
pabelanger | k | 20:14 |
corvus | the wheel build jobs should be fine if interrupted. the only reason to take care with mirror-update is so we don't accidentally get in a state where we need to restart bandersnatch. but even that should be okay. | 20:15 |
corvus | (i mean, it's run out of space twice in the past few months and has not needed a full restart) | 20:15 |
pabelanger | yah, that is true | 20:16 |
pabelanger | AFK for a few to get some coffee before doing afs01.dfw | 20:18 |
fungi | i'm taking a quick look in rax to see if i can suss out what the situation with stackalytics.o.o is | 20:18 |
fungi | i want to say we decided to delete it and redeploy with xenial if/when we were ready to work on it further | 20:19 |
clarkb | ya some of the unreachable servers may have had failed migrations | 20:19 |
fungi | looks like stackalytics is in that boat | 20:20 |
fungi | nothing in its oob console, but i issued a ctrl-alt-del through there | 20:21 |
fungi | i have at times seen a similar issue i've also guessed may be migration-related, where the ip addresses for some instances cease getting routed until they're rebooted, but they're otherwise fine | 20:21 |
fungi | even happened to one of my personal debian systems in rax | 20:22 |
fungi | in that case i was able to work out a login through the oob console and inspect from the guest side | 20:22 |
fungi | and the interface was up and configured but tcpdump showed no packets arriving on it | 20:23 |
fungi | (my personal instances have password login through the console enabled, just not via ssh) | 20:24 |
clarkb | general warning kids have been sick the last couple days and larissa now feels sick so I will be doing patching from couch while entertaining 2 year olds | 20:24 |
dmsimard | did someone forget to delete the old eavesdrops machine ? | 20:24 |
dmsimard | fatal: [eavesdrop.openstack.org]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 2001:4800:7818:101:be76:4eff:fe05:31bf port 22: No route to host\r\n", "unreachable": true} | 20:24 |
fungi | dmsimard: i bet it was shutdown but not deleted until we were sure the replacement was good? | 20:24 |
fungi | i think stackalytics is exhibiting the same broken network behavior. console indicated it wasn't able to configure its network interface and eventually timed that out and booted anyway... remotely the instance is still unreachable. i'll move on to rebooting it through the api next | 20:27 |
fungi | huh, even after a nova reboot of the instance, stackalytics.o.o seems unable to bring up its network | 20:32 |
dmsimard | clarkb, fungi: the playbook works after some fiddling, the inventory is in /tmp/meltdown-spectre-inventory | 20:33 |
dmsimard | ~60 unpatched hosts still | 20:34 |
dmsimard | and 66 patched hosts | 20:34 |
fungi | dmsimard: does it check whether they're intel or amd? | 20:35 |
fungi | i've already turned up two amd hosts which won't show patched because kpti isn't enabled for amd cpus | 20:35 |
dmsimard | For some reason the facts gathering would hang at some point under the system-installed ansible version (2.2.1.0), I installed latest (2.4.2.0) in a venv and used that | 20:35 |
dmsimard | fungi: it doesn't but I can add that in -- can you show me an example ? | 20:36 |
fungi | dmsimard: so far wiki-upgrade-test and groups-dev | 20:37 |
fungi | dmsimard: you could check for GenuineIntel or not AuthenticAMD in /proc/cpuinfo | 20:37 |
fungi | whichever seems better | 20:37 |
dmsimard | hmm, my key isn't on wiki-upgrade-test | 20:37 |
dmsimard | looks like I can use the ansible_processor fact (since we're gathering facts anyway) let me fix that | 20:40 |
fungi | dmsimard: yeah, just don't apply puppet on it. it's basically frozen in time from before we started the effort to puppetize our mediawiki installation (which is what's on wiki-dev) | 20:45 |
fungi | and since it's perpetually in the emergency file until we get the puppet-deployed mediawiki viable, new shell accounts aren't getting created | 20:45 |
clarkb | dmsimard: its because ssh fails to a couple nodes I think. Newer ansible must handle that better and timeout | 20:46 |
pabelanger | I've added mirror-update.o.o to emergency file so I can edit crontab | 20:55 |
fungi | stackalytics.o.o isn't coming back no matter how hard i reboot it. we likely need to delete and rebuild anyway | 20:59 |
dmsimard | clarkb: that's what I thought as well. | 21:00 |
pabelanger | fungi: yah, sounds fair. needs to be moved to xenial anyways | 21:03 |
pabelanger | eavesdrop.o.o was likely me, yah it was shutdown and never deleted after eavesdrop01.o.o was good | 21:03 |
clarkb | ok back to computer now | 21:18 |
clarkb | I'm going to do adns now if it isn't already done | 21:18 |
clarkb | dmsimard: the storyboard hosts have been patched for ~2 hours but they show up in your unpatched list | 21:22 |
clarkb | dmsimard: was it run more than two hours ago or maybe its not quite accurate? | 21:22 |
clarkb | (I just checked both by hand and they are patched) | 21:22 |
clarkb | adns1 rebooting now | 21:24 |
clarkb | ok its up and reports being patched and named is running | 21:26 |
clarkb | anyone know what is up with status.o.o and status01.o.o? | 21:28 |
clarkb | looks like they are different hosts | 21:28 |
fungi | my bet is a not-yet-completed xenial replacement | 21:29 |
fungi | status01 is not in dns | 21:29 |
fungi | and status is still in production and on trusty | 21:29 |
clarkb | any objections to me patching and rebooting both of them? | 21:29 |
fungi | none from me. i'll see if i can figure out who was working on the status01 build | 21:30 |
clarkb | status hosts elasticsearch and bots | 21:30 |
clarkb | otherwise its mostly a redirect to zuulv3 status I think | 21:30 |
fungi | yeah | 21:30 |
clarkb | ok doing that now | 21:30 |
fungi | seems to have been ianw working on status01 during the sprint, according to channel logs | 21:31 |
clarkb | er not elasticsearch, elastic-recheck | 21:31 |
fungi | http://eavesdrop.openstack.org/irclogs/%23openstack-sprint/%23openstack-sprint.2017-12-12.log.html#t2017-12-12T00:41:19 | 21:32 |
clarkb | I think most of what is left is the zuul group, lists, puppetmaster and backup server | 21:33 |
clarkb | status* rebooting now | 21:34 |
clarkb | should probably do puppetmaster last | 21:34 |
clarkb | so that if it hiccups it does so after everything else is updated | 21:34 |
clarkb | oh and infracloud still needs updating on control plane | 21:35 |
fungi | zuul-dev and zuul.o.o also need updating but we should take care not to accidentally start services on the latter | 21:37 |
fungi | maybe it's time to talk again about deleting them | 21:37 |
clarkb | ya earlier we said we could delete zuul-dev | 21:38 |
clarkb | I'm doing the backup server now | 21:38 |
fungi | zuulv3.o.o has only one vhost so apache directs all requests there as the default vhost, and as such just making zuul a cname to zuulv3 will work without needing the old zuul serving a redirect | 21:38 |
clarkb | I'm up for deleting zuul.o.o too | 21:38 |
clarkb | but first backup server | 21:39 |
fungi | i tested with an /etc/hosts entry and my browser earlier, worked fine | 21:39 |
corvus | there is no zuul, only zuulv3 | 21:40 |
clarkb | fungi: I would say make the dns record update then and lets delete both old servers | 21:41 |
corvus | ++ | 21:41 |
clarkb | backup server rebooting now | 21:42 |
clarkb | ok backups server came back happy and has both filesystems (old and current) mounted | 21:44 |
fungi | i'll do the cname dance now | 21:44 |
fungi | for zuul->zuulv3 | 21:44 |
clarkb | I guess I'm up now for lists reboots | 21:45 |
clarkb | I'm going to start iwth kata because lower traffic | 21:45 |
clarkb | lists.o.o is actually amd | 21:48 |
clarkb | so patching less urgent for it but we may as well | 21:48 |
fungi | ttl on zuul.o.o a and aaaa were 5 minutes | 21:49 |
fungi | cname from zuul to zuulv3 now added | 21:49 |
clarkb | lists.katacontainers.io is back up and happy. At least it has exim, mailman, and apache running | 21:50 |
clarkb | going to do lists.openstack.org now | 21:50 |
fungi | #status log deleted old zuul-dev.openstack.org instance | 21:51 |
openstackstatus | fungi: finished logging | 21:51 |
fungi | will delete the zuul.o.o instance shortly once dns changes have a chance to propagate | 21:51 |
clarkb | after zuul is cleaned up I'm going to rerun my check for what is patched | 21:51 |
clarkb | but I think we will be down to the zuulv3 group | 21:52 |
clarkb | (and infracloud) | 21:52 |
clarkb | oh and puppetmaster (but again do this one after zuulv3 group) | 21:52 |
clarkb | lists.o.o rebooting now | 21:53 |
clarkb | and is back, services look good, but no kpti because it is AMD | 21:54 |
clarkb | I'm going to make sure zm01-zm08 are patched now but not reboot them | 21:55 |
fungi | i think that makes three we know about now with amd cpus | 21:56 |
fungi | (wiki-upgrade-test, groups-dev and lists) | 21:56 |
clarkb | I think it has to do with the age of the server | 21:57 |
clarkb | since rax was all amd before they added the performance flavors | 21:57 |
clarkb | fungi: let me know when zuul.o.o is gone and I am gonna regen our list to make sure only the zuulv3 set is left | 21:58 |
clarkb | patching nodepool.o.o now too but not rebooting | 22:01 |
clarkb | and now patching static.o.o | 22:06 |
corvus | i just popped back to pick up another server and don't see one available. it looks like we're at the end of the list, where we reboot the zuul system all at once? | 22:09 |
fungi | yeah, i'm about to delete old zuul.o.o now | 22:09 |
fungi | which i think is the end | 22:10 |
fungi | we've made short work of all this | 22:10 |
fungi | (not me so much, but the rest of you) | 22:10 |
clarkb | corvus: yes I am generating a new list from ansible output to compare now | 22:10 |
clarkb | will be a sec | 22:10 |
corvus | cool, count me in for helping with that. i'd like to handle zuul.o.o itself, and patch in the repl in case i have time to debug memory stuff later in the week. | 22:11 |
clarkb | ok | 22:11 |
fungi | i hope you mean zuulv3.o.o | 22:11 |
fungi | since i'm about to push the button on deleting the old zuul.o.o | 22:11 |
corvus | fungi: yes. thinking ahead. :) | 22:11 |
fungi | dns propagation has had plenty of time now | 22:12 |
fungi | and http://zuul.openstack.org/ is giving me the status page just fine | 22:12 |
ianw | status01 is waiting for us to finish the node puppet stuff | 22:13 |
fungi | ianw: cool, thanks. i thought it might be something like that after looking at the log from the sprint channel | 22:14 |
fungi | #status log deleted old zuul.openstack.org instance | 22:14 |
clarkb | http://paste.openstack.org/show/642507/ up to date list | 22:14 |
openstackstatus | fungi: finished logging | 22:14 |
clarkb | I sorted twice so we can split up by status too | 22:15 |
clarkb | pabelanger: you still doing afs01? | 22:15 |
clarkb | I guess that one can happen out of band | 22:15 |
pabelanger | clarkb: yup, just waiting for mirror-update stop bandersnatch | 22:15 |
fungi | i'll go ahead and delete the stackalytics server too, it's not going to be recoverable without a trouble ticket, and that's a bridge too far for something we know is inherently broken and unused anyway | 22:16 |
clarkb | fungi: sounds good | 22:16 |
fungi | #status log deleted old stackalytics.openstack.org instance | 22:16 |
openstackstatus | fungi: finished logging | 22:16 |
pabelanger | I'll have to step away shortly for supper with family, are we thinking of doing zuulv3.o.o reboot shortly? | 22:17 |
clarkb | pabelanger: yes I think so | 22:17 |
clarkb | you still good to do the mirrors if we do it in the next few minutes? | 22:17 |
pabelanger | clarkb: yah, I can do reboot them | 22:17 |
clarkb | line 151 hsa the zuulv3 set. I've given corvus zuulv3.o.o and I've taken zuul mergers | 22:18 |
clarkb | I'll put pabelanger on the mirrors | 22:18 |
clarkb | that leaves static and nodepool | 22:18 |
fungi | i'm around to help for the next little bit, but it's going to be lateish here soon so i likely won't be around later if something crops up | 22:18 |
clarkb | fungi: ^ | 22:18 |
fungi | i'll take static | 22:18 |
fungi | and nodepool unless someone else wants to grab that | 22:18 |
corvus | this one we should announce -- should we go ahead and put statusbot on that? | 22:18 |
fungi | big concern with static.o.o is making sure it doesn't fsck /srv/static/logs | 22:19 |
fungi | does touching /fastboot still work in this day and age? | 22:19 |
clarkb | corvus: ya why don't we do that | 22:19 |
clarkb | fungi: I'm not sure, it is a trusty node right? | 22:19 |
fungi | also, full agree on status alert this | 22:19 |
fungi | yeah, static.o.o is trusty | 22:20 |
corvus | status notice The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved. | 22:21 |
corvus | ^? | 22:21 |
clarkb | corvus: +1 | 22:21 |
fungi | lgtm | 22:21 |
corvus | #status notice The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved. | 22:21 |
*** corvus is now known as jeblair | 22:21 | |
jeblair | #status notice The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved. | 22:22 |
openstackstatus | jeblair: sending notice | 22:22 |
clarkb | fungi: supposedly setting the sixth column in fstab to 0 will prevent fscking | 22:22 |
*** jeblair is now known as corvus | 22:22 | |
pabelanger | ansible -i /etc/ansible/hosts/openstack 'mirror01*' -m command -a "reboot" | 22:22 |
pabelanger | is the command I'll use to reboot mirrors | 22:22 |
corvus | oh, now i spot the typo in that | 22:22 |
clarkb | fungi: http://www.man7.org/linux/man-pages/man5/fstab.5.html | 22:22 |
fungi | clarkb: thanks, done. that's a more durable solution anyway, i should have thought of it | 22:22 |
corvus | re-equeued. something about a horse i assume. | 22:22 |
-openstackstatus- NOTICE: The zuul system is being restarted to apply security updates and will be offline for several minutes. It will be restarted and changes re-equeued; changes approved during the downtime will need to be rechecked or re-approved. | 22:23 | |
fungi | historically anyway, creating a /fastboot file only exempted it from running fsck for the next reboot, and then the secure or single-user runlevel initscripts would delete it at boot | 22:23 |
clarkb | ok I've double checked the zm0* servers have up to date packages. I think I am ready | 22:24 |
clarkb | (everyone else should probably double check too before we turn off zuul) | 22:24 |
fungi | checking mine now | 22:24 |
openstackstatus | jeblair: finished sending notice | 22:24 |
corvus | i'm ready and waiting on fungi and pabelanger to indicate they are ready | 22:25 |
pabelanger | yes, ready | 22:25 |
corvus | waiting on fungi | 22:25 |
clarkb | I think order is grab queues, stop zuul, reboot everything but zuul, reboot zuul after static, etc come back up | 22:25 |
fungi | okay, all set to reboot mine | 22:25 |
corvus | clarkb: yep. after fungi is ready, i will grab + stop, then let you know to proceed | 22:26 |
fungi | and yes, that plan sounds correct | 22:26 |
fungi | corvus: start at will | 22:26 |
corvus | zuul is stopped -- you may proceed | 22:26 |
fungi | thanks, rebooting static and nodepool now | 22:26 |
pabelanger | okay, doing mirrors | 22:26 |
clarkb | zm0* have been rebooted, waiting for them to return to us | 22:27 |
fungi | will the launchers need to have anything done to reconnect to zk on nodepool.o.o? | 22:27 |
clarkb | fungi: I don't think so | 22:27 |
clarkb | it should retry if it fails iirc | 22:27 |
clarkb | similar to how gear works | 22:28 |
clarkb | (but we should check) | 22:28 |
clarkb | fungi: oh also we may need to make sure the old nodepool process doesn't start? | 22:28 |
clarkb | fungi: should just have zk running on nodepool.o.o | 22:28 |
fungi | nodepool has booted already | 22:28 |
fungi | i'll go check what's running | 22:28 |
clarkb | all zuul mergers report kpti, checking them for daemons | 22:28 |
dmsimard | clarkb: looking for storyboard re: playbook | 22:29 |
fungi | zookeeper is running, nodepool is not | 22:29 |
fungi | so should be okay | 22:29 |
clarkb | all 8 zuul mergers are running a zuul merger process I think my side is good | 22:29 |
fungi | static.o.o is up and reporting correct kernel security | 22:29 |
dmsimard | clarkb: ok I need to fix ubuntu 14.04 vs 16.04 | 22:30 |
dmsimard | storyboard have not yet been updated | 22:30 |
clarkb | dmsimard: they should be the same | 22:30 |
clarkb | dmsimard: and they havebeen updated | 22:30 |
dmsimard | 14.04 doesn't have the kaiser flag in /proc/cpuinfo, 16.04 does | 22:30 |
pabelanger | okay, just waiting for inap, all other mirrors show good for reboot. Checking apache now | 22:30 |
pabelanger | inap is also good for reboot | 22:31 |
clarkb | dmsimard: ansible -i /etc/ansible/hosts/openstack 'logstash-worker0*' -m shell -a "dmesg | grep 'Kernel/User page tables isolation: enabled'" is apparently hte most reliable way to check | 22:31 |
clarkb | dmsimard: as all the flag in cpuinfo means is that the kernel has detected the insecure cpu not that it has necessarily enabled pti | 22:31 |
dmsimard | clarkb: yeah but that doesn't work for centos/rhel :( | 22:31 |
clarkb | pabelanger: good for reboot meaning reboot is complete and the dmesg content is there? | 22:31 |
clarkb | dmsimard: ya I know rhel is the only distro it doesn't work on from the ones I have sampled | 22:31 |
clarkb | but checking cpuinfo doesn't tell oyu if pti is enabled | 22:32 |
clarkb | it only tells you if kernel knows its cpu is insecure | 22:32 |
pabelanger | clarkb: yes, just validating AFS is working properly now | 22:32 |
dmsimard | /boot/config-3.13.0-139-generic:CONFIG_PAGE_TABLE_ISOLATION=y is probably safe | 22:32 |
pabelanger | so far, issue with vexxhost mirror | 22:32 |
clarkb | dmsimard: that also doesn't tell you pti is enabled | 22:32 |
clarkb | dmsimard: only that support for it was compiled in | 22:32 |
dmsimard | damn it | 22:32 |
clarkb | (this is why the dmesg check is important it is positive confirmation from the kernel that it is ptiing) | 22:32 |
mgagne | pabelanger: what is the reboot about? kernel for meltdown? | 22:32 |
clarkb | mgagne: yes | 22:33 |
corvus | clarkb, fungi: i believe i'm waiting only on pabelanger at this point, correct? | 22:33 |
clarkb | corvus: that is my understanding yes | 22:33 |
fungi | yes, everything good on my end | 22:33 |
corvus | pabelanger: i'm idle if you need me to jump on a mirror | 22:33 |
pabelanger | corvus: yes, vexxhost please | 22:33 |
pabelanger | it is not serving up AFS | 22:33 |
pabelanger | I am checking others still | 22:33 |
corvus | ack | 22:34 |
corvus | pabelanger: seems up now: http://mirror01.ca-ymq-1.vexxhost.openstack.org/ubuntu/lists/ | 22:34 |
clarkb | mgagne: we are doing all of our VM kernels today (and if I don't run out of time the control plane of infracloud) | 22:35 |
fungi | `ls /afs/openstack.org/docs` returns content for me on mirror01.ca-ymq-1.vexxhost | 22:35 |
mgagne | clarkb: cool, just wanted to check if it was for spectre | 22:35 |
clarkb | mgagne: I'm not aware of any spectre patches yet | 22:35 |
pabelanger | corvus: yes, confirmed. Thanks | 22:35 |
fungi | 329 entries to be exact | 22:35 |
clarkb | mgagne: unfrotunately | 22:35 |
mgagne | clarkb: ok, we are on the same page then | 22:36 |
pabelanger | okay, all mirrors are rebooted, dmesg confirmed and apache running | 22:36 |
corvus | mgagne: from what i read from gkh, that's probably next week. and the week after. and so on forever. :| | 22:36 |
clarkb | corvus: we'll get really good at rebooting :) | 22:36 |
fungi | until we get redesigned cpus deployed everywhere | 22:36 |
mgagne | ¯\_(ツ)_/¯ | 22:36 |
clarkb | corvus: I think that means you are good to patch zuulv3 and reboot | 22:36 |
fungi | and discover whatever new class of bugs they introduce | 22:36 |
corvus | cool, proceeding with zuulv3.o.o. i expect it to start on boot. | 22:37 |
clarkb | corvus: note I didn't prepatch zuulv3 | 22:37 |
corvus | clarkb: i did | 22:37 |
clarkb | since python had been crashing there I didn't want to do anything early | 22:37 |
clarkb | cool | 22:37 |
corvus | host is up | 22:38 |
corvus | zuul is querying all gerrit projects | 22:38 |
corvus | zuul-web did not start | 22:38 |
corvus | or rather, it seems to have started and crashed without error? | 22:39 |
pabelanger | clarkb: okay if I step away now? Have some guests over this evening | 22:39 |
clarkb | pabelanger: yup | 22:39 |
pabelanger | great, good luck all | 22:39 |
clarkb | pabelanger: enjoy, and thanks for the help! | 22:39 |
fungi | thanks pabelanger! | 22:40 |
clarkb | pabelanger: oh can you tldr what needs to be done with afs? | 22:40 |
clarkb | pabelanger: we can finish that up while you dinner :) | 22:40 |
corvus | submitting cat jobs now | 22:40 |
clarkb | I guess its wait for mirror-update to stop doing things | 22:40 |
clarkb | then reboot the afs server | 22:40 |
corvus | mergers seem to be running them | 22:40 |
fungi | and after that, puppetmaster | 22:40 |
clarkb | fungi: ++ | 22:40 |
corvus | i've restarted zuul-web | 22:41 |
clarkb | looks like a couple changes have enqueued according to status | 22:42 |
corvus | according to grafana we're at 10 executors and 18 mergers which is the full complement | 22:42 |
clarkb | and jobs are running | 22:42 |
corvus | okay, i'll re-enqueue now | 22:43 |
fungi | 43fa686e-12a4-4c51-ad3b-d613e2417ff3 claims to be named "ethercalc01" | 22:43 |
fungi | but is not the real slim shady | 22:43 |
ianw | fungi: that could be our in-progress ethercalc ... again waiting for nodejs | 22:43 |
clarkb | fungi: oh I don't think the real one is in my listings either | 22:43 |
ianw | i think fricker had one up for testing last year | 22:43 |
fungi | 93b2b91f-7d01-442b-8dff-96a53088654a is actual ethercalc01 | 22:44 |
clarkb | fungi: it might be a good idea to regenerate the openstack inventory cache file? | 22:44 |
clarkb | since a few servers have been deleted that are showing up in there | 22:44 |
fungi | so it's in the inventory, just tracked by uuid since there are two of them | 22:44 |
clarkb | corvus: console streaming is working | 22:44 |
fungi | should we delete the in-progress one before clearing and regenerating the inventory cache? | 22:44 |
ianw | fungi / clarkb: see https://review.openstack.org/#/c/529186/ ... may be related? | 22:45 |
clarkb | fungi: the in progress ethercalc? its probably fine to keep it but just patch it? | 22:45 |
fungi | should we delete status01 and the nonproduction ethercalc01 duplicate for now, since we'd want to test bootstrapping them from scratch again anyway? | 22:45 |
ianw | fungi: yep | 22:46 |
clarkb | fungi: thinking about it htough that may be simplest | 22:46 |
mordred | infra-root: oh, well. I somehow missed that we were doing work over here in incident (and quite literally spent the day wondering why everyone was so quiet) | 22:46 |
clarkb | ++ to cleaning up | 22:46 |
fungi | mordred: i hope you got a lot done in the quiet! | 22:46 |
clarkb | mordred: maybe you want to help mwhahaha debug jobs? | 22:46 |
mordred | what, if anything, can I do to make myself useful? | 22:46 |
mordred | clarkb: kk. will do | 22:46 |
ianw | fungi: do you want me to do that, if you have other things in flight? | 22:46 |
clarkb | mordred: the bulk of patching is done and unless you really want to do reboots on whatsleft we probably have it under control? | 22:47 |
fungi | ianw: feel free but i'm basically idle for the moment which is why i got to looking at the odd entries | 22:47 |
clarkb | infracloud was mostly ignored today but I think I got most/all of the hypervisors done there last night | 22:47 |
ianw | fungi: ok, well if you've got all the id's there ready to go might be quicker for you | 22:47 |
clarkb | fungi: mirror-update looks idle now, maybe you want to do the remaining afs server? | 22:47 |
fungi | will do | 22:48 |
clarkb | in that case I can do afs server :) | 22:48 |
fungi | go for it, i'll work on ethercalc01 upgrade and the deletions for dupes | 22:48 |
clarkb | cool | 22:48 |
clarkb | corvus: and to confirm afs01.dfw can just be rebooted now that mirror-update is happy? | 22:49 |
corvus | clarkb: yep, should be fine. | 22:51 |
clarkb | mordred: http://logs.openstack.org/80/532680/1/check/openstack-tox-py27/2b32f22/ara/result/1771aec0-9c03-40bd-837e-4aca16e1ec88/ the fail there in run.yaml caused a post failure | 22:51 |
clarkb | mordred: so that may be part of it | 22:51 |
fungi | i've deleted status01 and the shadow of ethercalc01, cleared the inventory cache, and am updating the real ethercalc01 now | 22:51 |
corvus | clients should fail over to afs02.dfw | 22:51 |
clarkb | updating pacakges on afs01.dfw now | 22:52 |
clarkb | and rebooting it now | 22:53 |
fungi | ethercalc01 is done and seems to be none the worse | 22:53 |
clarkb | afs01.dfw is back up and running now and has the patch in place according to dmesg | 22:56 |
* clarkb reads docs on how to check its volume are happy | 22:56 | |
fungi | i'll make sure the updated packages are on puppetmaster (very carefully taking note first of what it wants to upgrade) | 22:56 |
corvus | all changes re-enqueued | 22:57 |
clarkb | corvus: vos listvldb looks good to me | 22:57 |
clarkb | anything else I should check before reenabling puppet on mirror-update? | 22:58 |
corvus | should we send a second status notice, or was the first sufficient? | 22:58 |
fungi | upgraded packages on puppetmaster, keeping old configus | 22:58 |
fungi | i think the first was fine, that was brief enough of an outage | 22:58 |
fungi | puppetmaster is ready for its reboot when we're all ready | 22:58 |
corvus | clarkb: 'listvol' is probably better for this -- i think it consults the fileserver itself, not just the database | 22:58 |
clarkb | corvus: cool will do that now | 22:58 |
mordred | clarkb: that's a normal test failure - so yah, we probably need to make our post-run things a little more resilient | 22:59 |
clarkb | and they all show online | 22:59 |
clarkb | mordred: ya not sure if mwhahaha's things were all due to that weird error reporting though | 22:59 |
mordred | clarkb: I'm guessing that'll be "zomg there is no .testrepository" - which is normal when a patch causes tests to not be able to import | 22:59 |
clarkb | mordred: justnoticed it may be a cause of lots of post failures | 22:59 |
mordred | clarkb: indeed. | 22:59 |
clarkb | fungi: I think I'm ready for puppetmaster when everyone else is | 22:59 |
mordred | clarkb: I'll work up a fix for it in any case - it's definitely sub-optimal | 22:59 |
clarkb | fungi: i can reenable puppet on mirror-update after the reboot | 23:00 |
fungi | sure | 23:00 |
clarkb | ze10 is apparently not patched? | 23:00 |
clarkb | (that can happen after puppetmaster) | 23:01 |
fungi | okay, last call for objections before i reboot puppetmaster.o.o | 23:01 |
fungi | and here goes | 23:01 |
fungi | i can ssh into puppetmaster.o.o again now | 23:02 |
clarkb | as can I | 23:03 |
fungi | Kernel/User page tables isolation: enabled | 23:03 |
fungi | should be all set | 23:03 |
clarkb | I'm going to remove mirror-update from the emergency file now | 23:03 |
fungi | sounds good | 23:03 |
clarkb | thats done | 23:03 |
clarkb | now to look into ze10 | 23:03 |
clarkb | ok it looks like the dmesg buffer rolled over | 23:04 |
clarkb | we don't have the initial boot messages, but the kernel version lgtm so I'm going to trust that it was done | 23:04 |
fungi | yeah, that'll happen | 23:05 |
clarkb | going to generate a new list then I think it likely infracloud is what is left | 23:05 |
fungi | i guess the noise from bwrap floods dmesg pretty thoroughly | 23:06 |
fungi | ahh, oom killer running wild on ze10 actually | 23:07 |
clarkb | corvus: jeblairtest01 is the only server that is reachable and not patched | 23:09 |
clarkb | infra-root http://paste.openstack.org/show/642516/ | 23:10 |
* clarkb looks into some of those unreachable servers | 23:10 | |
fungi | hopefully there are fewer unreachables after the stuff i deleted earlier | 23:13 |
clarkb | yup just 5 now. | 23:13 |
fungi | oh, apps-dev should be able to go away | 23:13 |
fungi | i'll delete it | 23:13 |
fungi | eavesdrop is in a stopped state, pending deletion after we're sure eavesdrop01 is good (per pabelanger) | 23:13 |
clarkb | ianw: https://etherpad.openstack.org/p/infra-meltdown-patching line 173 is I think the old backup server? | 23:14 |
clarkb | ianw: is that something you are able to confirm (and did you want to delete it?) | 23:14 |
fungi | #status log deleted old apps-dev.openstack.org server | 23:14 |
openstackstatus | fungi: finished logging | 23:14 |
fungi | infra-root: any objections to deleting the old (trusty) eavesdrop.o.o server? | 23:15 |
dmsimard | fungi: pabelanger mentioned earlier it was him who kept it undeleted just in case | 23:15 |
dmsimard | I don't believe there's anything wrong with the new eavesdrop | 23:15 |
fungi | data was on a cinder volume anyway so unless you stuck something in your homedir it's not like there's anything on there anyway | 23:15 |
mordred | no objectionshere | 23:16 |
clarkb | fungi: if you are confident it can go away then I'm fine with it | 23:16 |
ianw | clarkb: yeah, the old one, think it's fine to delete it now, will do | 23:16 |
clarkb | I am putting these servers on the etherpad fwiw | 23:16 |
fungi | #status log deleted old eavesdrop.openstack.org server | 23:16 |
openstackstatus | fungi: finished logging | 23:16 |
fungi | kdc02 was supplanted by kdc04 right? | 23:17 |
clarkb | fungi: ya there was a change for that too | 23:17 |
fungi | pretty sure i reviewed (maybe approved?) that | 23:17 |
clarkb | looks like its already been deleted so stale inventory cache? | 23:17 |
clarkb | unless its not in dfw | 23:17 |
fungi | shouldn't be stale inventory cache | 23:17 |
clarkb | ah yup its in ORD | 23:18 |
fungi | ord | 23:18 |
fungi | just looked in the cache | 23:18 |
clarkb | its state is shut off | 23:18 |
fungi | which makes sense, and explains the unreachable | 23:18 |
fungi | so we're safe to delete that instance as well? | 23:19 |
clarkb | I think so | 23:19 |
fungi | and odsreg, i'm pretty sure we checked with ttx and he said it was clear for deletion too | 23:19 |
clarkb | ya I seem to recall that | 23:19 |
fungi | #status log deleted old kdc02.openstack.org server | 23:19 |
openstackstatus | fungi: finished logging | 23:20 |
dmsimard | clarkb: retrying the inventory after fixing 14.04/16.04 and adding AMD in | 23:20 |
clarkb | fungi: you don't happen to hae that logged in your irc logs do you? I can't find it (I don't keep long term logs and just rebooted) | 23:20 |
clarkb | dmsimard: thanks | 23:20 |
fungi | clarkb: that's what i'm hunting down now | 23:20 |
dmsimard | sorry about what little I could do today, I've been fighting other things | 23:20 |
dmsimard | I'll help after dinner | 23:21 |
clarkb | I think we are just about to the real fun. INFRACLOUD! | 23:21 |
dmsimard | should probably drain nodepool | 23:23 |
dmsimard | and then yolo it | 23:23 |
ianw | so my test xenial build is -109 generic, what got added over 108? | 23:24 |
clarkb | ianw: they fixed booting problems :) | 23:24 |
fungi | ianw: now with less crashing | 23:24 |
clarkb | ianw: so anything that is already on 108 is probably fine as long as unattended upgrades is working and updates to 109 before the next reboot | 23:24 |
clarkb | 109 is not required to be secure aiui | 23:24 |
ianw | ok, cool, thanks | 23:25 |
clarkb | dmsimard: I'm not even sure we have to drain nodepool. I got the hypervisors done last night (the ones I could access) and now its just the control plane and making sure we've done due diligence to do any other hypervisors that might still be alive but not responding to puppetmaster.o.o | 23:26 |
clarkb | compute000.vanilla.ic.openstack.org for example I can connect to from home but not puppetmaster | 23:26 |
clarkb | compute030.vanilla.ic.openstack.org accepts ssh connections then immediately kills them | 23:27 |
clarkb | alright I'm going to try collecting data on what needs help in infracloud | 23:30 |
*** rlandy is now known as rlandy|bbl | 23:37 | |
clarkb | infra-root http://paste.openstack.org/show/642518/ that is infracloud | 23:45 |
clarkb | we need to figure out which hypervisor is running the mirror but we should be able to reboot any other node | 23:45 |
clarkb | lets save baremetal00 for last as its the bastion for that env | 23:45 |
clarkb | (similar to how we did puppetmaster last in our control plane) | 23:45 |
* clarkb figures out where the mirror is running | 23:46 | |
clarkb | compute039.vanilla.ic.openstack.org is where mirror01.regionone.vanilla.openstack.org is running and it has been patched so we don't have to worry about that | 23:49 |
clarkb | anyone want to do compute012.vanilla.ic.openstack.org? I'm going to start digging into compute005.vanilla.ic.openstack.org and why it is not working | 23:49 |
*** SergeyLukjanov has quit IRC | 23:53 | |
*** SergeyLukjanov has joined #openstack-infra-incident | 23:54 | |
fungi | pretty sure the hypervisor hosts for both mirror instances were already done, because i had to explicitly boot them both earlier today | 23:56 |
fungi | i can give compute012.vanilla.ic a shot at an upgrade | 23:56 |
clarkb | looking at the nova-manager service list and the ironic node-list I think we have diverged a bit between what is working and what is expected to be working | 23:57 |
clarkb | I think what we should likely do is do our best to recover nodes like compute005 but if they don't come back we can disable them in nova then remove them from the inventory file | 23:57 |
clarkb | I'm going to quickly make sure all of the patche dhypervisors + 012 are the only ones that nova thinks it can use | 23:57 |
fungi | compute012.vanilla.ic is taking a while to reach, seems like | 23:58 |
clarkb | the networking there is so screwed up | 23:59 |
clarkb | compute000 for example apparently puppetmaster can hit it now | 23:59 |
clarkb | the XXX nodes in nova-manager service list are the ones that are unreachable | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!