clarkb | so we don't have any hypervisors (other than 012) that are unpatched and actually being used in vanilla | 00:00 |
---|---|---|
clarkb | I executed what I think was a reboot via ironic to compute005 we will see if it comes back | 00:00 |
clarkb | but I think we may just have to disable these in nova so that if they do mysteriously manage to return we don't run anything on them | 00:01 |
fungi | wfm | 00:02 |
fungi | my test probe to 012 did eventually return so i have the upgrade running against it now | 00:02 |
clarkb | ok I've got to take care of kids for a bit | 00:03 |
fungi | dist-upgrade on 012 finished so rebooting now | 00:05 |
clarkb | fungi: when thats done I think we can also patch amd reboot controller00.vanilla | 00:07 |
clarkb | this will make nodepool unhappy for a short bit but its used to clouds going away | 00:07 |
fungi | especially this one, probably | 00:07 |
clarkb | I'm mostly not able to computer now that kids are awake from nap | 00:07 |
fungi | i'm nearing end of steam myself and about to go relax | 00:08 |
fungi | but will see copute012 through first at least | 00:08 |
pabelanger | and back | 00:08 |
pabelanger | afs01.dfw is waiting for mirror-update.o.o to stop running bandersnatch | 00:09 |
pabelanger | checking now | 00:09 |
clarkb | pabelanger: its done | 00:09 |
clarkb | pabelanger: I rebooted it and removed mirror update from emergency file | 00:09 |
pabelanger | great | 00:09 |
clarkb | we are doing infracloud vanilla now | 00:10 |
clarkb | I cross checked nova-manage service list against the unreachable aervers | 00:10 |
clarkb | and tried rebooting compute005 to see if it comes back | 00:10 |
clarkb | but I'm guessing we just disable the unreachable servers in nova and remove them from our inventory | 00:11 |
pabelanger | yah, we likely should | 00:11 |
fungi | clarkb: i have scoured my chat logs and can't find where we might have asked ttx about odsreg.o.o, so i'll just try to remember to ask him tomorrow once he's around | 00:12 |
fungi | how long should i wait for this to boot back up before i get worried it's not coming back? | 00:14 |
clarkb | fungi: 10 minutes | 00:15 |
fungi | k | 00:15 |
pabelanger | okay, so anything I should focus on, or just slow process of infra-cloud | 00:15 |
clarkb | ~5 seems to be short end with somewhere less than 10 as common | 00:15 |
fungi | it's probably coming close to 10 minutes at this point, but i'll give it a while longer | 00:15 |
clarkb | pabelanger: I think just slow process of infracloud. maybe you want to do controller00.vanilla.ic.openstack.org? | 00:15 |
clarkb | I'm going to start doing the chocolate status check | 00:15 |
clarkb | I guess the worry here is controller00 could not come back | 00:16 |
pabelanger | sure | 00:16 |
clarkb | maybe thta would be a good thing >_> | 00:16 |
pabelanger | Oh, is the HDD on the way out? | 00:16 |
clarkb | pabelanger: last night compute017 and compute027 had RO / filesystems and dmesg complained about medium errors | 00:17 |
clarkb | not sure about controller00 but the other hosts in the group are definitely on their last legs I think | 00:17 |
pabelanger | I do see some HDD messages in dmesg on controller00 | 00:18 |
clarkb | pabelanger: and medium error? | 00:18 |
clarkb | or sd read error | 00:18 |
pabelanger | [Wed Jan 10 07:08:22 2018] end_request: critical medium error, dev sda, sector 308872 | 00:18 |
pabelanger | for example | 00:18 |
clarkb | ya, controller00 may not come back in that case | 00:18 |
clarkb | 017 and 027 I couldn't patch due to the RO fs so I rebooted them and they never came back :) | 00:18 |
fungi | compute012 _did_ eventually come back online. i can ssh into it pretty quickly, so i think the super slow ansible results are likely slowness with fact gathering? | 00:19 |
clarkb | fungi: the network bw is also bad | 00:19 |
pabelanger | apt-get is still working | 00:19 |
clarkb | fungi: there is one lsat step you need to do post boot and htat is create the network interfaces for neutron | 00:19 |
pabelanger | guess we try a reboot and if it fails, rebuild from working compute? | 00:19 |
fungi | sure, but my point was i can ssh into it and check dmesg in a matter of a few seconds, yet trying to get ansible to do the same takes minutes | 00:19 |
clarkb | pabelanger: considering controller00 is not ro fs I say we try it and honestly if it doesn't come back we home the vanilla computes to chocolate and move on | 00:19 |
clarkb | fungi: yup I think its network problems | 00:20 |
pabelanger | clarkb: sure | 00:20 |
fungi | but specifically ansible seems to be doing many orders of magnitude more network things than just ssh'ing into the server, running my command and regurgitating the results | 00:20 |
clarkb | fungi: ansible -i /etc/ansible/hosts/infracloud 'compute*.vanilla.ic.openstack.org' -m shell -a '/sbin/ip link add veth1 type veth peer name veth2 && /sbin/brctl addif br-vlan2551 veth1 && /sbin/ip link set dev veth1 up && /sbin/ip link set dev veth2 up' | 00:20 |
clarkb | those are the steps puppet runs to make networking go but puppet will happen too slowly | 00:20 |
fungi | thanks, running that against 012 now | 00:21 |
fungi | grr... | 00:21 |
fungi | Failed to connect to the host via ssh: Connection timed out during banner exchange | 00:21 |
fungi | oh, is that a result of the command maybe? | 00:22 |
fungi | since it fiddles with networking | 00:22 |
fungi | doesn't seem to have worked | 00:23 |
fungi | `ip link show dev veth2` returns 'Device "veth2" does not exist.' | 00:23 |
fungi | same for veth1 | 00:23 |
clarkb | maybe it didn't manage to run? | 00:24 |
fungi | `brctl show` indicates there is a br-vlan2551 but only has eth2.2551 in it | 00:24 |
fungi | so, yeah, i think it disconnected before running | 00:25 |
fungi | trying again | 00:25 |
fungi | unreachable again | 00:25 |
fungi | interesting... i can't ssh to it from puppetmaster but i can from home | 00:25 |
clarkb | fungi: ya I had that problem yesterday too | 00:26 |
fungi | didn't you have another one like this, clarkb? | 00:26 |
clarkb | thats why I Think the networking there is super hosed | 00:26 |
clarkb | fungi: compute000.vanilla | 00:26 |
clarkb | like broken lacp or asymetric route from puppetmaster? | 00:26 |
clarkb | something funny going on | 00:26 |
fungi | oh, i did eventually get through from puppetmaster with a straight ssh | 00:26 |
fungi | just took a while | 00:26 |
fungi | more wondering whether there's a peering problem. i can get to it quite quickly from home but it takes forever from puppetmaster | 00:27 |
clarkb | [71381.503893] EDAC MC0: 25661 CE error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0) | 00:27 |
fungi | seeing a some packet loss from puppetmaster but not terirble | 00:27 |
clarkb | oh it does eventually work? interesting | 00:27 |
fungi | yeah, did `sudo ssh compute012.vanilla.ic.openstack.org` on puppetmaster and after a couple minutes it rendered a root prompt | 00:28 |
clarkb | this hardware is really on its last legs. Don't be surprised if an item on next weeks meeting agenda is turn off infracloud | 00:28 |
clarkb | fungi: huh | 00:28 |
fungi | i wonder if ansible just isn't waiting long enough | 00:28 |
fungi | it gives up after about 10 seconds | 00:29 |
* fungi times it | 00:29 | |
clarkb | [18371.678636] end_request: I/O error, dev sdc, sector 8089728 | 00:29 |
fungi | yeah, ansible gives up just under 11 seconds, so probably a 10-second timeout on the ssh cakk | 00:29 |
fungi | call | 00:30 |
fungi | a straight ssh to it from puppetmaster takes just under 51 seconds to return | 00:30 |
fungi | i'll run the commands by hand, not via ansible | 00:31 |
clarkb | http://paste.openstack.org/show/642525/ is choclate. 2 hypervisors there need patching | 00:31 |
clarkb | I will work on 061 in chocolate | 00:31 |
pabelanger | cannot compute MD5 hash for file '/etc/ldap/ldap.conf': failed to read (Input/output error) | 00:32 |
pabelanger | looks like a bad file | 00:32 |
pabelanger | was able to use the version in dpkg-new for now | 00:33 |
fungi | ick | 00:33 |
clarkb | we don't ldap so that should be fine | 00:33 |
pabelanger | dist-upgrade done | 00:33 |
clarkb | other than th efact that the host is about to die | 00:33 |
pabelanger | cross fingers, rebooting | 00:33 |
fungi | okay, neutron networking has been set back up on compute012.vanilla.ic | 00:33 |
fungi | anything else it should need?> | 00:34 |
clarkb | fungi: that should be it | 00:34 |
clarkb | just confirm kpti is on | 00:35 |
fungi | yeah, did that earlier | 00:35 |
fungi | it's good | 00:35 |
clarkb | I am rebooting compute064 and 061 in chocolate now to pick up the changes | 00:35 |
fungi | i think i'm gonna drop off for the evening in that case and check back in tomorrow morning to see if there's anything left | 00:35 |
clarkb | cool thanks for all the help. I think all that is left will be cleaning up odsreg and disabling any nova hypervisors we can't patch | 00:36 |
fungi | yup, i'll try to remember to ask ttx about the odsreg server tomorrow | 00:36 |
clarkb | and maybe doing chocolate controller and baremetal00 as I am doing my best to computer with kids but its not easy :) | 00:36 |
fungi | sure | 00:36 |
fungi | maybe someone in later timezones will pick up some of what's left for chocolate too | 00:37 |
pabelanger | still waiting for server to come back online from reboot | 00:37 |
fungi | of they're feeling up to it | 00:37 |
fungi | s/of/if/ | 00:37 |
pabelanger | it does take a bit to boot however | 00:37 |
clarkb | pabelanger: ya ~10 minutes or so is normal | 00:37 |
clarkb | as fungi just demonstrated with his server | 00:38 |
fungi | that's about how long compute012.vanilla.ic took | 00:38 |
fungi | yeah | 00:38 |
pabelanger | yah, sounds right | 00:38 |
* fungi fades into the aether | 00:38 | |
pabelanger | hey, it is back | 00:41 |
pabelanger | checking nova logs now | 00:42 |
clarkb | 064 is back up and I added the veth interfaces | 00:42 |
clarkb | and now 061 is back and veth interfaces added | 00:43 |
pabelanger | oh right, linux-bridge | 00:43 |
clarkb | ya puppet configures it but thtas slow | 00:43 |
clarkb | the controller doesn't need it iirc | 00:43 |
clarkb | only the computes | 00:43 |
pabelanger | going to kick.sh it | 00:44 |
clarkb | hrm chocolate does have it | 00:44 |
clarkb | *has veths on controller | 00:44 |
clarkb | chocolate unreachable nodes are not all in the nova-manage service list with XXX entries | 00:46 |
clarkb | so going to double check those from here to see if I managed to patch them | 00:46 |
pabelanger | welp, that's not good | 00:47 |
pabelanger | [ 302.415533] EXT4-fs (sda1): Remounting filesystem read-only | 00:47 |
pabelanger | just happened | 00:47 |
pabelanger | I think neurton / rabbitmq spamming logs didn't help | 00:48 |
pabelanger | clarkb: ^ | 00:48 |
pabelanger | so, I can reboot again, see stop neutron / rabbit, get veth back online | 00:48 |
pabelanger | but, since HDD is dying, maybe we should just let it | 00:49 |
clarkb | fun | 00:49 |
clarkb | ya I mean you can try a reboot but if it does that again I think we just disable infracloud vanilla in nodepool | 00:50 |
pabelanger | kk | 00:50 |
clarkb | ok I'm fairly confident that chocolate ocmputes are all patched now | 00:54 |
pabelanger | cool | 00:55 |
clarkb | the unreachable computes that don't show up as XXX in nova-manage service list don't show up at all :) | 00:55 |
clarkb | considering vanilla's controller is afk maybe we save chocolates controller and baremetal for when we aren't about to afk ourselves :) | 00:56 |
pabelanger | k, got pings, kicking now | 00:56 |
pabelanger | well, trying to SSH | 00:57 |
pabelanger | doesn't look good :( | 00:57 |
clarkb | womp womp | 00:58 |
clarkb | so thats at least 3 dead machines in vanilla from patching | 00:58 |
pabelanger | port 22 open, no reply | 00:58 |
clarkb | controller00 and compute017 and compute027 | 00:58 |
pabelanger | I'm guess HDD died again | 00:58 |
clarkb | chocolate actually does seem to be in much better shape | 00:58 |
pabelanger | guessing* | 00:58 |
pabelanger | yah | 00:58 |
clarkb | unfortuantely our baremetal controller is on the vanilla hardware :/ | 00:58 |
pabelanger | okay, so propose patch to disable vanilla in nodepool, then deal with migrating compute nodes | 00:59 |
clarkb | ya I think so | 00:59 |
clarkb | we can do the migrating of computes later | 00:59 |
pabelanger | sure | 00:59 |
pabelanger | remote: https://review.openstack.org/532705 Set max-server to 0 for infracloud-vanilla | 01:02 |
pabelanger | clarkb: ^ tomorrow I'll clean up images, so we don't try to upload DIBs to it | 01:02 |
*** rlandy|bbl is now known as rlandy | 03:50 | |
*** rosmaita has quit IRC | 04:34 | |
*** rlandy has quit IRC | 04:53 | |
*** rosmaita has joined #openstack-infra-incident | 10:32 | |
*** panda|rover|afk has quit IRC | 11:24 | |
*** panda has joined #openstack-infra-incident | 11:38 | |
*** panda has quit IRC | 11:59 | |
*** panda has joined #openstack-infra-incident | 12:13 | |
*** rosmaita has quit IRC | 12:34 | |
*** panda is now known as panda|ruck | 13:29 | |
*** rlandy has joined #openstack-infra-incident | 13:36 | |
*** rosmaita has joined #openstack-infra-incident | 13:41 | |
*** rosmaita has quit IRC | 14:34 | |
*** Shrews has joined #openstack-infra-incident | 14:45 | |
Shrews | FYI, from what I can determine from nodepool-launcher logs, looks like we temporarily lost connection to our ZK server around 22:27 UTC yesterday. I think it's around that time we began seeing znodes in ZK in the 'deleting' state and locked. The few I manually checked did not actually exist in the provider, yet the znode remained. After the launcher restart, seems to be that most did not | 14:47 |
Shrews | exist in the provider, but we could now delete the znodes. | 14:47 |
Shrews | I can't find how this might happen in the launcher code, but will keep looking. | 14:47 |
Shrews | https://review.openstack.org/532823 might help debug this in the future | 14:48 |
Shrews | We've lost the ZK connection before and nodepool recovered fine, so this is new. | 14:49 |
Shrews | maybe kazoo itself got confused | 14:49 |
clarkb | Shrews: not certain but that is probably when we rebooted the server running zk | 16:27 |
*** rosmaita has joined #openstack-infra-incident | 16:31 | |
fungi | Shrews: 22:27 is roughly when we were restarting everything, so losing connections to stuff is expected | 16:37 |
fungi | specifically, nodepool.o.o was rebooted | 16:37 |
clarkb | I've just checked dmesg on baremetal00.vanilla.ic.openstack.org and there are no disk or memory errors | 16:50 |
clarkb | I think baremetal00.vanilla.ic.openstack.org and controller00.chocolate.ic.openstack.org are the last infracloud nodes that need patching | 16:50 |
clarkb | I think before we patch and reboot controller00 we should try to understand if that will break nodepool again | 16:55 |
pabelanger | agree | 16:59 |
fungi | looks like odsreg was down due to a botched migration. i issued a ctrl-alt-del at the oob console and it came back online | 17:00 |
fungi | okay, after confirming with ttx that it really could go, it's gone now | 17:03 |
fungi | our openstackci tenant in rax is pretty thoroughly cleaned up at this point | 17:04 |
fungi | well, at least as far as server instances are concerned. we could likely still stand to look through trove instances, snapshots, cinder volumes, swift containers... | 17:04 |
-openstackstatus- NOTICE: Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets | 17:47 | |
clarkb | knowing what I now know after reading logs and having shrews bring me up to speed I don't think restarting controller00 is currently safe (I mean if it comes up quickly we probably won't notice but if it doesn't we likely will have node_failures again) | 17:49 |
*** weshay is now known as weshay_interview | 18:14 | |
clarkb | ptg flights are quite a bit more expensive than I was expecting | 18:50 |
clarkb | I guess thats the downside to flying somewhere with cheap lodging and food, its harder to get to | 18:50 |
* clarkb books anyway | 18:50 | |
clarkb | er ww | 18:50 |
clarkb | but hey now you know :) | 18:50 |
clarkb | also if flying through london beware the 3 airport problem | 18:50 |
corvus | clarkb: try staying across a saturday. could be cheaper even including the extra hotel. :) | 19:11 |
corvus | sfo<->dub was < $700 for me | 19:12 |
clarkb | corvus: ya the google flights breakdown says leaving sunday and monday is roughly the same so doing monday flight back | 19:14 |
clarkb | but still >900 | 19:14 |
*** weshay_interview is now known as weshay | 19:16 | |
*** panda|ruck is now known as panda|ruck|afk | 19:20 | |
pabelanger | can confirm staying across a saturday was cheaper, fly out on the Friday night | 19:37 |
pabelanger | flying in I mean | 19:38 |
*** rlandy is now known as rlandy|bbl | 22:52 | |
clarkb | nodepool launchers are now running code that will handle max-servers:0 properly | 23:43 |
clarkb | I don't think I'll do it today because feel like running out of steam early but will try to finish up infracloud patching tomorrow (and set chocolate's max-server to 0 first) | 23:43 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!