Thursday, 2018-01-11

clarkbso we don't have any hypervisors (other than 012) that are unpatched and actually being used in vanilla00:00
clarkbI executed what I think was a reboot via ironic to compute005 we will see if it comes back00:00
clarkbbut I think we may just have to disable these in nova so that if they do mysteriously manage to return we don't run anything on them00:01
fungiwfm00:02
fungimy test probe to 012 did eventually return so i have the upgrade running against it now00:02
clarkbok I've got to take care of kids for a bit00:03
fungidist-upgrade on 012 finished so rebooting now00:05
clarkbfungi: when thats done I think we can also patch amd reboot controller00.vanilla00:07
clarkbthis will make nodepool unhappy for a short bit but its used to clouds going away00:07
fungiespecially this one, probably00:07
clarkbI'm mostly not able to computer now that kids are awake from nap00:07
fungii'm nearing end of steam myself and about to go relax00:08
fungibut will see copute012 through first at least00:08
pabelangerand back00:08
pabelangerafs01.dfw is waiting for mirror-update.o.o to stop running bandersnatch00:09
pabelangerchecking now00:09
clarkbpabelanger: its done00:09
clarkbpabelanger: I rebooted it and removed mirror update from emergency file00:09
pabelangergreat00:09
clarkbwe are doing infracloud vanilla now00:10
clarkbI cross checked nova-manage service list against the unreachable aervers00:10
clarkband tried rebooting compute005 to see if it comes back00:10
clarkbbut I'm guessing we just disable the unreachable servers in nova and remove them from our inventory00:11
pabelangeryah, we likely should00:11
fungiclarkb: i have scoured my chat logs and can't find where we might have asked ttx about odsreg.o.o, so i'll just try to remember to ask him tomorrow once he's around00:12
fungihow long should i wait for this to boot back up before i get worried it's not coming back?00:14
clarkbfungi: 10 minutes00:15
fungik00:15
pabelangerokay, so anything I should focus on, or just slow process of infra-cloud00:15
clarkb~5 seems to be short end with somewhere less than 10 as common00:15
fungiit's probably coming close to 10 minutes at this point, but i'll give it a while longer00:15
clarkbpabelanger: I think just slow process of infracloud. maybe you want to do controller00.vanilla.ic.openstack.org?00:15
clarkbI'm going to start doing the chocolate status check00:15
clarkbI guess the worry here is controller00 could not come back00:16
pabelangersure00:16
clarkbmaybe thta would be a good thing >_>00:16
pabelangerOh, is the HDD on the way out?00:16
clarkbpabelanger: last night compute017 and compute027 had RO / filesystems and dmesg complained about medium errors00:17
clarkbnot sure about controller00 but the other hosts in the group are definitely on their last legs I think00:17
pabelangerI do see some HDD messages in dmesg on controller0000:18
clarkbpabelanger: and medium error?00:18
clarkbor sd read error00:18
pabelanger[Wed Jan 10 07:08:22 2018] end_request: critical medium error, dev sda, sector 30887200:18
pabelangerfor example00:18
clarkbya, controller00 may not come back in that case00:18
clarkb017 and 027 I couldn't patch due to the RO fs so I rebooted them and they never came back :)00:18
fungicompute012 _did_ eventually come back online. i can ssh into it pretty quickly, so i think the super slow ansible results are likely slowness with fact gathering?00:19
clarkbfungi: the network bw is also bad00:19
pabelangerapt-get is still working00:19
clarkbfungi: there is one lsat step you need to do post boot and htat is create the network interfaces for neutron00:19
pabelangerguess we try a reboot and if it fails, rebuild from working compute?00:19
fungisure, but my point was i can ssh into it and check dmesg in a matter of a few seconds, yet trying to get ansible to do the same takes minutes00:19
clarkbpabelanger: considering controller00 is not ro fs I say we try it and honestly if it doesn't come back we home the vanilla computes to chocolate and move on00:19
clarkbfungi: yup I think its network problems00:20
pabelangerclarkb: sure00:20
fungibut specifically ansible seems to be doing many orders of magnitude more network things than just ssh'ing into the server, running my command and regurgitating the results00:20
clarkbfungi: ansible -i /etc/ansible/hosts/infracloud 'compute*.vanilla.ic.openstack.org' -m shell -a '/sbin/ip link add veth1 type veth peer name veth2 && /sbin/brctl addif br-vlan2551 veth1 && /sbin/ip link set dev veth1 up && /sbin/ip link set dev veth2 up'00:20
clarkbthose are the steps puppet runs to make networking go but puppet will happen too slowly00:20
fungithanks, running that against 012 now00:21
fungigrr...00:21
fungiFailed to connect to the host via ssh: Connection timed out during banner exchange00:21
fungioh, is that a result of the command maybe?00:22
fungisince it fiddles with networking00:22
fungidoesn't seem to have worked00:23
fungi`ip link show dev veth2` returns 'Device "veth2" does not exist.'00:23
fungisame for veth100:23
clarkbmaybe it didn't manage to run?00:24
fungi`brctl show` indicates there is a br-vlan2551 but only has eth2.2551 in it00:24
fungiso, yeah, i think it disconnected before running00:25
fungitrying again00:25
fungiunreachable again00:25
fungiinteresting... i can't ssh to it from puppetmaster but i can from home00:25
clarkbfungi: ya I had that problem yesterday too00:26
fungididn't you have another one like this, clarkb?00:26
clarkbthats why I Think the networking there is super hosed00:26
clarkbfungi: compute000.vanilla00:26
clarkblike broken lacp or asymetric route from puppetmaster?00:26
clarkbsomething funny going on00:26
fungioh, i did eventually get through from puppetmaster with a straight ssh00:26
fungijust took a while00:26
fungimore wondering whether there's a peering problem. i can get to it quite quickly from home but it takes forever from puppetmaster00:27
clarkb[71381.503893] EDAC MC0: 25661 CE error on CPU#0Channel#0_DIMM#0 (channel:0 slot:0 page:0x0 offset:0x0 grain:8 syndrome:0x0)00:27
fungiseeing a some packet loss from puppetmaster but not terirble00:27
clarkboh it does eventually work? interesting00:27
fungiyeah, did `sudo ssh compute012.vanilla.ic.openstack.org` on puppetmaster and after a couple minutes it rendered a root prompt00:28
clarkbthis hardware is really on its last legs. Don't be surprised if an item on next weeks meeting agenda is turn off infracloud00:28
clarkbfungi: huh00:28
fungii wonder if ansible just isn't waiting long enough00:28
fungiit gives up after about 10 seconds00:29
* fungi times it00:29
clarkb[18371.678636] end_request: I/O error, dev sdc, sector 808972800:29
fungiyeah, ansible gives up just under 11 seconds, so probably a 10-second timeout on the ssh cakk00:29
fungicall00:30
fungia straight ssh to it from puppetmaster takes just under 51 seconds to return00:30
fungii'll run the commands by hand, not via ansible00:31
clarkbhttp://paste.openstack.org/show/642525/ is choclate. 2 hypervisors there need patching00:31
clarkbI will work on 061 in chocolate00:31
pabelanger cannot compute MD5 hash for file '/etc/ldap/ldap.conf': failed to read (Input/output error)00:32
pabelangerlooks like a bad file00:32
pabelangerwas able to use the version in dpkg-new for now00:33
fungiick00:33
clarkbwe don't ldap so that should be fine00:33
pabelangerdist-upgrade done00:33
clarkbother than th efact that the host is about to die00:33
pabelangercross fingers, rebooting00:33
fungiokay, neutron networking has been set back up on compute012.vanilla.ic00:33
fungianything else it should need?>00:34
clarkbfungi: that should be it00:34
clarkbjust confirm kpti is on00:35
fungiyeah, did that earlier00:35
fungiit's good00:35
clarkbI am rebooting compute064 and 061 in chocolate now to pick up the changes00:35
fungii think i'm gonna drop off for the evening in that case and check back in tomorrow morning to see if there's anything left00:35
clarkbcool thanks for all the help. I think all that is left will be cleaning up odsreg and disabling any nova hypervisors we can't patch00:36
fungiyup, i'll try to remember to ask ttx about the odsreg server tomorrow00:36
clarkband maybe doing chocolate controller and baremetal00 as I am doing my best to computer with kids but its not easy :)00:36
fungisure00:36
fungimaybe someone in later timezones will pick up some of what's left for chocolate too00:37
pabelangerstill waiting for server to come back online from reboot00:37
fungiof they're feeling up to it00:37
fungis/of/if/00:37
pabelangerit does take a bit to boot however00:37
clarkbpabelanger: ya ~10 minutes or so is normal00:37
clarkbas fungi just demonstrated with his server00:38
fungithat's about how long compute012.vanilla.ic took00:38
fungiyeah00:38
pabelangeryah, sounds right00:38
* fungi fades into the aether00:38
pabelangerhey, it is back00:41
pabelangerchecking nova logs now00:42
clarkb064 is back up and I added the veth interfaces00:42
clarkband now 061 is back and veth interfaces added00:43
pabelangeroh right, linux-bridge00:43
clarkbya puppet configures it but thtas slow00:43
clarkbthe controller doesn't need it iirc00:43
clarkbonly the computes00:43
pabelangergoing to kick.sh it00:44
clarkbhrm chocolate does have it00:44
clarkb*has veths on controller00:44
clarkbchocolate unreachable nodes are not all in the nova-manage service list with XXX entries00:46
clarkbso going to double check those from here to see if I managed to patch them00:46
pabelangerwelp, that's not good00:47
pabelanger[  302.415533] EXT4-fs (sda1): Remounting filesystem read-only00:47
pabelangerjust happened00:47
pabelangerI think neurton / rabbitmq spamming logs didn't help00:48
pabelangerclarkb: ^00:48
pabelangerso, I can reboot again, see stop neutron / rabbit, get veth back online00:48
pabelangerbut, since HDD is dying, maybe we should just let it00:49
clarkbfun00:49
clarkbya I mean you can try a reboot but if it does that again I think we just disable infracloud vanilla in nodepool00:50
pabelangerkk00:50
clarkbok I'm fairly confident that chocolate ocmputes are all patched now00:54
pabelangercool00:55
clarkbthe unreachable computes that don't show up as XXX in nova-manage service list don't show up at all :)00:55
clarkbconsidering vanilla's controller is afk maybe we save chocolates controller and baremetal for when we aren't about to afk ourselves :)00:56
pabelangerk, got pings, kicking now00:56
pabelangerwell, trying to SSH00:57
pabelangerdoesn't look good :(00:57
clarkbwomp womp00:58
clarkbso thats at least 3 dead machines in vanilla from patching00:58
pabelangerport 22 open, no reply00:58
clarkbcontroller00 and compute017 and compute02700:58
pabelangerI'm guess HDD died again00:58
clarkbchocolate actually does seem to be in much better shape00:58
pabelangerguessing*00:58
pabelangeryah00:58
clarkbunfortuantely our baremetal controller is on the vanilla hardware :/00:58
pabelangerokay, so propose patch to disable vanilla in nodepool, then deal with migrating compute nodes00:59
clarkbya I think so00:59
clarkbwe can do the migrating of computes later00:59
pabelangersure00:59
pabelangerremote:   https://review.openstack.org/532705 Set max-server to 0 for infracloud-vanilla01:02
pabelangerclarkb: ^ tomorrow I'll clean up images, so we don't try to upload DIBs to it01:02
*** rlandy|bbl is now known as rlandy03:50
*** rosmaita has quit IRC04:34
*** rlandy has quit IRC04:53
*** rosmaita has joined #openstack-infra-incident10:32
*** panda|rover|afk has quit IRC11:24
*** panda has joined #openstack-infra-incident11:38
*** panda has quit IRC11:59
*** panda has joined #openstack-infra-incident12:13
*** rosmaita has quit IRC12:34
*** panda is now known as panda|ruck13:29
*** rlandy has joined #openstack-infra-incident13:36
*** rosmaita has joined #openstack-infra-incident13:41
*** rosmaita has quit IRC14:34
*** Shrews has joined #openstack-infra-incident14:45
ShrewsFYI, from what I can determine from nodepool-launcher logs, looks like we temporarily lost connection to our ZK server around 22:27 UTC yesterday. I think it's around that time we began seeing znodes in ZK in the 'deleting' state and locked. The few I manually checked did not actually exist in the provider, yet the znode remained. After the launcher restart, seems to be that most did not14:47
Shrewsexist in the provider, but we could now delete the znodes.14:47
ShrewsI can't find how this might happen in the launcher code, but will keep looking.14:47
Shrewshttps://review.openstack.org/532823  might help debug this in the future14:48
ShrewsWe've lost the ZK connection before and nodepool recovered fine, so this is new.14:49
Shrewsmaybe kazoo itself got confused14:49
clarkbShrews: not certain but that is probably when we rebooted the server running zk16:27
*** rosmaita has joined #openstack-infra-incident16:31
fungiShrews: 22:27 is roughly when we were restarting everything, so losing connections to stuff is expected16:37
fungispecifically, nodepool.o.o was rebooted16:37
clarkbI've just checked dmesg on baremetal00.vanilla.ic.openstack.org and there are no disk or memory errors16:50
clarkbI think baremetal00.vanilla.ic.openstack.org and controller00.chocolate.ic.openstack.org are the last infracloud nodes that need patching16:50
clarkbI think before we patch and reboot controller00 we should try to understand if that will break nodepool again16:55
pabelangeragree16:59
fungilooks like odsreg was down due to a botched migration. i issued a ctrl-alt-del at the oob console and it came back online17:00
fungiokay, after confirming with ttx that it really could go, it's gone now17:03
fungiour openstackci tenant in rax is pretty thoroughly cleaned up at this point17:04
fungiwell, at least as far as server instances are concerned. we could likely still stand to look through trove instances, snapshots, cinder volumes, swift containers...17:04
-openstackstatus- NOTICE: Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets17:47
clarkbknowing what I now know after reading logs and having shrews bring me up to speed I don't think restarting controller00 is currently safe (I mean if it comes up quickly we probably won't notice but if it doesn't we likely will have node_failures again)17:49
*** weshay is now known as weshay_interview18:14
clarkbptg flights are quite a bit more expensive than I was expecting18:50
clarkbI guess thats the downside to flying somewhere with cheap lodging and food, its harder to get to18:50
* clarkb books anyway18:50
clarkber ww18:50
clarkbbut hey now you know :)18:50
clarkbalso if flying through london beware the 3 airport problem18:50
corvusclarkb: try staying across a saturday.  could be cheaper even including the extra hotel.  :)19:11
corvussfo<->dub was < $700 for me19:12
clarkbcorvus: ya the google flights breakdown says leaving sunday and monday is roughly the same so doing monday flight back19:14
clarkbbut still >90019:14
*** weshay_interview is now known as weshay19:16
*** panda|ruck is now known as panda|ruck|afk19:20
pabelangercan confirm staying across a saturday was cheaper, fly out on the Friday night19:37
pabelangerflying in I mean19:38
*** rlandy is now known as rlandy|bbl22:52
clarkbnodepool launchers are now running code that will handle max-servers:0 properly23:43
clarkbI don't think I'll do it today because feel like running out of steam early but will try to finish up infracloud patching tomorrow (and set chocolate's max-server to 0 first)23:43

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!