*** anteaya has quit IRC | 05:26 | |
*** anteaya has joined #openstack-infra-incident | 05:29 | |
*** jhesketh has quit IRC | 10:25 | |
*** jhesketh has joined #openstack-infra-incident | 10:31 | |
*** fungi has joined #openstack-infra-incident | 21:12 | |
*** ChanServ changes topic to "Incident in progress: https://etherpad.openstack.org/p/venom-reboots" | 21:12 | |
*** clarkb has joined #openstack-infra-incident | 21:15 | |
fungi | same bat time, different bat channel | 21:15 |
---|---|---|
clarkb | how do we want to do this, just claim nodes off the etherpad? | 21:16 |
fungi | probably, after a little planning | 21:16 |
pleia2 | clarkb: unrelated to dns, gid woes means I still don't have an account on git-fe01 & 02 | 21:16 |
clarkb | pleia2: oh darn | 21:16 |
pleia2 | we'll fix that up later | 21:17 |
fungi | oh, right, we've not cleaned up the uids/gids on those two have we? | 21:17 |
clarkb | apparently not which makes it hard for pleia2 to do the laod balancing dance for git* safe reboots | 21:17 |
pleia2 | right | 21:17 |
clarkb | why don't I go take a look and see how bad the gid situation is right now | 21:18 |
clarkb | maybe we can fix that real quick | 21:18 |
mordred | hey all | 21:18 |
pleia2 | welcome to the party, mordred | 21:18 |
lifeless | so what is venom ? I can't read the ticket | 21:18 |
mordred | lifeless: it's the latest marketing-named CVE | 21:19 |
clarkb | lifeless: its VMs can break into the hypervisor via floppy drive code in qemu | 21:19 |
pleia2 | lifeless: guest-executable buffer overflow of the kerney floppy thing | 21:19 |
lifeless | clarkb: LOOOOOOOL | 21:19 |
pleia2 | kernel too | 21:19 |
*** zaro has joined #openstack-infra-incident | 21:19 | |
anteaya | http://seclists.org/oss-sec/2015/q2/425 | 21:19 |
clarkb | lifeless: so we can reoot gracefully or we get rebooted forcefully in ~24 hours | 21:19 |
lifeless | anteaya: thanks | 21:19 |
anteaya | welcome | 21:19 |
lifeless | much brilliant, such wow | 21:19 |
pleia2 | apparently floppy stuff is hard to remove from the kernel (I was surprised it was included at all in base systems) | 21:20 |
fungi | clarkb: yeah, remapping the uids/gids on those should be relatively trivial (i hope) | 21:20 |
fungi | pretty sure she's just conflicting with some unused account called "admin" | 21:21 |
pleia2 | admin is the fomer, only used to-be-backwards compatible, sudo group on ubuntu | 21:21 |
*** SpamapS has joined #openstack-infra-incident | 21:21 | |
pleia2 | former | 21:21 |
pleia2 | unlikely that we're using it | 21:21 |
clarkb | fungi: group admin has gid 2015 which is pleia2's group gid | 21:21 |
clarkb | fungi: but sudo group needs to be moved too | 21:22 |
clarkb | so basically I need to find where group owner is 2015 or 2016, make sure I am root so I can chown files after the regiding (as I may break sudo) then run puppet | 21:22 |
fungi | k. neither of those should actually own any files, i don't think | 21:22 |
clarkb | fungi: I am going to find out shortly :) | 21:22 |
*** nibalizer has joined #openstack-infra-incident | 21:22 | |
fungi | that sounds like a good plan | 21:23 |
* mordred isn't QUITE back online yet - just got back from taking sandy to the airport - will be useful in a few ... | 21:23 | |
fungi | if nobody disagrees that the entries marked with * can be rebooted now, i'll go ahead and start in on those. take note that we need to not just reboot them from the command line. we need to halt them and then nova reboot --hard | 21:24 |
clarkb | sudo find / -gid 2015 and sudo find / -gid 2016 don't find any files on git-fe01, so I guess its time to become root, change gid, then puppet | 21:25 |
clarkb | fungi: go for it, maybe put rebooting process on etherpad for ease of finding | 21:26 |
pleia2 | I think stackalytics.openstack.org should be safe to reboot too | 21:26 |
fungi | will do | 21:26 |
fungi | oh, right. that's not actually production anything | 21:26 |
jeblair | so for design-summit-prep, i'm not sure what to do... i think we can/should reboot it, but none of us actually knows anything about the app on there, so i don't know what happens when we reboot it and it doesn't come up | 21:28 |
jeblair | we should really stop just handing root to people | 21:29 |
mordred | ++ | 21:29 |
*** greghaynes has joined #openstack-infra-incident | 21:29 | |
fungi | that one's been in a transitional state waiting on ttx to work with someone on writing puppet modules for the apps on there | 21:30 |
clarkb | pleia2: can you hop on git-fe01 and check that it works for you? sudo too? | 21:30 |
jeblair | tbh, my inclination is to reboot it and since there is nothing described by puppet running on it, nor any documentation, call our work done. | 21:30 |
pleia2 | mordred: is test-mordred-config-drive deletable? (see pad) | 21:30 |
lifeless | live migration should mitigate it... if they had that working | 21:30 |
fungi | i've pinged him in #-infra in case he's around | 21:30 |
jeblair | it's pretty late for ttx | 21:30 |
fungi | lifeless: yep! too bad | 21:30 |
pleia2 | clarkb: I'm in, thanks! | 21:30 |
pleia2 | (with sudo, yay) | 21:31 |
jeblair | i'm equally okay with "do nothing and let rax take care of it" | 21:31 |
clarkb | pleia2: awesome, I am working on fe02 now | 21:31 |
fungi | i'm starting down the easy reboots list in alpha order | 21:31 |
clarkb | pleia2: so first thing we need to do is take fe01 or fe02 out of the DNS roudn robin, then we can take one backend out of haproxy at a time on the other frontend and reboot the backend, put it back into service, rinse and repeat | 21:32 |
clarkb | pleia2: then when that is all done add fe02 back to DNS round robin, remove 01, reboot 01 | 21:32 |
clarkb | pleia2: and the only people that should see any downtime are those that hardcode a git frontend | 21:32 |
pleia2 | clarkb: ok, how are we interacting with dns for this? | 21:32 |
* mordred is going to start on the easy reboots in reverse alpha order | 21:33 | |
mordred | will meet fungi in the middle | 21:33 |
fungi | thanks mordred | 21:33 |
fungi | pleia2: clarkb: there is a rax dns client, but probably webui is easier for this | 21:33 |
pleia2 | fungi: if you could toss the exact instructions you're using for actually-effective reboots in the pad, it would help us be consistent | 21:33 |
clarkb | mordred: keep in mind a normal reboot is not good enough | 21:34 |
fungi | well, maybe not actually. since we just need to delete and then create a and aaaa records. cli client may be easier | 21:34 |
clarkb | (so lets all know how to reboot before we reboot) | 21:34 |
mordred | yah | 21:34 |
mordred | we apparently need to halt. then reboot --hard | 21:34 |
mordred | yah? | 21:34 |
jeblair | how can you do anything after halting? | 21:34 |
clarkb | mordred: thats what fungi said, but ^ | 21:34 |
clarkb | jeblair: I think you have to nova reboot it | 21:35 |
mordred | yah | 21:35 |
clarkb | so in instance do shutdown -h now | 21:35 |
jeblair | oh, "nova reboot --hard" | 21:35 |
mordred | yah | 21:35 |
clarkb | then go to nova client and reboot it | 21:35 |
mordred | sorry | 21:35 |
lifeless | list(parse_requirements('foo==0.1.dev0'))[0].specifier.contains(parse_version('0.1.dev0')) | 21:35 |
lifeless | True | 21:35 |
lifeless | man, copypasta all over the place today | 21:35 |
mordred | lifeless: wrong channel | 21:35 |
clarkb | pleia2: git-fe02 is ready for you to test there | 21:35 |
fungi | pleia2: i logged into puppetmaster, sourced our nova creds for openstackci in dfw, then `sudo ssh $hostname` and run `halt`, then when i get kicked out start pinging it, and then `nova reboot --hard $hostname` | 21:35 |
lifeless | mordred: I know, it was my belly on my mouse pad | 21:35 |
fungi | once i t no longer responds to ping | 21:35 |
mordred | oh good | 21:37 |
mordred | we have 2 zuul-devs | 21:37 |
mordred | should I delete the one that is not the one dns resolves to? | 21:37 |
fungi | yeah, should be safe at this point | 21:38 |
jeblair | mordred: yes | 21:38 |
mordred | and we have more than one translate-dev | 21:40 |
mordred | pleia2: ^^ delete the one that is not in DNS? or reboot it? | 21:40 |
jeblair | is the old one the pootle one? | 21:40 |
mordred | maybe? | 21:40 |
jeblair | mordred: paste both ips? | 21:41 |
pleia2 | the old pootle one is deleteable | 21:41 |
mordred | afd4a8d9-98a7-4a21-a827-33106abeeb8a | translate-dev.openstack.org | ACTIVE | - | Running | public=104.130.243.78, 2001:4800:7819:105:be76:4eff:fe04:4758; private=10.209.160.236 | | 21:41 |
mordred | | f1103432-ae29-4ec2-87e0-39920429ac50 | translate-dev.openstack.org | ACTIVE | - | Running | public=23.253.231.198, 2001:4800:7817:103:be76:4eff:fe04:545a; private=10.208.170.81 | | 21:41 |
pleia2 | new traslate-dev server is 104.130.243.78 | 21:41 |
mordred | 104. is the one in dns | 21:41 |
pleia2 | yeah, can kill the 23. one afaic | 21:41 |
mordred | also wiki.o.o is not o nthe list - I thin kit'sa "can reboot any time" yeah? | 21:41 |
jeblair | 104.130.243.78 does not respond for me | 21:42 |
fungi | mordred: it's likely not pvhvm | 21:42 |
mordred | fungi: ah - we only have to delete pvhm? | 21:42 |
mordred | reboot? | 21:42 |
fungi | mordred: pvhvm is affected, pv is not | 21:42 |
mordred | gotcha | 21:42 |
* mordred removes from list | 21:42 | |
fungi | basically this is the list which rax put in the ticket | 21:42 |
mordred | ah | 21:43 |
mordred | so - translate.openstack.org is not pvhvm? | 21:43 |
pleia2 | jeblair: oh dear, maybe zanata went sideways when I wasn't looking | 21:43 |
mordred | translate is a standard - that's ok - we can think about that later - if we're happy with performance, then it's likely fine :) | 21:44 |
fungi | puppetmaster root has cached the wrong ssh host key for ci-backup-rs-ord | 21:47 |
clarkb | pleia2: I think we should remove git-fe02 records from the git.o.o name to start. git.o.o A 23.253.252.15 and git.o.o AAAA 2001:4800:7818:104:be76:4eff:fe04:7072 | 21:47 |
fungi | should i correct it, or work around it (to avoid ansible doing things to it)? | 21:47 |
jeblair | fungi: work around it for now; not sure what the state is with that | 21:47 |
pleia2 | clarkb: yep, sounds good | 21:47 |
fungi | jeblair: thanks, will do | 21:47 |
clarkb | pleia2: also note the TTL (I think its the minimum of 5 minutes for round robining but we will have to add that back in when we make the record for it again) | 21:48 |
fungi | mordred: i'll skip hound since i'm not sure what your plan is with that or if it needs special care | 21:48 |
pleia2 | my favorite part is how their interface doesn't show me the whole ipv6 address in the list | 21:48 |
clarkb | pleia2: :( | 21:48 |
fungi | rax dns is a special critter, to be sure | 21:48 |
pleia2 | modify record lets me see it and cancel ;) | 21:49 |
clarkb | pleia2: once you have those records removed you will want to hop on git-fe02 and sudo tail -f /var/log/haproxy.log and wait for connections to stop coming in | 21:50 |
clarkb | pleia2: at that point we can do the reboot procedures | 21:50 |
mordred | AND - in a fit of consistency - we have 2 review-dev | 21:50 |
clarkb | mordred: is one trusty the other precise? | 21:51 |
clarkb | mordred: if so you can probably remove the precise node | 21:51 |
clarkb | pleia2: once you are about at that point let me know and I can dig up my haproxy knoweldge | 21:51 |
mordred | yup | 21:51 |
pleia2 | clarkb: busy servers these ones | 21:51 |
fungi | jeblair: there are a jeblairtest2 and jeblairtest3 as well which aren't on the list but shall i delete them while i'm here? | 21:52 |
mordred | well, that's exciting | 21:53 |
mordred | I halted review dev. it's not returning pings - BUT - nova won't reboot it because it's in state "powering off" | 21:53 |
jeblair | fungi: please | 21:53 |
fungi | will do | 21:53 |
anteaya | can someone with ops spare a moment to kick a spammer from -infra? | 21:56 |
mordred | fungi: hound done. no special care - it's all happy and normal | 21:57 |
fungi | i've also been sshing back into each and checking uptime to make sure it really rebooted | 21:58 |
clarkb | pleia2: its been > TTL now ya? | 21:58 |
pleia2 | clarkb: down to a trickle, so should be ready soon | 21:58 |
mordred | so - anybody have any ideas what to do about a vm stuck in nova powering-off state? | 21:58 |
clarkb | pleia2: ya and if these don't go away after another minute or two Ithink we blame their bad DNS resolving | 21:58 |
pleia2 | clarkb: nods | 21:59 |
clarkb | mordred: I want to say in hpcloud when that happened you had to contact support, unsure if rax is different | 21:59 |
mordred | if there isnt' a quick answer from john - I'm going to leave it - because it's review-dev and it'll get hard-rebooted tomorrow | 22:03 |
clarkb | pleia2: found the magic socat commands for haproxy cnotrol on my fe02 scrollback | 22:03 |
clarkb | mordred: wfm | 22:03 |
pleia2 | clarkb: cool, I'll have a peek | 22:03 |
fungi | okay, got the all-clear from reed to reboot groups and ask so doing those next | 22:03 |
clarkb | pleia2: getting a paste up for you | 22:03 |
pleia2 | clarkb: thanks | 22:03 |
mordred | fungi: I am not using puppetmaster - so happy for you to reboot it any time | 22:03 |
clarkb | pleia2: http://paste.openstack.org/show/222205/ | 22:04 |
clarkb | pleia2: basically haproxy is organized as frontends eg balance_git_daemon and backends for each frontend so we have to disable the full set of pairs for all of those per backend | 22:05 |
clarkb | pleia2: once that is done it should be safe to reboot the backend | 22:05 |
pleia2 | clarkb: ok, so this series of commands + confirm they come back for all the git0x servers? | 22:06 |
clarkb | pleia2: then off to the next backend, when all backends are done we add git-fe02 back to dns, remove git-fe01 and then reboot git-fe01 | 22:06 |
clarkb | pleia2: yup and run that on git-fe01 since its the only haproxy balancing traffic right now | 22:06 |
pleia2 | clarkb: sounds good, on it | 22:07 |
clarkb | pleia2: but basically go host by host, rebooting then reenabling and we shouldn't have any downtiem | 22:07 |
* pleia2 nods | 22:07 | |
clarkb | we should probably ansible this for the general case, not sure we want to ansibe it for the hard reboot case since one already broke on us | 22:07 |
pleia2 | heh, right | 22:07 |
jeblair | thinking ahead to the harder servers; we have enough SPOFS that we may just want to take an outage for all of them at once; maybe i'll start working on identifying what we should group together | 22:09 |
mordred | jeblair: ++ | 22:09 |
clarkb | jeblair: sounds good | 22:09 |
mordred | clarkb: yah - I was thinking we should maybe have some ansibles to handle "we need to reboot the world" - I'm sure it'll come up again | 22:09 |
SpamapS | kernel updates come to mind | 22:10 |
fungi | mordred: are you rebooting openstackid.org? if not, i'll get it next | 22:10 |
mordred | fungi: yeah - I got it | 22:10 |
fungi | thanks | 22:11 |
*** reed has joined #openstack-infra-incident | 22:11 | |
jeblair | fungi, clarkb: er, i see the chat in etherpad; what should we do with the ES machines? | 22:12 |
clarkb | jeblair: fungi we can roll through them right now if we want, we can have rax just reboot all of them, or we can reboot all of them | 22:12 |
fungi | i'm inclined to halt and hard reboot them ourselves since no idea if rax halts these or just blip! | 22:13 |
clarkb | jeblair: fungi: I am tempted to go ahead and reboot one then see how long recovery takes, then based on that either reboot the rest all at once or reboot them one by one | 22:13 |
pleia2 | clarkb: 01 down, 4 to go! | 22:13 |
mordred | clarkb: sounds reasonable | 22:13 |
fungi | also we have some opportunity to look carefully at the systems as they come back up and identify obvious issues immediately rather than when we happen to notice later | 22:13 |
clarkb | fungi: ++ | 22:13 |
mordred | subunit-worker01 should be able to just be rebooted too, no? | 22:13 |
jeblair | clarkb: okay, we'll be taking "the ci system" down anyway, so if it's easier to do all at once, we will have that opportunity | 22:13 |
clarkb | mordred: yup | 22:13 |
* mordred does subunit-worker | 22:14 | |
jeblair | well, it may lose info | 22:14 |
clarkb | jeblair: well recovery from reboot all at once is a many hours thing too I think | 22:14 |
jeblair | which is why i marked it with [1] | 22:14 |
jeblair | mordred: ^ | 22:14 |
clarkb | jeblair: and logstash.o.o will just queue up the work for us so ES can be in that state whenever and we should be fine | 22:14 |
clarkb | jeblair: biggest impact is the delay on indexing until we catch back up again | 22:14 |
jeblair | clarkb: the system is not busy and slowing, so we may be in luck there | 22:15 |
fungi | okay, so logstash workers and elasticsearch group together with [1] as well? | 22:15 |
clarkb | wfm | 22:16 |
jeblair | oh, logstash workers can go at any time, right? | 22:16 |
clarkb | jeblair: actually ya, let me just go dothose right now | 22:16 |
fungi | they used to go at any time they wanted, so i suppose so ;) | 22:16 |
jeblair | i vote we keep them out of [1] if that's the case; there are 16 others that we aren't rebooting anyway | 22:16 |
fungi | agreed | 22:16 |
clarkb | yup yup I am doing logstash workers now | 22:16 |
jeblair | so probably can actually just do all 4 right now | 22:16 |
jeblair | (at once) | 22:16 |
mordred | jeblair: sorry. IRC race-condition subunit-worker01 rebooted | 22:17 |
jeblair | mordred: i did have the [1] in the etherpad before you did that | 22:17 |
clarkb | pleia2: I think you can reboot git-fe02 whenever you have time, its down to like 2 requests per minute | 22:17 |
fungi | is the plan to put zuul into graceful shutdown long enough to quiesce jobs and then stop the zuul mergers and bring zuul back up? | 22:18 |
fungi | that would avoid running jobs while we do the pypi mirrors too | 22:18 |
fungi | then once the rest of the [1] group is done, bring the mergers back up | 22:18 |
pleia2 | git02 seems stuck after the halt, not pingable, but nova shows it as running, for git01 I waited to run the nova reboot --hard until it was in shutdown mode | 22:18 |
jeblair | fungi: we have to take review.o.o down, i think we should just do it without quiescing | 22:19 |
fungi | jeblair: oh, right i forgot it was in that set | 22:19 |
fungi | pleia2: i guess give it a few minutes. if we have to open a ticket for it, we can limp along down one member until rax fixes it for us | 22:19 |
jeblair | yeah, i'd think about a way to do it more gracefully, but there's no hiding that, so we might as well make it easy on ourselves and just reboot it all at once | 22:20 |
fungi | the joys of active redundancy | 22:20 |
jeblair | we can save the zuul queues | 22:20 |
fungi | fair enough | 22:20 |
clarkb | pleia2: I have nothing better than what fungi suggests | 22:20 |
pleia2 | fungi: oh good, looks like all it needed was 5 minutes to die, on to 03! | 22:20 |
fungi | heh | 22:20 |
clarkb | doing logstash-worker17 now | 22:20 |
fungi | sometimes rax likes to just sit on api calls too. it's an openstack thing | 22:21 |
mordred | ooh! review-dev finally rebooted | 22:22 |
clarkb | they are just teaching us patience | 22:22 |
mordred | awesome | 22:22 |
fungi | i have a feeling rax didn't tune their environment to expect all customers to want to reboot everything in one day | 22:22 |
pleia2 | indeed | 22:23 |
mordred | weird | 22:23 |
fungi | pleia2: clarkb: let me know when you expect to be idle on the puppetmaster for a few minutes and i'll reboot it too | 22:23 |
clarkb | fungi: let me get through the logstash workers, will let you know | 22:23 |
fungi | there's no hurry | 22:24 |
pleia2 | fungi: will do, just going to finish up this git fun | 22:24 |
jeblair | okay, i laid out a plan for the group[1] reboots | 22:26 |
clarkb | after halting 18,19,20 they all show avtive according to nova show, are we waiting for them to be shutdown before hard rebooting? | 22:26 |
mordred | jeblair: I agree with your plan | 22:27 |
fungi | yeah, lgtm | 22:27 |
mordred | clarkb: I have been | 22:27 |
clarkb | ok I will wait then | 22:27 |
fungi | might want to also wait for the pypi mirrors to boot before starting zuul? | 22:27 |
fungi | otherwise there could be job carnage | 22:28 |
clarkb | ++ | 22:28 |
fungi | also waiting to bring up the zuul mergers until most worker types have registered in zuul again could avoid some not_registered results? | 22:28 |
jeblair | fungi: true, i was mostly thinking of waiting for the services though; we should probably wait until the reboots have been issued for all of the vms | 22:28 |
fungi | though i suppose just not readding the saved queues for a bit would work as well | 22:29 |
fungi | worst case a couple of changes get uploaded to gerrit and their jobs are kicked back because it was too soon | 22:30 |
jeblair | or waiting until nodepool has ready nodes of each type (which it may immediately sinc the system is not busy) | 22:30 |
fungi | yeah, i expect that to be quick | 22:31 |
jeblair | 200 concurrent jobs running is "not busy" | 22:31 |
fungi | heh | 22:32 |
jeblair | do folks want to claim nodes in the group[1] by adding their names to the ep? | 22:32 |
fungi | also cinder seems pretty broken | 22:32 |
fungi | yep, will do | 22:32 |
jeblair | i'll do the gerrit/zuul/nodepool bits | 22:32 |
mordred | I'll just watch I guess | 22:33 |
jeblair | (and btw, not stopping nodepool is intentional; it'll reduce number of aliens) | 22:33 |
jeblair | mordred: want the zuul mergers? | 22:33 |
mordred | oh - sure! | 22:34 |
jeblair | status alert Gerrit and Zuul are going offline for reboots to fix a security vulnerability. | 22:35 |
jeblair | ? | 22:35 |
fungi | halting nodepool.o.o will likely send a term to nodepoold as init tries to gracefully stop things | 22:35 |
jeblair | fungi: yup | 22:35 |
fungi | jeblair: that looks good | 22:35 |
jeblair | clarkb, fungi, mordred: are you standing by? | 22:36 |
mordred | jeblair: yup | 22:36 |
fungi | on hand | 22:36 |
clarkb | I am here | 22:37 |
clarkb | almost done with the last logstash worker it took forever to stop | 22:37 |
jeblair | i'll send the announcement then wait a bit, then give you the go ahead when i've stopped zuul and gerrit | 22:37 |
jeblair | #status alert Gerrit and Zuul are going offline for reboots to fix a security vulnerability. | 22:37 |
openstackstatus | jeblair: sending alert | 22:37 |
*** reed has left #openstack-infra-incident | 22:37 | |
-openstackstatus- NOTICE: Gerrit and Zuul are going offline for reboots to fix a security vulnerability. | 22:39 | |
*** ChanServ changes topic to "Gerrit and Zuul are going offline for reboots to fix a security vulnerability." | 22:39 | |
clarkb | thats exciting, the logstash indexer refuses to keep running on 19 and doesn't log why it failed | 22:40 |
clarkb | but we are super redundant there so I can switch to the [1] group whenever | 22:40 |
openstackstatus | jeblair: finished sending alert | 22:42 |
jeblair | clarkb, fungi, mordred: gerrit is stopped, clear to reboot | 22:42 |
mordred | rebooting | 22:42 |
clarkb | all the ES's are halted | 22:43 |
clarkb | waiting for them to not be ACTIVE in nova show before rebooting | 22:44 |
fungi | my 4 are back up, i'm checking their services now | 22:44 |
clarkb | fungi: did you wait for them to not be active? | 22:44 |
clarkb | I am trying to decide if that is necessary | 22:44 |
fungi | i did not | 22:44 |
fungi | zuul is already stopped | 22:45 |
clarkb | ok I won't wait either then | 22:45 |
fungi | so it's not like job results will matter at this point | 22:45 |
fungi | none that are running will report | 22:45 |
mordred | mine are all rebooted and services are running | 22:45 |
jeblair | (nodepool is up and running) | 22:46 |
fungi | yep, all mine are looking good | 22:46 |
fungi | rebooting gerrit now | 22:46 |
pleia2 | clarkb: rebooting git-fe02 now, once that's back up I'll readd to dns, remove git-fe01 and wait for that to trickle out (and tell fungi to reboot puppetmaster) | 22:46 |
mordred | hour and a half isn't terrible for an emergency reboot the world | 22:48 |
clarkb | elasticsearch is semi up, its red and 5/6 nodes are present | 22:48 |
fungi | gerrit's on its way back up | 22:49 |
pleia2 | fungi: I'm done with puppetmaster for now, just have a git-fe01 reboot to do, but will wait on dns for that so reboot away | 22:49 |
fungi | review.o.o looks like it's working | 22:49 |
mordred | I agree that review.o.o ooks like it's running | 22:49 |
fungi | jeblair: should be set to start zuul? | 22:49 |
anteaya | fungi: slow but working... | 22:50 |
jeblair | [2015-05-13 22:50:02,782] WARN org.eclipse.jetty.io.nio : Dispatched Failed! SCEP@34ee175d{l(/127.0.0.1:58159)<->r(/127.0.0.1:8081),d=false,open=true,ishut=false,oshut=false,rb=false,wb=false,w=true,i=1r}-{AsyncHttpConnection@242369e7,g=HttpGenerator{s=0,h=-1,b=-1,c=-1},p=HttpParser{s=-14,l=0,c=0},r=0} to org.eclipse.jetty.server.nio.SelectChannelConnector$ConnectorSelectorManager@10fdcf3a | 22:50 |
clarkb | es02 ran out of disk again | 22:50 |
clarkb | pleia2: ^ same issue as last time, /me makes a note to logrotate for it better | 22:50 |
jeblair | wow there were a lot of those errors in the gerrit log | 22:50 |
jeblair | but it seems happier now | 22:50 |
fungi | it was probably getting slammed by people desperately reloading while it was starting up | 22:51 |
jeblair | okay i will restart zuul | 22:51 |
clarkb | wait hrm | 22:52 |
fungi | hrm? | 22:52 |
jeblair | clarkb: i've already started zuul and begun re-enqueing | 22:52 |
clarkb | can someone else log into es02 | 22:52 |
fungi | on it | 22:52 |
clarkb | jeblair: sorry I think you are ok to start zuul | 22:52 |
jeblair | ok, will continue | 22:52 |
clarkb | fungi: tell me if mount looks funn | 22:52 |
clarkb | jeblair: but it looks like es02 came up without a / | 22:53 |
fungi | um... wow! | 22:53 |
clarkb | I wonder if this is what hit worker19 | 22:53 |
fungi | i mean, that can happen, technically, as / is mounted at boot and might not be in mtab | 22:53 |
clarkb | in any case we should probably do an audit of this | 22:53 |
pleia2 | clarkb: oops, I think I made a bug/story about that but never managed to cycle back to it | 22:53 |
jeblair | okay, zuul is up and there are a very small handful of not_registered; i think we can ignore them | 22:53 |
fungi | overflow on /tmp type tmpfs (rw,size=1048576,mode=1777) | 22:53 |
clarkb | pleia2: np | 22:53 |
jeblair | i will status ok? | 22:54 |
fungi | whazzat | 22:54 |
mordred | jeblair: ++ | 22:54 |
fungi | jeblair: go for it | 22:54 |
clarkb | jeblair: do we want to check the mout table everywhere first? | 22:54 |
clarkb | we may need to reboot more if we don't see those devices within a VM? | 22:54 |
jeblair | clarkb: review looks ok (it has a / and an /opt) | 22:55 |
clarkb | jeblair: cool thats probably the most important one | 22:55 |
clarkb | fungi: so short of rebooting, any ideas on fixing/debugging this? | 22:55 |
fungi | clarkb: i'm looking at dmesg. just a sec | 22:55 |
jeblair | #status ok Gerrit and Zuul are back online. | 22:55 |
openstackstatus | jeblair: sending ok | 22:55 |
jeblair | nodepool looks sane | 22:56 |
clarkb | ES is yellow despite this trouble | 22:56 |
clarkb | so it is recovering in the right direction (started red) | 22:57 |
fungi | dmesg looks like it properly remounted xvda1 rw... maybe mtab is corrupt | 22:57 |
jhesketh | Morning | 22:57 |
*** ChanServ changes topic to "Incident in progress: https://etherpad.openstack.org/p/venom-reboots" | 22:57 | |
-openstackstatus- NOTICE: Gerrit and Zuul are back online. | 22:57 | |
* jhesketh catches up on incident(s) | 22:58 | |
anteaya | jhesketh: https://etherpad.openstack.org/p/venom-reboots if you haven't found it yet | 22:58 |
fungi | also what is this tmpfs called "overflow" mounted at /tmp? it's 1mb in size, which seems dysfunctional | 22:58 |
Clint | fungi: that's a "feature" for when / is full on bootup | 22:59 |
fungi | anteaya: i also put it in the channel topic | 22:59 |
clarkb | and / was full | 22:59 |
fungi | Clint: thank you! | 22:59 |
anteaya | fungi: thank you | 22:59 |
clarkb | because ES rotates logs but doesn't have a limit on how far back to keep | 22:59 |
fungi | so, yes, that's interesting and i think explains some of this then | 22:59 |
fungi | i wonder if that is why root isn't actually showing up mounted then | 23:00 |
openstackstatus | jeblair: finished sending ok | 23:00 |
clarkb | fungi: seems likely | 23:00 |
fungi | clarkb: i guess just clean up the root filesystem and reboot again and it will probably go back to "normal" | 23:01 |
fungi | Clint: you gave me something new to read about | 23:01 |
clarkb | fungi: ya I have already killed the log files taking up all the disk | 23:01 |
Clint | fungi: enjoy | 23:01 |
clarkb | so I will go ahead and reboot it | 23:01 |
fungi | sounds good | 23:01 |
clarkb | then make it more of a priority to have logrotate clean up after ES | 23:01 |
jhesketh | Looks like good progress on the reboots has been made | 23:01 |
jhesketh | Let me know if I can help | 23:01 |
fungi | but that being the case, i think this probably is not some mysterious endemic issue we should go hunting on our other servers looking for | 23:02 |
jeblair | jhesketh: thanks, i think we're nearing the end of it | 23:02 |
fungi | it's gone pleasantly quickly | 23:02 |
jhesketh | Great stuff | 23:02 |
fungi | mordred: clarkb: so you're idle on puppetmaster for a bit. looks like i'm clear to reboot that. speak soon if not | 23:03 |
pleia2 | fungi: git-fe01 just slowed down enough for me to reboot it, if you can wait 5 minutes I might as well finish this up | 23:03 |
clarkb | fungi: I am done | 23:03 |
mordred | fungi: I am doing nothing there | 23:03 |
fungi | pleia2: no problem, go for it | 23:03 |
fungi | i'll wait | 23:03 |
clarkb | pleia2: fe01 or fe02? | 23:03 |
pleia2 | clarkb: fe02 is all done | 23:04 |
clarkb | pleia2: woot | 23:04 |
clarkb | es02 is rebooting now | 23:04 |
clarkb | looks like the logstash workers may be leaking logs too | 23:04 |
clarkb | I find it somewhat :( that our logging system is bad at logging (most likely my fault) | 23:04 |
fungi | logs are hard (and made of wood) | 23:05 |
clarkb | and its up, cluster is still recovering and yellow so that all looks good | 23:06 |
mordred | fungi: more witches! | 23:07 |
pleia2 | fungi: all clear puppetmaster | 23:07 |
fungi | pleia2: thanks! restarting it next | 23:07 |
fungi | puppet master is back up now | 23:08 |
pleia2 | updated dns & load is coming back to git-fe01, so all is looking good | 23:09 |
clarkb | ok confirmed that es doesn't do rotations with deletes properly until release 1.5 | 23:10 |
jeblair | are all servers complete now? | 23:10 |
clarkb | but I think logrotate can delete based on age so will just set it up to kill anything >2 weeks old or whatever | 23:10 |
pleia2 | jeblair: I think so | 23:10 |
clarkb | the list seems to make it look that way | 23:10 |
fungi | gate-heat-python26 NOT_REGISTERED | 23:11 |
fungi | same for gate-pbr-python26 | 23:11 |
jeblair | there are bare-centos6 nodes ready, maybe we should delete them | 23:12 |
fungi | looks like a lot of jobs expecting bare-precise and baer-centos6 are not registered, so yeah | 23:12 |
fungi | let's | 23:12 |
jeblair | doing | 23:12 |
fungi | thanks | 23:12 |
jeblair | oh, sorry, my query was wrong; no ready nodes there, only hold, building, and used (possibly from before reboot) | 23:13 |
jeblair | building might be before reboot too | 23:13 |
jeblair | anyone using those held nodes? | 23:14 |
fungi | i am no longer | 23:14 |
clarkb | jeblair: I have one I would like to keep | 23:14 |
fungi | just finished yesterday | 23:14 |
fungi | i'll delete mine | 23:14 |
clarkb | 2536877 would be nice for me to keep | 23:14 |
fungi | i kept a list | 23:14 |
clarkb | been using it as part of the nodepool + devstack work | 23:14 |
jeblair | okay, i deleted building/used; leaving hold to you | 23:15 |
jeblair | that should be enough to correct the registrations once those are built | 23:15 |
jeblair | should be clear out of this room now? | 23:15 |
jeblair | /be/we/ | 23:16 |
pleia2 | sounds good, see you on the other side | 23:16 |
*** clarkb has left #openstack-infra-incident | 23:16 | |
fungi | adios | 23:17 |
*** fungi has left #openstack-infra-incident | 23:17 | |
*** ChanServ changes topic to "Discussion of OpenStack project infrastructure incidents | No current incident" | 23:17 | |
* jeblair puts the sheets back over the control consoles | 23:17 | |
*** nibalizer has left #openstack-infra-incident | 23:18 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!