*** AJaeger is now known as AJaeger_ | 06:08 | |
*** hjensas has joined #openstack-infra-incident | 07:30 | |
*** hjensas has quit IRC | 08:23 | |
-openstackstatus- NOTICE: zuul was restarted due to an unrecoverable disconnect from gerrit. If your change is missing a CI result and isn't listed in the pipelines on http://status.openstack.org/zuul/ , please recheck | 08:51 | |
*** hjensas has joined #openstack-infra-incident | 09:34 | |
*** Daviey has quit IRC | 12:28 | |
*** Daviey has joined #openstack-infra-incident | 12:55 | |
jroll | https://nvd.nist.gov/vuln/detail/CVE-2016-10229#vulnDescriptionTitle "udp.c in the Linux kernel before 4.5 allows remote attackers to execute arbitrary code via UDP traffic that triggers an unsafe second checksum calculation during execution of a recv system call with the MSG_PEEK flag." | 13:00 |
---|---|---|
jroll | not sure if infra listens for UDP off the top of my head, but thought I'd drop that here | 13:00 |
*** hjensas has quit IRC | 14:57 | |
*** lifeless_ has joined #openstack-infra-incident | 15:07 | |
*** mordred has quit IRC | 15:10 | |
*** lifeless has quit IRC | 15:10 | |
*** EmilienM has quit IRC | 15:10 | |
*** mordred1 has joined #openstack-infra-incident | 15:10 | |
*** 21WAAA2JF has joined #openstack-infra-incident | 15:13 | |
*** mordred1 is now known as mordred | 15:46 | |
clarkb | jroll: thanks, we have an open snmp port we might want to close | 16:10 |
clarkb | pabelanger: fungi ^ | 16:10 |
pabelanger | ack | 16:10 |
pabelanger | pbx might be one | 16:12 |
pabelanger | since we use UDP for RTP | 16:12 |
*** hjensas has joined #openstack-infra-incident | 16:12 | |
jroll | clarkb: np | 16:13 |
clarkb | actually snmp issourcespecific so fairly safe | 16:13 |
clarkb | afs | 16:13 |
clarkb | is udp | 16:13 |
clarkb | mordred: ^ | 16:13 |
pabelanger | ya, AFS might be our large exposure | 16:15 |
mordred | oy. that's awesome | 16:17 |
pabelanger | https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html | 16:17 |
pabelanger | Linux afs01.dfw.openstack.org 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | 16:18 |
pabelanger | so ya, might need a new kernel and reboot? | 16:18 |
fungi | ugh | 16:19 |
fungi | any reports it's actively exploited in the wild? | 16:19 |
pabelanger | I am not sure | 16:21 |
pabelanger | looks like android is getting the brunt of it however | 16:22 |
fungi | keep in mind source filtering is still a lot less effective for udp than tcp | 16:23 |
fungi | easier to spoof (mainly just need to guess a source address and active ephemeral port) | 16:23 |
*** openstack has joined #openstack-infra-incident | 16:33 | |
*** openstackstatus has joined #openstack-infra-incident | 16:34 | |
*** ChanServ sets mode: +v openstackstatus | 16:34 | |
*** 21WAAA2JF is now known as EmilienM | 17:05 | |
*** EmilienM has joined #openstack-infra-incident | 17:05 | |
clarkb | looking at https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html says xenial is not affected? | 18:57 |
clarkb | I also don't see a ubuntu security notice for it yet | 18:59 |
clarkb | it looks like we may be patched in many places? | 19:06 |
clarkb | trying to figure out what exactly is required, but if I read that correctly xenial is fine despite being 4.4? trusty needs kernel >=3.13.0-79.123 | 19:06 |
clarkb | pabelanger: fungi ^ that sound right to you? if so maybe next step is generate a list of kernels on all our hosts via ansible then produce a needs to be rebooted list | 19:12 |
pabelanger | clarkb: ya, xenail isn't affected what what I read. | 19:13 |
pabelanger | ++ to ansible run | 19:13 |
clarkb | pabelanger: Linux review 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | 19:48 |
pabelanger | ++ | 19:48 |
clarkb | on review.o.o which is newer than 3.13.0-79.123 | 19:48 |
clarkb | so I think just restart the service? | 19:48 |
pabelanger | ya, looks like just a restart then | 19:49 |
-openstackstatus- NOTICE: The Gerrit service on http://review.openstack.org is being restarted to address hung remote replication tasks. | 19:51 | |
fungi | sorry for not being around... kernel update sounds right, too bad we didn't take the gerrit restart as an opportunity to reboot | 19:58 |
clarkb | fungi: we don't need to reboot it | 20:00 |
clarkb | fungi: gerrit's kernel is new enough I think ^ you can double check above. | 20:00 |
fungi | oh | 20:00 |
fungi | yep, read those backwards | 20:00 |
clarkb | pabelanger: puppetmaster.o.o:/home/clarkb/collect_facts.yaml has a small playbook thing to collect kernel info, want to check that out? | 20:00 |
clarkb | that is incredibly verbose, is there a better way to introspect facts? | 20:02 |
pabelanger | clarkb: sure | 20:05 |
clarkb | pabelanger: if it looks good to you I will run it against all the hosts and stop using my --limit commands to test | 20:06 |
clarkb | its verbose but works so just going to go with it I think | 20:06 |
pabelanger | clarkb: looks good | 20:06 |
clarkb | pabelanger: ok I will run it and redirect output into ~clarkb/kernels.txt | 20:07 |
clarkb | its running | 20:08 |
pabelanger | clarkb: only seeing the ok: hostname bits | 20:10 |
clarkb | pabelanger: ya its gathering all the facts before running the task I think | 20:10 |
pabelanger | Ha, ya | 20:11 |
pabelanger | gather_facts: false | 20:11 |
clarkb | well we need the facts | 20:11 |
pabelanger | but, we need them | 20:11 |
pabelanger | ya | 20:11 |
clarkb | I guess I could've uname -a 'd it instead :) | 20:11 |
pabelanger | okay, let me get some coffee | 20:11 |
pabelanger | also, forgot about infracloud | 20:12 |
pabelanger | that will be fun | 20:12 |
clarkb | hrm why does the mtl01 internap mirror show up, I thoguht I cleaned that host up a while back | 20:16 |
mgagne | mtl01 is the active region, nyj01 is the one that is now unused | 20:16 |
* mgagne didn't read backlog | 20:16 | |
clarkb | oh I got them mixed up, thanks | 20:18 |
clarkb | looks like right now ansible is timing out trying to get to things like jeblairtest | 20:27 |
clarkb | I'm just gonna let it timeout on its own, does anyone know how long that will take? | 20:27 |
fungi | maybe 60 minutes in my experience | 20:36 |
clarkb | fwiw most of my spot checking shows our kernels are new neough | 20:54 |
clarkb | so don't expect to need to reboot much once ansible gets back to me | 20:54 |
clarkb | its been an hour and they haven't timed out yet... | 21:17 |
clarkb | done waiting going to kill ssh processes and hope that doesn't prevent play from finishing | 21:20 |
clarkb | pabelanger: https://etherpad.openstack.org/p/infra-reboots-old-kernel | 21:26 |
clarkb | I'm just going to start picking off some of the easy ones | 21:28 |
clarkb | mordred: are you around? how do we do the afs reboots? make afs01 rw for all volumes, reboot afs02, make 02 rw, reboot 01? | 21:28 |
clarkb | then do each of the db hosts one at a time? what about kdc01? | 21:28 |
clarkb | doing etherpad now so the etherpad will be temporarily unavailable | 21:34 |
clarkb | for propsal.slave.openstack.org and others, is the zuul launcher going to gracefully pick those back up again after a reboot or will we have to restart the launcher too? | 21:37 |
clarkb | I guess I'm going to find out? | 21:37 |
clarkb | rebooting proposal.slave.o.o now as its not doing anything | 21:38 |
clarkb | I'm going to try grabbing all the mirror update locks on mirror-update so that I can reboot it next | 22:09 |
clarkb | pabelanger: gem mirroring appears to have been stalled since april 2nd but there is a process holding the lock. Safe to grab it and then reboot? | 22:26 |
pabelanger | clarkb: ya, we'll need to grab lock after reboot | 22:33 |
pabelanger | just finishing up dinner, and need to run an errand | 22:34 |
pabelanger | I can help with reboots once I get back in about 90mins | 22:34 |
clarkb | pabelanger: sounds good we can leave mirror-update and afs for then. I will keep working on the others | 22:34 |
clarkb | afs didn't survive reboot on gra1 mirror. working to fix now | 22:39 |
clarkb | oh maybe it did and its just slow things are cdable now | 22:40 |
clarkb | pabelanger for when you get back grafana updated on the grafana server, not sure if it matter or not? hopefully I didn't derp anything | 22:43 |
clarkb | my apologies if it does :/ | 22:44 |
clarkb | the web ui seems to be working though so going to reboot server now | 22:44 |
clarkb | and its up and happy again | 22:50 |
clarkb | now for the baremetal00 host for infra clouds running bifrost | 22:50 |
pabelanger | looks like errands are pushed back a few hours | 23:00 |
pabelanger | clarkb: look like we might have upgraded grafana.o.o too | 23:00 |
pabelanger | checking logs to see if there are any errors | 23:00 |
pabelanger | but so far, seems okay | 23:00 |
clarkb | pabelanger: yes it upgraded, sorry I didn't think to not do that until it was done | 23:00 |
clarkb | but ya service seems to work | 23:00 |
pabelanger | 2.6.0 | 23:01 |
pabelanger | should be fine | 23:01 |
clarkb | I have been running apt-get update and dist-upgrade before reboots to make sure we get the new stuff | 23:01 |
pabelanger | we'll find out soon if grafyaml has issues | 23:01 |
clarkb | :) | 23:01 |
clarkb | baremetal00 and puppetmaster are the two I want to do next | 23:01 |
clarkb | then we are just left with afs things | 23:01 |
pabelanger | k | 23:01 |
clarkb | I think baremetal should be fine to just reboot | 23:01 |
pabelanger | ya | 23:02 |
clarkb | for puppetmaster we should grab the puppet run lock and then do it so we don't interrupt a bunch of ansible/puppet | 23:02 |
pabelanger | okay | 23:02 |
pabelanger | which do you want me to do | 23:02 |
clarkb | I want you to do mirror-update if you can | 23:02 |
pabelanger | k | 23:02 |
clarkb | since you have the gem lock? I should have all the other locks at this point you can ps -elf | grep k5 to make sure nothing else is running | 23:02 |
clarkb | I'm logged in but don't worry about it the only process I have are holding locsk | 23:03 |
clarkb | I'm going to reboot baremetal00 now | 23:03 |
* clarkb crosses fingers | 23:03 | |
pabelanger | ya, k5start process are not running | 23:03 |
pabelanger | so think mirror-update.o.o is ready | 23:03 |
clarkb | pabelanger: cool go for it | 23:04 |
clarkb | then grab whatever locks you need | 23:04 |
clarkb | since they shouldn't survive a reboot | 23:04 |
pabelanger | rebooting | 23:04 |
clarkb | then when thats done and baremetal comes back lets figure out puppetmaster, then figure out afs servers | 23:05 |
clarkb | baremetal still not back. Real hardware is annoying :) | 23:06 |
pabelanger | mirror-update.o.o good, locks grabbed agin | 23:08 |
clarkb | and still not up | 23:08 |
pabelanger | ya, will take a few minutes | 23:08 |
clarkb | pabelanger: can you start poking at puppetmaster maybe, see about grabbing lock for the puppet/ansible rotation? | 23:08 |
clarkb | remmeber there are two now iirc | 23:08 |
pabelanger | yup | 23:09 |
clarkb | tyty | 23:09 |
clarkb | at what point do I worry the hardware for baremetal00 is not coming back? :/ | 23:10 |
clarkb | oh it just starting pinging | 23:10 |
clarkb | \o/ | 23:10 |
pabelanger | okay, have both locks on puppetmaster.o.o | 23:12 |
pabelanger | and ansible is not running | 23:12 |
clarkb | I don't need puppetmaster if you want to go for it | 23:12 |
pabelanger | k, rebooting | 23:12 |
clarkb | baremetal is up now. and ironic node-list works | 23:13 |
clarkb | well thats interesting | 23:13 |
clarkb | its urnning its old kernel | 23:13 |
pabelanger | puppetmaster.o.o online | 23:14 |
clarkb | I'm going to keep debugging baremetal and may have to reboot it again :( | 23:14 |
pabelanger | ansible now running | 23:15 |
clarkb | as for afs, can we reboot the kdc01 server safely ? we just won't be able to get kerberos tokens while its down? | 23:15 |
clarkb | and can we reboot the db servers one at a time without impacting the service? | 23:16 |
clarkb | then we just have to do the fileservers in a synchronized manner ya? | 23:16 |
clarkb | mordred: corvus ^ | 23:16 |
clarkb | I manually installed linux-image-3.13.0-116-generic on baremetal00, I do not know why a dist-upgrade was not pulling that in | 23:18 |
clarkb | but its in there and in grub so thinking I will do a second reboot now | 23:18 |
clarkb | pabelanger: ^ any ideas on that or concerns? | 23:18 |
pabelanger | nope, go for it. We have access to iLo if needed | 23:19 |
clarkb | we don't have to go through baremetal00 to get ilo? we can go through any of the other hosts ya? | 23:20 |
clarkb | thats my biggest concern | 23:20 |
pabelanger | I think we can do any now | 23:20 |
pabelanger | they are all on same network | 23:20 |
clarkb | ok rebooting | 23:20 |
clarkb | I put some notes about doing the afs related servers on the etherpad. Does that look right to everyone? | 23:23 |
clarkb | pabelanger: maybe you can check if one file server is already rw for all volumes and we can reboot the other one? | 23:23 |
* clarkb goes to grab a drink while waiting for baremetal00 | 23:23 | |
pabelanger | clarkb: vos listvldb show everything in symc | 23:25 |
pabelanger | sync* | 23:25 |
pabelanger | clarkb: http://docs.openafs.org/AdminGuide/HDRWQ139.html | 23:26 |
pabelanger | we might want to follow that? | 23:26 |
clarkb | vldb is what we run on afsdb0X? | 23:29 |
pabelanger | I did it from afsdb01 | 23:30 |
pabelanger | I think we use bos shutdown | 23:30 |
pabelanger | to rotate things out | 23:30 |
clarkb | gotcha thats the way you signal the other server to take over duties? | 23:30 |
pabelanger | I think so | 23:30 |
clarkb | definitely seems like what you are supposed to do according to the guide | 23:30 |
pabelanger | maybe start with afs02.dfw.openstack.org | 23:31 |
clarkb | does afs02 only have ro volumes? | 23:31 |
pabelanger | yes | 23:31 |
clarkb | and afs01 is all rw? if so then ya I think we do that one first | 23:31 |
pabelanger | right | 23:31 |
clarkb | (still waiting on baremetal00) | 23:31 |
pabelanger | afs01 has rw and ro | 23:31 |
pabelanger | err | 23:31 |
pabelanger | afs01.dfw.openstack.org RW RO | 23:32 |
pabelanger | afs01.dfw.openstack.org RO | 23:32 |
pabelanger | afs02.dfw.openstack.org RO | 23:32 |
pabelanger | afs01.ord.openstack.org | 23:32 |
pabelanger | is still online, but not used | 23:32 |
pabelanger | maybe we do afs01.ord.openstack.org first | 23:32 |
clarkb | right ok. Then once afs02 is back up again we transition all the volumes to be swapped on the RW RO | 23:32 |
pabelanger | npm volume locked atm | 23:33 |
clarkb | ord's kernel is old too but not in my list | 23:33 |
clarkb | maybe we skipped things in emergency file? may need to dbouel check that after we are done | 23:34 |
clarkb | (I used hosts: '*' to try and get them all and not !disabled) | 23:34 |
pabelanger | odd, okay we'll need to do 3 servers it seems. afs01.ord.openstack.org is still used by a few volumes | 23:34 |
clarkb | still waiting on baremetal00 :/ | 23:35 |
pabelanger | okay, so which do we want to shutdown first? | 23:35 |
clarkb | I really don't know :( my feeling is the non file servers may be the lowest impact? | 23:36 |
pabelanger | right, afs01.ord.openstack.org is the least used FS | 23:36 |
clarkb | ok lets do that one first of the fileservers | 23:37 |
clarkb | then question is do we want to do kdc and afsdb before fileservers or after? | 23:37 |
clarkb | also still no baremetal pinging. This is much longer than the last time | 23:37 |
clarkb | pabelanger: does the ord fs have an RW volumes? | 23:38 |
pabelanger | mirror.npm still is locked | 23:38 |
pabelanger | clarkb: no | 23:39 |
pabelanger | just RO | 23:39 |
clarkb | ok so what we want to do then maybe is grab all the flocks on mirror-update so that things stop updating volumes (like npm) | 23:40 |
clarkb | then reboot ord fileserver first? | 23:40 |
clarkb | see how that goes? | 23:40 |
pabelanger | sure | 23:40 |
clarkb | ok why don't you grab the flocks. I am working on getting ilo access to baremetal00 | 23:41 |
pabelanger | ha | 23:42 |
pabelanger | puppet needs to create the files in /var/run still | 23:42 |
pabelanger | since they are deleted on boot | 23:42 |
clarkb | the lock files are deleted? | 23:42 |
pabelanger | /var/run is tmpfs | 23:43 |
pabelanger | so /var/run/npm/npm.lock | 23:43 |
pabelanger | will fail, until puppet create /var/run/npm | 23:43 |
clarkb | I can't seem to hit the ipmi adds with ssh | 23:44 |
clarkb | s/adds/addrs/ | 23:44 |
clarkb | I am able to hit the compute hosts own ipmi but its slwo, maybe baremetal is slower /me tries harder | 23:45 |
clarkb | ok I'm on the ilo now. I guess being persistent after multiple connection timeouts is the trick | 23:49 |
pabelanger | k, have all the lock on mirror-update | 23:50 |
clarkb | so I can see the text console. The server is running | 23:56 |
clarkb | but no ssh | 23:56 |
clarkb | I think I am going to reboot with the text console up | 23:57 |
pabelanger | k | 23:57 |
pabelanger | like I'm ready to bos shutdown afs01.ord.openstack.org | 23:58 |
clarkb | pabelanger: I think if you are ready and willing then lets do it :) | 23:58 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!