*** irclogbot_0 has quit IRC | 00:16 | |
*** irclogbot_2 has joined #openstack-lbaas | 00:19 | |
*** ccamposr has quit IRC | 00:46 | |
*** ccamposr has joined #openstack-lbaas | 00:46 | |
*** armax has quit IRC | 01:15 | |
*** sapd1 has joined #openstack-lbaas | 02:34 | |
*** rcernin has quit IRC | 03:02 | |
*** rcernin has joined #openstack-lbaas | 03:21 | |
*** rcernin has quit IRC | 03:25 | |
*** rcernin has joined #openstack-lbaas | 03:26 | |
*** armax has joined #openstack-lbaas | 04:18 | |
*** vishalmanchanda has joined #openstack-lbaas | 04:51 | |
*** gcheresh has joined #openstack-lbaas | 04:59 | |
*** sapd1 has quit IRC | 05:17 | |
*** gcheresh has quit IRC | 05:38 | |
sorrison | Wondering if anyone is around that can help me with an octavia issue we're experiencing in production? | 05:48 |
---|---|---|
sorrison | Ahh it's still Sunday in the states | 05:50 |
*** gcheresh has joined #openstack-lbaas | 05:51 | |
sorrison | I figured out the issue, was some local silly thing, however needed to update the DB to get things back | 06:31 |
*** maciejjozefczyk has joined #openstack-lbaas | 06:43 | |
*** born2bake has joined #openstack-lbaas | 07:04 | |
*** sapd1 has joined #openstack-lbaas | 07:11 | |
*** halali_ has joined #openstack-lbaas | 07:16 | |
*** rcernin has quit IRC | 07:24 | |
*** halali_ has quit IRC | 07:37 | |
*** rcernin has joined #openstack-lbaas | 07:38 | |
*** rcernin has quit IRC | 07:44 | |
*** rcernin has joined #openstack-lbaas | 07:48 | |
*** rcernin has quit IRC | 07:52 | |
*** wuchunyang has joined #openstack-lbaas | 08:03 | |
openstackgerrit | Carlos Goncalves proposed openstack/octavia master: Allow amphorav2 to run without jobboard https://review.opendev.org/739053 | 08:05 |
*** wuchunyang has quit IRC | 08:31 | |
openstackgerrit | Carlos Goncalves proposed openstack/octavia master: Allow amphorav2 to run without jobboard https://review.opendev.org/739053 | 08:38 |
openstackgerrit | Carlos Goncalves proposed openstack/octavia-lib master: Add missing assertIsInstance checks https://review.opendev.org/738090 | 08:43 |
*** sapd1 has quit IRC | 08:54 | |
*** sapd1 has joined #openstack-lbaas | 09:01 | |
*** rcernin has joined #openstack-lbaas | 09:10 | |
*** ccamposr__ has joined #openstack-lbaas | 09:14 | |
*** rcernin has quit IRC | 09:15 | |
*** ccamposr has quit IRC | 09:16 | |
*** ccamposr has joined #openstack-lbaas | 09:24 | |
*** ccamposr__ has quit IRC | 09:25 | |
openstackgerrit | Merged openstack/octavia stable/ussuri: add the verify for the session https://review.opendev.org/738403 | 09:36 |
openstackgerrit | Merged openstack/octavia-lib master: Add missing assertIsInstance checks https://review.opendev.org/738090 | 09:36 |
*** also_stingrayza is now known as stingrayza | 09:58 | |
*** sapd1 has quit IRC | 10:01 | |
*** sapd1 has joined #openstack-lbaas | 10:05 | |
*** irclogbot_2 has quit IRC | 10:08 | |
*** johnthetubaguy has quit IRC | 10:08 | |
*** vesper11 has quit IRC | 10:08 | |
*** irclogbot_2 has joined #openstack-lbaas | 10:09 | |
*** johnthetubaguy has joined #openstack-lbaas | 10:09 | |
*** vesper11 has joined #openstack-lbaas | 10:09 | |
*** tkajinam has quit IRC | 10:29 | |
devfaz | hi, just run into a failed tempest test octavia_tempest_plugin.tests.api.v2.test_member.MemberAPITest.test_member_ipv4_create - seems like missing qos extension in neutron caused the failure http://paste.openstack.org/show/796108/ | 10:31 |
devfaz | wait.. maybe im wrong. | 10:32 |
*** trident has quit IRC | 10:35 | |
*** trident has joined #openstack-lbaas | 10:36 | |
devfaz | according to the neutron log - the network still existed (valid 200 status-requests) and got removed by tempest during cleanup | 10:36 |
devfaz | http://paste.openstack.org/show/796109/ - maybe the 404 of the qos-ext is (somehow?) used to return a 404 for the network-request? | 10:37 |
*** stingrayza has quit IRC | 10:40 | |
*** wuchunyang has joined #openstack-lbaas | 10:57 | |
*** wuchunyang has quit IRC | 11:23 | |
*** stingrayza has joined #openstack-lbaas | 11:36 | |
*** sapd1 has quit IRC | 11:38 | |
*** also_stingrayza has joined #openstack-lbaas | 12:35 | |
*** stingrayza has quit IRC | 12:37 | |
*** also_stingrayza is now known as stingrayza | 12:37 | |
*** rcernin has joined #openstack-lbaas | 13:07 | |
*** rcernin has quit IRC | 13:12 | |
*** TrevorV has joined #openstack-lbaas | 13:35 | |
*** sapd1 has joined #openstack-lbaas | 13:51 | |
openstackgerrit | Carlos Goncalves proposed openstack/octavia master: Add amphora image tag capability to Octavia flavors https://review.opendev.org/737528 | 14:04 |
*** numans_ has joined #openstack-lbaas | 14:21 | |
*** sapd1 has quit IRC | 14:24 | |
devfaz | nope, had some more time - even with enabled qos-ext the network is "not found" - sorry for the noise. | 14:38 |
openstackgerrit | Corey Bryant proposed openstack/octavia master: Drop diskimage-builder from root requirements.txt https://review.opendev.org/741960 | 14:43 |
*** rcernin has joined #openstack-lbaas | 14:55 | |
johnsom | devfaz Those are DEBUG messages, so typically not actually causing a problem. Octavia does an extension discovery with neutron to check for a few different features to be enabled or not depending on the deployment. | 14:58 |
*** rcernin has quit IRC | 15:00 | |
johnsom | devfaz I will note, that until a recently merged patch on octavia-tempest-plugin, there were some bad tests around availability zones that would cause the API tests to fail. https://review.opendev.org/737191 | 15:00 |
devfaz | johnsom: i will cherry-pick and give it a try | 15:01 |
devfaz | johnsom: already tested with latest (master) octavia-tempest-plugin, which already contained above change, didnt help. | 15:11 |
devfaz | im just wondering why neutron is answering "subnet not found", but the subnet still exits, because the later _cleanup, is getting 200 on delete. | 15:12 |
johnsom | Hmm, ok, can you past the test failure information from the tempest run? | 15:12 |
devfaz | sure | 15:12 |
devfaz | johnsom: http://paste.openstack.org/show/796131/ | 15:13 |
devfaz | johnsom: dont want to waste your time. I can build the current head/master of octavia and retry my tests | 15:13 |
johnsom | Yeah, this rings a bell. I know there were a few API tests that got broken by accident. The AZ was one, I remember there was a "missing subnet" as well, but I can't remember what the story/issue was with that one. | 15:13 |
devfaz | if it helps, seems like the issue was introduced in "stein" branch during maintanance of the branch, because my old branch didnt had the issue (afaik) - just trying to upgrade to latest stein because of the issue we talked about in the list | 15:14 |
johnsom | Yeah, the tempest plugin in branchless, so if there is an issue it will impact all of them most likely. | 15:15 |
devfaz | well, im using a forked version of tempest (to avoid issues like this) and it seems the issue was introduced just by upgrading octavia from older version of stein, to latest version of stein. | 15:16 |
johnsom | Yeah, I think it's related to the new "invisible resource" check we do on create calls. | 15:17 |
johnsom | Oh, I know what it is. It's a check ordering issue in the octavia code. One second, let me track that down. | 15:19 |
johnsom | devfaz It was this patch on master: https://review.opendev.org/#/c/737084/ | 15:20 |
johnsom | let me check stein and see if that still needs this patch | 15:20 |
johnsom | Yeah, ok, so that patch still needs to be backported. | 15:21 |
johnsom | The test is failing because it's getting a different exception than is expected due to the checks being out of order in the API. | 15:22 |
johnsom | I will work on getting that backported | 15:22 |
openstackgerrit | Michael Johnson proposed openstack/octavia stable/ussuri: Prioritize policy validation https://review.opendev.org/741967 | 15:25 |
*** spatel has joined #openstack-lbaas | 15:29 | |
johnsom | devfaz If that is the only set of errors you are getting, the "subnet not found" you are probably good to go. It is functional, just running validation checks out of order so you may get a "subnet not found" error (because you don't have permissions to it) instead of the Forbidden that you should get. The backport will fix that, but functionally it still denies the user without permissions. | 15:33 |
devfaz | johnsom: good to know - thx. I just tried to backport the change to my stein branch (i will not use it, just to check) | 15:38 |
johnsom | Sure, NP. I'm working on the backport patches now. They should all be posted in the next half hour or so | 15:39 |
devfaz | thx a lot. going in the next minutes, CEST - so would not use them in the next 12h :) | 15:40 |
devfaz | just waiting for the tempest result of my quick-and-dirty-cherry-pick. | 15:40 |
devfaz | did some typos while solving the merge-conflict, so tempest failed. I will use your patch tomorrow or try it myself if you have more important issues to solve. Nevertheless - thanks a lot | 15:45 |
johnsom | Sure, no problem | 15:46 |
*** gcheresh has quit IRC | 15:48 | |
openstackgerrit | Michael Johnson proposed openstack/octavia stable/train: Prioritize policy validation https://review.opendev.org/741974 | 15:51 |
openstackgerrit | Michael Johnson proposed openstack/octavia stable/stein: Prioritize policy validation https://review.opendev.org/741975 | 15:52 |
devfaz | 2020-07-20 15:52:58.703 822 INFO tempest-verifier [-] {0} octavia_tempest_plugin.tests.api.v2.test_member.MemberAPITest.test_member_ipv4_create ... success [8.310s] | 15:53 |
devfaz | so, now I can go :) | 15:53 |
devfaz | have a nice day. | 15:53 |
openstackgerrit | Michael Johnson proposed openstack/octavia stable/train: Prioritize policy validation https://review.opendev.org/741974 | 16:23 |
openstackgerrit | Michael Johnson proposed openstack/octavia stable/stein: Prioritize policy validation https://review.opendev.org/741975 | 16:26 |
rm_work | johnsom: could you throw another review at https://review.opendev.org/#/c/740815/ ? the other side is finished (controlplane side) | 16:26 |
johnsom | Yes, sometime today, probably | 16:31 |
openstackgerrit | Brian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code https://review.opendev.org/718192 | 16:35 |
openstackgerrit | Michael Johnson proposed openstack/octavia master: Clarify the current status of Octavia in README https://review.opendev.org/741988 | 16:37 |
johnsom | Someone had a post on ask.o.o that they were confused by the "Under development" statements. So, trying to clarify some of that. | 16:39 |
*** laerling is now known as Guest68516 | 16:55 | |
*** rcernin has joined #openstack-lbaas | 16:56 | |
*** Guest68516 has quit IRC | 17:00 | |
*** rcernin has quit IRC | 17:01 | |
*** ccamposr__ has joined #openstack-lbaas | 17:45 | |
*** ccamposr has quit IRC | 17:48 | |
*** spatel has quit IRC | 18:42 | |
openstackgerrit | Brian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code https://review.opendev.org/718192 | 19:00 |
johnsom | rm_work done | 19:27 |
*** maciejjozefczyk has quit IRC | 19:33 | |
*** vishalmanchanda has quit IRC | 19:41 | |
*** gthiemonge has quit IRC | 19:42 | |
*** gthiemonge has joined #openstack-lbaas | 19:42 | |
aannuusshhkkaa | hey johnsom, do we need to create a new message version (4) when we add Response Time (rtime) to the health message? | 20:20 |
johnsom | Yes, you would | 20:20 |
aannuusshhkkaa | okay thanks! | 20:22 |
*** shtepanie has joined #openstack-lbaas | 20:31 | |
*** rcernin has joined #openstack-lbaas | 20:58 | |
*** rcernin has quit IRC | 21:02 | |
openstackgerrit | Brian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code https://review.opendev.org/718192 | 21:02 |
*** TrevorV has quit IRC | 22:19 | |
sorrison | Got a issue in prod which I'm struggling to fix, some LBs are stuck in PENDING_UPDATE and they have heaps of amphora in ERROR status. Eg. http://paste.openstack.org/show/796141/ | 22:41 |
*** shtepanie has quit IRC | 22:41 | |
*** born2bake has quit IRC | 22:46 | |
johnsom | sorrison What version are you running? | 22:49 |
sorrison | Train with the AZ patches on top | 22:50 |
sorrison | Been trying to find a way to just blow all the amphora away and rebuild them for the affected LBs | 22:50 |
sorrison | have 5 out of about 30 LBs in this state | 22:50 |
johnsom | You will be very interested in this patch chain: https://review.opendev.org/739772 | 22:51 |
johnsom | But let's see what is going on and where you are at now. | 22:51 |
johnsom | So, the LB in pending create, did someone/something kill -9 one of the controller processes? | 22:52 |
sorrison | luckily one of the LBs is one of mine so can experiment with it, the others are customers | 22:52 |
sorrison | they in pending_update | 22:52 |
johnsom | Does the log show it still trying to repair it? | 22:52 |
johnsom | Ah, yeah, pending_update, that is what I meant | 22:52 |
sorrison | it looks like it keep trying to build more and they fail | 22:53 |
*** tkajinam has joined #openstack-lbaas | 22:53 | |
johnsom | What is the error code you get when they fail? I assume this is the HM log | 22:53 |
sorrison | but I've spun some up manually and with tempest and they all good | 22:53 |
sorrison | I see octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out errors | 22:55 |
sorrison | but they look like trying to contact the dead amphora | 22:55 |
sorrison | It spins up new ones and puts them into ALLOCATED | 22:55 |
sorrison | and I can ping them etc. but it seems to be stuck trying to talk to the dead one | 22:56 |
johnsom | Yeah, it should do that. Ok, let's pick one of the amphora in ERROR, preferably one on your LB. | 22:56 |
sorrison | the ones in ERROR are dead | 22:58 |
sorrison | I can ping ones in ALLOCATED state but they have no role | 22:58 |
johnsom | There are no nova instances behind them? | 22:58 |
*** rcernin has joined #openstack-lbaas | 22:58 | |
sorrison | no they are gone, I assume the source of the issue which I haven't figured out why yet | 22:58 |
*** rcernin has quit IRC | 22:59 | |
johnsom | No, that is ok, that just means the cleanup worked correctly which is good. | 22:59 |
sorrison | the broken LBs are in different states | 22:59 |
*** rcernin has joined #openstack-lbaas | 22:59 | |
sorrison | I think I have some broken where either the master or the backup are still contactable | 22:59 |
johnsom | Is load balancer f784f824-64a5-4f13-8d54-fda54b528282 yours? | 22:59 |
johnsom | Yeah, Octavia will fail safe. When parts of the cloud are broken we stop before we would break the actual load balancer tenant traffic flow | 23:00 |
sorrison | no that LB isn't mine | 23:00 |
sorrison | but looks similar | 23:01 |
johnsom | Ok, if there is one "ALLOCATED" there on your LB. Let's do a "openstack console log show" on it | 23:01 |
johnsom | We are looking for the net device info box that looks like this: | 23:02 |
johnsom | https://www.irccloud.com/pastebin/dTx8M0aU/ | 23:02 |
johnsom | Ha, well, with better formatting. | 23:02 |
johnsom | We want to confirm that the first line in that first box has a valid IP address | 23:02 |
johnsom | for the AZ this amp lives in | 23:03 |
sorrison | http://paste.openstack.org/show/796143/ | 23:03 |
sorrison | looks good | 23:04 |
sorrison | I can ssh in etc too | 23:04 |
johnsom | Oh, ok, good. | 23:04 |
sorrison | when I restarted amphora-agent on one of then last night the nova instance just deleted | 23:04 |
johnsom | Yeah, you have to be careful with that. | 23:05 |
johnsom | So, on that instance, in syslog, can you check that there are no errors from the agent? | 23:05 |
johnsom | They will not be in the agent log on that old version | 23:05 |
sorrison | I don't see any errors | 23:06 |
johnsom | Well, ok, it was in train, so they might be in the agent log. | 23:06 |
johnsom | Ok, yeah, I wouldn't expect to given what I have seen so far. I think the amp is healthy. | 23:07 |
sorrison | last line is a PUT /1.0/config HTTP/1.1" 202 16 "-" | 23:07 |
johnsom | Ok, can you take a quick look at the haproxy log and make sure there are no errors there? | 23:07 |
johnsom | It should be clear (other than the file descriptor warnings, which are normal). | 23:07 |
sorrison | haproxy isn't running | 23:08 |
sorrison | nor is there a netns | 23:08 |
johnsom | Ok, so it didn't get far enough to send a config over. Fair enough. | 23:08 |
johnsom | Yeah, the amp is fine, so we need to turn our attention to the controllers. | 23:08 |
johnsom | The timeout message you pasted, it was of level ERROR or WARN? | 23:09 |
sorrison | ERROR | 23:09 |
johnsom | Hmmm, are you running the controllers in containers? | 23:09 |
sorrison | no | 23:10 |
johnsom | Ok. And when you ssh to the amphora, are you doing that from one of the controller hosts? I.e. it's going over the o-hm0? | 23:11 |
sorrison | I didn't ssh, I vnc consoled | 23:11 |
johnsom | Sorry, I'm going to ask a lot of questions to narrow down where things are not behaving. We will get to an answer.... | 23:11 |
sorrison | but I can ssh in if required | 23:11 |
sorrison | just a bit more effort | 23:11 |
johnsom | Oh, yeah, very different | 23:12 |
johnsom | If they dont' have a keypair that is ok, I just need to know | 23:12 |
sorrison | they have a SSH CA keypair | 23:12 |
sorrison | I just need to reissue myself a new key | 23:13 |
johnsom | Really what I want to do is verify the controllers, health manager in particular can actually reach the amps over the o-hm0. It seems the amps can send, but we haven't shown that the controller can reach them. | 23:14 |
sorrison | in the amp logs I see HTTP requests from the controller though | 23:15 |
johnsom | Try doing this from the controller: openssl s_client --connect 192.168.1.79:9443 | 23:15 |
johnsom | Yeah, but that could be from one controller, but another may not be able to reach it. Assuming you have multiple controllers | 23:16 |
johnsom | You should get a response something like this (and a bunch more information): | 23:18 |
johnsom | https://www.irccloud.com/pastebin/acJh9hYl/ | 23:18 |
johnsom | The CN should equal the amphora ID | 23:18 |
sorrison | Certificate chain | 23:19 |
sorrison | 0 s:CN = 73fc6855-70e2-4103-8f9d-23953762cad6 | 23:19 |
sorrison | i:C = AU, ST = Victoria, L = Melbourne, O = Nectar Research Cloud, OU = Octavia, CN = Server CA | 23:19 |
johnsom | You might have to control-c out of that connection too. It should boot you, but | 23:19 |
johnsom | Yeah, ok, so that controller can reach it fine. | 23:19 |
johnsom | If you can do a quick check of the other controller instances, that would rule out that one controller is net split | 23:19 |
sorrison | yeah tried all 3, all good there | 23:20 |
johnsom | Ok, so for sure the network isn't the problem. | 23:20 |
johnsom | Ok, for your load balancer. let's see if the controller got killed while it was working on the LB or if it is still trying. | 23:21 |
johnsom | We need to tail the worker and health manager log files, check for the repeated messages that it is trying to connect to "192.168.1.79" | 23:22 |
johnsom | I hope we see that it is not still retrying and someone just killed a controller with -9 (systemd is good at doing this for you when it should not) | 23:23 |
*** armax has quit IRC | 23:23 | |
sorrison | Nothing happening with the worker logs at all | 23:24 |
sorrison | we don't have debug on though | 23:24 |
johnsom | Ok, that is fine | 23:25 |
johnsom | How about HM? | 23:25 |
sorrison | ahh | 23:26 |
sorrison | looks like some issues on the hosts the run HM | 23:26 |
sorrison | no route to host | 23:27 |
sorrison | somehow I missed that | 23:27 |
sorrison | HM on different hosts | 23:27 |
sorrison | Let me fix that up as I think I know what is going on | 23:27 |
johnsom | +1 | 23:28 |
sorrison | i needed to restart the HM processes but now should be better I think | 23:32 |
sorrison | Should I just wait a bit and it might fix itself up? | 23:33 |
johnsom | No, by now it has probably hit the "fail safe" mode. I would try a manual load balancer failover once this is fixed. | 23:34 |
*** armax has joined #openstack-lbaas | 23:36 | |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!