Monday, 2020-07-20

*** irclogbot_0 has quit IRC00:16
*** irclogbot_2 has joined #openstack-lbaas00:19
*** ccamposr has quit IRC00:46
*** ccamposr has joined #openstack-lbaas00:46
*** armax has quit IRC01:15
*** sapd1 has joined #openstack-lbaas02:34
*** rcernin has quit IRC03:02
*** rcernin has joined #openstack-lbaas03:21
*** rcernin has quit IRC03:25
*** rcernin has joined #openstack-lbaas03:26
*** armax has joined #openstack-lbaas04:18
*** vishalmanchanda has joined #openstack-lbaas04:51
*** gcheresh has joined #openstack-lbaas04:59
*** sapd1 has quit IRC05:17
*** gcheresh has quit IRC05:38
sorrisonWondering if anyone is around that can help me with an octavia issue we're experiencing in production?05:48
sorrisonAhh it's still Sunday in the states05:50
*** gcheresh has joined #openstack-lbaas05:51
sorrisonI figured out the issue, was some local silly thing, however needed to update the DB to get things back06:31
*** maciejjozefczyk has joined #openstack-lbaas06:43
*** born2bake has joined #openstack-lbaas07:04
*** sapd1 has joined #openstack-lbaas07:11
*** halali_ has joined #openstack-lbaas07:16
*** rcernin has quit IRC07:24
*** halali_ has quit IRC07:37
*** rcernin has joined #openstack-lbaas07:38
*** rcernin has quit IRC07:44
*** rcernin has joined #openstack-lbaas07:48
*** rcernin has quit IRC07:52
*** wuchunyang has joined #openstack-lbaas08:03
openstackgerritCarlos Goncalves proposed openstack/octavia master: Allow amphorav2 to run without jobboard  https://review.opendev.org/73905308:05
*** wuchunyang has quit IRC08:31
openstackgerritCarlos Goncalves proposed openstack/octavia master: Allow amphorav2 to run without jobboard  https://review.opendev.org/73905308:38
openstackgerritCarlos Goncalves proposed openstack/octavia-lib master: Add missing assertIsInstance checks  https://review.opendev.org/73809008:43
*** sapd1 has quit IRC08:54
*** sapd1 has joined #openstack-lbaas09:01
*** rcernin has joined #openstack-lbaas09:10
*** ccamposr__ has joined #openstack-lbaas09:14
*** rcernin has quit IRC09:15
*** ccamposr has quit IRC09:16
*** ccamposr has joined #openstack-lbaas09:24
*** ccamposr__ has quit IRC09:25
openstackgerritMerged openstack/octavia stable/ussuri: add the verify for the session  https://review.opendev.org/73840309:36
openstackgerritMerged openstack/octavia-lib master: Add missing assertIsInstance checks  https://review.opendev.org/73809009:36
*** also_stingrayza is now known as stingrayza09:58
*** sapd1 has quit IRC10:01
*** sapd1 has joined #openstack-lbaas10:05
*** irclogbot_2 has quit IRC10:08
*** johnthetubaguy has quit IRC10:08
*** vesper11 has quit IRC10:08
*** irclogbot_2 has joined #openstack-lbaas10:09
*** johnthetubaguy has joined #openstack-lbaas10:09
*** vesper11 has joined #openstack-lbaas10:09
*** tkajinam has quit IRC10:29
devfazhi, just run into a failed tempest test octavia_tempest_plugin.tests.api.v2.test_member.MemberAPITest.test_member_ipv4_create - seems like missing qos extension in neutron caused the failure http://paste.openstack.org/show/796108/10:31
devfazwait.. maybe im wrong.10:32
*** trident has quit IRC10:35
*** trident has joined #openstack-lbaas10:36
devfazaccording to the neutron log - the network still existed (valid 200 status-requests) and got removed by tempest during cleanup10:36
devfazhttp://paste.openstack.org/show/796109/ - maybe the 404 of the qos-ext is (somehow?) used to return a 404 for the network-request?10:37
*** stingrayza has quit IRC10:40
*** wuchunyang has joined #openstack-lbaas10:57
*** wuchunyang has quit IRC11:23
*** stingrayza has joined #openstack-lbaas11:36
*** sapd1 has quit IRC11:38
*** also_stingrayza has joined #openstack-lbaas12:35
*** stingrayza has quit IRC12:37
*** also_stingrayza is now known as stingrayza12:37
*** rcernin has joined #openstack-lbaas13:07
*** rcernin has quit IRC13:12
*** TrevorV has joined #openstack-lbaas13:35
*** sapd1 has joined #openstack-lbaas13:51
openstackgerritCarlos Goncalves proposed openstack/octavia master: Add amphora image tag capability to Octavia flavors  https://review.opendev.org/73752814:04
*** numans_ has joined #openstack-lbaas14:21
*** sapd1 has quit IRC14:24
devfaznope, had some more time - even with enabled qos-ext the network is "not found" - sorry for the noise.14:38
openstackgerritCorey Bryant proposed openstack/octavia master: Drop diskimage-builder from root requirements.txt  https://review.opendev.org/74196014:43
*** rcernin has joined #openstack-lbaas14:55
johnsomdevfaz Those are DEBUG messages, so typically not actually causing a problem. Octavia does an extension discovery with neutron to check for a few different features to be enabled or not depending on the deployment.14:58
*** rcernin has quit IRC15:00
johnsomdevfaz I will note, that until a recently merged patch on octavia-tempest-plugin, there were some bad tests around availability zones that would cause the API tests to fail. https://review.opendev.org/73719115:00
devfazjohnsom: i will cherry-pick and give it a try15:01
devfazjohnsom: already tested with latest (master) octavia-tempest-plugin, which already contained above change, didnt help.15:11
devfazim just wondering why neutron is answering "subnet not found", but the subnet still exits, because the later _cleanup, is getting 200 on delete.15:12
johnsomHmm, ok, can you past the test failure information from the tempest run?15:12
devfazsure15:12
devfazjohnsom: http://paste.openstack.org/show/796131/15:13
devfazjohnsom: dont want to waste your time. I can build the current head/master of octavia and retry my tests15:13
johnsomYeah, this rings a bell. I know there were a few API tests that got broken by accident. The AZ was one, I remember there was a "missing subnet" as well, but I can't remember what the story/issue was with that one.15:13
devfazif it helps, seems like the issue was introduced in "stein" branch during maintanance of the branch, because my old branch didnt had the issue (afaik) - just trying to upgrade to latest stein because of the issue we talked about in the list15:14
johnsomYeah, the tempest plugin in branchless, so if there is an issue it will impact all of them most likely.15:15
devfazwell, im using a forked version of tempest (to avoid issues like this) and it seems the issue was introduced just by upgrading octavia from older version of stein, to latest version of stein.15:16
johnsomYeah, I think it's related to the new "invisible resource" check we do on create calls.15:17
johnsomOh, I know what it is. It's a check ordering issue in the octavia code. One second, let me track that down.15:19
johnsomdevfaz It was this patch on master: https://review.opendev.org/#/c/737084/15:20
johnsomlet me check stein and see if that still needs this patch15:20
johnsomYeah, ok, so that patch still needs to be backported.15:21
johnsomThe test is failing because it's getting a different exception than is expected due to the checks being out of order in the API.15:22
johnsomI will work on getting that backported15:22
openstackgerritMichael Johnson proposed openstack/octavia stable/ussuri: Prioritize policy validation  https://review.opendev.org/74196715:25
*** spatel has joined #openstack-lbaas15:29
johnsomdevfaz If that is the only set of errors you are getting, the "subnet not found" you are probably good to go. It is functional, just running validation checks out of order so you may get a "subnet not found" error (because you don't have permissions to it) instead of the Forbidden that you should get. The backport will fix that, but functionally it still denies the user without permissions.15:33
devfazjohnsom: good to know - thx. I just tried to backport the change to my stein branch (i will not use it, just to check)15:38
johnsomSure, NP. I'm working on the backport patches now. They should all be posted in the next half hour or so15:39
devfazthx a lot. going in the next minutes, CEST - so would not use them in the next 12h :)15:40
devfazjust waiting for the tempest result of my quick-and-dirty-cherry-pick.15:40
devfazdid some typos while solving the merge-conflict, so tempest failed. I will use your patch tomorrow or try it myself if you have more important issues to solve. Nevertheless - thanks a lot15:45
johnsomSure, no problem15:46
*** gcheresh has quit IRC15:48
openstackgerritMichael Johnson proposed openstack/octavia stable/train: Prioritize policy validation  https://review.opendev.org/74197415:51
openstackgerritMichael Johnson proposed openstack/octavia stable/stein: Prioritize policy validation  https://review.opendev.org/74197515:52
devfaz2020-07-20 15:52:58.703 822 INFO tempest-verifier [-] {0} octavia_tempest_plugin.tests.api.v2.test_member.MemberAPITest.test_member_ipv4_create ... success [8.310s]15:53
devfazso, now I can go :)15:53
devfazhave a nice day.15:53
openstackgerritMichael Johnson proposed openstack/octavia stable/train: Prioritize policy validation  https://review.opendev.org/74197416:23
openstackgerritMichael Johnson proposed openstack/octavia stable/stein: Prioritize policy validation  https://review.opendev.org/74197516:26
rm_workjohnsom: could you throw another review at https://review.opendev.org/#/c/740815/ ? the other side is finished (controlplane side)16:26
johnsomYes, sometime today, probably16:31
openstackgerritBrian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code  https://review.opendev.org/71819216:35
openstackgerritMichael Johnson proposed openstack/octavia master: Clarify the current status of Octavia in README  https://review.opendev.org/74198816:37
johnsomSomeone had a post on ask.o.o that they were confused by the "Under development" statements. So, trying to clarify some of that.16:39
*** laerling is now known as Guest6851616:55
*** rcernin has joined #openstack-lbaas16:56
*** Guest68516 has quit IRC17:00
*** rcernin has quit IRC17:01
*** ccamposr__ has joined #openstack-lbaas17:45
*** ccamposr has quit IRC17:48
*** spatel has quit IRC18:42
openstackgerritBrian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code  https://review.opendev.org/71819219:00
johnsomrm_work done19:27
*** maciejjozefczyk has quit IRC19:33
*** vishalmanchanda has quit IRC19:41
*** gthiemonge has quit IRC19:42
*** gthiemonge has joined #openstack-lbaas19:42
aannuusshhkkaahey johnsom, do we need to create a new message version (4) when we add Response Time (rtime) to the health message?20:20
johnsomYes, you would20:20
aannuusshhkkaaokay thanks!20:22
*** shtepanie has joined #openstack-lbaas20:31
*** rcernin has joined #openstack-lbaas20:58
*** rcernin has quit IRC21:02
openstackgerritBrian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code  https://review.opendev.org/71819221:02
*** TrevorV has quit IRC22:19
sorrisonGot a issue in prod which I'm struggling to fix, some LBs are stuck in PENDING_UPDATE and they have heaps of amphora in ERROR status. Eg. http://paste.openstack.org/show/796141/22:41
*** shtepanie has quit IRC22:41
*** born2bake has quit IRC22:46
johnsomsorrison What version are you running?22:49
sorrisonTrain with the AZ patches on top22:50
sorrisonBeen trying to find a way to just blow all the amphora away and rebuild them for the affected LBs22:50
sorrisonhave 5 out of about 30 LBs in this state22:50
johnsomYou will be very interested in this patch chain: https://review.opendev.org/73977222:51
johnsomBut let's see what is going on and where you are at now.22:51
johnsomSo, the LB in pending create, did someone/something kill -9 one of the controller processes?22:52
sorrisonluckily one of the LBs is one of mine so can experiment with it, the others are customers22:52
sorrisonthey in pending_update22:52
johnsomDoes the log show it still trying to repair it?22:52
johnsomAh, yeah, pending_update, that is what I meant22:52
sorrisonit looks like it keep trying to build more and they fail22:53
*** tkajinam has joined #openstack-lbaas22:53
johnsomWhat is the error code you get when they fail? I assume this is the HM log22:53
sorrisonbut I've spun some up manually and with tempest and they all good22:53
sorrisonI see  octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out  errors22:55
sorrisonbut they look like trying to contact the dead amphora22:55
sorrisonIt spins up new ones and puts them into ALLOCATED22:55
sorrisonand I can ping them etc. but it seems to be stuck trying to talk to the dead one22:56
johnsomYeah, it should do that. Ok, let's pick one of the amphora in ERROR, preferably one on your LB.22:56
sorrisonthe ones in ERROR are dead22:58
sorrisonI can ping ones in ALLOCATED state but they have no role22:58
johnsomThere are no nova instances behind them?22:58
*** rcernin has joined #openstack-lbaas22:58
sorrisonno they are gone, I assume the source of the issue which I haven't figured out why yet22:58
*** rcernin has quit IRC22:59
johnsomNo, that is ok, that just means the cleanup worked correctly which is good.22:59
sorrisonthe broken LBs are in different states22:59
*** rcernin has joined #openstack-lbaas22:59
sorrisonI think I have some broken where either the master or the backup are still contactable22:59
johnsomIs load balancer f784f824-64a5-4f13-8d54-fda54b528282 yours?22:59
johnsomYeah, Octavia will fail safe. When parts of the cloud are broken we stop before we would break the actual load balancer tenant traffic flow23:00
sorrisonno that LB isn't mine23:00
sorrisonbut looks similar23:01
johnsomOk, if there is one "ALLOCATED" there on your LB. Let's do a "openstack console log show" on it23:01
johnsomWe are looking for the net device info box that looks like this:23:02
johnsomhttps://www.irccloud.com/pastebin/dTx8M0aU/23:02
johnsomHa, well, with better formatting.23:02
johnsomWe want to confirm that the first line in that first box has a valid IP address23:02
johnsomfor the AZ this amp lives in23:03
sorrisonhttp://paste.openstack.org/show/796143/23:03
sorrisonlooks good23:04
sorrisonI can ssh in etc too23:04
johnsomOh, ok, good.23:04
sorrisonwhen I restarted amphora-agent on one of then last night the nova instance just deleted23:04
johnsomYeah, you have to be careful with that.23:05
johnsomSo, on that instance, in syslog, can you check that there are no errors from the agent?23:05
johnsomThey will not be in the agent log on that old version23:05
sorrisonI don't see any errors23:06
johnsomWell, ok, it was in train, so they might be in the agent log.23:06
johnsomOk, yeah, I wouldn't expect to given what I have seen so far. I think the amp is healthy.23:07
sorrisonlast line is a PUT /1.0/config HTTP/1.1" 202 16 "-"23:07
johnsomOk, can you take a quick look at the haproxy log and make sure there are no errors there?23:07
johnsomIt should be clear (other than the file descriptor warnings, which are normal).23:07
sorrisonhaproxy isn't running23:08
sorrisonnor is there a netns23:08
johnsomOk, so it didn't get far enough to send a config over. Fair enough.23:08
johnsomYeah, the amp is fine, so we need to turn our attention to the controllers.23:08
johnsomThe timeout message you pasted, it was of level ERROR or WARN?23:09
sorrisonERROR23:09
johnsomHmmm, are you running the controllers in containers?23:09
sorrisonno23:10
johnsomOk. And when you ssh to the amphora, are you doing that from one of the controller hosts? I.e. it's going over the o-hm0?23:11
sorrisonI didn't ssh, I vnc consoled23:11
johnsomSorry, I'm going to ask a lot of questions to narrow down where things are not behaving. We will get to an answer....23:11
sorrisonbut I can ssh in if required23:11
sorrisonjust a bit more effort23:11
johnsomOh, yeah, very different23:12
johnsomIf they dont' have a keypair that is ok, I just need to know23:12
sorrisonthey have a SSH CA keypair23:12
sorrisonI just need to reissue myself a new key23:13
johnsomReally  what I want to do is verify the controllers, health manager in particular can actually reach the amps over the o-hm0. It seems the amps can send, but we haven't shown that the controller can reach them.23:14
sorrisonin the amp logs I see HTTP requests from the controller though23:15
johnsomTry doing this from the controller: openssl s_client --connect 192.168.1.79:944323:15
johnsomYeah, but that could be from one controller, but another may not be able to reach it. Assuming you have multiple controllers23:16
johnsomYou should get a response something like this (and a bunch more information):23:18
johnsomhttps://www.irccloud.com/pastebin/acJh9hYl/23:18
johnsomThe CN should equal the amphora ID23:18
sorrisonCertificate chain23:19
sorrison 0 s:CN = 73fc6855-70e2-4103-8f9d-23953762cad623:19
sorrison   i:C = AU, ST = Victoria, L = Melbourne, O = Nectar Research Cloud, OU = Octavia, CN = Server CA23:19
johnsomYou might have to control-c out of that connection too. It should boot you, but23:19
johnsomYeah, ok, so that controller can reach it fine.23:19
johnsomIf you can do a quick check of the other controller instances, that would rule out that one controller is net split23:19
sorrisonyeah tried all 3, all good there23:20
johnsomOk, so for sure the network isn't the problem.23:20
johnsomOk, for your load balancer. let's see if the controller got killed while it was working on the LB or if it is still trying.23:21
johnsomWe need to tail the worker and health manager log files, check for the repeated messages that it is trying to connect to "192.168.1.79"23:22
johnsomI hope we see that it is not still retrying and someone just killed a controller with -9 (systemd is good at doing this for you when it should not)23:23
*** armax has quit IRC23:23
sorrisonNothing happening with the worker logs at all23:24
sorrisonwe don't have debug on though23:24
johnsomOk, that is fine23:25
johnsomHow about HM?23:25
sorrisonahh23:26
sorrisonlooks like some issues on the hosts the run HM23:26
sorrisonno route to host23:27
sorrisonsomehow I missed that23:27
sorrisonHM on different hosts23:27
sorrisonLet me fix that up as I think I know what is going on23:27
johnsom+123:28
sorrisoni needed to restart the HM processes but now should be better I think23:32
sorrisonShould I just wait a bit and it might fix itself up?23:33
johnsomNo, by now it has probably hit the "fail safe" mode. I would try a manual load balancer failover once this is fixed.23:34
*** armax has joined #openstack-lbaas23:36

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!