Monday, 2020-07-20

*** irclogbot_0 has quit IRC		00:16
*** irclogbot_2 has joined #openstack-lbaas		00:19
*** ccamposr has quit IRC		00:46
*** ccamposr has joined #openstack-lbaas		00:46
*** armax has quit IRC		01:15
*** sapd1 has joined #openstack-lbaas		02:34
*** rcernin has quit IRC		03:02
*** rcernin has joined #openstack-lbaas		03:21
*** rcernin has quit IRC		03:25
*** rcernin has joined #openstack-lbaas		03:26
*** armax has joined #openstack-lbaas		04:18
*** vishalmanchanda has joined #openstack-lbaas		04:51
*** gcheresh has joined #openstack-lbaas		04:59
*** sapd1 has quit IRC		05:17
*** gcheresh has quit IRC		05:38
sorrison	Wondering if anyone is around that can help me with an octavia issue we're experiencing in production?	05:48
sorrison	Ahh it's still Sunday in the states	05:50
*** gcheresh has joined #openstack-lbaas		05:51
sorrison	I figured out the issue, was some local silly thing, however needed to update the DB to get things back	06:31
*** maciejjozefczyk has joined #openstack-lbaas		06:43
*** born2bake has joined #openstack-lbaas		07:04
*** sapd1 has joined #openstack-lbaas		07:11
*** halali_ has joined #openstack-lbaas		07:16
*** rcernin has quit IRC		07:24
*** halali_ has quit IRC		07:37
*** rcernin has joined #openstack-lbaas		07:38
*** rcernin has quit IRC		07:44
*** rcernin has joined #openstack-lbaas		07:48
*** rcernin has quit IRC		07:52
*** wuchunyang has joined #openstack-lbaas		08:03
openstackgerrit	Carlos Goncalves proposed openstack/octavia master: Allow amphorav2 to run without jobboard https://review.opendev.org/739053	08:05
*** wuchunyang has quit IRC		08:31
openstackgerrit	Carlos Goncalves proposed openstack/octavia master: Allow amphorav2 to run without jobboard https://review.opendev.org/739053	08:38
openstackgerrit	Carlos Goncalves proposed openstack/octavia-lib master: Add missing assertIsInstance checks https://review.opendev.org/738090	08:43
*** sapd1 has quit IRC		08:54
*** sapd1 has joined #openstack-lbaas		09:01
*** rcernin has joined #openstack-lbaas		09:10
*** ccamposr__ has joined #openstack-lbaas		09:14
*** rcernin has quit IRC		09:15
*** ccamposr has quit IRC		09:16
*** ccamposr has joined #openstack-lbaas		09:24
*** ccamposr__ has quit IRC		09:25
openstackgerrit	Merged openstack/octavia stable/ussuri: add the verify for the session https://review.opendev.org/738403	09:36
openstackgerrit	Merged openstack/octavia-lib master: Add missing assertIsInstance checks https://review.opendev.org/738090	09:36
*** also_stingrayza is now known as stingrayza		09:58
*** sapd1 has quit IRC		10:01
*** sapd1 has joined #openstack-lbaas		10:05
*** irclogbot_2 has quit IRC		10:08
*** johnthetubaguy has quit IRC		10:08
*** vesper11 has quit IRC		10:08
*** irclogbot_2 has joined #openstack-lbaas		10:09
*** johnthetubaguy has joined #openstack-lbaas		10:09
*** vesper11 has joined #openstack-lbaas		10:09
*** tkajinam has quit IRC		10:29
devfaz	hi, just run into a failed tempest test octavia_tempest_plugin.tests.api.v2.test_member.MemberAPITest.test_member_ipv4_create - seems like missing qos extension in neutron caused the failure http://paste.openstack.org/show/796108/	10:31
devfaz	wait.. maybe im wrong.	10:32
*** trident has quit IRC		10:35
*** trident has joined #openstack-lbaas		10:36
devfaz	according to the neutron log - the network still existed (valid 200 status-requests) and got removed by tempest during cleanup	10:36
devfaz	http://paste.openstack.org/show/796109/ - maybe the 404 of the qos-ext is (somehow?) used to return a 404 for the network-request?	10:37
*** stingrayza has quit IRC		10:40
*** wuchunyang has joined #openstack-lbaas		10:57
*** wuchunyang has quit IRC		11:23
*** stingrayza has joined #openstack-lbaas		11:36
*** sapd1 has quit IRC		11:38
*** also_stingrayza has joined #openstack-lbaas		12:35
*** stingrayza has quit IRC		12:37
*** also_stingrayza is now known as stingrayza		12:37
*** rcernin has joined #openstack-lbaas		13:07
*** rcernin has quit IRC		13:12
*** TrevorV has joined #openstack-lbaas		13:35
*** sapd1 has joined #openstack-lbaas		13:51
openstackgerrit	Carlos Goncalves proposed openstack/octavia master: Add amphora image tag capability to Octavia flavors https://review.opendev.org/737528	14:04
*** numans_ has joined #openstack-lbaas		14:21
*** sapd1 has quit IRC		14:24
devfaz	nope, had some more time - even with enabled qos-ext the network is "not found" - sorry for the noise.	14:38
openstackgerrit	Corey Bryant proposed openstack/octavia master: Drop diskimage-builder from root requirements.txt https://review.opendev.org/741960	14:43
*** rcernin has joined #openstack-lbaas		14:55
johnsom	devfaz Those are DEBUG messages, so typically not actually causing a problem. Octavia does an extension discovery with neutron to check for a few different features to be enabled or not depending on the deployment.	14:58
*** rcernin has quit IRC		15:00
johnsom	devfaz I will note, that until a recently merged patch on octavia-tempest-plugin, there were some bad tests around availability zones that would cause the API tests to fail. https://review.opendev.org/737191	15:00
devfaz	johnsom: i will cherry-pick and give it a try	15:01
devfaz	johnsom: already tested with latest (master) octavia-tempest-plugin, which already contained above change, didnt help.	15:11
devfaz	im just wondering why neutron is answering "subnet not found", but the subnet still exits, because the later _cleanup, is getting 200 on delete.	15:12
johnsom	Hmm, ok, can you past the test failure information from the tempest run?	15:12
devfaz	sure	15:12
devfaz	johnsom: http://paste.openstack.org/show/796131/	15:13
devfaz	johnsom: dont want to waste your time. I can build the current head/master of octavia and retry my tests	15:13
johnsom	Yeah, this rings a bell. I know there were a few API tests that got broken by accident. The AZ was one, I remember there was a "missing subnet" as well, but I can't remember what the story/issue was with that one.	15:13
devfaz	if it helps, seems like the issue was introduced in "stein" branch during maintanance of the branch, because my old branch didnt had the issue (afaik) - just trying to upgrade to latest stein because of the issue we talked about in the list	15:14
johnsom	Yeah, the tempest plugin in branchless, so if there is an issue it will impact all of them most likely.	15:15
devfaz	well, im using a forked version of tempest (to avoid issues like this) and it seems the issue was introduced just by upgrading octavia from older version of stein, to latest version of stein.	15:16
johnsom	Yeah, I think it's related to the new "invisible resource" check we do on create calls.	15:17
johnsom	Oh, I know what it is. It's a check ordering issue in the octavia code. One second, let me track that down.	15:19
johnsom	devfaz It was this patch on master: https://review.opendev.org/#/c/737084/	15:20
johnsom	let me check stein and see if that still needs this patch	15:20
johnsom	Yeah, ok, so that patch still needs to be backported.	15:21
johnsom	The test is failing because it's getting a different exception than is expected due to the checks being out of order in the API.	15:22
johnsom	I will work on getting that backported	15:22
openstackgerrit	Michael Johnson proposed openstack/octavia stable/ussuri: Prioritize policy validation https://review.opendev.org/741967	15:25
*** spatel has joined #openstack-lbaas		15:29
johnsom	devfaz If that is the only set of errors you are getting, the "subnet not found" you are probably good to go. It is functional, just running validation checks out of order so you may get a "subnet not found" error (because you don't have permissions to it) instead of the Forbidden that you should get. The backport will fix that, but functionally it still denies the user without permissions.	15:33
devfaz	johnsom: good to know - thx. I just tried to backport the change to my stein branch (i will not use it, just to check)	15:38
johnsom	Sure, NP. I'm working on the backport patches now. They should all be posted in the next half hour or so	15:39
devfaz	thx a lot. going in the next minutes, CEST - so would not use them in the next 12h :)	15:40
devfaz	just waiting for the tempest result of my quick-and-dirty-cherry-pick.	15:40
devfaz	did some typos while solving the merge-conflict, so tempest failed. I will use your patch tomorrow or try it myself if you have more important issues to solve. Nevertheless - thanks a lot	15:45
johnsom	Sure, no problem	15:46
*** gcheresh has quit IRC		15:48
openstackgerrit	Michael Johnson proposed openstack/octavia stable/train: Prioritize policy validation https://review.opendev.org/741974	15:51
openstackgerrit	Michael Johnson proposed openstack/octavia stable/stein: Prioritize policy validation https://review.opendev.org/741975	15:52
devfaz	2020-07-20 15:52:58.703 822 INFO tempest-verifier [-] {0} octavia_tempest_plugin.tests.api.v2.test_member.MemberAPITest.test_member_ipv4_create ... success [8.310s]	15:53
devfaz	so, now I can go :)	15:53
devfaz	have a nice day.	15:53
openstackgerrit	Michael Johnson proposed openstack/octavia stable/train: Prioritize policy validation https://review.opendev.org/741974	16:23
openstackgerrit	Michael Johnson proposed openstack/octavia stable/stein: Prioritize policy validation https://review.opendev.org/741975	16:26
rm_work	johnsom: could you throw another review at https://review.opendev.org/#/c/740815/ ? the other side is finished (controlplane side)	16:26
johnsom	Yes, sometime today, probably	16:31
openstackgerrit	Brian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code https://review.opendev.org/718192	16:35
openstackgerrit	Michael Johnson proposed openstack/octavia master: Clarify the current status of Octavia in README https://review.opendev.org/741988	16:37
johnsom	Someone had a post on ask.o.o that they were confused by the "Under development" statements. So, trying to clarify some of that.	16:39
*** laerling is now known as Guest68516		16:55
*** rcernin has joined #openstack-lbaas		16:56
*** Guest68516 has quit IRC		17:00
*** rcernin has quit IRC		17:01
*** ccamposr__ has joined #openstack-lbaas		17:45
*** ccamposr has quit IRC		17:48
*** spatel has quit IRC		18:42
openstackgerrit	Brian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code https://review.opendev.org/718192	19:00
johnsom	rm_work done	19:27
*** maciejjozefczyk has quit IRC		19:33
*** vishalmanchanda has quit IRC		19:41
*** gthiemonge has quit IRC		19:42
*** gthiemonge has joined #openstack-lbaas		19:42
aannuusshhkkaa	hey johnsom, do we need to create a new message version (4) when we add Response Time (rtime) to the health message?	20:20
johnsom	Yes, you would	20:20
aannuusshhkkaa	okay thanks!	20:22
*** shtepanie has joined #openstack-lbaas		20:31
*** rcernin has joined #openstack-lbaas		20:58
*** rcernin has quit IRC		21:02
openstackgerrit	Brian Haley proposed openstack/octavia master: Remove Neutron SDN-specific code https://review.opendev.org/718192	21:02
*** TrevorV has quit IRC		22:19
sorrison	Got a issue in prod which I'm struggling to fix, some LBs are stuck in PENDING_UPDATE and they have heaps of amphora in ERROR status. Eg. http://paste.openstack.org/show/796141/	22:41
*** shtepanie has quit IRC		22:41
*** born2bake has quit IRC		22:46
johnsom	sorrison What version are you running?	22:49
sorrison	Train with the AZ patches on top	22:50
sorrison	Been trying to find a way to just blow all the amphora away and rebuild them for the affected LBs	22:50
sorrison	have 5 out of about 30 LBs in this state	22:50
johnsom	You will be very interested in this patch chain: https://review.opendev.org/739772	22:51
johnsom	But let's see what is going on and where you are at now.	22:51
johnsom	So, the LB in pending create, did someone/something kill -9 one of the controller processes?	22:52
sorrison	luckily one of the LBs is one of mine so can experiment with it, the others are customers	22:52
sorrison	they in pending_update	22:52
johnsom	Does the log show it still trying to repair it?	22:52
johnsom	Ah, yeah, pending_update, that is what I meant	22:52
sorrison	it looks like it keep trying to build more and they fail	22:53
*** tkajinam has joined #openstack-lbaas		22:53
johnsom	What is the error code you get when they fail? I assume this is the HM log	22:53
sorrison	but I've spun some up manually and with tempest and they all good	22:53
sorrison	I see octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out errors	22:55
sorrison	but they look like trying to contact the dead amphora	22:55
sorrison	It spins up new ones and puts them into ALLOCATED	22:55
sorrison	and I can ping them etc. but it seems to be stuck trying to talk to the dead one	22:56
johnsom	Yeah, it should do that. Ok, let's pick one of the amphora in ERROR, preferably one on your LB.	22:56
sorrison	the ones in ERROR are dead	22:58
sorrison	I can ping ones in ALLOCATED state but they have no role	22:58
johnsom	There are no nova instances behind them?	22:58
*** rcernin has joined #openstack-lbaas		22:58
sorrison	no they are gone, I assume the source of the issue which I haven't figured out why yet	22:58
*** rcernin has quit IRC		22:59
johnsom	No, that is ok, that just means the cleanup worked correctly which is good.	22:59
sorrison	the broken LBs are in different states	22:59
*** rcernin has joined #openstack-lbaas		22:59
sorrison	I think I have some broken where either the master or the backup are still contactable	22:59
johnsom	Is load balancer f784f824-64a5-4f13-8d54-fda54b528282 yours?	22:59
johnsom	Yeah, Octavia will fail safe. When parts of the cloud are broken we stop before we would break the actual load balancer tenant traffic flow	23:00
sorrison	no that LB isn't mine	23:00
sorrison	but looks similar	23:01
johnsom	Ok, if there is one "ALLOCATED" there on your LB. Let's do a "openstack console log show" on it	23:01
johnsom	We are looking for the net device info box that looks like this:	23:02
johnsom	https://www.irccloud.com/pastebin/dTx8M0aU/	23:02
johnsom	Ha, well, with better formatting.	23:02
johnsom	We want to confirm that the first line in that first box has a valid IP address	23:02
johnsom	for the AZ this amp lives in	23:03
sorrison	http://paste.openstack.org/show/796143/	23:03
sorrison	looks good	23:04
sorrison	I can ssh in etc too	23:04
johnsom	Oh, ok, good.	23:04
sorrison	when I restarted amphora-agent on one of then last night the nova instance just deleted	23:04
johnsom	Yeah, you have to be careful with that.	23:05
johnsom	So, on that instance, in syslog, can you check that there are no errors from the agent?	23:05
johnsom	They will not be in the agent log on that old version	23:05
sorrison	I don't see any errors	23:06
johnsom	Well, ok, it was in train, so they might be in the agent log.	23:06
johnsom	Ok, yeah, I wouldn't expect to given what I have seen so far. I think the amp is healthy.	23:07
sorrison	last line is a PUT /1.0/config HTTP/1.1" 202 16 "-"	23:07
johnsom	Ok, can you take a quick look at the haproxy log and make sure there are no errors there?	23:07
johnsom	It should be clear (other than the file descriptor warnings, which are normal).	23:07
sorrison	haproxy isn't running	23:08
sorrison	nor is there a netns	23:08
johnsom	Ok, so it didn't get far enough to send a config over. Fair enough.	23:08
johnsom	Yeah, the amp is fine, so we need to turn our attention to the controllers.	23:08
johnsom	The timeout message you pasted, it was of level ERROR or WARN?	23:09
sorrison	ERROR	23:09
johnsom	Hmmm, are you running the controllers in containers?	23:09
sorrison	no	23:10
johnsom	Ok. And when you ssh to the amphora, are you doing that from one of the controller hosts? I.e. it's going over the o-hm0?	23:11
sorrison	I didn't ssh, I vnc consoled	23:11
johnsom	Sorry, I'm going to ask a lot of questions to narrow down where things are not behaving. We will get to an answer....	23:11
sorrison	but I can ssh in if required	23:11
sorrison	just a bit more effort	23:11
johnsom	Oh, yeah, very different	23:12
johnsom	If they dont' have a keypair that is ok, I just need to know	23:12
sorrison	they have a SSH CA keypair	23:12
sorrison	I just need to reissue myself a new key	23:13
johnsom	Really what I want to do is verify the controllers, health manager in particular can actually reach the amps over the o-hm0. It seems the amps can send, but we haven't shown that the controller can reach them.	23:14
sorrison	in the amp logs I see HTTP requests from the controller though	23:15
johnsom	Try doing this from the controller: openssl s_client --connect 192.168.1.79:9443	23:15
johnsom	Yeah, but that could be from one controller, but another may not be able to reach it. Assuming you have multiple controllers	23:16
johnsom	You should get a response something like this (and a bunch more information):	23:18
johnsom	https://www.irccloud.com/pastebin/acJh9hYl/	23:18
johnsom	The CN should equal the amphora ID	23:18
sorrison	Certificate chain	23:19
sorrison	0 s:CN = 73fc6855-70e2-4103-8f9d-23953762cad6	23:19
sorrison	i:C = AU, ST = Victoria, L = Melbourne, O = Nectar Research Cloud, OU = Octavia, CN = Server CA	23:19
johnsom	You might have to control-c out of that connection too. It should boot you, but	23:19
johnsom	Yeah, ok, so that controller can reach it fine.	23:19
johnsom	If you can do a quick check of the other controller instances, that would rule out that one controller is net split	23:19
sorrison	yeah tried all 3, all good there	23:20
johnsom	Ok, so for sure the network isn't the problem.	23:20
johnsom	Ok, for your load balancer. let's see if the controller got killed while it was working on the LB or if it is still trying.	23:21
johnsom	We need to tail the worker and health manager log files, check for the repeated messages that it is trying to connect to "192.168.1.79"	23:22
johnsom	I hope we see that it is not still retrying and someone just killed a controller with -9 (systemd is good at doing this for you when it should not)	23:23
*** armax has quit IRC		23:23
sorrison	Nothing happening with the worker logs at all	23:24
sorrison	we don't have debug on though	23:24
johnsom	Ok, that is fine	23:25
johnsom	How about HM?	23:25
sorrison	ahh	23:26
sorrison	looks like some issues on the hosts the run HM	23:26
sorrison	no route to host	23:27
sorrison	somehow I missed that	23:27
sorrison	HM on different hosts	23:27
sorrison	Let me fix that up as I think I know what is going on	23:27
johnsom	+1	23:28
sorrison	i needed to restart the HM processes but now should be better I think	23:32
sorrison	Should I just wait a bit and it might fix itself up?	23:33
johnsom	No, by now it has probably hit the "fail safe" mode. I would try a manual load balancer failover once this is fixed.	23:34
*** armax has joined #openstack-lbaas		23:36

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!