Monday, 2022-01-10

renich	Hello! 0/	19:23
renich	I was a #openstack-octavia and they told me to ask you. :D	19:23
renich	I'm having issues with octavia on RHOSP 16.2. When I create a load balancer, it hangs in "PENDING_CREATE" while the amphorae stay on "BOOTING" status. I've checked the console of the latter and it shows they did boot.	19:23
renich	Also, I couldn't delete it. I had to manually modify the DB in order to change the state of the loadbalancer to "ERROR" in order to do so.	19:24
renich	I had created the loadbalancer with the following command: openstack loadbalancer create --name lb1 --vip-subnet-id public-subnet	19:25
renich	I can do it again if someone lends a hand.	19:25
johnsom	renich Hi. Please do not modify the database states. Most likely one of the controllers was still trying to retry/resolve the problem with nova. By changing the state you can end up with lots of conflicts and abandoned resources. The timeouts before a controller gives up can be quite long on some deployments. RHOSP 16.2 is at least 25 minutes.	19:29
renich	johnsom: OK. I left it there all weekend.	19:29
johnsom	renich Once the controllers timeout trying to resolve the issue with nova/neutron, the object will move back to ERROR and be able to be deleted via the CLI/API	19:30
renich	johnsom: understood.	19:30
johnsom	renich Did someone kill -9 the controller or container?	19:30
renich	johnsom: I don't think so. I'm the only one doing these things for now.	19:31
renich	I might've deleted the amphorae's instance	19:31
johnsom	To see what was wrong with nova, you can look in the worker log file. It should highlight with ERROR logs as to why it was still in BOOTING	19:31
johnsom	Nah, deleting the amp would have immediately caused the LB create to go into ERROR status	19:32
johnsom	When the amp is in BOOTING we are polling nova and/or the amp to see when the VM fully boots.	19:33
renich	johnsom: OK, let me see if I can find the log. Kind of difficult in RHOSP since they're all containers. Will look for it.	19:33
johnsom	RHOSP centralizes the logs, so look under the container logs, worker	19:33
renich	johnsom: found it at controller-01; exactly where you said.	19:35
johnsom	You are looking for ERROR or WARNING level messages related the LB you were creating	19:36
renich	I see a warning but it's not related.	19:47
renich	So, I've generated a new load-balancer: lb2. It might be a quota issue, right? https://paste.openstack.org/show/812006/	21:38
renich	I created it like this: https://paste.openstack.org/show/812007/	21:38
johnsom	No, that log snippet looks ok given you changed the status of a load balancer in the DB.	21:40
johnsom	That is basically saying the quota was inconsistent to what it should be and it corrected it.	21:41
renich	johnsom: I deleted that one and created a new one.	21:41
renich	Oh, OK.	21:41
renich	Still, when I use: openstack quota show <admin-id> I see load_balancers: 0	21:42
renich	https://paste.openstack.org/show/812008/	21:42
renich	Sorry; not zero but "None"	21:43
johnsom	Yeah, that quota was for old neutron-lbaas. It was mistakenly added to that command and is removed in later versions of OpenStack. Octavia quota is managed through "openstack loadbalancer quota <...>".	21:43
johnsom	openstack quota show "load_balancers" will never be anything different than 0 because there is no code behind it. Just someone's CLI mistake	21:44
johnsom	https://docs.openstack.org/python-octaviaclient/latest/cli/index.html#quota	21:44
renich	OK. Thank you for clarifying.	21:45
renich	All are -1	21:45
johnsom	This log message is odd: "Invalid state PENDING_CREATE of loadbalancer resource 45ec9ff5-ec2e-4c86-87ba-c41b1781cd54" Do you know what resource that UUID points to?	21:45
renich	let me check.	21:46
johnsom	-1 is unlimited	21:46
renich	Oh, OK.	21:46
renich	It might've been one of the amphorae? I mean, I dunno because they have disappeared and my loadbalancer is in ERROR state now.	21:48
johnsom	No, those were different UUIDs (https://paste.openstack.org/show/812007/) and that message was at the API level, before those are even started to be created.	21:48
johnsom	Yeah, so if the LB is in error, it cleaned up the resources that were created. There should be log messages in the worker log of why it went to ERROR	21:50
renich	I have created lb3 and isolated the logs. I'll share them with you when it finishes failing.	21:51
johnsom	Excellent, happy to take a look at them	21:51
johnsom	Oh, ok, found that message. I was probably when you tried to delete the LB when it was still being retried.	21:52
renich	johnsom: you're very kind.	21:53
renich	OK	21:53
renich	johnsom: it seems it failed. https://paste.openstack.org/show/812009/	22:03
renich	and this is how it looked when I created it: https://paste.openstack.org/show/812010/	22:04
johnsom	renich And there is not messages in the worker log? That is going to be the important log for this issue.	22:06
renich	It doesn't seem so. I used tail -f /var/log/containers/octavia/.log	22:08
renich	I'll cat it just to make sure.	22:09
johnsom	Do you have more than one controller? Any one of them could be getting the create message	22:09
renich	https://paste.openstack.org/show/812011/	22:10
renich	johnsom: yes. I have 3 of them.	22:10
renich	Oh!	22:10
renich	OK, let me check all of them then.	22:10
johnsom	Yeah, it distributes the work for HA/load balancing reasons. The logs we need are probably on one of the other nodes	22:11
renich	Had to paste it in another pastebin. It was too long for the OpenStack one. https://paste.centos.org/view/2bed5ca1	22:12
renich	That's controller2.	22:12
renich	You've been seing controller1 so far.	22:12
renich	Let me fetch controller3	22:12
johnsom	Amphora compute instance failed to become reachable. This either means the compute driver failed to fully boot the instance inside the timeout interval or the instance is not reachable via the lb-mgmt-net.: octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out	22:13
renich	johnsom: controller3: https://paste.centos.org/view/8a55244f	22:13
renich	OK, the amphora has booted. I can provide the console log. That said, it has never responded to ping.	22:14
johnsom	So, this could be caused by a few things. Most typically, the VM isn't getting an IP address.	22:14
johnsom	Yeah, if you have a console log it would be great	22:14
renich	sure thing.	22:14
renich	johnsom: had to create a new lb: https://paste.centos.org/view/517ab8dc	22:16
renich	Some info on the server johnsom https://paste.centos.org/view/b6170d47	22:16
renich	The network (the lb network) is type geneve. I don't understand why we cannot reach it.	22:17
renich	johnsom: network description: https://paste.centos.org/view/aa7e5d05	22:18
renich	I mention this because we have several others type vLAN.	22:18
renich	I see a ton of certificate verification failures. Maybe we got our certs wrong? We're using our own CA.	22:20
johnsom	Yeah, ok. Those all look ok, but custom CA is likely the problem.	22:22
johnsom	Here is a detailed guide on how it is supposed to work: https://docs.openstack.org/octavia/latest/admin/guides/certificates.html	22:22
johnsom	However, for RHOSP it is different.	22:23
renich	Ah! So, let me double-check that in our deployment. We're using a siged "sub-CA" so we might've gotten it wrong.	22:23
renich	johnsom: yeah, well, I'll take a look at the OpenSSL commands and figure it out.	22:23
johnsom	Also, there was a known bug in triple/RHOSP where certificates were not getting copied to all of the controllers correctly. Let me see if I can find that.	22:23
renich	johnsom: you think the CA is the issue then?	22:23
renich	johnsom: oh!	22:23
renich	huh? Certs should be on the controllers? OK. I need to look for those.	22:24
johnsom	Well, we use mutual TLS authentication when talking with the amphora. This means the controller validates the amphora, and the amphora validate the controllers. If any part of that CA setup is wrong, they won't be able to talk to each other.	22:24
johnsom	Tripleo/Director will install them for you based on the environment file settings.	22:25
renich	OK, so, most probably, we got it wrong.	22:27
renich	You helped me so much, johnsom. Thank you!	22:27
johnsom	Sure, NP. I will paste the bug link here when I find that copy issue.	22:27
renich	johnsom: thanks. I really appreciate it.	22:32
johnsom	renich https://bugzilla.redhat.com/show_bug.cgi?id=1985999	22:32
johnsom	If that is the case, you can get a hotfix from support.	22:33
renich	Thanks!	22:47
renich	so, I've found the cert files (CA, private and client certs) within the containers. Is there a practical way to check them? I know they're valid. At least not expired.	23:31

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!