Monday, 2022-01-10

renichHello! 0/19:23
renichI was a #openstack-octavia and they told me to ask you. :D19:23
renichI'm having issues with octavia on RHOSP 16.2. When I create a load balancer, it hangs in "PENDING_CREATE" while the amphorae stay on "BOOTING" status. I've checked the console of the latter and it shows they did boot.19:23
renichAlso, I couldn't delete it. I had to manually modify the DB in order to change the state of the loadbalancer to "ERROR" in order to do so.19:24
renichI had created the loadbalancer with the following command: openstack loadbalancer create --name lb1 --vip-subnet-id public-subnet19:25
renichI can do it again if someone lends a hand. 19:25
johnsomrenich Hi. Please do not modify the database states. Most likely one of the controllers was still trying to retry/resolve the problem with nova. By changing the state you can end up with lots of conflicts and abandoned resources. The timeouts before a controller gives up can be quite long on some deployments. RHOSP 16.2 is at least 25 minutes.19:29
renichjohnsom: OK. I left it there all weekend. 19:29
johnsomrenich Once the controllers timeout trying to resolve the issue with nova/neutron, the object will move back to ERROR and be able to be deleted via the CLI/API19:30
renichjohnsom: understood. 19:30
johnsomrenich Did someone kill -9 the controller or container?19:30
renichjohnsom: I don't think so. I'm the only one doing these things for now. 19:31
renichI might've deleted the amphorae's instance19:31
johnsomTo see what was wrong with nova, you can look in the worker log file. It should highlight with ERROR logs as to why it was still in BOOTING19:31
johnsomNah, deleting the amp would have immediately caused the LB create to go into ERROR status19:32
johnsomWhen the amp is in BOOTING we are polling nova and/or the amp to see when the VM fully boots.19:33
renichjohnsom: OK, let me see if I can find the log. Kind of difficult in RHOSP since they're all containers. Will look for it. 19:33
johnsomRHOSP centralizes the logs, so look under the container logs, worker19:33
renichjohnsom: found it at controller-01; exactly where you said. 19:35
johnsomYou are looking for ERROR or WARNING level messages related the LB you were creating19:36
renichI see a warning but it's not related. 19:47
renichSo, I've generated a new load-balancer: lb2. It might be a quota issue, right? https://paste.openstack.org/show/812006/21:38
renichI created it like this: https://paste.openstack.org/show/812007/21:38
johnsomNo, that log snippet looks ok given you changed the status of a load balancer in the DB.21:40
johnsomThat is basically saying the quota was inconsistent to what it should be and it corrected it.21:41
renichjohnsom: I deleted that one and created a new one.21:41
renichOh, OK. 21:41
renichStill, when I use: openstack quota show <admin-id> I see load_balancers: 021:42
renichhttps://paste.openstack.org/show/812008/21:42
renichSorry; not zero but "None"21:43
johnsomYeah, that quota was for old neutron-lbaas. It was mistakenly added to that command and is removed in later versions of OpenStack. Octavia quota is managed through "openstack loadbalancer quota <...>". 21:43
johnsomopenstack quota show "load_balancers" will never be anything different than 0 because there is no code behind it. Just someone's CLI mistake21:44
johnsomhttps://docs.openstack.org/python-octaviaclient/latest/cli/index.html#quota21:44
renichOK. Thank you for clarifying. 21:45
renichAll are -121:45
johnsomThis log message is odd: "Invalid state PENDING_CREATE of loadbalancer resource 45ec9ff5-ec2e-4c86-87ba-c41b1781cd54" Do you know what resource that UUID points to?21:45
renichlet me check.21:46
johnsom-1 is unlimited21:46
renichOh, OK. 21:46
renichIt might've been one of the amphorae? I mean, I dunno because they have disappeared and my loadbalancer is in ERROR state now.21:48
johnsomNo, those were different UUIDs (https://paste.openstack.org/show/812007/) and that message was at the API level, before those are even started to be created.21:48
johnsomYeah, so if the LB is in error, it cleaned up the resources that were created. There should be log messages in the worker log of why it went to ERROR21:50
renichI have created lb3 and isolated the logs. I'll share them with you when it finishes failing. 21:51
johnsomExcellent, happy to take a look at them21:51
johnsomOh, ok, found that message. I was probably when you tried to delete the LB when it was still being retried.21:52
renichjohnsom: you're very kind. 21:53
renichOK21:53
renichjohnsom: it seems it failed. https://paste.openstack.org/show/812009/22:03
renichand this is how it looked when I created it: https://paste.openstack.org/show/812010/22:04
johnsomrenich And there is not messages in the worker log? That is going to be the important log for this issue.22:06
renichIt doesn't seem so. I used tail -f /var/log/containers/octavia*/*.log22:08
renichI'll cat it just to make sure.22:09
johnsomDo you have more than one controller? Any one of them could be getting the create message22:09
renichhttps://paste.openstack.org/show/812011/22:10
renichjohnsom: yes. I have 3 of them.22:10
renichOh!22:10
renichOK, let me check all of them then. 22:10
johnsomYeah, it distributes the work for HA/load balancing reasons. The logs we need are probably on one of the other nodes22:11
renichHad to paste it in another pastebin. It was too long for the OpenStack one. https://paste.centos.org/view/2bed5ca122:12
renichThat's controller2.22:12
renichYou've been seing controller1 so far.22:12
renichLet me fetch controller322:12
johnsomAmphora compute instance failed to become reachable. This either means the compute driver failed to fully boot the instance inside the timeout interval or the instance is not reachable via the lb-mgmt-net.: octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out22:13
renichjohnsom: controller3: https://paste.centos.org/view/8a55244f22:13
renichOK, the amphora has booted. I can provide the console log. That said, it has never responded to ping. 22:14
johnsomSo, this could be caused by a few things. Most typically, the VM isn't getting an IP address.22:14
johnsomYeah, if you have a console log it would be great22:14
renichsure thing.22:14
renichjohnsom: had to create a new lb: https://paste.centos.org/view/517ab8dc22:16
renichSome info on the server johnsom https://paste.centos.org/view/b6170d4722:16
renichThe network (the lb network) is type geneve. I don't understand why we cannot reach it. 22:17
renichjohnsom: network description: https://paste.centos.org/view/aa7e5d0522:18
renichI mention this because we have several others type vLAN. 22:18
renichI see a ton of certificate verification failures. Maybe we got our certs wrong? We're using our own CA. 22:20
johnsomYeah, ok. Those all look ok, but custom CA is likely the problem.22:22
johnsomHere is a detailed guide on how it is supposed to work: https://docs.openstack.org/octavia/latest/admin/guides/certificates.html22:22
johnsomHowever, for RHOSP it is different.22:23
renichAh! So, let me double-check that in our deployment. We're using a siged "sub-CA" so we might've gotten it wrong. 22:23
renichjohnsom: yeah, well, I'll take a look at the OpenSSL commands and figure it out. 22:23
johnsomAlso, there was a known bug in triple/RHOSP where certificates were not getting copied to all of the controllers correctly. Let me see if I can find that.22:23
renichjohnsom: you think the CA is the issue then? 22:23
renichjohnsom: oh!22:23
renichhuh? Certs should be on the controllers? OK. I need to look for those. 22:24
johnsomWell, we use  mutual TLS authentication when talking with the amphora. This means the controller validates the amphora, and the amphora validate the controllers. If any part of that CA setup is wrong, they won't be able to talk to each other.22:24
johnsomTripleo/Director will install them for you based on the environment file settings.22:25
renichOK, so, most probably, we got it wrong. 22:27
renichYou helped me so much, johnsom. Thank you!22:27
johnsomSure, NP. I will paste the bug link here when I find that copy issue.22:27
renichjohnsom: thanks. I really appreciate it.22:32
johnsomrenich https://bugzilla.redhat.com/show_bug.cgi?id=198599922:32
johnsomIf that is the case, you can get a hotfix from support.22:33
renichThanks!22:47
renichso, I've found the cert files (CA, private and client certs) within the containers. Is there a practical way to check them? I know they're valid. At least not expired. 23:31

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!