renich | Hello! 0/ | 19:23 |
---|---|---|
renich | I was a #openstack-octavia and they told me to ask you. :D | 19:23 |
renich | I'm having issues with octavia on RHOSP 16.2. When I create a load balancer, it hangs in "PENDING_CREATE" while the amphorae stay on "BOOTING" status. I've checked the console of the latter and it shows they did boot. | 19:23 |
renich | Also, I couldn't delete it. I had to manually modify the DB in order to change the state of the loadbalancer to "ERROR" in order to do so. | 19:24 |
renich | I had created the loadbalancer with the following command: openstack loadbalancer create --name lb1 --vip-subnet-id public-subnet | 19:25 |
renich | I can do it again if someone lends a hand. | 19:25 |
johnsom | renich Hi. Please do not modify the database states. Most likely one of the controllers was still trying to retry/resolve the problem with nova. By changing the state you can end up with lots of conflicts and abandoned resources. The timeouts before a controller gives up can be quite long on some deployments. RHOSP 16.2 is at least 25 minutes. | 19:29 |
renich | johnsom: OK. I left it there all weekend. | 19:29 |
johnsom | renich Once the controllers timeout trying to resolve the issue with nova/neutron, the object will move back to ERROR and be able to be deleted via the CLI/API | 19:30 |
renich | johnsom: understood. | 19:30 |
johnsom | renich Did someone kill -9 the controller or container? | 19:30 |
renich | johnsom: I don't think so. I'm the only one doing these things for now. | 19:31 |
renich | I might've deleted the amphorae's instance | 19:31 |
johnsom | To see what was wrong with nova, you can look in the worker log file. It should highlight with ERROR logs as to why it was still in BOOTING | 19:31 |
johnsom | Nah, deleting the amp would have immediately caused the LB create to go into ERROR status | 19:32 |
johnsom | When the amp is in BOOTING we are polling nova and/or the amp to see when the VM fully boots. | 19:33 |
renich | johnsom: OK, let me see if I can find the log. Kind of difficult in RHOSP since they're all containers. Will look for it. | 19:33 |
johnsom | RHOSP centralizes the logs, so look under the container logs, worker | 19:33 |
renich | johnsom: found it at controller-01; exactly where you said. | 19:35 |
johnsom | You are looking for ERROR or WARNING level messages related the LB you were creating | 19:36 |
renich | I see a warning but it's not related. | 19:47 |
renich | So, I've generated a new load-balancer: lb2. It might be a quota issue, right? https://paste.openstack.org/show/812006/ | 21:38 |
renich | I created it like this: https://paste.openstack.org/show/812007/ | 21:38 |
johnsom | No, that log snippet looks ok given you changed the status of a load balancer in the DB. | 21:40 |
johnsom | That is basically saying the quota was inconsistent to what it should be and it corrected it. | 21:41 |
renich | johnsom: I deleted that one and created a new one. | 21:41 |
renich | Oh, OK. | 21:41 |
renich | Still, when I use: openstack quota show <admin-id> I see load_balancers: 0 | 21:42 |
renich | https://paste.openstack.org/show/812008/ | 21:42 |
renich | Sorry; not zero but "None" | 21:43 |
johnsom | Yeah, that quota was for old neutron-lbaas. It was mistakenly added to that command and is removed in later versions of OpenStack. Octavia quota is managed through "openstack loadbalancer quota <...>". | 21:43 |
johnsom | openstack quota show "load_balancers" will never be anything different than 0 because there is no code behind it. Just someone's CLI mistake | 21:44 |
johnsom | https://docs.openstack.org/python-octaviaclient/latest/cli/index.html#quota | 21:44 |
renich | OK. Thank you for clarifying. | 21:45 |
renich | All are -1 | 21:45 |
johnsom | This log message is odd: "Invalid state PENDING_CREATE of loadbalancer resource 45ec9ff5-ec2e-4c86-87ba-c41b1781cd54" Do you know what resource that UUID points to? | 21:45 |
renich | let me check. | 21:46 |
johnsom | -1 is unlimited | 21:46 |
renich | Oh, OK. | 21:46 |
renich | It might've been one of the amphorae? I mean, I dunno because they have disappeared and my loadbalancer is in ERROR state now. | 21:48 |
johnsom | No, those were different UUIDs (https://paste.openstack.org/show/812007/) and that message was at the API level, before those are even started to be created. | 21:48 |
johnsom | Yeah, so if the LB is in error, it cleaned up the resources that were created. There should be log messages in the worker log of why it went to ERROR | 21:50 |
renich | I have created lb3 and isolated the logs. I'll share them with you when it finishes failing. | 21:51 |
johnsom | Excellent, happy to take a look at them | 21:51 |
johnsom | Oh, ok, found that message. I was probably when you tried to delete the LB when it was still being retried. | 21:52 |
renich | johnsom: you're very kind. | 21:53 |
renich | OK | 21:53 |
renich | johnsom: it seems it failed. https://paste.openstack.org/show/812009/ | 22:03 |
renich | and this is how it looked when I created it: https://paste.openstack.org/show/812010/ | 22:04 |
johnsom | renich And there is not messages in the worker log? That is going to be the important log for this issue. | 22:06 |
renich | It doesn't seem so. I used tail -f /var/log/containers/octavia*/*.log | 22:08 |
renich | I'll cat it just to make sure. | 22:09 |
johnsom | Do you have more than one controller? Any one of them could be getting the create message | 22:09 |
renich | https://paste.openstack.org/show/812011/ | 22:10 |
renich | johnsom: yes. I have 3 of them. | 22:10 |
renich | Oh! | 22:10 |
renich | OK, let me check all of them then. | 22:10 |
johnsom | Yeah, it distributes the work for HA/load balancing reasons. The logs we need are probably on one of the other nodes | 22:11 |
renich | Had to paste it in another pastebin. It was too long for the OpenStack one. https://paste.centos.org/view/2bed5ca1 | 22:12 |
renich | That's controller2. | 22:12 |
renich | You've been seing controller1 so far. | 22:12 |
renich | Let me fetch controller3 | 22:12 |
johnsom | Amphora compute instance failed to become reachable. This either means the compute driver failed to fully boot the instance inside the timeout interval or the instance is not reachable via the lb-mgmt-net.: octavia.amphorae.driver_exceptions.exceptions.TimeOutException: contacting the amphora timed out | 22:13 |
renich | johnsom: controller3: https://paste.centos.org/view/8a55244f | 22:13 |
renich | OK, the amphora has booted. I can provide the console log. That said, it has never responded to ping. | 22:14 |
johnsom | So, this could be caused by a few things. Most typically, the VM isn't getting an IP address. | 22:14 |
johnsom | Yeah, if you have a console log it would be great | 22:14 |
renich | sure thing. | 22:14 |
renich | johnsom: had to create a new lb: https://paste.centos.org/view/517ab8dc | 22:16 |
renich | Some info on the server johnsom https://paste.centos.org/view/b6170d47 | 22:16 |
renich | The network (the lb network) is type geneve. I don't understand why we cannot reach it. | 22:17 |
renich | johnsom: network description: https://paste.centos.org/view/aa7e5d05 | 22:18 |
renich | I mention this because we have several others type vLAN. | 22:18 |
renich | I see a ton of certificate verification failures. Maybe we got our certs wrong? We're using our own CA. | 22:20 |
johnsom | Yeah, ok. Those all look ok, but custom CA is likely the problem. | 22:22 |
johnsom | Here is a detailed guide on how it is supposed to work: https://docs.openstack.org/octavia/latest/admin/guides/certificates.html | 22:22 |
johnsom | However, for RHOSP it is different. | 22:23 |
renich | Ah! So, let me double-check that in our deployment. We're using a siged "sub-CA" so we might've gotten it wrong. | 22:23 |
renich | johnsom: yeah, well, I'll take a look at the OpenSSL commands and figure it out. | 22:23 |
johnsom | Also, there was a known bug in triple/RHOSP where certificates were not getting copied to all of the controllers correctly. Let me see if I can find that. | 22:23 |
renich | johnsom: you think the CA is the issue then? | 22:23 |
renich | johnsom: oh! | 22:23 |
renich | huh? Certs should be on the controllers? OK. I need to look for those. | 22:24 |
johnsom | Well, we use mutual TLS authentication when talking with the amphora. This means the controller validates the amphora, and the amphora validate the controllers. If any part of that CA setup is wrong, they won't be able to talk to each other. | 22:24 |
johnsom | Tripleo/Director will install them for you based on the environment file settings. | 22:25 |
renich | OK, so, most probably, we got it wrong. | 22:27 |
renich | You helped me so much, johnsom. Thank you! | 22:27 |
johnsom | Sure, NP. I will paste the bug link here when I find that copy issue. | 22:27 |
renich | johnsom: thanks. I really appreciate it. | 22:32 |
johnsom | renich https://bugzilla.redhat.com/show_bug.cgi?id=1985999 | 22:32 |
johnsom | If that is the case, you can get a hotfix from support. | 22:33 |
renich | Thanks! | 22:47 |
renich | so, I've found the cert files (CA, private and client certs) within the containers. Is there a practical way to check them? I know they're valid. At least not expired. | 23:31 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!