*** yamamoto has joined #openstack-lbaas | 00:15 | |
*** armax has joined #openstack-lbaas | 00:35 | |
*** abaindur has joined #openstack-lbaas | 00:58 | |
*** yamamoto has quit IRC | 01:19 | |
johnsom | FYI, I have found that nova anti-affinity appears to be broken: https://bugs.launchpad.net/nova/+bug/1863190 | 01:33 |
---|---|---|
openstack | Launchpad bug 1863190 in OpenStack Compute (nova) "Server group anti-affinity no longer works" [Undecided,New] | 01:33 |
johnsom | I was testing that failover puts the amp back in the server group correctly (which works BTW). | 01:34 |
johnsom | On that fine note, catch you all tomorrow. | 01:35 |
johnsom | #canary | 01:35 |
*** vishalmanchanda has joined #openstack-lbaas | 01:54 | |
*** armax has quit IRC | 02:54 | |
*** abaindur has quit IRC | 03:02 | |
*** abaindur has joined #openstack-lbaas | 03:03 | |
*** armax has joined #openstack-lbaas | 03:07 | |
*** abaindur has quit IRC | 03:08 | |
*** yamamoto has joined #openstack-lbaas | 03:25 | |
*** armax has quit IRC | 03:55 | |
*** andy_ has quit IRC | 05:07 | |
*** andy_ has joined #openstack-lbaas | 05:08 | |
*** nicolasbock has quit IRC | 05:09 | |
*** psachin has joined #openstack-lbaas | 05:37 | |
*** ramishra has joined #openstack-lbaas | 05:46 | |
*** abaindur has joined #openstack-lbaas | 05:59 | |
*** goldyfruit has quit IRC | 06:00 | |
*** goldyfruit has joined #openstack-lbaas | 06:00 | |
*** ramishra has quit IRC | 06:30 | |
*** abaindur has quit IRC | 06:32 | |
*** abaindur has joined #openstack-lbaas | 06:32 | |
*** ivve has joined #openstack-lbaas | 07:56 | |
*** maciejjozefczyk has joined #openstack-lbaas | 07:58 | |
*** abaindur_ has joined #openstack-lbaas | 08:05 | |
*** abaindur has quit IRC | 08:08 | |
*** abaindur_ has quit IRC | 08:11 | |
*** abaindur has joined #openstack-lbaas | 08:12 | |
*** gcheresh_ has joined #openstack-lbaas | 08:14 | |
cgoncalves | hah! | 08:17 |
cgoncalves | #canary neutron-dhcp-agent seems to be broken. fails to spawn dnsmasq | 08:18 |
cgoncalves | the bot should start collecting these messages like it does for #success | 08:18 |
cgoncalves | #success Octavia is the canary | 08:18 |
openstackstatus | cgoncalves: Added success to Success page (https://wiki.openstack.org/wiki/Successes) | 08:18 |
*** ccamposr has quit IRC | 08:19 | |
*** ccamposr has joined #openstack-lbaas | 08:20 | |
*** ramishra has joined #openstack-lbaas | 08:21 | |
*** tesseract has joined #openstack-lbaas | 08:21 | |
*** ccamposr__ has joined #openstack-lbaas | 08:39 | |
*** ccamposr has quit IRC | 08:41 | |
*** tkajinam has quit IRC | 08:42 | |
ivve | question: wouldn't it be great if octavia timed out on the "pending update" immutable state after some time (configurable) and move it back into last state with an error on last request rather than be stuck in immutable state forever (and force an admin to hack it out of existence with db commands) | 08:48 |
cgoncalves | ivve, lucky us there is a timeout setting ;) | 08:49 |
ivve | great! | 08:49 |
ivve | whats the default and what is the actual timeout? | 08:49 |
*** gcheresh_ has quit IRC | 08:49 | |
ivve | because currently everything just hangs indefinitly | 08:50 |
ivve | (with default value) | 08:50 |
cgoncalves | default is 25 minutes IIRC | 08:50 |
ivve | oh then either a) its is not OR b) it doesn't work OR c) it is refreshed if someone tries a delete command again on the entire loadbalancer | 08:51 |
cgoncalves | ivve, it hangs indefinitely in PENDING_* if you restart the controller service while it was still processing the request | 08:51 |
ivve | i see | 08:51 |
ivve | or it lost network connection during that time, i suppose? | 08:51 |
cgoncalves | I don't think that would have an impact on the revert | 08:52 |
cgoncalves | do you have service logs from the start of the request? | 08:53 |
ivve | i have a debug command from the last request only | 08:54 |
ivve | it was in available state, everything was ok. tried to delete it | 08:54 |
ivve | went into delete, then update pending | 08:54 |
ivve | then stuck | 08:54 |
ivve | for a day | 08:54 |
cgoncalves | can you confirm the octavia services were not restarted during that period? | 08:55 |
ivve | yea | 08:55 |
cgoncalves | no octavia worker logs we could check? | 08:55 |
ivve | users create k8s clusters and then out of stack loadbalancer items for it (ingress objects etc) | 08:56 |
ivve | when they done testing, lazily removes the stack | 08:56 |
ivve | this same thing happens with cinder objects | 08:56 |
ivve | i guess | 08:57 |
ivve | could be long, but now i have an appointment. i will check when i come back! | 08:57 |
cgoncalves | is Heat involved in creating/deleting Octavia resources? | 08:58 |
cgoncalves | ok. talk to you later | 08:58 |
*** xakaitetoia has joined #openstack-lbaas | 08:59 | |
*** rcernin has quit IRC | 09:12 | |
openstackgerrit | Carlos Goncalves proposed openstack/octavia-tempest-plugin master: Add tests for allowed CIDRs in listeners https://review.opendev.org/702629 | 09:19 |
*** baffle has joined #openstack-lbaas | 09:32 | |
*** ccamposr__ has quit IRC | 09:33 | |
*** ccamposr__ has joined #openstack-lbaas | 09:34 | |
*** yamamoto has quit IRC | 09:57 | |
*** rcernin has joined #openstack-lbaas | 09:59 | |
*** yamamoto has joined #openstack-lbaas | 10:04 | |
*** rcernin has quit IRC | 10:17 | |
*** abaindur has quit IRC | 10:18 | |
*** abaindur has joined #openstack-lbaas | 10:18 | |
*** abaindur has quit IRC | 10:23 | |
*** psachin has quit IRC | 10:25 | |
*** yamamoto has quit IRC | 10:28 | |
*** vishalmanchanda has quit IRC | 10:28 | |
*** psachin has joined #openstack-lbaas | 10:33 | |
*** yamamoto has joined #openstack-lbaas | 10:39 | |
*** gcheresh_ has joined #openstack-lbaas | 10:52 | |
*** abaindur has joined #openstack-lbaas | 11:05 | |
*** abaindur has quit IRC | 11:10 | |
*** psachin has quit IRC | 11:10 | |
*** psachin has joined #openstack-lbaas | 11:12 | |
openstackgerrit | Carlos Goncalves proposed openstack/octavia-tempest-plugin master: Add tests for allowed CIDRs in listeners https://review.opendev.org/702629 | 11:15 |
*** gcheresh_ has quit IRC | 11:16 | |
*** yamamoto has quit IRC | 11:21 | |
ivve | cgoncalves: back again, so heat created the initial lb, but then the ingress controller creates bunch of other stuff and probably more loadbalancers. then when the stack is removed (some of those resources are not included in the stack) | 11:41 |
ivve | and they got stuck | 11:41 |
ivve | k8s ingress controller | 11:42 |
ivve | so in order to get rid of them i set them as active/online in the db and quickly remove them | 11:43 |
cgoncalves | ivve, I can't think of anything else how to troubleshoot than checking the logs :/ | 11:44 |
ivve | yea i get it | 11:45 |
ivve | its just that this issue spans over multiple scenarios | 11:45 |
cgoncalves | we've also had customers reporting that their LB resources got stuck while deleting a k8s cluster on top of openstack via Heat | 11:45 |
ivve | restarting controllers, even when doing it one by one | 11:45 |
cgoncalves | so your case seems similar | 11:45 |
ivve | if a network failure occurs in parts of the datacenter/infrastructure | 11:46 |
ivve | then this happens again | 11:46 |
ivve | and then manual db labour to fix tens upon tens of loadbalancers | 11:46 |
ivve | what im saying is the general state recovery | 11:47 |
ivve | could it be improved? even if the service gets a restart | 11:47 |
ivve | this is the easiest way to fail it | 11:48 |
ivve | stop the octavia mgmt net | 11:48 |
ivve | and then hell breaks loose | 11:49 |
ivve | and the api calls in existance doesn't help to reset the loadbalancers when they are in the immutable state, there is no way other than DB hacking and restarting things manually | 11:50 |
ivve | existence* | 11:50 |
cgoncalves | there's work ongoing now in master that will mitigate resources getting stuck in PENDING_* | 11:51 |
cgoncalves | https://review.opendev.org/#/c/647406/ | 11:51 |
ivve | oh okay cool | 11:51 |
*** yamamoto has joined #openstack-lbaas | 11:54 | |
ivve | cgoncalves: one last question, if i have a loadbalancer that ends up in error mode or immutable state, which way is the best approach to recreate it (even if it takes some hacking db/sending api commands as admin) | 11:56 |
ivve | like for example this one, it's still working but no matter my approach octavia tries to solve the issue but is unable too, and it now looks like this: | 11:57 |
cgoncalves | ivve, if in PENDING_*, set it to ERROR rather than ACTIVE. then, issue a loadbalancer failver | 11:57 |
ivve | https://hastebin.com/esoyitefaj.rb | 11:57 |
cgoncalves | $ openstack loadbalancer failover $LB_ID | 11:57 |
ivve | failover the error one im assuming? | 11:58 |
*** tkajinam has joined #openstack-lbaas | 11:58 | |
ivve | sometimes they come back, but assume role: standalone | 11:58 |
cgoncalves | correct, failover the LB in ERROR provisioning_status | 11:58 |
ivve | not amphora failover? | 11:59 |
*** yamamoto has quit IRC | 11:59 | |
cgoncalves | better failover LB | 11:59 |
ivve | :( | 12:01 |
ivve | it got worse | 12:01 |
ivve | https://hastebin.com/pulupayuha.rb | 12:02 |
ivve | keeps trying to create that new one | 12:02 |
ivve | then fails, then gives up | 12:03 |
cgoncalves | show me the logs .... :) | 12:03 |
ivve | ill grab logs | 12:03 |
cgoncalves | heh | 12:03 |
ivve | LB bb5fe733-82c3-4156-b26d-7735b9a8c7dc failover exception: port not found (port id: a9b9ce28-bf7f-4d81-b08a-cf7ab554149e).: PortNotFound: port not found (port id: a9b9ce28-bf7f-4d81-b08a-cf7ab554149e). | 12:04 |
ivve | im guessing this is the problem | 12:04 |
cgoncalves | oh, either the vrrp or the vip port got deleted :/ | 12:05 |
ivve | aye | 12:05 |
cgoncalves | johnsom helped a couple of users with this problem by recreating the ports manually | 12:06 |
cgoncalves | I'm not 100% I'd know all the process to do so | 12:06 |
cgoncalves | he may be help to better help you than me once he's online | 12:07 |
cgoncalves | also, johnsom has been working on fixing the failover flow which if I recall correctly will also address this scenario, i.e. recreate the port if missing | 12:07 |
ivve | yeah, i guess im just looking to be taught how to fish as this happens from time to time in a prod env | 12:07 |
cgoncalves | https://review.opendev.org/#/c/705317/ | 12:08 |
cgoncalves | if you're in a hurry and that LB can be re-created from scratch, you can do openstack loadbalancer delete | 12:09 |
ivve | yea so this is where the next issue comes up | 12:12 |
ivve | it's not possible | 12:12 |
ivve | :) | 12:12 |
ivve | the only way i can delete it is to fake it to be all healthy and active in db before octavia notices and then delete it before it is check for health | 12:12 |
ivve | checked* | 12:13 |
*** nicolasbock has joined #openstack-lbaas | 12:13 | |
cgoncalves | setting it to ERROR and deleting immediately after should work | 12:13 |
ivve | yeah or that | 12:13 |
ivve | it would rock if we could get "openstack loadbalancer set --state <states> <lb>" like cinder | 12:14 |
*** psachin has quit IRC | 12:15 | |
cgoncalves | we lost count at how many people have asked for that :) | 12:15 |
ivve | hahah | 12:15 |
ivve | :D | 12:15 |
ivve | "another one" :P | 12:15 |
ivve | the issue becomes when i have tons of users which in turn have tons of loadbalancers | 12:16 |
ivve | and we have some kind of outtage | 12:16 |
ivve | and as admin you do that dreadful openstack loadbalancer list and see all those errors :( | 12:17 |
ivve | and then go through them one by one and fix them | 12:18 |
cgoncalves | the refactor failover patch will help you in failing over broken LBs, including when ports got deleted like is your case now | 12:18 |
ivve | also maybe a openstack loadbalancer re-create <lb> | 12:18 |
cgoncalves | the jobboard patch will help when controller handling the CUD request gets killed halfway through the flow | 12:18 |
ivve | just delete the whole thing and recreate it with identical uuids/ips. that would probably load to even more issues in the end i guess if resources half-exist or can't be removed | 12:19 |
ivve | load=lead | 12:19 |
cgoncalves | how would recreate be different than failover? | 12:19 |
ivve | it would not mend a lb, it woudl delete the resource and recreate it i guess | 12:20 |
ivve | well in my case (and i guess in a lot of cases) i want a active/passive topology | 12:21 |
*** maciejjozefczyk has quit IRC | 12:21 | |
ivve | in a standalone topology i guess the failover is exactly that and im guessing that the failover procedure for a standalone works more often then not compared to the active/passive | 12:21 |
cgoncalves | ivve, that is what failover does. it recreates the amphora (delete + create) | 12:21 |
ivve | yea but not if a part of them is functioning, right? | 12:22 |
ivve | so i have active and a backup, the backup is error but active is fine. only fix the backup if i failover, correct or not? | 12:22 |
cgoncalves | failover in active-standby topology will recreate the amps one at a time. this is to avoid data plane outages | 12:22 |
ivve | oh, so it does recreate both? (should?) | 12:22 |
ivve | i haven't seen that | 12:23 |
cgoncalves | ivve, in that case you could do "openstack loadbalancer amphora failover $AMP_ID" | 12:23 |
ivve | yea thats what i have been using most of the time | 12:23 |
cgoncalves | if you failover the loadbalancer, it will failover all amps associated to it | 12:23 |
ivve | got it | 12:23 |
ivve | like here is a result of attempted fix, eventually it started working but now looks like an abomination and nobody dear touch it to halt production https://hastebin.com/ucesumaduy.rb | 12:30 |
ivve | this is probably setting in error state & using amphora failover after a network outtage between controller nodes | 12:32 |
*** maciejjozefczyk has joined #openstack-lbaas | 12:37 | |
ivve | cgoncalves: while im at it asking questions, where there ever be support for using multiple images at the same time? today we use haproxy with some OS (tagging image with amphora) | 12:38 |
ivve | i mean flavor is one thing, but what about images :) | 12:40 |
cgoncalves | ivve, support to set amphora image in flavor is in the to-do list. just need someone to go there and do it. should be trivial, I guess | 12:41 |
ivve | cool | 12:42 |
ivve | its a request from my users, myself im just thinking for testing new images before putting it "prod" | 12:42 |
*** yamamoto has joined #openstack-lbaas | 12:43 | |
ivve | but they want multiple types of OS in the background (don't ask my why) | 12:43 |
ivve | s/my/me | 12:43 |
cgoncalves | yeah, I understand | 12:44 |
ivve | okay well, i will handle this and wait for coming updates | 12:45 |
ivve | thanks a bunch for answering my questions, much appreciated | 12:45 |
*** yamamoto has quit IRC | 12:45 | |
ivve | i will also try do the full loadbalancer failover next time with setting it to error if it isnt already, not sure if i have done exactly that before | 12:46 |
cgoncalves | sorry about the trouble. the team is working hard to fix them | 12:47 |
cgoncalves | having users reporting issues and helping us troubleshooting is great | 12:47 |
*** yamamoto has joined #openstack-lbaas | 13:06 | |
*** tkajinam has quit IRC | 13:17 | |
ataraday | johnsom, Hi! Sorry for bothering you, but could you take a look at points 7-9 in https://etherpad.openstack.org/p/octavia-worker-v2-issue-tracker and leave some comments what do you think | 13:23 |
*** ivve has quit IRC | 13:28 | |
*** psachin has joined #openstack-lbaas | 13:37 | |
*** TrevorV has joined #openstack-lbaas | 14:18 | |
*** psachin has quit IRC | 14:48 | |
johnsom | Just a comment on the above thread. Restart of the controller will not hang a flow, but kill -9 will. You must use graceful shutdown until our jobboard work is done. | 14:56 |
johnsom | ataraday: will do | 14:56 |
johnsom | Sounds like the root cause of ivve problem was nova or not being able to reach the database from the controllers. | 15:05 |
*** abaindur has joined #openstack-lbaas | 15:16 | |
*** abaindur has quit IRC | 15:21 | |
xakaitetoia | i've seem a lof of communication issues and generally a good thing is to always check rabbit. | 15:35 |
cgoncalves | johnsom, why do you say that? one problem was the vrrp port was gone in neutron | 15:35 |
johnsom | cgoncalves exactly that and looking at the logs and what he said. | 15:36 |
johnsom | VRRP port gone either means nova didn't release it, or we couldn't access the database. Also, stuck in PENDING_ could mean that the controllers gave up trying to write "ERROR" or "ACTIVE" to the database. It only retries so long | 15:37 |
cgoncalves | one can also port delete even if attached, I tried that the other day | 15:38 |
johnsom | Yes, but not when nova is hung | 15:39 |
johnsom | I.e. it is having DB issues, or the compute instance is not reachable to nova | 15:39 |
cgoncalves | ah, right. k, didn't try that | 15:39 |
johnsom | This is the open nova bug (that also affects cinder as he mentioned) | 15:40 |
*** maciejjozefczyk has quit IRC | 15:42 | |
johnsom | cgoncalves Just to clarify, heat stack deletes cannot cause a stuck PENDING_* state. That is not an RCA reason. | 15:42 |
*** armax has joined #openstack-lbaas | 15:53 | |
*** TrevorV has quit IRC | 16:09 | |
*** ramishra has quit IRC | 16:36 | |
*** yamamoto has quit IRC | 17:05 | |
*** yamamoto has joined #openstack-lbaas | 17:06 | |
*** yamamoto has quit IRC | 17:06 | |
*** yamamoto has joined #openstack-lbaas | 17:06 | |
*** yamamoto has quit IRC | 17:11 | |
*** xakaitetoia has quit IRC | 17:11 | |
*** yamamoto has joined #openstack-lbaas | 17:26 | |
*** tesseract has quit IRC | 17:29 | |
*** gmann is now known as gmann_afk | 17:42 | |
openstackgerrit | Merged openstack/octavia stable/queens: Use stable upper-constraints.txt in Amphora builds https://review.opendev.org/706052 | 17:47 |
johnsom | Wahooo! Thanks cgoncalves for your persistence. lol | 17:47 |
cgoncalves | happy to help. 23 rechecks | 17:54 |
johnsom | 🤦 | 17:54 |
*** gcheresh_ has joined #openstack-lbaas | 18:14 | |
*** gcheresh_ has quit IRC | 18:32 | |
*** gcheresh_ has joined #openstack-lbaas | 18:42 | |
openstackgerrit | Brian Haley proposed openstack/octavia master: Allow multiple VIPs per LB https://review.opendev.org/660239 | 19:16 |
*** gcheresh_ has quit IRC | 19:22 | |
openstackgerrit | Michael Johnson proposed openstack/octavia master: WIP - Refactor the failover flows https://review.opendev.org/705317 | 19:26 |
cgoncalves | cores: gate fix https://review.opendev.org/#/c/706051/ | 19:43 |
johnsom | +2 | 19:45 |
cgoncalves | thanks | 19:52 |
*** gmann_afk is now known as gmann | 20:12 | |
haleyb | johnsom: since you're approving gate fixes, https://review.opendev.org/#/c/707687/ :) | 20:22 |
*** abaindur has joined #openstack-lbaas | 21:08 | |
*** abaindur has quit IRC | 21:08 | |
*** abaindur has joined #openstack-lbaas | 21:09 | |
johnsom | haleyb +W | 21:38 |
*** nicolasbock has quit IRC | 21:54 | |
*** yamamoto has quit IRC | 22:01 | |
*** ccamposr has joined #openstack-lbaas | 22:02 | |
*** ccamposr__ has quit IRC | 22:03 | |
*** yamamoto has joined #openstack-lbaas | 22:37 | |
*** yamamoto has quit IRC | 22:45 | |
openstackgerrit | Merged openstack/octavia stable/stein: Fix pep8 failures on stable/stein branch https://review.opendev.org/707687 | 23:20 |
*** armax has quit IRC | 23:33 | |
*** armax has joined #openstack-lbaas | 23:33 | |
*** abaindur has quit IRC | 23:52 | |
*** spatel has joined #openstack-lbaas | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!