*** celebdor has quit IRC | 01:23 | |
*** hongbin has joined #openstack-lbaas | 01:55 | |
*** bzhao__ has joined #openstack-lbaas | 02:20 | |
*** ramishra has joined #openstack-lbaas | 02:22 | |
*** abaindur has quit IRC | 02:24 | |
*** bzhao__ has left #openstack-lbaas | 02:25 | |
*** bzhao__ has joined #openstack-lbaas | 02:25 | |
*** bzhao__ has left #openstack-lbaas | 02:26 | |
*** bzhao__ has joined #openstack-lbaas | 02:26 | |
*** AlexStaf has quit IRC | 02:26 | |
*** bzhao__ has quit IRC | 02:27 | |
*** bzhao__ has joined #openstack-lbaas | 02:28 | |
*** bzhao__ has quit IRC | 02:33 | |
*** bzhao__ has joined #openstack-lbaas | 02:34 | |
openstackgerrit | ZhaoBo proposed openstack/python-octaviaclient master: Add 2 new options to Pool for support backend certificates validation https://review.openstack.org/620211 | 03:50 |
---|---|---|
*** hongbin has quit IRC | 04:06 | |
*** jarodwl has quit IRC | 04:17 | |
*** ivve has joined #openstack-lbaas | 04:31 | |
*** velizarx has joined #openstack-lbaas | 06:28 | |
*** AlexStaf has joined #openstack-lbaas | 06:33 | |
*** rcernin has quit IRC | 06:58 | |
*** ccamposr has joined #openstack-lbaas | 07:08 | |
*** velizarx has quit IRC | 07:11 | |
*** pcaruana has joined #openstack-lbaas | 07:23 | |
*** velizarx has joined #openstack-lbaas | 07:27 | |
*** abaindur has joined #openstack-lbaas | 07:57 | |
*** celebdor has joined #openstack-lbaas | 08:04 | |
*** yboaron has joined #openstack-lbaas | 08:06 | |
*** celebdor has quit IRC | 08:06 | |
*** celebdor has joined #openstack-lbaas | 08:06 | |
*** abaindur has quit IRC | 08:17 | |
*** abaindur has joined #openstack-lbaas | 08:19 | |
*** abaindur has quit IRC | 08:20 | |
*** irclogbot_1 has quit IRC | 08:44 | |
*** irclogbot_1 has joined #openstack-lbaas | 08:47 | |
*** yboaron_ has joined #openstack-lbaas | 08:47 | |
*** yboaron has quit IRC | 08:51 | |
*** irclogbot_1 has quit IRC | 08:53 | |
*** irclogbot_1 has joined #openstack-lbaas | 08:56 | |
*** yboaron_ has quit IRC | 09:16 | |
*** jackivanov has joined #openstack-lbaas | 09:16 | |
*** yboaron_ has joined #openstack-lbaas | 09:17 | |
*** ramishra has quit IRC | 10:26 | |
*** rpittau has joined #openstack-lbaas | 10:33 | |
*** ramishra has joined #openstack-lbaas | 10:49 | |
*** dosaboy has quit IRC | 11:24 | |
*** dosaboy has joined #openstack-lbaas | 11:31 | |
*** salmankhan has joined #openstack-lbaas | 11:35 | |
*** salmankhan1 has joined #openstack-lbaas | 11:38 | |
*** salmankhan has quit IRC | 11:40 | |
*** salmankhan1 is now known as salmankhan | 11:40 | |
*** salmankhan has quit IRC | 11:42 | |
*** yamamoto has quit IRC | 12:16 | |
*** yamamoto has joined #openstack-lbaas | 12:16 | |
*** rpittau_ has joined #openstack-lbaas | 12:16 | |
*** rpittau has quit IRC | 12:19 | |
*** rpittau has joined #openstack-lbaas | 12:24 | |
*** rpittau_ has quit IRC | 12:28 | |
*** salmankhan has joined #openstack-lbaas | 12:34 | |
*** velizarx has quit IRC | 12:51 | |
*** velizarx has joined #openstack-lbaas | 12:57 | |
*** takamatsu has quit IRC | 12:58 | |
*** velizarx has quit IRC | 12:58 | |
*** velizarx has joined #openstack-lbaas | 13:00 | |
*** rpittau_ has joined #openstack-lbaas | 13:08 | |
*** rpittau has quit IRC | 13:11 | |
*** salmankhan has quit IRC | 13:12 | |
*** salmankhan has joined #openstack-lbaas | 13:18 | |
*** rpittau_ is now known as rpittau | 13:20 | |
*** takamatsu has joined #openstack-lbaas | 13:21 | |
*** salmankhan has quit IRC | 13:24 | |
*** takamatsu has quit IRC | 13:37 | |
*** velizarx has quit IRC | 13:41 | |
*** takamatsu has joined #openstack-lbaas | 13:43 | |
*** velizarx has joined #openstack-lbaas | 13:43 | |
openstackgerrit | Merged openstack/octavia-dashboard master: Imported Translations from Zanata https://review.openstack.org/619451 | 14:09 |
*** velizarx has quit IRC | 14:25 | |
*** salmankhan has joined #openstack-lbaas | 14:58 | |
*** salmankhan has quit IRC | 14:58 | |
*** ianychoi_ is now known as ianychoi | 15:06 | |
*** salmankhan has joined #openstack-lbaas | 15:17 | |
*** rpittau has quit IRC | 15:21 | |
*** salmankhan has quit IRC | 15:29 | |
*** fnaval has joined #openstack-lbaas | 16:00 | |
*** salmankhan has joined #openstack-lbaas | 16:11 | |
*** salmankhan has quit IRC | 16:12 | |
*** witek has joined #openstack-lbaas | 16:22 | |
*** yboaron_ has quit IRC | 16:35 | |
*** ccamposr has quit IRC | 16:42 | |
*** salmankhan has joined #openstack-lbaas | 17:03 | |
*** salmankhan has quit IRC | 18:00 | |
*** salmankhan has joined #openstack-lbaas | 18:54 | |
*** salmankhan has quit IRC | 18:54 | |
*** salmankhan1 has joined #openstack-lbaas | 18:54 | |
*** salmankhan1 is now known as salmankhan | 18:56 | |
*** salmankhan has quit IRC | 19:11 | |
*** salmankhan has joined #openstack-lbaas | 19:15 | |
*** salmankhan has quit IRC | 19:15 | |
*** salmankhan has joined #openstack-lbaas | 19:16 | |
*** salmankhan has quit IRC | 19:17 | |
*** salmankhan has joined #openstack-lbaas | 19:17 | |
*** celebdor has quit IRC | 19:26 | |
*** salmankhan has quit IRC | 19:28 | |
openstackgerrit | Michael Johnson proposed openstack/octavia stable/queens: Healthmanager shouldn't update NO_MONITOR members https://review.openstack.org/612280 | 19:34 |
*** salmankhan has joined #openstack-lbaas | 19:51 | |
*** xgerman has joined #openstack-lbaas | 20:42 | |
xgerman | mnaser: https://storyboard.openstack.org/#!/story/2004435 | 20:42 |
mnaser | xgerman: cool, thank you | 20:42 |
* mnaser is dealing with another octavia fire | 20:42 | |
*** salmankhan has quit IRC | 20:52 | |
*** ivve has quit IRC | 21:21 | |
*** yamamoto has quit IRC | 21:29 | |
openstackgerrit | Mohammed Naser proposed openstack/octavia master: failover: add negative test for failover without amps https://review.openstack.org/620411 | 21:35 |
johnsom | mnaser What version are you running that you saw that? | 21:41 |
*** salmankhan has joined #openstack-lbaas | 21:45 | |
*** pcaruana has quit IRC | 21:46 | |
*** salmankhan has quit IRC | 21:46 | |
*** salmankhan1 has joined #openstack-lbaas | 21:46 | |
*** salmankhan1 is now known as salmankhan | 21:49 | |
openstackgerrit | Mohammed Naser proposed openstack/octavia master: failover: add negative test for failover without amps https://review.openstack.org/620411 | 21:49 |
xgerman | We probably need to rethink failover… right now it’s focused on replacing a broken amp with a healthy one and if that fails we throw our hands up… but we lack a process which monitors LBs and brings them to a desired state, e.g. in ACTIVE-ACTIVE scale up, etc. | 21:55 |
mnaser | johnsom: xgerman queens=>rocky upgrade, but upgrade wasnt done yet | 22:04 |
mnaser | it was really really bad | 22:04 |
mnaser | the problem is if the amp gets deleted | 22:04 |
mnaser | failover is useless | 22:04 |
xgerman | yeah, often we loose the port int hat case | 22:05 |
openstackgerrit | Mohammed Naser proposed openstack/octavia master: failover: add negative test for failover without amps https://review.openstack.org/620411 | 22:05 |
openstackgerrit | Mohammed Naser proposed openstack/octavia master: failover: create new amphora if none exist https://review.openstack.org/620414 | 22:05 |
mnaser | xgerman: but we lose the useless port technically | 22:05 |
mnaser | we still have the port which vrrp runs on | 22:05 |
mnaser | which is really the important one at the end of the day | 22:05 |
*** yboaron_ has joined #openstack-lbaas | 22:06 | |
mnaser | and honestly, we have enough information to rebuild if we need to | 22:06 |
xgerman | yeah, I started working on making that more “robust”... | 22:06 |
mnaser | is there a good way to test that second patch | 22:06 |
mnaser | surely its not that straight forward | 22:06 |
* mnaser will dedicate full resources for that | 22:06 | |
mnaser | i know this might be complex but i need this fixed before the end of this week, this was *really* *really* bad | 22:07 |
johnsom | What triggered your failover | 22:07 |
*** yamamoto has joined #openstack-lbaas | 22:07 | |
mnaser | galera being restarted during upgrade | 22:07 |
mnaser | took too long | 22:07 |
mnaser | health manager couldnt update | 22:07 |
johnsom | Ahh, yeah, ok, we fixed that | 22:07 |
mnaser | i saw an extra time.sleep() | 22:07 |
xgerman | mnaser: this addresses a “similar" case https://review.openstack.org/#/c/585864/ | 22:08 |
mnaser | but still, i mean, the issue is you end up in a state where amps dont exist but lbs dont | 22:08 |
xgerman | yep, gotcha | 22:08 |
mnaser | yeah that might address it | 22:08 |
johnsom | This is the patch that fixed the DB outage thing: https://review.openstack.org/600876 | 22:08 |
mnaser | GetAmphoraeNetworkConfigs was failing obviously | 22:08 |
mnaser | yeah but that only really buys you 60 seconds | 22:09 |
johnsom | No, it waits forever for the DB to come back online, then waits another 60 seconds | 22:09 |
johnsom | I also agree, the current state of the failover API code is in question. It should be able to recover from that situation. | 22:10 |
mnaser | yeah, but 60 seconds might not be enough for whatever reasons | 22:10 |
*** yboaron_ has quit IRC | 22:10 | |
johnsom | No, that is the right amount if you are running with the default configs. | 22:10 |
mnaser | its just how bad the situation becomes after that state | 22:11 |
johnsom | The part that is new in that set of patches is we wait forever testing the DB *before* we start the 60 second timer | 22:11 |
mnaser | yeah. i agree in the usefulness of that, but if $reasons don't fix that, the failover can really get messed up | 22:11 |
mnaser | i've pretty much had to ask a customer to rebuilt 60+ prod load balancers | 22:12 |
mnaser | that was the only way to bring it back to lief | 22:12 |
mnaser | life* | 22:12 |
mnaser | xgerman: also in your case it actually wouldn't do anything | 22:12 |
mnaser | because there is no amps allocated, so failover will do nothing. | 22:12 |
mnaser | https://review.openstack.org/#/c/620414/1/octavia/controller/worker/controller_worker.py | 22:13 |
mnaser | see loop here | 22:13 |
xgerman | ah, duh | 22:13 |
mnaser | im just not sure how i can test that behaviour | 22:14 |
mnaser | what happens if we call create_load_balancer on an already created load balancer | 22:14 |
xgerman | not sure... | 22:15 |
johnsom | Yeah, not sure that is the right answer as opposed to going down the create amphora flow. I also need to look at the context there. | 22:16 |
mnaser | well i was just wondering if there is any testing happening on create amphora flow that i can hook into | 22:16 |
xgerman | yeah, I am a bit perplexedm too, since it seems we deleted the amp, marked it deleted, and then never moved on | 22:16 |
mnaser | and see what happens | 22:16 |
mnaser | i was at a state where like i had 17 amphoras | 22:17 |
mnaser | and 60 load balancers | 22:17 |
xgerman | k | 22:17 |
mnaser | also, another bug is that octavia health manager polls for amphoras | 22:17 |
mnaser | rather than polling for load balancers and matching for amphoras | 22:17 |
mnaser | so if those amphoras disappear | 22:17 |
mnaser | nothing ever rebuilds them | 22:17 |
xgerman | yeah, it’s very amphora centric | 22:17 |
johnsom | Here is the flow: https://docs.openstack.org/octavia/latest/_images/LoadBalancerFlows-get_create_load_balancer_flow.svg | 22:17 |
johnsom | mnaser that is not true | 22:18 |
xgerman | we likely throw the LB in error and leave it there | 22:18 |
*** yamamoto has quit IRC | 22:18 | |
mnaser | https://github.com/openstack/octavia/blob/master/octavia/controller/healthmanager/health_manager.py#L79-L155 | 22:18 |
mnaser | nothing scans LBs | 22:18 |
mnaser | it just gets amphora healths | 22:19 |
johnsom | We query the amphora health table, yes, but don't poll the amps. | 22:19 |
mnaser | but if the amphora health has only 17 records because they disappeared and never came back | 22:19 |
mnaser | those load balancers are now in ACTIVE state and noting setting them to ERROR | 22:19 |
mnaser | because when you're polling your 17 "healthy" amphoras, you ignore the remaining ones | 22:19 |
johnsom | Nope, that is not how the logic works | 22:20 |
johnsom | If the failover flow deleted the amphora health record, it is in the flow already. If that flow fails the LB gets marked error because nova or something failed in the middle of the failover | 22:21 |
xgerman | yeah, there must have been some shutown int he middle, DB failure, or DB tampering to get there | 22:21 |
johnsom | Plus it would not be in the active state if a failover is occuring | 22:22 |
mnaser | openstack ansible does roling reboots | 22:22 |
mnaser | of galera | 22:22 |
mnaser | it might be somewhat related | 22:22 |
johnsom | That should be fine as long as one of the instances was still responding and the db connection string is setup for those multiple instances | 22:22 |
mnaser | regardless, it was in a state where LBs were ACTIVE (3 were ERROR only), 17 amphoras existed in the table (and in nova) and things are just stuck | 22:22 |
mnaser | and failing over didn't work | 22:23 |
xgerman | yeah, we should fix failover for that case | 22:23 |
mnaser | imho the best solution at least is to figure out a way to let octavia create a new amp | 22:23 |
mnaser | somehow | 22:23 |
xgerman | yeah, if we still have the vip port that shouldn’t be a problem | 22:23 |
johnsom | right, it should be doing that. | 22:23 |
mnaser | i mean i can probably easily reproduce locally, create an lb, delete the amphora record and now you're stuck | 22:23 |
mnaser | but i dont know how to best build a test case around that | 22:24 |
xgerman | delete the amp record? | 22:24 |
mnaser | like if you just go to the db | 22:24 |
mnaser | delete from amphora where load_balancer_id=thebadone; | 22:24 |
johnsom | he means the amphora health record | 22:24 |
mnaser | you'll end up in that state | 22:24 |
mnaser | yeah, delete the amphora health record *and* amphora record and you're stuck, unable to manually failover too | 22:25 |
xgerman | no, amp tabkle: | 22:25 |
xgerman | https://review.openstack.org/#/c/620414/1/octavia/controller/worker/controller_worker.py | 22:25 |
johnsom | You should get a relational error from the DB and that would be denied | 22:25 |
xgerman | apparently if we don't have records of the amps we skip the failover | 22:25 |
mnaser | ^ | 22:25 |
johnsom | Ok, yes, delete amp and amp health at the same time, that would get you into a state where we no longer "know" of the amphora | 22:26 |
johnsom | You skip the API failover, yes, I see that bug | 22:26 |
xgerman | yep, so we should have. acheck if the amps we get back are consistent with the LB set up and then recreate amps or so | 22:26 |
johnsom | Yeah, but at that point you have blown away all record of how many amps that LB had (thinking Act/Act here) | 22:27 |
mnaser | fwiw this is a standalone scenario | 22:27 |
mnaser | arent we able to recreate the amp stuff anyways? | 22:28 |
mnaser | we assume that the old amps are missing | 22:29 |
mnaser | if we have no record of anything | 22:29 |
johnsom | Yes, absolutely, that is what the failover flows do | 22:29 |
johnsom | No records of things, that gets a bit more dicey | 22:29 |
mnaser | so is the issue here that things are getting deleted before they should be? | 22:30 |
mnaser | i.e. we should associate to a new amphora first, and if we don't, we stop. | 22:30 |
johnsom | Yeah, I think part of the key is what deleted both those records and didn't rollback on failures | 22:30 |
mnaser | assuming a functional db might be the issue here | 22:30 |
johnsom | I mean the amphora records are marked "DELETED" and not actually deleted until the purge interval, which is a week by default | 22:31 |
johnsom | Well, with the above patch, we no longer assume functional DBs.... | 22:31 |
mnaser | uh i def dont see that behaviour.. | 22:31 |
mnaser | when was this added | 22:32 |
johnsom | Which is on top of the 50 retries in oslo_db, etc.... | 22:32 |
xgerman | but if it’s DELETED we ignore them for the query I posted | 22:32 |
mnaser | i dont see a single deleted amphora in the amphora table | 22:32 |
mnaser | they're all ready or allocated | 22:32 |
johnsom | You must have changed your purge interval then | 22:32 |
mnaser | havent touched it unless openstack ansible overrides the defaults | 22:33 |
mnaser | https://github.com/openstack/openstack-ansible-os_octavia/blob/stable/queens/templates/octavia.conf.j2#L253-L254 | 22:33 |
mnaser | its commented out | 22:33 |
johnsom | https://github.com/openstack/openstack-ansible-os_octavia/blob/stable/queens/templates/octavia.conf.j2#L255-L256 | 22:34 |
johnsom | Commented out means it's running defaults | 22:34 |
mnaser | yeah we havent touched that.. | 22:34 |
mnaser | let me double check | 22:34 |
mnaser | yep, commented out here | 22:35 |
mnaser | let me see the logs | 22:36 |
mnaser | ok so only 3 managed to revert to ERROR | 22:39 |
mnaser | 2018-11-27 18:02:26.414 109359 WARNING octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.database_tasks.MarkAmphoraPendingDeleteInDB' (3574ee5a-451e-4f8f-9513-c48ea7332fa6) transitioned into state 'REVERTED' from state 'REVERTING' | 22:39 |
mnaser | which explains that we had 3 that reverted | 22:39 |
johnsom | mnaser Would you mind taring up the cw/hm/hk logs from the controllers and putting them somewhere I can get them? Private is probably best. | 22:42 |
mnaser | unfortunately it wasnt running in debug and there's 3 controllers so it might be a bit harder, but i can try and do that after somewhat stripping them | 22:42 |
johnsom | Ok, if it's too much hassle, don't worry about it. | 22:43 |
*** abaindur has joined #openstack-lbaas | 22:43 | |
mnaser | so i guess whats strange is how they all disappeared | 22:43 |
*** abaindur has quit IRC | 22:43 | |
johnsom | I just wanted to have some light reading to piece together the chain of events that led to the amp records being deleted | 22:43 |
mnaser | it should have rolled back to ERROR and the amp would still be recoverable | 22:43 |
mnaser | because it would just have a db field in ERROR state | 22:44 |
johnsom | Right | 22:44 |
*** abaindur has joined #openstack-lbaas | 22:44 | |
mnaser | i guess it would be good to know why nothing exists in the amphora table | 22:44 |
johnsom | I'm wondering if you didn't have the DB outage fix and didn't have the "dual down" fix | 22:44 |
johnsom | but still, not sure what could/would have deleted the amp record. They should at least be in "DELETED" | 22:45 |
johnsom | or ERROR | 22:45 |
mnaser | so weird | 22:48 |
mnaser | i dont see anything in logs under 'Deleted Amphora id' | 22:48 |
mnaser | based on https://github.com/openstack/octavia/blob/stable/queens/octavia/controller/housekeeping/house_keeping.py#L86 | 22:48 |
johnsom | LOG.info('Purged db record for Amphora ID: %s', amp.id) | 22:49 |
mnaser | oh hold on | 22:49 |
mnaser | there's a whole bunch of those | 22:49 |
mnaser | in another ctl | 22:49 |
johnsom | That is what I would look for as you weren't running debug | 22:49 |
mnaser | in queens its different message johnsom (link above) | 22:49 |
johnsom | Ah, I see that | 22:50 |
mnaser | oh man. | 22:50 |
mnaser | https://github.com/openstack/octavia/blob/stable/queens/octavia/controller/housekeeping/house_keeping.py#L69-L86 | 22:53 |
mnaser | https://github.com/openstack/octavia/blob/stable/queens/octavia/db/repositories.py#L1098-L1119 | 22:54 |
mnaser | it removed 73 records | 22:54 |
mnaser | at the time of that issue | 22:54 |
*** rcernin has joined #openstack-lbaas | 22:57 | |
johnsom | So the health records were gone, which probably means it was failed over twice in a row | 23:04 |
mnaser | johnsom: so it looks like it did mark it as ERROR | 23:05 |
mnaser | 2018-11-27 18:06:51.593 109359 WARNING octavia.controller.worker.tasks.database_tasks [req-95a0f1d3-5fee-4bbf-bcf8-40a1b4a5aac2 - 8fdb4e6f09aa481ab157863dbfe2227f - - -] Reverting mark amphora pending delete in DB for amp id bed14822-aa31-480e-96b8-88f6119d7022 and compute id 6500b551-2211-47c1-8f0a-f70d06d4f841 | 23:05 |
mnaser | 2018-11-27 18:06:51.619 109359 ERROR octavia.controller.worker.controller_worker [req-95a0f1d3-5fee-4bbf-bcf8-40a1b4a5aac2 - 8fdb4e6f09aa481ab157863dbfe2227f - - -] Failover exception: Error plugging amphora (compute_id: 3d4c578d-0458-4b5c-9253-e90ab86823d5) into port 8bb6e55f-3414-4032-9c97-031ce71e4f89.: PlugNetworkException: Error plugging amphora (compute_id: 3d4c578d-0458-4b5c-9253-e90ab86823d5) into port | 23:05 |
mnaser | 8bb6e55f-3414-4032-9c97-031ce71e4f89. | 23:05 |
mnaser | which means at that state, i believe amphora status was ERROR | 23:06 |
mnaser | which means even clean up would have not deleted it | 23:06 |
mnaser | i wonder what deletes the amphora_health record | 23:07 |
*** salmankhan has quit IRC | 23:11 | |
johnsom | The failover flow will on a successful failover. | 23:16 |
mnaser | johnsom: thats what im seeing too. im wondering if there's a race condition where the amphora_health is missing, which then makes health manager wipe it | 23:18 |
johnsom | Yeah, the queens code doesn't check if it's expired first, it goes straight into the amp health check | 23:19 |
johnsom | So, if the amp is DELETED and the amp health is gone, it will purge the record | 23:19 |
mnaser | it looks like DisableAmphoraHealthMonitoring task is the only one that removes it | 23:19 |
johnsom | Rocky looks first and shoots later: https://github.com/openstack/octavia/blob/stable/rocky/octavia/controller/housekeeping/house_keeping.py#L75 | 23:20 |
mnaser | johnsom: is taskflow code linear in terms of order or does it all happen in parallel according to providers= and requires= ? | 23:21 |
mnaser | https://github.com/openstack/octavia/blob/stable/queens/octavia/controller/worker/flows/amphora_flows.py#L463-L466 | 23:22 |
johnsom | Some parts of the flow are linear (serial) some parts are parallel | 23:22 |
mnaser | shouldnt it wait for more things to require than AMPHORA ? | 23:22 |
johnsom | Basically we are passing in the failed amp object to that task, so it likely pulls the amp ID out of that object. | 23:23 |
mnaser | johnsom: right but shouldnt it wait much more before removing it from health? | 23:24 |
mnaser | https://docs.openstack.org/octavia/latest/_images/AmphoraFlows-get_failover_flow.svg | 23:24 |
mnaser | well i guess according to this its ok | 23:24 |
mnaser | i mean the only theory is that *something* removed amphora_health causing it to be deleted, i cant imagine any other theory, time is ok on that server and we monitor ntp | 23:25 |
johnsom | I think you will find the LB with deleted amps went through two failover flows.... | 23:26 |
mnaser | okay so | 23:26 |
mnaser | this might be a rough question but | 23:26 |
mnaser | is it possible because we run multiple octavia health managers? | 23:26 |
mnaser | or are you saying two consecutive failovers | 23:27 |
johnsom | No, consecutive | 23:27 |
johnsom | like one flow to fix amp 1, another to fix amp 2 | 23:27 |
johnsom | or attempt 1 failed, and it ran an attempt 2 | 23:27 |
*** fnaval has quit IRC | 23:28 | |
mnaser | thats possible | 23:28 |
mnaser | i dont have enough logs to be 100% sure | 23:28 |
johnsom | So, I think the issue is that the housekeeping fix didn't get backported to queens.... | 23:30 |
johnsom | This one: https://review.openstack.org/#/c/548989/ | 23:31 |
mnaser | johnsom: im struggling to understand the second part | 23:34 |
*** yamamoto has joined #openstack-lbaas | 23:34 | |
mnaser | on how/why it would end up in that state | 23:35 |
johnsom | We tried to fail over the amp because the DB fix isn't in your build of Octavia, this failover failed, probably due to the DB outage, we revert the failover and put stuff back, but don't rebuild the amp health record. We then attempt to fail it over again (it's still broke, give it another go), however the amp is gone and there is no health record. Along comes housekeeping, which without that patch, checks all | 23:41 |
johnsom | amphora, doesn't find the health record, and nukes the amp record | 23:41 |
johnsom | It's a pretty darn, hit-it-perfect bug, but I can see it happen. | 23:42 |
mnaser | johnsom: but do we ever remove the amp health record? it doesn't look like it actually gets removed until the failover was successfully completed | 23:42 |
johnsom | Queens, amphora_flows, line 332 | 23:43 |
johnsom | We were doing it as part of the cleanup of the failed amp | 23:43 |
johnsom | https://review.openstack.org/#/c/548989/13/octavia/controller/worker/flows/amphora_flows.py | 23:43 |
mnaser | ah shit | 23:44 |
mnaser | i have queens checked out | 23:44 |
mnaser | and i dont have that same code | 23:44 |
mnaser | so that must have been backported but i dont have that change | 23:44 |
johnsom | No, I just noticed we forgot to backport that patch.... It's only in Rocky | 23:45 |
mnaser | johnsom: https://review.openstack.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701 | 23:45 |
mnaser | https://review.openstack.org/#/c/587195/ | 23:45 |
johnsom | Ah, gerrit search failed me | 23:46 |
johnsom | Ok, so it's there | 23:46 |
mnaser | let me check if we have it | 23:46 |
mnaser | and we dont | 23:47 |
mnaser | that explains it all now | 23:47 |
mnaser | sigh | 23:47 |
johnsom | We have two queens backports pending merge and we will cut another queens release soon. But that one was in 2.0.2 | 23:47 |
mnaser | man, we were one commit away from that | 23:47 |
mnaser | well the one before was a year ago | 23:48 |
xgerman | yeah, things move quickly on our end -) | 23:48 |
johnsom | Yeah, we did poorly on cutting backport releases for a while. I now have a core that is focused on that | 23:48 |
johnsom | Trying to improve our practices with cutting releases.... | 23:48 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!