Tuesday, 2018-11-27

*** celebdor has quit IRC01:23
*** hongbin has joined #openstack-lbaas01:55
*** bzhao__ has joined #openstack-lbaas02:20
*** ramishra has joined #openstack-lbaas02:22
*** abaindur has quit IRC02:24
*** bzhao__ has left #openstack-lbaas02:25
*** bzhao__ has joined #openstack-lbaas02:25
*** bzhao__ has left #openstack-lbaas02:26
*** bzhao__ has joined #openstack-lbaas02:26
*** AlexStaf has quit IRC02:26
*** bzhao__ has quit IRC02:27
*** bzhao__ has joined #openstack-lbaas02:28
*** bzhao__ has quit IRC02:33
*** bzhao__ has joined #openstack-lbaas02:34
openstackgerritZhaoBo proposed openstack/python-octaviaclient master: Add 2 new options to Pool for support backend certificates validation  https://review.openstack.org/62021103:50
*** hongbin has quit IRC04:06
*** jarodwl has quit IRC04:17
*** ivve has joined #openstack-lbaas04:31
*** velizarx has joined #openstack-lbaas06:28
*** AlexStaf has joined #openstack-lbaas06:33
*** rcernin has quit IRC06:58
*** ccamposr has joined #openstack-lbaas07:08
*** velizarx has quit IRC07:11
*** pcaruana has joined #openstack-lbaas07:23
*** velizarx has joined #openstack-lbaas07:27
*** abaindur has joined #openstack-lbaas07:57
*** celebdor has joined #openstack-lbaas08:04
*** yboaron has joined #openstack-lbaas08:06
*** celebdor has quit IRC08:06
*** celebdor has joined #openstack-lbaas08:06
*** abaindur has quit IRC08:17
*** abaindur has joined #openstack-lbaas08:19
*** abaindur has quit IRC08:20
*** irclogbot_1 has quit IRC08:44
*** irclogbot_1 has joined #openstack-lbaas08:47
*** yboaron_ has joined #openstack-lbaas08:47
*** yboaron has quit IRC08:51
*** irclogbot_1 has quit IRC08:53
*** irclogbot_1 has joined #openstack-lbaas08:56
*** yboaron_ has quit IRC09:16
*** jackivanov has joined #openstack-lbaas09:16
*** yboaron_ has joined #openstack-lbaas09:17
*** ramishra has quit IRC10:26
*** rpittau has joined #openstack-lbaas10:33
*** ramishra has joined #openstack-lbaas10:49
*** dosaboy has quit IRC11:24
*** dosaboy has joined #openstack-lbaas11:31
*** salmankhan has joined #openstack-lbaas11:35
*** salmankhan1 has joined #openstack-lbaas11:38
*** salmankhan has quit IRC11:40
*** salmankhan1 is now known as salmankhan11:40
*** salmankhan has quit IRC11:42
*** yamamoto has quit IRC12:16
*** yamamoto has joined #openstack-lbaas12:16
*** rpittau_ has joined #openstack-lbaas12:16
*** rpittau has quit IRC12:19
*** rpittau has joined #openstack-lbaas12:24
*** rpittau_ has quit IRC12:28
*** salmankhan has joined #openstack-lbaas12:34
*** velizarx has quit IRC12:51
*** velizarx has joined #openstack-lbaas12:57
*** takamatsu has quit IRC12:58
*** velizarx has quit IRC12:58
*** velizarx has joined #openstack-lbaas13:00
*** rpittau_ has joined #openstack-lbaas13:08
*** rpittau has quit IRC13:11
*** salmankhan has quit IRC13:12
*** salmankhan has joined #openstack-lbaas13:18
*** rpittau_ is now known as rpittau13:20
*** takamatsu has joined #openstack-lbaas13:21
*** salmankhan has quit IRC13:24
*** takamatsu has quit IRC13:37
*** velizarx has quit IRC13:41
*** takamatsu has joined #openstack-lbaas13:43
*** velizarx has joined #openstack-lbaas13:43
openstackgerritMerged openstack/octavia-dashboard master: Imported Translations from Zanata  https://review.openstack.org/61945114:09
*** velizarx has quit IRC14:25
*** salmankhan has joined #openstack-lbaas14:58
*** salmankhan has quit IRC14:58
*** ianychoi_ is now known as ianychoi15:06
*** salmankhan has joined #openstack-lbaas15:17
*** rpittau has quit IRC15:21
*** salmankhan has quit IRC15:29
*** fnaval has joined #openstack-lbaas16:00
*** salmankhan has joined #openstack-lbaas16:11
*** salmankhan has quit IRC16:12
*** witek has joined #openstack-lbaas16:22
*** yboaron_ has quit IRC16:35
*** ccamposr has quit IRC16:42
*** salmankhan has joined #openstack-lbaas17:03
*** salmankhan has quit IRC18:00
*** salmankhan has joined #openstack-lbaas18:54
*** salmankhan has quit IRC18:54
*** salmankhan1 has joined #openstack-lbaas18:54
*** salmankhan1 is now known as salmankhan18:56
*** salmankhan has quit IRC19:11
*** salmankhan has joined #openstack-lbaas19:15
*** salmankhan has quit IRC19:15
*** salmankhan has joined #openstack-lbaas19:16
*** salmankhan has quit IRC19:17
*** salmankhan has joined #openstack-lbaas19:17
*** celebdor has quit IRC19:26
*** salmankhan has quit IRC19:28
openstackgerritMichael Johnson proposed openstack/octavia stable/queens: Healthmanager shouldn't update NO_MONITOR members  https://review.openstack.org/61228019:34
*** salmankhan has joined #openstack-lbaas19:51
*** xgerman has joined #openstack-lbaas20:42
xgermanmnaser: https://storyboard.openstack.org/#!/story/200443520:42
mnaserxgerman: cool, thank you20:42
* mnaser is dealing with another octavia fire20:42
*** salmankhan has quit IRC20:52
*** ivve has quit IRC21:21
*** yamamoto has quit IRC21:29
openstackgerritMohammed Naser proposed openstack/octavia master: failover: add negative test for failover without amps  https://review.openstack.org/62041121:35
johnsommnaser What version are you running that you saw that?21:41
*** salmankhan has joined #openstack-lbaas21:45
*** pcaruana has quit IRC21:46
*** salmankhan has quit IRC21:46
*** salmankhan1 has joined #openstack-lbaas21:46
*** salmankhan1 is now known as salmankhan21:49
openstackgerritMohammed Naser proposed openstack/octavia master: failover: add negative test for failover without amps  https://review.openstack.org/62041121:49
xgermanWe probably need to rethink failover… right now it’s focused on replacing a broken amp with a healthy one and if that fails we throw our hands up…  but we lack a process which monitors LBs and brings them to a desired state, e.g. in ACTIVE-ACTIVE scale up, etc.21:55
mnaserjohnsom: xgerman queens=>rocky upgrade, but upgrade wasnt done yet22:04
mnaserit was really really bad22:04
mnaserthe problem is if the amp gets deleted22:04
mnaserfailover is useless22:04
xgermanyeah, often we loose the port int hat case22:05
openstackgerritMohammed Naser proposed openstack/octavia master: failover: add negative test for failover without amps  https://review.openstack.org/62041122:05
openstackgerritMohammed Naser proposed openstack/octavia master: failover: create new amphora if none exist  https://review.openstack.org/62041422:05
mnaserxgerman: but we lose the useless port technically22:05
mnaserwe still have the port which vrrp runs on22:05
mnaserwhich is really the important one at the end of the day22:05
*** yboaron_ has joined #openstack-lbaas22:06
mnaserand honestly, we have enough information to rebuild if we need to22:06
xgermanyeah, I started working on making that more “robust”...22:06
mnaseris there a good way to test that second patch22:06
mnasersurely its not that straight forward22:06
* mnaser will dedicate full resources for that22:06
mnaseri know this might be complex but i need this fixed before the end of this week, this was *really* *really* bad22:07
johnsomWhat triggered your failover22:07
*** yamamoto has joined #openstack-lbaas22:07
mnasergalera being restarted during upgrade22:07
mnasertook too long22:07
mnaserhealth manager couldnt update22:07
johnsomAhh, yeah, ok, we fixed that22:07
mnaseri saw an extra time.sleep()22:07
xgermanmnaser: this addresses a “similar" case https://review.openstack.org/#/c/585864/22:08
mnaserbut still, i mean, the issue is you end up in a state where amps dont exist but lbs dont22:08
xgermanyep, gotcha22:08
mnaseryeah that might address it22:08
johnsomThis is the patch that fixed the DB outage thing: https://review.openstack.org/60087622:08
mnaserGetAmphoraeNetworkConfigs was failing obviously22:08
mnaseryeah but that only really buys you 60 seconds22:09
johnsomNo, it waits forever for the DB to come back online, then waits another 60 seconds22:09
johnsomI also agree, the current state of the failover API code is in question. It should be able to recover from that situation.22:10
mnaseryeah, but 60 seconds might not be enough for whatever reasons22:10
*** yboaron_ has quit IRC22:10
johnsomNo, that is the right amount if you are running with the default configs.22:10
mnaserits just how bad the situation becomes after that state22:11
johnsomThe part that is new in that set of patches is we wait forever testing the DB *before* we start the 60 second timer22:11
mnaseryeah.  i agree in the usefulness of that, but if $reasons don't fix that, the failover can really get messed up22:11
mnaseri've pretty much had to ask a customer to rebuilt 60+ prod load balancers22:12
mnaserthat was the only way to bring it back to lief22:12
mnaserlife*22:12
mnaserxgerman: also in your case it actually wouldn't do anything22:12
mnaserbecause there is no amps allocated, so failover will do nothing.22:12
mnaserhttps://review.openstack.org/#/c/620414/1/octavia/controller/worker/controller_worker.py22:13
mnasersee loop here22:13
xgermanah, duh22:13
mnaserim just not sure how i can test that behaviour22:14
mnaserwhat happens if we call create_load_balancer on an already created load balancer22:14
xgermannot sure...22:15
johnsomYeah, not sure that is the right answer as opposed to going down the create amphora flow. I also need to look at the context there.22:16
mnaserwell i was just wondering if there is any testing happening on create amphora flow that i can hook into22:16
xgermanyeah, I am a bit perplexedm too, since it seems we deleted the amp, marked it deleted, and then never moved on22:16
mnaserand see what happens22:16
mnaseri was at a state where like i had 17 amphoras22:17
mnaserand 60 load balancers22:17
xgermank22:17
mnaseralso, another bug is that octavia health manager polls for amphoras22:17
mnaserrather than polling for load balancers and matching for amphoras22:17
mnaserso if those amphoras disappear22:17
mnasernothing ever rebuilds them22:17
xgermanyeah, it’s very amphora centric22:17
johnsomHere is the flow: https://docs.openstack.org/octavia/latest/_images/LoadBalancerFlows-get_create_load_balancer_flow.svg22:17
johnsommnaser that is not true22:18
xgermanwe likely throw the LB in error and leave it there22:18
*** yamamoto has quit IRC22:18
mnaserhttps://github.com/openstack/octavia/blob/master/octavia/controller/healthmanager/health_manager.py#L79-L15522:18
mnasernothing scans LBs22:18
mnaserit just gets amphora healths22:19
johnsomWe query the amphora health table, yes, but don't poll the amps.22:19
mnaserbut if the amphora health has only 17 records because they disappeared and never came back22:19
mnaserthose load balancers are now in ACTIVE state and noting setting them to ERROR22:19
mnaserbecause when you're polling your 17 "healthy" amphoras, you ignore the remaining ones22:19
johnsomNope, that is not how the logic works22:20
johnsomIf the failover flow deleted the amphora health record,  it is in the flow already. If that flow fails the LB gets marked error because nova or something failed in the middle of the failover22:21
xgermanyeah, there must have been some shutown int he middle, DB failure, or DB tampering to get there22:21
johnsomPlus it would not be in the active state if a failover is occuring22:22
mnaseropenstack ansible does roling reboots22:22
mnaserof galera22:22
mnaserit might be somewhat related22:22
johnsomThat should be fine as long as one of the instances was still responding and the db connection string is setup for those multiple instances22:22
mnaserregardless, it was in a state where LBs were ACTIVE (3 were ERROR only), 17 amphoras existed in the table (and in nova) and things are just stuck22:22
mnaserand failing over didn't work22:23
xgermanyeah, we should fix failover for that case22:23
mnaserimho the best solution at least is to figure out a way to let octavia create a new amp22:23
mnasersomehow22:23
xgermanyeah, if we still have the vip port that shouldn’t be a problem22:23
johnsomright, it should be doing that.22:23
mnaseri mean i can probably easily reproduce locally, create an lb, delete the amphora record and now you're stuck22:23
mnaserbut i dont know how to best build a test case around that22:24
xgermandelete the amp record?22:24
mnaserlike if you just go to the db22:24
mnaserdelete from amphora where load_balancer_id=thebadone;22:24
johnsomhe means the amphora health record22:24
mnaseryou'll end up in that state22:24
mnaseryeah, delete the amphora health record *and* amphora record and you're stuck, unable to manually failover too22:25
xgermanno, amp tabkle:22:25
xgermanhttps://review.openstack.org/#/c/620414/1/octavia/controller/worker/controller_worker.py22:25
johnsomYou should get a relational error from the DB and that would be denied22:25
xgermanapparently if we don't have records of the amps we skip the failover22:25
mnaser^22:25
johnsomOk, yes, delete amp and amp health at the same time, that would get you into a state where we no longer "know" of the amphora22:26
johnsomYou skip the API failover, yes, I see that bug22:26
xgermanyep, so we should have. acheck if the amps we get back are consistent with the LB set up and then recreate amps or so22:26
johnsomYeah, but at that point you have blown away all record of how many amps that LB had (thinking Act/Act here)22:27
mnaserfwiw this is a standalone scenario22:27
mnaserarent we able to recreate the amp stuff anyways?22:28
mnaserwe assume that the old amps are missing22:29
mnaserif we have no record of anything22:29
johnsomYes, absolutely, that is what the failover flows do22:29
johnsomNo records of things, that gets a bit more dicey22:29
mnaserso is the issue here that things are getting deleted before they should be?22:30
mnaseri.e. we should associate to a new amphora first, and if we don't, we stop.22:30
johnsomYeah, I think part of the key is what deleted both those records and didn't rollback on failures22:30
mnaserassuming a functional db might be the issue here22:30
johnsomI mean the amphora records are marked "DELETED" and not actually deleted until the purge interval, which is a week by default22:31
johnsomWell, with the above patch, we no longer assume functional DBs....22:31
mnaseruh i def dont see that behaviour..22:31
mnaserwhen was this added22:32
johnsomWhich is on top of the 50 retries in oslo_db, etc....22:32
xgermanbut if it’s DELETED we ignore them for the query I posted22:32
mnaseri dont see a single deleted amphora in the amphora table22:32
mnaserthey're all ready or allocated22:32
johnsomYou must have changed your purge interval then22:32
mnaserhavent touched it unless openstack ansible overrides the defaults22:33
mnaserhttps://github.com/openstack/openstack-ansible-os_octavia/blob/stable/queens/templates/octavia.conf.j2#L253-L25422:33
mnaserits commented out22:33
johnsomhttps://github.com/openstack/openstack-ansible-os_octavia/blob/stable/queens/templates/octavia.conf.j2#L255-L25622:34
johnsomCommented out means it's running defaults22:34
mnaseryeah we havent touched that..22:34
mnaserlet me double check22:34
mnaseryep, commented out here22:35
mnaserlet me see the logs22:36
mnaserok so only 3 managed to revert to ERROR22:39
mnaser2018-11-27 18:02:26.414 109359 WARNING octavia.controller.worker.controller_worker [-] Task 'octavia.controller.worker.tasks.database_tasks.MarkAmphoraPendingDeleteInDB' (3574ee5a-451e-4f8f-9513-c48ea7332fa6) transitioned into state 'REVERTED' from state 'REVERTING'22:39
mnaserwhich explains that we had 3 that reverted22:39
johnsommnaser Would you mind taring up the cw/hm/hk logs from the controllers and putting them somewhere I can get them?  Private is probably best.22:42
mnaserunfortunately it wasnt running in debug and there's 3 controllers so it might be a bit harder, but i can try and do that after somewhat stripping them22:42
johnsomOk, if it's too much hassle, don't worry about it.22:43
*** abaindur has joined #openstack-lbaas22:43
mnaserso i guess whats strange is how they all disappeared22:43
*** abaindur has quit IRC22:43
johnsomI just wanted to have some light reading to piece together the chain of events that led to the amp records being deleted22:43
mnaserit should have rolled back to ERROR and the amp would still be recoverable22:43
mnaserbecause it would just have a db field in ERROR state22:44
johnsomRight22:44
*** abaindur has joined #openstack-lbaas22:44
mnaseri guess it would be good to know why nothing exists in the amphora table22:44
johnsomI'm wondering if you didn't have the DB outage fix and didn't have the "dual down" fix22:44
johnsombut still, not sure what could/would have deleted the amp record. They should at least be in "DELETED"22:45
johnsomor ERROR22:45
mnaserso weird22:48
mnaseri dont see anything in logs under 'Deleted Amphora id'22:48
mnaserbased on https://github.com/openstack/octavia/blob/stable/queens/octavia/controller/housekeeping/house_keeping.py#L8622:48
johnsomLOG.info('Purged db record for Amphora ID: %s', amp.id)22:49
mnaseroh hold on22:49
mnaserthere's a whole bunch of those22:49
mnaserin another ctl22:49
johnsomThat is what I would look for as you weren't running debug22:49
mnaserin queens its different message johnsom (link above)22:49
johnsomAh, I see that22:50
mnaseroh man.22:50
mnaserhttps://github.com/openstack/octavia/blob/stable/queens/octavia/controller/housekeeping/house_keeping.py#L69-L8622:53
mnaserhttps://github.com/openstack/octavia/blob/stable/queens/octavia/db/repositories.py#L1098-L111922:54
mnaserit removed 73 records22:54
mnaserat the time of that issue22:54
*** rcernin has joined #openstack-lbaas22:57
johnsomSo the health records were gone, which probably means it was failed over twice in a row23:04
mnaserjohnsom: so it looks like it did mark it as ERROR23:05
mnaser2018-11-27 18:06:51.593 109359 WARNING octavia.controller.worker.tasks.database_tasks [req-95a0f1d3-5fee-4bbf-bcf8-40a1b4a5aac2 - 8fdb4e6f09aa481ab157863dbfe2227f - - -] Reverting mark amphora pending delete in DB for amp id bed14822-aa31-480e-96b8-88f6119d7022 and compute id 6500b551-2211-47c1-8f0a-f70d06d4f84123:05
mnaser2018-11-27 18:06:51.619 109359 ERROR octavia.controller.worker.controller_worker [req-95a0f1d3-5fee-4bbf-bcf8-40a1b4a5aac2 - 8fdb4e6f09aa481ab157863dbfe2227f - - -] Failover exception: Error plugging amphora (compute_id: 3d4c578d-0458-4b5c-9253-e90ab86823d5) into port 8bb6e55f-3414-4032-9c97-031ce71e4f89.: PlugNetworkException: Error plugging amphora (compute_id: 3d4c578d-0458-4b5c-9253-e90ab86823d5) into port23:05
mnaser8bb6e55f-3414-4032-9c97-031ce71e4f89.23:05
mnaserwhich means at that state, i believe amphora status was ERROR23:06
mnaserwhich means even clean up would have not deleted it23:06
mnaseri wonder what deletes the amphora_health record23:07
*** salmankhan has quit IRC23:11
johnsomThe failover flow will on a successful failover.23:16
mnaserjohnsom: thats what im seeing too.  im wondering if there's a race condition where the amphora_health is missing, which then makes health manager wipe it23:18
johnsomYeah, the queens code doesn't check if it's expired first, it goes straight into the amp health check23:19
johnsomSo, if the amp is DELETED and the amp health is gone, it will purge the record23:19
mnaserit looks like DisableAmphoraHealthMonitoring task is the only one that removes it23:19
johnsomRocky looks first and shoots later: https://github.com/openstack/octavia/blob/stable/rocky/octavia/controller/housekeeping/house_keeping.py#L7523:20
mnaserjohnsom: is taskflow code linear in terms of order or does it all happen in parallel according to providers= and requires= ?23:21
mnaserhttps://github.com/openstack/octavia/blob/stable/queens/octavia/controller/worker/flows/amphora_flows.py#L463-L46623:22
johnsomSome parts of the flow are linear (serial) some parts are parallel23:22
mnasershouldnt it wait for more things to require than AMPHORA ?23:22
johnsomBasically we are passing in the failed amp object to that task, so it likely pulls the amp ID out of that object.23:23
mnaserjohnsom: right but shouldnt it wait much more before removing it from health?23:24
mnaserhttps://docs.openstack.org/octavia/latest/_images/AmphoraFlows-get_failover_flow.svg23:24
mnaserwell i guess according to this its ok23:24
mnaseri mean the only theory is that *something* removed amphora_health causing it to be deleted, i cant imagine any other theory, time is ok on that server and we monitor ntp23:25
johnsomI think you will find the LB with deleted amps went through two failover flows....23:26
mnaserokay so23:26
mnaserthis might be a rough question but23:26
mnaseris it possible because we run multiple octavia health managers?23:26
mnaseror are you saying two consecutive failovers23:27
johnsomNo, consecutive23:27
johnsomlike one flow to fix amp 1, another to fix amp 223:27
johnsomor attempt 1 failed, and it ran an attempt 223:27
*** fnaval has quit IRC23:28
mnaserthats possible23:28
mnaseri dont have enough logs to be 100% sure23:28
johnsomSo, I think the issue is that the housekeeping fix didn't get backported to queens....23:30
johnsomThis one: https://review.openstack.org/#/c/548989/23:31
mnaserjohnsom: im struggling to understand the second part23:34
*** yamamoto has joined #openstack-lbaas23:34
mnaseron how/why it would end up in that state23:35
johnsomWe tried to fail over the amp because the DB fix isn't in your build of Octavia, this failover failed, probably due to the DB outage, we revert the failover and put stuff back, but don't rebuild the amp health record. We then attempt to fail it over again (it's still broke, give it another go), however the amp is gone and there is no health record. Along comes housekeeping, which without that patch, checks all23:41
johnsomamphora, doesn't find the health record, and nukes the amp record23:41
johnsomIt's a pretty darn, hit-it-perfect bug, but I can see it happen.23:42
mnaserjohnsom: but do we ever remove the amp health record? it doesn't look like it actually gets removed until the failover was successfully completed23:42
johnsomQueens, amphora_flows, line 33223:43
johnsomWe were doing it as part of the cleanup of the failed amp23:43
johnsomhttps://review.openstack.org/#/c/548989/13/octavia/controller/worker/flows/amphora_flows.py23:43
mnaserah shit23:44
mnaseri have queens checked out23:44
mnaserand i dont have that same code23:44
mnaserso that must have been backported but i dont have that change23:44
johnsomNo, I just noticed we forgot to backport that patch.... It's only in Rocky23:45
mnaserjohnsom: https://review.openstack.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d8170123:45
mnaserhttps://review.openstack.org/#/c/587195/23:45
johnsomAh, gerrit search failed me23:46
johnsomOk, so it's there23:46
mnaserlet me check if we have it23:46
mnaserand we dont23:47
mnaserthat explains it all now23:47
mnasersigh23:47
johnsomWe have two queens backports pending merge and we will cut another queens release soon. But that one was in 2.0.223:47
mnaserman, we were one commit away from that23:47
mnaserwell the one before was a year ago23:48
xgermanyeah, things move quickly on our end -)23:48
johnsomYeah, we did poorly on cutting backport releases for a while. I now have a core that is focused on that23:48
johnsomTrying to improve our practices with cutting releases....23:48

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!