*** chkumar246 has quit IRC | 01:04 | |
*** Alex_Staf has quit IRC | 01:15 | |
*** chandankumar has joined #openstack-lbaas | 01:22 | |
*** fnaval has quit IRC | 01:38 | |
*** annp has joined #openstack-lbaas | 02:34 | |
*** yamamoto has joined #openstack-lbaas | 03:27 | |
*** ianychoi has joined #openstack-lbaas | 03:29 | |
*** yamamoto has quit IRC | 03:31 | |
*** rcernin has quit IRC | 03:55 | |
*** rcernin has joined #openstack-lbaas | 03:55 | |
*** yamamoto has joined #openstack-lbaas | 04:28 | |
nmagnezi | johnsom, o/ | 06:12 |
---|---|---|
nmagnezi | johnsom, still around? | 06:12 |
*** links has joined #openstack-lbaas | 06:19 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 06:19 | |
*** AlexeyAbashkin has quit IRC | 06:24 | |
*** ianychoi_ has joined #openstack-lbaas | 06:26 | |
*** ianychoi has quit IRC | 06:29 | |
*** Alex_Staf has joined #openstack-lbaas | 06:34 | |
*** kobis has joined #openstack-lbaas | 06:53 | |
*** pcaruana has joined #openstack-lbaas | 07:08 | |
*** rcernin has quit IRC | 07:10 | |
*** aojea has joined #openstack-lbaas | 07:13 | |
*** aojea has quit IRC | 07:22 | |
*** tesseract has joined #openstack-lbaas | 07:23 | |
*** aojea has joined #openstack-lbaas | 07:29 | |
*** kobis1 has joined #openstack-lbaas | 07:31 | |
*** kobis1 has quit IRC | 07:31 | |
*** aojea has quit IRC | 07:35 | |
*** aojea has joined #openstack-lbaas | 07:42 | |
*** aojea has quit IRC | 07:49 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 07:50 | |
*** kobis has quit IRC | 08:12 | |
*** velizarx has joined #openstack-lbaas | 08:35 | |
openstackgerrit | Ganpat Agarwal proposed openstack/octavia master: Active-Active: ExaBGP amphora L3 distributor driver https://review.openstack.org/537842 | 08:39 |
*** voelzmo has joined #openstack-lbaas | 09:10 | |
*** voelzmo has quit IRC | 09:16 | |
*** salmankhan has joined #openstack-lbaas | 09:21 | |
*** Alex_Staf has quit IRC | 09:29 | |
*** Alex_Staf has joined #openstack-lbaas | 09:38 | |
*** Alex_Staf has quit IRC | 09:51 | |
*** salmankhan has quit IRC | 09:57 | |
*** salmankhan has joined #openstack-lbaas | 09:57 | |
*** salmankhan has quit IRC | 10:50 | |
*** salmankhan has joined #openstack-lbaas | 11:08 | |
*** AlexeyAbashkin has quit IRC | 11:14 | |
*** Alex_Staf has joined #openstack-lbaas | 11:16 | |
*** yamamoto has quit IRC | 11:27 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 11:29 | |
*** yamamoto has joined #openstack-lbaas | 11:36 | |
*** salmankhan has quit IRC | 11:40 | |
*** yamamoto has quit IRC | 11:45 | |
*** salmankhan has joined #openstack-lbaas | 11:48 | |
*** atoth has joined #openstack-lbaas | 11:59 | |
*** yamamoto has joined #openstack-lbaas | 12:05 | |
*** velizarx has quit IRC | 12:20 | |
*** yamamoto has quit IRC | 12:30 | |
*** Alex_Staf has quit IRC | 13:02 | |
*** yamamoto has joined #openstack-lbaas | 13:20 | |
*** yamamoto has quit IRC | 13:20 | |
*** yamamoto has joined #openstack-lbaas | 13:20 | |
*** cpusmith_ has joined #openstack-lbaas | 13:49 | |
*** cpusmith has joined #openstack-lbaas | 13:50 | |
*** fnaval has joined #openstack-lbaas | 13:52 | |
*** cpusmith_ has quit IRC | 13:54 | |
*** yamamoto has quit IRC | 14:03 | |
*** kong has quit IRC | 14:13 | |
*** salmankhan has quit IRC | 14:21 | |
*** links has quit IRC | 14:32 | |
*** salmankhan has joined #openstack-lbaas | 14:37 | |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting https://review.openstack.org/556549 | 14:37 |
rm_work | johnsom / nmagnezi / xgerman_ ^^ | 14:37 |
rm_work | I *think* I did that correctly? | 14:38 |
rm_work | nope lol | 14:38 |
rm_work | maybe not | 14:39 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting https://review.openstack.org/556549 | 14:39 |
rm_work | how about this | 14:39 |
rm_work | hmmm no | 14:40 |
rm_work | i don't understand what i did wrong on the first one | 14:40 |
rm_work | that should have been right | 14:40 |
rm_work | i guess i have to select branches in order to make it non-voting? | 14:40 |
johnsom | rm_work The semi-colon was missing in the first try | 15:01 |
rm_work | wut | 15:02 |
rm_work | ohhh | 15:02 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Switch multinode tests to non-voting https://review.openstack.org/556549 | 15:03 |
*** yamamoto has joined #openstack-lbaas | 15:03 | |
*** chandankumar has quit IRC | 15:06 | |
xgerman_ | k, they have been failing forever | 15:10 |
rm_work | yes | 15:10 |
xgerman_ | rm_work: also my HM ate up 10G RAM over the weekend. Anywhere in the DB to look for bottlenecks? | 15:11 |
rm_work | it's not the DB | 15:11 |
rm_work | i mean, not really | 15:11 |
xgerman_ | well, I know — but it’s not like HM tels me how ong it’s queries/inserts take | 15:11 |
rm_work | cutting down the query count helps a lot (see: https://review.openstack.org/#/c/555471/) but you need the rest of those patches | 15:12 |
rm_work | that one will help *a lot* though | 15:12 |
xgerman_ | mmh, I like to have numbers to base that one. There should be a setting where HM spews diagnostic info | 15:13 |
rm_work | as will https://review.openstack.org/#/c/555473/ | 15:13 |
rm_work | xgerman_: it was something like ... many hundreds to 10a | 15:13 |
rm_work | *to 10s | 15:13 |
rm_work | johnsom had some numbers | 15:13 |
rm_work | but, an order of magnitude reduction | 15:13 |
xgerman_ | yeah, I guess we should have the measurement code submitted - since other operators need that for diagnosyics | 15:14 |
rm_work | you REALLY need to get those first three | 15:14 |
rm_work | it will improve things significantly | 15:14 |
rm_work | but to REALLY fix it you need the last three, which ... i don't know if we'll be able to get in | 15:14 |
rm_work | maybe you can just apply those patches locally | 15:14 |
*** yamamoto has quit IRC | 15:15 | |
xgerman_ | well, if Pike doesn’t work we need to tell people | 15:15 |
rm_work | i had a hard time proving it, i tried, but i'm just not that good at performance diagnostics | 15:16 |
rm_work | i can show the symptoms in a variety of ways | 15:16 |
rm_work | via like, debug logging | 15:16 |
rm_work | or system performance graphs | 15:16 |
xgerman_ | yes, but if I am some operator I need something in the logs which tells me what is going on other than my OpenStack cluster crashing because we have a run-away health manager | 15:17 |
rm_work | but i worked quite a bit with zzzeek on the sqlalchemy bits that seemed to be taking a long time, and could never find a smoking gun | 15:17 |
rm_work | yes | 15:17 |
rm_work | so see my third patch there | 15:17 |
rm_work | https://review.openstack.org/#/c/555473/ | 15:17 |
xgerman_ | because if I wouldn’t need Octavia after one or two of those episodes I would loose all faith | 15:17 |
rm_work | https://review.openstack.org/#/c/555473/2/octavia/controller/healthmanager/update_db.py@123 | 15:18 |
rm_work | if you start seeing those messages | 15:18 |
rm_work | it's bad news bears time | 15:18 |
*** pcaruana has quit IRC | 15:18 | |
rm_work | i put that in specifically for what you're talking about | 15:18 |
rm_work | to warn operators that something is going wrong | 15:18 |
rm_work | and they need to be aware | 15:18 |
rm_work | The language is as strong as I could make it and stay "PC" | 15:19 |
rm_work | saying "your shit is fucked" in a log message isn't the best, but i wanted to, lol | 15:19 |
xgerman_ | well, we can also try to keep track of jobs and cancel the ones we get overtaken… to keep the queue at a ~constant size | 15:21 |
*** chkumar246 has joined #openstack-lbaas | 15:23 | |
rm_work | xgerman_: that's what that code does | 15:24 |
rm_work | less tracking, more just cancelling | 15:24 |
rm_work | you're better off letting the original job finish than restarting | 15:25 |
rm_work | if you cancelled the first job and tried to do the new one, once you got behind almost nothing would ever actually update, and it'd be catastrophic | 15:25 |
rm_work | which brings me back to, you *absolutely* need that third patch https://review.openstack.org/#/c/555473/ | 15:25 |
johnsom | Yeah, we need to stop the panic here and figure out why this system is so much different than everyone else's | 15:25 |
rm_work | johnsom: regardless of whatever xgerman_'s issue is, I firmly believe we need to backport those first three | 15:26 |
*** imacdonn has joined #openstack-lbaas | 15:26 | |
rm_work | (only three because the middle one prevents a ton of merge errors, so i threw it in there, it's test-only) | 15:26 |
rm_work | this is based entirely on my assessment of the state of the pike codebase, and cgoncalves's plea to more actively backport stuff that we think is necessary for good operation | 15:27 |
rm_work | i didn't realize before that we hadn't fixed any of that stuff in Pike yet | 15:27 |
rm_work | health management is completely out of control without at least your patch (the first one in the chain) and mine (the third one) prevents a lot of headache and gives at least some heads-up that operators would need to even realize the issue | 15:28 |
rm_work | i think if we merge those three, we'll see more people coming forward saying "i'm getting this warning message from my HMs..." | 15:28 |
rm_work | as it is, people may not even realize until it's too late, and then it takes a good deal of forensics to realize what happened | 15:29 |
rm_work | I should have asked cgoncalves to do the backports so I could +2 them myself <_< | 15:29 |
xgerman_ | ok, let’s see if my system is really that special | 15:30 |
johnsom | Well, I'm going to have a look at this environment and see what is up. The context switches are near zero there and the load is too high for the # of amps. So something else is going on there | 15:30 |
rm_work | the first three have nothing to do with that | 15:31 |
rm_work | the first one is yours that objectively makes the number of DB operations less ridiculous, and the third is mine that gives operators at least a chance to see something is wrong and possibly head off a catastrophic failure | 15:31 |
rm_work | IMO not merging at least those would be irresponsible | 15:32 |
rm_work | I should have proposed them earlier honestly | 15:32 |
rm_work | but i hadn't really been thinking about backports / remembering that people actually will try to run Pike | 15:32 |
xgerman_ | My 2ct: We run an unbound service which can chew up resources. We need to constrain that somehow | 15:32 |
xgerman_ | rm_work: people are just now getting on newton ;-) | 15:33 |
rm_work | xgerman_: yes, well, the next set fixes all of that | 15:33 |
rm_work | if we merge those it isn't an issue | 15:33 |
rm_work | and since we'd have no reason to make a patch to master to do constraints (since it isn't an issue), we're not going to be able to backport anything | 15:33 |
rm_work | anyway, i don't understand why we'd make and backport a workaround when the *resolution* is already merged in master | 15:34 |
xgerman_ | there are also rumours that the queue in the ProcessPool is better constraint | 15:34 |
xgerman_ | yeah, I like to work of one code base as well | 15:34 |
rm_work | xgerman_: did you get a chance to *try* this patch chain in your env | 15:35 |
xgerman_ | no, we wanted to test the exception handling fix over the weekend | 15:36 |
johnsom | Looks like the environment reverted to stock pike, no exception fix, etc. in it | 15:38 |
rm_work | :( | 15:40 |
johnsom | Hmm, is there a way to find the sha from an pip installed app? The version in pip list is a "dev" version # | 15:40 |
rm_work | ah you don't do -e ? | 15:42 |
rm_work | is the dev version number not like... a shorthash? | 15:42 |
johnsom | Apparently here, they do dev. I think I found one in the egg under pbr. looking now | 15:42 |
johnsom | There are so many amps reporting in that don't have listeners in the DB.... | 15:44 |
rm_work | errr | 15:46 |
rm_work | because they didn't get their configs updated correctly? | 15:47 |
johnsom | yeah, ok, it has failed over 83 amps in 10 minutes | 15:47 |
johnsom | I think we are ping ponging on these corrupt amps/records | 15:47 |
johnsom | Remember this isn't a normal environment, folks have messed in the DB, it was running neutron-lbaas, etc. | 15:48 |
johnsom | Yeah, they are looping on the amp ids.... Interesting. | 15:49 |
rm_work | yeah so | 15:49 |
rm_work | that is a cascade failure | 15:49 |
rm_work | i saw the same thing | 15:49 |
rm_work | you need to shut it all down | 15:49 |
johnsom | Yeah, this is immediate after a fresh restart too | 15:49 |
rm_work | yes | 15:49 |
rm_work | gotta shut down the HMs | 15:49 |
rm_work | mark all amps busy in health | 15:50 |
rm_work | bring the HMs up | 15:50 |
rm_work | wait for the health messages to stabilize | 15:50 |
rm_work | then un-busy and failover one at a time | 15:50 |
rm_work | to get them back to "actually correct" state | 15:50 |
xgerman_ | johnsom: infra01 has the monkey patched version | 15:52 |
johnsom | No, I am on infra01, it's running clean d9e24e83bbf7d12f8bec16072b28c6ca655690ac | 15:53 |
johnsom | Our changes are gone | 15:53 |
xgerman_ | mmh | 15:53 |
xgerman_ | I only restarted the hm | 15:53 |
johnsom | Me too | 15:53 |
xgerman_ | ok | 15:53 |
johnsom | I haven't touched this since we did. Let me see if I can get a timestamp when it was redeployed | 15:54 |
johnsom | Yeah, no idea, the files all say jan 4 | 15:55 |
johnsom | xgerman_ oh, somehow that IP you gave me took me to infra3 this time... | 15:56 |
xgerman_ | mmh, no idea | 15:56 |
rm_work | lol | 15:57 |
johnsom | Ok, yeah, I think we have it figured out, now just why.... It's failing over slightly more than 8 a minute, with a failover queue of 10, so... That is the queue growth, and why the CPU is high. Now digging into the why.... | 16:03 |
*** salmankhan has quit IRC | 16:03 | |
*** salmankhan has joined #openstack-lbaas | 16:07 | |
*** salmankhan has quit IRC | 16:12 | |
johnsom | Yeah, ok so it's amps we show as deleted reporting in. | 16:15 |
johnsom | I think we need to add some zombie protection... lol | 16:17 |
*** salmankhan has joined #openstack-lbaas | 16:31 | |
rm_work | please please PLEASE take full advantage of the possible commit messages | 16:34 |
rm_work | i will -1 for not properly taking advantage of subject matter | 16:34 |
*** pcaruana has joined #openstack-lbaas | 16:44 | |
rm_work | i wish we had a way to figure out if the IP that reports in is actually an old amp compute, and if so, figure out what its compute-id is and delete it | 16:47 |
*** salmankhan has quit IRC | 16:51 | |
johnsom | So I am tracking by the "reports 1 listeners when 0 expected" which gives the amp ID, which is the name of the amp in nova | 17:04 |
*** AlexeyAbashkin has quit IRC | 17:15 | |
*** salmankhan has joined #openstack-lbaas | 17:16 | |
*** salmankhan has quit IRC | 17:20 | |
rm_work | ummm | 17:36 |
rm_work | "amp id"? | 17:36 |
rm_work | what if it's not in the DB? | 17:37 |
rm_work | we're talking about zombies right? | 17:37 |
rm_work | amps that were "deleted" but the compute instance survived somehow? | 17:37 |
johnsom | Right | 17:43 |
johnsom | Yeah, looks like nova was trashed here for a while and some non-graceful shutdowns | 17:45 |
xgerman_ | mmh | 17:46 |
xgerman_ | I think nova foo-baring is a fact of life | 17:47 |
johnsom | Agreed, and we handled it appropriately | 17:48 |
johnsom | retried and then marked them ERROR | 17:48 |
xgerman_ | well, and then we dropped the ball on zombie vms | 17:48 |
johnsom | Yeah, it looks like we are not handling amps we told nova to delete, nova said success, but they didn't actually delete. We can improve this. | 17:49 |
xgerman_ | +1 | 17:51 |
*** tesseract has quit IRC | 18:10 | |
rm_work | yeah so how to get the compute-id of them | 18:12 |
rm_work | the only thing i could think of is their address <_< | 18:13 |
rm_work | which ... is maybe not 100% reliable? | 18:13 |
johnsom | They come into the HM with their IDs | 18:13 |
johnsom | We just need to check if the amp is DELETED during a failover and re-delete if so | 18:13 |
rm_work | umm | 18:14 |
rm_work | so that's if they're still in the DB | 18:14 |
rm_work | [02:37:12] <rm_work>what if it's not in the DB? | 18:14 |
rm_work | [02:37:49] <rm_work>we're talking about zombies right? | 18:14 |
rm_work | [02:38:02] <rm_work>amps that were "deleted" but the compute instance survived somehow? | 18:14 |
johnsom | Ok, or missing I guess, but yeah, normally they will be in "DELETED" | 18:14 |
rm_work | i don't know about you, but mine go away | 18:14 |
johnsom | at least for some period of time | 18:15 |
rm_work | so there is literally nothing | 18:15 |
rm_work | i usually have to do lookups by IP | 18:15 |
rm_work | because there's this thing coming in with an amp-id that has no record in the amphora table | 18:15 |
johnsom | Even if you have your HK DB cleanup too low, we get in an amp ID, it's failed because it claims listeners that we show as 0, we check DB if DELETED or missing, call nova delete "amphora-<ID>" | 18:16 |
johnsom | Do you happen to remember the columns we need to fill in to re-create an amp record to trigger a failover? | 18:18 |
johnsom | I have a MASTER healthy but no BACKUP | 18:18 |
rm_work | ah, yes i think so, one sec | 18:21 |
rm_work | and also: it has nothing to do with "HK cleanup time" | 18:21 |
rm_work | see: https://review.openstack.org/#/c/548989/ | 18:21 |
rm_work | until ya'll review that | 18:21 |
rm_work | ... i will admit that i need to really give it a good testing in devstack | 18:22 |
*** fnaval has quit IRC | 18:23 | |
rm_work | also, to distract even more, I have one more feature that I'm going to be whipping through today and hopefully have a patch up shortly -- an admin API call that will give usage statistics for the whole system (total number of LBs, amps, listeners, members, etc | 18:24 |
rm_work | ) | 18:24 |
rm_work | drawing up a mock output first to run past people | 18:24 |
*** fnaval has joined #openstack-lbaas | 18:26 | |
rm_work | I would do a spec but | 18:27 |
rm_work | then it'd take a month to get to code | 18:28 |
rm_work | rather just get some basic agreement | 18:28 |
*** AlexeyAbashkin has joined #openstack-lbaas | 18:42 | |
*** pcaruana has quit IRC | 18:46 | |
rm_work | johnsom / xgerman_ / nmagnezi / cgoncalves / anyone really: when you have a minute: https://etherpad.openstack.org/p/octavia-admin-usage | 18:51 |
rm_work | proposing *something* like that | 18:51 |
rm_work | please to comment / propose alternatives | 18:51 |
rm_work | need to have a way to expose some stats about how utilized the system is | 18:52 |
rm_work | ah maybe it should be *utilization* | 18:52 |
*** AlexeyAbashkin has quit IRC | 18:54 | |
*** voelzmo has joined #openstack-lbaas | 18:54 | |
*** voelzmo has quit IRC | 19:00 | |
*** atoth has quit IRC | 19:00 | |
nmagnezi | o/ | 19:06 |
rm_work | o/ | 19:07 |
nmagnezi | rm_work, looking now | 19:07 |
nmagnezi | :) | 19:07 |
nmagnezi | rm_work, a question in regards to the amphorae section | 19:08 |
nmagnezi | rm_work, since we plan to implement providers support in Rocky, will it support an equivalent of that for other providers? | 19:08 |
nmagnezi | rm_work, would you like me to make comment on this in the etherpad itself? | 19:09 |
nmagnezi | johnsom, o/ | 19:09 |
rm_work | nmagnezi: i mean i am thinking yes | 19:10 |
rm_work | but | 19:10 |
rm_work | well | 19:10 |
rm_work | i'm not sure | 19:10 |
rm_work | i think that might be digging into the internals of other providers | 19:10 |
rm_work | and i think they have their own stats endpoints | 19:10 |
*** AlexeyAbashkin has joined #openstack-lbaas | 19:14 | |
xgerman_ | well, I think we could have a more common API call listing how many LB are in the system and such… | 19:14 |
*** voelzmo has joined #openstack-lbaas | 19:15 | |
xgerman_ | yeah, both lb and listeners could be part of lbaasv2 | 19:15 |
xgerman_ | ^^ rm_work | 19:15 |
xgerman_ | so I would make it two calls - one on the main API for lb, listeners, members — and one on the ovctavia-API for amps and oither things we can think of | 19:16 |
rm_work | hmmm k | 19:17 |
rm_work | i mean for amps technically you could just ... call the amphora api I guess | 19:17 |
rm_work | since it already spits back everything | 19:17 |
rm_work | and the total is "len(result)" | 19:17 |
rm_work | and the statuses you can make a map of | 19:18 |
rm_work | it's this other stuff that's a bit finnicky | 19:18 |
rm_work | since you can't get a list of members, for example, without querying by pool | 19:18 |
*** AlexeyAbashkin has quit IRC | 19:18 | |
rm_work | so yeah, i could at least remove amphora listing from it | 19:18 |
rm_work | though i figured it would be fine to include, and just have it be None or 0 in the case that it's unused | 19:19 |
*** aojea has joined #openstack-lbaas | 19:29 | |
johnsom | nmagnezi Hi | 19:29 |
xgerman_ | rm_work: I think we need to do a bit more with the amphora API to get a utilization view. After all we report out mem and load stats | 19:32 |
rm_work | err, we do? | 19:32 |
johnsom | Let's not go too far with controller health stuff. There is a cross project proposal coming out for that | 19:33 |
johnsom | I linked to it in one of the last two meetings | 19:33 |
rm_work | this isn't controller health | 19:33 |
rm_work | this is dataplane health | 19:33 |
xgerman_ | +1 | 19:33 |
johnsom | Ok, just thought xgerman_ was implying controller stats | 19:34 |
rm_work | at least, what i'm proposing | 19:34 |
rm_work | i thought he meant amp load/mem | 19:34 |
xgerman_ | yes, amp sepcific things | 19:34 |
xgerman_ | the /info | 19:36 |
rm_work | but we don't *push* that | 19:36 |
rm_work | i'm not looking to have something that calls out to every amp when you call it :/ | 19:36 |
rm_work | this is a proposal for something very light-weight, DB only | 19:36 |
*** harlowja has joined #openstack-lbaas | 19:37 | |
*** voelzmo_ has joined #openstack-lbaas | 19:38 | |
xgerman_ | ok | 19:39 |
rm_work | just basic utilization statistics that could be fetched for a dashboard | 19:42 |
*** voelzmo has quit IRC | 19:42 | |
xgerman_ | sounds good | 19:42 |
*** yamamoto has joined #openstack-lbaas | 19:45 | |
rm_work | so, plz comments on etherpad :P | 19:45 |
rm_work | i asked a ton of questions to my own example, lol | 19:45 |
*** voelzmo_ has quit IRC | 19:46 | |
*** yamamoto has quit IRC | 19:50 | |
*** voelzmo has joined #openstack-lbaas | 19:51 | |
*** voelzmo has quit IRC | 19:51 | |
*** voelzmo has joined #openstack-lbaas | 19:51 | |
*** voelzmo_ has joined #openstack-lbaas | 19:52 | |
*** voelzmo has quit IRC | 19:55 | |
*** voelzmo_ has quit IRC | 20:02 | |
*** voelzmo has joined #openstack-lbaas | 20:02 | |
nmagnezi | rm_work, placed some comments in the etherpad | 20:02 |
nmagnezi | johnsom, o/ | 20:02 |
rm_work | coolcool | 20:02 |
nmagnezi | johnsom, a question re https://docs.openstack.org/octavia/queens/configuration/policy.html | 20:03 |
johnsom | nmagnezi Hi | 20:03 |
nmagnezi | johnsom, so if I understand this correctly, do we expect the operator to assign the load-balancer_member role for each newly created user/tenant? | 20:04 |
nmagnezi | johnsom, otherwise, that user can't actually interact with o-api | 20:04 |
johnsom | project ID, yes, I think there is a hierarchical thing that lets you set it for everyone or a group too. | 20:05 |
nmagnezi | hmm | 20:05 |
nmagnezi | okay | 20:05 |
johnsom | Or deploy our policy.json and switch it back to the old way | 20:05 |
nmagnezi | do we know of any other projects who do the same? what was our incentive for that? | 20:05 |
johnsom | nmagenzi If I remember right nova does this. It was prompted by their work and a general consensus that we need finer grain control in OpenStack. | 20:06 |
nmagnezi | indeed, but I was told that projects are moving away from policy.json | 20:07 |
johnsom | nmagenzi Yes, like we did.... | 20:07 |
johnsom | policy.json is only for overrides now | 20:07 |
nmagnezi | so moving back to it won't be a good idea for new deployments :) | 20:07 |
johnsom | nmagenzi No, policy.json will always be an option for deployments. That is not going away as it is the standard way to override the defaults. However, going back to the old RBAC rules "owner or admin" may not be the best as going forward the rules are changing. Let me find some references for you... | 20:09 |
nmagnezi | johnsom, thank you! | 20:10 |
johnsom | nmagenzi So this is the spec for "make everyone do this and make it the same" https://review.openstack.org/#/c/523973/ | 20:13 |
johnsom | nmagenzi Still digging for the older stuff about why we did it this way. | 20:14 |
johnsom | nmagenzi Geez that was hard to find: https://review.openstack.org/#/c/427872/19/specs/pike/approved/additional-default-policy-roles.rst | 20:25 |
johnsom | And this: https://review.openstack.org/#/c/245629/8/specs/common-default-policy.rst | 20:26 |
rm_work | nmagnezi: responded | 20:33 |
rm_work | anyone else? johnsom? | 20:33 |
nmagnezi | johnsom, thanks a lot. really. | 20:33 |
nmagnezi | rm_work, cool :) | 20:35 |
xgerman_ | ok, I think I added my 2ct | 20:40 |
rm_work | replied | 20:41 |
*** rm_mobile has joined #openstack-lbaas | 20:48 | |
johnsom | rm_work So, done cleaning up that zombie mess | 21:00 |
johnsom | I think if we are going to do stats about the objects it should be in the main api (admin), then have another that is amps and driver specific stuffs | 21:01 |
openstackgerrit | Merged openstack/neutron-lbaas-dashboard master: Show last updated timestamp for docs https://review.openstack.org/539074 | 21:01 |
*** voelzmo has quit IRC | 21:01 | |
*** voelzmo has joined #openstack-lbaas | 21:02 | |
*** sshank has joined #openstack-lbaas | 21:02 | |
*** voelzmo has quit IRC | 21:02 | |
*** voelzmo has joined #openstack-lbaas | 21:02 | |
openstackgerrit | Merged openstack/neutron-lbaas-dashboard master: Update links in README https://review.openstack.org/551829 | 21:03 |
openstackgerrit | Merged openstack/neutron-lbaas-dashboard master: i18n: Do not include html directives in translation strings https://review.openstack.org/538214 | 21:03 |
*** voelzmo has quit IRC | 21:03 | |
*** voelzmo has joined #openstack-lbaas | 21:04 | |
*** voelzmo has quit IRC | 21:04 | |
*** voelzmo has joined #openstack-lbaas | 21:04 | |
*** voelzmo has quit IRC | 21:05 | |
*** voelzmo has joined #openstack-lbaas | 21:05 | |
*** voelzmo has quit IRC | 21:05 | |
*** kong has joined #openstack-lbaas | 21:08 | |
xgerman_ | johnsom: +1 | 21:24 |
rm_mobile | Hmm | 21:26 |
rm_mobile | Did you comment on the spec etherpad? :P | 21:26 |
*** rm_mobile has quit IRC | 21:28 | |
xgerman_ | yes | 21:30 |
rm_work | i meant johnsom | 21:32 |
rm_work | but i can check now | 21:32 |
rm_work | ... great. he said that | 21:33 |
*** sshank has quit IRC | 21:50 | |
rm_work | johnsom: A little more detail? Like ... just remove the amphora keyword and we'd be good to merge something like option#1? | 22:04 |
johnsom | Ha, I was pulliing an rm_work. paper-minimization | 22:05 |
rm_work | lol is that my signature move? :P | 22:05 |
rm_work | also -- still no one has addressed any of my other comments really | 22:06 |
johnsom | I think option one is fine, if it has value for someone. I would change the path to not be octavia obviously. | 22:06 |
rm_work | k | 22:06 |
rm_work | do you agree with xgerman_ to split out tls and non-tls listeners? | 22:06 |
johnsom | Just now digging myself out of zombie land. Still one stuck in nova "deleting" | 22:07 |
rm_work | T_T | 22:07 |
johnsom | Since the load on the system is higher with TLS it might be nice, but could be a follow on IMO | 22:08 |
johnsom | I don't have a real need for this, so kinda flexible. I mean I just query the DB by hand if I need these counts. | 22:09 |
johnsom | But I realize not everyone is as comfortable with that | 22:10 |
johnsom | Also, the by project stuff, I think the quota API and coming quota stuff may provide that answer | 22:10 |
xgerman_ | it’s not all about our needs but … | 22:10 |
rm_work | k | 22:11 |
rm_work | johnsom: yeah i was doing that too | 22:11 |
rm_work | then talking it over with my team here, we realized that's ... dumb | 22:11 |
rm_work | that should not be a thing we have to do | 22:11 |
johnsom | Yeah, fair | 22:11 |
*** AlexeyAbashkin has joined #openstack-lbaas | 22:12 | |
*** AlexeyAbashkin has quit IRC | 22:16 | |
*** aojea has quit IRC | 22:18 | |
*** rcernin has joined #openstack-lbaas | 22:20 | |
*** cpusmith has quit IRC | 22:26 | |
johnsom | Hmm, what do you folks think? If we get a DELETED amp submitted to failover (HM or API) show we log and mark it busy to exclude it from future health checks or should we try to go delete it in nova? | 22:27 |
xgerman_ | both | 22:28 |
johnsom | rm_work? | 22:28 |
rm_work | hmmmmm | 22:43 |
rm_work | how does that even happen | 22:44 |
johnsom | german's lab had 30+ | 22:44 |
rm_work | wut | 22:44 |
xgerman_ | mmh | 22:44 |
johnsom | His nova seems to be brokenish | 22:45 |
xgerman_ | well, it probably didn’t help we were recycling like mad | 22:45 |
johnsom | Well, it's going to be a bit of work if we want to have it try another delete | 22:47 |
johnsom | I need to add a compute driver interface for get compute id by name | 22:48 |
*** fnaval has quit IRC | 22:50 | |
xgerman_ | don’t we already have that — after all we failover… | 22:51 |
johnsom | We have the compute ID in failover cases. In this case we may not have that record anymore | 22:52 |
xgerman_ | johnsom: wait - we have the compute-id in the DB on the amphora — is that one no-good? | 22:52 |
johnsom | If they set an aggressive DELETED purge it will be gone. | 22:53 |
xgerman_ | ah, ok - | 22:53 |
johnsom | You had a few zombies that had been purged | 22:54 |
johnsom | So, if we are going to do deletes, we might as well do it all, not just those with records | 22:54 |
xgerman_ | makes sense | 22:55 |
*** chkumar246 has quit IRC | 22:57 | |
rm_work | johnsom: yes to both BTW above | 22:59 |
rm_work | which ... if it IS gone already, need to just delete the health record | 22:59 |
rm_work | of course it may come back | 22:59 |
johnsom | rm_work the question is, if it's DELETED in our DB, do we retry delete with nova | 23:00 |
rm_work | i think it's probably happening because german's health stuff is delayed by so many minutes that the records are still being saved well after the vms are gone | 23:00 |
xgerman_ | you extended that to “if we can’t find it in our DB, let’s delete, too” | 23:00 |
rm_work | which is why he REALLY needs the third patch i backported | 23:00 |
rm_work | >_< | 23:00 |
johnsom | No, he had no delay at all | 23:00 |
rm_work | wtf | 23:00 |
rm_work | well, anyway, yes | 23:00 |
*** fnaval has joined #openstack-lbaas | 23:01 | |
xgerman_ | so, the zombie vms were bringing us down — that *should* not happen | 23:03 |
johnsom | Yeah, it turned out to be all of these zombies coming in and triggering failovers because they say they have listeners but we say they shouldn't have any. So, the failover queue was getting flooded. That was the memory growth issues. After cleanup and a clean deploy the CPU is down and things are better aside from the amp that nova is stuck "deleting" | 23:03 |
xgerman_ | but zombie vms shouldn’t trigger failovers? They should just be ignored… | 23:03 |
johnsom | Right, that is what I am working on. Either we ignore them or we try to kill them and ignore them | 23:04 |
xgerman_ | ah, ok, got it | 23:04 |
johnsom | I think I will do one patch that just marks it busy so they don't attempt to failover. This will be clean for backports as it's a simple change. | 23:11 |
xgerman_ | +1 | 23:11 |
johnsom | We can revisit the "kill the zombies" in another patch if we decide it's a good idea. | 23:12 |
xgerman_ | yeah, I can live with that. | 23:12 |
*** AlexeyAbashkin has joined #openstack-lbaas | 23:12 | |
xgerman_ | =1 | 23:12 |
xgerman_ | Thanks. Good sleuthing I really was under the impression we would ignore zombies already and not make them failover every time | 23:13 |
*** chandankumar has joined #openstack-lbaas | 23:14 | |
*** AlexeyAbashkin has quit IRC | 23:16 | |
rm_work | k | 23:20 |
openstackgerrit | Michael Johnson proposed openstack/octavia master: Fix health manager edge case with zombie amphora https://review.openstack.org/556682 | 23:45 |
*** chandankumar has quit IRC | 23:53 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!