*** mordred has quit IRC | 00:06 | |
*** zigo has quit IRC | 00:06 | |
*** zigo_ has joined #openstack-lbaas | 00:06 | |
*** KeithMnemonic has quit IRC | 00:08 | |
*** mordred has joined #openstack-lbaas | 00:16 | |
*** leitan has quit IRC | 00:18 | |
*** leitan has joined #openstack-lbaas | 00:18 | |
*** sshank has quit IRC | 00:20 | |
*** leitan has quit IRC | 00:22 | |
rm_work | ummmmm | 00:31 |
---|---|---|
rm_work | https://review.openstack.org/#/c/570485/4 | 00:31 |
rm_work | failed one each | 00:31 |
rm_work | i've seen similar failures before in noops | 00:32 |
rm_work | not sure how stuff gets stuck in PENDING statuses?? | 00:32 |
rm_work | but it seems to | 00:32 |
rm_work | RMQ failures? | 00:32 |
johnsom | No RMQ with noop.... | 00:32 |
rm_work | digging through o-cw log | 00:32 |
rm_work | err ... yes? | 00:33 |
rm_work | we still go through to worker | 00:33 |
rm_work | and it just noops the compute/net/amp parts | 00:33 |
johnsom | It's not running noop Octavia driver? | 00:33 |
rm_work | ??? | 00:33 |
rm_work | this is an octavia tempest test | 00:33 |
johnsom | Yeah, right, I think that is going through | 00:33 |
johnsom | Looks broken: http://logs.openstack.org/85/570485/4/check/octavia-v2-dsvm-noop-py35-api/5f232ca/controller/logs/screen-o-cw.txt.gz?level=ERROR | 00:34 |
rm_work | this doesn't look good tho: http://logs.openstack.org/85/570485/4/check/octavia-v2-dsvm-noop-py35-api/5f232ca/controller/logs/screen-o-cw.txt.gz#_May_30_00_02_12_274349 | 00:34 |
*** atoth has quit IRC | 00:35 | |
rm_work | hmmm yeah | 00:35 |
rm_work | interesting | 00:35 |
rm_work | is this catching an octavia bug somehow? lol | 00:35 |
rm_work | but why only sometimes | 00:35 |
rm_work | maybe due to the test order / parallel runs? | 00:35 |
johnsom | Oye, I hope not.... | 00:36 |
*** longkb has joined #openstack-lbaas | 00:36 | |
rm_work | i mean it's here: http://logs.openstack.org/85/570485/4/check/octavia-v2-dsvm-noop-py35-api/5f232ca/controller/logs/screen-o-cw.txt.gz?#_May_30_00_03_35_679321 | 00:36 |
rm_work | how would that happen O_o | 00:36 |
*** keithmnemonic[m] has quit IRC | 00:37 | |
rm_work | it tries to fetch the pool info from the DB and it isn't there O_o | 00:38 |
rm_work | how, what | 00:38 |
johnsom | We need to fix that quota log message, it's not printing the object | 00:38 |
rm_work | yeah | 00:39 |
rm_work | i wonder if this is an issue with the new provider stuff ordering the db commit wrong possibly? grasping at straws | 00:39 |
rm_work | cause yeah... you do the "call_provider" before the commit | 00:40 |
rm_work | so that's my guess | 00:40 |
rm_work | we manage to make it through the RMQ call and into the worker and do the processing, before the commit happens | 00:40 |
rm_work | in some instances | 00:40 |
rm_work | :/ | 00:40 |
rm_work | sad panda | 00:40 |
rm_work | https://github.com/openstack/octavia/blob/master/octavia/api/v2/controllers/pool.py#L201-L207 | 00:41 |
rm_work | is it OK to move the driver call *after* the commit? | 00:41 |
rm_work | actually, it should definitely be after, right? because if the driver call fails, we don't want to rollback still IMO | 00:42 |
rm_work | it'd be direct-to-error? | 00:42 |
rm_work | or is that not the design | 00:42 |
johnsom | Hmmm, well, we would have to go back the the "ERROR" status thing. | 00:42 |
rm_work | so... yeah. we have to IMO | 00:42 |
rm_work | since we really can't call out to the provider without having the db updated yet T_T | 00:42 |
johnsom | I switched this around to try to enroll the provider call in the DB transaction, so if the driver failed we would roll back the whole request | 00:43 |
rm_work | we're going to have this race | 00:43 |
rm_work | :/ | 00:43 |
johnsom | We, as in Octavia driver.... | 00:43 |
rm_work | how can a driver that relies on the DB function then | 00:43 |
rm_work | amphora driver, yeah | 00:43 |
johnsom | No driver should rely on the DB | 00:44 |
rm_work | hmmm | 00:44 |
rm_work | so amp driver needs to be rewritten then <_< | 00:44 |
rm_work | because it does | 00:44 |
johnsom | They should get everything from the provider call | 00:44 |
rm_work | i guess we need to rewrite the amp driver to use the interface? | 00:44 |
johnsom | Yes, ultimately, that is true | 00:44 |
rm_work | that means passing that data over rmq | 00:44 |
rm_work | instead of just an ID | 00:44 |
johnsom | I had hoped to avoid that | 00:45 |
rm_work | well, you see the problem with avoiding that <_< | 00:45 |
johnsom | At least for now, just because that is a lot of change too | 00:45 |
rm_work | ok... so. | 00:45 |
rm_work | we can't leave it like this :( | 00:45 |
johnsom | Add sleeps? lol | 00:47 |
rm_work | what do you suggest BESIDES that | 00:47 |
* rm_work slaps johnsom | 00:47 | |
rm_work | no hacks :P | 00:47 |
johnsom | Poll the DB? | 00:47 |
johnsom | Sorry, still wacky from being sick | 00:47 |
rm_work | ugh, retry decorator? | 00:47 |
rm_work | you got sick? :( | 00:47 |
rm_work | i mean we could do a retry for like 5s or something dumb | 00:48 |
rm_work | temporarily | 00:48 |
*** leitan has joined #openstack-lbaas | 00:48 | |
rm_work | erugh | 00:48 |
rm_work | *dudhadhuwoefhwf | 00:48 |
johnsom | Yeah, I got a sinus infection from the trip. Burned my whole weekend | 00:48 |
rm_work | i said "temporarily", which means shitty hack workaround | 00:48 |
rm_work | sad :( | 00:48 |
rm_work | I drove to Boise and back, which burned a lot of mine :P | 00:48 |
johnsom | Yep, hope the weddding was fun | 00:48 |
rm_work | nah, not a wedding. just meeting up with some folks | 00:49 |
johnsom | So..... Yeah, I think the driver code is *right*, it's our amphora side that is *wrong* | 00:49 |
johnsom | Ah, I thought you were going to a wedding for some reason. | 00:49 |
rm_work | but whatever meds you're on, I want some | 00:49 |
johnsom | None sadly | 00:49 |
rm_work | lol | 00:50 |
rm_work | yes, I guess I agree then | 00:50 |
rm_work | so | 00:50 |
* johnsom sees the problem | 00:50 | |
rm_work | i don't see a solution besides passing stuff through RMQ | 00:50 |
rm_work | the whole kaboodle | 00:50 |
johnsom | Well, it's either that or do the DB polling on the controller side. | 00:50 |
rm_work | that's really shitty :/ | 00:51 |
rm_work | it's impossible to *prove* it works | 00:51 |
johnsom | Right, long term, it should be using the data it gets in the provider call | 00:53 |
*** annp has quit IRC | 00:53 | |
*** annp has joined #openstack-lbaas | 00:54 | |
rm_work | so you think... what? we temporary-hack-workaround it to just do a retry on all the DB loads? | 00:54 |
rm_work | if it's not found during a create? | 00:54 |
rm_work | I GUESS that's kinda ok? since for a create, we really should be able to assume that it's got to be there sooner or later | 00:55 |
johnsom | It really comes down to who has time to do the long term solution vs. slapping in a workaround. | 00:56 |
rm_work | k well, at the moment this is scary | 00:57 |
rm_work | i couldn't put this in prod, and our gates are going to be unreliable | 00:57 |
*** leitan has quit IRC | 00:58 | |
*** leitan has joined #openstack-lbaas | 00:58 | |
*** harlowja has quit IRC | 01:00 | |
*** leitan has quit IRC | 01:00 | |
*** leitan has joined #openstack-lbaas | 01:00 | |
*** keithmnemonic[m] has joined #openstack-lbaas | 01:05 | |
openstackgerrit | Michael Johnson proposed openstack/octavia master: Implement provider drivers - Cleanup https://review.openstack.org/567431 | 01:07 |
johnsom | rm_work the gotcha with the long term solution is the code expects a DB model there, just as that failure was calling, which doesn't exist with the provider driver method. We don't have a seperate DB for the amphora driver yet. | 01:16 |
rm_work | yeah so it's kinda a rewrite | 01:18 |
johnsom | Tomorrow I will cook up the polling thing. | 01:21 |
rm_work | k imean | 01:22 |
rm_work | i assume it's just a DB retry | 01:22 |
rm_work | using like | 01:22 |
rm_work | tenacity | 01:22 |
rm_work | if it throws the DB error | 01:22 |
rm_work | i can prolly do it too | 01:22 |
johnsom | Right, a decorator that checks the repo get result to see if it is empty, if so, try again. like up to 30 seconds or something | 01:22 |
johnsom | It won't throw an exception, it's just an None object back | 01:23 |
rm_work | yeah, i mean, I could make it throw one :P | 01:23 |
johnsom | Then go down the controller_worker, and wrap those first "get" calls | 01:23 |
johnsom | That would be nasty as tons of things use those repo get calls | 01:24 |
rm_work | not in the repo-get | 01:24 |
rm_work | in the controller_worker create_* | 01:24 |
johnsom | Oh | 01:24 |
johnsom | Yeah. | 01:24 |
johnsom | And it would only be the create calls, the others should be fine. I think | 01:25 |
johnsom | maybe | 01:25 |
johnsom | Ok, I need to run sadly | 01:25 |
rm_work | kk | 01:26 |
rm_work | yes | 01:26 |
rm_work | updates will be fine | 01:27 |
rm_work | and deletes obv | 01:27 |
*** mordred has quit IRC | 01:31 | |
*** hongbin has joined #openstack-lbaas | 01:38 | |
*** mordred has joined #openstack-lbaas | 01:45 | |
*** leitan has quit IRC | 02:38 | |
*** leitan has joined #openstack-lbaas | 02:39 | |
*** leitan has quit IRC | 02:43 | |
*** harlowja has joined #openstack-lbaas | 03:29 | |
*** hongbin has quit IRC | 03:53 | |
*** links has joined #openstack-lbaas | 04:11 | |
*** annp has quit IRC | 04:13 | |
*** harlowja has quit IRC | 04:14 | |
*** blake has joined #openstack-lbaas | 04:23 | |
*** JudeC has joined #openstack-lbaas | 05:10 | |
*** eandersson has joined #openstack-lbaas | 05:10 | |
*** kobis has joined #openstack-lbaas | 05:19 | |
*** JudeC has quit IRC | 05:22 | |
*** JudeC has joined #openstack-lbaas | 05:24 | |
*** kobis has quit IRC | 05:37 | |
*** kobis has joined #openstack-lbaas | 05:40 | |
*** kobis has quit IRC | 05:41 | |
*** eandersson has quit IRC | 05:48 | |
*** eandersson has joined #openstack-lbaas | 05:49 | |
*** kobis has joined #openstack-lbaas | 06:11 | |
*** blake has quit IRC | 06:13 | |
*** imacdonn has quit IRC | 06:16 | |
*** imacdonn has joined #openstack-lbaas | 06:16 | |
rm_work | ugh a lot of our docstrings are hilariously wrong in controller_worker | 06:29 |
*** AlexeyAbashkin has joined #openstack-lbaas | 06:32 | |
*** pcaruana has joined #openstack-lbaas | 06:33 | |
*** JudeC__ has joined #openstack-lbaas | 06:40 | |
*** JudeC_ has quit IRC | 06:41 | |
openstackgerrit | Kobi Samoray proposed openstack/octavia master: Octavia devstack plugin API mode https://review.openstack.org/570924 | 06:48 |
*** annp has joined #openstack-lbaas | 06:52 | |
*** apple01 has joined #openstack-lbaas | 06:54 | |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107 | 07:00 |
openstackgerrit | Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for healthmonitors https://review.openstack.org/567688 | 07:01 |
*** apple01 has quit IRC | 07:17 | |
*** tesseract has joined #openstack-lbaas | 07:20 | |
*** rcernin has quit IRC | 07:27 | |
*** kobis has quit IRC | 07:34 | |
*** kobis has joined #openstack-lbaas | 07:38 | |
*** yboaron has joined #openstack-lbaas | 07:39 | |
*** apple01 has joined #openstack-lbaas | 07:46 | |
openstackgerrit | ZhaoBo proposed openstack/octavia master: UDP for [3][5][6] https://review.openstack.org/571120 | 07:46 |
*** apple01 has quit IRC | 07:51 | |
*** JudeC__ has quit IRC | 08:10 | |
*** kobis has quit IRC | 08:12 | |
*** kobis has joined #openstack-lbaas | 08:13 | |
*** jiteka has quit IRC | 08:13 | |
*** nmanos has joined #openstack-lbaas | 08:19 | |
*** Alexey_Abashkin has joined #openstack-lbaas | 08:50 | |
*** Alexey_Abashkin has quit IRC | 08:51 | |
*** AlexeyAbashkin has quit IRC | 08:51 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 08:52 | |
*** apple01 has joined #openstack-lbaas | 08:56 | |
*** apple01 has quit IRC | 09:01 | |
*** salmankhan has joined #openstack-lbaas | 09:14 | |
*** yamamoto has quit IRC | 09:21 | |
*** salmankhan has quit IRC | 09:25 | |
*** kobis has quit IRC | 09:26 | |
*** JudeC_ has joined #openstack-lbaas | 09:28 | |
*** salmankhan has joined #openstack-lbaas | 09:35 | |
*** salmankhan has quit IRC | 09:42 | |
*** salmankhan has joined #openstack-lbaas | 09:47 | |
*** yamamoto has joined #openstack-lbaas | 09:50 | |
*** zigo_ is now known as zigo | 09:56 | |
*** annp has quit IRC | 10:01 | |
*** apple01 has joined #openstack-lbaas | 10:06 | |
*** kobis has joined #openstack-lbaas | 10:14 | |
*** apple01 has quit IRC | 10:14 | |
*** JudeC_ has quit IRC | 10:15 | |
*** AlexeyAbashkin has quit IRC | 10:30 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 10:31 | |
*** AlexeyAbashkin has quit IRC | 10:36 | |
*** apple01 has joined #openstack-lbaas | 11:04 | |
*** yboaron has quit IRC | 11:04 | |
*** apple01 has quit IRC | 11:08 | |
*** apple01 has joined #openstack-lbaas | 11:20 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 11:21 | |
*** yamamoto has quit IRC | 11:29 | |
*** longkb has quit IRC | 11:40 | |
*** apple01 has quit IRC | 11:42 | |
*** apple01 has joined #openstack-lbaas | 11:49 | |
*** yamamoto has joined #openstack-lbaas | 12:07 | |
*** leitan has joined #openstack-lbaas | 12:09 | |
*** atoth has joined #openstack-lbaas | 12:16 | |
*** amuller has joined #openstack-lbaas | 12:23 | |
*** apple01 has quit IRC | 12:44 | |
*** yboaron has joined #openstack-lbaas | 12:45 | |
*** sajjadg has joined #openstack-lbaas | 12:59 | |
*** samccann has joined #openstack-lbaas | 13:00 | |
*** yamamoto has quit IRC | 13:04 | |
*** leitan has quit IRC | 13:11 | |
*** leitan has joined #openstack-lbaas | 13:26 | |
*** leitan has quit IRC | 13:26 | |
*** links has quit IRC | 13:29 | |
*** links has joined #openstack-lbaas | 13:30 | |
*** yamamoto has joined #openstack-lbaas | 13:35 | |
*** fnaval has joined #openstack-lbaas | 13:36 | |
*** yamamoto has quit IRC | 13:39 | |
*** links has quit IRC | 13:44 | |
*** AlexeyAbashkin has quit IRC | 14:02 | |
*** AlexeyAbashkin has joined #openstack-lbaas | 14:04 | |
*** kobis has quit IRC | 14:06 | |
*** yboaron_ has joined #openstack-lbaas | 14:25 | |
*** yboaron has quit IRC | 14:28 | |
*** yboaron_ has quit IRC | 14:37 | |
*** yboaron_ has joined #openstack-lbaas | 14:38 | |
*** yboaron_ has quit IRC | 14:53 | |
*** kobis has joined #openstack-lbaas | 14:55 | |
*** apple01 has joined #openstack-lbaas | 15:07 | |
*** rpittau has joined #openstack-lbaas | 15:27 | |
*** sajjadg has quit IRC | 15:30 | |
*** pcaruana has quit IRC | 15:33 | |
amuller | Random Q of the day - When issuing the admin failover command to a loadbalancer in active_standby topology, I see that both amphorae IDs changed. Is this expected? | 15:44 |
*** jiteka has joined #openstack-lbaas | 15:45 | |
*** jiteka has quit IRC | 15:47 | |
*** jiteka has joined #openstack-lbaas | 15:48 | |
amuller | it seems like both the active and standby amphorae were killed and new ones were spawned | 15:48 |
*** apple01 has quit IRC | 15:51 | |
xgerman_ | yep, that is expected | 15:53 |
xgerman_ | the idea is to recycle both amps in order to update the amphora image for example | 15:54 |
xgerman_ | the amphora API has a more granular failover | 15:54 |
xgerman_ | ^^ amuller | 15:54 |
amuller | mhmm | 15:54 |
amuller | so how do I know that a keepalived based failover happened, via the API? | 15:55 |
xgerman_ | yeah, failover on lb might mean something different for F5 :-) | 15:55 |
amuller | (thinking of testing) | 15:55 |
amuller | I mean, how do I know that the old backup is now the active? | 15:55 |
johnsom | amuller If you want to failover only one of the pair, you can use the amphora failover API. | 15:56 |
xgerman_ | well, I think you mean the active dies and the passive takes over | 15:56 |
amuller | right | 15:56 |
xgerman_ | you can “simulate” that with a nova delete on the ACTIVE | 15:56 |
xgerman_ | or a port down, or… | 15:57 |
amuller | admin state down on the neutron port of the active? | 15:57 |
johnsom | As for which one is currently "MASTER" in the VRRP pair, we don't currently expose that outside the amphora. They manage that completely autonomously to the control plane. | 15:57 |
*** nmanos has quit IRC | 15:57 | |
xgerman_ | amuller: yes, the vrrp port | 15:57 |
amuller | johnsom: ack | 15:57 |
johnsom | It also comes down to keepalived not reliably exposing the status of a given instance. | 15:58 |
amuller | so I'm trying to write a scenario test for failover with active_standby topology | 15:58 |
xgerman_ | yeah, I think in the grander scheme of things killing the amp you think is master will definitely have the other amp take over | 15:59 |
amuller | so I'm trying to find an easy way (using the API) to see that a failover happened | 15:59 |
xgerman_ | then you check via the nova or amp api if there is a new amp | 15:59 |
johnsom | Adam had started one: https://review.openstack.org/#/c/501559/ | 15:59 |
johnsom | Have you looked at that. | 15:59 |
xgerman_ | while you check that traffic kept flowing | 15:59 |
amuller | oh | 15:59 |
amuller | I had not | 15:59 |
johnsom | Yeah, the amphora IDs will change, which is queriable via the amphora admin API. That is one clue | 16:00 |
johnsom | I think that patch is pretty out of date now, but could be a starting place. | 16:00 |
amuller | mhmm, that patch has a larger scope than I was hoping to put up for review | 16:03 |
amuller | I was hoping for a simpler patch, leaving traffic flow out of it | 16:04 |
amuller | just using the API to see that a failover happened | 16:04 |
amuller | xgerman_: if I find the nova VM for the active amphora and kill it, I imagine octavia will spawn a new amphora, but will the old standby become the new active, without octavia spawning a new 2nd amphora? | 16:09 |
amuller | in other words if I kill the nova VM for the active amphora, and I wait a bit and issue an amphorae list for the LB, will I see two new amphorae or only one new one? | 16:10 |
johnsom | Only one new one | 16:10 |
xgerman_ | +1 | 16:11 |
xgerman_ | passive->active runs “local” on each amp pair and independently we replace broken/deleted amps with the control plane | 16:11 |
johnsom | When you kill one amp by nova delete, the other will automatically assume MASTER if it wasn't. Then health manager will rebuild the other amphora and it will assume the BACKUP role. | 16:11 |
johnsom | MASTER and BACKUP should not be confused with the database "ROLE" which is for build time settings and not an indication of which amp is playing what role. | 16:12 |
xgerman_ | +1 — once the pair is live it can do whatever it wants | 16:13 |
johnsom | I actually had a demo video of doing this via horizon I had in case I ran short on my presentations at the summit | 16:13 |
amuller | oh, ok good note on the database role | 16:13 |
johnsom | Yeah, that confuses people. It's for settings, not for current status | 16:14 |
amuller | yeah it's confusing =p | 16:14 |
amuller | so to double check there's no way to use the admin API to find out who is the active one (per keepalived)? | 16:14 |
johnsom | Correct. Keepalived doesn't provide a way to ask this that is reliable enough to use. | 16:15 |
johnsom | Unless they have added it in the last year or two. | 16:15 |
amuller | when I did the l3 HA work in neutron | 16:15 |
amuller | because keepalived didn't at the time allow you to query it for its status | 16:15 |
amuller | (I don't know if that changed) | 16:15 |
amuller | I wrote keepalived-state-change-monitor | 16:15 |
amuller | which is a little python process that uses 'ip monitor' to detect IP address changes | 16:16 |
xgerman_ | mmh, maybe we can incorporate that | 16:16 |
amuller | and when it does, it notifies the L3 agent via a unix domain socket | 16:16 |
amuller | and the L3 agent then notifies neutron-server via RPC | 16:16 |
amuller | which updates the DB | 16:16 |
amuller | so when operators can use an admin API to see where is the active router replica | 16:16 |
amuller | then* | 16:16 |
xgerman_ | yeah, we can extend the health messages to stream it into the DB | 16:17 |
johnsom | Hmm, well, we have a status/diagnostic API on the amps, I think that would be fine to query and not have yet another status to update in the DB. | 16:17 |
amuller | yeah, I was just commenting on a way to figure out keepalived state changes | 16:18 |
johnsom | Also, I think you will find there are some oddities in keepalived that on initial startup both amps will see the VIP IP but keepalived will only be GARPing from one of them. | 16:18 |
amuller | not how you'd model it in the API | 16:18 |
johnsom | This is one of the "reliable" issues we hit | 16:18 |
amuller | hmm, I didn't see that in L3 HA | 16:19 |
amuller | I know that we configure keepalived differently | 16:19 |
amuller | in neutron and in octavia | 16:19 |
johnsom | Yeah, I think it might be how we are bringing up the interfaces. It wasn't a "problem" so we never went back and tried to fix it | 16:19 |
*** kobis has quit IRC | 16:19 | |
johnsom | We considered log scraping, but that provided to not be accurate. We considered doing the status dump to /tmp using the signal, but that required a newer version that some distros have. etc. | 16:20 |
johnsom | Just wasn't a priority to run to ground as it doesn't provide a lot of useful information, just some nice to have info for failover. | 16:21 |
johnsom | I think Ryan brought up that the right answer was likely listening on the dbus, but again, a good chunk of work | 16:21 |
amuller | I'm not familiar with the status dump using a signal | 16:22 |
xgerman_ | so far the amp (diagnostic) API is not exposed in the best way — that’s why I was thinking health messages… I think one day I will write a way to proxy amp-api->diagnostic-api | 16:22 |
*** pcaruana has joined #openstack-lbaas | 16:23 | |
amuller | so it seems like right now there's no way to test failover in active_standby topology, if I want to stay in the realm of the API | 16:23 |
johnsom | What do you meant it's not exposed in the best way. It works just fine. I have no idea why you would need a proxy | 16:23 |
johnsom | Sure there is, use the amphora admin API to check the amphora IDs | 16:24 |
xgerman_ | you need to curl the amp directly or not? | 16:25 |
johnsom | not for the amphora admin API\ | 16:25 |
johnsom | https://developer.openstack.org/api-ref/load-balancer/v2/index.html#show-amphora-details | 16:25 |
johnsom | I guess list would be more useful: https://developer.openstack.org/api-ref/load-balancer/v2/index.html#list-amphora | 16:26 |
johnsom | Filter by LB ID | 16:26 |
johnsom | I am pretty sure this is how Adam's failover patch is doing it | 16:26 |
amuller | that tells me that octavia spawned new amphoras, does it answer the question of 'did the old backup become the new active'? | 16:27 |
johnsom | You can test that by making sure traffic still flows during the failover | 16:27 |
amuller | I don't really wanna do that =p | 16:28 |
amuller | it makes the test more complicated than it needs to be | 16:28 |
amuller | so, even with active_standby you'll see some downtime | 16:28 |
johnsom | VRRP is designed to be autonomous to the control plane. It can switch at any time for failures the control plane doesn't know about. | 16:28 |
amuller | so how long of a downtime would you set to be tolerable? | 16:28 |
amuller | it needs to be shorten than a test with standalone topology | 16:28 |
amuller | shorter* | 16:28 |
johnsom | Yeah, depending on how it's tuned it's around a second, usually less | 16:29 |
amuller | it's gonna be difficult to write it in a way that's reliable at the gate and in loaded environments | 16:29 |
amuller | but with a low enough timeout so that the test is still meaningful | 16:29 |
*** salmankhan has quit IRC | 16:29 | |
xgerman_ | johnsom: I was thinking proxying https://docs.openstack.org/octavia/latest/contributor/api/haproxy-amphora-api.html — but you probably guessed that :-) | 16:30 |
johnsom | Yeah, the hierarchic is: systemd recovery, sub second; active/standby around a second; failvoer, under a minute (depends on the could infrastructure and your octavia config, typically 30 seconds). | 16:30 |
amuller | so let's say we do this by pinging the LB | 16:32 |
johnsom | amuller Here is a demo I showed at the tokyo summit: https://youtu.be/8n7FGhtOiXk?t=23m52s | 16:32 |
amuller | with some timeout | 16:33 |
amuller | how do I know which VM to kill | 16:33 |
amuller | because I don't know who the active amphora is | 16:33 |
johnsom | Right, you won't, you have to cycle through killing them both, one at a time to prove and active/standby transition | 16:33 |
xgerman_ | they could switch in between — so you are probably left with “good enough” | 16:34 |
amuller | so I pick one at random, kill it, assert < timeout outage | 16:34 |
amuller | I could have killed either the active or the standby | 16:34 |
rm_work | amuller: i am mid failover-scenario right now (started on it yesterday) | 16:34 |
rm_work | updating my old patch | 16:35 |
amuller | oh hah ok | 16:35 |
xgerman_ | sweet | 16:35 |
rm_work | about to push up the amphora client patch | 16:35 |
amuller | well, good thing I asked =p | 16:35 |
rm_work | just didn't get the testing quite right last night | 16:35 |
amuller | alrighty well | 16:36 |
amuller | is there another area there are glaring testing coverage issues? | 16:36 |
amuller | one thing to pops to mind is the amphora show and list API | 16:36 |
amuller | I don't think I saw tests for that in the api subdir | 16:36 |
amuller | is there a doc or some such ya'all are using to track progress on test coverage in the octavia tempest plugin? | 16:37 |
rm_work | amuller: that's what i meant | 16:38 |
openstackgerrit | Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for l7policies https://review.openstack.org/570482 | 16:39 |
openstackgerrit | Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for l7rules https://review.openstack.org/570485 | 16:39 |
openstackgerrit | Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for amphora https://review.openstack.org/571251 | 16:40 |
rm_work | ^^ there | 16:40 |
rm_work | not done but i'll just push it so it'll be obvious what i mean | 16:40 |
amuller | +1 | 16:41 |
amuller | rm_work: https://review.openstack.org/#/c/571251/1/octavia_tempest_plugin/tests/api/v2/test_amphora.py@32 you prolly want a different class name =p | 16:42 |
amuller | anyhow | 16:42 |
rm_work | yes :P | 16:42 |
rm_work | i wasn't really ready to push yet but ;P | 16:43 |
rm_work | one of the tests is still just an unmodified clone, lol | 16:43 |
amuller | still curious if there's an effort to track test coverage in the octavia tempest plugin | 16:44 |
amuller | on my way out for lunch, we can sync later :) | 16:44 |
rm_work | yes | 16:44 |
rm_work | https://storyboard.openstack.org/#!/story/2001387 | 16:44 |
*** jcarpentier has joined #openstack-lbaas | 16:44 | |
rm_work | i think it is missing the amp piece | 16:45 |
amuller | ah ha :) | 16:45 |
amuller | that's quite a story hehe | 16:45 |
*** kobis has joined #openstack-lbaas | 16:45 | |
*** JudeC_ has joined #openstack-lbaas | 16:46 | |
rm_work | johnsom: did you see my attempts at the db retry patch? | 16:52 |
johnsom | Haven't had a chance yet this morning | 16:53 |
rm_work | https://review.openstack.org/#/c/571107/ | 16:53 |
* rm_work shrugs | 16:54 | |
rm_work | i can tweak that stuff a bunch if we want to do like | 16:55 |
rm_work | wait 2, retry count = 5 | 16:55 |
rm_work | instead of wait 1, retry until 10s | 16:55 |
*** kobis has quit IRC | 16:59 | |
*** tesseract has quit IRC | 17:10 | |
johnsom | rm_work So this looks fine. I would make those numbers either config or a constant somewhere (could be at the top of the file). I think 1 is the right answer, but maybe the top end as 30. Just in case there is a DB that takes a crazy long time to commit. | 17:28 |
*** AlexeyAbashkin has quit IRC | 17:29 | |
johnsom | It doesn't really handle the case of walking off the end of the retries in a nice way. It's not like we could do much about that (can't set ERROR obviously), but a log message with details would be nice. | 17:30 |
*** AlexeyAbashkin has joined #openstack-lbaas | 17:31 | |
johnsom | FYI, someone on the dev mailing list was asking about backend re-encryption | 17:32 |
*** jcarpentier has left #openstack-lbaas | 17:35 | |
*** AlexeyAbashkin has quit IRC | 17:36 | |
rm_work | johnsom: right, if it fails right now the exception makes just as little sense, i think | 17:39 |
rm_work | at least this one is representative of the issue | 17:39 |
rm_work | but in no case is there anything to be done :( | 17:40 |
rm_work | I guess I could put in a log like "retrying" | 17:40 |
rm_work | one sec | 17:40 |
johnsom | I probably would have put the retry loop inside and had a log message that we "lost this object" and that the database is broken | 17:40 |
rm_work | I mean, I could just catch the existing error, but it's not *necessarily* as explicit | 17:41 |
rm_work | the the AttributeError | 17:42 |
rm_work | but that'd also be easier | 17:42 |
rm_work | what about: | 17:45 |
rm_work | wait=tenacity.wait_incrementing(2, 2) | 17:46 |
rm_work | stop=tenacity.stop_after_attempt(10) | 17:46 |
johnsom | rm_work Hmm, there are some nice backoff options in the tinacity docs (trying to figure out what increment does, which of course isn't documented). It also has an "after failed" log option that would work here | 17:49 |
johnsom | https://github.com/jd/tenacity | 17:49 |
*** kobis has joined #openstack-lbaas | 17:49 | |
rm_work | yes | 17:49 |
rm_work | i don't really want to go exponential | 17:50 |
johnsom | yeah, agreed | 17:50 |
rm_work | ok actually, maybe I should just do it like... | 17:51 |
johnsom | It should land quickly under normal operation, so adding too much delay between attempts just slows the whole thing down. | 17:51 |
rm_work | retry on the natural AttributeError, and then reraise after the last retry | 17:52 |
rm_work | and not increment the delay much | 17:52 |
rm_work | but meh... | 17:54 |
rm_work | hmmm | 17:54 |
*** AlexeyAbashkin has joined #openstack-lbaas | 17:57 | |
*** Alexey_Abashkin has joined #openstack-lbaas | 18:00 | |
*** AlexeyAbashkin has quit IRC | 18:02 | |
*** Alexey_Abashkin is now known as AlexeyAbashkin | 18:02 | |
*** pcaruana has quit IRC | 18:09 | |
rm_work | lol, retry_error_callback is SUPER NEW | 18:11 |
openstackgerrit | Michael Johnson proposed openstack/octavia master: Implement provider drivers - Cleanup https://review.openstack.org/567431 | 18:12 |
rm_work | so not sure it's safe to use yet | 18:12 |
johnsom | ok | 18:12 |
johnsom | ^^^ just a docs update to clarify some questions Kobi raised this morning | 18:12 |
rm_work | yeah k | 18:13 |
*** kobis has quit IRC | 18:18 | |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107 | 18:21 |
rm_work | johnsom: revised ^^ | 18:21 |
rm_work | erk off by one | 18:21 |
rm_work | i was aiming for 1+2+3+4+5*11 = 60 | 18:21 |
rm_work | but i need 15 not 14 for that | 18:22 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107 | 18:23 |
rm_work | as this is a *workaround*, I really wanted to avoid adding a bunch more config (that we'd have to deprecate, to boot) | 18:23 |
johnsom | Yeah, just throw a constant at the top of the file | 18:26 |
rm_work | ah, sure | 18:27 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107 | 18:29 |
*** AlexeyAbashkin has quit IRC | 18:40 | |
*** salmankhan has joined #openstack-lbaas | 18:53 | |
*** atoth has quit IRC | 18:55 | |
*** salmankhan has quit IRC | 18:57 | |
johnsom | One minor comment about the log message, otherwise good for me | 19:17 |
johnsom | And I bet lower constraints is wrong | 19:18 |
openstackgerrit | Jan Zerebecki proposed openstack/neutron-lbaas master: Remove eager loading of Listener relations https://review.openstack.org/570596 | 19:19 |
*** salmankhan has joined #openstack-lbaas | 19:46 | |
johnsom | #startmeeting Octavia | 20:00 |
openstack | Meeting started Wed May 30 20:00:07 2018 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. | 20:00 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 20:00 |
*** openstack changes topic to " (Meeting topic: Octavia)" | 20:00 | |
openstack | The meeting name has been set to 'octavia' | 20:00 |
johnsom | Hi folks | 20:00 |
johnsom | Pretty light agenda today | 20:00 |
cgoncalves | hi | 20:01 |
nmagnezi | o/ | 20:01 |
johnsom | #topic Announcements | 20:01 |
*** openstack changes topic to "Announcements (Meeting topic: Octavia)" | 20:01 | |
johnsom | I don't have much for announcements. The summit was last week. There are lots of good sessions up for viewing on the openstack site. | 20:02 |
johnsom | Octavia was demoed and called out in a Keynote, so yay for that! | 20:02 |
nmagnezi | :-) | 20:02 |
johnsom | We also came up in a number of sessions I attended, so some good buzz | 20:02 |
johnsom | Nir, I think you have an announcement today.... | 20:03 |
nmagnezi | :) | 20:03 |
nmagnezi | I'm happy to announce that TripleO now fully supports Octavia as a One click install. | 20:03 |
nmagnezi | That includes: | 20:03 |
nmagnezi | 1. Octavia services in Docker containers. | 20:03 |
nmagnezi | 2. Creation of the mgmt subnet for amphorae | 20:03 |
nmagnezi | 3. If the user is using RHEL/CentOS based amphora - it will automatically pull an image and load it to glance. | 20:04 |
nmagnezi | Additionally, the SELinux policies for the amphora image are now ready and tested internally. Those policies are available as a part of openstack-selinux package. | 20:04 |
nmagnezi | some pointers (partial list): | 20:04 |
nmagnezi | 1. I'll find the related tripleO docs and provide it next week (what we do currently have is release notes ready for Rocky | 20:04 |
nmagnezi | 2. SELinux: | 20:04 |
nmagnezi | #link https://github.com/redhat-openstack/openstack-selinux/blob/master/os-octavia.te | 20:04 |
xgerman_ | o/ | 20:05 |
nmagnezi | Many people were involved with this effort ( cgoncalves amuller myself bcafarel and more) . And now we can fully support Octavia as an OSP13 component (Based on Queens). | 20:05 |
johnsom | Ok, cool! So glad to see SELinux enabled for the amps | 20:05 |
xgerman_ | +1 | 20:05 |
johnsom | +1000 There are people looking for it. I also mentioned OSP 13 a number of times at the summit | 20:06 |
amuller | I do expect this to drive up Octavia adoption pretty significantly | 20:06 |
xgerman_ | :-) | 20:06 |
nmagnezi | yup. I'm sure operators that are still using the legacy n-lbaas with haproxy will migrate | 20:07 |
cgoncalves | it takes less than 2 people to deploy octavia with tripleo now | 20:07 |
johnsom | Way to go RH/TripleO folks. I know it's been a journey, but it is great to have the capability. | 20:07 |
johnsom | 2 people? I thought it was one-click? One to click the mouse and one to open the beverages? | 20:07 |
johnsom | grin | 20:08 |
nmagnezi | haha | 20:08 |
cgoncalves | johnsom, I said "less than" ;) one is enough to do both jobs :) | 20:08 |
johnsom | Nice. Any other announcements today? | 20:08 |
*** amuller has quit IRC | 20:09 | |
johnsom | #topic Brief progress reports / bugs needing review | 20:09 |
*** openstack changes topic to "Brief progress reports / bugs needing review (Meeting topic: Octavia)" | 20:09 | |
johnsom | I was obviously a bit distracted with the summit and presentation prep. | 20:10 |
nmagnezi | yup :) | 20:10 |
johnsom | We have started merging the provider driver code and the tempest plugin code. We are co-testing the patches as we go. | 20:10 |
rm_work | o/ | 20:11 |
johnsom | rm_work Did point out a race condition in my amp driver last night, so he is working on fixing that. | 20:11 |
johnsom | It's kind of a migration issue, as I wanted to incrementally migrate the amphora driver over to the new provider driver ways. | 20:12 |
johnsom | Today my focus is on adding the driver library support (update stats/status). | 20:12 |
johnsom | I have been chatting with Kobi in the mornings about the VMWare NSX driver, which it sounds like is in-progress. | 20:12 |
nmagnezi | Nice! | 20:13 |
johnsom | He has been giving feedback that I have been including in the cleanup patch. | 20:13 |
johnsom | Along the lines of drivers, it's not clear when we will get an F5 driver. They have had some re-orgs from what I gather, so it may delay that work. | 20:14 |
nmagnezi | good to know. at least we have feedback from one vendor for now. | 20:15 |
johnsom | Yeah, sad that the vendor that was the key author of the spec may not be able to create their driver right away | 20:15 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107 | 20:15 |
johnsom | Any other progress updates today? | 20:16 |
nmagnezi | yes | 20:17 |
nmagnezi | I had some cycles, so I finished my Rally patch to add support for Octavia. it is ready for feedback now and worked okay for me: https://review.openstack.org/#/c/554228/ | 20:17 |
johnsom | In case you are really bored, here is the link to my project update. Feedback welcome. | 20:17 |
johnsom | #link https://youtu.be/woPaywKYljE | 20:17 |
nmagnezi | johnsom, i watched it. you did a great work with this. | 20:18 |
johnsom | Thanks | 20:18 |
johnsom | Rally, cool. I need to look at that and refresh my memory of how those gates work | 20:19 |
nmagnezi | #link https://review.openstack.org/#/c/554228/ | 20:19 |
nmagnezi | johnsom, yeah. the scenario i add now is a port of the existing n-lbass scenario to Octavia | 20:19 |
johnsom | I'm guessing it is the "rally-task-load-balancing" gate I should look at? | 20:19 |
nmagnezi | next up we can add more stuff | 20:20 |
nmagnezi | johnsom, yup | 20:20 |
johnsom | Cool, I will check it out | 20:20 |
nmagnezi | thanks! | 20:21 |
johnsom | Any other updates? | 20:21 |
johnsom | cgoncalves I saw the grenade gate was failing, but didn't have much time to dig into why. Are you looking into that? | 20:22 |
cgoncalves | johnsom, I submitted new patch set today (yesterday?) to check what's going on when we curl. it fails post-upgrade | 20:22 |
johnsom | Yeah, odd | 20:23 |
cgoncalves | not sure why yet. it successfully passes the same curl pre and during upgrade | 20:23 |
cgoncalves | it started failing out of the sudden | 20:23 |
johnsom | Well, let us know if we can provide a second set of eyes to look into it. | 20:24 |
johnsom | I really want to get that merged and voting. | 20:24 |
cgoncalves | thanks! | 20:24 |
johnsom | So we can start the climb on upgrade tags | 20:24 |
johnsom | Also, we discussed fast-forward upgrades a bit at the summit. | 20:24 |
cgoncalves | yes! | 20:24 |
johnsom | I'm thinking we need to setup grenade starting with Pike (1.0 release) and have gates for Pike->Queens, Queens->Rocky, etc. to prove we can do a fast-forward upgrade | 20:25 |
johnsom | I guess I am jumping ahead... grin | 20:26 |
johnsom | #topic Open Discussion | 20:26 |
*** openstack changes topic to "Open Discussion (Meeting topic: Octavia)" | 20:26 | |
*** mstrohl has joined #openstack-lbaas | 20:28 | |
johnsom | fast-forward is running each upgrade sequentially to move from an older release to current. This is different from leap-frog which is a direct jump Pike->Rocky. It sounds like fast-forward is going to be the supported plan for OpenStack upgrades | 20:28 |
rm_work | fast forward sounds like... a normal upgrade process | 20:29 |
johnsom | Right, just chained together | 20:29 |
nmagnezi | and... fast :) | 20:29 |
rm_work | unless they add stuff like "you don't have to start/stop the control plane at each stage" | 20:29 |
johnsom | Eh, as long as we have a plan and a test I will be happy. | 20:29 |
rm_work | but yeah | 20:29 |
johnsom | I think it's the script of what is required to do that. | 20:30 |
johnsom | Other topics today? | 20:30 |
cgoncalves | IIUC, we could try that now in the grenade patch by changing the base version from queens to pike: https://review.openstack.org/#/c/549654/33/devstack/upgrade/settings@4 | 20:30 |
johnsom | cgoncalves that would be a leap frog thought right? | 20:31 |
cgoncalves | if there are upgrade issues (e.g. deprecated configs), we create a per version directory with upgrade instructions | 20:31 |
cgoncalves | johnsom, ah right | 20:32 |
johnsom | yeah, we need to write up an upgrade doc that lays out the steps. Maybe from their link to any upgrade issues or just link to the release notes | 20:32 |
* johnsom notices the room going quiet.... | 20:33 | |
johnsom | lol | 20:33 |
nmagnezi | everyone like docs.. | 20:34 |
nmagnezi | :) | 20:34 |
johnsom | In fairness, I think we can just pull the grenade back to the Queens branch and set it up. This would give us the FFU gates we need. | 20:34 |
johnsom | Once we get it stable on master | 20:34 |
cgoncalves | +1 | 20:35 |
nmagnezi | yup. sounds right | 20:36 |
johnsom | Ok, if there aren't any more topics, have a good week and we will chat next Wednesday! | 20:38 |
rm_work | something is nagging at me about that... but sure, probably | 20:38 |
rm_work | o/ REVIEW STUFF | 20:38 |
johnsom | Yes, please. I really hope to do a client release soon | 20:38 |
johnsom | Would love to get some reviews on that | 20:39 |
johnsom | #endmeeting | 20:40 |
*** openstack changes topic to "Discussion of OpenStack Load Balancing (Octavia) | https://etherpad.openstack.org/p/octavia-priority-reviews" | 20:40 | |
openstack | Meeting ended Wed May 30 20:40:00 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:40 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/octavia/2018/octavia.2018-05-30-20.00.html | 20:40 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/octavia/2018/octavia.2018-05-30-20.00.txt | 20:40 |
openstack | Log: http://eavesdrop.openstack.org/meetings/octavia/2018/octavia.2018-05-30-20.00.log.html | 20:40 |
rm_work | ah yeah i need to look at client again | 20:40 |
johnsom | I think status and timeouts are ready for review. | 20:41 |
johnsom | UDP might be, there were some updates pending on the controller sied | 20:41 |
johnsom | side | 20:41 |
rm_work | i need to get scenario tests done on the l7 stuff so we can merge those and then merge yours | 20:42 |
johnsom | +1 | 20:42 |
rm_work | re: the comments earlier about amps not showing who is master -- i could implement it to update our DB at least the way that I do it in my env | 20:43 |
rm_work | which is, on takeover, keepalived can run a script -- and that script can trigger a health message with the takeover notification | 20:43 |
rm_work | so we could update the db | 20:43 |
* rm_work shrugs | 20:43 | |
johnsom | Well, I have a bunch of concerns about that frankly | 20:43 |
rm_work | it's just a third type of health message, which is like "hey, i'm a master now, do with that what you will" | 20:44 |
rm_work | and the default driver can just be like "ok, updating amps in the db" | 20:44 |
rm_work | it's not guaranteed to be super accurate | 20:45 |
johnsom | Yeah, plus new DB columns, a bunch of latency, etc. Just is it really that useful of info? | 20:45 |
rm_work | but it's better than what it does now | 20:45 |
rm_work | no new db columns | 20:45 |
rm_work | we already have a column for role | 20:45 |
johnsom | Oh yes! You CANNOT change the ROLE column | 20:45 |
rm_work | lol k | 20:45 |
* rm_work does | 20:45 | |
johnsom | That would be a big problem | 20:45 |
rm_work | that's how it is in my prod | 20:45 |
rm_work | have yet to see any problems | 20:45 |
johnsom | That dictates the settings pushed to the amps, like priorities and peers | 20:46 |
rm_work | https://review.openstack.org/#/c/435612/132/octavia/amphorae/drivers/health/flip_utils.py@85 | 20:47 |
johnsom | My guess is your failovers aren't happening right and likely pre-empt the master when a failed amp is brought back in | 20:47 |
johnsom | That column was never intended to be a status column | 20:47 |
rm_work | it wouldn't push new configs to keepalived | 20:47 |
rm_work | unless one of the amps actually gets updated | 20:47 |
johnsom | Right, but that column dictates the settings deployed into the keepalived config file when we build replacements | 20:48 |
rm_work | at which point... it then pushes to both | 20:48 |
rm_work | yeah, i build a replacement BACKUP immediately actually | 20:48 |
rm_work | which gets the new backup settings | 20:48 |
johnsom | We have had this conversation, if I remember right you are ok with the downtime | 20:49 |
rm_work | what downtime? | 20:49 |
rm_work | I mean, we already have some downtime just for the FLIP to move, on a fail | 20:49 |
rm_work | but there's no other downtime involved | 20:49 |
johnsom | right | 20:49 |
johnsom | That eats the keepalived downtime | 20:49 |
rm_work | we always reload keepalived on amp configs | 20:50 |
rm_work | not sure how there would be any different downtime | 20:50 |
xgerman_ | I would rather have https://docs.openstack.org/octavia/latest/contributor/api/haproxy-amphora-api.html#get-amphora-topology working and then relay back to the amp API - my 2cts | 20:50 |
johnsom | xgerman_ I don't think we want anymore "proxy" code right now | 20:51 |
rm_work | eugh, public API endpoint that can trigger an actual callout to dataplane -> not something i want to see happen | 20:51 |
xgerman_ | I said relay — not proxy | 20:51 |
johnsom | <xgerman_> German Eichberger johnsom: I was thinking proxying https://docs.openstack.org/octavia/latest/contributor/api/haproxy-amphora-api.html — but you probably guessed that :-) | 20:52 |
johnsom | That was from this morning. | 20:52 |
xgerman_ | yes, that’s why I said relay — I am learning | 20:53 |
rm_work | i get the idea -- it's not exactly a proxy I think -- but it involves syncronously reaching out to the dataplane | 20:53 |
rm_work | which is meh | 20:53 |
rm_work | i would rather not expose a direct path from user->dataplane | 20:53 |
rm_work | though, it is just admins | 20:54 |
johnsom | yeah, I struggle with the value of it. To me it is a cattle detail hidden from view that can change at any time autonomously from the control plane. | 20:54 |
rm_work | it's essentially useful just for testing | 20:54 |
rm_work | which is at least 3/4 of why i even wrote the amp API | 20:54 |
xgerman_ | if it’s just tetsing they can hit the amp directlly | 20:55 |
rm_work | so i could do stuff in tempest without reaching into the DB | 20:55 |
xgerman_ | I was more thinking for some automation stuff - e.g. change roles (ther eis also a set) | 20:55 |
rm_work | yes, for pre-fails that would be nice | 20:55 |
rm_work | for the failover call and such | 20:55 |
rm_work | and my evac | 20:56 |
johnsom | It's not like you can force a role change outside of a failure. There is no keepalived call to do that (thank goodness) | 20:56 |
xgerman_ | also seeing amps flap between active-passive is useful | 20:56 |
johnsom | Is it? | 20:57 |
xgerman_ | yes, if it happens every 3s | 20:57 |
johnsom | User wouldn't know | 20:57 |
xgerman_ | the entwork is wonly | 20:57 |
rm_work | the latency +/- on that would be a good deal of that 3s lol | 20:57 |
xgerman_ | yes, it’s useful for admin/ monitoring | 20:57 |
*** KeithMnemonic has joined #openstack-lbaas | 20:57 | |
xgerman_ | users don’t care… | 20:58 |
johnsom | I think that is a poor metric. It doesn't tell you anything about why or if it is even a bad thing. | 20:58 |
johnsom | If you want that level of deep monitoring, you need the logs that would actually tell you why in addition to if it changed | 20:59 |
johnsom | But again, this is treating them like pets, not cattle | 20:59 |
xgerman_ | I see monitoring as detetcing potential error conditions and then investigating through logs | 21:00 |
xgerman_ | but if I would see a ton of flapping and heard that networking did soemthing… | 21:01 |
rm_work | the biggest thing i see it being useful for is tempest failover testing | 21:03 |
johnsom | Again, I think role changes are noise and not an actionable event | 21:03 |
*** sapd_ has joined #openstack-lbaas | 21:03 | |
johnsom | Right, I think the only real use case is testing IMO. Which could be done other ways, like ssh in and dump the status file | 21:04 |
johnsom | Or just do a dual failure to guarantee the initial state | 21:04 |
rm_work | hmm | 21:05 |
xgerman_ | now, to what I actually wanted to do: check zuul for ending jobs — didn’t they have a page I could check by Change-id? | 21:06 |
johnsom | http://zuul.openstack.org/builds.html | 21:06 |
johnsom | Change ID, not so sure | 21:07 |
*** sapd has quit IRC | 21:07 | |
*** samccann has quit IRC | 21:12 | |
*** samccann has joined #openstack-lbaas | 21:14 | |
*** salmankhan has quit IRC | 21:29 | |
*** harlowja has joined #openstack-lbaas | 21:52 | |
lxkong | morning, guys, I have a question, is it possible that the package data sent from amphora to health-manager is in non-ascii format? | 21:57 |
lxkong | we met with some error message in health-manager log: `Health Manager experienced an exception processing a heartbeat message from ('100.64.1.21', 31538). Ignoring this packet. Exception: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128)` | 21:57 |
lxkong | so i made a patch to oslo_utils here https://review.openstack.org/#/c/570826/ | 21:58 |
lxkong | seems the method hmac.compare_digest doesn't support non-ascii | 21:58 |
lxkong | but our issue was solved using this patch | 21:59 |
rm_work | hmmm | 21:59 |
rm_work | that's weird | 21:59 |
johnsom | Yeah, I think it is always non-ascii... It should be raw bytes | 22:00 |
rm_work | ah Ben's link is interesting | 22:00 |
rm_work | `a and b must both be of the same type: either str (ASCII only, as e.g. returned by HMAC.hexdigest()), or a bytes-like object` | 22:00 |
lxkong | yeah, that's from the doc of hmac module | 22:03 |
johnsom | lxkong Do you have the traceback? | 22:04 |
lxkong | johnsom: i don't have traceback now, but according to my debugging, the error happens here: https://github.com/openstack/oslo.utils/blob/0fb1b0aabe100bb36d0e4ad6d5a9f96dd8eb6ff6/oslo_utils/secretutils.py#L39 | 22:05 |
lxkong | and i also printed the data received, it's non-ascii | 22:05 |
johnsom | Yeah, so the data it's comparing is generated here: https://docs.python.org/2/library/hmac.html#hmac.HMAC.digest | 22:07 |
johnsom | Which is clear, it is not ASCII | 22:08 |
lxkong | yeah | 22:08 |
johnsom | We probably should switch to using hmac.compare_digest | 22:08 |
johnsom | () | 22:08 |
rm_work | err | 22:11 |
lxkong | the method has different explanation in python 2.7.7 and 3.3 | 22:11 |
johnsom | Yeah, I see that | 22:12 |
rm_work | right so | 22:12 |
rm_work | the point of that is because we can't use hmac.compare_digest :P | 22:12 |
lxkong | yeah | 22:12 |
rm_work | it does use it, if it can | 22:12 |
johnsom | Oh, yeah, I see it | 22:13 |
lxkong | do you think we should fix that in octavia? | 22:13 |
rm_work | so it looks like we SHOULD have been using hexdigest() instead of digest() ? | 22:13 |
johnsom | So, right, maybe for the decode we use hexdigest and for the encode use digest? | 22:14 |
rm_work | err | 22:14 |
johnsom | So, don't we still want to pass the hmac in binary over the wire (half the size) but use the hex dump for the compare? | 22:16 |
rm_work | yeah i am trying to find where we actually do the encode/decode | 22:16 |
johnsom | https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py | 22:17 |
lxkong | octavia/amphorae/backends/health_daemon/status_message.py | 22:17 |
johnsom | No, that won't work, it has to be hexdigest all the way around | 22:17 |
rm_work | ah right, same file for both | 22:17 |
rm_work | erm | 22:17 |
rm_work | so should we even be encoding in utf-8 then? | 22:18 |
rm_work | hmac.new(key.encode("utf-8"), payload, hashlib.sha256) | 22:18 |
rm_work | oh that's just the key | 22:18 |
johnsom | That is the key, so yes | 22:18 |
rm_work | err | 22:18 |
rm_work | we also do the payload tho | 22:18 |
rm_work | https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L37 | 22:19 |
rm_work | ah and then it becomes binary | 22:19 |
johnsom | Before it's compressed, so again, doesn't matter | 22:19 |
rm_work | yeah | 22:19 |
johnsom | I guess we just have to double the size of the hmac sig on the packets by making them hex | 22:20 |
rm_work | so... yeah we just switch to hexdigest aybe | 22:20 |
rm_work | hmmm | 22:20 |
rm_work | how big is that | 22:20 |
johnsom | 32 bin, 64 hex | 22:20 |
rm_work | bits? bytes? | 22:21 |
rm_work | I assume bytes | 22:21 |
johnsom | bytes | 22:21 |
rm_work | so ... ok, not ideal, but meh | 22:21 |
rm_work | we've got .... 65507 bytes to work with? :P | 22:22 |
rm_work | though that may not function on all networks? | 22:22 |
*** rcernin has joined #openstack-lbaas | 22:22 | |
johnsom | Unless there is some way to make those both "bytes like objects" | 22:25 |
rm_work | i mean, that's what they should be | 22:25 |
rm_work | i believe that IS what we send | 22:26 |
rm_work | essentially just a bytefield | 22:26 |
johnsom | So, it must be this that is hosing it over to python "what is a string" hell: envelope[-hash_len:] | 22:27 |
rm_work | can debug through that and see | 22:28 |
rm_work | one sec | 22:28 |
rm_work | nope | 22:29 |
rm_work | still bytes objects | 22:30 |
rm_work | when we pass to | 22:30 |
rm_work | secretutils.constant_time_compare(expected_hmc, calculated_hmc) | 22:30 |
rm_work | both expected and calculated are bytes() | 22:30 |
rm_work | one sec let me see which python i'm on | 22:30 |
rm_work | ok in py2 it's all str() | 22:31 |
rm_work | in py3 it's all bytes() | 22:31 |
rm_work | but even the original envelope is str() in py2 | 22:31 |
rm_work | we DO test this in our unit tests | 22:32 |
rm_work | it's possible it's an issue with py3 on amps and py2 on controlplane, or visa-versa? | 22:32 |
rm_work | lxkong: what version of python does your HM run on? | 22:32 |
lxkong | let me see | 22:33 |
lxkong | the hm is on 2.7.6, the python inside amphora is 2.7.12 | 22:34 |
lxkong | using hexdigest() can also solve the issue. What are you talking about? | 22:36 |
lxkong | length? | 22:36 |
*** mstrohl has quit IRC | 22:36 | |
*** fnaval has quit IRC | 22:37 | |
johnsom | lxkong BTW, you might want to add a docs page in Octavia that talks about using Octavia with K8s and your ingress work. Just so people can find it, etc. | 22:41 |
lxkong | i don't understand what's the potential problem it will bring if we replace digest() with hexdigest()? | 22:41 |
lxkong | johnsom: sure, i will | 22:42 |
johnsom | lxkong Kind of like I did for the SDKs: https://docs.openstack.org/octavia/latest/user/sdks.html | 22:42 |
lxkong | yeah, i will find an appropriate place to advertise that work :-) | 22:43 |
johnsom | Well, it doubles the size of the hmac data being sent. We want our overall data size to fit in a single UDP packet, so trying to keep the size down is good practice | 22:43 |
lxkong | ah, ok yeah, 32 should be 64 | 22:43 |
johnsom | Yeah, if we switch to hexdigest it will become 64 | 22:44 |
lxkong | but reducing from 64 to 32 is just a mitigation | 22:44 |
lxkong | any other options do we have to solve the problem? | 22:45 |
johnsom | It's an optimization to have it stay 32 bytes like it is now, true. | 22:45 |
lxkong | or is anything i can help? | 22:45 |
johnsom | That is a question for rm_work, I think he is poking at this one | 22:46 |
rm_work | eh, i looked, but | 22:47 |
rm_work | i don't really have time to | 22:47 |
lxkong | or we could keep a copy of oslo_utils/secretutils.py but change things i proposed :-) | 22:47 |
rm_work | nor have i seen this issue? | 22:47 |
rm_work | well, I think we might have to do the hexdigest thing | 22:47 |
rm_work | to be actually legit | 22:47 |
johnsom | Yeah, so maybe just do that for now. I think I have seen this, but found out it really was some other issue that was triggering it. Not sure. It's vaguely familiar | 22:48 |
lxkong | so do you guys have time to fix that atm, i can definitely do that because maybe we are the only one who are affected | 22:50 |
johnsom | Got for it, my vote is the hexdigest approach. We need to handle backwards compatibility though... I.e. operator updates the HM controller, but still has digest() amps | 22:52 |
rm_work | eugh yeah | 22:53 |
johnsom | It also might be worth debugging if you can reproduce, see what types we have and the payload | 22:53 |
lxkong | johnsom: do you mean we should both check hmc.digest and hmc.hexdigest to make sure we don't break the old amps? | 22:54 |
johnsom | Yes, we will have to fall back in some way | 22:55 |
lxkong | ok | 22:55 |
openstackgerrit | Lingxian Kong proposed openstack/octavia master: [WIP] Use HMAC.hexdigest to avoid non-ascii characters for package data https://review.openstack.org/571333 | 23:46 |
lxkong | johnsom, rm_work: when you are available, pleaset take a look at https://review.openstack.org/571333, I want to make sure that's the way you expected before I am starting to add unit tests | 23:50 |
lxkong | it's already verified in our preprod | 23:51 |
johnsom | One comment, but otherwise yeah, I think that is the path | 23:55 |
openstackgerrit | Lingxian Kong proposed openstack/octavia master: [WIP] Use HMAC.hexdigest to avoid non-ascii characters for package data https://review.openstack.org/571333 | 23:59 |
lxkong | johnsom: thanks | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!