Wednesday, 2018-05-30

*** mordred has quit IRC		00:06
*** zigo has quit IRC		00:06
*** zigo_ has joined #openstack-lbaas		00:06
*** KeithMnemonic has quit IRC		00:08
*** mordred has joined #openstack-lbaas		00:16
*** leitan has quit IRC		00:18
*** leitan has joined #openstack-lbaas		00:18
*** sshank has quit IRC		00:20
*** leitan has quit IRC		00:22
rm_work	ummmmm	00:31
rm_work	https://review.openstack.org/#/c/570485/4	00:31
rm_work	failed one each	00:31
rm_work	i've seen similar failures before in noops	00:32
rm_work	not sure how stuff gets stuck in PENDING statuses??	00:32
rm_work	but it seems to	00:32
rm_work	RMQ failures?	00:32
johnsom	No RMQ with noop....	00:32
rm_work	digging through o-cw log	00:32
rm_work	err ... yes?	00:33
rm_work	we still go through to worker	00:33
rm_work	and it just noops the compute/net/amp parts	00:33
johnsom	It's not running noop Octavia driver?	00:33
rm_work	???	00:33
rm_work	this is an octavia tempest test	00:33
johnsom	Yeah, right, I think that is going through	00:33
johnsom	Looks broken: http://logs.openstack.org/85/570485/4/check/octavia-v2-dsvm-noop-py35-api/5f232ca/controller/logs/screen-o-cw.txt.gz?level=ERROR	00:34
rm_work	this doesn't look good tho: http://logs.openstack.org/85/570485/4/check/octavia-v2-dsvm-noop-py35-api/5f232ca/controller/logs/screen-o-cw.txt.gz#_May_30_00_02_12_274349	00:34
*** atoth has quit IRC		00:35
rm_work	hmmm yeah	00:35
rm_work	interesting	00:35
rm_work	is this catching an octavia bug somehow? lol	00:35
rm_work	but why only sometimes	00:35
rm_work	maybe due to the test order / parallel runs?	00:35
johnsom	Oye, I hope not....	00:36
*** longkb has joined #openstack-lbaas		00:36
rm_work	i mean it's here: http://logs.openstack.org/85/570485/4/check/octavia-v2-dsvm-noop-py35-api/5f232ca/controller/logs/screen-o-cw.txt.gz?#_May_30_00_03_35_679321	00:36
rm_work	how would that happen O_o	00:36
*** keithmnemonic[m] has quit IRC		00:37
rm_work	it tries to fetch the pool info from the DB and it isn't there O_o	00:38
rm_work	how, what	00:38
johnsom	We need to fix that quota log message, it's not printing the object	00:38
rm_work	yeah	00:39
rm_work	i wonder if this is an issue with the new provider stuff ordering the db commit wrong possibly? grasping at straws	00:39
rm_work	cause yeah... you do the "call_provider" before the commit	00:40
rm_work	so that's my guess	00:40
rm_work	we manage to make it through the RMQ call and into the worker and do the processing, before the commit happens	00:40
rm_work	in some instances	00:40
rm_work	:/	00:40
rm_work	sad panda	00:40
rm_work	https://github.com/openstack/octavia/blob/master/octavia/api/v2/controllers/pool.py#L201-L207	00:41
rm_work	is it OK to move the driver call after the commit?	00:41
rm_work	actually, it should definitely be after, right? because if the driver call fails, we don't want to rollback still IMO	00:42
rm_work	it'd be direct-to-error?	00:42
rm_work	or is that not the design	00:42
johnsom	Hmmm, well, we would have to go back the the "ERROR" status thing.	00:42
rm_work	so... yeah. we have to IMO	00:42
rm_work	since we really can't call out to the provider without having the db updated yet T_T	00:42
johnsom	I switched this around to try to enroll the provider call in the DB transaction, so if the driver failed we would roll back the whole request	00:43
rm_work	we're going to have this race	00:43
rm_work	:/	00:43
johnsom	We, as in Octavia driver....	00:43
rm_work	how can a driver that relies on the DB function then	00:43
rm_work	amphora driver, yeah	00:43
johnsom	No driver should rely on the DB	00:44
rm_work	hmmm	00:44
rm_work	so amp driver needs to be rewritten then <_<	00:44
rm_work	because it does	00:44
johnsom	They should get everything from the provider call	00:44
rm_work	i guess we need to rewrite the amp driver to use the interface?	00:44
johnsom	Yes, ultimately, that is true	00:44
rm_work	that means passing that data over rmq	00:44
rm_work	instead of just an ID	00:44
johnsom	I had hoped to avoid that	00:45
rm_work	well, you see the problem with avoiding that <_<	00:45
johnsom	At least for now, just because that is a lot of change too	00:45
rm_work	ok... so.	00:45
rm_work	we can't leave it like this :(	00:45
johnsom	Add sleeps? lol	00:47
rm_work	what do you suggest BESIDES that	00:47
* rm_work slaps johnsom		00:47
rm_work	no hacks :P	00:47
johnsom	Poll the DB?	00:47
johnsom	Sorry, still wacky from being sick	00:47
rm_work	ugh, retry decorator?	00:47
rm_work	you got sick? :(	00:47
rm_work	i mean we could do a retry for like 5s or something dumb	00:48
rm_work	temporarily	00:48
*** leitan has joined #openstack-lbaas		00:48
rm_work	erugh	00:48
rm_work	*dudhadhuwoefhwf	00:48
johnsom	Yeah, I got a sinus infection from the trip. Burned my whole weekend	00:48
rm_work	i said "temporarily", which means shitty hack workaround	00:48
rm_work	sad :(	00:48
rm_work	I drove to Boise and back, which burned a lot of mine :P	00:48
johnsom	Yep, hope the weddding was fun	00:48
rm_work	nah, not a wedding. just meeting up with some folks	00:49
johnsom	So..... Yeah, I think the driver code is right, it's our amphora side that is wrong	00:49
johnsom	Ah, I thought you were going to a wedding for some reason.	00:49
rm_work	but whatever meds you're on, I want some	00:49
johnsom	None sadly	00:49
rm_work	lol	00:50
rm_work	yes, I guess I agree then	00:50
rm_work	so	00:50
* johnsom sees the problem		00:50
rm_work	i don't see a solution besides passing stuff through RMQ	00:50
rm_work	the whole kaboodle	00:50
johnsom	Well, it's either that or do the DB polling on the controller side.	00:50
rm_work	that's really shitty :/	00:51
rm_work	it's impossible to prove it works	00:51
johnsom	Right, long term, it should be using the data it gets in the provider call	00:53
*** annp has quit IRC		00:53
*** annp has joined #openstack-lbaas		00:54
rm_work	so you think... what? we temporary-hack-workaround it to just do a retry on all the DB loads?	00:54
rm_work	if it's not found during a create?	00:54
rm_work	I GUESS that's kinda ok? since for a create, we really should be able to assume that it's got to be there sooner or later	00:55
johnsom	It really comes down to who has time to do the long term solution vs. slapping in a workaround.	00:56
rm_work	k well, at the moment this is scary	00:57
rm_work	i couldn't put this in prod, and our gates are going to be unreliable	00:57
*** leitan has quit IRC		00:58
*** leitan has joined #openstack-lbaas		00:58
*** harlowja has quit IRC		01:00
*** leitan has quit IRC		01:00
*** leitan has joined #openstack-lbaas		01:00
*** keithmnemonic[m] has joined #openstack-lbaas		01:05
openstackgerrit	Michael Johnson proposed openstack/octavia master: Implement provider drivers - Cleanup https://review.openstack.org/567431	01:07
johnsom	rm_work the gotcha with the long term solution is the code expects a DB model there, just as that failure was calling, which doesn't exist with the provider driver method. We don't have a seperate DB for the amphora driver yet.	01:16
rm_work	yeah so it's kinda a rewrite	01:18
johnsom	Tomorrow I will cook up the polling thing.	01:21
rm_work	k imean	01:22
rm_work	i assume it's just a DB retry	01:22
rm_work	using like	01:22
rm_work	tenacity	01:22
rm_work	if it throws the DB error	01:22
rm_work	i can prolly do it too	01:22
johnsom	Right, a decorator that checks the repo get result to see if it is empty, if so, try again. like up to 30 seconds or something	01:22
johnsom	It won't throw an exception, it's just an None object back	01:23
rm_work	yeah, i mean, I could make it throw one :P	01:23
johnsom	Then go down the controller_worker, and wrap those first "get" calls	01:23
johnsom	That would be nasty as tons of things use those repo get calls	01:24
rm_work	not in the repo-get	01:24
rm_work	in the controller_worker create_*	01:24
johnsom	Oh	01:24
johnsom	Yeah.	01:24
johnsom	And it would only be the create calls, the others should be fine. I think	01:25
johnsom	maybe	01:25
johnsom	Ok, I need to run sadly	01:25
rm_work	kk	01:26
rm_work	yes	01:26
rm_work	updates will be fine	01:27
rm_work	and deletes obv	01:27
*** mordred has quit IRC		01:31
*** hongbin has joined #openstack-lbaas		01:38
*** mordred has joined #openstack-lbaas		01:45
*** leitan has quit IRC		02:38
*** leitan has joined #openstack-lbaas		02:39
*** leitan has quit IRC		02:43
*** harlowja has joined #openstack-lbaas		03:29
*** hongbin has quit IRC		03:53
*** links has joined #openstack-lbaas		04:11
*** annp has quit IRC		04:13
*** harlowja has quit IRC		04:14
*** blake has joined #openstack-lbaas		04:23
*** JudeC has joined #openstack-lbaas		05:10
*** eandersson has joined #openstack-lbaas		05:10
*** kobis has joined #openstack-lbaas		05:19
*** JudeC has quit IRC		05:22
*** JudeC has joined #openstack-lbaas		05:24
*** kobis has quit IRC		05:37
*** kobis has joined #openstack-lbaas		05:40
*** kobis has quit IRC		05:41
*** eandersson has quit IRC		05:48
*** eandersson has joined #openstack-lbaas		05:49
*** kobis has joined #openstack-lbaas		06:11
*** blake has quit IRC		06:13
*** imacdonn has quit IRC		06:16
*** imacdonn has joined #openstack-lbaas		06:16
rm_work	ugh a lot of our docstrings are hilariously wrong in controller_worker	06:29
*** AlexeyAbashkin has joined #openstack-lbaas		06:32
*** pcaruana has joined #openstack-lbaas		06:33
*** JudeC__ has joined #openstack-lbaas		06:40
*** JudeC_ has quit IRC		06:41
openstackgerrit	Kobi Samoray proposed openstack/octavia master: Octavia devstack plugin API mode https://review.openstack.org/570924	06:48
*** annp has joined #openstack-lbaas		06:52
*** apple01 has joined #openstack-lbaas		06:54
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107	07:00
openstackgerrit	Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for healthmonitors https://review.openstack.org/567688	07:01
*** apple01 has quit IRC		07:17
*** tesseract has joined #openstack-lbaas		07:20
*** rcernin has quit IRC		07:27
*** kobis has quit IRC		07:34
*** kobis has joined #openstack-lbaas		07:38
*** yboaron has joined #openstack-lbaas		07:39
*** apple01 has joined #openstack-lbaas		07:46
openstackgerrit	ZhaoBo proposed openstack/octavia master: UDP for [3][5][6] https://review.openstack.org/571120	07:46
*** apple01 has quit IRC		07:51
*** JudeC__ has quit IRC		08:10
*** kobis has quit IRC		08:12
*** kobis has joined #openstack-lbaas		08:13
*** jiteka has quit IRC		08:13
*** nmanos has joined #openstack-lbaas		08:19
*** Alexey_Abashkin has joined #openstack-lbaas		08:50
*** Alexey_Abashkin has quit IRC		08:51
*** AlexeyAbashkin has quit IRC		08:51
*** AlexeyAbashkin has joined #openstack-lbaas		08:52
*** apple01 has joined #openstack-lbaas		08:56
*** apple01 has quit IRC		09:01
*** salmankhan has joined #openstack-lbaas		09:14
*** yamamoto has quit IRC		09:21
*** salmankhan has quit IRC		09:25
*** kobis has quit IRC		09:26
*** JudeC_ has joined #openstack-lbaas		09:28
*** salmankhan has joined #openstack-lbaas		09:35
*** salmankhan has quit IRC		09:42
*** salmankhan has joined #openstack-lbaas		09:47
*** yamamoto has joined #openstack-lbaas		09:50
*** zigo_ is now known as zigo		09:56
*** annp has quit IRC		10:01
*** apple01 has joined #openstack-lbaas		10:06
*** kobis has joined #openstack-lbaas		10:14
*** apple01 has quit IRC		10:14
*** JudeC_ has quit IRC		10:15
*** AlexeyAbashkin has quit IRC		10:30
*** AlexeyAbashkin has joined #openstack-lbaas		10:31
*** AlexeyAbashkin has quit IRC		10:36
*** apple01 has joined #openstack-lbaas		11:04
*** yboaron has quit IRC		11:04
*** apple01 has quit IRC		11:08
*** apple01 has joined #openstack-lbaas		11:20
*** AlexeyAbashkin has joined #openstack-lbaas		11:21
*** yamamoto has quit IRC		11:29
*** longkb has quit IRC		11:40
*** apple01 has quit IRC		11:42
*** apple01 has joined #openstack-lbaas		11:49
*** yamamoto has joined #openstack-lbaas		12:07
*** leitan has joined #openstack-lbaas		12:09
*** atoth has joined #openstack-lbaas		12:16
*** amuller has joined #openstack-lbaas		12:23
*** apple01 has quit IRC		12:44
*** yboaron has joined #openstack-lbaas		12:45
*** sajjadg has joined #openstack-lbaas		12:59
*** samccann has joined #openstack-lbaas		13:00
*** yamamoto has quit IRC		13:04
*** leitan has quit IRC		13:11
*** leitan has joined #openstack-lbaas		13:26
*** leitan has quit IRC		13:26
*** links has quit IRC		13:29
*** links has joined #openstack-lbaas		13:30
*** yamamoto has joined #openstack-lbaas		13:35
*** fnaval has joined #openstack-lbaas		13:36
*** yamamoto has quit IRC		13:39
*** links has quit IRC		13:44
*** AlexeyAbashkin has quit IRC		14:02
*** AlexeyAbashkin has joined #openstack-lbaas		14:04
*** kobis has quit IRC		14:06
*** yboaron_ has joined #openstack-lbaas		14:25
*** yboaron has quit IRC		14:28
*** yboaron_ has quit IRC		14:37
*** yboaron_ has joined #openstack-lbaas		14:38
*** yboaron_ has quit IRC		14:53
*** kobis has joined #openstack-lbaas		14:55
*** apple01 has joined #openstack-lbaas		15:07
*** rpittau has joined #openstack-lbaas		15:27
*** sajjadg has quit IRC		15:30
*** pcaruana has quit IRC		15:33
amuller	Random Q of the day - When issuing the admin failover command to a loadbalancer in active_standby topology, I see that both amphorae IDs changed. Is this expected?	15:44
*** jiteka has joined #openstack-lbaas		15:45
*** jiteka has quit IRC		15:47
*** jiteka has joined #openstack-lbaas		15:48
amuller	it seems like both the active and standby amphorae were killed and new ones were spawned	15:48
*** apple01 has quit IRC		15:51
xgerman_	yep, that is expected	15:53
xgerman_	the idea is to recycle both amps in order to update the amphora image for example	15:54
xgerman_	the amphora API has a more granular failover	15:54
xgerman_	^^ amuller	15:54
amuller	mhmm	15:54
amuller	so how do I know that a keepalived based failover happened, via the API?	15:55
xgerman_	yeah, failover on lb might mean something different for F5 :-)	15:55
amuller	(thinking of testing)	15:55
amuller	I mean, how do I know that the old backup is now the active?	15:55
johnsom	amuller If you want to failover only one of the pair, you can use the amphora failover API.	15:56
xgerman_	well, I think you mean the active dies and the passive takes over	15:56
amuller	right	15:56
xgerman_	you can “simulate” that with a nova delete on the ACTIVE	15:56
xgerman_	or a port down, or…	15:57
amuller	admin state down on the neutron port of the active?	15:57
johnsom	As for which one is currently "MASTER" in the VRRP pair, we don't currently expose that outside the amphora. They manage that completely autonomously to the control plane.	15:57
*** nmanos has quit IRC		15:57
xgerman_	amuller: yes, the vrrp port	15:57
amuller	johnsom: ack	15:57
johnsom	It also comes down to keepalived not reliably exposing the status of a given instance.	15:58
amuller	so I'm trying to write a scenario test for failover with active_standby topology	15:58
xgerman_	yeah, I think in the grander scheme of things killing the amp you think is master will definitely have the other amp take over	15:59
amuller	so I'm trying to find an easy way (using the API) to see that a failover happened	15:59
xgerman_	then you check via the nova or amp api if there is a new amp	15:59
johnsom	Adam had started one: https://review.openstack.org/#/c/501559/	15:59
johnsom	Have you looked at that.	15:59
xgerman_	while you check that traffic kept flowing	15:59
amuller	oh	15:59
amuller	I had not	15:59
johnsom	Yeah, the amphora IDs will change, which is queriable via the amphora admin API. That is one clue	16:00
johnsom	I think that patch is pretty out of date now, but could be a starting place.	16:00
amuller	mhmm, that patch has a larger scope than I was hoping to put up for review	16:03
amuller	I was hoping for a simpler patch, leaving traffic flow out of it	16:04
amuller	just using the API to see that a failover happened	16:04
amuller	xgerman_: if I find the nova VM for the active amphora and kill it, I imagine octavia will spawn a new amphora, but will the old standby become the new active, without octavia spawning a new 2nd amphora?	16:09
amuller	in other words if I kill the nova VM for the active amphora, and I wait a bit and issue an amphorae list for the LB, will I see two new amphorae or only one new one?	16:10
johnsom	Only one new one	16:10
xgerman_	+1	16:11
xgerman_	passive->active runs “local” on each amp pair and independently we replace broken/deleted amps with the control plane	16:11
johnsom	When you kill one amp by nova delete, the other will automatically assume MASTER if it wasn't. Then health manager will rebuild the other amphora and it will assume the BACKUP role.	16:11
johnsom	MASTER and BACKUP should not be confused with the database "ROLE" which is for build time settings and not an indication of which amp is playing what role.	16:12
xgerman_	+1 — once the pair is live it can do whatever it wants	16:13
johnsom	I actually had a demo video of doing this via horizon I had in case I ran short on my presentations at the summit	16:13
amuller	oh, ok good note on the database role	16:13
johnsom	Yeah, that confuses people. It's for settings, not for current status	16:14
amuller	yeah it's confusing =p	16:14
amuller	so to double check there's no way to use the admin API to find out who is the active one (per keepalived)?	16:14
johnsom	Correct. Keepalived doesn't provide a way to ask this that is reliable enough to use.	16:15
johnsom	Unless they have added it in the last year or two.	16:15
amuller	when I did the l3 HA work in neutron	16:15
amuller	because keepalived didn't at the time allow you to query it for its status	16:15
amuller	(I don't know if that changed)	16:15
amuller	I wrote keepalived-state-change-monitor	16:15
amuller	which is a little python process that uses 'ip monitor' to detect IP address changes	16:16
xgerman_	mmh, maybe we can incorporate that	16:16
amuller	and when it does, it notifies the L3 agent via a unix domain socket	16:16
amuller	and the L3 agent then notifies neutron-server via RPC	16:16
amuller	which updates the DB	16:16
amuller	so when operators can use an admin API to see where is the active router replica	16:16
amuller	then*	16:16
xgerman_	yeah, we can extend the health messages to stream it into the DB	16:17
johnsom	Hmm, well, we have a status/diagnostic API on the amps, I think that would be fine to query and not have yet another status to update in the DB.	16:17
amuller	yeah, I was just commenting on a way to figure out keepalived state changes	16:18
johnsom	Also, I think you will find there are some oddities in keepalived that on initial startup both amps will see the VIP IP but keepalived will only be GARPing from one of them.	16:18
amuller	not how you'd model it in the API	16:18
johnsom	This is one of the "reliable" issues we hit	16:18
amuller	hmm, I didn't see that in L3 HA	16:19
amuller	I know that we configure keepalived differently	16:19
amuller	in neutron and in octavia	16:19
johnsom	Yeah, I think it might be how we are bringing up the interfaces. It wasn't a "problem" so we never went back and tried to fix it	16:19
*** kobis has quit IRC		16:19
johnsom	We considered log scraping, but that provided to not be accurate. We considered doing the status dump to /tmp using the signal, but that required a newer version that some distros have. etc.	16:20
johnsom	Just wasn't a priority to run to ground as it doesn't provide a lot of useful information, just some nice to have info for failover.	16:21
johnsom	I think Ryan brought up that the right answer was likely listening on the dbus, but again, a good chunk of work	16:21
amuller	I'm not familiar with the status dump using a signal	16:22
xgerman_	so far the amp (diagnostic) API is not exposed in the best way — that’s why I was thinking health messages… I think one day I will write a way to proxy amp-api->diagnostic-api	16:22
*** pcaruana has joined #openstack-lbaas		16:23
amuller	so it seems like right now there's no way to test failover in active_standby topology, if I want to stay in the realm of the API	16:23
johnsom	What do you meant it's not exposed in the best way. It works just fine. I have no idea why you would need a proxy	16:23
johnsom	Sure there is, use the amphora admin API to check the amphora IDs	16:24
xgerman_	you need to curl the amp directly or not?	16:25
johnsom	not for the amphora admin API\	16:25
johnsom	https://developer.openstack.org/api-ref/load-balancer/v2/index.html#show-amphora-details	16:25
johnsom	I guess list would be more useful: https://developer.openstack.org/api-ref/load-balancer/v2/index.html#list-amphora	16:26
johnsom	Filter by LB ID	16:26
johnsom	I am pretty sure this is how Adam's failover patch is doing it	16:26
amuller	that tells me that octavia spawned new amphoras, does it answer the question of 'did the old backup become the new active'?	16:27
johnsom	You can test that by making sure traffic still flows during the failover	16:27
amuller	I don't really wanna do that =p	16:28
amuller	it makes the test more complicated than it needs to be	16:28
amuller	so, even with active_standby you'll see some downtime	16:28
johnsom	VRRP is designed to be autonomous to the control plane. It can switch at any time for failures the control plane doesn't know about.	16:28
amuller	so how long of a downtime would you set to be tolerable?	16:28
amuller	it needs to be shorten than a test with standalone topology	16:28
amuller	shorter*	16:28
johnsom	Yeah, depending on how it's tuned it's around a second, usually less	16:29
amuller	it's gonna be difficult to write it in a way that's reliable at the gate and in loaded environments	16:29
amuller	but with a low enough timeout so that the test is still meaningful	16:29
*** salmankhan has quit IRC		16:29
xgerman_	johnsom: I was thinking proxying https://docs.openstack.org/octavia/latest/contributor/api/haproxy-amphora-api.html — but you probably guessed that :-)	16:30
johnsom	Yeah, the hierarchic is: systemd recovery, sub second; active/standby around a second; failvoer, under a minute (depends on the could infrastructure and your octavia config, typically 30 seconds).	16:30
amuller	so let's say we do this by pinging the LB	16:32
johnsom	amuller Here is a demo I showed at the tokyo summit: https://youtu.be/8n7FGhtOiXk?t=23m52s	16:32
amuller	with some timeout	16:33
amuller	how do I know which VM to kill	16:33
amuller	because I don't know who the active amphora is	16:33
johnsom	Right, you won't, you have to cycle through killing them both, one at a time to prove and active/standby transition	16:33
xgerman_	they could switch in between — so you are probably left with “good enough”	16:34
amuller	so I pick one at random, kill it, assert < timeout outage	16:34
amuller	I could have killed either the active or the standby	16:34
rm_work	amuller: i am mid failover-scenario right now (started on it yesterday)	16:34
rm_work	updating my old patch	16:35
amuller	oh hah ok	16:35
xgerman_	sweet	16:35
rm_work	about to push up the amphora client patch	16:35
amuller	well, good thing I asked =p	16:35
rm_work	just didn't get the testing quite right last night	16:35
amuller	alrighty well	16:36
amuller	is there another area there are glaring testing coverage issues?	16:36
amuller	one thing to pops to mind is the amphora show and list API	16:36
amuller	I don't think I saw tests for that in the api subdir	16:36
amuller	is there a doc or some such ya'all are using to track progress on test coverage in the octavia tempest plugin?	16:37
rm_work	amuller: that's what i meant	16:38
openstackgerrit	Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for l7policies https://review.openstack.org/570482	16:39
openstackgerrit	Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for l7rules https://review.openstack.org/570485	16:39
openstackgerrit	Adam Harwell proposed openstack/octavia-tempest-plugin master: Create api+scenario tests for amphora https://review.openstack.org/571251	16:40
rm_work	^^ there	16:40
rm_work	not done but i'll just push it so it'll be obvious what i mean	16:40
amuller	+1	16:41
amuller	rm_work: https://review.openstack.org/#/c/571251/1/octavia_tempest_plugin/tests/api/v2/test_amphora.py@32 you prolly want a different class name =p	16:42
amuller	anyhow	16:42
rm_work	yes :P	16:42
rm_work	i wasn't really ready to push yet but ;P	16:43
rm_work	one of the tests is still just an unmodified clone, lol	16:43
amuller	still curious if there's an effort to track test coverage in the octavia tempest plugin	16:44
amuller	on my way out for lunch, we can sync later :)	16:44
rm_work	yes	16:44
rm_work	https://storyboard.openstack.org/#!/story/2001387	16:44
*** jcarpentier has joined #openstack-lbaas		16:44
rm_work	i think it is missing the amp piece	16:45
amuller	ah ha :)	16:45
amuller	that's quite a story hehe	16:45
*** kobis has joined #openstack-lbaas		16:45
*** JudeC_ has joined #openstack-lbaas		16:46
rm_work	johnsom: did you see my attempts at the db retry patch?	16:52
johnsom	Haven't had a chance yet this morning	16:53
rm_work	https://review.openstack.org/#/c/571107/	16:53
* rm_work shrugs		16:54
rm_work	i can tweak that stuff a bunch if we want to do like	16:55
rm_work	wait 2, retry count = 5	16:55
rm_work	instead of wait 1, retry until 10s	16:55
*** kobis has quit IRC		16:59
*** tesseract has quit IRC		17:10
johnsom	rm_work So this looks fine. I would make those numbers either config or a constant somewhere (could be at the top of the file). I think 1 is the right answer, but maybe the top end as 30. Just in case there is a DB that takes a crazy long time to commit.	17:28
*** AlexeyAbashkin has quit IRC		17:29
johnsom	It doesn't really handle the case of walking off the end of the retries in a nice way. It's not like we could do much about that (can't set ERROR obviously), but a log message with details would be nice.	17:30
*** AlexeyAbashkin has joined #openstack-lbaas		17:31
johnsom	FYI, someone on the dev mailing list was asking about backend re-encryption	17:32
*** jcarpentier has left #openstack-lbaas		17:35
*** AlexeyAbashkin has quit IRC		17:36
rm_work	johnsom: right, if it fails right now the exception makes just as little sense, i think	17:39
rm_work	at least this one is representative of the issue	17:39
rm_work	but in no case is there anything to be done :(	17:40
rm_work	I guess I could put in a log like "retrying"	17:40
rm_work	one sec	17:40
johnsom	I probably would have put the retry loop inside and had a log message that we "lost this object" and that the database is broken	17:40
rm_work	I mean, I could just catch the existing error, but it's not necessarily as explicit	17:41
rm_work	the the AttributeError	17:42
rm_work	but that'd also be easier	17:42
rm_work	what about:	17:45
rm_work	wait=tenacity.wait_incrementing(2, 2)	17:46
rm_work	stop=tenacity.stop_after_attempt(10)	17:46
johnsom	rm_work Hmm, there are some nice backoff options in the tinacity docs (trying to figure out what increment does, which of course isn't documented). It also has an "after failed" log option that would work here	17:49
johnsom	https://github.com/jd/tenacity	17:49
*** kobis has joined #openstack-lbaas		17:49
rm_work	yes	17:49
rm_work	i don't really want to go exponential	17:50
johnsom	yeah, agreed	17:50
rm_work	ok actually, maybe I should just do it like...	17:51
johnsom	It should land quickly under normal operation, so adding too much delay between attempts just slows the whole thing down.	17:51
rm_work	retry on the natural AttributeError, and then reraise after the last retry	17:52
rm_work	and not increment the delay much	17:52
rm_work	but meh...	17:54
rm_work	hmmm	17:54
*** AlexeyAbashkin has joined #openstack-lbaas		17:57
*** Alexey_Abashkin has joined #openstack-lbaas		18:00
*** AlexeyAbashkin has quit IRC		18:02
*** Alexey_Abashkin is now known as AlexeyAbashkin		18:02
*** pcaruana has quit IRC		18:09
rm_work	lol, retry_error_callback is SUPER NEW	18:11
openstackgerrit	Michael Johnson proposed openstack/octavia master: Implement provider drivers - Cleanup https://review.openstack.org/567431	18:12
rm_work	so not sure it's safe to use yet	18:12
johnsom	ok	18:12
johnsom	^^^ just a docs update to clarify some questions Kobi raised this morning	18:12
rm_work	yeah k	18:13
*** kobis has quit IRC		18:18
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107	18:21
rm_work	johnsom: revised ^^	18:21
rm_work	erk off by one	18:21
rm_work	i was aiming for 1+2+3+4+5*11 = 60	18:21
rm_work	but i need 15 not 14 for that	18:22
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107	18:23
rm_work	as this is a workaround, I really wanted to avoid adding a bunch more config (that we'd have to deprecate, to boot)	18:23
johnsom	Yeah, just throw a constant at the top of the file	18:26
rm_work	ah, sure	18:27
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107	18:29
*** AlexeyAbashkin has quit IRC		18:40
*** salmankhan has joined #openstack-lbaas		18:53
*** atoth has quit IRC		18:55
*** salmankhan has quit IRC		18:57
johnsom	One minor comment about the log message, otherwise good for me	19:17
johnsom	And I bet lower constraints is wrong	19:18
openstackgerrit	Jan Zerebecki proposed openstack/neutron-lbaas master: Remove eager loading of Listener relations https://review.openstack.org/570596	19:19
*** salmankhan has joined #openstack-lbaas		19:46
johnsom	#startmeeting Octavia	20:00
openstack	Meeting started Wed May 30 20:00:07 2018 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot.	20:00
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	20:00
*** openstack changes topic to " (Meeting topic: Octavia)"		20:00
openstack	The meeting name has been set to 'octavia'	20:00
johnsom	Hi folks	20:00
johnsom	Pretty light agenda today	20:00
cgoncalves	hi	20:01
nmagnezi	o/	20:01
johnsom	#topic Announcements	20:01
*** openstack changes topic to "Announcements (Meeting topic: Octavia)"		20:01
johnsom	I don't have much for announcements. The summit was last week. There are lots of good sessions up for viewing on the openstack site.	20:02
johnsom	Octavia was demoed and called out in a Keynote, so yay for that!	20:02
nmagnezi	:-)	20:02
johnsom	We also came up in a number of sessions I attended, so some good buzz	20:02
johnsom	Nir, I think you have an announcement today....	20:03
nmagnezi	:)	20:03
nmagnezi	I'm happy to announce that TripleO now fully supports Octavia as a One click install.	20:03
nmagnezi	That includes:	20:03
nmagnezi	1. Octavia services in Docker containers.	20:03
nmagnezi	2. Creation of the mgmt subnet for amphorae	20:03
nmagnezi	3. If the user is using RHEL/CentOS based amphora - it will automatically pull an image and load it to glance.	20:04
nmagnezi	Additionally, the SELinux policies for the amphora image are now ready and tested internally. Those policies are available as a part of openstack-selinux package.	20:04
nmagnezi	some pointers (partial list):	20:04
nmagnezi	1. I'll find the related tripleO docs and provide it next week (what we do currently have is release notes ready for Rocky	20:04
nmagnezi	2. SELinux:	20:04
nmagnezi	#link https://github.com/redhat-openstack/openstack-selinux/blob/master/os-octavia.te	20:04
xgerman_	o/	20:05
nmagnezi	Many people were involved with this effort ( cgoncalves amuller myself bcafarel and more) . And now we can fully support Octavia as an OSP13 component (Based on Queens).	20:05
johnsom	Ok, cool! So glad to see SELinux enabled for the amps	20:05
xgerman_	+1	20:05
johnsom	+1000 There are people looking for it. I also mentioned OSP 13 a number of times at the summit	20:06
amuller	I do expect this to drive up Octavia adoption pretty significantly	20:06
xgerman_	:-)	20:06
nmagnezi	yup. I'm sure operators that are still using the legacy n-lbaas with haproxy will migrate	20:07
cgoncalves	it takes less than 2 people to deploy octavia with tripleo now	20:07
johnsom	Way to go RH/TripleO folks. I know it's been a journey, but it is great to have the capability.	20:07
johnsom	2 people? I thought it was one-click? One to click the mouse and one to open the beverages?	20:07
johnsom	grin	20:08
nmagnezi	haha	20:08
cgoncalves	johnsom, I said "less than" ;) one is enough to do both jobs :)	20:08
johnsom	Nice. Any other announcements today?	20:08
*** amuller has quit IRC		20:09
johnsom	#topic Brief progress reports / bugs needing review	20:09
*** openstack changes topic to "Brief progress reports / bugs needing review (Meeting topic: Octavia)"		20:09
johnsom	I was obviously a bit distracted with the summit and presentation prep.	20:10
nmagnezi	yup :)	20:10
johnsom	We have started merging the provider driver code and the tempest plugin code. We are co-testing the patches as we go.	20:10
rm_work	o/	20:11
johnsom	rm_work Did point out a race condition in my amp driver last night, so he is working on fixing that.	20:11
johnsom	It's kind of a migration issue, as I wanted to incrementally migrate the amphora driver over to the new provider driver ways.	20:12
johnsom	Today my focus is on adding the driver library support (update stats/status).	20:12
johnsom	I have been chatting with Kobi in the mornings about the VMWare NSX driver, which it sounds like is in-progress.	20:12
nmagnezi	Nice!	20:13
johnsom	He has been giving feedback that I have been including in the cleanup patch.	20:13
johnsom	Along the lines of drivers, it's not clear when we will get an F5 driver. They have had some re-orgs from what I gather, so it may delay that work.	20:14
nmagnezi	good to know. at least we have feedback from one vendor for now.	20:15
johnsom	Yeah, sad that the vendor that was the key author of the spec may not be able to create their driver right away	20:15
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow DB retries on controller_worker creates https://review.openstack.org/571107	20:15
johnsom	Any other progress updates today?	20:16
nmagnezi	yes	20:17
nmagnezi	I had some cycles, so I finished my Rally patch to add support for Octavia. it is ready for feedback now and worked okay for me: https://review.openstack.org/#/c/554228/	20:17
johnsom	In case you are really bored, here is the link to my project update. Feedback welcome.	20:17
johnsom	#link https://youtu.be/woPaywKYljE	20:17
nmagnezi	johnsom, i watched it. you did a great work with this.	20:18
johnsom	Thanks	20:18
johnsom	Rally, cool. I need to look at that and refresh my memory of how those gates work	20:19
nmagnezi	#link https://review.openstack.org/#/c/554228/	20:19
nmagnezi	johnsom, yeah. the scenario i add now is a port of the existing n-lbass scenario to Octavia	20:19
johnsom	I'm guessing it is the "rally-task-load-balancing" gate I should look at?	20:19
nmagnezi	next up we can add more stuff	20:20
nmagnezi	johnsom, yup	20:20
johnsom	Cool, I will check it out	20:20
nmagnezi	thanks!	20:21
johnsom	Any other updates?	20:21
johnsom	cgoncalves I saw the grenade gate was failing, but didn't have much time to dig into why. Are you looking into that?	20:22
cgoncalves	johnsom, I submitted new patch set today (yesterday?) to check what's going on when we curl. it fails post-upgrade	20:22
johnsom	Yeah, odd	20:23
cgoncalves	not sure why yet. it successfully passes the same curl pre and during upgrade	20:23
cgoncalves	it started failing out of the sudden	20:23
johnsom	Well, let us know if we can provide a second set of eyes to look into it.	20:24
johnsom	I really want to get that merged and voting.	20:24
cgoncalves	thanks!	20:24
johnsom	So we can start the climb on upgrade tags	20:24
johnsom	Also, we discussed fast-forward upgrades a bit at the summit.	20:24
cgoncalves	yes!	20:24
johnsom	I'm thinking we need to setup grenade starting with Pike (1.0 release) and have gates for Pike->Queens, Queens->Rocky, etc. to prove we can do a fast-forward upgrade	20:25
johnsom	I guess I am jumping ahead... grin	20:26
johnsom	#topic Open Discussion	20:26
*** openstack changes topic to "Open Discussion (Meeting topic: Octavia)"		20:26
*** mstrohl has joined #openstack-lbaas		20:28
johnsom	fast-forward is running each upgrade sequentially to move from an older release to current. This is different from leap-frog which is a direct jump Pike->Rocky. It sounds like fast-forward is going to be the supported plan for OpenStack upgrades	20:28
rm_work	fast forward sounds like... a normal upgrade process	20:29
johnsom	Right, just chained together	20:29
nmagnezi	and... fast :)	20:29
rm_work	unless they add stuff like "you don't have to start/stop the control plane at each stage"	20:29
johnsom	Eh, as long as we have a plan and a test I will be happy.	20:29
rm_work	but yeah	20:29
johnsom	I think it's the script of what is required to do that.	20:30
johnsom	Other topics today?	20:30
cgoncalves	IIUC, we could try that now in the grenade patch by changing the base version from queens to pike: https://review.openstack.org/#/c/549654/33/devstack/upgrade/settings@4	20:30
johnsom	cgoncalves that would be a leap frog thought right?	20:31
cgoncalves	if there are upgrade issues (e.g. deprecated configs), we create a per version directory with upgrade instructions	20:31
cgoncalves	johnsom, ah right	20:32
johnsom	yeah, we need to write up an upgrade doc that lays out the steps. Maybe from their link to any upgrade issues or just link to the release notes	20:32
* johnsom notices the room going quiet....		20:33
johnsom	lol	20:33
nmagnezi	everyone like docs..	20:34
nmagnezi	:)	20:34
johnsom	In fairness, I think we can just pull the grenade back to the Queens branch and set it up. This would give us the FFU gates we need.	20:34
johnsom	Once we get it stable on master	20:34
cgoncalves	+1	20:35
nmagnezi	yup. sounds right	20:36
johnsom	Ok, if there aren't any more topics, have a good week and we will chat next Wednesday!	20:38
rm_work	something is nagging at me about that... but sure, probably	20:38
rm_work	o/ REVIEW STUFF	20:38
johnsom	Yes, please. I really hope to do a client release soon	20:38
johnsom	Would love to get some reviews on that	20:39
johnsom	#endmeeting	20:40
*** openstack changes topic to "Discussion of OpenStack Load Balancing (Octavia) \| https://etherpad.openstack.org/p/octavia-priority-reviews"		20:40
openstack	Meeting ended Wed May 30 20:40:00 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:40
openstack	Minutes: http://eavesdrop.openstack.org/meetings/octavia/2018/octavia.2018-05-30-20.00.html	20:40
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/octavia/2018/octavia.2018-05-30-20.00.txt	20:40
openstack	Log: http://eavesdrop.openstack.org/meetings/octavia/2018/octavia.2018-05-30-20.00.log.html	20:40
rm_work	ah yeah i need to look at client again	20:40
johnsom	I think status and timeouts are ready for review.	20:41
johnsom	UDP might be, there were some updates pending on the controller sied	20:41
johnsom	side	20:41
rm_work	i need to get scenario tests done on the l7 stuff so we can merge those and then merge yours	20:42
johnsom	+1	20:42
rm_work	re: the comments earlier about amps not showing who is master -- i could implement it to update our DB at least the way that I do it in my env	20:43
rm_work	which is, on takeover, keepalived can run a script -- and that script can trigger a health message with the takeover notification	20:43
rm_work	so we could update the db	20:43
* rm_work shrugs		20:43
johnsom	Well, I have a bunch of concerns about that frankly	20:43
rm_work	it's just a third type of health message, which is like "hey, i'm a master now, do with that what you will"	20:44
rm_work	and the default driver can just be like "ok, updating amps in the db"	20:44
rm_work	it's not guaranteed to be super accurate	20:45
johnsom	Yeah, plus new DB columns, a bunch of latency, etc. Just is it really that useful of info?	20:45
rm_work	but it's better than what it does now	20:45
rm_work	no new db columns	20:45
rm_work	we already have a column for role	20:45
johnsom	Oh yes! You CANNOT change the ROLE column	20:45
rm_work	lol k	20:45
* rm_work does		20:45
johnsom	That would be a big problem	20:45
rm_work	that's how it is in my prod	20:45
rm_work	have yet to see any problems	20:45
johnsom	That dictates the settings pushed to the amps, like priorities and peers	20:46
rm_work	https://review.openstack.org/#/c/435612/132/octavia/amphorae/drivers/health/flip_utils.py@85	20:47
johnsom	My guess is your failovers aren't happening right and likely pre-empt the master when a failed amp is brought back in	20:47
johnsom	That column was never intended to be a status column	20:47
rm_work	it wouldn't push new configs to keepalived	20:47
rm_work	unless one of the amps actually gets updated	20:47
johnsom	Right, but that column dictates the settings deployed into the keepalived config file when we build replacements	20:48
rm_work	at which point... it then pushes to both	20:48
rm_work	yeah, i build a replacement BACKUP immediately actually	20:48
rm_work	which gets the new backup settings	20:48
johnsom	We have had this conversation, if I remember right you are ok with the downtime	20:49
rm_work	what downtime?	20:49
rm_work	I mean, we already have some downtime just for the FLIP to move, on a fail	20:49
rm_work	but there's no other downtime involved	20:49
johnsom	right	20:49
johnsom	That eats the keepalived downtime	20:49
rm_work	we always reload keepalived on amp configs	20:50
rm_work	not sure how there would be any different downtime	20:50
xgerman_	I would rather have https://docs.openstack.org/octavia/latest/contributor/api/haproxy-amphora-api.html#get-amphora-topology working and then relay back to the amp API - my 2cts	20:50
johnsom	xgerman_ I don't think we want anymore "proxy" code right now	20:51
rm_work	eugh, public API endpoint that can trigger an actual callout to dataplane -> not something i want to see happen	20:51
xgerman_	I said relay — not proxy	20:51
johnsom	<xgerman_> German Eichberger johnsom: I was thinking proxying https://docs.openstack.org/octavia/latest/contributor/api/haproxy-amphora-api.html — but you probably guessed that :-)	20:52
johnsom	That was from this morning.	20:52
xgerman_	yes, that’s why I said relay — I am learning	20:53
rm_work	i get the idea -- it's not exactly a proxy I think -- but it involves syncronously reaching out to the dataplane	20:53
rm_work	which is meh	20:53
rm_work	i would rather not expose a direct path from user->dataplane	20:53
rm_work	though, it is just admins	20:54
johnsom	yeah, I struggle with the value of it. To me it is a cattle detail hidden from view that can change at any time autonomously from the control plane.	20:54
rm_work	it's essentially useful just for testing	20:54
rm_work	which is at least 3/4 of why i even wrote the amp API	20:54
xgerman_	if it’s just tetsing they can hit the amp directlly	20:55
rm_work	so i could do stuff in tempest without reaching into the DB	20:55
xgerman_	I was more thinking for some automation stuff - e.g. change roles (ther eis also a set)	20:55
rm_work	yes, for pre-fails that would be nice	20:55
rm_work	for the failover call and such	20:55
rm_work	and my evac	20:56
johnsom	It's not like you can force a role change outside of a failure. There is no keepalived call to do that (thank goodness)	20:56
xgerman_	also seeing amps flap between active-passive is useful	20:56
johnsom	Is it?	20:57
xgerman_	yes, if it happens every 3s	20:57
johnsom	User wouldn't know	20:57
xgerman_	the entwork is wonly	20:57
rm_work	the latency +/- on that would be a good deal of that 3s lol	20:57
xgerman_	yes, it’s useful for admin/ monitoring	20:57
*** KeithMnemonic has joined #openstack-lbaas		20:57
xgerman_	users don’t care…	20:58
johnsom	I think that is a poor metric. It doesn't tell you anything about why or if it is even a bad thing.	20:58
johnsom	If you want that level of deep monitoring, you need the logs that would actually tell you why in addition to if it changed	20:59
johnsom	But again, this is treating them like pets, not cattle	20:59
xgerman_	I see monitoring as detetcing potential error conditions and then investigating through logs	21:00
xgerman_	but if I would see a ton of flapping and heard that networking did soemthing…	21:01
rm_work	the biggest thing i see it being useful for is tempest failover testing	21:03
johnsom	Again, I think role changes are noise and not an actionable event	21:03
*** sapd_ has joined #openstack-lbaas		21:03
johnsom	Right, I think the only real use case is testing IMO. Which could be done other ways, like ssh in and dump the status file	21:04
johnsom	Or just do a dual failure to guarantee the initial state	21:04
rm_work	hmm	21:05
xgerman_	now, to what I actually wanted to do: check zuul for ending jobs — didn’t they have a page I could check by Change-id?	21:06
johnsom	http://zuul.openstack.org/builds.html	21:06
johnsom	Change ID, not so sure	21:07
*** sapd has quit IRC		21:07
*** samccann has quit IRC		21:12
*** samccann has joined #openstack-lbaas		21:14
*** salmankhan has quit IRC		21:29
*** harlowja has joined #openstack-lbaas		21:52
lxkong	morning, guys, I have a question, is it possible that the package data sent from amphora to health-manager is in non-ascii format?	21:57
lxkong	we met with some error message in health-manager log: `Health Manager experienced an exception processing a heartbeat message from ('100.64.1.21', 31538). Ignoring this packet. Exception: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc1 in position 0: ordinal not in range(128)`	21:57
lxkong	so i made a patch to oslo_utils here https://review.openstack.org/#/c/570826/	21:58
lxkong	seems the method hmac.compare_digest doesn't support non-ascii	21:58
lxkong	but our issue was solved using this patch	21:59
rm_work	hmmm	21:59
rm_work	that's weird	21:59
johnsom	Yeah, I think it is always non-ascii... It should be raw bytes	22:00
rm_work	ah Ben's link is interesting	22:00
rm_work	`a and b must both be of the same type: either str (ASCII only, as e.g. returned by HMAC.hexdigest()), or a bytes-like object`	22:00
lxkong	yeah, that's from the doc of hmac module	22:03
johnsom	lxkong Do you have the traceback?	22:04
lxkong	johnsom: i don't have traceback now, but according to my debugging, the error happens here: https://github.com/openstack/oslo.utils/blob/0fb1b0aabe100bb36d0e4ad6d5a9f96dd8eb6ff6/oslo_utils/secretutils.py#L39	22:05
lxkong	and i also printed the data received, it's non-ascii	22:05
johnsom	Yeah, so the data it's comparing is generated here: https://docs.python.org/2/library/hmac.html#hmac.HMAC.digest	22:07
johnsom	Which is clear, it is not ASCII	22:08
lxkong	yeah	22:08
johnsom	We probably should switch to using hmac.compare_digest	22:08
johnsom	()	22:08
rm_work	err	22:11
lxkong	the method has different explanation in python 2.7.7 and 3.3	22:11
johnsom	Yeah, I see that	22:12
rm_work	right so	22:12
rm_work	the point of that is because we can't use hmac.compare_digest :P	22:12
lxkong	yeah	22:12
rm_work	it does use it, if it can	22:12
johnsom	Oh, yeah, I see it	22:13
lxkong	do you think we should fix that in octavia?	22:13
rm_work	so it looks like we SHOULD have been using hexdigest() instead of digest() ?	22:13
johnsom	So, right, maybe for the decode we use hexdigest and for the encode use digest?	22:14
rm_work	err	22:14
johnsom	So, don't we still want to pass the hmac in binary over the wire (half the size) but use the hex dump for the compare?	22:16
rm_work	yeah i am trying to find where we actually do the encode/decode	22:16
johnsom	https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py	22:17
lxkong	octavia/amphorae/backends/health_daemon/status_message.py	22:17
johnsom	No, that won't work, it has to be hexdigest all the way around	22:17
rm_work	ah right, same file for both	22:17
rm_work	erm	22:17
rm_work	so should we even be encoding in utf-8 then?	22:18
rm_work	hmac.new(key.encode("utf-8"), payload, hashlib.sha256)	22:18
rm_work	oh that's just the key	22:18
johnsom	That is the key, so yes	22:18
rm_work	err	22:18
rm_work	we also do the payload tho	22:18
rm_work	https://github.com/openstack/octavia/blob/master/octavia/amphorae/backends/health_daemon/status_message.py#L37	22:19
rm_work	ah and then it becomes binary	22:19
johnsom	Before it's compressed, so again, doesn't matter	22:19
rm_work	yeah	22:19
johnsom	I guess we just have to double the size of the hmac sig on the packets by making them hex	22:20
rm_work	so... yeah we just switch to hexdigest aybe	22:20
rm_work	hmmm	22:20
rm_work	how big is that	22:20
johnsom	32 bin, 64 hex	22:20
rm_work	bits? bytes?	22:21
rm_work	I assume bytes	22:21
johnsom	bytes	22:21
rm_work	so ... ok, not ideal, but meh	22:21
rm_work	we've got .... 65507 bytes to work with? :P	22:22
rm_work	though that may not function on all networks?	22:22
*** rcernin has joined #openstack-lbaas		22:22
johnsom	Unless there is some way to make those both "bytes like objects"	22:25
rm_work	i mean, that's what they should be	22:25
rm_work	i believe that IS what we send	22:26
rm_work	essentially just a bytefield	22:26
johnsom	So, it must be this that is hosing it over to python "what is a string" hell: envelope[-hash_len:]	22:27
rm_work	can debug through that and see	22:28
rm_work	one sec	22:28
rm_work	nope	22:29
rm_work	still bytes objects	22:30
rm_work	when we pass to	22:30
rm_work	secretutils.constant_time_compare(expected_hmc, calculated_hmc)	22:30
rm_work	both expected and calculated are bytes()	22:30
rm_work	one sec let me see which python i'm on	22:30
rm_work	ok in py2 it's all str()	22:31
rm_work	in py3 it's all bytes()	22:31
rm_work	but even the original envelope is str() in py2	22:31
rm_work	we DO test this in our unit tests	22:32
rm_work	it's possible it's an issue with py3 on amps and py2 on controlplane, or visa-versa?	22:32
rm_work	lxkong: what version of python does your HM run on?	22:32
lxkong	let me see	22:33
lxkong	the hm is on 2.7.6, the python inside amphora is 2.7.12	22:34
lxkong	using hexdigest() can also solve the issue. What are you talking about?	22:36
lxkong	length?	22:36
*** mstrohl has quit IRC		22:36
*** fnaval has quit IRC		22:37
johnsom	lxkong BTW, you might want to add a docs page in Octavia that talks about using Octavia with K8s and your ingress work. Just so people can find it, etc.	22:41
lxkong	i don't understand what's the potential problem it will bring if we replace digest() with hexdigest()?	22:41
lxkong	johnsom: sure, i will	22:42
johnsom	lxkong Kind of like I did for the SDKs: https://docs.openstack.org/octavia/latest/user/sdks.html	22:42
lxkong	yeah, i will find an appropriate place to advertise that work :-)	22:43
johnsom	Well, it doubles the size of the hmac data being sent. We want our overall data size to fit in a single UDP packet, so trying to keep the size down is good practice	22:43
lxkong	ah, ok yeah, 32 should be 64	22:43
johnsom	Yeah, if we switch to hexdigest it will become 64	22:44
lxkong	but reducing from 64 to 32 is just a mitigation	22:44
lxkong	any other options do we have to solve the problem?	22:45
johnsom	It's an optimization to have it stay 32 bytes like it is now, true.	22:45
lxkong	or is anything i can help?	22:45
johnsom	That is a question for rm_work, I think he is poking at this one	22:46
rm_work	eh, i looked, but	22:47
rm_work	i don't really have time to	22:47
lxkong	or we could keep a copy of oslo_utils/secretutils.py but change things i proposed :-)	22:47
rm_work	nor have i seen this issue?	22:47
rm_work	well, I think we might have to do the hexdigest thing	22:47
rm_work	to be actually legit	22:47
johnsom	Yeah, so maybe just do that for now. I think I have seen this, but found out it really was some other issue that was triggering it. Not sure. It's vaguely familiar	22:48
lxkong	so do you guys have time to fix that atm, i can definitely do that because maybe we are the only one who are affected	22:50
johnsom	Got for it, my vote is the hexdigest approach. We need to handle backwards compatibility though... I.e. operator updates the HM controller, but still has digest() amps	22:52
rm_work	eugh yeah	22:53
johnsom	It also might be worth debugging if you can reproduce, see what types we have and the payload	22:53
lxkong	johnsom: do you mean we should both check hmc.digest and hmc.hexdigest to make sure we don't break the old amps?	22:54
johnsom	Yes, we will have to fall back in some way	22:55
lxkong	ok	22:55
openstackgerrit	Lingxian Kong proposed openstack/octavia master: [WIP] Use HMAC.hexdigest to avoid non-ascii characters for package data https://review.openstack.org/571333	23:46
lxkong	johnsom, rm_work: when you are available, pleaset take a look at https://review.openstack.org/571333, I want to make sure that's the way you expected before I am starting to add unit tests	23:50
lxkong	it's already verified in our preprod	23:51
johnsom	One comment, but otherwise yeah, I think that is the path	23:55
openstackgerrit	Lingxian Kong proposed openstack/octavia master: [WIP] Use HMAC.hexdigest to avoid non-ascii characters for package data https://review.openstack.org/571333	23:59
lxkong	johnsom: thanks	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!