Friday, 2019-02-22

*** yamamoto has quit IRC00:41
*** yamamoto has joined #openstack-lbaas00:44
*** yamamoto has quit IRC00:48
*** takamatsu_ has joined #openstack-lbaas01:26
*** takamatsu has quit IRC01:26
*** yamamoto has joined #openstack-lbaas01:41
openstackgerritMichael Johnson proposed openstack/octavia master: L7rule support client certificate cases  https://review.openstack.org/61227101:46
openstackgerritcaoyuan proposed openstack/octavia-tempest-plugin master: Update json module to jsonutils  https://review.openstack.org/63833202:04
*** Dinesh_Bhor has joined #openstack-lbaas02:04
eanderssonjohnsom, that gate error is an odd one for stable/rocky02:08
eanderssonhttp://logs.openstack.org/54/638454/1/check/octavia-grenade/54ceb48/logs/screen-o-cw.txt.gz02:08
eanderssonpretty sure the rocky gate is just broken02:10
rm_workah i remember there were some issues we were seeing in the grenade test, might be the same02:10
rm_worknot sure if cgoncalves actually fixed that yet (or was working on it)02:11
eanderssonOut of interest where is that Internal Error coming from?02:11
eandersson> self.client.plug_network(amphora, port_info)02:11
rm_workit's the 500 from the api API02:11
eanderssonReading the code I couldn't understand which api it is hitting02:11
rm_work*amp API02:11
cgoncalvesrm_work, I failed to reproduce it locally. ran grenade and all went fine :S02:11
cgoncalveseandersson, that's the million dollar question :)02:11
rm_workthe amp API is exploding, usually it's because something in the networking is broken02:12
eanderssonWhere can I see the amp API logs?02:12
rm_workyou using the same versions of everything locally?02:12
cgoncalvesrm_work, I guess...02:12
rm_workeandersson: i don't think we ever successfully got those shipped off and stored in test results :/02:12
eanderssonawh :'(02:12
cgoncalvesrm_work, ubuntu ctrl and amp02:12
rm_workcgoncalves: i mean, is your build getting the same ubuntu image exactly02:12
rm_workthey could have broken stuff in their networking02:13
rm_workbasically when i see bit-rot in stable gates, i see "stuff changing that's outside our version control"02:13
rm_workwhich means ubuntu or a few random other python libs02:13
eanderssonI think both Trove and Ironic allows you to access logs from the api right? Maybe that would be nice to have.02:13
rm_workyes02:14
rm_workthough if the API is broken... >_>02:14
eanderssonhaha yea02:14
eanderssonI think Ironic pushes the logs02:14
rm_workhonestly that's not the worst idea, we have been looking at shipping via SSH but02:14
rm_workit might be ok to just ... uhh02:14
cgoncalvesrm_work, yes. I mean, why wouldn't it? I didn't change the default amp settings so it picks ubuntu xenial or whatever02:14
rm_workafter we get a 500, immediately hit the "log" API and have it give us back a snapshot of the last 100 lines of the agent log or something02:15
rm_workrofl02:15
cgoncalvesalso ran xenial on controller02:15
rm_workcgoncalves: i mean, you JUST built that image? not sure if you are an "always clean env" kind of person or not LP02:15
rm_work:P02:15
rm_worki never use the same devstack twice, but a lot of people on my old teams thought i was weird02:16
cgoncalvesrm_work, always clean02:17
cgoncalvesgrenade has a clean up script that deletes everything02:17
rm_work;)02:17
johnsomSo, I think we need to enable the dib log output for that job. I think there is an image issue. We need the SHA for the amphora-agent version inside that image02:17
cgoncalvesI even pip uninstall all02:17
johnsomWe are close to having log shipping from the amp, it just needs some more thought/work. We talked about it two or so meetings ago.02:19
johnsomBut, first and easy things, get the DIB log captured.02:21
eanderssonbtw another topic - would it be impossible to move some of the biz logic from the health driver away from UpdateHealthDb? Ideally like UpdateStatsDb02:22
eanderssonThat call is so expensive on large load balancers with lots of members02:23
johnsomI have to run tonight, but we can chat about that tomorrow02:24
eanderssonSounds good02:24
*** fnaval has quit IRC02:29
rm_workeandersson: the main problem is that the stats data *comes from* the health call, so we have to do SOMETHING with it there02:48
rm_workit's possible we could like... dump one giant string somewhere and have another process come along and actually process (HK?)02:48
rm_workbut that seems a bit messy02:48
eanderssonFor sure02:48
eanderssonBut at the moment getting a lb with 400+ members takes like 40s02:49
rm_worki thought we passed that off to another set of threads at least?02:49
rm_workaugh what LB is that02:49
eanderssonWorst case scenario of course02:49
rm_worki feel like once you have 400 members you need more than one LB and like, RR-DNS or something02:49
eanderssonbut even with lets say 50 members the call takes ~10s02:49
rm_workyeah that's not great02:50
rm_workthe time is in the DB fetch?02:50
eanderssonYea02:50
eanderssonBuilding that db object is super expensive02:50
rm_workI still feel like it SHOULDN'T take that long02:50
rm_workeugh02:50
rm_workwe could avoid objects if we hate ourselves02:50
rm_workwe've done it in one other place I thought02:51
rm_workactually, possibly in the health stuff?02:51
rm_workI remember johnsom had some raw querying going on02:51
eanderssonInteresting - I might have missed something02:51
rm_workbecause it just SHOULD NOT take that long to deal with 40 or even 1000 members02:51
rm_workDBs are super quick02:51
rm_workthat's the point02:51
rm_work(well, part of the point)02:52
eanderssonhttps://github.com/openstack/octavia/blob/22df2e0f484b3560cbb47d33b7f2f62cb57519fd/octavia/db/repositories.py#L125202:52
rm_workwhenever i see DB loading taking that long, i remember running queries on a 1billion+ row table that returned 100s of rows of results and that taking subsecond times at my first job02:52
rm_workbut that was MSSQL so02:52
eanderssonI based my stats on just in general getting a lb, not how long it takes in the health manager call02:53
eanderssonSo it might not be as bad as 10s for 50 memmbers02:53
eanderssonI'll probably need to profile that whole function02:53
rm_workah yeah there it is, you found it02:53
eanderssonhttps://github.com/openstack/octavia/blob/master/octavia/controller/healthmanager/health_drivers/update_db.py#L12802:54
eanderssonIt's... so massive02:54
rm_workno objects for us there :P02:54
eanderssonYea might have been a bad assumption by me02:54
johnsomCan’t talk tonight, but would be interested in exactly your use case.  That query is super optimized and fast. Plus I can simulate configs for hm testing.02:56
johnsomAlready getting very bad looks from the spouse.02:56
rm_worklolol02:56
rm_workput down the phone02:56
*** yamamoto has quit IRC02:57
eanderssonWe are just worried about the usage we are seeing at the moment, I think colin- mentioned this already02:58
eanderssonand we are not really using octavia much yet02:59
eanderssonIt's not the end of the world, but just something we are worried about.03:00
*** yamamoto has joined #openstack-lbaas03:02
*** psachin has joined #openstack-lbaas03:04
rm_workyeah, i agree with you about the HM usage, any way we can possibly get that to be more efficient, we should pursue03:06
rm_workhaving it run too slowly can have catastrophic results (i know because i have dealt with related issues, in real production environments, more than once)03:06
*** abaindur has quit IRC03:07
*** abaindur has joined #openstack-lbaas03:08
*** abaindur has quit IRC03:08
*** abaindur has joined #openstack-lbaas03:09
*** abaindur has quit IRC03:10
*** Dinesh_Bhor has quit IRC03:55
*** yamamoto has quit IRC03:56
*** yamamoto has joined #openstack-lbaas03:59
*** Dinesh_Bhor has joined #openstack-lbaas04:00
*** yamamoto has quit IRC04:05
cgoncalvesLB with 400+ members but "not really using octavia much yet". that's quite an underestimation04:23
*** Dinesh_Bhor has quit IRC04:47
*** ramishra has joined #openstack-lbaas04:48
*** abaindur has joined #openstack-lbaas04:49
*** Dinesh_Bhor has joined #openstack-lbaas04:54
openstackgerritCarlos Goncalves proposed openstack/octavia master: WIP: Add RHEL 8 amphora support  https://review.openstack.org/63858104:56
cgoncalvesRHEL8-based LB is ACTIVE :)04:56
*** yamamoto has joined #openstack-lbaas05:12
*** pck has joined #openstack-lbaas05:13
*** yamamoto has quit IRC05:16
*** abaindur has quit IRC05:26
*** yamamoto has joined #openstack-lbaas05:37
*** yamamoto has quit IRC06:50
*** yamamoto has joined #openstack-lbaas06:50
*** Dinesh_Bhor has quit IRC07:13
*** ivve has joined #openstack-lbaas07:14
*** Dinesh_Bhor has joined #openstack-lbaas07:15
openstackgerritJacky Hu proposed openstack/octavia-dashboard master: WIP: Add load balancer flavor support  https://review.openstack.org/63836507:35
*** ccamposr has joined #openstack-lbaas07:44
*** yamamoto has quit IRC07:56
*** yamamoto has joined #openstack-lbaas07:58
*** yamamoto has quit IRC08:02
*** yamamoto has joined #openstack-lbaas08:09
*** celebdor has joined #openstack-lbaas08:11
*** yamamoto has quit IRC08:13
*** celebdor has quit IRC08:19
*** yamamoto has joined #openstack-lbaas08:22
*** abaindur has joined #openstack-lbaas08:27
dayouCool08:34
*** yamamoto has quit IRC09:28
*** ivve has quit IRC09:34
*** celebdor has joined #openstack-lbaas09:43
*** abaindur has quit IRC09:46
*** takamatsu_ has quit IRC10:04
*** takamatsu has joined #openstack-lbaas10:05
*** yamamoto has joined #openstack-lbaas10:08
*** yamamoto has quit IRC10:13
*** Dinesh_Bhor has quit IRC10:23
*** takamatsu_ has joined #openstack-lbaas10:23
*** takamatsu has quit IRC10:24
*** Dinesh_Bhor has joined #openstack-lbaas10:25
*** rcernin has quit IRC10:31
*** takamatsu_ has quit IRC10:48
*** takamatsu has joined #openstack-lbaas10:49
*** takamatsu has quit IRC10:54
*** takamatsu has joined #openstack-lbaas10:58
*** yamamoto has joined #openstack-lbaas11:30
*** yamamoto has quit IRC11:34
*** Dinesh_Bhor has quit IRC11:38
*** Dinesh_Bhor has joined #openstack-lbaas11:55
*** Dinesh_Bhor has quit IRC11:57
*** yamamoto has joined #openstack-lbaas12:16
*** ivve has joined #openstack-lbaas12:35
openstackgerritCarlos Goncalves proposed openstack/octavia master: Add RHEL 8 amphora support  https://review.openstack.org/63858112:43
openstackgerritCarlos Goncalves proposed openstack/octavia-tempest-plugin master: DNM: octavia-v2-dsvm-scenario-rhel-8 job  https://review.openstack.org/63865612:48
*** trown|outtypewww is now known as trown13:03
*** psachin has quit IRC13:35
*** yamamoto has quit IRC13:43
*** yamamoto has joined #openstack-lbaas13:43
openstackgerritCarlos Goncalves proposed openstack/octavia-tempest-plugin master: DNM: octavia-v2-dsvm-scenario-rhel-8 job  https://review.openstack.org/63865613:47
openstackgerritCarlos Goncalves proposed openstack/octavia-tempest-plugin master: DNM: octavia-v2-dsvm-scenario-rhel-8 job  https://review.openstack.org/63865614:28
openstackgerritCarlos Goncalves proposed openstack/octavia master: Add Python 3.7 support  https://review.openstack.org/63523615:09
*** yamamoto has quit IRC15:51
*** yamamoto has joined #openstack-lbaas16:28
*** yamamoto has quit IRC16:32
*** gcheresh has joined #openstack-lbaas17:17
*** trown is now known as trown|lunch17:18
*** celebdor has quit IRC17:18
*** yamamoto has joined #openstack-lbaas17:24
*** yamamoto has quit IRC17:29
*** cranges has joined #openstack-lbaas17:29
*** takamatsu_ has joined #openstack-lbaas17:48
*** takamatsu has quit IRC17:48
*** ccamposr has quit IRC17:51
*** cranges has quit IRC18:00
*** takamatsu_ has quit IRC18:03
*** takamatsu_ has joined #openstack-lbaas18:06
xgermanwell, we can get amp logs if we merge my log patch18:12
xgerman(scrollback reading)18:12
*** gcheresh has quit IRC18:12
johnsomFirst step is to turn on the dib log collection there....18:14
openstackgerritMichael Johnson proposed openstack/octavia master: L7rule support client certificate cases  https://review.openstack.org/61227118:25
xgermanyep, jumping ahead18:37
*** ramishra has quit IRC18:40
*** trown|lunch is now known as trown18:40
openstackgerritMichael Johnson proposed openstack/octavia master: L7rule support client certificate cases  https://review.openstack.org/61227118:42
openstackgerritMichael Johnson proposed openstack/python-octaviaclient master: Add 'client_ca_tls_container_ref' in Listener on client side  https://review.openstack.org/61615818:47
openstackgerritMichael Johnson proposed openstack/python-octaviaclient master: Add 'client_authentication' in Listener on client  https://review.openstack.org/61687918:59
openstackgerritMichael Johnson proposed openstack/python-octaviaclient master: Add client_crl_container_ref for Listener API in CLI  https://review.openstack.org/61761919:06
openstackgerritMichael Johnson proposed openstack/python-octaviaclient master: Add 4 l7rule types into Octavia CLI  https://review.openstack.org/61871619:09
*** yamamoto has joined #openstack-lbaas19:12
*** yamamoto has quit IRC19:16
openstackgerritMichael Johnson proposed openstack/python-octaviaclient master: Add client_crl_container_ref for Listener API in CLI  https://review.openstack.org/61761919:22
openstackgerritMichael Johnson proposed openstack/python-octaviaclient master: Add 4 l7rule types into Octavia CLI  https://review.openstack.org/61871619:23
*** yamamoto has joined #openstack-lbaas19:52
*** yamamoto has quit IRC19:57
johnsomrm_work Are you around, I have a certs question20:06
eanderssonxgerman, what patch is that?20:14
johnsomeandersson It's this one: https://review.openstack.org/624835 But we are talking about doing it a bit differently20:15
johnsomeandersson Are you going to be around in ~30 minutes? I need to go make lunch, but would like to talk about HM and your setup.20:16
eanderssonGetting lunch but should be back around then20:18
rm_workYeah johnsom20:19
johnsomrm_work Doh, bad timing on my part sorry about that.  So the question is about the CA cert for client authentication.20:44
johnsomShould we make people stuff that in a pkcs12 or should we store it separate?20:44
johnsomRight now, the cert parser stuff pukes with a pkcs12 with no main cert in it.20:45
*** hyang has joined #openstack-lbaas20:56
*** hyang has quit IRC20:59
*** yamamoto has joined #openstack-lbaas21:04
*** notcolin has joined #openstack-lbaas21:06
*** notcolin has quit IRC21:08
*** yamamoto has quit IRC21:09
*** notcolin has joined #openstack-lbaas21:09
*** notcolin has quit IRC21:16
*** henriqueof has quit IRC21:20
johnsomeandersson Available to chat about HM?21:37
eanderssonSure21:37
eanderssoncolin-, is here as well21:38
johnsomExcellent. So if I remember right, you have ~650 amphora some or all have 400 members?21:38
eanderssonSo 400 members is not octavia at the moment21:38
johnsomAnd you are running one instance of the HM process21:38
*** yamamoto has joined #openstack-lbaas21:39
eanderssonI think 821:39
johnsomOh, ok, 8 instances of the HM21:39
johnsomI'm just trying to get a feel for the scale. I can then use the simulator to reproduce your performance concern.21:40
johnsomAn you said the status update section of the HM processing is taking seconds?21:40
eanderssonSorry, that was actually assumptions, not facts.21:41
eanderssonMy concern is the load it's generating21:42
eanderssonSo with 8 workers they are constantly at 40% cpu usage (each)21:42
eanderssonAnd this is without the larger use cases (e.g. 100+ members)21:42
eanderssonMy concern is really that this increase almost looks exponential21:43
*** yamamoto has quit IRC21:43
eanderssonThe good thing here is that this seems to be horizontally scalable? so we can just keep adding more HMs21:44
johnsomWhat do you have set for these settings?  [health_manager] health_update_threads and stats_update_threads21:44
eandersson821:45
johnsomSo you have 8 instances, each capped at 8 processes?21:45
eanderssonWhen you say instances, do you mean amps?21:45
johnsomNo, systemd started instances of HM21:46
eanderssonor talking about "control nodes"21:46
eandersson2x821:46
eanderssonSo 1621:47
*** ivve has quit IRC21:47
eanderssonAre stats and health updates sent at the same frequency?21:47
johnsomIt is all in one UDP packet21:48
eanderssonah21:48
johnsomSo you have 2 systemd started HMs, each configured to have up to 8 sub processes.21:48
eanderssonYea so I see that hm has 2x8 per control node21:49
eanderssonand in this case we have two control nodes21:49
eanderssonSo just adding some information we were trying to figure out why octavia was capping out 20+ cores21:50
eanderssonWe assumed it was the HM's, but in reality it was the oslo connection leak causing most of this load21:51
johnsomOk, And you don't have sync_provisioning_status or event_streamer_driver enabled right?21:51
eanderssonYea - that is disabled21:51
johnsomWhat is your heartbeat_interval and health_check_interval setting?21:51
eanderssonIt's set to 10, but... defaults to 3 right?21:51
eanderssonand that hasn't replicated yet21:52
eanderssonSince all the amps need to be re-created right?21:52
johnsomWell, these two settings have very different meanings.21:53
johnsomhealth check interval is how often the HM hits the database looking for missing amps. This can change at any time as it is controller side.21:53
eanderssonah sorry, heartbeat_interval is probably set to default21:54
johnsomThat has a default of 3 seconds21:54
johnsomheartbeat interval is how often an amp sends that UDP packet with the stats/status in it. This defaults to once every 10 seconds.21:55
rm_workProbably a lone cert is fine, it doesn't really meet the qualifications for a pkcs1221:57
johnsomThat one is stored in the amphroa and would require a rebuild. However Stein will have a new API that lets you update it without rebuild21:57
eanderssonSo johnsom increase the sleep etc will obviously lower cpu time used by the HM21:57
eanderssonThat is awesome21:57
rm_workYeah but increase failure detection time21:58
eanderssonMy only concern there is that it would make health checks less responsive right?21:58
rm_workIt's a trade-off21:58
eanderssonYea21:58
johnsomYeah, I'm just trying to dive into the situation. I wouldn't really expect it to be a lot of load21:58
rm_workAh are you using py2 or py321:58
eanderssontbh what ever data we have right now is invalid until we have pushed the oslo fix out21:58
eandersson221:58
rm_workI noticed the sleeps in that code in native python are shitty, it's fixed in py3 to be less shit21:59
rm_workSo that might also help21:59
eanderssonbecause at first I made the assumption about SQL based on the load on the (only reason I could see something generating that much load)21:59
johnsomOk, so the 40% was for all of the controller processes and not just the HM?21:59
rm_workBut that's theoretical, I'm just postulating21:59
eanderssoncpu is 40% per process, but load on box did not add up21:59
eandersson(basically load was 20+, but cpu usage per process was lower and did not add up)22:00
rm_work(thread sleeps)22:00
johnsomSo 40% of one core?22:00
eanderssonYea22:00
eanderssonIf you added that up load should have been at 422:00
eanderssonbut having 6-20 million connections established in the background threw those numbers off22:01
johnsomYeah, that is only on the API process though22:01
johnsomThere is no rabbit in HM22:01
eanderssonapi process usage per core was like 20% :p22:01
eanderssonso nothing added up22:01
johnsomOk, so maybe we need to come back to this discussion once you have deployed the rabbit fix.22:02
eanderssonYea I think so22:02
eanderssonI mean there are for sure improvements that can be done like you said22:03
eanderssonbut I wouldn't mind making sure there are no outside factors first22:03
johnsomOk.  I spent some time last year optimizing the HM status update process.  It should be pretty lean assuming you are not failing over a bunch of amps all the time.22:03
johnsomUnder load, if I remember right, it was about .006 seconds per UDP packet. I think that was with a few thousand amps22:04
colin-nice that's a good metric for us to judge against22:07
johnsomrm_work On the cert thing, so no pkcs12 wrapper for the ca cert.  That means I need to extend the cert manager again so we have a basic secret api.22:08
johnsomI did this for the CRL: https://review.openstack.org/#/c/612269/13/octavia/certificates/manager/barbican.py22:09
johnsomI probably should just change that to get_secret22:09
rm_workHmm22:17
johnsomI mean if I had time and could get it reviewed I would probably refactor the whole manager to be more generic and move the PKCS12 stuff outside the manager.22:18
*** celebdor has joined #openstack-lbaas22:54
*** yamamoto has joined #openstack-lbaas22:56
*** yamamoto has quit IRC23:00
eanderssonThe code path is pretty fast johnsom, at least without actual db codes23:13
johnsomYeah, typically it's the DB round trips that are the slowest part. That is why I hand optimized that DB query so we limit the number of DB calls to only those necessary.23:14
rm_workyeah23:15
rm_workthe DB is super important there23:16
rm_workwhich is why we originally intended to put that stuff in redis or similar23:16
rm_workI wonder if next cycle would be a good time to revisit that23:16
rm_workwe may finally be at a point where supporting that would be good23:16
rm_workand maybe we could get cycles from eandersson and crew to help? :P23:16
rm_workdiscussion at denver?23:16
eanderssonI think you have to convince Jude :p23:17
johnsomI'm still not sure we actually *need* that. I didn't have enough tester horsepower to really see where the tipping point is.23:17
rm_workyeah but if we can tell the DB is the bottleneck23:17
rm_workand we want MORE POWER (hur hur hur) that is where we'd make the improvement23:18
johnsomBut if we don't hit that until 100,000 LBs????23:18
johnsomOr 1,000,000.... I don't know. I needed more tester CPU cores than I have.23:19
eanderssonhttps://www.openstack.org/summit/denver-2019/summit-schedule/events/23379/how-blizzard-entertainment-uses-autoscaling-with-overwatch23:19
johnsomI bet you are really interested in the active/active work....  lol23:20
rm_workit is cool I can say the project I work on powers Overwatch :P23:24
rm_workI think I've mentioned that a few times when friends ask what I do and then are like "lolwat k"23:25
rm_workmakes things more relatable :P23:25
rm_worktoo bad i'm horrible at overwatch >_<23:25
johnsomI haven't played it23:25
rm_workit's too fast for me, I prefer shooter games where people move at like... normal human-ish speeds23:26
rm_workCS:GO, PUBG/RoE, etc23:26
rm_workFortnite, Apex Legends, Overwatch are all way too fast23:26
rm_workI can't track targets <_<23:26
johnsomYeah, last game I played was sniper elite 423:26
eanderssonI like BR... but Fortnite is wayyy to fast23:29
eanderssonI can play Apex, but seriously looking at kids playing Fornite is just nuts23:29
eanderssonMakes me feel old23:30
openstackgerritCarlos Goncalves proposed openstack/octavia master: Fix LB failover when in ERROR  https://review.openstack.org/63879023:41
johnsomcgoncalves I think there is a story for that23:42
cgoncalvesjohnsom, cool!23:42
cgoncalvesI was about to ask you folks if you were okay with this change23:42
cgoncalvesI think this was discussed before. sadly I don't remember what the conclusion was23:43
cgoncalvescan someone teach me how to search on storyboard? :S23:45
johnsomYeah, I have been looking for the story too.   Storyboard has gotten even slower....23:45
cgoncalvesguessing you're not referring to https://storyboard.openstack.org/#!/story/200263423:47
johnsomNo23:48
johnsomWell, I can't find it, so maybe it wasn't a story23:49
cgoncalvesare you generally okay with allowing LBs in ERROR to be failed over?23:51
cgoncalvesif so, I'll open a story23:51
cgoncalvesI have received numerous requests on how to recover LBs in ERROR23:52
johnsomYes, the original intent of the failover API was to recover ERROR LBs. That is a bug.23:53
johnsomThough I just commented on the patch23:53
johnsomUgh, I have to stop looking at bugs or we are not going to get a bunch of features in stein....23:53
cgoncalvesok, great23:56
johnsomOh, hmm, maybe that isn't doing what I thought at first glance23:56
johnsommy comment might be bogus23:56
cgoncalvesI think so23:56
johnsomYeah, my bad.23:59
johnsomOk, back to working on getting these features in. Bugs are in two weeks.23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!