Wednesday, 2020-09-09

*** tsmith2 has quit IRC		00:33
*** ricolin_ has joined #openstack-meeting-3		01:31
*** dconde has joined #openstack-meeting-3		02:34
*** dconde has quit IRC		02:35
*** njohnston has quit IRC		04:22
*** belmoreira has joined #openstack-meeting-3		04:23
*** ralonsoh has joined #openstack-meeting-3		06:34
*** e0ne has joined #openstack-meeting-3		06:35
*** e0ne has quit IRC		06:39
*** slaweq has joined #openstack-meeting-3		06:51
*** tosky has joined #openstack-meeting-3		07:57
*** slaweq has quit IRC		08:20
*** apetrich has joined #openstack-meeting-3		08:26
*** slaweq has joined #openstack-meeting-3		08:29
*** apetrich has quit IRC		08:39
*** ianychoi__ has quit IRC		08:40
*** apetrich has joined #openstack-meeting-3		08:44
*** slaweq has quit IRC		08:49
*** e0ne has joined #openstack-meeting-3		09:32
*** ricolin_ has quit IRC		09:38
*** ralonsoh has quit IRC		10:01
*** ralonsoh has joined #openstack-meeting-3		10:02
*** raildo has joined #openstack-meeting-3		11:30
*** lkoranda has joined #openstack-meeting-3		11:36
*** lkoranda has quit IRC		11:52
*** lkoranda has joined #openstack-meeting-3		11:52
*** njohnston has joined #openstack-meeting-3		12:08
*** slaweq has joined #openstack-meeting-3		14:03
*** lkoranda has quit IRC		14:08
*** ricolin_ has joined #openstack-meeting-3		14:11
*** ricolin_ has quit IRC		14:13
*** lkoranda has joined #openstack-meeting-3		15:03
*** slaweq has quit IRC		15:04
*** tsmith2 has joined #openstack-meeting-3		15:11
*** lpetrut has joined #openstack-meeting-3		15:12
*** mlavalle has joined #openstack-meeting-3		15:17
*** tsmith2 has quit IRC		15:20
*** tsmith2 has joined #openstack-meeting-3		15:22
*** tsmith_ has joined #openstack-meeting-3		15:25
*** tsmith2 has quit IRC		15:27
*** tsmith_ is now known as tsmith2		15:27
*** belmoreira has quit IRC		15:40
*** mdelavergne has joined #openstack-meeting-3		15:51
*** penick has joined #openstack-meeting-3		15:51
*** mdelavergne has quit IRC		15:53
*** mdelavergne has joined #openstack-meeting-3		15:54
*** lpetrut has quit IRC		15:59
ttx	#startmeeting large_scale_sig	16:00
openstack	Meeting started Wed Sep 9 16:00:07 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.	16:00
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	16:00
ttx	#topic Rollcall	16:00
*** openstack changes topic to " (Meeting topic: large_scale_sig)"		16:00
openstack	The meeting name has been set to 'large_scale_sig'	16:00
*** openstack changes topic to "Rollcall (Meeting topic: large_scale_sig)"		16:00
ttx	Who is here for the Large Scale SIG meeting ?	16:00
mdelavergne	Hi o/	16:00
*** eandersson has joined #openstack-meeting-3		16:01
ttx	I see penick is in the channel list	16:01
amorin	hey!	16:01
ttx	and amorin	16:01
* penick waves		16:01
ttx	and eandersson	16:02
eandersson	o/	16:02
ttx	Alright, let's get started	16:02
ttx	Our agenda for today is at:	16:02
ttx	#link https://etherpad.openstack.org/p/large-scale-sig-meeting	16:02
ttx	#topic Welcome newcomers	16:02
*** openstack changes topic to "Welcome newcomers (Meeting topic: large_scale_sig)"		16:02
ttx	Following the Opendev event on Large scale deployments, we had several people expressing interest in joining the SIG	16:03
ttx	Several of them in the US and not interested in our original meeting time at 8utc	16:03
*** priteau has joined #openstack-meeting-3		16:03
ttx	So we decided some time ago to rotate between US+EU / APAC+EU times, as the majority of the group is from EU	16:03
ttx	This is our second US+EU meet... but the first was not really successful at attracting new participants	16:03
ttx	so I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in	16:03
ttx	so we can shape the direction of the SIG accordingly	16:03
ttx	penick: not sure you need any intro, but go	16:04
penick	Heh, hey folks. I work for Verizon Media and act as the director/architect for our private cloud.	16:04
ttx	Anything specific you're interested in within this SIG, beyond sharing your experience?	16:05
penick	I'm here to offer some perspective and feedback on our needs as a large scale deployer, and learn other use cases. Ideally i'd like to align our upstream development efforts to help make the product better for large and small deployers	16:05
ttx	great! You're in the right place	16:05
ttx	eandersson: care to quickly introduce yourself?	16:06
amorin	welcome	16:06
eandersson	Sure!	16:06
eandersson	I work at Blizzard Entertainment as a Technical Lead and I am responsible for the Private Cloud here.	16:07
ttx	Anything specific you're interested in within this SIG, beyond sharing your experience?	16:07
ttx	(yeah, i copy pasted that)	16:08
mdelavergne	Welcome to the both of you :)	16:08
eandersson	Very similar as penick, here to share perspective and provide feedback.	16:08
ttx	Perfect. I'll let amorin and mdelavergne quickly introduce themselves... I'm working for the OSF, facilitating this group operations	16:09
ttx	Not much of a first-hand experience on large scale deployments, lots of second hand accounts though :)	16:09
ttx	Other regular SIG members include belmoreira of CERN fame, and masahito from LINE, whoi opefully is sleeping at this hour.	16:10
ttx	hopefully*	16:10
amorin	hey, iwork for ovh, i am mostly involved in deploying openstack for our public cloud offer	16:10
mdelavergne	Hi, I'm a PhD student working on a way to geo-distribute applications, so I mostly do experiments on large-scale	16:11
amorin	and sorry, i am on my phone :( not easy to type here	16:11
ttx	heh	16:11
penick	i'm glad to meet you all :)	16:11
ttx	Alright, let's jump right in in our current workstreams... don't hesitate to interrupt me to ask questions	16:11
ttx	#topic Progress on "Documenting large scale operations" goal	16:11
*** openstack changes topic to "Progress on "Documenting large scale operations" goal (Meeting topic: large_scale_sig)"		16:11
ttx	#link https://etherpad.openstack.org/p/large-scale-sig-documentation	16:11
ttx	So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments.	16:12
ttx	One work stream is around collecting articles, howto and tips & tricks around large scale published over the years	16:12
ttx	You can add any you know of on that etherpad I just linked to	16:12
ttx	Another workstream is around documenting better configuration values when you start hitting issues with default values	16:12
ttx	amorin is leading that work at https://wiki.openstack.org/wiki/Large_Scale_Configuration_Guidelines but could use help	16:13
ttx	Basically the default settings only carry you so far	16:13
ttx	and if we can agree on commonly-tweaked settings at scale, we should find a way to document them	16:13
amorin	yes i'd love to be able to push that forward, but i lack some time mostly	16:13
ttx	We had another work item on collecting metrics/billing stories	16:14
ttx	That points to one critical activity of this SIG:	16:14
ttx	It's all about sharing your experience operating large scale deployments of OpenStack	16:14
ttx	so that we can derive best practices and/or fix common issues	16:14
ttx	Only amorin contributed the story for OVH on the etherpad, so please add to that if you have time	16:15
ttx	Maybe there is no common pattern there, but my guess is there is, and it's not necessarily aligned with what upstream provides	16:15
ttx	So I'll push again an action to contribute to that:	16:15
ttx	#action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation	16:15
ttx	And finally, we have a workstream on tooling, prompted by OVH's interest in pushing osarchiver upstream	16:16
ttx	The current status there is that the "OSOps" effort is being revived, under the 'Operation Docs and Tooling' SIG	16:16
ttx	belmoreira and I signed up to make sure that was moving forward, but smcginnis has been active leading it	16:16
ttx	The setup is still in progress at https://review.opendev.org/#/c/749834/	16:16
penick	I can ask my team to see if they can distill some of the useful configuration changes we've made. One trick is there's so many ways to deploy OpenStack that there are a lot of different contexts for when to use some settings vs others. For example, we focus on fewer,larger openstack deployments. With thousands of hypervisors per VM cluster and tens of thousands of baremetal nodes per baremetal cluster. So some things	16:16
penick	we've done will not really apply to a large scale deployer who might prefer many smaller clusters.	16:16
ttx	penick: I think the info will be useful anyway. You're right that in some cases there won't be a common practice and it will be all over the map	16:17
amorin	agree, but still useful to know how people are doing scale	16:18
ttx	But from some early discussions at the SIG it was apparent taht some of the pain points are still common	16:18
ttx	To finish on osarchiver, once OSOps repo is set up we'll start working on pushing osarchiver in it	16:18
ttx	Anything else on this topic? Any other direction you'd like this broad goal to take?	16:19
amorin	great	16:19
ttx	(the SIG mostly goes where its members push it, so if you're interested in something specific, please let the group know)	16:19
ttx	Otherwise I'll keep on exposing current status	16:20
ttx	#topic Progress on "Scaling within one cluster" goal	16:20
*** openstack changes topic to "Progress on "Scaling within one cluster" goal (Meeting topic: large_scale_sig)"		16:20
ttx	#link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling	16:20
ttx	This is the other initial goal of the SIG - identify, measure and push back common scaling limitations within one cluster	16:20
ttx	Basically, as you add nodes to a single OpenStack cluster, at one point you reach scaling limits.	16:20
ttx	How do we push that limit back, and reduce the need to create multiple clusters ?	16:21
ttx	Again there may not be a common pattern, so...	16:21
ttx	First task here is to collect those scaling stories. You have a large scale deployment, what happens if you add too many nodes ?	16:21
ttx	We try to collect those stories at https://etherpad.openstack.org/p/scaling-stories	16:21
ttx	and then move them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories once edited	16:21
ttx	So please add to that if you have a story to share! (can be old)	16:22
eandersson	Neutron has been a big pain point for us, with memory usage been excessive, to the point where we ended up running Neutron in VMs.	16:22
ttx	Neutron failed first? Usually people point to Rabbit first, but maybe you optimized that already	16:22
eandersson	Rabbit has been good to us, besides recovering after a crash.	16:22
penick	eandersson: interesting, was neutron running in containers prior to moving to VMs?	16:23
eandersson	Yea - we containeraized all of our deployments when moving to Rocky about ~2 years ago.	16:23
ttx	amorin: you're not (yet) running components containerized, right?	16:24
amorin	nop, not yet	16:25
amorin	we are working on it	16:25
penick	eandersson good to know, thanks! We're working on containerizing all of our stuff now.. I'll make sure my team is aware	16:25
ttx	So as I said, one common issue when adding nodes is around RabbitMQ falling down, even if I expect most people in this group to have moved past that	16:26
ttx	And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them	16:26
amorin	is your database and rabbit cluster running in containers as well?	16:26
ttx	Based on what LINE used internally to solve that	16:26
eandersson	Not yet.	16:27
ttx	This resulted in https://opendev.org/openstack/oslo.metrics	16:27
ttx	Next step there is to add basic tests, so that we are reasonably confident we do not introduce regressions	16:27
ttx	I had an action item I did not finish to evaluate that, let me push it back	16:27
ttx	#action ttx to look into a basic test framework for oslo.metrics	16:27
ttx	Also masahito was planning to push the latest patches to it, but I haven't seen anything posted yet	16:27
ttx	#action masahito to push latest patches to oslo.metrics	16:28
ttx	amorin: did you check about applicability of oslo.metrics within OVH?	16:28
amorin	not yet, still in my todo	16:28
ttx	alright pushing that back too	16:28
ttx	#action amorin to see if oslo.metrics could be tested at OVH	16:28
ttx	Finally we recently helped with OVH's patch to add a basic ping to oslo.middleware	16:28
ttx	That can be useful to monitor a rabbitMQ setup and detect weird failure cases	16:29
ttx	(there were threads on the ML about it)	16:29
ttx	Happy to report the patch finally landed at https://review.opendev.org/#/c/749834/	16:29
ttx	And shipped in oslo.messaging 12.4.0!	16:29
amorin	yay!	16:29
mdelavergne	congrats!	16:29
eandersson	I believe a lot of the critical recovery issues with RabbitMQ are fixed in Ussuri. Especially the excessive number of exchanges created by RPC calls that I believe caused all our recovery issues.	16:30
ttx	eandersson: yes I think there were lots of improvements there	16:31
ttx	Does that sound like a good goal for you penick and eandersson, or do you think we'll also fail to extract common patterns and low-hanging fruit improvements to raise scale	16:31
penick	Sorry, not understanding.. Is what a good goal?	16:32
ttx	oh the "Scaling within one cluster" goal	16:32
ttx	Trying to push back when you need to scale out to multiple clusters	16:33
eandersson	Yea - that is a good goal.	16:33
penick	Ah, yeah. That's good for me.	16:34
ttx	ok, moving on	16:34
ttx	#topic PTG/Summit plans	16:34
*** openstack changes topic to "PTG/Summit plans (Meeting topic: large_scale_sig)"		16:34
ttx	For the PTG we decided last meeting to ask for a PTG room around our usual meeting times, to serve as our regular meeting while recruiting potential new members	16:34
ttx	So I requested Wednesday 7UTC-8UTC and 16UTC-17UTC (Oct 28)	16:35
ttx	(first one is a bit early but there was no slot scheduled at our normal time)	16:35
ttx	For the summit we discussed proposing one Forum session around collecting scaling stories, which I still have to file	16:35
ttx	The idea is to get people to talk and we can document their story afterwards	16:36
ttx	One learning from Opendev is that to get a virtual discussion going, it's good to prime the pump by having 2-3 people signed up to discuss the topic	16:36
ttx	So... anyone interested in sharing their scaling story in the context of a Forum session?	16:36
ttx	I just fear unless we start talking nobody will dare expose their case	16:37
ttx	I can moderate but I don't have any first-hand scaling story to share	16:37
amorin	i can maybe do a small one	16:37
penick	I can share something, or ask someone from my team to	16:38
ttx	great, thanks! I think that will help	16:38
ttx	We'll try to actively promote that session in the openstack ops community, so hopefully it will work, both as a recruiting mechanism and a way to collect data	16:39
amorin	ok	16:39
ttx	#action ttx to file Scaling Stories forum session, with amorin and someone from penick's team to help get it off the ground	16:40
ttx	#topic Next meeting	16:40
*** openstack changes topic to "Next meeting (Meeting topic: large_scale_sig)"		16:40
ttx	If we continue on the same rhythm:	16:40
ttx	Next meeting will be EU-APAC on Sept 23, 8utc.	16:40
ttx	Then next US-EU meeting will be Oct 7, 16utc.	16:40
ttx	How does that sound?	16:41
ttx	Any objection to continuing to hold those on IRC?	16:41
amorin	good	16:41
eandersson	Sounds good to me.	16:41
mdelavergne	it's fine	16:41
ttx	Feel free to invite others from your teams to join if they can help, or fellow openstack ops you accidentally discuss with	16:42
ttx	#info next meetings: Sep 23, 8:00UTC; Oct 7, 16:00UTC	16:42
ttx	That is all I had, opening the floor for comments, questions, desires	16:42
ttx	#topic Open discussion	16:42
*** openstack changes topic to "Open discussion (Meeting topic: large_scale_sig)"		16:42
penick	16utc is fine for me, i'm ok with either IRC or video chat	16:42
ttx	Anything else this SIG should tackle? We are trying to have reasonable goals as we all have a lot of work besides the SIG	16:43
ttx	and not sweat it if things go slow	16:43
ttx	but still expose and share knowledge and tools that is distributed in the large deployment ops community	16:44
eandersson	Meaningful monitoring of the control plane has always been difficult to me.	16:44
penick	I'd be interested to learn how many large scale deployers have dedicated time for upstream contributions, and if there's any interest in collaborating on things.	16:44
penick	For example, producing reference deployment documentation, similar to what we did for the edge deployments.	16:44
eandersson	We focus heavily on anything that we think is critical and is currently lacking the support (e.g. our contributions to Senlin)	16:45
penick	ah, meaningful monitoring is a good one	16:45
eandersson	We don't necessarily have the time to dedicate someone to a project like Nova, Neutron just because of the sheer complexity of learning that. So we focus where we can make the most impact.	16:46
eandersson	But we do contribute up smaller bug fixes to all projects (or report them at the very least) the moment we find them.	16:46
ttx	I like the idea of meaningful monitoring. It really is in a sorry state	16:47
ttx	I think the quest for exact metrics (usable for billing) has killed the "meaningful" out of our monitoring	16:48
eandersson	Yep!	16:48
eandersson	It's difficult to find a good balance.	16:49
ttx	Maybe we should collect experiences on how people currently do monitoring, not even talking about metrics/billing	16:49
eandersson	I would also like to see if someone has successfully enabled tracing.	16:50
penick	splunk monitoring/search examples for troubleshooting problems might also be useful	16:51
ttx	I'll give some thought on how to drive that, maybe we should just add a goal on "meaningful monitoring" (I like that termbut it might be a bit too ambitious for the group right now	16:51
eandersson	I would like to know what monitoring is the first to alert you when something goes wrong (e.g. rabbit goes down, control node goes down)	16:51
penick	also, relevant, i'm on two meetings at once here, but the other meeting is what we're going to do to get off of Ceilometer and Gnocchi to move to something active and supported	16:51
*** dosaboy has quit IRC		16:51
ttx	relevant indeed	16:52
eandersson	We have a lot synthetic monitoring scenarios running continuously	16:52
eandersson	or scenario monitoring (e.g. a VM is created and tested every 5 minute)	16:52
ttx	penick: would love to know where you end up	16:52
ttx	OK that was a great first meeting y'all, but I need to run	16:53
penick	I'll share what we come up with	16:53
penick	cool, thanks y'all!	16:53
ttx	So thanks everyone, let's continue the discussion next time you can drop by. I'll add a section to continue discussing meaningful monitoring in the EU+APAC meeting in two weeks	16:54
ttx	#endmeeting	16:54
*** openstack changes topic to "OpenStack Meetings \|\| https://wiki.openstack.org/wiki/Meetings/"		16:54
openstack	Meeting ended Wed Sep 9 16:54:45 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	16:54
openstack	Minutes: http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.html	16:54
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.txt	16:54
openstack	Log: http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.log.html	16:54
*** eandersson has left #openstack-meeting-3		16:54
amorin	bye	16:55
ttx	I'll post summary of the meeting, but tomorrow since I need to run	16:55
mdelavergne	Thanks everyone, see you!	16:55
*** mdelavergne has quit IRC		16:55
*** slaweq has joined #openstack-meeting-3		16:57
*** penick has quit IRC		17:00
*** slaweq has quit IRC		17:12
*** belmoreira has joined #openstack-meeting-3		17:35
*** e0ne has quit IRC		17:37
*** ralonsoh has quit IRC		18:05
*** priteau has quit IRC		18:57
*** apetrich has quit IRC		19:11
*** slaweq has joined #openstack-meeting-3		19:17
*** tosky has quit IRC		19:37
*** slaweq has quit IRC		20:51
*** slaweq has joined #openstack-meeting-3		20:55
*** slaweq has quit IRC		20:59
*** slaweq has joined #openstack-meeting-3		21:26
*** slaweq has quit IRC		21:31
*** raildo has quit IRC		22:22
*** dosaboy has joined #openstack-meeting-3		23:28

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!