*** tsmith2 has quit IRC | 00:33 | |
*** ricolin_ has joined #openstack-meeting-3 | 01:31 | |
*** dconde has joined #openstack-meeting-3 | 02:34 | |
*** dconde has quit IRC | 02:35 | |
*** njohnston has quit IRC | 04:22 | |
*** belmoreira has joined #openstack-meeting-3 | 04:23 | |
*** ralonsoh has joined #openstack-meeting-3 | 06:34 | |
*** e0ne has joined #openstack-meeting-3 | 06:35 | |
*** e0ne has quit IRC | 06:39 | |
*** slaweq has joined #openstack-meeting-3 | 06:51 | |
*** tosky has joined #openstack-meeting-3 | 07:57 | |
*** slaweq has quit IRC | 08:20 | |
*** apetrich has joined #openstack-meeting-3 | 08:26 | |
*** slaweq has joined #openstack-meeting-3 | 08:29 | |
*** apetrich has quit IRC | 08:39 | |
*** ianychoi__ has quit IRC | 08:40 | |
*** apetrich has joined #openstack-meeting-3 | 08:44 | |
*** slaweq has quit IRC | 08:49 | |
*** e0ne has joined #openstack-meeting-3 | 09:32 | |
*** ricolin_ has quit IRC | 09:38 | |
*** ralonsoh has quit IRC | 10:01 | |
*** ralonsoh has joined #openstack-meeting-3 | 10:02 | |
*** raildo has joined #openstack-meeting-3 | 11:30 | |
*** lkoranda has joined #openstack-meeting-3 | 11:36 | |
*** lkoranda has quit IRC | 11:52 | |
*** lkoranda has joined #openstack-meeting-3 | 11:52 | |
*** njohnston has joined #openstack-meeting-3 | 12:08 | |
*** slaweq has joined #openstack-meeting-3 | 14:03 | |
*** lkoranda has quit IRC | 14:08 | |
*** ricolin_ has joined #openstack-meeting-3 | 14:11 | |
*** ricolin_ has quit IRC | 14:13 | |
*** lkoranda has joined #openstack-meeting-3 | 15:03 | |
*** slaweq has quit IRC | 15:04 | |
*** tsmith2 has joined #openstack-meeting-3 | 15:11 | |
*** lpetrut has joined #openstack-meeting-3 | 15:12 | |
*** mlavalle has joined #openstack-meeting-3 | 15:17 | |
*** tsmith2 has quit IRC | 15:20 | |
*** tsmith2 has joined #openstack-meeting-3 | 15:22 | |
*** tsmith_ has joined #openstack-meeting-3 | 15:25 | |
*** tsmith2 has quit IRC | 15:27 | |
*** tsmith_ is now known as tsmith2 | 15:27 | |
*** belmoreira has quit IRC | 15:40 | |
*** mdelavergne has joined #openstack-meeting-3 | 15:51 | |
*** penick has joined #openstack-meeting-3 | 15:51 | |
*** mdelavergne has quit IRC | 15:53 | |
*** mdelavergne has joined #openstack-meeting-3 | 15:54 | |
*** lpetrut has quit IRC | 15:59 | |
ttx | #startmeeting large_scale_sig | 16:00 |
---|---|---|
openstack | Meeting started Wed Sep 9 16:00:07 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. | 16:00 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 16:00 |
ttx | #topic Rollcall | 16:00 |
*** openstack changes topic to " (Meeting topic: large_scale_sig)" | 16:00 | |
openstack | The meeting name has been set to 'large_scale_sig' | 16:00 |
*** openstack changes topic to "Rollcall (Meeting topic: large_scale_sig)" | 16:00 | |
ttx | Who is here for the Large Scale SIG meeting ? | 16:00 |
mdelavergne | Hi o/ | 16:00 |
*** eandersson has joined #openstack-meeting-3 | 16:01 | |
ttx | I see penick is in the channel list | 16:01 |
amorin | hey! | 16:01 |
ttx | and amorin | 16:01 |
* penick waves | 16:01 | |
ttx | and eandersson | 16:02 |
eandersson | o/ | 16:02 |
ttx | Alright, let's get started | 16:02 |
ttx | Our agenda for today is at: | 16:02 |
ttx | #link https://etherpad.openstack.org/p/large-scale-sig-meeting | 16:02 |
ttx | #topic Welcome newcomers | 16:02 |
*** openstack changes topic to "Welcome newcomers (Meeting topic: large_scale_sig)" | 16:02 | |
ttx | Following the Opendev event on Large scale deployments, we had several people expressing interest in joining the SIG | 16:03 |
ttx | Several of them in the US and not interested in our original meeting time at 8utc | 16:03 |
*** priteau has joined #openstack-meeting-3 | 16:03 | |
ttx | So we decided some time ago to rotate between US+EU / APAC+EU times, as the majority of the group is from EU | 16:03 |
ttx | This is our second US+EU meet... but the first was not really successful at attracting new participants | 16:03 |
ttx | so I'd like to take the time to welcome new attendees, and spend some time discussing what they are interested in | 16:03 |
ttx | so we can shape the direction of the SIG accordingly | 16:03 |
ttx | penick: not sure you need any intro, but go | 16:04 |
penick | Heh, hey folks. I work for Verizon Media and act as the director/architect for our private cloud. | 16:04 |
ttx | Anything specific you're interested in within this SIG, beyond sharing your experience? | 16:05 |
penick | I'm here to offer some perspective and feedback on our needs as a large scale deployer, and learn other use cases. Ideally i'd like to align our upstream development efforts to help make the product better for large and small deployers | 16:05 |
ttx | great! You're in the right place | 16:05 |
ttx | eandersson: care to quickly introduce yourself? | 16:06 |
amorin | welcome | 16:06 |
eandersson | Sure! | 16:06 |
eandersson | I work at Blizzard Entertainment as a Technical Lead and I am responsible for the Private Cloud here. | 16:07 |
ttx | Anything specific you're interested in within this SIG, beyond sharing your experience? | 16:07 |
ttx | (yeah, i copy pasted that) | 16:08 |
mdelavergne | Welcome to the both of you :) | 16:08 |
eandersson | Very similar as penick, here to share perspective and provide feedback. | 16:08 |
ttx | Perfect. I'll let amorin and mdelavergne quickly introduce themselves... I'm working for the OSF, facilitating this group operations | 16:09 |
ttx | Not much of a first-hand experience on large scale deployments, lots of second hand accounts though :) | 16:09 |
ttx | Other regular SIG members include belmoreira of CERN fame, and masahito from LINE, whoi opefully is sleeping at this hour. | 16:10 |
ttx | hopefully* | 16:10 |
amorin | hey, iwork for ovh, i am mostly involved in deploying openstack for our public cloud offer | 16:10 |
mdelavergne | Hi, I'm a PhD student working on a way to geo-distribute applications, so I mostly do experiments on large-scale | 16:11 |
amorin | and sorry, i am on my phone :( not easy to type here | 16:11 |
ttx | heh | 16:11 |
penick | i'm glad to meet you all :) | 16:11 |
ttx | Alright, let's jump right in in our current workstreams... don't hesitate to interrupt me to ask questions | 16:11 |
ttx | #topic Progress on "Documenting large scale operations" goal | 16:11 |
*** openstack changes topic to "Progress on "Documenting large scale operations" goal (Meeting topic: large_scale_sig)" | 16:11 | |
ttx | #link https://etherpad.openstack.org/p/large-scale-sig-documentation | 16:11 |
ttx | So this is one of our current goals for the SIG - produce better documentation to help operators setting up large deployments. | 16:12 |
ttx | One work stream is around collecting articles, howto and tips & tricks around large scale published over the years | 16:12 |
ttx | You can add any you know of on that etherpad I just linked to | 16:12 |
ttx | Another workstream is around documenting better configuration values when you start hitting issues with default values | 16:12 |
ttx | amorin is leading that work at https://wiki.openstack.org/wiki/Large_Scale_Configuration_Guidelines but could use help | 16:13 |
ttx | Basically the default settings only carry you so far | 16:13 |
ttx | and if we can agree on commonly-tweaked settings at scale, we should find a way to document them | 16:13 |
amorin | yes i'd love to be able to push that forward, but i lack some time mostly | 16:13 |
ttx | We had another work item on collecting metrics/billing stories | 16:14 |
ttx | That points to one critical activity of this SIG: | 16:14 |
ttx | It's all about sharing your experience operating large scale deployments of OpenStack | 16:14 |
ttx | so that we can derive best practices and/or fix common issues | 16:14 |
ttx | Only amorin contributed the story for OVH on the etherpad, so please add to that if you have time | 16:15 |
ttx | Maybe there is no common pattern there, but my guess is there is, and it's not necessarily aligned with what upstream provides | 16:15 |
ttx | So I'll push again an action to contribute to that: | 16:15 |
ttx | #action all to describe briefly how you solved metrics/billing in your deployment in https://etherpad.openstack.org/p/large-scale-sig-documentation | 16:15 |
ttx | And finally, we have a workstream on tooling, prompted by OVH's interest in pushing osarchiver upstream | 16:16 |
ttx | The current status there is that the "OSOps" effort is being revived, under the 'Operation Docs and Tooling' SIG | 16:16 |
ttx | belmoreira and I signed up to make sure that was moving forward, but smcginnis has been active leading it | 16:16 |
ttx | The setup is still in progress at https://review.opendev.org/#/c/749834/ | 16:16 |
penick | I can ask my team to see if they can distill some of the useful configuration changes we've made. One trick is there's so many ways to deploy OpenStack that there are a lot of different contexts for when to use some settings vs others. For example, we focus on fewer,larger openstack deployments. With thousands of hypervisors per VM cluster and tens of thousands of baremetal nodes per baremetal cluster. So some things | 16:16 |
penick | we've done will not really apply to a large scale deployer who might prefer many smaller clusters. | 16:16 |
ttx | penick: I think the info will be useful anyway. You're right that in some cases there won't be a common practice and it will be all over the map | 16:17 |
amorin | agree, but still useful to know how people are doing scale | 16:18 |
ttx | But from some early discussions at the SIG it was apparent taht some of the pain points are still common | 16:18 |
ttx | To finish on osarchiver, once OSOps repo is set up we'll start working on pushing osarchiver in it | 16:18 |
ttx | Anything else on this topic? Any other direction you'd like this broad goal to take? | 16:19 |
amorin | great | 16:19 |
ttx | (the SIG mostly goes where its members push it, so if you're interested in something specific, please let the group know) | 16:19 |
ttx | Otherwise I'll keep on exposing current status | 16:20 |
ttx | #topic Progress on "Scaling within one cluster" goal | 16:20 |
*** openstack changes topic to "Progress on "Scaling within one cluster" goal (Meeting topic: large_scale_sig)" | 16:20 | |
ttx | #link https://etherpad.openstack.org/p/large-scale-sig-cluster-scaling | 16:20 |
ttx | This is the other initial goal of the SIG - identify, measure and push back common scaling limitations within one cluster | 16:20 |
ttx | Basically, as you add nodes to a single OpenStack cluster, at one point you reach scaling limits. | 16:20 |
ttx | How do we push that limit back, and reduce the need to create multiple clusters ? | 16:21 |
ttx | Again there may not be a common pattern, so... | 16:21 |
ttx | First task here is to collect those scaling stories. You have a large scale deployment, what happens if you add too many nodes ? | 16:21 |
ttx | We try to collect those stories at https://etherpad.openstack.org/p/scaling-stories | 16:21 |
ttx | and then move them to https://wiki.openstack.org/wiki/Large_Scale_Scaling_Stories once edited | 16:21 |
ttx | So please add to that if you have a story to share! (can be old) | 16:22 |
eandersson | Neutron has been a big pain point for us, with memory usage been excessive, to the point where we ended up running Neutron in VMs. | 16:22 |
ttx | Neutron failed first? Usually people point to Rabbit first, but maybe you optimized that already | 16:22 |
eandersson | Rabbit has been good to us, besides recovering after a crash. | 16:22 |
penick | eandersson: interesting, was neutron running in containers prior to moving to VMs? | 16:23 |
eandersson | Yea - we containeraized all of our deployments when moving to Rocky about ~2 years ago. | 16:23 |
ttx | amorin: you're not (yet) running components containerized, right? | 16:24 |
amorin | nop, not yet | 16:25 |
amorin | we are working on it | 16:25 |
penick | eandersson good to know, thanks! We're working on containerizing all of our stuff now.. I'll make sure my team is aware | 16:25 |
ttx | So as I said, one common issue when adding nodes is around RabbitMQ falling down, even if I expect most people in this group to have moved past that | 16:26 |
ttx | And so this SIG has worked to produce code to instrument oslo.messaging calls and get good metrics from them | 16:26 |
amorin | is your database and rabbit cluster running in containers as well? | 16:26 |
ttx | Based on what LINE used internally to solve that | 16:26 |
eandersson | Not yet. | 16:27 |
ttx | This resulted in https://opendev.org/openstack/oslo.metrics | 16:27 |
ttx | Next step there is to add basic tests, so that we are reasonably confident we do not introduce regressions | 16:27 |
ttx | I had an action item I did not finish to evaluate that, let me push it back | 16:27 |
ttx | #action ttx to look into a basic test framework for oslo.metrics | 16:27 |
ttx | Also masahito was planning to push the latest patches to it, but I haven't seen anything posted yet | 16:27 |
ttx | #action masahito to push latest patches to oslo.metrics | 16:28 |
ttx | amorin: did you check about applicability of oslo.metrics within OVH? | 16:28 |
amorin | not yet, still in my todo | 16:28 |
ttx | alright pushing that back too | 16:28 |
ttx | #action amorin to see if oslo.metrics could be tested at OVH | 16:28 |
ttx | Finally we recently helped with OVH's patch to add a basic ping to oslo.middleware | 16:28 |
ttx | That can be useful to monitor a rabbitMQ setup and detect weird failure cases | 16:29 |
ttx | (there were threads on the ML about it) | 16:29 |
ttx | Happy to report the patch finally landed at https://review.opendev.org/#/c/749834/ | 16:29 |
ttx | And shipped in oslo.messaging 12.4.0! | 16:29 |
amorin | yay! | 16:29 |
mdelavergne | congrats! | 16:29 |
eandersson | I believe a lot of the critical recovery issues with RabbitMQ are fixed in Ussuri. Especially the excessive number of exchanges created by RPC calls that I believe caused all our recovery issues. | 16:30 |
ttx | eandersson: yes I think there were lots of improvements there | 16:31 |
ttx | Does that sound like a good goal for you penick and eandersson, or do you think we'll also fail to extract common patterns and low-hanging fruit improvements to raise scale | 16:31 |
penick | Sorry, not understanding.. Is what a good goal? | 16:32 |
ttx | oh the "Scaling within one cluster" goal | 16:32 |
ttx | Trying to push back when you need to scale out to multiple clusters | 16:33 |
eandersson | Yea - that is a good goal. | 16:33 |
penick | Ah, yeah. That's good for me. | 16:34 |
ttx | ok, moving on | 16:34 |
ttx | #topic PTG/Summit plans | 16:34 |
*** openstack changes topic to "PTG/Summit plans (Meeting topic: large_scale_sig)" | 16:34 | |
ttx | For the PTG we decided last meeting to ask for a PTG room around our usual meeting times, to serve as our regular meeting while recruiting potential new members | 16:34 |
ttx | So I requested Wednesday 7UTC-8UTC and 16UTC-17UTC (Oct 28) | 16:35 |
ttx | (first one is a bit early but there was no slot scheduled at our normal time) | 16:35 |
ttx | For the summit we discussed proposing one Forum session around collecting scaling stories, which I still have to file | 16:35 |
ttx | The idea is to get people to talk and we can document their story afterwards | 16:36 |
ttx | One learning from Opendev is that to get a virtual discussion going, it's good to prime the pump by having 2-3 people signed up to discuss the topic | 16:36 |
ttx | So... anyone interested in sharing their scaling story in the context of a Forum session? | 16:36 |
ttx | I just fear unless we start talking nobody will dare expose their case | 16:37 |
ttx | I can moderate but I don't have any first-hand scaling story to share | 16:37 |
amorin | i can maybe do a small one | 16:37 |
penick | I can share something, or ask someone from my team to | 16:38 |
ttx | great, thanks! I think that will help | 16:38 |
ttx | We'll try to actively promote that session in the openstack ops community, so hopefully it will work, both as a recruiting mechanism and a way to collect data | 16:39 |
amorin | ok | 16:39 |
ttx | #action ttx to file Scaling Stories forum session, with amorin and someone from penick's team to help get it off the ground | 16:40 |
ttx | #topic Next meeting | 16:40 |
*** openstack changes topic to "Next meeting (Meeting topic: large_scale_sig)" | 16:40 | |
ttx | If we continue on the same rhythm: | 16:40 |
ttx | Next meeting will be EU-APAC on Sept 23, 8utc. | 16:40 |
ttx | Then next US-EU meeting will be Oct 7, 16utc. | 16:40 |
ttx | How does that sound? | 16:41 |
ttx | Any objection to continuing to hold those on IRC? | 16:41 |
amorin | good | 16:41 |
eandersson | Sounds good to me. | 16:41 |
mdelavergne | it's fine | 16:41 |
ttx | Feel free to invite others from your teams to join if they can help, or fellow openstack ops you accidentally discuss with | 16:42 |
ttx | #info next meetings: Sep 23, 8:00UTC; Oct 7, 16:00UTC | 16:42 |
ttx | That is all I had, opening the floor for comments, questions, desires | 16:42 |
ttx | #topic Open discussion | 16:42 |
*** openstack changes topic to "Open discussion (Meeting topic: large_scale_sig)" | 16:42 | |
penick | 16utc is fine for me, i'm ok with either IRC or video chat | 16:42 |
ttx | Anything else this SIG should tackle? We are trying to have reasonable goals as we all have a lot of work besides the SIG | 16:43 |
ttx | and not sweat it if things go slow | 16:43 |
ttx | but still expose and share knowledge and tools that is distributed in the large deployment ops community | 16:44 |
eandersson | Meaningful monitoring of the control plane has always been difficult to me. | 16:44 |
penick | I'd be interested to learn how many large scale deployers have dedicated time for upstream contributions, and if there's any interest in collaborating on things. | 16:44 |
penick | For example, producing reference deployment documentation, similar to what we did for the edge deployments. | 16:44 |
eandersson | We focus heavily on anything that we think is critical and is currently lacking the support (e.g. our contributions to Senlin) | 16:45 |
penick | ah, meaningful monitoring is a good one | 16:45 |
eandersson | We don't necessarily have the time to dedicate someone to a project like Nova, Neutron just because of the sheer complexity of learning that. So we focus where we can make the most impact. | 16:46 |
eandersson | But we do contribute up smaller bug fixes to all projects (or report them at the very least) the moment we find them. | 16:46 |
ttx | I like the idea of meaningful monitoring. It really is in a sorry state | 16:47 |
ttx | I think the quest for exact metrics (usable for billing) has killed the "meaningful" out of our monitoring | 16:48 |
eandersson | Yep! | 16:48 |
eandersson | It's difficult to find a good balance. | 16:49 |
ttx | Maybe we should collect experiences on how people currently do monitoring, not even talking about metrics/billing | 16:49 |
eandersson | I would also like to see if someone has successfully enabled tracing. | 16:50 |
penick | splunk monitoring/search examples for troubleshooting problems might also be useful | 16:51 |
ttx | I'll give some thought on how to drive that, maybe we should just add a goal on "meaningful monitoring" (I like that termbut it might be a bit too ambitious for the group right now | 16:51 |
eandersson | I would like to know what monitoring is the first to alert you when something goes wrong (e.g. rabbit goes down, control node goes down) | 16:51 |
penick | also, relevant, i'm on two meetings at once here, but the other meeting is what we're going to do to get off of Ceilometer and Gnocchi to move to something active and supported | 16:51 |
*** dosaboy has quit IRC | 16:51 | |
ttx | relevant indeed | 16:52 |
eandersson | We have a lot synthetic monitoring scenarios running continuously | 16:52 |
eandersson | or scenario monitoring (e.g. a VM is created and tested every 5 minute) | 16:52 |
ttx | penick: would love to know where you end up | 16:52 |
ttx | OK that was a great first meeting y'all, but I need to run | 16:53 |
penick | I'll share what we come up with | 16:53 |
penick | cool, thanks y'all! | 16:53 |
ttx | So thanks everyone, let's continue the discussion next time you can drop by. I'll add a section to continue discussing meaningful monitoring in the EU+APAC meeting in two weeks | 16:54 |
ttx | #endmeeting | 16:54 |
*** openstack changes topic to "OpenStack Meetings || https://wiki.openstack.org/wiki/Meetings/" | 16:54 | |
openstack | Meeting ended Wed Sep 9 16:54:45 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:54 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.html | 16:54 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.txt | 16:54 |
openstack | Log: http://eavesdrop.openstack.org/meetings/large_scale_sig/2020/large_scale_sig.2020-09-09-16.00.log.html | 16:54 |
*** eandersson has left #openstack-meeting-3 | 16:54 | |
amorin | bye | 16:55 |
ttx | I'll post summary of the meeting, but tomorrow since I need to run | 16:55 |
mdelavergne | Thanks everyone, see you! | 16:55 |
*** mdelavergne has quit IRC | 16:55 | |
*** slaweq has joined #openstack-meeting-3 | 16:57 | |
*** penick has quit IRC | 17:00 | |
*** slaweq has quit IRC | 17:12 | |
*** belmoreira has joined #openstack-meeting-3 | 17:35 | |
*** e0ne has quit IRC | 17:37 | |
*** ralonsoh has quit IRC | 18:05 | |
*** priteau has quit IRC | 18:57 | |
*** apetrich has quit IRC | 19:11 | |
*** slaweq has joined #openstack-meeting-3 | 19:17 | |
*** tosky has quit IRC | 19:37 | |
*** slaweq has quit IRC | 20:51 | |
*** slaweq has joined #openstack-meeting-3 | 20:55 | |
*** slaweq has quit IRC | 20:59 | |
*** slaweq has joined #openstack-meeting-3 | 21:26 | |
*** slaweq has quit IRC | 21:31 | |
*** raildo has quit IRC | 22:22 | |
*** dosaboy has joined #openstack-meeting-3 | 23:28 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!