Wednesday, 2024-04-24

ttxo/09:00
amorino/09:01
ttx#startmeeting large_scale_sig09:01
opendevmeetMeeting started Wed Apr 24 09:01:18 2024 UTC and is due to finish in 60 minutes.  The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot.09:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.09:01
opendevmeetThe meeting name has been set to 'large_scale_sig'09:01
ttx#topic Rollcall09:01
ttxamorin: o/09:01
amorinhey there09:01
ttxAnyone else here for the large scale sig meeting?09:01
amorinmaybe felixhuettner[m] ?09:02
ttxOur agenda for today's meeting is09:02
ttx#link https://etherpad.opendev.org/p/large-scale-sig-meeting09:02
amorinI added info on this etherpad about our recent rabbitmq improvment09:04
ttxOK I'll read09:04
ttxPinging felix.huettner 09:04
amorinwe can talk about it during open discussion I think09:05
ttx#topic Ops Deep Dive proposals review09:05
ttxSo we did not get anything, which makes me wonder if we should change our strategy here09:05
felixhuettner[m]o/ sorry09:05
amorinI agree09:05
ttxMaybe plan to do one ourselves in the future to keep it hot and running09:05
felixhuettner[m]we can do that, its just the question of topics09:06
ttxWould be great to hit something trendy... like a GPU-oriented cloud, or VMWare transitioning09:06
ttxMaybe we can think about topics between now and next meeting and come back with proposals ?09:06
felixhuettner[m]we have a lot of people asking about confidential VMs/GPUs09:07
felixhuettner[m]might also be an interesting topic 09:07
felixhuettner[m]allthough it is quite snakeoily09:07
amorinok, I can try to reach internal IA/GPU team09:07
ttxyeah, would be well aligned with the Foundation's 2024 messaging09:07
amorin#action amorin reach OVHcloud IA team about openstack GPU integration09:08
ttx#action everyone to think about a hot theme for our next episode, to be discussed at May meeting09:08
ttxfeel free to email me if you have something urgent earlier09:09
ttxAny other idea before we move on?09:09
amorinnop09:10
ttx#topic Large scale doc09:10
ttxNo open reviews...09:10
ttxamorin: do you want to discuss your rabbitMQ points ? Not sure large scale doc is the right outlet09:11
amorinI can do it here09:11
ttxWondering if a technical blog post would not be better for it09:11
amorinit's not only doc09:11
amorina blog post where?09:12
ttxCould be on an OVH platform that we would promote... or maybe directly on https://www.openstack.org/blog/09:13
amorinyes09:14
amorinI'll talk about it internally then and let you know which one is best09:14
amorinwe will talk about it on OpenInfra Day France also09:14
ttxOK, in the mean time I will ask marketing here if they would be interested featuring it on openstack blog09:14
ttxyeah, that's great (OID France talk)09:15
amorinthe fact is that all of the stuff is not yet upstream / merged09:15
ttxcould be on superuser instead09:15
ttxopenstack blog is more about things that landed, but superuser can be more opinionated09:16
amorinand for part of them, I'd like to switch default behavior on oslo messaging part, but I dont know how to proceed, maybe large-scale group would be able to push it with me09:16
ttxI'll ask and let you know09:16
amorinok!09:16
ttxI think it's a good story, between what you already contributed and what you would like to change in the future09:17
ttxin addition to being good large scale advice09:17
amorinyup09:17
amorinShould I expose some of the points in this discussion now for the record?09:17
ttxyes please09:17
amorinLets go, stop me if needed :)09:18
amorinBoth nova and neutron heavily rely on rabbitmq for intra communication (between agents running on computes and API running on control plane).09:18
amorinRabbitMQ clustering is a must have to let operators manage the lifecycle of rabbitMQ09:18
amorinSome recent improvment have been done on oslo.messaging to allow a better scaling and management of rabbitmq queues09:18
amorinHere is a list of what we did on OVH side to achieve better stability at large scale09:18
amorin- Better eventlet / green thread management09:18
amorinOn that part, OVH basically finished the work to avoid green thread to fail sending heartbeats09:19
amorinpatches are merged upstream and available by default.09:19
amorin- Replace classic HA with quorum09:19
amorinQuorum queues are the future for rabbitmq, replacing classic HA09:19
amorinOVH did a patch to finish this implementation09:20
amorinUsing quorum queues is not yet the default and we would like to enable this by default, we may need help from large-scale about this09:20
amorin- Consistent queue naming09:20
amorinoslo.messaging was relying on random queue naming.09:20
amorinWhile this seems not a problem on small deployments, it has two bad side effects :09:20
amorin- it's harder to figure out which service created a specific queue09:20
amorin- as soon as you restart your services, new random queues are created, leaving a lot of orphaned queues in rabbitmq09:20
amorinThese side effects are highly visible at large scale, and even more visible when using quorum queues.09:20
amorinWe did a patch on oslo.messaging to stop using random name.09:20
amorinThis is now merged upstream, but disable by default.09:20
amorinWe would like to enable this by default in the future.09:21
ttxquestion09:21
amorinThis is very important for us, and it helped a lot identifying which compute is affected by a rabbit issue09:21
amorinyes?09:21
ttxShould we document the setting that enables quorum queue in the large scale doc first?09:21
ttxas a step toward making it default09:22
amorinyes, we should09:22
amorinthe bad thing is that it's not easy to switch09:22
amorinbecause it needs all queues to be destroyed and recreated09:22
amorinoslo.messaging is not able to do it09:22
ttxFor all the settings that you suggest should be default we should first document them in large-scale doc09:23
amorinwhat we did on our side: we stopped all openstack services, destroyed the queues, switch the flag, starts everything back09:23
ttxmaybe there is a way to make it default for new deployments only09:23
amorinI agree about pushing it on doc09:23
ttx(not affecting existing ones)09:23
amorin#action push documentation about enabling quorum queues and other rabbitmq improvment settings for large scale09:24
amorinok, lets continue09:25
amorin- Reduce the number of queues09:25
amorinRabbitMQ is a message broker, not a queue broker.09:25
amorinWith a high number of queues, rabbitmq does not work correctly (timeouts / cpu usage / network usage / etc.).09:25
amorinOVH did some patches to reduce the number of queues created by neutron by patching oslo.messaging and neutron code (we divide neutron number of queues by 5).09:25
amorinThis is not yet upstream, we would like to push it09:25
amorinWithout those patched, our biggest regions (2k computes) were consuming a lot of rabbitmq resources and it was hard to maintain it working correctly09:26
amorinNext thing is:09:26
amorin- Replace classic fanouts with streams09:26
amorinOVH did a patch to rely on "stream" queues to replace classic fanouts.09:27
amorinthis is reducing the number of queues (a lot) and also the number messaging transiting from rabbit09:27
amorinrabbit is now in charge of deduplicating the identical messages to all computes09:28
amorinThose patches are merged upstream but disabled by default09:28
amorinWe would like to enable this by default.09:28
amorin(will do doc about it)09:28
amorinlast point is:09:28
amorin- Get rid of 'transient' queues09:28
amorinoslo.messaging is distinguishing 'transient' queues from other queues but it make no sense09:28
amorinneutron and nova does not rely on transient queues09:28
amorinE.G. a reply_xyz queue is transient in oslo, but it needs to be fully replicated for nova/neutron to work correctly09:29
amorinin the past, such queues were not HA, so if you had to switch off one server of your rabbitmq cluster, you would affect part of your deployment09:30
amorinlosing messages, leading to inconsistency, etc.09:30
amorinby making thoses queues HA, you can operate your rabbitmq cluster correctly09:30
amorinTLDR, transient queues is a non-sense09:31
ttxThat one might be the trickiest to convince people out of, due to history09:31
amorinyup09:31
ttxbut an article can be the first step09:31
amorinbut removing such concept in oslo would simplify a lot code, configuration and understanding stakes09:31
amorinthat's all! thank you for reading :)09:32
ttxand together with other actions show that you did your homework and speak from a good expertise position09:32
amorinI am proud of the team on that part, we gain a lot of experience on rabbit side recently09:32
ttxagreed! It's why it's good material for "superuser" which is user-oriented09:33
ttxbut I'll defer to Allison on where is the best09:33
amorinthanks!09:33
ttxanything else on this topic?09:33
amorinI am done09:33
ttx#topic Next meeting(s)09:34
ttxNext meeting is May 22, which  conflicts with OID France09:34
ttxSo I was thinking moving the meeting to the 3rd wednesday instead, which would be May 15, June 19...09:34
ttxrather than skipping it09:35
ttxthoughts on that?09:35
amorinthat works for me09:35
ttxfelix.huettner: ?09:35
felixhuettner[m]i'm not there on the 15th May09:36
felixhuettner[m]but otherwise that is fine with me09:36
ttxHmm, alternatively we could move it back one week exceptionally in May (May 29, June 19), but that is trickier/impossible to specify in the booking system09:38
ttxErr, that would be May 29, June 2609:38
ttxAlso I generally have conflicts on the last week of months so would not mind moving up09:39
ttxwhat should we do?09:40
felixhuettner[m]but we can just move it to the 3rd wednesday09:41
ttxok09:41
ttx#action ttx to move meeting to 3rd wednesday every month09:41
felixhuettner[m]then i'm just not there for one session09:41
amorinyes09:41
ttxOK, if you have episode ideas in the meantime, let us know by email :)09:42
ttx#topic Open discussion09:42
ttxAnything else, anyone?09:42
amorinnop09:42
ttxalright then09:43
ttx#endmeeting09:43
opendevmeetMeeting ended Wed Apr 24 09:43:28 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)09:43
opendevmeetMinutes:        https://meetings.opendev.org/meetings/large_scale_sig/2024/large_scale_sig.2024-04-24-09.01.html09:43
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/large_scale_sig/2024/large_scale_sig.2024-04-24-09.01.txt09:43
opendevmeetLog:            https://meetings.opendev.org/meetings/large_scale_sig/2024/large_scale_sig.2024-04-24-09.01.log.html09:43
felixhuettner[m]<amorin> "what we did on our side: we..." <- we did this with a little less downtime:09:43
felixhuettner[m]1. patching oslo.messaging/kombu to not care if connecting to an existing queue if they are of a different type and roll this out to all machines09:44
felixhuettner[m]2. kill all queues one-by-one09:44
felixhuettner[m]3. punch a few openstack services that did not recreate queues correctly09:44
amorinThats nice09:46
amorinMaybe we could patch Oslo to have a config param for this09:46
*** tobias-urdin4 is now known as tobias-urdin15:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!