Monday, 2021-11-22

bauzas	good morning Nova	08:50
gibi	good morning	08:50
* bauzas is about to go back to bed as it looks it's still the night over my window		08:51
bauzas	you appreciate winter times when you have to light on	08:51
*** simondodsley_ is now known as simondodsley		09:31
*** erlon_ is now known as erlon		09:32
*** TheJulia_ is now known as TheJulia		09:32
*** EugenMayer3 is now known as EugenMayer		10:31
gibi	bauzas: I think this spec is a quick +A https://review.opendev.org/c/openstack/nova-specs/+/810868 if you have minute :	13:16
gibi	:)	13:16
bauzas	gibi: I'll need to taxi my daughter in a few mins but I'll look at it after	13:17
sean-k-mooney	gibi: i kind of agree	13:19
gibi	bauzas: OK, sure	13:19
sean-k-mooney	i can take a look but ill leave the +w to bauzas	13:19
Henriqueof	Is there any articles/guides on how overcommiting CPU/RAM degrades performance?	13:26
sean-k-mooney	Henriqueof: well over commiting ram is easy to understand in that when you actully start overcommiting the inuse ram it will start swapping to disk	13:34
sean-k-mooney	using something like zram as first level swap and an actual swap partion as second level swap and have 1 swap partion per numa node can in some casess help but only if the storage is also numa alinged	13:35
sean-k-mooney	Henriqueof: for cpus it is really just a matter of contntion and context switchign overhead	13:36
sean-k-mooney	if you vms are mostly idel it will be fien to over commit them but once the host load starts to exceed the number of cpus then the performcne will degrade	13:36
sean-k-mooney	you likely will hit memory bandwith, disk io or network io bottelneck too depending on your workloads	13:37
sean-k-mooney	Henriqueof: my recommendation are never over commit cpus more then 4:1, always reserve at least 1 core per numa node for the host, if you are using hyper treading reserve the hypertread sibling of each host core too.	13:38
sean-k-mooney	Henriqueof: hyperthreading only give you about a 1.4x increase in throughput by the way so dont expect to actully be able to service a load = nproc if you are using HT	13:39
sean-k-mooney	Henriqueof: form memroy i normally recommend never over commiting and using hugepages but if you must over commit allocate swap equal to total memory*overcommit ratio	13:40
sean-k-mooney	i would not really over commit more then about 2-4x your ram either. but as i said i recomend keeping memory over comiit at 1.0 so no over commit in most cases.	13:41
Henriqueof	sean-k-mooney: You actually answered most of my questions, thank you!	14:00
Henriqueof	I find odd the OpenStack docs says that they overcommit CPU and RAM by default, but kolla-ansible doesn't seens to do that.	14:02
sean-k-mooney	we have our default set to overcommit cpu by 16:1 and ram by 1.5:1	14:03
kashyap	Henriqueof: Who are the users of 'kolla-ansible'?	14:03
kashyap	(Do people use it manually, or do tools use it mostly?)	14:03
sean-k-mooney	they are old defaults from when when openstack was used by nasa and rackspace mainly for webhosting/data storage	14:04
sean-k-mooney	kashyap: its one of the more popular installers its often used via kayobe which is supported by stackhpc https://www.stackhpc.com/pages/kayobe.html	14:05
sean-k-mooney	kashyap: the company johnthetubaguy[m] works at if he has not moved on.	14:06
kashyap	sean-k-mooney: Right; I vaguely know the tool is an installer. Didn't know how much it is actually used in production	14:06
kashyap	I see, noted.	14:06
sean-k-mooney	kashyap: so most of the user are HPC or scientific user or goverment/university installation i belive	14:06
sean-k-mooney	kashyap: the highest profile use is proably SKA the Square Kilometer Array telescope	14:07
kashyap	Cool; good to know :)	14:08
Henriqueof	sean-k-mooney: Really? Until now I thought kolla-ansible was one of the most popular deployment tool.	14:08
sean-k-mooney	Henriqueof: it is yes	14:10
sean-k-mooney	Henriqueof: im not sure how much market share it has vs tripleo,openstack charms and openstack ansible	14:11
sean-k-mooney	but those are the big 4 deployment tools	14:11
sean-k-mooney	looking at https://www.openstack.org/analytics	14:14
sean-k-mooney	if you go to Deployment Decisions	14:14
Henriqueof	Yeah, it is very straight forward and stable tool so I never felt the nned to experiment with the others.	14:14
sean-k-mooney	29% of respondence used kolla ansible	14:14
sean-k-mooney	which is about the same as juju/triplo/OSA combined	14:15
sean-k-mooney	that does not tell you how big the deployment are however	14:15
sean-k-mooney	so ther might be more respondnce using kolla-ansible but that does not mean there are more servers managed by it but it at least give some indeicaionts of it popularity	14:16
kashyap	stephenfin: Hey, have you ever used this? - sphinxcontrib-spelling	14:21
kashyap	[https://sphinxcontrib-spelling.readthedocs.io/en/latest/]	14:21
sean-k-mooney	kashyap: i would expect it would have issues with the terms we use like extra-spec	14:23
kashyap	Right; but still I wonder is it overall a net win or not	14:24
sean-k-mooney	we likely could include a dictionary with those but that might get tedious	14:24
kashyap	sean-k-mooney: Yes, it can use project-specific dictionaries	14:24
opendevreview	Merged openstack/nova-specs master: Repropose Add libvirt support for flavor and image defined ephemeral encryption https://review.opendev.org/c/openstack/nova-specs/+/810868	15:25
opendevreview	Dan Smith proposed openstack/nova master: Allow per-context rule in error messages https://review.opendev.org/c/openstack/nova/+/816865	15:38
opendevreview	Dan Smith proposed openstack/nova master: Revert project-specific APIs for servers https://review.opendev.org/c/openstack/nova/+/816206	15:38
dansmith	gmann: johnthetubaguy[m]: Removed the WIPs from these ^ as I'm assuming there are no more fundamental concerns	15:38
gmann	dansmith: ack, I will check today. thanks	15:39
dansmith	can we get this merged? https://review.opendev.org/c/openstack/nova/+/817030	15:48
dansmith	it's already being used to debug gate and real VIF plugging event failures	15:48
gibi	dansmith: done	15:49
dansmith	gibi: thanks	15:51
kashyap	In CirrOS latest 0.5.2, where is this file? /etc/cirros-init/config?	16:51
kashyap	Is it moved to somewhere else? /me didn't find it in a quick libguestfs inspection	16:51
kashyap	Actually, ignore me. It's still there.	16:57
opendevreview	Merged openstack/nova master: Log instance event wait times https://review.opendev.org/c/openstack/nova/+/817030	17:35
*** lucasagomes_ is now known as lucasagomes		18:22
opendevreview	Merged openstack/nova master: nova-manage: Always get BDMs using get_by_volume_and_instance https://review.opendev.org/c/openstack/nova/+/811716	18:39
mnaser	hi y'all	18:56
mnaser	has anyone ran into an issue where the api stops responding if the notification transport is failing?	18:56
mnaser	i.e. oslo_messaging_notificaitons/transport_url = rabbit://foobar , where foobar goes down, and the DEFAULT/transport_url still is up, but i guess the threads all get blocked till it grinds down to a halt?	18:57
mnaser	i've repro'd on a customer environment that is deployed by OSA but i'm trying to get a devstack up right now and get GMR to see how it hands	18:58
mnaser	s/hands/hangs/	18:58
sean-k-mooney	it might be related to the heartbeat	18:59
sean-k-mooney	or the wsgi server	18:59
sean-k-mooney	if you are using mod_wsgi under apptach each worker will only ever service 1 api request at a time	19:00
sean-k-mooney	we may monkey patch the api but that will never allow the apache process to service a second request in parallel as that is managed by apache	19:01
sean-k-mooney	if all the api workers are trying to do somethign that needs rabit then it will stop responding until the request or rpc timeout fires and it retruns an error	19:01
sean-k-mooney	i dont know if uwsgi is better in that regard	19:02
mnaser	sean-k-mooney: OSA deploys with uwsgi	19:03
mnaser	sean-k-mooney: i'm still doing my research, but also, i suspect this affects n-cond too	19:03
mnaser	and anything rabbit related, it seems like the notification blocks the main process	19:03
mnaser	or maybe when the queue of unsent messages gets so big, the whole process bogs down	19:04
mnaser	or it has a limit of threads it will bubble up to and then the whole process stops responding	19:05
sean-k-mooney	its possible that the eventlet thread pool will file up eventually	19:05
sean-k-mooney	hopefully this is something i can detech ast part of the health check work	19:06
mnaser	afaik i think the default timeout or retry is set to 0 with notifications	19:06
sean-k-mooney	well notificaiotn are off by default	19:06
sean-k-mooney	or rather we use the noop driver	19:06
mnaser	right yes, but if you turn them on, retries=0 so retry forever	19:06
sean-k-mooney	i woudl expect 0 to be retry never	19:07
sean-k-mooney	and -1 be retry for ever	19:07
mnaser	0 is retry forever in notifier i think let me duble check	19:07
mnaser	sean-k-mooney: btw i suggest looking at how we do health checks in openstack-helm, it has some neat things where it actually makes an rpc call to the local instance and make sure we get an error back saying "not valid call"	19:07
mnaser	there's some neat stuff there that might draw inspiration	19:07
mnaser	sean-k-mooney: https://opendev.org/openstack/openstack-helm/src/branch/master/neutron/templates/bin/_health-probe.py.tpl	19:07
sean-k-mooney	mnaser: i wanted to do active probes but the direction at the ptg was that was not ok	19:08
mnaser	this one pretty much runs the check when it's asked	19:08
sean-k-mooney	maybe after the intial work si done we can add a probe endpoint but it will intially be based on cached sate	19:08
sean-k-mooney	mnaser: ya that is what i was going to do but it was rejected when i proposed it	19:08
mnaser	https://opendev.org/openstack/openstack-helm/src/branch/master/nova/templates/bin/_health-probe.py.tpl is how its done for nova	19:08
sean-k-mooney	mnaser: well that is writing to the nova message bus	19:09
sean-k-mooney	so that is not allowed by anything that is not apart fo nova	19:09
mnaser	yes it might not be very clean but it works(tm)	19:09
sean-k-mooney	sure and will void any downstream support you have with your vendor	19:10
mnaser	fair enough, my downstream support is me =P	19:10
sean-k-mooney	but ya probing the queue was one of the thing i wanted to do	19:10
sean-k-mooney	we migth add a way to do that at somepoint	19:11
mnaser	btw, you were right, -1 is indefinite, and it defaults to that => https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/notify/notifier.py#L55-L58	19:11
sean-k-mooney	ack	19:11
mnaser	i guess oslo messaging doesnt have a timeout	19:11
sean-k-mooney	ya im not sure	19:19
opendevreview	Artom Lifshitz proposed openstack/nova master: api-ref: Adjust BFV rescue non-support note. https://review.opendev.org/c/openstack/nova/+/818823	19:19
sean-k-mooney	likely you shoul change the default to be say 10 or simialr	19:19
sean-k-mooney	in OSA	19:19
opendevreview	Merged openstack/nova stable/xena: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/818515	20:04
mnaser	sean-k-mooney: well it sounds like maybe that's not a great default value i guess	20:16
sean-k-mooney	mnaser: i assume notifcaiton are off in osa by default. if it enabeld an no default is spcifed for retry i woudl proably default to 0,1 or 3 but not -1	20:18
sean-k-mooney	or just make it an error	20:18
mnaser	sean-k-mooney: yeah im thinking more of a more sane oslo.messaging defaults	20:18
sean-k-mooney	require it to be set	20:18
sean-k-mooney	well again it depends on your setup you might rely on notificaions	20:19
sean-k-mooney	but if you do then you als need to have monitoring in place to know that ere are rabbit issues	20:19
sean-k-mooney	and correct that	20:19
mnaser	sean-k-mooney: yeah but to me it sounds like notifications failing should not result in nova falling apart	20:20
sean-k-mooney	well it should not but that might just mean that -1 for retry is not a vaild value	20:21
sean-k-mooney	-1 presumable mean you must keep every notificaiotn in memory	20:21
sean-k-mooney	untill its sent	20:21
sean-k-mooney	with a copertive threading model like evently if you have enough notificiton eventlet pendign that will eventurally degrade the performance of the service	20:22
mnaser	yeah im trying to repro right now	20:23
opendevreview	Stanislav Dmitriev proposed openstack/nova master: Retry image download if it's corrupted https://review.opendev.org/c/openstack/nova/+/818503	21:21
opendevreview	Dmitrii Shcherbakov proposed openstack/nova master: [yoga] Add PCI VPD Capability Handling https://review.opendev.org/c/openstack/nova/+/808199	22:06
opendevreview	Dmitrii Shcherbakov proposed openstack/nova master: [yoga] Support remote-managed SmartNIC DPU ports https://review.opendev.org/c/openstack/nova/+/812111	22:06

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!