Monday, 2021-11-22

bauzasgood morning Nova08:50
gibigood morning08:50
* bauzas is about to go back to bed as it looks it's still the night over my window08:51
bauzasyou appreciate winter times when you have to light on08:51
*** simondodsley_ is now known as simondodsley09:31
*** erlon_ is now known as erlon09:32
*** TheJulia_ is now known as TheJulia09:32
*** EugenMayer3 is now known as EugenMayer10:31
gibibauzas: I think this spec is a quick +A https://review.opendev.org/c/openstack/nova-specs/+/810868 if you have minute :13:16
gibi:)13:16
bauzasgibi: I'll need to taxi my daughter in a few mins but I'll look at it after13:17
sean-k-mooneygibi: i kind of agree13:19
gibibauzas: OK, sure13:19
sean-k-mooneyi can take a look but ill leave the +w to bauzas 13:19
HenriqueofIs there any articles/guides on how overcommiting CPU/RAM degrades performance?13:26
sean-k-mooneyHenriqueof: well over commiting ram is easy to understand in that when you actully start overcommiting the inuse ram it will start swapping to disk13:34
sean-k-mooneyusing something like zram as first level swap and an actual swap partion as second level swap and have 1 swap partion per numa node can in some casess help but only if the storage is also numa alinged13:35
sean-k-mooneyHenriqueof: for cpus it is really just a matter of contntion and context switchign overhead13:36
sean-k-mooneyif you vms are mostly idel it will be fien to over commit them but once the host load starts to exceed the number of cpus then the performcne will degrade13:36
sean-k-mooneyyou likely will hit memory bandwith, disk io or network io bottelneck too depending on your workloads13:37
sean-k-mooneyHenriqueof: my recommendation are never over commit cpus more then 4:1, always reserve at least 1 core per numa node for the host, if you are using hyper treading reserve the hypertread sibling of each host core too.13:38
sean-k-mooneyHenriqueof: hyperthreading only give you about a 1.4x increase in throughput by the way so dont expect to actully be able to service a load = nproc if you are using HT13:39
sean-k-mooneyHenriqueof: form memroy i normally recommend never over commiting and using hugepages but if you must over commit allocate swap equal to total memory*overcommit ratio13:40
sean-k-mooneyi would not really over commit more then about 2-4x your ram either. but as i said i recomend keeping memory over comiit at 1.0 so no over commit in most cases.13:41
Henriqueofsean-k-mooney: You actually answered most of my questions, thank you!14:00
HenriqueofI find odd the OpenStack docs says that they overcommit CPU and RAM by default, but kolla-ansible doesn't seens to do that.14:02
sean-k-mooneywe have our default set to overcommit cpu by 16:1 and ram by 1.5:114:03
kashyapHenriqueof: Who are the users of 'kolla-ansible'?14:03
kashyap(Do people use it manually, or do tools use it mostly?)14:03
sean-k-mooneythey are old defaults from when when openstack was used by nasa and rackspace mainly for webhosting/data storage14:04
sean-k-mooneykashyap: its one of the more popular installers its often used via kayobe which is supported by stackhpc https://www.stackhpc.com/pages/kayobe.html14:05
sean-k-mooneykashyap: the company johnthetubaguy[m] works at if he has not moved on.14:06
kashyapsean-k-mooney: Right; I vaguely know the tool is an installer.  Didn't know how much it is actually used in production14:06
kashyapI see, noted.14:06
sean-k-mooneykashyap: so most of the user are HPC or scientific user or goverment/university installation i belive14:06
sean-k-mooneykashyap: the highest profile use is proably SKA the Square Kilometer Array telescope14:07
kashyapCool; good to know :)14:08
Henriqueofsean-k-mooney: Really? Until now I thought kolla-ansible was one of the most popular deployment tool.14:08
sean-k-mooneyHenriqueof: it is yes14:10
sean-k-mooneyHenriqueof: im not sure how much market share it has vs tripleo,openstack charms and openstack ansible14:11
sean-k-mooneybut those are the big 4 deployment tools14:11
sean-k-mooneylooking at https://www.openstack.org/analytics14:14
sean-k-mooneyif you go to Deployment Decisions14:14
HenriqueofYeah, it is very straight forward and stable tool so I never felt the nned to experiment with the others.14:14
sean-k-mooney29% of respondence used kolla ansible14:14
sean-k-mooneywhich is about the same as juju/triplo/OSA combined14:15
sean-k-mooneythat does not tell you how big the deployment are however14:15
sean-k-mooneyso ther might be more respondnce using kolla-ansible but that does not mean there are more servers managed by it but it at least give some indeicaionts of it popularity14:16
kashyapstephenfin: Hey, have you ever used this? - sphinxcontrib-spelling14:21
kashyap[https://sphinxcontrib-spelling.readthedocs.io/en/latest/]14:21
sean-k-mooneykashyap: i would expect it would have issues with the terms we use like extra-spec14:23
kashyapRight; but still I wonder is it overall a net win or not14:24
sean-k-mooneywe likely could include a dictionary with those but that might get tedious14:24
kashyapsean-k-mooney: Yes, it can use project-specific dictionaries14:24
opendevreviewMerged openstack/nova-specs master: Repropose Add libvirt support for flavor and image defined ephemeral encryption  https://review.opendev.org/c/openstack/nova-specs/+/81086815:25
opendevreviewDan Smith proposed openstack/nova master: Allow per-context rule in error messages  https://review.opendev.org/c/openstack/nova/+/81686515:38
opendevreviewDan Smith proposed openstack/nova master: Revert project-specific APIs for servers  https://review.opendev.org/c/openstack/nova/+/81620615:38
dansmithgmann: johnthetubaguy[m]: Removed the WIPs from these ^ as I'm assuming there are no more fundamental concerns15:38
gmanndansmith: ack, I will check today. thanks 15:39
dansmithcan we get this merged? https://review.opendev.org/c/openstack/nova/+/81703015:48
dansmithit's already being used to debug gate and real VIF plugging event failures15:48
gibidansmith: done15:49
dansmithgibi: thanks15:51
kashyapIn CirrOS latest 0.5.2, where is this file?  /etc/cirros-init/config?16:51
kashyapIs it moved to somewhere else?  /me didn't find it in a quick libguestfs inspection16:51
kashyapActually, ignore me.  It's still there.16:57
opendevreviewMerged openstack/nova master: Log instance event wait times  https://review.opendev.org/c/openstack/nova/+/81703017:35
*** lucasagomes_ is now known as lucasagomes18:22
opendevreviewMerged openstack/nova master: nova-manage: Always get BDMs using get_by_volume_and_instance  https://review.opendev.org/c/openstack/nova/+/81171618:39
mnaserhi y'all18:56
mnaserhas anyone ran into an issue where the api stops responding if the notification transport is failing?18:56
mnaseri.e. oslo_messaging_notificaitons/transport_url = rabbit://foobar , where foobar goes down, and the DEFAULT/transport_url still is up, but i guess the threads all get blocked till it grinds down to a halt?18:57
mnaseri've repro'd on a customer environment that is deployed by OSA but i'm trying to get a devstack up right now and get GMR to see how it hands18:58
mnasers/hands/hangs/18:58
sean-k-mooney it might be related to the heartbeat18:59
sean-k-mooneyor the wsgi server18:59
sean-k-mooneyif you are using mod_wsgi under apptach each worker will only ever service 1 api request at a time19:00
sean-k-mooneywe may monkey patch the api but that will never allow the apache process to service a second request in parallel as that is managed by apache19:01
sean-k-mooneyif all the api workers are trying to do somethign that needs rabit then it will stop responding until the request or rpc timeout fires and it retruns an error19:01
sean-k-mooneyi dont know if uwsgi is better in that regard19:02
mnasersean-k-mooney: OSA deploys with uwsgi19:03
mnasersean-k-mooney: i'm still doing my research, but also, i suspect this affects n-cond too19:03
mnaserand anything rabbit related, it seems like the notification blocks the main process19:03
mnaseror maybe when the queue of unsent messages gets so big, the whole process bogs down19:04
mnaseror it has a limit of threads it will bubble up to and then the whole process stops responding19:05
sean-k-mooneyits possible that the eventlet thread pool will file up eventually19:05
sean-k-mooneyhopefully this is something i can detech ast part of the health check work19:06
mnaserafaik i think the default timeout or retry is set to 0 with notifications19:06
sean-k-mooneywell notificaiotn are off by default19:06
sean-k-mooneyor rather we use the noop driver19:06
mnaserright yes, but if you turn them on, retries=0 so retry forever19:06
sean-k-mooneyi woudl expect 0 to be retry never19:07
sean-k-mooneyand -1 be retry for ever19:07
mnaser0 is retry forever in notifier i think let me duble check19:07
mnasersean-k-mooney: btw i suggest looking at how we do health checks in openstack-helm, it has some neat things where it actually makes an rpc call to the local instance and make sure we get an error back saying "not valid call"19:07
mnaserthere's some neat stuff there that might draw inspiration19:07
mnasersean-k-mooney: https://opendev.org/openstack/openstack-helm/src/branch/master/neutron/templates/bin/_health-probe.py.tpl19:07
sean-k-mooneymnaser: i wanted to do active probes but the direction at the ptg was that was not ok19:08
mnaserthis one pretty much runs the check when it's asked19:08
sean-k-mooneymaybe after the intial work si done we can add a probe endpoint but it will intially be based on cached sate19:08
sean-k-mooneymnaser: ya that is what i was going to do but it was rejected when i proposed it19:08
mnaserhttps://opendev.org/openstack/openstack-helm/src/branch/master/nova/templates/bin/_health-probe.py.tpl is how its done for nova19:08
sean-k-mooneymnaser: well that is writing to the nova message bus19:09
sean-k-mooneyso that is not allowed by anything that is not apart fo nova19:09
mnaseryes it might not be very clean but it works(tm)19:09
sean-k-mooneysure and will void any downstream support you have with your vendor19:10
mnaserfair enough, my downstream support is me =P19:10
sean-k-mooneybut ya probing the queue was one of the thing i wanted to do19:10
sean-k-mooneywe migth add a way to do that at somepoint19:11
mnaserbtw, you were right, -1 is indefinite, and it defaults to that => https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/notify/notifier.py#L55-L5819:11
sean-k-mooneyack19:11
mnaseri guess oslo messaging doesnt have a timeout19:11
sean-k-mooneyya im not sure19:19
opendevreviewArtom Lifshitz proposed openstack/nova master: api-ref: Adjust BFV rescue non-support note.  https://review.opendev.org/c/openstack/nova/+/81882319:19
sean-k-mooneylikely you shoul change the default to be say 10 or simialr19:19
sean-k-mooneyin OSA19:19
opendevreviewMerged openstack/nova stable/xena: Add a WA flag waiting for vif-plugged event during reboot  https://review.opendev.org/c/openstack/nova/+/81851520:04
mnasersean-k-mooney: well it sounds like maybe that's not a great default value i guess20:16
sean-k-mooneymnaser: i assume notifcaiton are off in osa by default. if it enabeld an no default is spcifed for retry i woudl proably default to 0,1 or 3 but not -120:18
sean-k-mooneyor just make it an error20:18
mnasersean-k-mooney: yeah im thinking more of a more sane oslo.messaging defaults20:18
sean-k-mooneyrequire it to be set20:18
sean-k-mooneywell again it depends on your setup you might rely on notificaions20:19
sean-k-mooneybut if you do then you als need to have monitoring in place to know that ere are rabbit issues20:19
sean-k-mooneyand correct that20:19
mnasersean-k-mooney: yeah but to me it sounds like notifications failing should not result in nova falling apart20:20
sean-k-mooneywell it should not but that might just mean that -1 for retry is not a vaild value20:21
sean-k-mooney-1 presumable mean you must keep every notificaiotn in memory20:21
sean-k-mooneyuntill its sent20:21
sean-k-mooneywith a copertive threading model like evently if you have enough notificiton eventlet pendign that will eventurally degrade the performance of the service20:22
mnaseryeah im trying to repro right now20:23
opendevreviewStanislav Dmitriev proposed openstack/nova master: Retry image download if it's corrupted  https://review.opendev.org/c/openstack/nova/+/81850321:21
opendevreviewDmitrii Shcherbakov proposed openstack/nova master: [yoga] Add PCI VPD Capability Handling  https://review.opendev.org/c/openstack/nova/+/80819922:06
opendevreviewDmitrii Shcherbakov proposed openstack/nova master: [yoga] Support remote-managed SmartNIC DPU ports  https://review.opendev.org/c/openstack/nova/+/81211122:06

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!