Thursday, 2023-11-02

opendevreview	Sahid Orentino Ferdjaoui proposed openstack/nova master: scheduler: AggregateMultitenancyIsolation to support unlimited tenant https://review.opendev.org/c/openstack/nova/+/896512	08:41
opendevreview	Pavlo Shchelokovskyy proposed openstack/nova master: Add debug logging to can_fit_pagesize https://review.opendev.org/c/openstack/nova/+/899942	10:11
opendevreview	Alexey Stupnikov proposed openstack/nova stable/2023.1: Translate VF network capabilities to port binding https://review.opendev.org/c/openstack/nova/+/898945	10:20
opendevreview	sean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/nova/+/899753	11:10
opendevreview	sean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/nova/+/899753	11:14
opendevreview	sean mooney proposed openstack/os-vif master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/os-vif/+/899946	11:21
opendevreview	sean mooney proposed openstack/placement master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/placement/+/899947	11:23
opendevreview	sean mooney proposed openstack/os-traits master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/os-traits/+/899948	11:24
opendevreview	sean mooney proposed openstack/os-resource-classes master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/os-resource-classes/+/899949	11:25
opendevreview	sean mooney proposed openstack/python-novaclient master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/python-novaclient/+/899950	11:27
*** ozzzo1 is now known as ozzzo		11:50
*** eandersson0 is now known as eandersson		11:50
opendevreview	sean mooney proposed openstack/osc-placement master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/osc-placement/+/899955	11:54
greatgatsby	Good day. Having trouble converting a `nova boot` command to the `openstack server create` equivalent. With `nova boot` I can have a one-liner that creates and boots from a volume that also has the delete on instance termination set by using --block-device and shutdown=remove. However, with `openstack server create` the only thing I can get to work is first doing a --boot-from-volume and then using `openstack volume update	12:40
greatgatsby	--delete-on-termination`.	12:40
opendevreview	sean mooney proposed openstack/nova master: Use centos-9-stream for functional-py39 job https://review.opendev.org/c/openstack/nova/+/899959	12:44
greatgatsby	If I try to use a similar --block-device arg for `openstack server create` I get an error saying that one of --image --image-property --volume --snapshot is required	12:44
sean-k-mooney	i are you trying to add a data volume or do boot form volume	12:46
greatgatsby	sean-k-mooney: thanks for the reply. Boot from volume, where the volume is created from an image. I seem to be able to do it as a one-liner with `nova boot` but not `openstack server create`	12:50
frickler	the "one of ... is required" is a bug that should be fixed in recent osc iirc	12:50
sean-k-mooney	frickler: its not a bug	12:50
sean-k-mooney	so if you want to do it in one line	12:50
sean-k-mooney	you need to use --block-device-mapping	12:51
greatgatsby	that says its deprecated in the help	12:51
sean-k-mooney	well you can actully use --block-device i think as well that is the replacement for --block-device-mappings	12:52
sean-k-mooney	but you have to specify more then just the delete on terminate flag	12:52
greatgatsby	that's what I tried, basically the 1:1 --block-device I as using from nova boot, but then I get the error about those other flags being required	12:52
sean-k-mooney	right os i dont think this was intened to be a 1:1 port	12:53
greatgatsby	I'll test with a newer version of the openstack CLI. We're using explicit versions specified by our kolla (yoga) version	12:53
sean-k-mooney	it may have been a osc bug	12:53
greatgatsby	1:1 in that I use the equivalent params	12:53
sean-k-mooney	we would expect osc to map to the api partmerts but im not sure if novaclient was doing some magic internally	12:54
sean-k-mooney	greatgatsby: i think --block-device-mapping will workaround the limiation on older osc version an hopefully new version of osc will also work	12:55
greatgatsby	nice, thanks a lot for the help, I'll test both those scenarios	12:56
sean-k-mooney	frickler: was this what you were thinking of? https://github.com/openstack/python-openstackclient/commit/89e5d67a16bd86af7bbe8480252686e2eef55095	13:01
sean-k-mooney	because that is detecting fi you forget to pass --image with --boot-from-volume	13:01
sean-k-mooney	greatgatsby: this is where your error is comming form	13:04
sean-k-mooney	https://github.com/openstack/python-openstackclient/blob/master/openstackclient/compute/v2/server.py#L1714-L1725	13:04
sean-k-mooney	at least i think so	13:04
sean-k-mooney	oh no its https://github.com/openstack/python-openstackclient/blob/master/openstackclient/compute/v2/server.py#L1762-L1769	13:05
sean-k-mooney	greatgatsby: so i think what is missing is you likely dont have boot_index=0 set in the the --block-device argumetns?	13:06
greatgatsby	if I do `openstack server create --block-device source_type=image,uuid=<uuid>,destination_type=volume,volume_size=10,delete_on_termination=true I still get an error about one of those other flags being required, OCS 5.8.0	13:06
greatgatsby	I tried with boot_index as well	13:07
sean-k-mooney	can you past the output of the command somewhere so we can review	13:08
sean-k-mooney	since that does not have boot_index that should raise but i want to see if i can find the exact error message in code	13:09
greatgatsby	I'm on two different networks (openstack for work, home network for IRC) but I'll pastebin what I'm entering... couple seconds to type it out. Really appreciate the help.	13:09
greatgatsby	if you just want the output, it's literally just the openstack server create help followed by:	13:09
greatgatsby	openstack server create: error: one of the arguments --image --image-property --volume --snapshot is required	13:10
greatgatsby	my command is:	13:10
sean-k-mooney	ok let me see if i can find that as the code i linked to has a slight diffent message so its failign somewhere else	13:10
greatgatsby	openstack --os-cloud foo server create --flavor medium --block-device source_type=image,uuid=<image-uuid>,destination_type=volume,volume_size=10,delete_on_termination=true,boot_index=0 test-vm	13:11
sean-k-mooney	ya that looks valid to me	13:12
greatgatsby	OCS 5.8.0	13:12
sean-k-mooney	https://github.com/openstack/python-openstackclient/commit/91277e7e51849d197554b633a579c92116a5afc4	13:14
sean-k-mooney	greatgatsby: so ^ is the fix frickler was refering too and its in 6.1.0+	13:15
greatgatsby	nice, thanks a lot. Hopefully we'll be upgrading off yoga soon and will also upgrade our recommended OCS version.	13:15
sean-k-mooney	well in general i recommend pepoel to install osc in a python venv form pip	13:16
greatgatsby	we try to go by the version matrix for all the python utils and dependencies	13:16
sean-k-mooney	so that that they can use the latest indepent of the cloud	13:16
greatgatsby	that's exactly what we do	13:16
greatgatsby	we still use the versions according to the docs... maybe we should reconsider that?	13:16
sean-k-mooney	ya so in general osc shoudl work with any release	13:17
sean-k-mooney	so i generally suggest using the latest version	13:17
sean-k-mooney	the cli should be stable and generally only have additive changes	13:17
sean-k-mooney	but if you have a lot of internal docs using a fixed verion might make sense	13:18
greatgatsby	https://releases.openstack.org/yoga/index.html#service-client-projects	13:18
sean-k-mooney	that just documenting the version that was released in yoga	13:19
sean-k-mooney	its not a recomendation to use that version to interact with a yoga cloud	13:19
greatgatsby	oh, that's great to know, thanks, I'll pass that on!	13:19
greatgatsby	again, really appreciate getting help in here, can be tough banging ones head against the desk for too long	13:20
sean-k-mooney	the reason tha tis good to refence is for co installablity	13:20
sean-k-mooney	if you stay to the version in that list	13:20
sean-k-mooney	then we are commiting to keepign them co installable	13:20
sean-k-mooney	if you deviate form that you might have compitbality issues	13:20
sean-k-mooney	but i f you are using osc on your laptop in its own venv	13:21
sean-k-mooney	you dont need to really consider that	13:21
greatgatsby	sorry what do you mean by co installability? Just want to pass this info on and not get pushback	13:21
sean-k-mooney	so we maintain that list mainly for distro packagers	13:21
sean-k-mooney	the ablity to install all the release artifacts in the same python envionment	13:21
greatgatsby	I see, ok, thanks again	13:22
opendevreview	Merged openstack/nova master: Fix rebuild compute RPC API exception for rolling-upgrades https://review.opendev.org/c/openstack/nova/+/899235	13:22
sean-k-mooney	we test for co-installablity in our ci system and partly enforce that with upper-constraints.txt in the requirements repo for external deps	13:23
opendevreview	sean mooney proposed openstack/nova master: Use centos-9-stream for functional-py39 job https://review.opendev.org/c/openstack/nova/+/899959	13:32
opendevreview	sean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/nova/+/899753	13:32
opendevreview	sean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/nova/+/899753	13:34
dansmith	bauzas: are you around to explain something to me about the RP patch	13:36
dansmith	?	13:36
bauzas	dansmith: yup, I'm here now	13:36
bauzas	thanks for having started to look at it, btw.	13:36
bauzas	(me also needs to look at your spec)	13:37
dansmith	yeah sorry I forgot you were off yesterday until after I pinged	13:37
bauzas	so, shoot your questions	13:37
bauzas	have you seen the bug report btw.?	13:37
bauzas	I tried to summarize the problem in there	13:37
dansmith	my questions are actually about the stuff in the bug :)	13:37
bauzas	https://bugs.launchpad.net/nova/+bug/2041519	13:37
bauzas	okay	13:37
bauzas	and why I did that, I guess ?	13:38
dansmith	so I get the inventory before, then after one allocation, one RP goes to zero	13:38
bauzas	more than one RP but yeah	13:38
dansmith	no, only one according to your report	13:38
bauzas	and this is not exacly always after	13:38
bauzas	the fact is that we automatically create a mdev if none of them are not used yet	13:39
dansmith	but what i don't get is why after the second allocation they all go to zero.. that would imply we're using inventory wrong	13:39
bauzas	okay, gonna clarify	13:39
bauzas	we are not counting mdevs for inventories	13:39
bauzas	what we count as VGPU resources are either unused mdevs or* available_instances*	13:40
dansmith	oh, hang on	13:40
bauzas	which means something you can allocate to a guest	13:40
dansmith	re-reading this with fresh eyes, I think I was confusing the very similar sysfs and placement dumps	13:41
dansmith	so it's the mdev availability that goes off a cliff	13:41
bauzas	https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8090-L8133	13:42
bauzas	that's where we calculate inventories	13:42
dansmith	yep	13:42
bauzas	available_instances is a value that the kernel driver gives	13:43
bauzas	but,	13:43
dansmith	right	13:43
bauzas	once you create a mdev, the value decreases by 1	13:43
bauzas	that doesn't mean that the mdev will be used by a guest	13:43
sean-k-mooney	bauzas: yep shich is why you are ment to be internally tracking which devices the mdev are created form	13:43
bauzas	if you basically precreate all mdev resources, then available_instances is 0 for all the gpus	13:43
sean-k-mooney	and using that to compute the inventoy total as (aviablie_instnace + mdevs form that device)	13:44
dansmith	does nvidia-472 imply that more than 1/total share of the card is used?	13:44
bauzas	but the inventory still tracks that you can bind N vGPUs	13:44
bauzas	dansmith: that's the crux of the problem	13:44
bauzas	the number of vGPUs can differ per type	13:45
sean-k-mooney	bauzas: that problem is sovled by enforcing that the device adress	13:45
dansmith	that feels like we're using placement incorrectly to have inventory of unequal resources	13:45
bauzas	nvidia-472 can create up to two vGPUs	13:45
bauzas	so there are three possibilities	13:45
bauzas	either available_instances = 2	13:45
bauzas	or, available_instances = 1 and one mdev exists that'not used by a guest	13:45
bauzas	or, there are two mdevs available, which are both unused	13:46
bauzas	with SR-IOV, things change	13:46
sean-k-mooney	bauzas: the decreasing inventory behavoir is a nvidia specirci issue with vfs form teh same card	13:46
bauzas	now, with SR-IOV, instead of having one single pci device that have available_instances = 2 (and then only one RP),	13:46
bauzas	then we have two VFs, each of them having available_instances = 1	13:47
bauzas	but,n	13:47
bauzas	since this is type-dependent, nvidia doesn't create only 2 VFs but 16	13:47
bauzas	(for that single GPU)	13:47
sean-k-mooney	yes	13:47
bauzas	(if you have A2, IIRC this will create 32 virtual functions)	13:48
dansmith	can someone answer the question: that's a function of nvidia-472 right?	13:48
bauzas	so, say you define nvidia-472, this means that only 2 virtual functions have to be used	13:48
sean-k-mooney	dansmith: the VF can support many mdev types	13:48
dansmith	and if they are slicing differently with (nvidia-xxx right)? then the math would be different/	13:48
bauzas	if you use other type (say nvidia-473), then you could use four virtual functions	13:48
bauzas	dansmith: correct	13:49
dansmith	yeah	13:49
sean-k-mooney	dansmith: yep the math change depenign on which mdev type you create	13:49
sean-k-mooney	but the number of VF is fixed	13:49
bauzas	the fact is	13:49
sean-k-mooney	so not all of them can actully be allcoated	13:49
bauzas	given the type,	13:49
sean-k-mooney	for all types	13:49
bauzas	once you create mdevs for all the VFs corresponding to the type,	13:49
dansmith	right, and could it be some combination of two types? or only one type of slicing per card?	13:49
bauzas	then other VFs start magically having available_instances=0	13:49
sean-k-mooney	dansmith: that depend on the card	13:50
bauzas	you can't mix types per GPU, nvidia doesn't support it	13:50
bauzas	well,	13:50
sean-k-mooney	actully it does	13:50
bauzas	for time-sliced GPUs	13:50
bauzas	for MIG GPUs, you can, but that's a waaaaay different matter	13:50
sean-k-mooney	but only in s specific mode for sepcific type on specific cards	13:50
sean-k-mooney	yep	13:50
bauzas	let's not overcomplicate this	13:50
sean-k-mooney	lol	13:50
sean-k-mooney	do you have a time machine	13:50
dansmith	well, my point is,	13:50
dansmith	it seems really uncool to expose all these single-inventory RPs in general	13:51
dansmith	like it feels as though we modeled it wrong	13:51
sean-k-mooney	ya so we shoudl be grouping per PF	13:51
sean-k-mooney	as each phsical devce shoudl be its own RP	13:52
sean-k-mooney	but since we never actully added offical supprot for sriov mode	13:52
sean-k-mooney	what we did instead was tell people	13:52
bauzas	dansmith: the problem is that we don't know preemptively which type consumes how many VFs	13:52
sean-k-mooney	jsut use the pci adress of the vf instead of hte pf	13:52
bauzas	so the simpliest solution is to tell people to statically allocate mdevs	13:52
sean-k-mooney	bauzas: thats not really the probelm but it is one of the issues	13:53
bauzas	sean-k-mooney: that IS the problem in our case	13:53
dansmith	bauzas: yeah, I think there are ways we could handle that, but yeah it's not ideal for sure	13:53
dansmith	unless we make people annotate the config in that way	13:53
sean-k-mooney	well we agreed in the ptg to make them do that ^	13:53
bauzas	if I was able to get a clue of how many VFs are available for a type, I could prevent nova to create RPs for the 16 of them	13:53
sean-k-mooney	we agreed to make device_adress required	13:54
bauzas	sean-k-mooney: no we said at the PTG that allocating a mdev is cheap	13:54
bauzas	so we could continue to do this	13:54
sean-k-mooney	bauzas: yes we can that not what i said above	13:54
sean-k-mooney	we agreed that going forward then need to specificy which devices we create inventoies form	13:54
sean-k-mooney	by requiring the device_adress	13:54
bauzas	sean-k-mooney: sure	13:54
sean-k-mooney	nova can continue to od the mdev creation	13:54
bauzas	but this doesn't prevent operator fat fingers, right?	13:54
dansmith	sean-k-mooney: agreed to make who do what?	13:55
sean-k-mooney	the key is just not list more vf then can be created	13:55
sean-k-mooney	dansmith: https://docs.openstack.org/nova/latest/configuration/config.html#devices	13:55
bauzas	sean-k-mooney: also, with SR-IOV, this will be hard for operators to know which VFs to allocate btw.	13:55
bauzas	the fucking nvidia docs never tell about nvidia-XXX	13:56
sean-k-mooney	for legacy reasons (only for the first mdev type) specifiv device_address is optional	13:56
dansmith	sean-k-mooney: meaning in [mdev_nvidia-35]/this_takes_n_vfs=10 right?	13:56
dansmith	that sounds good to me	13:56
bauzas	the only way for an operator to know what a nvidia type is doing is by getting the /name value and match it with docs	13:56
dansmith	default to 1 and if they don't fill it in, then the math will be wrong	13:56
sean-k-mooney	dansmith: not quite	13:56
bauzas	dansmith: no, sean-k-mooney is mentioning to explicitely define in nova.conf all the pci addresses of the VFs	13:57
sean-k-mooney	dansmith: you would look at the mdev type and list n vfs	13:57
bauzas	no new option	13:57
dansmith	sean-k-mooney: isn't that what I just said?	13:57
sean-k-mooney	we could do it diffently however	13:57
bauzas	[devices] enabled_mdev_types = nvidia-35, nvidia-36 [mdev_nvidia-35] device_addresses = 0000:84:00.0,0000:85:00.0 [vgpu_nvidia-36] device_addresses = 0000:86:00.0	13:57
bauzas	you would specify then on nova.conf all the VFs	13:57
bauzas	at least the ones you gonna allocate	13:58
sean-k-mooney	dansmith: so we could do something better	13:58
bauzas	sean-k-mooney: so sean, you gonna prefer to -1 my fix and say "please rather statically allocate all VFs" ?	13:58
bauzas	https://review.opendev.org/c/openstack/nova/+/899625	13:59
dansmith	-1 to static, I can't imagine that's really going to make people happy	13:59
bauzas	not sure our operators will appreciate	13:59
bauzas	yeah	13:59
sean-k-mooney	dansmith: what we could do is per mdev type declare how many that can have	13:59
sean-k-mooney	and then specify that device_address shoudl be the PF	13:59
bauzas	that's a risky math	14:00
sean-k-mooney	so then we would have one device address (and placment inventoy) per phsyical gpu	14:00
dansmith	I still don't understand how that's not what I said above and why that's not a good solution	14:00
bauzas	because nothing in the sysfs structure is telling us which parent PF is the VF	14:00
sean-k-mooney	and we would just not use aviable_isntance at all for the placment total	14:00
sean-k-mooney	bauzas: it really should	14:00
dansmith	suuuurely it does	14:00
bauzas	sean-k-mooney: if you don't use available_instances, then you're asking people to precreate the mdevs	14:00
sean-k-mooney	that might be missing in libvirt but it will be in sysfs	14:00
sean-k-mooney	bauzas: no im not	14:01
sean-k-mooney	im asking them to read the nvidia docs	14:01
dansmith	I have to do another meeting for a bit	14:01
sean-k-mooney	and tell us how many it can allcoate	14:01
dansmith	surely we could also have a generic listing for the common types	14:01
bauzas	sean-k-mooney: I don't get then what you said	14:01
bauzas	(15:00:31) sean-k-mooney: and we would just not use aviable_isntance at all for the placment total	14:01
bauzas	dansmith: that's like boiling the ocean	14:02
bauzas	types are hardware-dependent	14:02
sean-k-mooney	bauzas: im saying for the placment inventory we read total form the config option not form aviable_instnace	14:02
dansmith	nvidia-472 differs in size on different cards?	14:02
bauzas	https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu	14:02
sean-k-mooney	dansmith: nvidia are creating a mdev type per card	14:02
bauzas	dansmith: count every line in each tab, you'll get the number of mdev types	14:02
dansmith	ack I understand	14:03
bauzas	dansmith: no, the other way, nvidia-472 only stands for A100-20C which is only a type supported by A100 cards	14:03
dansmith	yup, I got it	14:03
sean-k-mooney	bauzas: lets park this for a bit and create an etherpad or something to collaberate on	14:04
bauzas	formerly, before the magic world of SRIOV GPUs, available_instances was giving us the number of VGPUs a PCI device was able to create	14:04
sean-k-mooney	its simpler to discuss when we have some of the constriat written down liek a spec	14:04
bauzas	sean-k-mooney: there is already the bug report	14:05
sean-k-mooney	i would like somethign we can edit	14:05
bauzas	I'm not a big fan of writing a spec for fixing this	14:05
sean-k-mooney	bauzas: i think you have too	14:05
sean-k-mooney	depending on the fix	14:05
sean-k-mooney	there are ways to do the minimal fix as a bug	14:06
dansmith	I think it's clear we need to document this more than just some sysfs dumps in a bug :)	14:06
bauzas	sean-k-mooney: have you seen my patch already ?	14:06
sean-k-mooney	but if we want to fix this properly we need to change how we are doing mdev managment	14:06
bauzas	as a reminder, mdev inventory isn't a problem with non-SRIOV GPUs	14:06
bauzas	so whatever we decide will necessarly be opt-in	14:06
sean-k-mooney	its not a problem for all sriov devices	14:06
bauzas	and keeping the legacy behaviour too	14:07
sean-k-mooney	its litrally aproblem just for how nvidia did it	14:07
bauzas	"its litrally aproblem just for how nvidia did it" I should quote it, print it and put it on my front wall	14:07
bauzas	the way that available_instances isn't a static number is also a terrible driver implementation, if you want MHO	14:08
sean-k-mooney	before i left intel, i used a protoytp intel QAT impelemtnation wtch had sriov VF for each enging type and they supprot mdevs per VF with indepenent aviableity	14:08
dansmith	is it really a problem though? I mean it kinda seems like this is the point of mdevs, no? to be allocatable in more dynamic ways than just static slices of a card?	14:08
bauzas	true	14:09
sean-k-mooney	dansmith: mevs from diffent pci endpoitn are intented to not affect each other	14:09
sean-k-mooney	that is what nvida started breaking	14:09
sean-k-mooney	the framework allows them to be dynmaic	14:09
dansmith	there are multiple pci devices per card? or you mean the vfs?	14:09
bauzas	oh wait	14:10
bauzas	[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/0000\:41\:01.0/mdev_supported_types/nvidia-472/description	14:10
bauzas	num_heads=1, frl_config=60, framebuffer=20480M, max_resolution=4096x2400, max_instance=2	14:10
bauzas	we can get the number of VGPUs in the description	14:10
sean-k-mooney	bauzas: the desciptoin is not a standard format	14:10
bauzas	this is horribly driver-specific	14:10
bauzas	yeah	14:11
sean-k-mooney	yes it is its litrally just a text filed	14:11
dansmith	but it's something	14:11
sean-k-mooney	its not somehting nova should read	14:11
dansmith	it's something an operator could read in order to put the math in the config :)	14:12
sean-k-mooney	grep . /sys/class/mdev_bus/0000\:41\:01.0/mdev_supported_types/nvidia-472/*	14:12
bauzas	operators aldray read :	14:12
bauzas	[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/0000\:41\:01.0/mdev_supported_types/nvidia-472/name	14:12
bauzas	GRID A100-20C	14:12
sean-k-mooney	bauzas: can you run ^	14:12
dansmith	I also wouldn't say nova shouldn't read it.. I think we could read and parse it as long as we don't fail if it has changed	14:12
bauzas	and then Ctrl-F that string into https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu	14:12
sean-k-mooney	dansmith: yes the operator can read it	14:13
sean-k-mooney	dansmith: so would you be ok with adding max_instances to the mdev_type config section	14:13
sean-k-mooney	or is that still too static	14:14
bauzas	sean-k-mooney: if we go this way, this wouldn't be backportable, right?	14:14
dansmith	well, I'm not sure that's a solution entirely, but I think making the operator tell us some of the math so we can do it better on the backend makes sense	14:14
bauzas	I really tried to propose https://review.opendev.org/c/openstack/nova/+/899625 as a backportable mechanism	14:14
dansmith	I'd need to hear the whole story before "being okay" with it, but that seems like a good direction to me	14:14
bauzas	dansmith: finish your call and then we could do a gmeet	14:15
dansmith	sure	14:15
sean-k-mooney	bauzas: i dont really like that because it implies your still doing the math incorrect	14:15
bauzas	I have the PTG summary to write	14:15
dansmith	I don't love the "quick, delete the provider" approach either, FWIW	14:15
bauzas	sean-k-mooney: it implies people were lazy enough to not explicitely tell which PCI devices to allocate	14:15
bauzas	dansmith: that's understandable, but then you're pro-statically defining a list of VFs or an integer counting allocatable VFs	14:16
sean-k-mooney	bauzas: and we are not internally trackign which device the mdev is allcoated form meaning we are not mapping the allcoations correctly in placment	14:16
bauzas	(just clarifying the options)	14:16
sean-k-mooney	removing the rp works if and only if we gurentte that the mdevs are created by packing the RPs	14:17
bauzas	sean-k-mooney: if you want to have my frank opinion, I'm not a big fan of changing all the allocations we make	14:17
dansmith	bauzas: I'm for doing better math, and it sounds like some hints for that math (in config) are necessary	14:17
bauzas	sean-k-mooney: I very humbly think you're wrong on the fact we miscount	14:17
bauzas	we very correcly count which mdevs against RPs	14:18
dansmith	still would need to understand the whole picture and how we migrate to it, impact to existing people, and whether or not it's worth it if mdevs are going away	14:18
bauzas	the latter conditional is hard to guess	14:18
bauzas	but that's fair assumptions	14:19
bauzas	and I'm not particularly attached to the bugfix itself	14:19
bauzas	but again, I have my wills : a simple approach not redesigning our placement tracking which would be backportable	14:20
sean-k-mooney	bauzas: in sriov mode is aviable_instances alway 1	14:20
sean-k-mooney	becasue for intels implemation (which they never released) it was not	14:20
bauzas	see the bug report, yah until all mdevs are created	14:20
bauzas	I literally wrote that in the bug description	14:20
sean-k-mooney	what i dont like about that is that is a vendor specific detail	14:21
bauzas	hell yeah, hence why I wanted to make the fix agnostic	14:21
bauzas	we're just doing some math in some method that say 'oh look, that inventory is now 0, but meh, I don't care updating it'	14:22
sean-k-mooney	well your solution does not work if the VF have aviable_instance>1 initally	14:22
bauzas	this is honestly to me a wrong approach, and at least a LOG warning should tell the operator something weird happened	14:22
bauzas	beg your pardon ?	14:22
sean-k-mooney	simple case	14:23
sean-k-mooney	we have one mdev capable device	14:23
sean-k-mooney	it has an mdev type that allow 2 allcations total	14:23
sean-k-mooney	it creats n VF	14:23
sean-k-mooney	each vf reporst aviabel_invetnory=2	14:23
bauzas	(assuming then you're not talking of the nvidia driver then)	14:24
sean-k-mooney	i allcoate one mdev from two different vfs	14:24
sean-k-mooney	you can delete the other vf rps	14:24
sean-k-mooney	but we will would be miss reporting the 2 remain rps	14:24
bauzas	are you talking of something inexistent now but as a future possibility ?	14:25
sean-k-mooney	im talkign generically	14:25
bauzas	again, I don't understand your logic, but let's imagine your scenario for a sec	14:26
bauzas	say I have 16 VFs, each of them with available_instances=2	14:26
bauzas	say now I create one mdev twice, each on a distinct VF	14:26
bauzas	for 2 VFs, available_instances would be equal to 1	14:27
bauzas	two mdevs would exist	14:27
bauzas	so the inventory wouldn't change for those	14:27
sean-k-mooney	right but only 2 can be created	14:27
bauzas	we would still count : total=2	14:27
sean-k-mooney	i gues what would happen is that when i allcoate the first mdev	14:27
sean-k-mooney	the aviable_insstance on all vfs would go to 1	14:27
sean-k-mooney	and the second one it would go to 0	14:27
bauzas	okay, then it's the driver's responsibility to change available_instances to 0 for all the VFs	14:28
bauzas	including the two ones we use	14:28
sean-k-mooney	ya	14:28
bauzas	total would be one for 2 RP inventories then, right?	14:28
bauzas	total=1 I mean	14:28
sean-k-mooney	so the only mdev hardware i have used that supprotes sriov vf did not have that probelm	14:28
bauzas	anyway, this isn't possible with nvidia SR-IOV, all VFs have available_instances=1	14:29
bauzas	you lost me, honestly	14:29
bauzas	if you're talking of T4s	14:29
bauzas	what we had was a basically non-SRIOV accountancy	14:30
sean-k-mooney	no im talking about hardware i used at intel and trying to think if it would break	14:30
sean-k-mooney	i think the delete your are proposing would work	14:30
bauzas	tbc, I've experienced a race condition	14:30
sean-k-mooney	but i dont like the idea of us creating and deleting rps in genreal based on workload usage	14:30
bauzas	since we update resource providers on a periodic	14:31
bauzas	I was able to create 3 VMs (and allocate three VGPU allocations)	14:31
bauzas	that's why I added the claim stuff	14:31
bauzas	so, see, again, I'm not particularly attached to do this	14:32
bauzas	if we go with statically defining which VFs are going to be used, fair	14:32
sean-k-mooney	bauzas: we normally run the update avaiabel reosuces call elsewhere	14:32
bauzas	I have checked	14:32
sean-k-mooney	bauzas: we expcitly do it form the comptue manager in move claims	14:32
bauzas	we do run the update before the spawn	14:32
bauzas	not after :)	14:32
dansmith	bauzas: agreed, deleting RPs as a result of allocation feels very wrong	14:33
sean-k-mooney	dansmith: the only mitigation is we dont delete the mdev after the instance is moved/deleted	14:33
sean-k-mooney	so provide the compute node is not restated	14:33
sean-k-mooney	we will do the delete once	14:33
dansmith	could we set reserved=total instead of deleting maybe?	14:33
sean-k-mooney	but the extra RPs will get recated after a host reboot	14:33
bauzas	dansmith: we would need to implement the reverse logic then	14:34
sean-k-mooney	dansmith: reserved=total is what i was orgianlly thinking but the issue is total is going to 0	14:34
sean-k-mooney	and total cant be 0	14:34
sean-k-mooney	so we would need to do resrved=total=1	14:34
bauzas	and when to unreserve ?	14:34
bauzas	say I delete my mdev ?	14:34
bauzas	I'm a stubborn operator that deletes the mdevs, y'know	14:35
sean-k-mooney	or you do a host reboot	14:35
bauzas	honestly, I like the config approach	14:35
sean-k-mooney	so ya the reserved=total approch is more complext then delete	14:35
dansmith	sean-k-mooney: right I meant reserved=old_total	14:35
bauzas	if you specific the max VGPUs per type, then we could do htis	14:36
bauzas	I feel a disturbance in the force when I tell about deleting RPs	14:36
sean-k-mooney	bauzas: its slightly more complex then that	14:36
bauzas	I used the term "strawman", I think it was the right wording	14:37
sean-k-mooney	max mdevs need nova to understand which phsyical deviec the mdevs are form	14:37
bauzas	that is correct	14:37
bauzas	we know the parent device	14:37
bauzas	for a mdev	14:37
bauzas	we already use this information to track	14:38
bauzas	in nova today	14:38
sean-k-mooney	well in the nvidia case the max mdev per type applies to the grandparent device	14:38
bauzas	what we don't track now is to know the parent of a mdev capable device	14:38
bauzas	and we can't rely on libvirt, it gives you the PCI-E root PCI address (I tested :) )	14:38
bauzas	sean-k-mooney: that's what libvirt is giving us https://paste.opendev.org/show/b3MHPT8VoIOBWw1tim2o/	14:40
bauzas	we know the parent addr tho	14:40
bauzas	<capability type='phys_function'> <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>	14:40
sean-k-mooney	it tell you the part device	14:41
bauzas	so you were right, we could use this	14:41
sean-k-mooney	<parent>pci_0000_40_01_1</parent>	14:41
bauzas	sean-k-mooney: no, you're wrong, that's not the PF address :)	14:41
bauzas	[stack@lenovo-sr655-01 ~]$ lspci \| grep 40:01.1	14:42
bauzas	40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge	14:42
sean-k-mooney	ok can you give me access to that system so i can take a look	14:42
bauzas	that's the PCI bridge address :)	14:42
bauzas	but I told you, we have the parent PF address in the device XML	14:42
sean-k-mooney	i orgianlly said we could get the info from sysfs because i knwo how to look it up there	14:42
sean-k-mooney	we can get it form libvirt but i dont rememebr the element fo the top of my head	14:43
sean-k-mooney	if we have it great	14:43
sean-k-mooney	its <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>	14:43
sean-k-mooney	form the capability type='phys_function' YES	14:44
bauzas	yeah, and thru sysfs, we have access to the physical function in /sys/class/mdev_bus/0000\:41\:00.4/physfn/	14:46
dansmith	okay done with my other meeting	14:46
* dansmith reads back		14:46
bauzas	but that's not a trivial change	14:46
bauzas	we need to implement some libvirt config change	14:46
sean-k-mooney	we can and should use the info form libvirt since its there	14:46
sean-k-mooney	bauzas: we already have parsing logic for this in the pci tracker code in the libvirt driver	14:46
bauzas	we already get information of the mdev capable devices thru libvirt	14:46
bauzas	sean-k-mooney: oh you're right	14:47
bauzas	dansmith: take time to digest	14:47
bauzas	sounds less bugfix-y, more feature-y than my patch but I can do it	14:48
bauzas	sean-k-mooney: fwiw, this is where we get information for a mdev cap device https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8284	14:49
sean-k-mooney	bauzas: we should seperate the back protable fix form the final fix	14:49
sean-k-mooney	bauzas: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1321 is the functilalyt that extracts the pci info form te node dev	14:49
bauzas	https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8263 is where we get the capacity	14:50
bauzas	so sounds easier than originally thought	14:50
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1351-L1363	14:50
sean-k-mooney	that actully where we are parsing it	14:50
dansmith	so max_mdevs_per_device works only because once you allocate one mdev type from a card, that card is locked into that type for the rest of the allocations right?	14:50
bauzas	dansmith: you're correct, there is no heterogenous support	14:50
artom	sean-k-mooney, we never updated the Neutron docs for the socket PCI NUMA affinity policy: https://docs.openstack.org/api-ref/network/v2/index.html#numa-affinity-policy	14:50
sean-k-mooney	dansmith: in the simple case yes. but we also only support one mdev_type per card in nova	14:50
bauzas	yup, there is no behavioural change in support	14:51
sean-k-mooney	artom: we never updated the neutron extention for it	14:51
sean-k-mooney	artom: its not a docs issue i belive we need an api change	14:51
dansmith	sean-k-mooney: wait, is that the case even before we allocate?	14:51
artom	sean-k-mooney, so we can't actually set it on the port at all?	14:51
sean-k-mooney	dansmith: yes which means we break if tye allocate the wrong type manually	14:52
bauzas	are we still talking of GPUs or PCI NUMA affinity now ?	14:52
sean-k-mooney	dansmith: we have had customer do that	14:52
artom	Anyways, I interrupted a convo with my dumb questions, carry on, I'll go look at the code	14:52
sean-k-mooney	bauzas: im talkign to artom about something else	14:52
artom	bauzas, ignore me and PCI policies	14:52
bauzas	artom: no worries, we're reaching to some agreement soon I guess	14:52
bauzas	dansmith: assigning a mdev type to a GPU is statically defined in nova	14:53
sean-k-mooney	artom: https://github.com/openstack/neutron-lib/blob/master/neutron_lib/constants.py#L676-L681	14:53
dansmith	okay I'm confused	14:53
dansmith	I thought ya'll said the opposite earlier	14:53
bauzas	changing the value once you create mdevs is a breaking change that we don't support	14:53
sean-k-mooney	artom: so the api will not accpet it and we are missing the osc support too	14:53
dansmith	right I understand that part	14:54
dansmith	are we going to gmeet?	14:54
bauzas	sure I can fire one	14:54
dansmith	just saying, since artom showed up we	14:54
dansmith	are all muddy here :)	14:54
artom	Yep, I barged in like a rhino on a sugar rush, apologies	14:55
* dansmith blinks		14:55
bauzas	https://meet.google.com/hqy-bgjb-muk for people fancy discussing nvidia nightmare	14:55
sean-k-mooney	dansmith: for what its worth the patch bauzas proposed should work given the hardware that is aviabel to buy today	14:55
sean-k-mooney	its jsut depening on some asusmintions we have in nova and in how the current hardware works	14:55
opendevreview	Will Szumski proposed openstack/nova stable/yoga: Ironic nodes with instance reserved in placement https://review.opendev.org/c/openstack/nova/+/867912	16:49
jovial	Is the nova-emulation job busted on stable/yoga? Was interested in progressing this patch: https://review.opendev.org/c/openstack/nova/+/898554	17:41
bauzas	sean-k-mooney: dansmith: so, if you don't mind, I'm gonna work on the VF filtering case with max_instances config option	17:42
bauzas	and then we could nitpick on what we want	17:42
dansmith	yep	17:44
bauzas	dansmith: fwiw, there is also https://review.opendev.org/c/openstack/nova/+/899406	17:44
bauzas	with that patch, pas-ha[m] is forcing operators to explicitely define all the device_addresses	17:45
bauzas	but I personnally think that adding a max_instances config option is painless for ops	17:45
bauzas	so we could mix both	17:45
bauzas	meaning that we would force ops to either explicitelyt specify the device addresses or give the number of VFs that nova needs to create	17:46
bauzas	this sounds very consistent to me	17:46
dansmith	max_instances seems much more useful to me so people don't have to encode pci device addresses in their config just to avoid the broken behavior	17:50
bauzas	yeah so we will keep the existing behaviour but add another option for automatic discovery	17:51
bauzas	dansmith: I definitely appreciate your help on that one	17:51
bauzas	and I really want to give you review credits back on your own series	17:52
sean-k-mooney	dansmith: i have some concerns with allow config with out pci adresses	17:57
sean-k-mooney	i.e. if you allow some of your gpus to be used for pci passthogu and other for vgpu	17:58
dansmith	right, that's a good reason to do it, but not a good reason to require it, IMHO	17:58
sean-k-mooney	if you dont use device_address in that case we coudl passthough a device (that was allowed in the pci devspec) but not disalloed in the mdev config section	17:58
sean-k-mooney	dansmith: i dont know i always felt that its a psudo security isseu to not explcitly list devices we can expose to a guest	17:59
sean-k-mooney	i get that it sucks to need to be explict	17:59
sean-k-mooney	but that the trade off	17:59
dansmith	but in the mdev case you're not just blindly exposing them,	18:00
sean-k-mooney	to be clear we have this tradeoff in the pci tracker too	18:00
dansmith	you've already opted into "all devices with this mdev type" right?	18:00
sean-k-mooney	yep so its the same as just putting vendor_id and product_id in the pci dev_spec	18:01
sean-k-mooney	aka the pci whitelist	18:01
dansmith	I don't think that's the same really	18:01
opendevreview	Stephen Finucane proposed openstack/nova master: docs: Add documentation on server groups https://review.opendev.org/c/openstack/nova/+/899979	18:01
dansmith	well, if you do both vendor and product I guess it is	18:01
sean-k-mooney	to me there is no diffence betwen mdev type and *vendor_id and product_id)	18:01
dansmith	but not like worrying that 8086 is going to allow your root bridge to go through or something	18:01
dansmith	but I really don't see the problem with that level of specificity	18:02
stephenfin	bauzas: melwitt: https://review.opendev.org/c/openstack/nova/+/899979 came up as a result of an OpenShift question. I figure you'd both be interested in that addition to our docs	18:02
sean-k-mooney	dansmith: i think its more an issue with how the config options are arranged more then anything esle	18:03
stephenfin	I've also proposed helper alias for OSC to make requesting server groups slightly nicer https://review.opendev.org/c/openstack/python-openstackclient/+/899975	18:03
stephenfin	*a helper	18:03
sean-k-mooney	dansmith: we split the mdevs into multipel dynmaic config sections to influcne this	18:03
bauzas	sean-k-mooney: if we can ease your pain, we can document the fact that max_instances automatically passes all GPUs to nova	18:03
sean-k-mooney	which to me make it harder to know if a device will be allowed or not	18:03
bauzas	sean-k-mooney: and we could mention to use device_addresses as an alternative for people afraid of passing the whole GPUs	18:04
sean-k-mooney	bauzas: just make -1 or None the default and use it as a sentanical to mean all.	18:04
bauzas	lemme provide the patch first	18:04
sean-k-mooney	bauzas: well i thing we shoudl be recommending device_address as the way peopel shoudl do it in general	18:05
dansmith	sean-k-mooney: that doesn't work right? that gives us the wrong behavior we have now	18:05
sean-k-mooney	dansmith: it doesn but i think bauzas was going to have it be wrong out of the box too	18:05
sean-k-mooney	unless i missunderstood	18:06
bauzas	tbc I was documenting in the config doc helper that max_instances should be set to the number of mediated devices a device can create	18:06
sean-k-mooney	but where you going to recommend usign device_addresses as what peopel shoudl do by default?	18:07
bauzas	we could	18:07
sean-k-mooney	personally i woudl prefer reuqiring you do do "device_adddress=*" to have use any device with this mdev type behavior by the way	18:08
sean-k-mooney	or saying you must set	18:08
sean-k-mooney	either device_addres or max_instances	18:08
bauzas	yeah those options are conflicting, for sre	18:10
bauzas	I can mark those this way with oslo.config	18:10
sean-k-mooney	but can you also make the mutally exclive group required	18:11
sean-k-mooney	and the dynmic config section requried	18:11
bauzas	we're on same page	18:12
sean-k-mooney	well there is one thing im not sure you and i agree on	18:12
bauzas	[mdev_nvidia-XXX] would become mandatory if we merge https://review.opendev.org/c/openstack/nova/+/899406	18:12
sean-k-mooney	ah	18:12
sean-k-mooney	ya that	18:12
bauzas	what's inside this section is mutually exclusive	18:12
bauzas	either device_addresses or max_instances	18:13
bauzas	wfy ?	18:13
sean-k-mooney	so your ok with requiring that if an mdev type is in enabled_mdev_types	18:13
sean-k-mooney	we must have the config section	18:13
bauzas	that's what we agreed at the PTG	18:13
sean-k-mooney	[mdev_<mdev name>]	18:13
sean-k-mooney	ok then ya with that condition i think im good with your propsoal	18:13
bauzas	in order to ease people's lifes, I'm just gonna add max_instances as a mutually exclusive option	18:13
bauzas	if you're SR-IOV based, we'll pick N VFs for you	18:14
bauzas	if you're not SRIOV, we'll just pick N devices for you	18:14
melwitt	stephenfin: cool, thanks. I'll check it out	18:15
sean-k-mooney	bauzas: we could just make max_instances apply to sriov case	18:15
sean-k-mooney	since the non sriov case just works	18:16
bauzas	with the non-sriov gpus, you need to specify all the cards	18:16
sean-k-mooney	in the device list	18:16
bauzas	giving them a way to say "I don't care"	18:16
bauzas	seems good to me	18:16
sean-k-mooney	ya so just add * to device_addreses	18:17
sean-k-mooney	well im flexible on how we do that	18:17
bauzas	we need implementation patches and then we could do anything we want	18:18
sean-k-mooney	im ok to have a away to say "expose all devices of this type" i just woudl like it to be a little more explict then it is today	18:18
bauzas	btw. I'm on PTO tomorrow	18:18
sean-k-mooney	right os if we want to defer the details of how we do that for now	18:18
sean-k-mooney	that works for me	18:18
sean-k-mooney	ya im on pto tomorrow and monday	18:18
bauzas	cool	18:18
bauzas	I'll try if I can draft something before I leave	18:19
bauzas	now that I have a machine, this is waaaay easier than before	18:19
sean-k-mooney	yep hardware both sucks and make life simpler	18:19
bauzas	sean-k-mooney: I made the wrong assumption, for non-sriov, max_instances is indeed meaningless	18:36
sean-k-mooney	cool	18:36
bauzas	so yeah we maybe need a wilcard	18:38
bauzas	that said, we captured the opposite https://etherpad.opendev.org/p/nova-caracal-ptg#L669	18:38
bauzas	but I recently discovered that every single day is full of adventures and news when it comes to nvidia	18:38
opendevreview	sean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1 https://review.opendev.org/c/openstack/nova/+/899753	19:44
pas-ha[m]	<bauzas> "dansmith: fwiw, there is also..." <- bauzas: I don't think I force anything with this patch, it should be 100% backward compatible - if no explicit devices are defined together with the single enabled mdev type, everything should work as before.	20:23
pas-ha[m]	at least that was my intention, not sure I actually tested for that :-)	20:23
sean-k-mooney	with a single mdev type bu tif you had 2 or more then the config section would be requried for each correct	20:24
* sean-k-mooney has not looked at the patch in a few hours and lost context		20:25
sean-k-mooney	hum so https://review.opendev.org/c/openstack/nova/+/899406/1/nova/virt/libvirt/driver.py#7933	20:26
sean-k-mooney	is just always regestring the config options	20:26
sean-k-mooney	but not nessiarly enforicing it	20:26
sean-k-mooney	because you still defalt it here https://review.opendev.org/c/openstack/nova/+/899406/1/nova/virt/libvirt/driver.py#7958	20:26
sean-k-mooney	pas-ha[m]: so ya i think your version is tecnically still backward compatible	20:27
sean-k-mooney	pas-ha[m]: the fact your chagne has not unit test chagne means either its backward compatiable or we have no test coverage for this	20:28
pas-ha[m]	sean-k-mooney: I bet the latter :-D	20:28
sean-k-mooney	unfortunetly you might be right	20:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!