Thursday, 2023-11-02

opendevreviewSahid Orentino Ferdjaoui proposed openstack/nova master: scheduler: AggregateMultitenancyIsolation to support unlimited tenant  https://review.opendev.org/c/openstack/nova/+/89651208:41
opendevreviewPavlo Shchelokovskyy proposed openstack/nova master: Add debug logging to can_fit_pagesize  https://review.opendev.org/c/openstack/nova/+/89994210:11
opendevreviewAlexey Stupnikov proposed openstack/nova stable/2023.1: Translate VF network capabilities to port binding  https://review.opendev.org/c/openstack/nova/+/89894510:20
opendevreviewsean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/nova/+/89975311:10
opendevreviewsean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/nova/+/89975311:14
opendevreviewsean mooney proposed openstack/os-vif master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/os-vif/+/89994611:21
opendevreviewsean mooney proposed openstack/placement master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/placement/+/89994711:23
opendevreviewsean mooney proposed openstack/os-traits master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/os-traits/+/89994811:24
opendevreviewsean mooney proposed openstack/os-resource-classes master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/os-resource-classes/+/89994911:25
opendevreviewsean mooney proposed openstack/python-novaclient master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/python-novaclient/+/89995011:27
*** ozzzo1 is now known as ozzzo11:50
*** eandersson0 is now known as eandersson11:50
opendevreviewsean mooney proposed openstack/osc-placement master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/osc-placement/+/89995511:54
greatgatsbyGood day.  Having trouble converting a `nova boot` command to the `openstack server create` equivalent.  With `nova boot` I can have a one-liner that creates and boots from a volume that also has the delete on instance termination set by using --block-device and shutdown=remove.  However, with `openstack server create` the only thing I can get to work is first doing a --boot-from-volume and then using `openstack volume update 12:40
greatgatsby--delete-on-termination`.  12:40
opendevreviewsean mooney proposed openstack/nova master: Use centos-9-stream for functional-py39 job  https://review.opendev.org/c/openstack/nova/+/89995912:44
greatgatsbyIf I try to use a similar --block-device arg for `openstack server create` I get an error saying that one of --image --image-property --volume --snapshot is required12:44
sean-k-mooneyi are you trying to add a data volume or do boot form volume12:46
greatgatsbysean-k-mooney: thanks for the reply.  Boot from volume, where the volume is created from an image.  I seem to be able to do it as a one-liner with `nova boot` but not `openstack server create`12:50
fricklerthe "one of ... is required" is a bug that should be fixed in recent osc iirc12:50
sean-k-mooneyfrickler: its not a bug12:50
sean-k-mooneyso if you want to do it in one line12:50
sean-k-mooneyyou need to use --block-device-mapping12:51
greatgatsbythat says its deprecated in the help12:51
sean-k-mooneywell you can actully use --block-device i think as well that is the replacement for --block-device-mappings12:52
sean-k-mooneybut you have to specify more then just the delete on terminate flag12:52
greatgatsbythat's what I tried, basically the 1:1 --block-device I as using from nova boot, but then I get the error about those other flags being required12:52
sean-k-mooneyright os i dont think this was intened to be a 1:1 port12:53
greatgatsbyI'll test with a newer version of the openstack CLI.  We're using explicit versions specified by our kolla (yoga) version12:53
sean-k-mooneyit may have been a osc bug12:53
greatgatsby1:1 in that I use the equivalent params12:53
sean-k-mooneywe would expect osc to map to the api partmerts but im not sure if novaclient was doing some magic internally12:54
sean-k-mooneygreatgatsby: i think --block-device-mapping will workaround the limiation on older osc version an hopefully new version of osc will also work 12:55
greatgatsbynice, thanks a lot for the help, I'll test both those scenarios12:56
sean-k-mooneyfrickler: was this what you were thinking of? https://github.com/openstack/python-openstackclient/commit/89e5d67a16bd86af7bbe8480252686e2eef5509513:01
sean-k-mooneybecause that is detecting fi you forget to pass --image with --boot-from-volume13:01
sean-k-mooneygreatgatsby: this is where your error is comming form 13:04
sean-k-mooneyhttps://github.com/openstack/python-openstackclient/blob/master/openstackclient/compute/v2/server.py#L1714-L172513:04
sean-k-mooneyat least i think so13:04
sean-k-mooneyoh no its https://github.com/openstack/python-openstackclient/blob/master/openstackclient/compute/v2/server.py#L1762-L176913:05
sean-k-mooneygreatgatsby: so i think what is missing is you likely dont have boot_index=0 set in the the --block-device argumetns?13:06
greatgatsbyif I do `openstack server create --block-device source_type=image,uuid=<uuid>,destination_type=volume,volume_size=10,delete_on_termination=true I still get an error about one of those other flags being required, OCS 5.8.013:06
greatgatsbyI tried with boot_index as well13:07
sean-k-mooneycan you past the output of the command somewhere so we can review13:08
sean-k-mooneysince that does not have boot_index that should raise but i want to see if i can find the exact error message in code13:09
greatgatsbyI'm on two different networks (openstack for work, home network for IRC) but I'll pastebin what I'm entering... couple seconds to type it out.  Really appreciate the help.13:09
greatgatsbyif you just want the output, it's literally just the openstack server create help followed by:13:09
greatgatsbyopenstack server create: error: one of the arguments --image --image-property --volume --snapshot is required13:10
greatgatsbymy command is:13:10
sean-k-mooneyok let me see if i can find that as the code i linked to has a slight diffent message so its failign somewhere else13:10
greatgatsbyopenstack --os-cloud foo server create --flavor medium --block-device source_type=image,uuid=<image-uuid>,destination_type=volume,volume_size=10,delete_on_termination=true,boot_index=0 test-vm13:11
sean-k-mooneyya that looks valid to me13:12
greatgatsbyOCS 5.8.013:12
sean-k-mooneyhttps://github.com/openstack/python-openstackclient/commit/91277e7e51849d197554b633a579c92116a5afc413:14
sean-k-mooneygreatgatsby: so ^ is the fix frickler was refering too and its in 6.1.0+13:15
greatgatsbynice, thanks a lot.  Hopefully we'll be upgrading off yoga soon and will also upgrade our recommended OCS version.13:15
sean-k-mooneywell in general i recommend pepoel to install osc in a python venv form pip13:16
greatgatsbywe try to go by the version matrix for all the python utils and dependencies13:16
sean-k-mooneyso that that they can use the latest indepent of the cloud13:16
greatgatsbythat's exactly what we do13:16
greatgatsbywe still use the versions according to the docs... maybe we should reconsider that?13:16
sean-k-mooneyya so in general osc shoudl work with any release13:17
sean-k-mooneyso i generally suggest using the latest version13:17
sean-k-mooneythe cli should be stable and generally only have additive changes13:17
sean-k-mooneybut if you have a lot of internal docs using a fixed verion might make sense13:18
greatgatsbyhttps://releases.openstack.org/yoga/index.html#service-client-projects13:18
sean-k-mooneythat just documenting the version that was released in yoga13:19
sean-k-mooneyits not a recomendation to use that version to interact with a yoga cloud13:19
greatgatsbyoh, that's great to know, thanks, I'll pass that on!13:19
greatgatsbyagain, really appreciate getting help in here, can be tough banging ones head against the desk for too long13:20
sean-k-mooneythe reason tha tis good to refence is for co installablity13:20
sean-k-mooneyif you stay to the version in that list13:20
sean-k-mooneythen we are commiting to keepign them co installable13:20
sean-k-mooneyif you deviate form that you might have compitbality issues13:20
sean-k-mooneybut i f you are using osc on your laptop in its own venv13:21
sean-k-mooneyyou dont need to really consider that13:21
greatgatsbysorry what do you mean by co installability?  Just want to pass this info on and not get pushback13:21
sean-k-mooneyso we maintain that list mainly for distro packagers13:21
sean-k-mooneythe ablity to install all the release artifacts in the same python envionment13:21
greatgatsbyI see, ok, thanks again13:22
opendevreviewMerged openstack/nova master: Fix rebuild compute RPC API exception for rolling-upgrades  https://review.opendev.org/c/openstack/nova/+/89923513:22
sean-k-mooneywe test for co-installablity in our ci system and partly enforce that with upper-constraints.txt in the requirements repo for external deps13:23
opendevreviewsean mooney proposed openstack/nova master: Use centos-9-stream for functional-py39 job  https://review.opendev.org/c/openstack/nova/+/89995913:32
opendevreviewsean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/nova/+/89975313:32
opendevreviewsean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/nova/+/89975313:34
dansmithbauzas: are you around to explain something to me about the RP patch13:36
dansmith?13:36
bauzasdansmith: yup, I'm here now13:36
bauzasthanks for having started to look at it, btw.13:36
bauzas(me also needs to look at your spec)13:37
dansmithyeah sorry I forgot you were off yesterday until after I pinged13:37
bauzasso, shoot your questions13:37
bauzashave you seen the bug report btw.?13:37
bauzasI tried to summarize the problem in there13:37
dansmithmy questions are actually *about* the stuff in the bug :)13:37
bauzashttps://bugs.launchpad.net/nova/+bug/204151913:37
bauzasokay13:37
bauzasand why I did that, I guess ?13:38
dansmithso I get the inventory before, then after one allocation, one RP goes to zero13:38
bauzasmore than one RP but yeah13:38
dansmithno, only one according to your report13:38
bauzasand this is not exacly *always* after13:38
bauzasthe fact is that we automatically create a mdev if none of them are not used yet13:39
dansmithbut what i don't get is why after the second allocation they *all* go to zero.. that would imply we're using inventory wrong13:39
bauzasokay, gonna clarify13:39
bauzaswe are not counting mdevs for inventories13:39
bauzaswhat we count as VGPU resources are *either unused mdevs *or* available_instances*13:40
dansmithoh, hang on13:40
bauzaswhich means something you can allocate to a guest13:40
dansmithre-reading this with fresh eyes, I think I was confusing the very similar sysfs and placement dumps13:41
dansmithso it's the mdev availability that goes off a cliff13:41
bauzashttps://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8090-L813313:42
bauzasthat's where we calculate inventories13:42
dansmithyep13:42
bauzasavailable_instances is a value that the kernel driver gives13:43
bauzasbut,13:43
dansmithright13:43
bauzasonce you create a mdev, the value decreases by 113:43
bauzasthat doesn't mean that the mdev will be used by a guest13:43
sean-k-mooneybauzas: yep shich is why you are ment to be internally tracking which devices the mdev are created form13:43
bauzasif you basically precreate all mdev resources, then available_instances is 0 for all the gpus 13:43
sean-k-mooneyand using that to compute the inventoy total as (aviablie_instnace + mdevs form that device)13:44
dansmithdoes nvidia-472 imply that more than 1/total share of the card is used?13:44
bauzasbut the inventory still tracks that you can bind N vGPUs13:44
bauzasdansmith: that's the crux of the problem13:44
bauzasthe number of vGPUs can differ per type13:45
sean-k-mooneybauzas: that problem is sovled by enforcing that the device adress13:45
dansmiththat feels like we're using placement incorrectly to have inventory of unequal resources13:45
bauzasnvidia-472 can create up to two vGPUs13:45
bauzasso there are three possibilities13:45
bauzaseither available_instances = 2 13:45
bauzasor, available_instances = 1 and one mdev exists that'not used by a guest13:45
bauzasor, there are two mdevs available, which are both unused13:46
bauzaswith SR-IOV, things change13:46
sean-k-mooneybauzas: the decreasing inventory behavoir is a nvidia specirci issue with vfs form teh same card13:46
bauzasnow, with SR-IOV, instead of having one single pci device that have available_instances = 2 (and then only one RP),13:46
bauzasthen we have *two* VFs, each of them having available_instances = 113:47
bauzasbut,n13:47
bauzassince this is type-dependent, nvidia doesn't create only 2 VFs but 1613:47
bauzas(for that single GPU)13:47
sean-k-mooneyyes13:47
bauzas(if you have A2, IIRC this will create 32 virtual functions)13:48
dansmithcan someone answer the question: that's a function of nvidia-472 right?13:48
bauzasso, say you define nvidia-472, this means that only 2 virtual functions have to be used13:48
sean-k-mooneydansmith: the VF can support many mdev types13:48
dansmithand if they are slicing differently with (nvidia-xxx right)? then the math would be different/13:48
bauzasif you use other type (say nvidia-473), then you could use *four* virtual functions13:48
bauzasdansmith: correct13:49
dansmithyeah13:49
sean-k-mooneydansmith: yep the math change depenign on which mdev type you create13:49
sean-k-mooneybut the number of VF is fixed13:49
bauzasthe fact is13:49
sean-k-mooneyso not all of them can actully be allcoated13:49
bauzasgiven the type,13:49
sean-k-mooneyfor all types13:49
bauzasonce you create mdevs for all the VFs corresponding to the type,13:49
dansmithright, and could it be some combination of two types? or only one type of slicing per card?13:49
bauzasthen other VFs start magically having available_instances=013:49
sean-k-mooneydansmith: that depend on the card13:50
bauzasyou can't mix types per GPU, nvidia doesn't support it13:50
bauzaswell,13:50
sean-k-mooneyactully it does13:50
bauzasfor time-sliced GPUs13:50
bauzasfor MIG GPUs, you can, but that's a waaaaay different matter13:50
sean-k-mooneybut only in s specific mode for sepcific type on specific cards13:50
sean-k-mooneyyep13:50
bauzaslet's not overcomplicate this 13:50
sean-k-mooneylol13:50
sean-k-mooneydo you have a time machine13:50
dansmithwell, my point is,13:50
dansmithit seems really uncool to expose all these single-inventory RPs in general13:51
dansmithlike it feels as though we modeled it wrong13:51
sean-k-mooneyya so we shoudl be grouping per PF13:51
sean-k-mooneyas each phsical devce shoudl be its own RP13:52
sean-k-mooneybut since we never actully added offical supprot for sriov mode13:52
sean-k-mooneywhat we did instead was tell people13:52
bauzasdansmith: the problem is that we don't know preemptively which type consumes how many VFs13:52
sean-k-mooneyjsut use the pci adress of the vf instead of hte pf13:52
bauzasso the simpliest solution is to tell people to statically allocate mdevs13:52
sean-k-mooneybauzas: thats not really the probelm but it is one of the issues13:53
bauzassean-k-mooney: that *IS* the problem in our case13:53
dansmithbauzas: yeah, I think there are ways we could handle that, but yeah it's not ideal for sure13:53
dansmithunless we make people annotate the config in that way13:53
sean-k-mooneywell we agreed in the ptg to make them do that ^13:53
bauzasif I was able to get a clue of how many VFs are available for a type, I could prevent nova to create RPs for the 16 of them13:53
sean-k-mooneywe agreed to make device_adress required13:54
bauzassean-k-mooney: no we said at the PTG that allocating a mdev is cheap13:54
bauzasso we could continue to do this13:54
sean-k-mooneybauzas: yes we can that not what i said above13:54
sean-k-mooneywe agreed that going forward then need to specificy which devices we create inventoies form13:54
sean-k-mooneyby requiring the device_adress13:54
bauzassean-k-mooney: sure13:54
sean-k-mooneynova can continue to od the mdev creation13:54
bauzasbut this doesn't prevent operator fat fingers, right?13:54
dansmithsean-k-mooney: agreed to make who do what?13:55
sean-k-mooneythe key is just not list more vf then can be created13:55
sean-k-mooneydansmith: https://docs.openstack.org/nova/latest/configuration/config.html#devices13:55
bauzassean-k-mooney: also, with SR-IOV, this will be hard for operators to know which VFs to allocate btw.13:55
bauzasthe fucking nvidia docs never tell about nvidia-XXX13:56
sean-k-mooneyfor legacy reasons (only for the first mdev type) specifiv device_address is optional13:56
dansmithsean-k-mooney: meaning in [mdev_nvidia-35]/this_takes_n_vfs=10 right?13:56
dansmiththat sounds good to me13:56
bauzasthe only way for an operator to know what a nvidia type is doing is by getting the /name value and match it with docs13:56
dansmithdefault to 1 and if they don't fill it in, then the math will be wrong13:56
sean-k-mooneydansmith: not quite13:56
bauzasdansmith: no, sean-k-mooney is mentioning to explicitely define in nova.conf all the pci addresses of the VFs13:57
sean-k-mooneydansmith: you would look at the mdev type and list n vfs13:57
bauzasno new option13:57
dansmithsean-k-mooney: isn't that what I just said?13:57
sean-k-mooneywe could do it diffently however13:57
bauzas[devices] enabled_mdev_types = nvidia-35, nvidia-36  [mdev_nvidia-35] device_addresses = 0000:84:00.0,0000:85:00.0  [vgpu_nvidia-36] device_addresses = 0000:86:00.0 13:57
bauzasyou would specify then on nova.conf all the VFs 13:57
bauzasat least the ones you gonna allocate13:58
sean-k-mooneydansmith: so we could do something better13:58
bauzassean-k-mooney: so sean, you gonna prefer to -1 my fix and say "please rather statically allocate all VFs" ?13:58
bauzashttps://review.opendev.org/c/openstack/nova/+/89962513:59
dansmith-1 to static, I can't imagine that's really going to make people happy13:59
bauzasnot sure our operators will appreciate 13:59
bauzasyeah13:59
sean-k-mooneydansmith: what we could do is per mdev type declare how many that can have13:59
sean-k-mooneyand then specify that device_address shoudl be the PF13:59
bauzasthat's a risky math14:00
sean-k-mooneyso then we would have one device address (and placment inventoy) per phsyical gpu14:00
dansmithI still don't understand how that's not what I said above and why that's not a good solution14:00
bauzasbecause nothing in the sysfs structure is telling us which parent PF is the VF14:00
sean-k-mooneyand we would just not use aviable_isntance at all for the placment total14:00
sean-k-mooneybauzas: it really should14:00
dansmithsuuuurely it does14:00
bauzassean-k-mooney: if you don't use available_instances, then you're asking people to precreate the mdevs14:00
sean-k-mooneythat might be missing in libvirt but it will be in sysfs14:00
sean-k-mooneybauzas: no im not14:01
sean-k-mooneyim asking them to read the nvidia docs14:01
dansmithI have to do another meeting for a bit14:01
sean-k-mooneyand tell us how many it can allcoate14:01
dansmithsurely we could also have a generic listing for the common types14:01
bauzassean-k-mooney: I don't get then what you said14:01
bauzas(15:00:31) sean-k-mooney: and we would just not use aviable_isntance at all for the placment total14:01
bauzasdansmith: that's like boiling the ocean14:02
bauzastypes are hardware-dependent14:02
sean-k-mooneybauzas: im saying for the placment inventory we read total form the config option not form aviable_instnace14:02
dansmithnvidia-472 differs in size on different cards?14:02
bauzashttps://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu14:02
sean-k-mooneydansmith: nvidia are creating a mdev type per card14:02
bauzasdansmith: count every line in each tab, you'll get the number of mdev types14:02
dansmithack I understand14:03
bauzasdansmith: no, the other way, nvidia-472 only stands for A100-20C which is only a type supported by A100 cards14:03
dansmithyup, I got it14:03
sean-k-mooneybauzas: lets park this for a bit and create an etherpad or something to collaberate on14:04
bauzasformerly, before the magic world of SRIOV GPUs, available_instances was giving us the number of VGPUs a PCI device was able to create14:04
sean-k-mooneyits simpler to discuss when we have some of the constriat written down liek a spec14:04
bauzassean-k-mooney: there is already the bug report14:05
sean-k-mooneyi would like somethign we can edit14:05
bauzasI'm not a big fan of writing a spec for fixing this14:05
sean-k-mooneybauzas: i think you have too14:05
sean-k-mooneydepending on the fix14:05
sean-k-mooneythere are ways to do the minimal fix as a bug14:06
dansmithI think it's clear we need to document this more than just some sysfs dumps in a bug :)14:06
bauzassean-k-mooney: have you seen my patch already ?14:06
sean-k-mooneybut if we want to fix this properly we need to change how we are doing mdev managment14:06
bauzasas a reminder, mdev inventory isn't a problem with non-SRIOV GPUs14:06
bauzasso whatever we decide will necessarly be opt-in14:06
sean-k-mooneyits not a problem for all sriov devices14:06
bauzasand keeping the legacy behaviour too14:07
sean-k-mooneyits litrally aproblem just for how nvidia did it14:07
bauzas"its litrally aproblem just for how nvidia did it" I should quote it, print it and put it on my front wall14:07
bauzasthe way that available_instances isn't a static number is also a terrible driver implementation, if you want MHO14:08
sean-k-mooneybefore i left intel, i used a protoytp intel QAT impelemtnation wtch had sriov VF for each enging type and they supprot mdevs per VF with indepenent aviableity14:08
dansmithis it really a problem though? I mean it kinda seems like this is the point of mdevs, no? to be allocatable in more dynamic ways than just static slices of a card?14:08
bauzastrue14:09
sean-k-mooneydansmith: mevs from diffent pci endpoitn are intented to not affect each other14:09
sean-k-mooneythat is what nvida started breaking14:09
sean-k-mooneythe framework allows them to be dynmaic14:09
dansmiththere are multiple pci devices per card? or you mean the vfs?14:09
bauzasoh wait14:10
bauzas[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/0000\:41\:01.0/mdev_supported_types/nvidia-472/description 14:10
bauzasnum_heads=1, frl_config=60, framebuffer=20480M, max_resolution=4096x2400, max_instance=214:10
bauzaswe can get the number of VGPUs in the description14:10
sean-k-mooneybauzas: the desciptoin is not a standard format14:10
bauzasthis is horribly driver-specific14:10
bauzasyeah14:11
sean-k-mooneyyes it is its litrally just a text filed 14:11
dansmithbut it's something14:11
sean-k-mooneyits not somehting nova should read14:11
dansmithit's something an operator could read in order to put the math in the config :)14:12
sean-k-mooneygrep . /sys/class/mdev_bus/0000\:41\:01.0/mdev_supported_types/nvidia-472/*14:12
bauzasoperators aldray read :14:12
bauzas[stack@lenovo-sr655-01 ~]$ cat /sys/class/mdev_bus/0000\:41\:01.0/mdev_supported_types/nvidia-472/name14:12
bauzasGRID A100-20C14:12
sean-k-mooneybauzas: can you run ^14:12
dansmithI also wouldn't say nova *shouldn't* read it.. I think we could read and parse it as long as we don't fail if it has changed14:12
bauzasand then Ctrl-F that string into https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#supported-gpus-grid-vgpu14:12
sean-k-mooneydansmith: yes the operator can read it14:13
sean-k-mooneydansmith: so would you be ok with adding max_instances to the mdev_type config section14:13
sean-k-mooneyor is that still too static14:14
bauzassean-k-mooney: if we go this way, this wouldn't be backportable, right?14:14
dansmithwell, I'm not sure that's a solution entirely, but I think making the operator tell us some of the math so we can do it better on the backend makes sense14:14
bauzasI really tried to propose https://review.opendev.org/c/openstack/nova/+/899625 as a backportable mechanism14:14
dansmithI'd need to hear the whole story before "being okay" with it, but that seems like a good direction to me14:14
bauzasdansmith: finish your call and then we could do a gmeet14:15
dansmithsure14:15
sean-k-mooneybauzas: i dont really like that because it implies your still doing the math incorrect14:15
bauzasI have the PTG summary to write14:15
dansmithI don't love the "quick, delete the provider" approach either, FWIW14:15
bauzassean-k-mooney: it implies people were lazy enough to not explicitely tell which PCI devices to allocate14:15
bauzasdansmith: that's understandable, but then you're pro-statically defining a list of VFs or an integer counting allocatable VFs14:16
sean-k-mooneybauzas: and we are not internally trackign which device the mdev is allcoated form meaning we are not mapping the allcoations correctly in placment14:16
bauzas(just clarifying the options)14:16
sean-k-mooneyremoving the rp works if and only if we gurentte that the mdevs are created by packing the RPs14:17
bauzassean-k-mooney: if you want to have my frank opinion, I'm not a big fan of changing all the allocations we make14:17
dansmithbauzas: I'm for doing better math, and it sounds like some hints for that math (in config) are necessary14:17
bauzassean-k-mooney: I very humbly think you're wrong on the fact we miscount14:17
bauzaswe very correcly count which mdevs against RPs14:18
dansmithstill would need to understand the whole picture *and* how we migrate to it, impact to existing people, and whether or not it's worth it if mdevs are going away14:18
bauzasthe latter conditional is hard to guess14:18
bauzasbut that's fair assumptions14:19
bauzasand I'm not particularly attached to the bugfix itself14:19
bauzasbut again, I have my wills : a simple approach not redesigning our placement tracking which would be backportable14:20
sean-k-mooneybauzas: in sriov mode is aviable_instances alway 1 14:20
sean-k-mooneybecasue for intels implemation (which they never released) it was not14:20
bauzassee the bug report, yah until all mdevs are created14:20
bauzasI literally wrote that in the bug description14:20
sean-k-mooneywhat i dont like about that is that is a vendor specific detail14:21
bauzashell yeah, hence why I wanted to make the fix agnostic14:21
bauzaswe're just doing some math in some method that say 'oh look, that inventory is now 0, but meh, I don't care updating it'14:22
sean-k-mooneywell your solution does not work if the VF have aviable_instance>1 initally14:22
bauzasthis is honestly to me a wrong approach, and at least a LOG warning should tell the operator something weird happened14:22
bauzasbeg your pardon ?14:22
sean-k-mooneysimple case14:23
sean-k-mooneywe have one mdev capable device14:23
sean-k-mooneyit has an mdev type that allow 2 allcations total14:23
sean-k-mooneyit creats n VF14:23
sean-k-mooneyeach vf reporst aviabel_invetnory=214:23
bauzas(assuming then you're not talking of the nvidia driver then)14:24
sean-k-mooneyi allcoate one mdev from two different vfs14:24
sean-k-mooneyyou can delete the other vf rps14:24
sean-k-mooneybut we will would be miss reporting the 2 remain rps14:24
bauzasare you talking of something inexistent now but as a future possibility ?14:25
sean-k-mooneyim talkign generically14:25
bauzasagain, I don't understand your logic, but let's imagine your scenario for a sec14:26
bauzassay I have 16 VFs, each of them with available_instances=214:26
bauzassay now I create one mdev twice, each on a distinct VF14:26
bauzasfor 2 VFs, available_instances would be equal to 114:27
bauzastwo mdevs would exist14:27
bauzasso the inventory wouldn't change for those14:27
sean-k-mooneyright but only 2 can be created14:27
bauzaswe would still count : total=214:27
sean-k-mooneyi gues what would happen is that when i allcoate the first mdev 14:27
sean-k-mooneythe aviable_insstance on all vfs would go to 114:27
sean-k-mooneyand the second one it would go to 014:27
bauzasokay, then it's the driver's responsibility to change available_instances to 0 for all the VFs14:28
bauzasincluding the two ones we use14:28
sean-k-mooneyya14:28
bauzastotal would be one for 2 RP inventories then, right?14:28
bauzastotal=1 I mean14:28
sean-k-mooneyso the only mdev hardware i have used that supprotes sriov vf did not have that probelm14:28
bauzasanyway, this isn't possible with nvidia SR-IOV, all VFs *have* available_instances=114:29
bauzasyou lost me, honestly14:29
bauzasif you're talking of T4s14:29
bauzaswhat we had was a basically non-SRIOV accountancy14:30
sean-k-mooneyno im talking about hardware i used at intel and trying to think if it would break14:30
sean-k-mooneyi think the delete your are proposing would work14:30
bauzastbc, I've experienced a race condition14:30
sean-k-mooneybut i dont like the idea of us creating and deleting rps in genreal based on workload usage14:30
bauzassince we update resource providers on a periodic14:31
bauzasI was able to create 3 VMs (and allocate three VGPU allocations)14:31
bauzasthat's why I added the claim stuff14:31
bauzasso, see, again, I'm not particularly attached to do this14:32
bauzasif we go with statically defining which VFs are going to be used, fair14:32
sean-k-mooneybauzas: we normally run the update avaiabel reosuces call elsewhere14:32
bauzasI have checked14:32
sean-k-mooneybauzas: we expcitly do it form the comptue manager in move claims14:32
bauzaswe do run the update *before* the spawn14:32
bauzasnot *after* :)14:32
dansmithbauzas: agreed, deleting RPs as a result of allocation feels very wrong14:33
sean-k-mooneydansmith: the only mitigation is we dont delete the mdev after the instance is moved/deleted14:33
sean-k-mooneyso provide the compute node is not restated14:33
sean-k-mooneywe will do the delete once 14:33
dansmithcould we set reserved=total instead of deleting maybe?14:33
sean-k-mooneybut the extra RPs will get recated after a host reboot14:33
bauzasdansmith: we would need to implement the reverse logic then14:34
sean-k-mooneydansmith: reserved=total is what i was orgianlly thinking but the issue is total is going to 014:34
sean-k-mooneyand total cant be 014:34
sean-k-mooneyso we would need to do resrved=total=114:34
bauzasand when to unreserve ?14:34
bauzassay I delete my mdev ?14:34
bauzasI'm a stubborn operator that deletes the mdevs, y'know14:35
sean-k-mooneyor you do a host reboot14:35
bauzashonestly, I like the config approach14:35
sean-k-mooneyso ya the reserved=total approch is more complext then delete14:35
dansmithsean-k-mooney: right I meant reserved=old_total14:35
bauzasif you specific the max VGPUs per type, then we could do htis14:36
bauzasI feel a disturbance in the force when I tell about deleting RPs14:36
sean-k-mooneybauzas: its slightly more complex then that14:36
bauzasI used the term "strawman", I think it was the right wording14:37
sean-k-mooneymax mdevs need nova to understand which phsyical deviec the mdevs are form14:37
bauzasthat is correct14:37
bauzaswe know the parent device14:37
bauzasfor a mdev14:37
bauzaswe already use this information to track14:38
bauzasin nova today14:38
sean-k-mooneywell in the nvidia case the max mdev per type applies to the grandparent device14:38
bauzaswhat we don't track *now* is to know the parent of a mdev capable device14:38
bauzasand we can't rely on libvirt, it gives you the PCI-E root PCI address (I tested :) )14:38
bauzassean-k-mooney: that's what libvirt is giving us https://paste.opendev.org/show/b3MHPT8VoIOBWw1tim2o/14:40
bauzaswe know the parent addr tho14:40
bauzas     <capability type='phys_function'>       <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>14:40
sean-k-mooneyit tell you the part device14:41
bauzasso you were right, we could use this14:41
sean-k-mooney<parent>pci_0000_40_01_1</parent>14:41
bauzassean-k-mooney: no, you're wrong, that's not the PF address :)14:41
bauzas[stack@lenovo-sr655-01 ~]$ lspci | grep 40:01.114:42
bauzas40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge14:42
sean-k-mooneyok can you give me access to that system so i can take a look14:42
bauzasthat's the PCI bridge address :)14:42
bauzasbut I told you, we have the parent PF address in the device XML14:42
sean-k-mooneyi orgianlly said we could get the info from sysfs because i knwo how to look it up there14:42
sean-k-mooneywe can get it form libvirt but i dont rememebr the element fo the top of my head14:43
sean-k-mooneyif we have it great14:43
sean-k-mooneyits <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>14:43
sean-k-mooneyform the capability type='phys_function' YES 14:44
bauzasyeah, and thru sysfs, we have access to the physical function in /sys/class/mdev_bus/0000\:41\:00.4/physfn/14:46
dansmithokay done with my other meeting14:46
* dansmith reads back14:46
bauzasbut that's not a trivial change14:46
bauzaswe need to implement some libvirt config change14:46
sean-k-mooneywe can and should use the info form libvirt since its there14:46
sean-k-mooneybauzas: we already have parsing logic for this in the pci tracker code in the libvirt driver14:46
bauzaswe already get information of the mdev capable devices thru libvirt14:46
bauzassean-k-mooney: oh you're right14:47
bauzasdansmith: take time to digest 14:47
bauzassounds less bugfix-y, more feature-y than my patch but I can do it14:48
bauzassean-k-mooney: fwiw, this is where we get information for a mdev cap device https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L828414:49
sean-k-mooneybauzas: we should seperate the back protable fix form the final fix14:49
sean-k-mooneybauzas: https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1321 is the functilalyt that extracts the pci info form te node dev 14:49
bauzashttps://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L8263 is where we get the capacity14:50
bauzasso sounds easier than originally thought14:50
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L1351-L136314:50
sean-k-mooneythat actully where we are parsing it14:50
dansmithso max_mdevs_per_device works only because once you allocate one mdev type from a card, that card is locked into that type for the rest of the allocations right?14:50
bauzasdansmith: you're correct, there is no heterogenous support14:50
artomsean-k-mooney, we never updated the Neutron docs for the socket PCI NUMA affinity policy: https://docs.openstack.org/api-ref/network/v2/index.html#numa-affinity-policy14:50
sean-k-mooneydansmith: in the simple case yes. but we also only support one mdev_type per card in nova14:50
bauzasyup, there is no behavioural change in support14:51
sean-k-mooneyartom: we never updated the neutron extention for it14:51
sean-k-mooneyartom: its not a docs issue i belive we need an api change14:51
dansmithsean-k-mooney: wait, is that the case even before we allocate?14:51
artomsean-k-mooney, so we can't actually set it on the port at all?14:51
sean-k-mooneydansmith: yes which means we break if tye allocate the wrong type manually14:52
bauzasare we still talking of GPUs or PCI NUMA affinity now ?14:52
sean-k-mooneydansmith: we have had customer do that14:52
artomAnyways, I interrupted a convo with my dumb questions, carry on, I'll go look at the code14:52
sean-k-mooneybauzas: im talkign to artom about something else14:52
artombauzas, ignore me and PCI policies14:52
bauzasartom: no worries, we're reaching to some agreement soon I guess14:52
bauzasdansmith: assigning a mdev type to a GPU is statically defined in nova14:53
sean-k-mooneyartom: https://github.com/openstack/neutron-lib/blob/master/neutron_lib/constants.py#L676-L68114:53
dansmithokay I'm confused14:53
dansmithI thought ya'll said the opposite earlier14:53
bauzaschanging the value once you create mdevs is a breaking change that we don't support14:53
sean-k-mooneyartom: so the api will not accpet it and we are missing the osc support too14:53
dansmithright I understand that part14:54
dansmithare we going to gmeet?14:54
bauzassure I can fire one14:54
dansmithjust saying, since artom showed up we14:54
dansmithare all muddy here :)14:54
artomYep, I barged in like a rhino on a sugar rush, apologies14:55
* dansmith blinks14:55
bauzashttps://meet.google.com/hqy-bgjb-muk for people fancy discussing nvidia nightmare14:55
sean-k-mooneydansmith: for what its worth the patch bauzas proposed should work given the hardware that is aviabel to buy today14:55
sean-k-mooneyits jsut depening on some asusmintions we have in nova and in how the current hardware works14:55
opendevreviewWill Szumski proposed openstack/nova stable/yoga: Ironic nodes with instance reserved in placement  https://review.opendev.org/c/openstack/nova/+/86791216:49
jovialIs the nova-emulation job busted on stable/yoga? Was interested in progressing this patch: https://review.opendev.org/c/openstack/nova/+/89855417:41
bauzassean-k-mooney: dansmith: so, if you don't mind, I'm gonna work on the VF filtering case with max_instances config option17:42
bauzasand then we could nitpick on what we want17:42
dansmithyep17:44
bauzasdansmith: fwiw, there is also https://review.opendev.org/c/openstack/nova/+/899406 17:44
bauzaswith that patch, pas-ha[m] is forcing operators to explicitely define all the device_addresses17:45
bauzasbut I personnally think that adding a max_instances config option is painless for ops 17:45
bauzasso we could mix both17:45
bauzasmeaning that we would force ops to either explicitelyt specify the device addresses *or* give the number of VFs that nova needs to create17:46
bauzasthis sounds very consistent to me17:46
dansmithmax_instances seems much more useful to me so people don't have to encode pci device addresses in their config *just* to avoid the broken behavior17:50
bauzasyeah so we will keep the existing behaviour but add another option for automatic discovery17:51
bauzasdansmith: I definitely appreciate your help on that one17:51
bauzasand I really want to give you review credits back on your own series17:52
sean-k-mooneydansmith: i have some concerns with allow config with out pci adresses17:57
sean-k-mooneyi.e. if you allow some of your gpus to be used for pci passthogu and other for vgpu17:58
dansmithright, that's a good reason *to* do it, but not a good reason to *require* it, IMHO17:58
sean-k-mooneyif you dont use device_address in that case we coudl passthough a device (that was allowed in the pci devspec) but not disalloed in the mdev config section17:58
sean-k-mooneydansmith: i dont know i always felt that its a psudo security isseu to not explcitly list devices we can expose to a guest17:59
sean-k-mooneyi get that it sucks to need to be explict17:59
sean-k-mooneybut that the trade off17:59
dansmithbut in the mdev case you're not just blindly exposing them,18:00
sean-k-mooneyto be clear we have this tradeoff in the pci tracker too18:00
dansmithyou've already opted into "all devices with this mdev type" right?18:00
sean-k-mooneyyep so its the same as just putting vendor_id and product_id in the pci dev_spec18:01
sean-k-mooneyaka the pci whitelist18:01
dansmithI don't think that's the same really18:01
opendevreviewStephen Finucane proposed openstack/nova master: docs: Add documentation on server groups  https://review.opendev.org/c/openstack/nova/+/89997918:01
dansmithwell, if you do both vendor *and* product I guess it is18:01
sean-k-mooneyto me there is no diffence betwen mdev type and *vendor_id and product_id)18:01
dansmithbut not like worrying that 8086 is going to allow your root bridge to go through or something18:01
dansmithbut I really don't see the problem with that level of specificity18:02
stephenfinbauzas: melwitt: https://review.opendev.org/c/openstack/nova/+/899979 came up as a result of an OpenShift question. I figure you'd both be interested in that addition to our docs18:02
sean-k-mooneydansmith: i think its more an issue with how the config options are arranged more then anything esle18:03
stephenfinI've also proposed helper alias for OSC to make requesting server groups slightly nicer https://review.opendev.org/c/openstack/python-openstackclient/+/89997518:03
stephenfin*a helper18:03
sean-k-mooneydansmith: we split the mdevs into multipel dynmaic config sections to influcne this18:03
bauzassean-k-mooney: if we can ease your pain, we can document the fact that max_instances automatically passes *all* GPUs to nova18:03
sean-k-mooneywhich to me make it harder to know if a device will be allowed or not18:03
bauzassean-k-mooney: and we could mention to use device_addresses as an alternative for people afraid of passing the whole GPUs 18:04
sean-k-mooneybauzas: just make -1 or None the default and use it as a sentanical to mean all.18:04
bauzaslemme provide the patch first18:04
sean-k-mooneybauzas: well i thing we shoudl be recommending device_address as the way peopel shoudl do it in general18:05
dansmithsean-k-mooney: that doesn't work right? that gives us the wrong behavior we have now18:05
sean-k-mooneydansmith: it doesn but i think bauzas was going to have it be wrong out of the box too18:05
sean-k-mooneyunless i missunderstood18:06
bauzastbc I was documenting in the config doc helper that max_instances should be set to the number of mediated devices a device can create18:06
sean-k-mooneybut where you going to recommend usign device_addresses as what peopel shoudl do by default?18:07
bauzaswe could18:07
sean-k-mooneypersonally i woudl prefer reuqiring you do do "device_adddress=*" to have use any device with this mdev type behavior by the way18:08
sean-k-mooneyor saying you must set 18:08
sean-k-mooneyeither device_addres or max_instances18:08
bauzasyeah those options are conflicting, for sre18:10
bauzasI can mark those this way with oslo.config18:10
sean-k-mooneybut can you also make the mutally exclive group required18:11
sean-k-mooneyand the dynmic config section requried18:11
bauzaswe're on same page18:12
sean-k-mooneywell there is one thing im not sure you and i agree on18:12
bauzas[mdev_nvidia-XXX] would become mandatory if we merge https://review.opendev.org/c/openstack/nova/+/89940618:12
sean-k-mooneyah18:12
sean-k-mooneyya that18:12
bauzaswhat's inside this section is mutually exclusive18:12
bauzaseither device_addresses or max_instances18:13
bauzaswfy ?18:13
sean-k-mooneyso your ok with requiring that if an mdev type is in enabled_mdev_types18:13
sean-k-mooneywe must have the config section18:13
bauzasthat's what we agreed at the PTG18:13
sean-k-mooney[mdev_<mdev name>]18:13
sean-k-mooneyok then ya with that condition i think im good with your propsoal18:13
bauzasin order to ease people's lifes, I'm just gonna add max_instances as a mutually exclusive option18:13
bauzasif you're SR-IOV based, we'll pick N VFs for you18:14
bauzasif you're not SRIOV, we'll just pick N devices for you18:14
melwittstephenfin: cool, thanks. I'll check it out18:15
sean-k-mooneybauzas: we could just make max_instances apply to sriov case18:15
sean-k-mooneysince the non sriov case just works18:16
bauzaswith the non-sriov gpus, you need to specify all the cards18:16
sean-k-mooneyin the device list18:16
bauzasgiving them a way to say "I don't care" 18:16
bauzasseems good to me18:16
sean-k-mooneyya so just add * to device_addreses18:17
sean-k-mooneywell im flexible on how we do that18:17
bauzaswe need implementation patches and then we could do anything we want18:18
sean-k-mooneyim ok to have a away to say "expose all devices of this type" i just woudl like it to be a little more explict then it is today18:18
bauzasbtw. I'm on PTO tomorrow18:18
sean-k-mooneyright os if we want to defer the details of how we do that for now18:18
sean-k-mooneythat works for me18:18
sean-k-mooneyya im on pto tomorrow and monday18:18
bauzascool18:18
bauzasI'll try if I can draft something before I leave 18:19
bauzasnow that I have a machine, this is waaaay easier than before18:19
sean-k-mooneyyep hardware both sucks and make life simpler18:19
bauzassean-k-mooney: I made the wrong assumption, for non-sriov, max_instances is indeed meaningless18:36
sean-k-mooneycool18:36
bauzasso yeah we maybe need a wilcard18:38
bauzasthat said, we captured the opposite https://etherpad.opendev.org/p/nova-caracal-ptg#L66918:38
bauzasbut I recently discovered that every single day is full of adventures and news when it comes to nvidia18:38
opendevreviewsean mooney proposed openstack/nova master: add pyproject.toml to support pip 23.1  https://review.opendev.org/c/openstack/nova/+/89975319:44
pas-ha[m]<bauzas> "dansmith: fwiw, there is also..." <- bauzas: I don't think I force anything with this patch, it should be 100% backward compatible - if no explicit devices are defined together with the single enabled mdev type, everything should work as before.20:23
pas-ha[m]at least that was my intention, not sure I actually tested for that :-) 20:23
sean-k-mooneywith a single mdev type bu tif you had 2 or more then the config section would be requried for each correct20:24
* sean-k-mooney has not looked at the patch in a few hours and lost context20:25
sean-k-mooneyhum so https://review.opendev.org/c/openstack/nova/+/899406/1/nova/virt/libvirt/driver.py#793320:26
sean-k-mooneyis just always regestring the config options20:26
sean-k-mooneybut not nessiarly enforicing it20:26
sean-k-mooneybecause you still defalt it here https://review.opendev.org/c/openstack/nova/+/899406/1/nova/virt/libvirt/driver.py#795820:26
sean-k-mooneypas-ha[m]: so ya i think your version is tecnically still backward compatible20:27
sean-k-mooneypas-ha[m]: the fact your chagne has not unit test chagne means either its backward compatiable or we have no test coverage for this20:28
pas-ha[m]sean-k-mooney: I bet the latter :-D20:28
sean-k-mooneyunfortunetly you might be right20:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!