Wednesday, 2025-03-12

*** JohnnyW557 is now known as JohnnyW5504:13
opendevreviewAmit Uniyal proposed openstack/nova master: DNM: enable NFS backend job  https://review.opendev.org/c/openstack/nova/+/94400806:27
opendevreviewRajesh Tailor proposed openstack/nova master: Add support for showing instance-action finish_time  https://review.opendev.org/c/openstack/nova/+/92893306:51
opendevreviewMichael Still proposed openstack/nova master: libvirt: Add extra spec for sound device.  https://review.opendev.org/c/openstack/nova/+/92612608:05
opendevreviewMichael Still proposed openstack/nova master: Protect older compute managers from sound model requests.  https://review.opendev.org/c/openstack/nova/+/94077008:05
opendevreviewMichael Still proposed openstack/nova master: libvirt: Add extra specs for USB redirection.  https://review.opendev.org/c/openstack/nova/+/92735408:05
gibisean-k-mooney: I have a feeling that the pool matching logic in stats.py also intentionally handles the multi spec case. See https://github.com/openstack/nova/blob/0d484ce37d86e989c8abdf57aec5e334f68206ef/nova/pci/utils.py#L75 it basically implements an OR between the multiple specs 08:22
sean-k-mooney[m]it did in the really early days08:22
sean-k-mooney[m]but i tough twe removed this like 10+ years ago08:23
gibiappreanlty we did not08:23
gibiPCI in Placement never supported it though I just broke the error reporting leading to http 50008:23
sean-k-mooney[m]so the vfio-variant dirver live migration depend on 1:1 alias mappings08:23
sean-k-mooney[m]at least when not using pci in placement08:24
gibisean-k-mooney: do you mean that the live migration will only work if the same exact device is on both sides and therefore having specs in OR relationship with different product id would not work?08:24
sean-k-mooney[m]that is my understanding08:25
gibiyeah I agree08:25
sean-k-mooney[m]i dont se how it could work any other way when the deviece is directly exposed to the guest kernel as a vf08:25
sean-k-mooney[m]i think we need to  at least restict when oring can happen08:26
gibithe unplug/plug live migration for neutron requested PCI probably works as that is not real live migration of the device, but the variant driver based will lead to very sad guest kernels08:26
sean-k-mooney[m]yep08:27
sean-k-mooney[m]so i know why they orginally did this which is they were trying to make differnt configurations of intels QAT device look the same\08:28
sean-k-mooney[m]and allow you to use ask for QAT08:28
sean-k-mooney[m]without careing which one you got08:28
sean-k-mooney[m]that only works if you never have to deal with live migration08:29
sean-k-mooney[m]so i think we need to disable this when live_migratable is set to true or you use pci_in_placment08:29
sean-k-mooney[m]either should result in a config error on startup08:30
sean-k-mooney[m]we can allow it for non llive migratbale device or when not using PCI in placement for backward compatibality08:31
opendevreviewRajesh Tailor proposed openstack/nova master: Add support for showing instance-action finish_time  https://review.opendev.org/c/openstack/nova/+/92893308:58
sean-k-mooneyit is day 2 or alt world where my fundemental core belife that "their should only be 1 7:30 in a day and it should happen after noon" is being chllanged by the concept of 7 am...09:12
sean-k-mooneyits currntly 09:13 which is still an hour before im useually awake09:13
opendevreviewribaudr proposed openstack/nova master: FUP improve comment accuracy and variable naming for tag removal  https://review.opendev.org/c/openstack/nova/+/94312411:43
opendevreviewribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check  https://review.opendev.org/c/openstack/nova/+/94410511:43
opendevreviewribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers  https://review.opendev.org/c/openstack/nova/+/94410611:43
andrewbonneysean-k-mooney: in case you have time today for another look, https://review.opendev.org/c/openstack/nova/+/919961 seems fine in CI11:52
sean-k-mooneyandrewbonney: ill review it propelry shortly. if you need to respin it might be nice to add a fixes release note12:13
sean-k-mooneybut thats not strictly required12:13
andrewbonneythanks12:13
sean-k-mooneyso ya ill take a look shortly12:13
sean-k-mooneyandrewbonney: +2 form me12:23
sean-k-mooneyandrewbonney: this proably wont merge for RC1 but it can merge and be backported as normal after12:24
andrewbonneysure, just trying to get it ticked off before I head off for extended leave12:24
sean-k-mooneyack12:24
opendevreviewMerged openstack/nova master: Fix serial console for ironic  https://review.opendev.org/c/openstack/nova/+/94257512:26
*** elodilles is now known as elodilles_pto13:22
gibifolks, Why do we only use the move claim during live migration if the VM is NUMA aware? context https://github.com/openstack/nova/blob/f71a0a6204744b7774caf85c16f27964a4f851db/nova/compute/manager.py#L9057-L9066 13:44
gibibased on my local testing that src_supports_numa_live_migration check there is only true if the VM is NUMA aware 13:45
gibiso it seems like we doesn't do any claim for the instance during live migration if the instance is not NUMA aware (except claiming things outside of the Claim object like PCI)13:45
Uggla@artom, can you have a look at gibi's question above.13:48
artomgibi, I don't remember :(13:50
artomBut I'm pretty sure it's not _just_ NUMA-aware VMs13:50
artomOther code added later started using that too13:50
artomVPMEM I think?13:50
artomAnd src_supports_numa_live_migration was there to handle the rolling upgrade case at the time of implementation13:51
gibiif you look at the code linked above there is a check that looks for src_supports_numa_live_migration in the migrate data, that condition is false for me locally for a non NUMA aware VM13:51
artomWe can _probably_ remove it now, with the usual caveats around RPC13:51
Uggla@artom, look at https://review.opendev.org/c/openstack/nova/+/944105  you left a comment that may remember you things.13:51
artomOh yeah, there was the SRIOV live migration thing as well13:52
artomIt's handled by a separate code path13:52
Ugglayep below in the manager13:53
gibiyeah the PCI stuff claimed differently, but what about the rest of the resources?13:53
artomYou mean non-Placement resources?13:53
Uggla@gibi this is claimed at scheduler step before no ?13:55
*** whoami-rajat_ is now known as whoami-rajat13:55
gibiartom: yea13:55
gibiall the normal things we do during an instance claim or during a move claim in other ops13:56
artomI _think_ as live migration stuff got added they all stated using the MovaClaim?13:56
gibiobviously what is not done today does not casue any highly visible issue so maybe those things does not need to be claimed during live migration, but it is very strange13:57
gibiyeah using the MoveClaim would be logical during live migration, but my observation is that for a non NUMA aware VM the code does not use any claim13:57
gibi(except the PCI one outside of the Claim obj)13:58
artomI agree with you that it's strange13:59
artomBut it's been so long (Train!) that I've lost most context for the decisions that were made13:59
gibiyeah I understand you don't have a context. I will try to charachterize the my feelings more13:59
artomI talk about it here: https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/numa-aware-live-migration.html#resource-claims14:00
gibibest would be to find a way how this could break our resource tracking and the move from there14:00
gibiI need to jump on a call, I will be back in 30 mins (hopefully)14:00
artomOK14:01
gibithat section reads for me like we wanted to do the claim regardless of NUMA awareness14:02
artomYeah, I dunno anymore :(14:17
artomMaybe let's come at it from a different angle - what led you to notice this, and what are you trying to do?14:17
dansmithah, the old "let me answer a different question" deflection :D14:20
artomHah14:21
artomI'd _like_ to answer the original question14:22
artomI just don't remember14:22
dansmithI recognize the strategy well14:24
*** svinota is now known as svinota[afk]14:45
dansmithgibi: so, there's a race happening, which maybe/probably affects other stuff.. any chance you have a few minutes to gmeet for me to describe and see if you're aware already?14:52
dansmithspecifically a provider_tree update race14:53
sean-k-mooneyi dont think we update the provier tree resereved value in nova for anything else, although i know neutron does for l3 routed networks14:55
sean-k-mooneyso they might have seen whatever the race is with that14:56
dansmithit's not a problem specific to reserved, that's just how I'm seeing it14:58
dansmithalthough perhaps it doesn't get exposed because of the nature of the external update in this case, although that might mean things like neutron/cinder do apply14:58
sean-k-mooneyare you getting a generation conflcit?14:59
* sean-k-mooney was not following the disucsion before14:59
dansmithno15:00
dansmithsean-k-mooney: there was no discussion of this before, I just discovered this while testing locally15:00
* gibi is reading back...15:00
dansmithgibi: there's nothing to read back except up to my ask for a gmeet consult :D15:01
gibiartom: so this come up when testing the vgpu live migration series 15:01
gibiuntil now all our testing was done with non NUMA aware VMs15:01
gibiand then Uggla did the first NUMA aware testing and hit by the check in the move claim that was not hit before15:02
gibithis check https://review.opendev.org/c/openstack/nova/+/944105/1/nova/compute/claims.py15:02
gibiso we started to check why we did not hit this code before in our testing15:02
gibiand it turns out because nova does not do a move claim during live migration if the VM is not NUMA aware15:03
gibiwhich sounds very strange to me15:03
gibiso I try to figure out if this is an old bug, or it is intentional15:04
gibithe code read likes the condition around the move claim wanted to handle rolling upgrade but somehow but actually it does more than that in reality as it disables the move claim for non NUMA aware VMs15:05
gibidansmith: sure we can have a gmeet. I'm free until the top of the hour15:07
dansmithmeet.google.com/jad-dzbb-nkb15:07
dansmithsean-k-mooney if you're free and interested ^15:08
opendevreviewRajesh Tailor proposed openstack/nova master: Add support for showing instance-action finish_time  https://review.opendev.org/c/openstack/nova/+/92893315:19
Uggla@gibi @dansmith interested by the chat, but I have noticed it too late. However I'd like to know your findings.15:24
opendevreviewribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers  https://review.opendev.org/c/openstack/nova/+/94410616:06
opendevreviewribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier  https://review.opendev.org/c/openstack/nova/+/94413316:06
dansmithgibi: first quick stab at invalidating the cached RP shows that we immediately refresh it when we go to do the placement sync17:10
dansmithso I think that's the ticket out of the problem17:10
*** bauzas is now known as bauzas__17:12
*** bauzas_ is now known as bauzas17:13
*** bauzas__ is now known as bauzas_17:14
bauzasfolks, I'm just moving to thelounger IRC bouncer and client after 12 years of ZNC, but I continue to run ZNC with the bauzas_ nick17:14
bauzasjust in case you don't get a reply from me, ping both of us :)17:14
gibiUggla: we discovered that there is a provider tree cache in the resouce tracker that causes a race condition if an OTU device is allocated, reserverd, deallocated, externally unreserved, and allocated again. It is a cache invalidation problem between placement and nova, and Dan is now adding a way to invalidate the caches explicitly when the OTU device is reseved by nova17:16
gibiUggla: we also touched on the non NUMA aware live migration move claim question as well. We think the move claim was added to that code path when NUMA aware live migration was added, and before that only the periodic update_available_resources healed the allocation in the resouce tracker 17:18
gibiso probably that is what happens today for non NUMA aware VMs when live migrated17:18
gibithis gives me enough certanty that I can comment on you patch removing the PCI check from the live migration claim. That check happens in the conductor in any case so I think that was just a dupicated check. I will comment on the patch tomorrow17:20
gibicc artom ^^17:20
Uggla@gibi, may I rephrase to ensure I got it: non numa vm will not claimed resources explicitly they are spawned and then a preriodic adjusts the allocations in the RT ?17:25
artom>  move claim was added to that code path when NUMA aware live migration was added, and before that only the periodic update_available_resources healed the allocation in the resouce tracker17:25
artomCorrect17:25
Ugglaso there is a short amount of time where the RT allocations are not the correct one ?17:26
gibiyeah that is my current understanding17:34
gibiwe are probably not hit by this in real life as most of the non NUMA aware VM resources are tracked in placement and the allocation there is correct and independent form the move claim17:34
gibialso the PCI claim happens outside of the move claim for live migration therefore that always happen correctly even for non NUMA aware VMs and for non PCI in Placement enabled17:35
gibithis is a bit of a mess but a well behaving mess :)17:36
Ugglain French we have an expression for this: "C'est tombé en marche". :)17:42
sean-k-mooneygibi: i belive the check in the move claim for pci devicve is for cold migration17:50
sean-k-mooneyoh maybe im wrong17:52
sean-k-mooneyok no it was checking but it was inherited form the base claim17:53
sean-k-mooneyand it was modifed for the live migration case17:54
sean-k-mooneyUggla: gibi  so ya we can likely drop https://github.com/openstack/nova/blob/master/nova/compute/claims.py#L209-L233 now17:55
sean-k-mooneywell17:55
sean-k-mooneyif we rely on the conducotr to validate that all teh request from the flavor are live migrtable17:56
Ugglasean-k-mooney, it is what I have done. But we were a bit afraid to not hit that code before.17:56
sean-k-mooneyyou have ad a patch up for that for a long time no?17:57
sean-k-mooneyi tought i looked at a patch for this like 2 weeks ago17:57
Ugglasean-k-mooney, fyi https://review.opendev.org/c/openstack/nova/+/94410517:57
Ugglasean-k-mooney, no created it only today.17:58
sean-k-mooneyi guess im thinking fo the conducotr change17:58
sean-k-mooneythis https://review.opendev.org/c/openstack/nova/+/942146/7/nova/conductor/tasks/live_migrate.py#24417:59
Ugglaoh yes sorry. 17:59
sean-k-mooneyhow were you teting this without that change in the move claims?17:59
sean-k-mooneyoh18:01
UgglaThis was tested with a unit tests and the fn tests were ok because all the live migration tests were done without numa VM.18:01
sean-k-mooneyno but you tested this manually18:01
sean-k-mooneywith real hardware18:01
sean-k-mooneyoh....18:02
Ugglanot real hardware was fine too, because I use a non numa vm.18:02
sean-k-mooneythat why gibi was askign about numa18:02
sean-k-mooneythat move claim code was orgianl ment to work for non numa instance too...18:02
sean-k-mooneyi guess that was just alway broken then18:02
Ugglayep, I discovered that this morning, then "pass" the ball to gibi.18:02
sean-k-mooneygot it18:03
sean-k-mooneyfor the record i still hate everyting about claims and still want to burn them with fire18:03
Ugglabecause I could not understand why it was working fine without numa and failed with numa.18:03
sean-k-mooneyyou should see this fail in the funtional test if you enabled hugepages or cpu pinning18:05
sean-k-mooneywe can add test coverage for that18:05
Uggla@sean-k-mooney, ha ha, openstack claims game of thrones.18:05
UgglaI saw the issue with the new fn tests this morning, because I enabled cpu pinning.18:05
sean-k-mooneyi strongly objected to extending claims when we added numa live migration which is why i didnt use them when adding sriov live migration in the same cycle18:06
sean-k-mooneyUggla: oh ok cool18:06
Ugglasean-k-mooney, in fact I leverage "your" test_live_migrate_server_with_neutron to do a mix test with port request and flavor based request.18:09
bauzascores, I need your help for paperwork changes RC1 : https://review.opendev.org/c/openstack/nova/+/94394818:10
bauzasditto for https://review.opendev.org/c/openstack/nova/+/94395218:10
sean-k-mooneyUggla: shoudl we be keepign this https://review.opendev.org/c/openstack/nova/+/944105/1/nova/tests/unit/compute/test_claims.py#b375 but updating it to show it nolonger raise18:10
sean-k-mooneybauzas: im going to finish soon but ill take a look now18:11
bauzasalso, I'd like cores to review our cycle highlights https://review.opendev.org/c/openstack/releases/+/94380118:11
bauzasand nothing urgent but I'd appreciate https://review.opendev.org/c/openstack/nova-specs/+/94401918:11
bauzasand we need a second core for https://review.opendev.org/c/openstack/nova/+/942871 (libvirt/qemu bump)18:13
Ugglabauzas, reminder about 942871: Bump MIN_{LIBVIRT,QEMU} for "Epoxy" | https://review.opendev.org/c/openstack/nova/+/942871  to not forget it. 18:13
bauzasUggla: jinxed !18:13
sean-k-mooneybauzas: you havnt updated https://review.opendev.org/c/openstack/releases/+/943801/2/deliverables/epoxy/nova.yaml based on what already there18:13
bauzassean-k-mooney: that's litterally my next action item18:14
sean-k-mooneyah18:14
bauzaseven at 7:15pm18:14
sean-k-mooneybauzas: i have reviewd everything except the libvirt bump ill take a look at that now18:15
bauzas<318:15
bauzasI'll write the prelude patch tomorrow morning18:16
Ugglabauzas, keep a little bandwith tomorrow morning to review the fup for Vf live migration.18:17
sean-k-mooneyso the bump patch is split in two right, the first does the bump and then https://review.opendev.org/c/openstack/nova/+/942872/2 is a follow up to drop the supprot for the older versions18:18
sean-k-mooneyso we can merge the bump and come back to the WIP patch next cycle yes18:18
Ugglasean-k-mooney, yes.18:19
sean-k-mooneyok i have appoved that as well and rechecked it for the timeouts18:19
Ugglasean-k-mooney, I was running out of time for doing the cleanup this cycle.18:21
sean-k-mooneythats ok. once we have signaled the new min version and updated the constants18:21
sean-k-mooneythe agent will check for the new min verison and report it18:21
sean-k-mooneyso we shoudl be fine18:21
UgglaI hope too18:22
opendevreviewMerged openstack/nova-specs master: Move Epoxy implemented specs  https://review.opendev.org/c/openstack/nova-specs/+/94401918:26
bauzasupdated the cycle highlights : https://review.opendev.org/c/openstack/releases/+/94380118:47
opendevreviewDan Smith proposed openstack/nova master: Support "one-time-use" PCI devices  https://review.opendev.org/c/openstack/nova/+/94381618:49
opendevreviewDan Smith proposed openstack/nova master: Extend invalidate_rp to only invalidate cache  https://review.opendev.org/c/openstack/nova/+/94414818:49
opendevreviewDan Smith proposed openstack/nova master: Invalidate PCI-in-placement cached RPs during claim  https://review.opendev.org/c/openstack/nova/+/94414918:49
-opendevstatus- NOTICE: One of our Zuul job log storage providers is experiencing errors. We have removed that storage target from base jobs. You should be able to safely recheck changes now.20:23
opendevreviewribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation  https://review.opendev.org/c/openstack/nova/+/94415320:42
opendevreviewSteve Baker proposed openstack/nova master: Add VNC console support for the Ironic driver  https://review.opendev.org/c/openstack/nova/+/94252821:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!