*** JohnnyW557 is now known as JohnnyW55 | 04:13 | |
opendevreview | Amit Uniyal proposed openstack/nova master: DNM: enable NFS backend job https://review.opendev.org/c/openstack/nova/+/944008 | 06:27 |
---|---|---|
opendevreview | Rajesh Tailor proposed openstack/nova master: Add support for showing instance-action finish_time https://review.opendev.org/c/openstack/nova/+/928933 | 06:51 |
opendevreview | Michael Still proposed openstack/nova master: libvirt: Add extra spec for sound device. https://review.opendev.org/c/openstack/nova/+/926126 | 08:05 |
opendevreview | Michael Still proposed openstack/nova master: Protect older compute managers from sound model requests. https://review.opendev.org/c/openstack/nova/+/940770 | 08:05 |
opendevreview | Michael Still proposed openstack/nova master: libvirt: Add extra specs for USB redirection. https://review.opendev.org/c/openstack/nova/+/927354 | 08:05 |
gibi | sean-k-mooney: I have a feeling that the pool matching logic in stats.py also intentionally handles the multi spec case. See https://github.com/openstack/nova/blob/0d484ce37d86e989c8abdf57aec5e334f68206ef/nova/pci/utils.py#L75 it basically implements an OR between the multiple specs | 08:22 |
sean-k-mooney[m] | it did in the really early days | 08:22 |
sean-k-mooney[m] | but i tough twe removed this like 10+ years ago | 08:23 |
gibi | appreanlty we did not | 08:23 |
gibi | PCI in Placement never supported it though I just broke the error reporting leading to http 500 | 08:23 |
sean-k-mooney[m] | so the vfio-variant dirver live migration depend on 1:1 alias mappings | 08:23 |
sean-k-mooney[m] | at least when not using pci in placement | 08:24 |
gibi | sean-k-mooney: do you mean that the live migration will only work if the same exact device is on both sides and therefore having specs in OR relationship with different product id would not work? | 08:24 |
sean-k-mooney[m] | that is my understanding | 08:25 |
gibi | yeah I agree | 08:25 |
sean-k-mooney[m] | i dont se how it could work any other way when the deviece is directly exposed to the guest kernel as a vf | 08:25 |
sean-k-mooney[m] | i think we need to at least restict when oring can happen | 08:26 |
gibi | the unplug/plug live migration for neutron requested PCI probably works as that is not real live migration of the device, but the variant driver based will lead to very sad guest kernels | 08:26 |
sean-k-mooney[m] | yep | 08:27 |
sean-k-mooney[m] | so i know why they orginally did this which is they were trying to make differnt configurations of intels QAT device look the same\ | 08:28 |
sean-k-mooney[m] | and allow you to use ask for QAT | 08:28 |
sean-k-mooney[m] | without careing which one you got | 08:28 |
sean-k-mooney[m] | that only works if you never have to deal with live migration | 08:29 |
sean-k-mooney[m] | so i think we need to disable this when live_migratable is set to true or you use pci_in_placment | 08:29 |
sean-k-mooney[m] | either should result in a config error on startup | 08:30 |
sean-k-mooney[m] | we can allow it for non llive migratbale device or when not using PCI in placement for backward compatibality | 08:31 |
opendevreview | Rajesh Tailor proposed openstack/nova master: Add support for showing instance-action finish_time https://review.opendev.org/c/openstack/nova/+/928933 | 08:58 |
sean-k-mooney | it is day 2 or alt world where my fundemental core belife that "their should only be 1 7:30 in a day and it should happen after noon" is being chllanged by the concept of 7 am... | 09:12 |
sean-k-mooney | its currntly 09:13 which is still an hour before im useually awake | 09:13 |
opendevreview | ribaudr proposed openstack/nova master: FUP improve comment accuracy and variable naming for tag removal https://review.opendev.org/c/openstack/nova/+/943124 | 11:43 |
opendevreview | ribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check https://review.opendev.org/c/openstack/nova/+/944105 | 11:43 |
opendevreview | ribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers https://review.opendev.org/c/openstack/nova/+/944106 | 11:43 |
andrewbonney | sean-k-mooney: in case you have time today for another look, https://review.opendev.org/c/openstack/nova/+/919961 seems fine in CI | 11:52 |
sean-k-mooney | andrewbonney: ill review it propelry shortly. if you need to respin it might be nice to add a fixes release note | 12:13 |
sean-k-mooney | but thats not strictly required | 12:13 |
andrewbonney | thanks | 12:13 |
sean-k-mooney | so ya ill take a look shortly | 12:13 |
sean-k-mooney | andrewbonney: +2 form me | 12:23 |
sean-k-mooney | andrewbonney: this proably wont merge for RC1 but it can merge and be backported as normal after | 12:24 |
andrewbonney | sure, just trying to get it ticked off before I head off for extended leave | 12:24 |
sean-k-mooney | ack | 12:24 |
opendevreview | Merged openstack/nova master: Fix serial console for ironic https://review.opendev.org/c/openstack/nova/+/942575 | 12:26 |
*** elodilles is now known as elodilles_pto | 13:22 | |
gibi | folks, Why do we only use the move claim during live migration if the VM is NUMA aware? context https://github.com/openstack/nova/blob/f71a0a6204744b7774caf85c16f27964a4f851db/nova/compute/manager.py#L9057-L9066 | 13:44 |
gibi | based on my local testing that src_supports_numa_live_migration check there is only true if the VM is NUMA aware | 13:45 |
gibi | so it seems like we doesn't do any claim for the instance during live migration if the instance is not NUMA aware (except claiming things outside of the Claim object like PCI) | 13:45 |
Uggla | @artom, can you have a look at gibi's question above. | 13:48 |
artom | gibi, I don't remember :( | 13:50 |
artom | But I'm pretty sure it's not _just_ NUMA-aware VMs | 13:50 |
artom | Other code added later started using that too | 13:50 |
artom | VPMEM I think? | 13:50 |
artom | And src_supports_numa_live_migration was there to handle the rolling upgrade case at the time of implementation | 13:51 |
gibi | if you look at the code linked above there is a check that looks for src_supports_numa_live_migration in the migrate data, that condition is false for me locally for a non NUMA aware VM | 13:51 |
artom | We can _probably_ remove it now, with the usual caveats around RPC | 13:51 |
Uggla | @artom, look at https://review.opendev.org/c/openstack/nova/+/944105 you left a comment that may remember you things. | 13:51 |
artom | Oh yeah, there was the SRIOV live migration thing as well | 13:52 |
artom | It's handled by a separate code path | 13:52 |
Uggla | yep below in the manager | 13:53 |
gibi | yeah the PCI stuff claimed differently, but what about the rest of the resources? | 13:53 |
artom | You mean non-Placement resources? | 13:53 |
Uggla | @gibi this is claimed at scheduler step before no ? | 13:55 |
*** whoami-rajat_ is now known as whoami-rajat | 13:55 | |
gibi | artom: yea | 13:55 |
gibi | all the normal things we do during an instance claim or during a move claim in other ops | 13:56 |
artom | I _think_ as live migration stuff got added they all stated using the MovaClaim? | 13:56 |
gibi | obviously what is not done today does not casue any highly visible issue so maybe those things does not need to be claimed during live migration, but it is very strange | 13:57 |
gibi | yeah using the MoveClaim would be logical during live migration, but my observation is that for a non NUMA aware VM the code does not use any claim | 13:57 |
gibi | (except the PCI one outside of the Claim obj) | 13:58 |
artom | I agree with you that it's strange | 13:59 |
artom | But it's been so long (Train!) that I've lost most context for the decisions that were made | 13:59 |
gibi | yeah I understand you don't have a context. I will try to charachterize the my feelings more | 13:59 |
artom | I talk about it here: https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/numa-aware-live-migration.html#resource-claims | 14:00 |
gibi | best would be to find a way how this could break our resource tracking and the move from there | 14:00 |
gibi | I need to jump on a call, I will be back in 30 mins (hopefully) | 14:00 |
artom | OK | 14:01 |
gibi | that section reads for me like we wanted to do the claim regardless of NUMA awareness | 14:02 |
artom | Yeah, I dunno anymore :( | 14:17 |
artom | Maybe let's come at it from a different angle - what led you to notice this, and what are you trying to do? | 14:17 |
dansmith | ah, the old "let me answer a different question" deflection :D | 14:20 |
artom | Hah | 14:21 |
artom | I'd _like_ to answer the original question | 14:22 |
artom | I just don't remember | 14:22 |
dansmith | I recognize the strategy well | 14:24 |
*** svinota is now known as svinota[afk] | 14:45 | |
dansmith | gibi: so, there's a race happening, which maybe/probably affects other stuff.. any chance you have a few minutes to gmeet for me to describe and see if you're aware already? | 14:52 |
dansmith | specifically a provider_tree update race | 14:53 |
sean-k-mooney | i dont think we update the provier tree resereved value in nova for anything else, although i know neutron does for l3 routed networks | 14:55 |
sean-k-mooney | so they might have seen whatever the race is with that | 14:56 |
dansmith | it's not a problem specific to reserved, that's just how I'm seeing it | 14:58 |
dansmith | although perhaps it doesn't get exposed because of the nature of the external update in this case, although that might mean things like neutron/cinder do apply | 14:58 |
sean-k-mooney | are you getting a generation conflcit? | 14:59 |
* sean-k-mooney was not following the disucsion before | 14:59 | |
dansmith | no | 15:00 |
dansmith | sean-k-mooney: there was no discussion of this before, I just discovered this while testing locally | 15:00 |
* gibi is reading back... | 15:00 | |
dansmith | gibi: there's nothing to read back except up to my ask for a gmeet consult :D | 15:01 |
gibi | artom: so this come up when testing the vgpu live migration series | 15:01 |
gibi | until now all our testing was done with non NUMA aware VMs | 15:01 |
gibi | and then Uggla did the first NUMA aware testing and hit by the check in the move claim that was not hit before | 15:02 |
gibi | this check https://review.opendev.org/c/openstack/nova/+/944105/1/nova/compute/claims.py | 15:02 |
gibi | so we started to check why we did not hit this code before in our testing | 15:02 |
gibi | and it turns out because nova does not do a move claim during live migration if the VM is not NUMA aware | 15:03 |
gibi | which sounds very strange to me | 15:03 |
gibi | so I try to figure out if this is an old bug, or it is intentional | 15:04 |
gibi | the code read likes the condition around the move claim wanted to handle rolling upgrade but somehow but actually it does more than that in reality as it disables the move claim for non NUMA aware VMs | 15:05 |
gibi | dansmith: sure we can have a gmeet. I'm free until the top of the hour | 15:07 |
dansmith | meet.google.com/jad-dzbb-nkb | 15:07 |
dansmith | sean-k-mooney if you're free and interested ^ | 15:08 |
opendevreview | Rajesh Tailor proposed openstack/nova master: Add support for showing instance-action finish_time https://review.opendev.org/c/openstack/nova/+/928933 | 15:19 |
Uggla | @gibi @dansmith interested by the chat, but I have noticed it too late. However I'd like to know your findings. | 15:24 |
opendevreview | ribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers https://review.opendev.org/c/openstack/nova/+/944106 | 16:06 |
opendevreview | ribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier https://review.opendev.org/c/openstack/nova/+/944133 | 16:06 |
dansmith | gibi: first quick stab at invalidating the cached RP shows that we immediately refresh it when we go to do the placement sync | 17:10 |
dansmith | so I think that's the ticket out of the problem | 17:10 |
*** bauzas is now known as bauzas__ | 17:12 | |
*** bauzas_ is now known as bauzas | 17:13 | |
*** bauzas__ is now known as bauzas_ | 17:14 | |
bauzas | folks, I'm just moving to thelounger IRC bouncer and client after 12 years of ZNC, but I continue to run ZNC with the bauzas_ nick | 17:14 |
bauzas | just in case you don't get a reply from me, ping both of us :) | 17:14 |
gibi | Uggla: we discovered that there is a provider tree cache in the resouce tracker that causes a race condition if an OTU device is allocated, reserverd, deallocated, externally unreserved, and allocated again. It is a cache invalidation problem between placement and nova, and Dan is now adding a way to invalidate the caches explicitly when the OTU device is reseved by nova | 17:16 |
gibi | Uggla: we also touched on the non NUMA aware live migration move claim question as well. We think the move claim was added to that code path when NUMA aware live migration was added, and before that only the periodic update_available_resources healed the allocation in the resouce tracker | 17:18 |
gibi | so probably that is what happens today for non NUMA aware VMs when live migrated | 17:18 |
gibi | this gives me enough certanty that I can comment on you patch removing the PCI check from the live migration claim. That check happens in the conductor in any case so I think that was just a dupicated check. I will comment on the patch tomorrow | 17:20 |
gibi | cc artom ^^ | 17:20 |
Uggla | @gibi, may I rephrase to ensure I got it: non numa vm will not claimed resources explicitly they are spawned and then a preriodic adjusts the allocations in the RT ? | 17:25 |
artom | > move claim was added to that code path when NUMA aware live migration was added, and before that only the periodic update_available_resources healed the allocation in the resouce tracker | 17:25 |
artom | Correct | 17:25 |
Uggla | so there is a short amount of time where the RT allocations are not the correct one ? | 17:26 |
gibi | yeah that is my current understanding | 17:34 |
gibi | we are probably not hit by this in real life as most of the non NUMA aware VM resources are tracked in placement and the allocation there is correct and independent form the move claim | 17:34 |
gibi | also the PCI claim happens outside of the move claim for live migration therefore that always happen correctly even for non NUMA aware VMs and for non PCI in Placement enabled | 17:35 |
gibi | this is a bit of a mess but a well behaving mess :) | 17:36 |
Uggla | in French we have an expression for this: "C'est tombé en marche". :) | 17:42 |
sean-k-mooney | gibi: i belive the check in the move claim for pci devicve is for cold migration | 17:50 |
sean-k-mooney | oh maybe im wrong | 17:52 |
sean-k-mooney | ok no it was checking but it was inherited form the base claim | 17:53 |
sean-k-mooney | and it was modifed for the live migration case | 17:54 |
sean-k-mooney | Uggla: gibi so ya we can likely drop https://github.com/openstack/nova/blob/master/nova/compute/claims.py#L209-L233 now | 17:55 |
sean-k-mooney | well | 17:55 |
sean-k-mooney | if we rely on the conducotr to validate that all teh request from the flavor are live migrtable | 17:56 |
Uggla | sean-k-mooney, it is what I have done. But we were a bit afraid to not hit that code before. | 17:56 |
sean-k-mooney | you have ad a patch up for that for a long time no? | 17:57 |
sean-k-mooney | i tought i looked at a patch for this like 2 weeks ago | 17:57 |
Uggla | sean-k-mooney, fyi https://review.opendev.org/c/openstack/nova/+/944105 | 17:57 |
Uggla | sean-k-mooney, no created it only today. | 17:58 |
sean-k-mooney | i guess im thinking fo the conducotr change | 17:58 |
sean-k-mooney | this https://review.opendev.org/c/openstack/nova/+/942146/7/nova/conductor/tasks/live_migrate.py#244 | 17:59 |
Uggla | oh yes sorry. | 17:59 |
sean-k-mooney | how were you teting this without that change in the move claims? | 17:59 |
sean-k-mooney | oh | 18:01 |
Uggla | This was tested with a unit tests and the fn tests were ok because all the live migration tests were done without numa VM. | 18:01 |
sean-k-mooney | no but you tested this manually | 18:01 |
sean-k-mooney | with real hardware | 18:01 |
sean-k-mooney | oh.... | 18:02 |
Uggla | not real hardware was fine too, because I use a non numa vm. | 18:02 |
sean-k-mooney | that why gibi was askign about numa | 18:02 |
sean-k-mooney | that move claim code was orgianl ment to work for non numa instance too... | 18:02 |
sean-k-mooney | i guess that was just alway broken then | 18:02 |
Uggla | yep, I discovered that this morning, then "pass" the ball to gibi. | 18:02 |
sean-k-mooney | got it | 18:03 |
sean-k-mooney | for the record i still hate everyting about claims and still want to burn them with fire | 18:03 |
Uggla | because I could not understand why it was working fine without numa and failed with numa. | 18:03 |
sean-k-mooney | you should see this fail in the funtional test if you enabled hugepages or cpu pinning | 18:05 |
sean-k-mooney | we can add test coverage for that | 18:05 |
Uggla | @sean-k-mooney, ha ha, openstack claims game of thrones. | 18:05 |
Uggla | I saw the issue with the new fn tests this morning, because I enabled cpu pinning. | 18:05 |
sean-k-mooney | i strongly objected to extending claims when we added numa live migration which is why i didnt use them when adding sriov live migration in the same cycle | 18:06 |
sean-k-mooney | Uggla: oh ok cool | 18:06 |
Uggla | sean-k-mooney, in fact I leverage "your" test_live_migrate_server_with_neutron to do a mix test with port request and flavor based request. | 18:09 |
bauzas | cores, I need your help for paperwork changes RC1 : https://review.opendev.org/c/openstack/nova/+/943948 | 18:10 |
bauzas | ditto for https://review.opendev.org/c/openstack/nova/+/943952 | 18:10 |
sean-k-mooney | Uggla: shoudl we be keepign this https://review.opendev.org/c/openstack/nova/+/944105/1/nova/tests/unit/compute/test_claims.py#b375 but updating it to show it nolonger raise | 18:10 |
sean-k-mooney | bauzas: im going to finish soon but ill take a look now | 18:11 |
bauzas | also, I'd like cores to review our cycle highlights https://review.opendev.org/c/openstack/releases/+/943801 | 18:11 |
bauzas | and nothing urgent but I'd appreciate https://review.opendev.org/c/openstack/nova-specs/+/944019 | 18:11 |
bauzas | and we need a second core for https://review.opendev.org/c/openstack/nova/+/942871 (libvirt/qemu bump) | 18:13 |
Uggla | bauzas, reminder about 942871: Bump MIN_{LIBVIRT,QEMU} for "Epoxy" | https://review.opendev.org/c/openstack/nova/+/942871 to not forget it. | 18:13 |
bauzas | Uggla: jinxed ! | 18:13 |
sean-k-mooney | bauzas: you havnt updated https://review.opendev.org/c/openstack/releases/+/943801/2/deliverables/epoxy/nova.yaml based on what already there | 18:13 |
bauzas | sean-k-mooney: that's litterally my next action item | 18:14 |
sean-k-mooney | ah | 18:14 |
bauzas | even at 7:15pm | 18:14 |
sean-k-mooney | bauzas: i have reviewd everything except the libvirt bump ill take a look at that now | 18:15 |
bauzas | <3 | 18:15 |
bauzas | I'll write the prelude patch tomorrow morning | 18:16 |
Uggla | bauzas, keep a little bandwith tomorrow morning to review the fup for Vf live migration. | 18:17 |
sean-k-mooney | so the bump patch is split in two right, the first does the bump and then https://review.opendev.org/c/openstack/nova/+/942872/2 is a follow up to drop the supprot for the older versions | 18:18 |
sean-k-mooney | so we can merge the bump and come back to the WIP patch next cycle yes | 18:18 |
Uggla | sean-k-mooney, yes. | 18:19 |
sean-k-mooney | ok i have appoved that as well and rechecked it for the timeouts | 18:19 |
Uggla | sean-k-mooney, I was running out of time for doing the cleanup this cycle. | 18:21 |
sean-k-mooney | thats ok. once we have signaled the new min version and updated the constants | 18:21 |
sean-k-mooney | the agent will check for the new min verison and report it | 18:21 |
sean-k-mooney | so we shoudl be fine | 18:21 |
Uggla | I hope too | 18:22 |
opendevreview | Merged openstack/nova-specs master: Move Epoxy implemented specs https://review.opendev.org/c/openstack/nova-specs/+/944019 | 18:26 |
bauzas | updated the cycle highlights : https://review.opendev.org/c/openstack/releases/+/943801 | 18:47 |
opendevreview | Dan Smith proposed openstack/nova master: Support "one-time-use" PCI devices https://review.opendev.org/c/openstack/nova/+/943816 | 18:49 |
opendevreview | Dan Smith proposed openstack/nova master: Extend invalidate_rp to only invalidate cache https://review.opendev.org/c/openstack/nova/+/944148 | 18:49 |
opendevreview | Dan Smith proposed openstack/nova master: Invalidate PCI-in-placement cached RPs during claim https://review.opendev.org/c/openstack/nova/+/944149 | 18:49 |
-opendevstatus- NOTICE: One of our Zuul job log storage providers is experiencing errors. We have removed that storage target from base jobs. You should be able to safely recheck changes now. | 20:23 | |
opendevreview | ribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation https://review.opendev.org/c/openstack/nova/+/944153 | 20:42 |
opendevreview | Steve Baker proposed openstack/nova master: Add VNC console support for the Ironic driver https://review.opendev.org/c/openstack/nova/+/942528 | 21:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!