Wednesday, 2024-02-07

opendevreviewmelanie witt proposed openstack/nova master: WIP Support encrypted backing files for qcow2  https://review.opendev.org/c/openstack/nova/+/90796101:33
opendevreviewmelanie witt proposed openstack/nova master: libvirt: Introduce support for raw with LUKS  https://review.opendev.org/c/openstack/nova/+/88431301:33
opendevreviewmelanie witt proposed openstack/nova master: libvirt: Introduce support for rbd with LUKS  https://review.opendev.org/c/openstack/nova/+/88991201:33
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770202:13
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770202:15
opendevreviewTakashi Kajinami proposed openstack/os-resource-classes master: Update bug tracker url  https://review.opendev.org/c/openstack/os-resource-classes/+/90821902:17
opendevreviewTakashi Kajinami proposed openstack/os-traits master: Update bug tracker url  https://review.opendev.org/c/openstack/os-traits/+/90822002:18
opendevreviewTakashi Kajinami proposed openstack/osc-placement master: Update bug tracker url  https://review.opendev.org/c/openstack/osc-placement/+/90822102:24
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770206:02
melwittsean-k-mooney: I did the respin, it seems like things are working except stable rescue in some cases. I'm debugging that06:03
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770206:24
bauzasokay, so I can officially say that nova-lvm has an issue :-10:27
whoami-rajatbauzas, hey, do you mean nova-lvm or nova-cinder-lvm? not sure if nova directly uses lvm without cinder11:21
frickleriiuc that's the nova-lvm job, which does NOVA_BACKEND="lvm", so not involving cinder11:27
whoami-rajatokay, i remember nova team mentioning about some issues with the cinder-lvm backend so good to know it's not that11:28
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770211:55
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD statless firmware support  https://review.opendev.org/c/openstack/nova-specs/+/90829712:08
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: Statless firmware support  https://review.opendev.org/c/openstack/nova-specs/+/90829712:08
greatgatsby_Hello.  Is it possible to modify OS-EXT-SVR-ATTR:host via the nova command or OSC if it's incorrect (after compute crash)?  Trying to avoid modifying via the database.12:24
bauzaswhoami-rajat: frickler: yeah was at lunch, correct this is nova-lvm jobn12:36
opendevreviewSylvain Bauza proposed openstack/nova master: Fix verifying all the alloc requests from a multi-create  https://review.opendev.org/c/openstack/nova/+/84678613:26
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770213:54
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: Statless firmware support  https://review.opendev.org/c/openstack/nova-specs/+/90829714:01
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: AMD SEV-ES support  https://review.opendev.org/c/openstack/nova-specs/+/90770214:02
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: Statless firmware support  https://review.opendev.org/c/openstack/nova-specs/+/90829714:03
opendevreviewTakashi Kajinami proposed openstack/nova-specs master: libvirt: Stateless firmware support  https://review.opendev.org/c/openstack/nova-specs/+/90829714:44
*** d34dh0r5- is now known as d34dh0r5314:57
bauzasdansmith: saw your comment on https://review.opendev.org/c/openstack/nova/+/904209/14/nova/virt/libvirt/driver.py#983816:09
bauzasdansmith: you're right, we could have problems, that's why I wrote this in the spec :16:09
bauzas"will persist that list of target mediated devices in some internal dictionary field of the LibvirtDriver instance, keyed by the instance UUID. " https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/libvirt-mdev-live-migrate.html#proposed-change16:10
bauzasbut yeah, in case the operator restart the compute service, then the dict could be wrong16:10
bauzasor the other way if we have a rabbit issue16:11
bauzasso I don't know what to say here16:11
bauzasthat's why we discussed this in the PTG, we said we will try to remember those by a dict16:11
bauzashttps://etherpad.opendev.org/p/nova-caracal-ptg#L58516:12
bauzasanyway, let's dicuss this after our meeting16:13
* zigo has just finished setting-up an arm64 as a compute (80 ampere cores) ! :)16:19
zigoIt felt super fast in the host, not so much in the VMs...16:19
dansmithbauzas: yeah, and L576 says "Dan Smith not very convinced" :)16:23
dansmithso I think this is well in bounds for being concerned now that we're looking at the implementation16:23
bauzasdansmith: let's find an alternative after our meeting :)16:24
bauzaslet's try* to find (tbc) 16:24
dansmithI'm certainly not trying to cause you trouble, I'm just expressing concern16:24
dansmithack16:24
bauzasdansmith: I'm swamped in another meeting but my point is that I wonder if we should persist the internal dict 17:09
bauzasin case of a rabbit issue, people just restart their computes17:09
bauzasthat said, if we have a current live-migration, it would be eventually stopped, right?17:10
dansmithright, but if they do, they leak resources in placement right?17:10
bauzasor not ?17:10
bauzasbecause of the target allocations ?17:10
dansmithyeah17:11
bauzasdansmith: sorry was in a meeting17:40
bauzasso, what I can do is to provide some logs, for sure17:41
bauzasthen, I'll try to test with my hardware environment what would arrive if the operator tries to restart a compute while a live-migration is currently running17:42
bauzasfor the allocations, it would be like any other feature17:43
dansmithbauzas: what if we had the compute service look for and clean up any allocations against its RP for instances that don't exist on it? basically "things that were reserved here, but are gone now"?17:54
dansmithmaybe we already do that and I'm missing it?17:54
dansmithbut if so, then a restart would *actually* clean up everything because the in-memory bit would be dumped and we'd clean up the stale reservations17:54
bauzasdansmith: there are two things17:54
bauzasthere are VGPU allocations and there are mdev "reservations"17:55
bauzasthe internal dict is just here for making sure we don't pass a mdev to a new instance if that mdev is currently used by a live-migrating instance17:56
bauzasso we just 'reserve' it 17:56
bauzasthat said, and that's something I can verify, if the tarrget domain is created just when we start the live-migration, then we don't need to 'reserve' the mdev17:57
bauzasmaybe the context you miss is that a compute only knows which mdevs are used by instances by looking at the guest XMLs17:58
bauzasmy concern I had is that if we don't have a domain for a target guest yet while live-migrating, then the compute would not know that the related mdevs are currently used for that one17:59
bauzashence the internal dict17:59
dansmithbauzas: thanks, I understand the difference between the dict reservation and the allocations18:09
dansmithhowever, the allocations are not cleaned up by a restart (right?) even though the dict reservations are18:09
dansmithI don't really like the latter requiring a restart, but the former seems very unfortunate to me18:09
bauzasthat's the same problem with any live-migration allocation, right?18:10
bauzasdansmith: one question I have and that you could maybe know, are we providing an error for running live-migrations if we restart a target ?18:11
dansmithbauzas: It seems to me like this is the first place where we're doing that allocation in the pre check no?18:13
dansmithwe grab pci stuff, but I don't think we persist it until later, if I'm reading correctly18:13
bauzasI don't know *when* we create a target allocation18:13
bauzasI need to look at the code18:14
bauzashttps://github.com/openstack/nova/blob/master/nova/conductor/tasks/live_migrate.py#L55818:16
dansmithokay I just found that18:16
dansmithso the conductor nukes any allocations that we have against the proposed node if we abort, yeah?18:16
bauzasafter pre-livemigrate sure18:17
bauzasbut then I wonder if we delete the allocations when we monitor the running live-migration18:17
bauzaslooking at the compute now18:17
dansmithyeah, that's only if we decide the destination isn't a good fit, but doesn't address the failure later18:17
bauzashttps://github.com/openstack/nova/blob/master/nova/conductor/tasks/live_migrate.py#L55818:20
bauzashere we error a migration if we have any exception18:20
dansmithright, but after we start the migration, nothing will clean up the destination node's allocations right?18:22
bauzashttps://github.com/openstack/nova/blob/master/nova/compute/manager.py#L969218:22
bauzasnah, found it18:22
bauzaswe delete the allocation when calling _rollback_live_migration()18:22
bauzasand then we call the drivers's methods for cleaning up the residues... which then unreserve the mdev from the dict :)18:23
bauzasso we're good18:23
bauzasbut I agree on adding more logs18:24
dansmiththe dict is on the destination, right/18:24
dansmithso we don't clean that up18:24
bauzasI didn't done that in the past with allocating mdevs and that was needed18:24
bauzasdansmith: no, we both call source and dest virt drivers18:24
bauzassource is here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9708-L970918:24
dansmithbauzas: not if rabbit is down18:25
bauzasand dest is there https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L9736-L973818:25
bauzasdansmith: hah, I get your point now18:25
dansmithjust saying it seems like we need more post-disaster recovery bits, be it on compute restart or some periodic18:26
bauzasthere could be a situation with rabbit down that would hit the live-migration but we wouldn't unreserve the mdev18:26
bauzasgotcha18:26
dansmithbut, given that we do clean these allocations if rabbit is not down and we fail, then fair enough,18:26
dansmithbut I think the logging of the internal private data that won't be reset until restart is important, at least so it's discoverable why something is reserved but there's no record of it18:26
bauzasyeah good point18:27
bauzasmy kids are starving, so I need to leave or they will run errand in the street like zombies18:28
bauzasbut I'll try to be a bit more resilient18:28
bauzasat least the first thing is to log18:28
bauzasthen I wonder whether we need some periodic for the cleaning case18:28
bauzas(but I just wonder which conditional to write in that periodic :) )18:29
bauzasthat should be "give me all the migrations that were errored and make sure we don't have a reserved mdev for those"18:29
dansmithsomething like that18:40
sean-k-mooneymelwitt: i think i need to restack but i got https://paste.opendev.org/show/bRUFWOFqOa8kSeXyH12I/ when i tried to delete a vm18:40
melwittsean-k-mooney: oh, ok that's bc I added a db migration (that I forgot to mention) to add a column (for backing file secret uuid). so you'd have to nova-manage db sync. it's complaining about the lack of the column18:42
sean-k-mooneyoh ok18:43
sean-k-mooneyi just did a git checkout18:43
sean-k-mooneyand restarted things so ya di cna do a db sync18:43
melwittyeah, I did the same thing. just forgot about the db sync, sorry 😑 18:43
sean-k-mooneyi was really confused becasue the way this fails in the openstack client is really terrible18:43
sean-k-mooneyvenv) ubuntu@disk-encrypt-1:~/repos/devstack$ openstack server delete "test-vm"18:44
sean-k-mooneyResource.get() takes 1 positional argument but 2 were given18:44
sean-k-mooneythat is the error you get 18:44
melwittyeah, I remember thinking that too18:44
sean-k-mooneyand if you use debug mode it does not help18:44
sean-k-mooneyya so i went back to nova client and it toold me there was a 500 and then it was clear that i was missing something18:45
melwittguess that's another thing for the osc backlog, figure out what's with that error message and how to improve it18:45
sean-k-mooneythis ist its internal traceback https://paste.opendev.org/show/bJQt6QO5X1pdKbL7dnm7/18:46
melwittoh, and another thing that tripped me up is the db sync doesn't fan out to cells (!) and you have to use --local_cell to make it sync the nova_cell1 db18:46
sean-k-mooneybut the issue is it tyin gto use the resouce before its verified the return code i think18:47
sean-k-mooneygood to know cause i totally did not add --local_cell18:47
melwitthm18:47
sean-k-mooneyits still not working for me but it did to the upgrade18:48
sean-k-mooneydo i need to als restart 18:48
sean-k-mooney Running upgrade 13863f4e1612 -> 2a7173b820a6, Add backing_encryption_secret_uuid to block_device_mapping18:48
melwittit took me a bit to figure out wtf was wrong too. been that long since I db sync'ed on the fly I guess 😶18:48
melwittI don't think you should need to restart18:49
melwittand make sure you had 'db sync' without --local_cell too bc that's what syncs the nova_cell0 db 😛 18:50
melwittanother thing for the backlog ... do a fanout. I'm not sure if it was deliberate not to or if it's just no one has gotten around to it yet18:51
sean-k-mooneynova-manage --config-dir /etc/nova --config-file /etc/nova/nova_cell1.conf  db sync18:51
sean-k-mooneythat worked18:51
sean-k-mooneyoh you know what this instance was in error so it was burried in cell018:52
melwitthm ok. I thought I tried that but it didn't work without --local_cell18:52
melwitt*didn't work to sync nova_cell118:52
sean-k-mooneyoh actullly no it got past the schduler so it was in cell 118:53
sean-k-mooneyi saw was becasue its deleted now so i guess it does not matter18:53
sean-k-mooneyalso i just booted a vm with your latest code so that cool too18:53
sean-k-mooneyok its on  disk-encrypt-2 so in theroy i coudl live migrate it to  disk-encrypt-118:54
melwittthat should work, but since I said that it probably won't18:54
sean-k-mooneyit did not but i havent tried live migration without encypted voluems so i dont know if i missed a step or not18:56
sean-k-mooneyah ok 18:57
sean-k-mooneylibvirt.libvirtError: Cannot recv data: ssh: Could not resolve hostname disk-encrypt-1: Name or service not known: Connection reset by peer18:57
sean-k-mooneyi just need to update /etc/hosts18:57
melwitt😅 18:58
sean-k-mooneyLive Migration failure: Cannot recv data: Host key verification failed.: Connection reset by peer: libvirt.libvirtError: Cannot recv data: Host key verification failed.: Connection reset by peer ok i need to check what usesr its using19:01
sean-k-mooneyi see up ubuntu but maybe its runing as root19:01
sean-k-mooneystill complaining baout host key verificaiton19:10
sean-k-mooneyi think ill just turn that off19:10
sean-k-mooneymelwitt: finally i had to swap to using root for some reason for the ssh conenction19:22
sean-k-mooneybut in anycase yes live migration in one direction workd19:22
sean-k-mooneyim going to check if the instnace dirs ectra actully got cleaned up but looking good so far19:23
sean-k-mooneyim so out of partice of doing this with devstack by hand 19:24
melwittsean-k-mooney: yeah, same (I had been)19:32
melwittsean-k-mooney: oh, with the host key verification in my case what I had to do was both ssh to and from as root  "ssh root@blah" but also had to ssh as the other user, maybe the stack user? but pass the root user like "ssh root@blah"19:38
melwittit would not stop complaining until I did both of those19:38
sean-k-mooneyso if i use live_migration_uri = qemu+ssh://root@%s/system20:07
sean-k-mooneyand ssh as root to each host20:07
sean-k-mooneythen it works fine20:08
sean-k-mooneybut i was trying to use live_migration_uri = qemu+ssh://ubuntu@%s/system20:08
opendevreviewGhanshyam proposed openstack/nova master: Remove HyperV: cleanup doc/code ref  https://review.opendev.org/c/openstack/nova/+/90662920:08
sean-k-mooneythere is a way to use a non root user to do that but its not worth the hassel20:08
opendevreviewGhanshyam proposed openstack/nova master: HyperV: Remove HyperVLiveMigrateData object  https://review.opendev.org/c/openstack/nova/+/90663620:10
melwittah I see20:16
opendevreviewGhanshyam proposed openstack/nova master: HyperV: Remove RDP console connection information API  https://review.opendev.org/c/openstack/nova/+/90699120:19
sean-k-mooneyits still unhappy for cold migration so i need to add the root users ssh key to the ubuntu users autherised keys20:20
sean-k-mooneyi normally just use one key an put it in all both ubuntu and root with the same public/private key on all hosts20:20
opendevreviewGhanshyam proposed openstack/nova master: HyperV: Remove RDP console connection information API  https://review.opendev.org/c/openstack/nova/+/90699120:20
sean-k-mooneythis time i was trying to be fancy and use seperate keys and minimal privladges 20:21
opendevreviewGhanshyam proposed openstack/nova master: HyperV: Remove RDP console API  https://review.opendev.org/c/openstack/nova/+/90680920:21
melwitt🙂 20:22
opendevreviewGhanshyam proposed openstack/nova master: HyperV: Remove extra specs of HyperV driver  https://review.opendev.org/c/openstack/nova/+/90699220:22
sean-k-mooneymelwitt: ill try this again tomorrow when i can make sure i set this up correctly end to end20:28
sean-k-mooneyof far i have got good coverage of what happens if your ssh keys dont work and we rollback the cold migrate early :)20:29
melwitthaha nice20:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!