Wednesday, 2022-11-23

opendevreviewyangzhipeng proposed openstack/nova master: add delete Signed-off-by: yangzhipeng  <yangzhipeng@cmss.chinamobile.com>  https://review.opendev.org/c/openstack/nova/+/86536202:31
opendevreviewyangzhipeng proposed openstack/nova master: Remove all tag if instance has beed hard deleted. Signed-off-by: yangzhipeng  <yangzhipeng@cmss.chinamobile.com>  https://review.opendev.org/c/openstack/nova/+/86536202:33
opendevreviewyangzhipeng proposed openstack/nova master: Remove all tag if instance has beed hard deleted.  https://review.opendev.org/c/openstack/nova/+/86536202:42
*** ministry is now known as __ministry03:31
gmanndansmith: gibi: bauzas: need one more review on the placement RBAC spec, please check https://review.opendev.org/c/openstack/placement/+/86438503:31
*** akekane is now known as abhishekk05:16
gokhaniGood Morning Folks, When I try to reboot my instance, It destroys my instance and I am getting error "glanceclient.exc.HTTPNotFound: HTTP 404 Not Found: No image found with ID xxxx.." after reboot action I didn't find my instance with "virsh list --all" command. I didn't understand why nova tries to find glance image and why nova destroyed my instance ? Logs are in https://paste.openstack.org/show/by8oCDVLo29Wt612Z1tT/. what cause happen this 06:59
gokhanisituation ?06:59
sean-k-mooneygokhani: when you reboot an instance we undefine the domain and recreated it however we will not lookup the glance image as part of a normal hard or soft reboot09:02
sean-k-mooneyso that implies that the instance was in an inconsitent state before the reboot09:03
sean-k-mooneyile "/openstack/venvs/nova-22.1.0/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 9930, in _create_images_and_backing09:04
sean-k-mooneyso this should only be invokded if the vm is started via due to the  resume_state_on_host_boot config option after a host reboot09:07
sean-k-mooneyhttps://github.com/openstack/nova/blob/stable/victoria/nova/virt/libvirt/driver.py#L3368-L337709:07
sean-k-mooneygokhani: based on the trace back the instnace disk does not exist anymore 09:08
sean-k-mooneysince it tokk this branch https://github.com/openstack/nova/blob/3224ceb3fffc57d2375e5163d8ffbbb77529bc38/nova/virt/libvirt/driver.py#L9949-L995109:08
sean-k-mooneyactullly no sorry https://github.com/openstack/nova/blob/3224ceb3fffc57d2375e5163d8ffbbb77529bc38/nova/virt/libvirt/driver.py#L9980-L9986 is the path its taking09:10
sean-k-mooneyso info['backing_file'] is equivalent to true09:11
sean-k-mooneyand to take the else branch that means that this si not the swap or ephmeral disk09:11
sean-k-mooneyanyway  as the comment suggest https://github.com/openstack/nova/blob/stable/victoria/nova/virt/libvirt/driver.py#L3368-L337709:16
sean-k-mooneythat code path is there to endure that any backing files that are shared between teh vms are still present on the host09:17
sean-k-mooneyso this implies taht the backing file is not presnt on the host and it has been deleted from glance09:17
sean-k-mooneyin a normal hard reboot triggered by a human that code path shoudl not be taken and we will just redefine the domain09:18
kashyapsean-k-mooney: Morning.  Do we document any of this anywhere, at a high-level?09:36
sean-k-mooneythat hard reboot does not redownload the image09:40
sean-k-mooneyi dont think so but thats more of a impemation detail 09:40
sean-k-mooneyit should not be required in a normal workflow sincew we are simulating rebooting a physical server09:41
sean-k-mooneyand you would not reinstall the os on a phsyical server09:41
sean-k-mooneythe fact we have code to do it after a host reboot is a little odd09:41
sean-k-mooneybut i would guess it is tehre because of a bug09:41
kashyapYeah, that's fair.  (I'm not saying impl detail should be be documented.  Just maybe somewhere in a debugging guide.  I admit I can't find an appropriate place for it, though)09:42
sean-k-mooneyi.e. to work around a bug where perhaps at some point in the past the image casche got lost or something09:42
kashyap"Image cache"  ... /me runs for the hills09:42
kashyap(Except no hills here in this part of the low lands :P)09:42
sean-k-mooneyya ill admit im kind of surprised we have that code path to try and heal broken/missing backing files09:43
kashyapTIL, too09:43
sean-k-mooneythe fallback to using it form the image cache if its not in glance is interesting...09:44
sean-k-mooneyit makes sense i guess09:44
sean-k-mooneybut this is and edgecase of an edgecase09:44
kashyapYeah, it definitely does to me.09:44
sean-k-mooneyi.e. after a host reboot the backing file need to have been delted and the image need to have been delted form glance but still exist in the host image cache for that to work09:45
sean-k-mooneyim sure thats the exact case someone hit and they added this to fix it however09:46
* sean-k-mooney s/however//09:47
* kashyap nods09:48
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Refactor volume connection cleanup out of _post_live_migration  https://review.opendev.org/c/openstack/nova/+/86467009:57
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Move pre-3.44 Cinder post live migration test to test_compute_mgr  https://review.opendev.org/c/openstack/nova/+/86467109:57
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86380609:57
opendevreviewAmit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86405509:57
opendevreviewAmit Uniyal proposed openstack/nova stable/train: func: Add _live_migrate helper to InstanceHelperMixin  https://review.opendev.org/c/openstack/nova/+/86538109:57
opendevreviewAmit Uniyal proposed openstack/nova stable/train: func: Introduce a server_expected_state kwarg to InstanceHelperMixin._live_migrate  https://review.opendev.org/c/openstack/nova/+/86538209:57
auniyalHi sean-k-mooney 10:20
auniyalcan you please review these patches  - https://review.opendev.org/c/openstack/nova/+/86405510:21
opendevreviewAnton Kurbatov proposed openstack/nova master: Fix VMs sorting fail in case of comparison with None  https://review.opendev.org/c/openstack/nova/+/86503710:23
sean-k-mooneyi can look quickly but i dont really have time for nova work this week outside of the vdpa rebases.10:23
sean-k-mooneyoh that the train backport10:23
auniyalyeah, the last patch is failing before migration10:24
gokhanisean-k-mooney, you are right, instance is in unconsistent state and I tried  reboot it. Glance image is already deleted  and so base image is also deleted. I didn't activate  resume_state_on_host_boot option, it is false. 10:42
sean-k-mooneythis code path is only taken if you dont have an authtoken10:43
sean-k-mooneyso you should not be able to get here manually10:43
sean-k-mooneygokhani: how did you trigger the hard reboot10:43
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Refactor volume connection cleanup out of _post_live_migration  https://review.opendev.org/c/openstack/nova/+/86467010:46
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Move pre-3.44 Cinder post live migration test to test_compute_mgr  https://review.opendev.org/c/openstack/nova/+/86467110:46
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86380610:46
opendevreviewAmit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86405510:46
gokhanisean-k-mooney, firstly run "nova reset-state --active 3be526a1-664c-4166-8212-a0b979259ddf" and after that "openstack server reboot --hard 3be526a1-664c-4166-8212-a0b979259ddf"10:47
sean-k-mooneythe second command should not have taken this code path10:47
sean-k-mooneyits guarded by "if context.auth_token is not None:"10:48
sean-k-mooneyif hard_reboot was trigged form the api then there should always be an authtoken10:49
bauzassean-k-mooney: can you drop your -1 on https://review.opendev.org/c/openstack/nova/+/864418 now that I flipped the patches in the series ?10:49
sean-k-mooneyis that the gpu stuff10:50
sean-k-mooneyyep10:50
bauzasyes10:50
* bauzas needs to take her daughter10:50
sean-k-mooneyill swap it to a review priorty +1 but i wont get to it today10:50
sean-k-mooneyill see if i can loop back to it later in the week10:51
gokhanisean-k-mooney, I will try to reproduce this problem with trying to reboot another instance 10:52
sean-k-mooneygokhani:  ack if you can repoduced it with a knwon good test instance then that would help10:53
sean-k-mooneygokhani: your using victoria correct10:53
sean-k-mooneyi noticed nova 22.somehting in the trace10:53
gokhanisean-k-mooney, yes our env is victoria10:54
sean-k-mooneyi didnt check the same section of code in master but its possible that this was a bug that was fix in between10:54
sean-k-mooneyno that funciton is the same on master10:55
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L393110:55
sean-k-mooneyso if there is a bug it should happen there too.10:55
gokhanisean-k-mooney, I tried to soft reboot an instance and it throws error again10:58
gokhanihttps://paste.openstack.org/show/b7FSkdeQOxcuWu296tfq/10:58
gokhanirebooting is triggered with horizon 10:59
sean-k-mooneyso the inital error Cannot access backing file '/var/lib/nova/instances/_base/9d36e2c635ce070d95805f64f4b34655f3eae96b' of storage file '/var/lib/nova/instances/4a34a2f4-98a4-4cea-8d80-96d28f12edd5/disk'10:59
sean-k-mooneyindicate that something deleted the disk image from behind nova's back11:00
sean-k-mooneysoft reboot escalate to hard reboot if it fails11:00
opendevreviewAmit Uniyal proposed openstack/nova stable/train: func: Introduce a server_expected_state kwarg to InstanceHelperMixin._live_migrate  https://review.opendev.org/c/openstack/nova/+/86538211:00
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Refactor volume connection cleanup out of _post_live_migration  https://review.opendev.org/c/openstack/nova/+/86467011:00
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Move pre-3.44 Cinder post live migration test to test_compute_mgr  https://review.opendev.org/c/openstack/nova/+/86467111:00
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86380611:00
opendevreviewAmit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86405511:00
sean-k-mooneygokhani: we might not be passign the auth token propelry in the context when we do that11:01
sean-k-mooneygokhani: the code path that it is taking woudl repair the issue that it detected in soft reboot11:01
sean-k-mooneyif the image still existed in glance11:01
sean-k-mooneyif its  deltete tehre is no way to fix the vm without copying the backing file form somewhere else in teh cloud that still has it11:02
sean-k-mooneyif /var/lib/nova/instances/_base/9d36e2c635ce070d95805f64f4b34655f3eae96b exist somewhere else on your cloud and you copy it to that location on this host11:02
sean-k-mooneyand ensure it has the corret user permissions as the other images there11:03
sean-k-mooneythen you might be able to recover the vm11:03
sean-k-mooneywith another hard/soft reboot11:03
gokhanisean-k-mooney, I deleted image of this instance in glance. 11:04
sean-k-mooneysure that should be fine11:04
sean-k-mooneybut something also deleted the image backing file on the compute node11:04
sean-k-mooneythat is not ok and should not happen if there is an instance based on that image on the host11:04
sean-k-mooneyyou have configured your deployment to use qcow images for the vm with a backign file11:06
gokhanisean-k-mooney, what can be the reason of deleting image backing file ? Is there any config on nova?11:06
sean-k-mooneythis will only be deleted by nova if there are no vms on this host based on that glance image11:06
sean-k-mooneyyou are not mounting this on a nfs share or soemthign liek that?11:07
gokhanisean-k-mooney, yes I have netapp storage and Instance disks are on nfs share11:08
sean-k-mooneynfsv3...?11:08
gokhaninfs411:09
sean-k-mooneynfs v3 has consitey/locking issues and we stongly discurage using it. ideally you woudl use v4.2+ with nova.11:09
gokhaninfsv411:09
sean-k-mooneyok11:09
sean-k-mooneyso the only thing that comes to mind is if you are not using a sepeate share or directory per host11:10
sean-k-mooneythen if the /var/lib/nova/instances/_base is shared one of the other nova compute could have deleted it if it did not detect its on a shared file system properly11:11
gokhanisean-k-mooney, I am using same share for all compute nodes11:11
sean-k-mooneyin generall we discorgage puting the instance directory on nfs by the way.11:11
sean-k-mooneyyou can do that and we know operators do but its not well tested and there are defintly more bugs in that config.11:12
sean-k-mooneycan you check if you ahve a file for me11:12
sean-k-mooneyone sec while i look for it11:13
gokhanisean-k-mooney, I have also instance disk under /var/lib/nova/instances/xxxxxxxxx/disk11:13
sean-k-mooneyyes thats where they are stored by default11:14
sean-k-mooneyyou have  /var/lib/nova/instances/compute_nodes11:14
sean-k-mooney*do you have11:14
gokhaniyes I have11:14
sean-k-mooneydoes it have muliple entries11:15
gokhaniI will check it 11:15
sean-k-mooneymine looks like this11:15
sean-k-mooneynova-compute)[nova@cloud instances]$ cat /var/lib/nova/instances/compute_nodes 11:15
sean-k-mooney{"cloud": 1669200576.0961165}11:15
sean-k-mooneyim not sure if that should have muliple entries for nfs11:16
sean-k-mooneybut that is created by the image cache code and i think its related ot deleteion on shared filesystmes11:16
sean-k-mooneyim just wondering if it exists and if it has multipel entires the content itself is not really that important11:16
gokhanisean-k-mooney, https://paste.openstack.org/show/bldgeGSIq8BoLjp0u6sx/11:18
sean-k-mooneyack that is what i was expecting to see11:18
sean-k-mooneythat generated here https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/storage_users.py#L45-L7211:19
sean-k-mooneyso each compute on shared storage adds themselves to that list11:19
sean-k-mooneyhttps://opendev.org/openstack/nova/src/branch/master/nova/compute/manager.py#L10901-L1091611:21
sean-k-mooneywe use that file to get the instnace for all host on the shared storage11:21
sean-k-mooneyand tehn we clean the cache based on that11:21
sean-k-mooneyon a normal non shared deployment like mine that only has one host11:22
sean-k-mooneyin your case it has many11:22
sean-k-mooneyi see prod-compute3 but not compute03 in that list11:23
gokhanisean-k-mooney, we need to find how image backing file is deleted . I have this problem on instances whose images are deleted on glance 11:23
sean-k-mooneythe delete proably happen because of the perodic cleanup11:24
sean-k-mooneybut can you verify if the compute node is compute03 or prod-compute3 in the hypervior api or compute_node db table11:24
sean-k-mooneywe are using conf.host to populate that file11:25
sean-k-mooneyim wondering if you perhaps changed that at some point?11:25
sean-k-mooneyor change the hostname11:25
sean-k-mooneyto remove the prod- prefix11:26
sean-k-mooneymy guess is that currently eitehr the [DEFAULT]/host value does not match the value in the compute_nodes file or its unset and your hostname nolonger matches because it was change somehow11:27
sean-k-mooneythat file is updated before every time we cleanup the image cache11:28
gokhanisean-k-mooney, https://paste.openstack.org/show/bVzhWTyWsA3f1Umb5YGM/11:28
sean-k-mooneyso the compute agents think they shoudl be usingthe prod- prefix11:28
sean-k-mooneyok11:28
sean-k-mooneythat atleast alines to whats in the file11:28
sean-k-mooneyalthoguh 11:28
sean-k-mooneythats showing the hypervior hostname11:29
gokhanisean-k-mooney, do ı need to also verify on db ? 11:29
sean-k-mooneywhat i actully need is the host value rather then hypervior hostname i think11:29
sean-k-mooneythat would be the host value in the compute service entry  for example although its also in teh compute nodes table11:30
sean-k-mooneyim just trying to verify that they match what is in the file11:30
sean-k-mooneyon never mind11:31
sean-k-mooneyits fine11:31
sean-k-mooneyhttps://paste.openstack.org/show/b7FSkdeQOxcuWu296tfq/11:31
sean-k-mooneyi had scolled that over and th start of the lines were cut off11:31
sean-k-mooneyso its prod-compute0311:31
gokhaniyes11:31
sean-k-mooneydo you see this warning any of the logs  https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/libvirt/imagecache.py#L331-L33411:36
gokhanisean-k-mooney, I dont see any warning like that11:41
gokhanithere are periodic checks like that https://paste.openstack.org/show/bGWC1FfgR5R7nXqiCOuQ/11:43
sean-k-mooneyya that what we expect to see11:48
sean-k-mooneywhen its workign correctly11:48
gokhanisean-k-mooney, for rescuing these vms it seems there is only one option create glance images from /var/lib/nova/instances/xxx/disk11:49
sean-k-mooneyglance does not allow you to create and image with a specifc uuid11:50
sean-k-mooneybreifly looking at the code i done se where the bug is11:50
sean-k-mooneyhttps://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/virt/imagecache.py#L4311:50
sean-k-mooneythe list_running_isntnace code seams to take into account local and remote instnace11:51
sean-k-mooneyi have a meeting now so unfortuetly i cant continue to look at this now11:57
sean-k-mooneythe root cause so far is something deleted the backing file11:57
sean-k-mooneysince its deleted in glance and its deleted form the nfs share im not sure there is a way to correct this11:58
gokhanithanks sean-k-mooney for your help :)11:59
*** dasm|off is now known as dasm13:59
opendevreviewJorge San Emeterio proposed openstack/nova-specs master: Review usage of oslo-privsep library on Nova  https://review.opendev.org/c/openstack/nova-specs/+/86543214:12
opendevreviewJorge San Emeterio proposed openstack/nova-specs master: Review usage of oslo-privsep library on Nova  https://review.opendev.org/c/openstack/nova-specs/+/86543214:14
opendevreviewJorge San Emeterio proposed openstack/nova-specs master: Review usage of oslo-privsep library on Nova  https://review.opendev.org/c/openstack/nova-specs/+/86543215:10
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86380616:02
opendevreviewAmit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86405516:02
*** akekane is now known as abhishekk16:09
bauzasman, the gate is super flakey those days16:58
bauzaslots of rechecks on networing and volume detachs :(16:58
gibiprobably the switch to jammy17:11
gibibut this time I have not time to look at what changed with the detach logic again17:11
gibi*I don't have time17:12
sean-k-mooneygibi: im not sure that it really has changed18:28
sean-k-mooneygibi: i think they still have not fixed it18:28
sean-k-mooneygibi: bauzas  by the way if eitehr of ye can approve https://review.opendev.org/c/openstack/nova/+/865031 it will simplfy ralonsoh life and help with the trunk port issue18:30
sean-k-mooneyhttps://review.opendev.org/c/openstack/neutron/+/837780 cuurntly need a cofnig option because of our min version18:31
sean-k-mooneybut that can be removed if we increase the min version of os-vif18:31
opendevreviewMerged openstack/nova master: Reproducer for bug 1951656  https://review.opendev.org/c/openstack/nova/+/85067319:59
*** dasm is now known as dasm|off23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!