Wednesday, 2024-01-31

tonybI have access to a cloud that, due to a rabbitmq failure, is in an inconsistent state. `openstack server list` returns a bunch of instances, some [most] return `No server with a name or ID of '$UUID' exists.` when trying to get more information/delete the nodes.00:55
tonybAny advice on how to reconcile the list with the instances that are still alive?00:55
*** jph7 is now known as jph01:40
tonybPicking one instance as an example, I see it in nova_api.instance_mapping, looking in the nove_cell0.instances it shows as deleted.06:03
* tonyb reads the cells docs06:25
*** zigo_ is now known as zigo09:43
sean-k-mooney[m]@tonybso cell0 is only used for instances that fail to boot to any host but pass the  intal api checks for is this request valid11:30
sean-k-mooney[m]cell0 should never have any compute nodes in it11:31
tonybI did more digging and found a number (but not all) of the instances showing up in 'server list' have data in the build_requests table.11:38
tonybI *suspect* I can delete them from the build_requests table and they11:39
tonyb'll "go away"11:40
tonybI also expect that will take care of most of the problems and then I can work out what the next stage is11:42
sean-k-mooney[m]that could happen in older openstack release but i belive we fixed that at some point11:51
sean-k-mooney[m]has anyone worked with the SovereignCloudStack folks on there standards openstack clouds11:52
sean-k-mooney[m]https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0100-v3-flavor-naming.md11:52
sean-k-mooney[m]that is interestign and i agree with most of it but it also impleies some functionality that nova does not supprot nativly11:53
sean-k-mooney[m]at lest not without security risk11:53
sean-k-mooney[m]there disk section https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0100-v3-flavor-naming.md#baseline-211:54
sean-k-mooney[m]the mention if the flaovr name does not have a disk capsity that menas the cloud will provide at least image.min_disk space11:54
sean-k-mooney[m]that either need the operator to cofnig root_gb=0 and allow non bfv instance with 0 root disk11:55
sean-k-mooney[m]that was considred a cve in the past11:55
sean-k-mooney[m]or they  need to have downstrema changes to supprot that11:55
sean-k-mooney[m]i feel its in approate to propsoe a standard behavior that is not supproted upstream or that depends on a security vulnerability11:56
sean-k-mooney[m]the also have11:57
sean-k-mooney[m]Multi-provisioned Disk11:57
sean-k-mooney[m]The disk size can be prefixed with Mx prefix, where M is an integer specifying that the disk is provisioned M times. Multiple disks provided this way should be independent storage media, so users can expect some level of parallelism and independence.11:57
sean-k-mooney[m]nova only support multipel disks for the same isntance if you use both root_gb adn ephmeral_gb11:57
sean-k-mooney[m]and nova does not supprot puttign them on diffent storage media11:57
sean-k-mooney[m]so again that would require downstream only modifications to nova11:57
tonybsean-k-mooney[m]: This is victoria11:57
sean-k-mooney[m]@tonyb ya i dont think it was fixed then11:58
sean-k-mooney[m]there was an edge case where if an error happened durign delete sometimes the build request didnt get cleaned up properly11:58
tonybOkay, rabbitmq died on this cloud (I don't know why) so I can see events going missing/.11:59
sean-k-mooney[m]did db access fail12:00
sean-k-mooney[m]i dont think it happend because of rabbit but perhaps12:00
tonyb"safe" is the wrong word but do you think it'll be "safe" to just delete those entries from build_requests12:00
tonybI don't think the DB access failed12:01
sean-k-mooney[m]if the instance are already in cell one or marked as deleted then ya it should be “safe”12:01
tonybsean-k-mooney[m]: Thanks  I'll try a couple tomorrow and see what happens12:03
tonybIt'd be nice to clean this up as it's one of our nodepool providers12:03
auniyalHi gibi, can you please have a look on this some time https://review.opendev.org/q/topic:%22bp/2048184%2212:15
auniyalthanks sean-k-mooney[m]++12:34
sean-k-mooneyno worreies. im mostly done with upstream code review for today bu ti also reviewd your other patch this morning12:35
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/85733912:35
auniyalyeah, I'll udpate this as well12:36
auniyalif you still have some time, can you please see this as well - https://review.opendev.org/q/topic:%22refactor-vf-profile-update%22 12:38
auniyalit will help this - https://review.opendev.org/c/openstack/nova/+/899253/412:39
fricklersean-k-mooney: I'll forward your comments regarding SCS, seems noone is around in this channel12:43
sean-k-mooneyfrickler: cool. i dont dislike the idea of having standard around this12:46
sean-k-mooneybut i would hope those stanard would start form a point of what is secure and supproted in vanialla openstack12:46
sean-k-mooneyfrickler: the multi disk capablity could work if your using ironic12:46
sean-k-mooneyi.e. you could have seperate storage media for strage provided by hte flavor in that case12:47
sean-k-mooneybut i dont think its possibel for any other virt dirver and it defintly is not for libvirt12:47
Ugglagibi, bauzas question for unmounting a share. There is no polling in that case but as we are calling manila, I guess it should be asynchronous too. Right ?13:26
bauzaswhen would you unmount the share ?13:27
Ugglabauzas, it depends. :)13:27
bauzasof the snow level ?13:27
bauzas:)13:27
Ugglabauzas, if it is not used by another VM yes.13:28
bauzasso when deleting the instance ?13:28
Ugglabauzas, hum not sure I get your point. But if you delete a VM with a share it goes to the same code and umount the share if it is not used elsewhere.13:30
gibiUggla: yeah, as we are calling other APIs I would make the detach async as well. But as you pointed out that will be faster as no polling there.13:40
bauzasso yeah, it should be an async call13:40
bauzasby design, we never ever stop a deletion13:41
bauzaswe first accept the deletion request and only after we try to delete the resource13:41
gibibauzas: having the unmount at detach means VM deletion does not need to talk to manila13:42
gibiscratch that13:42
gibiif we delete a VM with an attached share then the umount need to happen during VM deletion13:43
gibibut VM delet is already async so I don't see an issue there13:43
Ugglagibi, ++13:43
bauzasgibi: you're right on the fact that the delete is already async13:47
bauzaswhen thinking about that, I also wonder why we need to unmount the share when deleting the resource in the compute13:47
Ugglagibi, I would say the polling is more to ensure manila has applied the policy before mounting than really wait for manila. I set a timeout here mainly to avoir an infinite loop if manila is not working.13:47
bauzasit could be a periodic13:48
bauzas(if we have correct statuses)13:48
Ugglabauzas, yep but it is better to clean the room yourself and not your mom (periodic). :)13:49
gibiI see periodics as fallbacks when something bad happens during normal operators. 13:49
bauzasmy only concern is the polling 13:50
bauzasif we don't need to call manila, sure13:50
gibiwe need to call manial to unlock the share13:50
bauzashence my concern13:50
gibibut during delete we don't need to poll manila13:50
gibias unlock is sync13:50
gibiI think13:50
bauzasthere are two stages13:51
bauzas1/ user asks to delete the instance13:51
bauzas2/ compute eventually deletes it13:51
bauzassurely, for 1/, it's async13:51
bauzasthe compute could be not running13:51
bauzasbut for 2/13:51
bauzasonce the compute gets the fact that it needs to delete a guest, then it runs 13:52
bauzasand that's my concern13:52
bauzasI don't want this compute run to be waiting to manila to unlock the share13:52
gibibut we do the same with port unbinding and probably for unreserving the volume in glance13:53
gibiin cinder13:53
bauzashmm ok13:54
gibialso we tend to say that nova compute works sort of with all disable periodics 13:54
bauzaswdym ?13:56
sean-k-mooney im in a meeting right now but in genreal we shoudl be able to run nova without perodics enabeld modulo bugs13:57
gibi:)13:57
sean-k-mooneyso fallback to perodic sure13:57
gibiyeah I heard this from Sean13:57
gibibauzas: we do the volume detach in cinder API during terminate instance here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L328413:57
sean-k-mooneythe only "features" that need perodics to fucntion are things like local_delete13:58
sean-k-mooneywhere we will process any delete that were done while the agent was stopped13:58
sean-k-mooneyand all of those run in init host13:58
bauzasif people are okay with waiting for manila by polling, fine by me then13:59
bauzasit will be simplier for Uggla :)14:00
sean-k-mooneyi dont think its safe to consider the isntace deleted without at least unmounting the share form the host14:00
sean-k-mooneythe unlock i think shoudl be done too14:00
sean-k-mooneybuti could see not failing the delete if tha tdoes nto happen14:00
sean-k-mooneyand trying again via a perodic14:00
bauzasthat's my concern14:00
sean-k-mooneybut doing only in perodic woudl not be ok in my view14:00
bauzasI don't want the delete to have an exception if we are not able to call Manila14:00
gibiwhat happens today when a volume cannot be detached during VM delete?14:01
bauzaswe should rather delete the guest and eventually try to unmount14:01
sean-k-mooneyno the unmount on the host has to happen as part of the delete14:01
sean-k-mooneyits a security isssue if it does not14:01
bauzaswhy ?14:01
sean-k-mooneythe unlock of the share could be defered14:01
sean-k-mooneybauzas: because you have a tenatns data volume present on the host14:02
bauzasif we don't delete the share in nova, it's just a file14:02
bauzasthat none of other instances could use14:02
sean-k-mooneyits not just a file its an nfs/cephfs share that is mounted on the host14:02
sean-k-mooneythat anyoen with host access coudl read with the correct permissions14:03
bauzasanyone with host access would have more possibilties than using a share14:03
sean-k-mooneyi.e. if you break out of a vm and get host access you can read the shares for any share mounted on the host14:03
sean-k-mooneysure, to gibi's point we should mirror what we do for cinder volumes14:04
bauzasI think we said that it's not a security issue if you need to ssh to the host14:04
bauzasthe problem we had with cinder is that we were able to get *that* bdm for other instances14:04
sean-k-mooneythis is very close but not the same as the cidner issue with deleting the atatchment in my view14:04
bauzaswe discussed that a bit this morning during our meeting14:05
sean-k-mooneybauzas: not bdm, host block device14:05
bauzascorrect14:05
bauzasanyway, I'm doing other things by now, I can't discuss it more now14:05
sean-k-mooneyya so i agree you wont have sharing because we will be mounting the sahre under the instance dir right14:05
bauzasUggla: are you OK ?14:05
bauzasno, we don't mount the share on the instance dir14:06
bauzasbut we have share mappings telling which share to use with what instance14:06
sean-k-mooneyit should be either configurabel or under the instace dir14:06
bauzasthere is a spec14:06
bauzasand within this spec, this is explained14:06
Ugglabauzas, yep for me it is ok. I have my answers14:07
sean-k-mooneybauzas: im pretty sure the spec did not say its oke to defer the unmoutn to a perodic14:07
bauzasthat, indeed, I was just saying about the share path14:08
bauzasanyway, /me going back to work given Uggla got his answers14:08
sean-k-mooneyUggla: so your going to do the unmount when deleting the isntance as part of delete with a fallback to a perodic14:09
sean-k-mooneyand not have manilla block the delete if unlocking fails?14:10
Ugglasean-k-mooney, yep but atm I have not periodic.14:10
sean-k-mooneyright so without that we would have to make the failure to unlock put the instnace in error14:10
sean-k-mooneygibi: how do you feel about that ^14:10
sean-k-mooneyi do not have the full context so im mostly relying on you adn bauzas to ensure the correct behavior. if we just ignroed the failure to unlock the manilla share with no recovery method i woudl think that deserves a -214:12
gibiThe cheaper way is to put the instance in ERROR during delete. I suggest to check what we do with cinder volumes during delete if cinder is unavailable 14:14
sean-k-mooneywe need to be careful of two things: 1.) we are not locking the share indefintly if we delete the isntnce and the manial call fails, that has billing implciations. 2.) we are not exporting the sahre to host it shoudl not be on, the has security implcitations, and possible regelarotry complicne issues.14:15
gibiI agree about these goals ^^14:15
sean-k-mooneyi would have been ok with puting the instnace in error14:16
sean-k-mooneyso you can just try again14:16
sean-k-mooneywhen manilla is fixed14:16
sean-k-mooneyand i suspect that is what would happen for a BFV guest with delete on terminate but im not sure14:17
sean-k-mooney it looks like in terminate we do self._cleanup_volumes(context, instance, bdms,14:18
sean-k-mooney                raise_exc=False, detach=False)14:18
sean-k-mooneythat imples we allwo the delete to proceed14:18
bauzasI also agree with the two bullet points (just lurking)14:19
* bauzas is on on off as juggling with both a devstack 2-node install and some series rebase14:19
bauzastbh, I don't really like us copying the cinder workflow14:20
sean-k-mooneyi am about to enter 3 hours of meetings in 3.5 hours14:20
bauzasrecent issues led me think there are some corner cases I'd like to avoid :)14:20
sean-k-mooneyso ill also be "steppign way shortly"14:20
bauzasbut agreed on the 14:21
sean-k-mooneyright i think gibi and i wourl prefer to be stircter and fail the delte14:21
bauzasDONTs and the DOES14:21
sean-k-mooneybut lets loop back to this after Uggla has updated things14:21
sean-k-mooneyand we have more time14:21
bauzasI think we said we should never try to fail on an instance delete14:22
bauzasas there are also some capacity and billing implications14:22
sean-k-mooneywe can fail on isntance delete for other reasons im pretty sure14:22
sean-k-mooneybut you are correct on the capsity and billing implciations if we do too14:22
gibiit is a trade off if we let the VM delete to happen but leak resources in the meantime you still get billing issues :D14:22
bauzasas an operator, I'd be more angry if my instance would continue to exist because of a cheap open file not deleted 14:23
bauzasbut I could be wrong14:23
gibibauzas: e.g. we delete your VM happily but fail to unlock the share in manila. So you are billed for a share you cannot use :)14:23
bauzashence the periodic :)14:24
gibiwhich can be turned off :14:24
gibi:)14:24
bauzasI'm not saying we should leave the pipe leaking14:24
bauzaswhat I'm saying is that there are bigger things to do than fixing the pipe14:24
sean-k-mooneybauzas: would that same operator be ok if custoemr data was leaked via presitent memroy, or if vgpus were never made avaiable again or pci devices14:24
gibiat least if my VM is in ERROR and I cannot delete it I will notice and raise a ticket14:24
sean-k-mooneyso there is a way to fix this14:25
sean-k-mooneywe coudl ask manilla to aloow the share delteion if the nova instnace nolonger exists14:25
sean-k-mooneythat coudl be a followup enhacnment on there side14:25
sean-k-mooneyso if its locked by nova but nova says the isntnace is gone aloow the user to delelete it even with the lock14:26
bauzasgibi: sean-k-mooney: standing as a user for a sec,14:26
bauzasif I delete my instance and it goes into ERROR14:26
bauzasthe second after, I'm gonna try to delete again14:26
sean-k-mooneyyep which might work14:27
bauzasso I could be okay14:27
gibi:)14:27
bauzasprovided we ensure that the instance deletion path is idempotent14:27
sean-k-mooneyyes but we already need to do that14:27
sean-k-mooneyi belive one path where instance delete ends with an instance in error14:28
sean-k-mooneyis if the connection to libvirt is down14:28
sean-k-mooneyor ironic/hyperv/vmware api is not accable 14:29
sean-k-mooneyassuming the comptue agent is fine it will have to put it in error when it cant talk to the hypervior14:29
sean-k-mooneywell that or leave the instnace in active14:30
sean-k-mooneyand put the instnace action in errror14:30
sean-k-mooneyok meetings starting14:30
*** d34dh0r5| is now known as d34dh0r5315:01
bauzasevery time I'm trying to pull a devstack on a RHEL machine, I'm getting funny stories15:11
bauzasFailed to discover available identity versions when contacting http://10.73.116.68/identity. Attempting to parse version from URL.15:24
bauzasCould not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Service Unavailable (HTTP 503)15:24
bauzasbut when looking at keystone logs in the journal, lgtm15:25
bauzaswtf...15:25
bauzasI did set GLOBAL_VENV to False fwiw15:27
sean-k-mooneywell its technially not a supported operating system15:29
sean-k-mooneyi can likely take a look in a while if you are still having issues15:30
bauzasSELinux is preventing /usr/sbin/httpd from write access on the sock_file keystone-wsgi-public.socket15:35
bauzasdoh15:35
* bauzas goes disabling selinux for a sec :D15:36
sean-k-mooneydevstack disabels it on centos and fedora15:39
sean-k-mooneyso it proably is missing a check for rhel15:39
sean-k-mooneythat is proably a simple fix in devstack but also you can do it locally15:39
bauzasI think I need to do some kind of install notes for kashyap and others wanting to test vGPUs with devstack15:40
bauzasthat's always good to reinstall your devstack 15:40
bauzasand I think I got the same problem 6 months ago but I forgot15:40
bauzasso I'm litterally hitting the same bumps without learning15:41
*** d34dh0r53 is now known as d34dh0r5|16:01
*** d34dh0r5| is now known as d34dh0r5316:01
melwittsean-k-mooney: thanks for the reviews! I will be going through and addressing the comments. and I agree re: functional tests, I have been working on them and other issues16:23
sean-k-mooneyack i was talking to stephen an he made a point that we coudl do the functional tests at the end fo the series16:24
sean-k-mooneyi was going to chat to you later and see how you felt about doing it in line or at the end16:24
melwittI thought about that too and I think ideally I would put them inline (so that one can look at the code alongside the func test showing it's working)16:25
opendevreviewMerged openstack/nova master: Revert "[pwmgmt]ignore missin governor when cpu_state used"  https://review.opendev.org/c/openstack/nova/+/90567116:37
melwittsean-k-mooney: re: your comment about image backends reporting support ... currently I have each patch adding one action (trying to make review not a nightmare) but I'm not flipping the image backend flag to reporting support until the end. in real life it won't be runnable but in func tests it would. are you saying you would prefer to see the flag reporting support from the first patch and guarding all other actions to reject at the19:16
melwitt API instead? and lift each gate with each subsequent patch?19:17
sean-k-mooneyi would have done it with the api gating approch and lifted it.19:17
sean-k-mooneybut given the code appears to be working19:17
sean-k-mooneyand we can test it with functional tests19:18
sean-k-mooneyim hesitent to actullly ask you to flip it19:18
melwittsean-k-mooney: ack. wanted to make sure I understood what you meant19:18
sean-k-mooneyso i woudl like to be able to test each "feature" patch either with tempest ro functional ideally19:18
sean-k-mooneyso if we can add in a few functional tests along the way then im fine with the current order19:19
melwittyeah, I see what you mean19:19
sean-k-mooneyi can be conviced to keep the functional tests at the end and keep the currrent order19:20
sean-k-mooneyi just want to actully make sure this works meaning testing it myself or havign enough automated coverage19:20
sean-k-mooneyi like unit tests but i dont trust unit test for complex things19:20
melwittfor the next rev I'll not change the way to flag gets flipped but add all the func tests inline and see how that looks. if it's not convincing enough I could move to the API guarding pattern19:20
sean-k-mooneyack lets start with that and ill update my devstack vm to yoru latest revision and do some testign form the tip of the series19:21
sean-k-mooneyi ment to ask bou about one of the patches that is WIP19:22
melwittyeah, I agree with you. I've been testing using the massive DNM patch at the end (though I broke something recently and am fixing it). just saying that's how I have any confidence in it :P that and my local devstack testing19:22
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/905512/819:22
sean-k-mooneyis that WIP because it does not work? or because you need more coverage or something else19:22
melwittyeah, that needs a lot more test coverage (unit and func) so I left it WIP19:22
sean-k-mooneyack ok 19:23
sean-k-mooneydo you think you can complete that before FF19:23
melwittI think so19:23
sean-k-mooneyim wonderign if that should go to the end with an api check? or if your confidnet we can merge that 19:23
sean-k-mooneyok then lets try and minimise the code curn by keepign the order and addign functional tests either inline or as followups at the end where you feel it fits best19:24
melwittit could. it's pretty simple really, the thing migrations needed to work is to have the libvirt secrets available on the destination and until that patch it wasn't creating them on the dest (and deleting them on the source)19:24
sean-k-mooneyah ok19:25
melwittI've been trying to cut things down as small as possible because this stuff is complicated imho19:25
sean-k-mooneyi ran out os steam this morning but my goal is ot have reviewd each patch at least once by the end of the week19:26
melwittack. I'm trying to get the next rev pushed sometime today/tonight19:27
sean-k-mooneyok if you dont its not the end of the world. much of the content will be the same so i can still review to some degree19:30
melwittack19:30
sean-k-mooneyok i think its time for me to finish for today20:00
sean-k-mooneyo/20:01
melwitto/20:04
*** blarnath is now known as d34dh0r5322:17

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!