tonyb | I have access to a cloud that, due to a rabbitmq failure, is in an inconsistent state. `openstack server list` returns a bunch of instances, some [most] return `No server with a name or ID of '$UUID' exists.` when trying to get more information/delete the nodes. | 00:55 |
---|---|---|
tonyb | Any advice on how to reconcile the list with the instances that are still alive? | 00:55 |
*** jph7 is now known as jph | 01:40 | |
tonyb | Picking one instance as an example, I see it in nova_api.instance_mapping, looking in the nove_cell0.instances it shows as deleted. | 06:03 |
* tonyb reads the cells docs | 06:25 | |
*** zigo_ is now known as zigo | 09:43 | |
sean-k-mooney[m] | @tonybso cell0 is only used for instances that fail to boot to any host but pass the intal api checks for is this request valid | 11:30 |
sean-k-mooney[m] | cell0 should never have any compute nodes in it | 11:31 |
tonyb | I did more digging and found a number (but not all) of the instances showing up in 'server list' have data in the build_requests table. | 11:38 |
tonyb | I *suspect* I can delete them from the build_requests table and they | 11:39 |
tonyb | 'll "go away" | 11:40 |
tonyb | I also expect that will take care of most of the problems and then I can work out what the next stage is | 11:42 |
sean-k-mooney[m] | that could happen in older openstack release but i belive we fixed that at some point | 11:51 |
sean-k-mooney[m] | has anyone worked with the SovereignCloudStack folks on there standards openstack clouds | 11:52 |
sean-k-mooney[m] | https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0100-v3-flavor-naming.md | 11:52 |
sean-k-mooney[m] | that is interestign and i agree with most of it but it also impleies some functionality that nova does not supprot nativly | 11:53 |
sean-k-mooney[m] | at lest not without security risk | 11:53 |
sean-k-mooney[m] | there disk section https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0100-v3-flavor-naming.md#baseline-2 | 11:54 |
sean-k-mooney[m] | the mention if the flaovr name does not have a disk capsity that menas the cloud will provide at least image.min_disk space | 11:54 |
sean-k-mooney[m] | that either need the operator to cofnig root_gb=0 and allow non bfv instance with 0 root disk | 11:55 |
sean-k-mooney[m] | that was considred a cve in the past | 11:55 |
sean-k-mooney[m] | or they need to have downstrema changes to supprot that | 11:55 |
sean-k-mooney[m] | i feel its in approate to propsoe a standard behavior that is not supproted upstream or that depends on a security vulnerability | 11:56 |
sean-k-mooney[m] | the also have | 11:57 |
sean-k-mooney[m] | Multi-provisioned Disk | 11:57 |
sean-k-mooney[m] | The disk size can be prefixed with Mx prefix, where M is an integer specifying that the disk is provisioned M times. Multiple disks provided this way should be independent storage media, so users can expect some level of parallelism and independence. | 11:57 |
sean-k-mooney[m] | nova only support multipel disks for the same isntance if you use both root_gb adn ephmeral_gb | 11:57 |
sean-k-mooney[m] | and nova does not supprot puttign them on diffent storage media | 11:57 |
sean-k-mooney[m] | so again that would require downstream only modifications to nova | 11:57 |
tonyb | sean-k-mooney[m]: This is victoria | 11:57 |
sean-k-mooney[m] | @tonyb ya i dont think it was fixed then | 11:58 |
sean-k-mooney[m] | there was an edge case where if an error happened durign delete sometimes the build request didnt get cleaned up properly | 11:58 |
tonyb | Okay, rabbitmq died on this cloud (I don't know why) so I can see events going missing/. | 11:59 |
sean-k-mooney[m] | did db access fail | 12:00 |
sean-k-mooney[m] | i dont think it happend because of rabbit but perhaps | 12:00 |
tonyb | "safe" is the wrong word but do you think it'll be "safe" to just delete those entries from build_requests | 12:00 |
tonyb | I don't think the DB access failed | 12:01 |
sean-k-mooney[m] | if the instance are already in cell one or marked as deleted then ya it should be “safe” | 12:01 |
tonyb | sean-k-mooney[m]: Thanks I'll try a couple tomorrow and see what happens | 12:03 |
tonyb | It'd be nice to clean this up as it's one of our nodepool providers | 12:03 |
auniyal | Hi gibi, can you please have a look on this some time https://review.opendev.org/q/topic:%22bp/2048184%22 | 12:15 |
auniyal | thanks sean-k-mooney[m]++ | 12:34 |
sean-k-mooney | no worreies. im mostly done with upstream code review for today bu ti also reviewd your other patch this morning | 12:35 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/857339 | 12:35 |
auniyal | yeah, I'll udpate this as well | 12:36 |
auniyal | if you still have some time, can you please see this as well - https://review.opendev.org/q/topic:%22refactor-vf-profile-update%22 | 12:38 |
auniyal | it will help this - https://review.opendev.org/c/openstack/nova/+/899253/4 | 12:39 |
frickler | sean-k-mooney: I'll forward your comments regarding SCS, seems noone is around in this channel | 12:43 |
sean-k-mooney | frickler: cool. i dont dislike the idea of having standard around this | 12:46 |
sean-k-mooney | but i would hope those stanard would start form a point of what is secure and supproted in vanialla openstack | 12:46 |
sean-k-mooney | frickler: the multi disk capablity could work if your using ironic | 12:46 |
sean-k-mooney | i.e. you could have seperate storage media for strage provided by hte flavor in that case | 12:47 |
sean-k-mooney | but i dont think its possibel for any other virt dirver and it defintly is not for libvirt | 12:47 |
Uggla | gibi, bauzas question for unmounting a share. There is no polling in that case but as we are calling manila, I guess it should be asynchronous too. Right ? | 13:26 |
bauzas | when would you unmount the share ? | 13:27 |
Uggla | bauzas, it depends. :) | 13:27 |
bauzas | of the snow level ? | 13:27 |
bauzas | :) | 13:27 |
Uggla | bauzas, if it is not used by another VM yes. | 13:28 |
bauzas | so when deleting the instance ? | 13:28 |
Uggla | bauzas, hum not sure I get your point. But if you delete a VM with a share it goes to the same code and umount the share if it is not used elsewhere. | 13:30 |
gibi | Uggla: yeah, as we are calling other APIs I would make the detach async as well. But as you pointed out that will be faster as no polling there. | 13:40 |
bauzas | so yeah, it should be an async call | 13:40 |
bauzas | by design, we never ever stop a deletion | 13:41 |
bauzas | we first accept the deletion request and only after we try to delete the resource | 13:41 |
gibi | bauzas: having the unmount at detach means VM deletion does not need to talk to manila | 13:42 |
gibi | scratch that | 13:42 |
gibi | if we delete a VM with an attached share then the umount need to happen during VM deletion | 13:43 |
gibi | but VM delet is already async so I don't see an issue there | 13:43 |
Uggla | gibi, ++ | 13:43 |
bauzas | gibi: you're right on the fact that the delete is already async | 13:47 |
bauzas | when thinking about that, I also wonder why we need to unmount the share when deleting the resource in the compute | 13:47 |
Uggla | gibi, I would say the polling is more to ensure manila has applied the policy before mounting than really wait for manila. I set a timeout here mainly to avoir an infinite loop if manila is not working. | 13:47 |
bauzas | it could be a periodic | 13:48 |
bauzas | (if we have correct statuses) | 13:48 |
Uggla | bauzas, yep but it is better to clean the room yourself and not your mom (periodic). :) | 13:49 |
gibi | I see periodics as fallbacks when something bad happens during normal operators. | 13:49 |
bauzas | my only concern is the polling | 13:50 |
bauzas | if we don't need to call manila, sure | 13:50 |
gibi | we need to call manial to unlock the share | 13:50 |
bauzas | hence my concern | 13:50 |
gibi | but during delete we don't need to poll manila | 13:50 |
gibi | as unlock is sync | 13:50 |
gibi | I think | 13:50 |
bauzas | there are two stages | 13:51 |
bauzas | 1/ user asks to delete the instance | 13:51 |
bauzas | 2/ compute eventually deletes it | 13:51 |
bauzas | surely, for 1/, it's async | 13:51 |
bauzas | the compute could be not running | 13:51 |
bauzas | but for 2/ | 13:51 |
bauzas | once the compute gets the fact that it needs to delete a guest, then it runs | 13:52 |
bauzas | and that's my concern | 13:52 |
bauzas | I don't want this compute run to be waiting to manila to unlock the share | 13:52 |
gibi | but we do the same with port unbinding and probably for unreserving the volume in glance | 13:53 |
gibi | in cinder | 13:53 |
bauzas | hmm ok | 13:54 |
gibi | also we tend to say that nova compute works sort of with all disable periodics | 13:54 |
bauzas | wdym ? | 13:56 |
sean-k-mooney | im in a meeting right now but in genreal we shoudl be able to run nova without perodics enabeld modulo bugs | 13:57 |
gibi | :) | 13:57 |
sean-k-mooney | so fallback to perodic sure | 13:57 |
gibi | yeah I heard this from Sean | 13:57 |
gibi | bauzas: we do the volume detach in cinder API during terminate instance here https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L3284 | 13:57 |
sean-k-mooney | the only "features" that need perodics to fucntion are things like local_delete | 13:58 |
sean-k-mooney | where we will process any delete that were done while the agent was stopped | 13:58 |
sean-k-mooney | and all of those run in init host | 13:58 |
bauzas | if people are okay with waiting for manila by polling, fine by me then | 13:59 |
bauzas | it will be simplier for Uggla :) | 14:00 |
sean-k-mooney | i dont think its safe to consider the isntace deleted without at least unmounting the share form the host | 14:00 |
sean-k-mooney | the unlock i think shoudl be done too | 14:00 |
sean-k-mooney | buti could see not failing the delete if tha tdoes nto happen | 14:00 |
sean-k-mooney | and trying again via a perodic | 14:00 |
bauzas | that's my concern | 14:00 |
sean-k-mooney | but doing only in perodic woudl not be ok in my view | 14:00 |
bauzas | I don't want the delete to have an exception if we are not able to call Manila | 14:00 |
gibi | what happens today when a volume cannot be detached during VM delete? | 14:01 |
bauzas | we should rather delete the guest and eventually try to unmount | 14:01 |
sean-k-mooney | no the unmount on the host has to happen as part of the delete | 14:01 |
sean-k-mooney | its a security isssue if it does not | 14:01 |
bauzas | why ? | 14:01 |
sean-k-mooney | the unlock of the share could be defered | 14:01 |
sean-k-mooney | bauzas: because you have a tenatns data volume present on the host | 14:02 |
bauzas | if we don't delete the share in nova, it's just a file | 14:02 |
bauzas | that none of other instances could use | 14:02 |
sean-k-mooney | its not just a file its an nfs/cephfs share that is mounted on the host | 14:02 |
sean-k-mooney | that anyoen with host access coudl read with the correct permissions | 14:03 |
bauzas | anyone with host access would have more possibilties than using a share | 14:03 |
sean-k-mooney | i.e. if you break out of a vm and get host access you can read the shares for any share mounted on the host | 14:03 |
sean-k-mooney | sure, to gibi's point we should mirror what we do for cinder volumes | 14:04 |
bauzas | I think we said that it's not a security issue if you need to ssh to the host | 14:04 |
bauzas | the problem we had with cinder is that we were able to get *that* bdm for other instances | 14:04 |
sean-k-mooney | this is very close but not the same as the cidner issue with deleting the atatchment in my view | 14:04 |
bauzas | we discussed that a bit this morning during our meeting | 14:05 |
sean-k-mooney | bauzas: not bdm, host block device | 14:05 |
bauzas | correct | 14:05 |
bauzas | anyway, I'm doing other things by now, I can't discuss it more now | 14:05 |
sean-k-mooney | ya so i agree you wont have sharing because we will be mounting the sahre under the instance dir right | 14:05 |
bauzas | Uggla: are you OK ? | 14:05 |
bauzas | no, we don't mount the share on the instance dir | 14:06 |
bauzas | but we have share mappings telling which share to use with what instance | 14:06 |
sean-k-mooney | it should be either configurabel or under the instace dir | 14:06 |
bauzas | there is a spec | 14:06 |
bauzas | and within this spec, this is explained | 14:06 |
Uggla | bauzas, yep for me it is ok. I have my answers | 14:07 |
sean-k-mooney | bauzas: im pretty sure the spec did not say its oke to defer the unmoutn to a perodic | 14:07 |
bauzas | that, indeed, I was just saying about the share path | 14:08 |
bauzas | anyway, /me going back to work given Uggla got his answers | 14:08 |
sean-k-mooney | Uggla: so your going to do the unmount when deleting the isntance as part of delete with a fallback to a perodic | 14:09 |
sean-k-mooney | and not have manilla block the delete if unlocking fails? | 14:10 |
Uggla | sean-k-mooney, yep but atm I have not periodic. | 14:10 |
sean-k-mooney | right so without that we would have to make the failure to unlock put the instnace in error | 14:10 |
sean-k-mooney | gibi: how do you feel about that ^ | 14:10 |
sean-k-mooney | i do not have the full context so im mostly relying on you adn bauzas to ensure the correct behavior. if we just ignroed the failure to unlock the manilla share with no recovery method i woudl think that deserves a -2 | 14:12 |
gibi | The cheaper way is to put the instance in ERROR during delete. I suggest to check what we do with cinder volumes during delete if cinder is unavailable | 14:14 |
sean-k-mooney | we need to be careful of two things: 1.) we are not locking the share indefintly if we delete the isntnce and the manial call fails, that has billing implciations. 2.) we are not exporting the sahre to host it shoudl not be on, the has security implcitations, and possible regelarotry complicne issues. | 14:15 |
gibi | I agree about these goals ^^ | 14:15 |
sean-k-mooney | i would have been ok with puting the instnace in error | 14:16 |
sean-k-mooney | so you can just try again | 14:16 |
sean-k-mooney | when manilla is fixed | 14:16 |
sean-k-mooney | and i suspect that is what would happen for a BFV guest with delete on terminate but im not sure | 14:17 |
sean-k-mooney | it looks like in terminate we do self._cleanup_volumes(context, instance, bdms, | 14:18 |
sean-k-mooney | raise_exc=False, detach=False) | 14:18 |
sean-k-mooney | that imples we allwo the delete to proceed | 14:18 |
bauzas | I also agree with the two bullet points (just lurking) | 14:19 |
* bauzas is on on off as juggling with both a devstack 2-node install and some series rebase | 14:19 | |
bauzas | tbh, I don't really like us copying the cinder workflow | 14:20 |
sean-k-mooney | i am about to enter 3 hours of meetings in 3.5 hours | 14:20 |
bauzas | recent issues led me think there are some corner cases I'd like to avoid :) | 14:20 |
sean-k-mooney | so ill also be "steppign way shortly" | 14:20 |
bauzas | but agreed on the | 14:21 |
sean-k-mooney | right i think gibi and i wourl prefer to be stircter and fail the delte | 14:21 |
bauzas | DONTs and the DOES | 14:21 |
sean-k-mooney | but lets loop back to this after Uggla has updated things | 14:21 |
sean-k-mooney | and we have more time | 14:21 |
bauzas | I think we said we should never try to fail on an instance delete | 14:22 |
bauzas | as there are also some capacity and billing implications | 14:22 |
sean-k-mooney | we can fail on isntance delete for other reasons im pretty sure | 14:22 |
sean-k-mooney | but you are correct on the capsity and billing implciations if we do too | 14:22 |
gibi | it is a trade off if we let the VM delete to happen but leak resources in the meantime you still get billing issues :D | 14:22 |
bauzas | as an operator, I'd be more angry if my instance would continue to exist because of a cheap open file not deleted | 14:23 |
bauzas | but I could be wrong | 14:23 |
gibi | bauzas: e.g. we delete your VM happily but fail to unlock the share in manila. So you are billed for a share you cannot use :) | 14:23 |
bauzas | hence the periodic :) | 14:24 |
gibi | which can be turned off : | 14:24 |
gibi | :) | 14:24 |
bauzas | I'm not saying we should leave the pipe leaking | 14:24 |
bauzas | what I'm saying is that there are bigger things to do than fixing the pipe | 14:24 |
sean-k-mooney | bauzas: would that same operator be ok if custoemr data was leaked via presitent memroy, or if vgpus were never made avaiable again or pci devices | 14:24 |
gibi | at least if my VM is in ERROR and I cannot delete it I will notice and raise a ticket | 14:24 |
sean-k-mooney | so there is a way to fix this | 14:25 |
sean-k-mooney | we coudl ask manilla to aloow the share delteion if the nova instnace nolonger exists | 14:25 |
sean-k-mooney | that coudl be a followup enhacnment on there side | 14:25 |
sean-k-mooney | so if its locked by nova but nova says the isntnace is gone aloow the user to delelete it even with the lock | 14:26 |
bauzas | gibi: sean-k-mooney: standing as a user for a sec, | 14:26 |
bauzas | if I delete my instance and it goes into ERROR | 14:26 |
bauzas | the second after, I'm gonna try to delete again | 14:26 |
sean-k-mooney | yep which might work | 14:27 |
bauzas | so I could be okay | 14:27 |
gibi | :) | 14:27 |
bauzas | provided we ensure that the instance deletion path is idempotent | 14:27 |
sean-k-mooney | yes but we already need to do that | 14:27 |
sean-k-mooney | i belive one path where instance delete ends with an instance in error | 14:28 |
sean-k-mooney | is if the connection to libvirt is down | 14:28 |
sean-k-mooney | or ironic/hyperv/vmware api is not accable | 14:29 |
sean-k-mooney | assuming the comptue agent is fine it will have to put it in error when it cant talk to the hypervior | 14:29 |
sean-k-mooney | well that or leave the instnace in active | 14:30 |
sean-k-mooney | and put the instnace action in errror | 14:30 |
sean-k-mooney | ok meetings starting | 14:30 |
*** d34dh0r5| is now known as d34dh0r53 | 15:01 | |
bauzas | every time I'm trying to pull a devstack on a RHEL machine, I'm getting funny stories | 15:11 |
bauzas | Failed to discover available identity versions when contacting http://10.73.116.68/identity. Attempting to parse version from URL. | 15:24 |
bauzas | Could not find versioned identity endpoints when attempting to authenticate. Please check that your auth_url is correct. Service Unavailable (HTTP 503) | 15:24 |
bauzas | but when looking at keystone logs in the journal, lgtm | 15:25 |
bauzas | wtf... | 15:25 |
bauzas | I did set GLOBAL_VENV to False fwiw | 15:27 |
sean-k-mooney | well its technially not a supported operating system | 15:29 |
sean-k-mooney | i can likely take a look in a while if you are still having issues | 15:30 |
bauzas | SELinux is preventing /usr/sbin/httpd from write access on the sock_file keystone-wsgi-public.socket | 15:35 |
bauzas | doh | 15:35 |
* bauzas goes disabling selinux for a sec :D | 15:36 | |
sean-k-mooney | devstack disabels it on centos and fedora | 15:39 |
sean-k-mooney | so it proably is missing a check for rhel | 15:39 |
sean-k-mooney | that is proably a simple fix in devstack but also you can do it locally | 15:39 |
bauzas | I think I need to do some kind of install notes for kashyap and others wanting to test vGPUs with devstack | 15:40 |
bauzas | that's always good to reinstall your devstack | 15:40 |
bauzas | and I think I got the same problem 6 months ago but I forgot | 15:40 |
bauzas | so I'm litterally hitting the same bumps without learning | 15:41 |
*** d34dh0r53 is now known as d34dh0r5| | 16:01 | |
*** d34dh0r5| is now known as d34dh0r53 | 16:01 | |
melwitt | sean-k-mooney: thanks for the reviews! I will be going through and addressing the comments. and I agree re: functional tests, I have been working on them and other issues | 16:23 |
sean-k-mooney | ack i was talking to stephen an he made a point that we coudl do the functional tests at the end fo the series | 16:24 |
sean-k-mooney | i was going to chat to you later and see how you felt about doing it in line or at the end | 16:24 |
melwitt | I thought about that too and I think ideally I would put them inline (so that one can look at the code alongside the func test showing it's working) | 16:25 |
opendevreview | Merged openstack/nova master: Revert "[pwmgmt]ignore missin governor when cpu_state used" https://review.opendev.org/c/openstack/nova/+/905671 | 16:37 |
melwitt | sean-k-mooney: re: your comment about image backends reporting support ... currently I have each patch adding one action (trying to make review not a nightmare) but I'm not flipping the image backend flag to reporting support until the end. in real life it won't be runnable but in func tests it would. are you saying you would prefer to see the flag reporting support from the first patch and guarding all other actions to reject at the | 19:16 |
melwitt | API instead? and lift each gate with each subsequent patch? | 19:17 |
sean-k-mooney | i would have done it with the api gating approch and lifted it. | 19:17 |
sean-k-mooney | but given the code appears to be working | 19:17 |
sean-k-mooney | and we can test it with functional tests | 19:18 |
sean-k-mooney | im hesitent to actullly ask you to flip it | 19:18 |
melwitt | sean-k-mooney: ack. wanted to make sure I understood what you meant | 19:18 |
sean-k-mooney | so i woudl like to be able to test each "feature" patch either with tempest ro functional ideally | 19:18 |
sean-k-mooney | so if we can add in a few functional tests along the way then im fine with the current order | 19:19 |
melwitt | yeah, I see what you mean | 19:19 |
sean-k-mooney | i can be conviced to keep the functional tests at the end and keep the currrent order | 19:20 |
sean-k-mooney | i just want to actully make sure this works meaning testing it myself or havign enough automated coverage | 19:20 |
sean-k-mooney | i like unit tests but i dont trust unit test for complex things | 19:20 |
melwitt | for the next rev I'll not change the way to flag gets flipped but add all the func tests inline and see how that looks. if it's not convincing enough I could move to the API guarding pattern | 19:20 |
sean-k-mooney | ack lets start with that and ill update my devstack vm to yoru latest revision and do some testign form the tip of the series | 19:21 |
sean-k-mooney | i ment to ask bou about one of the patches that is WIP | 19:22 |
melwitt | yeah, I agree with you. I've been testing using the massive DNM patch at the end (though I broke something recently and am fixing it). just saying that's how I have any confidence in it :P that and my local devstack testing | 19:22 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/905512/8 | 19:22 |
sean-k-mooney | is that WIP because it does not work? or because you need more coverage or something else | 19:22 |
melwitt | yeah, that needs a lot more test coverage (unit and func) so I left it WIP | 19:22 |
sean-k-mooney | ack ok | 19:23 |
sean-k-mooney | do you think you can complete that before FF | 19:23 |
melwitt | I think so | 19:23 |
sean-k-mooney | im wonderign if that should go to the end with an api check? or if your confidnet we can merge that | 19:23 |
sean-k-mooney | ok then lets try and minimise the code curn by keepign the order and addign functional tests either inline or as followups at the end where you feel it fits best | 19:24 |
melwitt | it could. it's pretty simple really, the thing migrations needed to work is to have the libvirt secrets available on the destination and until that patch it wasn't creating them on the dest (and deleting them on the source) | 19:24 |
sean-k-mooney | ah ok | 19:25 |
melwitt | I've been trying to cut things down as small as possible because this stuff is complicated imho | 19:25 |
sean-k-mooney | i ran out os steam this morning but my goal is ot have reviewd each patch at least once by the end of the week | 19:26 |
melwitt | ack. I'm trying to get the next rev pushed sometime today/tonight | 19:27 |
sean-k-mooney | ok if you dont its not the end of the world. much of the content will be the same so i can still review to some degree | 19:30 |
melwitt | ack | 19:30 |
sean-k-mooney | ok i think its time for me to finish for today | 20:00 |
sean-k-mooney | o/ | 20:01 |
melwitt | o/ | 20:04 |
*** blarnath is now known as d34dh0r53 | 22:17 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!