Thursday, 2021-01-28

openstackgerritMerged openstack/glance master: Change database migration version to wallaby  https://review.opendev.org/c/openstack/glance/+/77190700:17
*** k_mouza has joined #openstack-glance00:29
*** k_mouza has quit IRC00:33
*** lifeless has quit IRC00:55
*** lifeless has joined #openstack-glance00:57
*** priteau has quit IRC01:16
*** baojg has joined #openstack-glance01:37
*** gyee has quit IRC01:59
*** nicolasbock_ has joined #openstack-glance02:21
*** nicolasbock has quit IRC02:21
*** irclogbot_2 has quit IRC02:21
*** CeeMac has quit IRC02:21
*** nicolasbock_ is now known as nicolasbock02:21
*** jdillaman has quit IRC02:22
*** jdillaman has joined #openstack-glance02:22
*** CeeMac has joined #openstack-glance02:22
*** irclogbot_2 has joined #openstack-glance02:23
*** rcernin has quit IRC02:34
*** rcernin has joined #openstack-glance02:36
*** lifeless has quit IRC04:36
*** lifeless has joined #openstack-glance04:38
*** whoami-rajat__ has joined #openstack-glance04:59
*** ratailor has joined #openstack-glance05:24
*** udesale has joined #openstack-glance05:28
*** baojg has quit IRC05:49
*** baojg has joined #openstack-glance05:49
*** baojg has quit IRC05:49
*** baojg has joined #openstack-glance05:50
*** baojg has quit IRC05:50
*** udesale_ has joined #openstack-glance05:50
*** baojg has joined #openstack-glance05:51
*** baojg has quit IRC05:51
*** baojg has joined #openstack-glance05:51
*** baojg has quit IRC05:52
*** baojg has joined #openstack-glance05:52
*** baojg has quit IRC05:52
*** baojg has joined #openstack-glance05:53
*** udesale has quit IRC05:53
*** baojg has quit IRC05:53
*** baojg has joined #openstack-glance05:54
*** baojg has quit IRC05:54
*** baojg has joined #openstack-glance05:55
*** baojg has quit IRC05:55
*** baojg has joined #openstack-glance05:55
*** baojg has quit IRC05:56
*** baojg has joined #openstack-glance05:56
*** baojg has quit IRC05:56
*** baojg has joined #openstack-glance05:57
*** baojg has quit IRC05:57
*** k_mouza has joined #openstack-glance05:57
*** baojg has joined #openstack-glance05:58
*** baojg has quit IRC05:58
*** baojg has joined #openstack-glance05:58
*** baojg has quit IRC05:59
*** baojg has joined #openstack-glance05:59
*** baojg has quit IRC06:00
*** baojg has joined #openstack-glance06:00
*** baojg has quit IRC06:00
*** k_mouza has quit IRC06:02
*** baojg has joined #openstack-glance06:07
*** rcernin has quit IRC06:18
*** rcernin has joined #openstack-glance06:22
*** baojg has quit IRC06:45
*** udesale__ has joined #openstack-glance07:05
*** m75abrams has joined #openstack-glance07:05
*** udesale_ has quit IRC07:07
*** udesale__ has quit IRC07:12
*** rcernin has quit IRC07:42
*** ralonsoh has joined #openstack-glance08:01
* abhishekk will be back around 1400 UTC08:13
*** rcernin has joined #openstack-glance08:15
*** tkajinam has quit IRC08:58
*** udesale has joined #openstack-glance10:20
*** rcernin has quit IRC10:24
*** udesale has quit IRC10:46
*** baojg has joined #openstack-glance10:47
*** udesale has joined #openstack-glance10:48
*** k_mouza has joined #openstack-glance10:48
*** rcernin has joined #openstack-glance10:50
*** rcernin has quit IRC11:08
*** udesale has quit IRC11:18
*** k_mouza has quit IRC11:48
*** k_mouza_ has joined #openstack-glance11:48
*** ratailor has quit IRC11:51
openstackgerritFelix Huettner proposed openstack/glance master: Fix missing backend deletion of disabled images  https://review.opendev.org/c/openstack/glance/+/77287212:23
*** priteau has joined #openstack-glance12:42
whoami-rajat__hi dansmith abhishekk rosmaita , can you review these functional tests for cinder multiple stores, it's been sitting there for a long time because of different reasons but now finally the CI is passing. https://review.opendev.org/c/openstack/glance/+/75014413:02
rosmaitawhoami-rajat__: ack13:05
whoami-rajat__thanks!13:12
*** udesale has joined #openstack-glance13:14
*** baojg has quit IRC13:35
abhishekkwhoami-rajat__, ack13:37
*** k_mouza has joined #openstack-glance13:43
*** k_mouza_ has quit IRC13:46
abhishekkjokke, rosmaita, dansmith, smcginnis weekly meeting in 5 minutes at #openstack-meeting13:56
*** rajivmucheli has joined #openstack-glance14:05
*** rajivmucheli has quit IRC14:06
openstackgerritPranali Deore proposed openstack/glance-tempest-plugin master: Add some test jobs for glance-tempest-plugin  https://review.opendev.org/c/openstack/glance-tempest-plugin/+/77288314:07
*** rajivmucheli has joined #openstack-glance14:07
*** k_mouza has quit IRC14:42
*** k_mouza has joined #openstack-glance14:43
*** k_mouza has quit IRC14:45
*** rcernin has joined #openstack-glance14:46
*** k_mouza has joined #openstack-glance14:46
*** rcernin has quit IRC15:00
*** k_mouza has quit IRC15:09
*** k_mouza_ has joined #openstack-glance15:09
*** rajivmucheli has quit IRC15:14
* abhishekk going for dinner15:23
dansmithgmann: this rbac thing actually broke the copy permissions that the job is setting out to test: https://review.opendev.org/c/openstack/glance/+/76407415:42
gmanndansmith: yeah, i opened it for debugging but forget. thanks15:43
dansmithokay, cool, just wasn't sure if it was obvious where the problem was15:43
gmannyeah, something very suspicious there as it is breaking the existing policy override which it should not do15:45
dansmithmaybe because the job needs to enable that perm and it's doing it the old way?15:45
dansmithwell, I it's good that we have at least one job with a policy override then :)15:45
gmannyeah15:46
dansmithecho $'{"copy_image": ""}' > /etc/glance/policy.json15:46
dansmithis what it's doing15:46
gmannah yeah15:47
gmannbut policy.json should be pick if it exist even we changed the default15:47
dansmithyeah, I haven't looked deeply at the patch so I dunno why15:48
dansmithmaybe because it's empty?15:48
dansmiththe perm?15:48
gmannhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f39/764074/9/check/nova-ceph-multistore/f39db38/controller/logs/etc/glance/policy.json15:48
gmannso file is there with expected policy15:49
dansmiththat's not json15:49
dansmithoh wait,15:49
dansmithmaybe firefox is rendering it for me15:49
dansmithyeah, heh15:49
gmann{"copy_image": ""} is there15:49
dansmithyeah, firefox parsed it for me when I loaded it initially, so I wasn't seeing actual json15:50
gmannhope devstack is not setting it in config15:51
gmanni mean yaml15:51
dansmithmeaning creating an empty yaml file so this gets ignored?15:52
dansmithI don't see one in etc/glance15:52
gmannyeah but nothing here so all good until this - https://zuul.opendev.org/t/openstack/build/f39db383d98941cc998a0f25bfb0c15e/log/controller/logs/etc/glance/glance-api_conf.txt15:52
gmannhttps://zuul.opendev.org/t/openstack/build/f39db383d98941cc998a0f25bfb0c15e/log/controller/logs/screen-g-api.txt#39215:54
gmanndansmith: ^^ it is looking for policy.yaml15:54
*** m75abrams has quit IRC15:55
dansmithgmann: so not falling back to json as it should?16:01
gmannyeah, I think I need to add more log in this as lot of if else condition for picking file https://github.com/openstack/oslo.policy/blob/master/oslo_policy/policy.py#L36316:01
gmannlet me add it with more logging and see why fallback not working16:02
* dansmith nods16:02
dansmithabhishekk: lmk when you're back from dinner16:09
*** mgagne has joined #openstack-glance16:10
*** udesale has quit IRC16:17
abhishekkdansmith, I am now16:20
dansmithabhishekk: so I've been looking at this situation where we have nodes that staged an image and then the image was deleted or imported from another16:21
dansmithlike if someone doesn't have a shared dir for staging,16:21
dansmithor one note stages the image and then the image is deleted without being able to delete the staging file16:21
dansmithI think we discussed this briefly back when doing the copy-to-store stuff16:21
dansmithbut as I mentioned, I confirmed that it's a problem, would you agree that's a bug?16:22
abhishekkyes16:22
dansmithif it happens persistently, you could end up filling the disk with abandoned images16:22
dansmithokay, so I will file a bug for that16:22
abhishekkack16:22
dansmithabhishekk: so here's my next question, related to this staging stuff:16:22
dansmithwhy is staging a glance_store at all?16:23
dansmiththe one benefit I see is that we can do a store copy from $staging to $dest16:23
dansmithbut otherwise, alllll the code that deals with staging seems to have to snip off the file:/// and use os-level commands on the files themselves16:23
dansmithand16:23
dansmithAFAICT, we can never have staging be anything other than file-on-disk, if we want things like format inspect (during import) and format conversion to work16:24
abhishekkas far as I recollect we were using upload call to upload the data in staging area16:24
dansmithand even if you discount those, making staging be a network location like swift would be massively inefficient because you could copy across the network multiple times16:24
jokkeI'd disagree it being a bug, it's problematic 'though but the requirements are clearly laid down in documentation for the deployments to avoid it16:24
abhishekkkind of known issue16:25
jokkethat it is16:25
dansmithseems like a bug to me, because it can happen even if properly config'd, but nfs gets unmounted or something similar16:25
abhishekkand that's why we are recommending to use web-download instead of glance-direct at the moment16:26
openstackgerritGhanshyam proposed openstack/glance master: DNM: testing json->yaml migration failure  https://review.opendev.org/c/openstack/glance/+/77291216:26
dansmithI forget, does web download download it first to staging?16:26
jokkeit's kind of like saying that VM stopping running because server breaks is nova bug16:27
abhishekkweb-download downloads in the staging first but all requests goes to same node16:27
dansmithjokke: leaking resources is not the same as a vm crashing16:27
dansmithabhishekk: but if I power off that worker in the middle of web-download, then when it starts back up, the image could be deleted by the user and the image will still be in staging on that node, forever right?16:28
abhishekkyes16:29
dansmithjokke: would you agree that is a bug?16:29
jokkeThe very specific case that we don't have means to automatically clean up after unclean shutdown, I think the actual bug is that we have no means to resume where we left16:31
jokkebeing the resume then completed or reverted is different story16:31
dansmithno,16:32
dansmithI'm saying the following sequence:16:32
dansmith1. I start a web-download import, which downloads the image to staging and is in the process of uploading to, say, ceph16:32
dansmith2. The node running the import loses power16:32
dansmith3. I lose patience and delete the image via another worker which still has power16:32
dansmith4. That worker comes back online and leaks 1TB of data until the operator notices16:33
dansmith(1TB because I get to choose the image size :)16:33
jokkeworst case it leaks almost 2TB16:33
dansmithwhat we should do, is on startup scan our staging area, and if we find any images there that *also* have a deleted image record, we delete the staging file16:34
jokkeas th location gets registered when the upload finishes so you will have the data in staging and partially in the store (I got to choose how far the upload got before the power went down)16:34
dansmithpresumably the delete of the image would have seen the partial location for the image and deleted that from ceph too, right?16:34
dansmithso no leak of the 0.9TB we already wrote16:34
dansmithwell, that's a problem too then16:35
dansmithI thought it created the location active=false so it could store the destination location, but if not that's another problem I guess16:35
dansmithso, does that mean you're on board with this as a bug now?16:35
jokkewhich one, that we allow user to delete the image while there is supposedly import going or that we don't proactively delete user data? ;)16:37
*** coreycb has quit IRC16:37
dansmithdo we block the delete of the in-progress image?16:37
*** coreycb has joined #openstack-glance16:38
jokkeCurrently we don't we just deal with it, but that's one way to approach the problem16:38
dansmithI think that users find not being able to delete things pretty annoying16:38
dansmithand generally, the philosophy is that you can delete anything in any state, and the server collects it if something is in progress16:39
dansmithespecially for an operation like this where we're deleting it because it's stuck or not progressing...16:39
jokkethere is indeed bug somewhere there we know of, but which is the actual bug, that we let user to delete record we are supposedly processing or that we don't automatically delete data on start is the debatable part of it16:39
dansmithright now you can't get the delete back to the original worker,16:39
dansmithwhich means you can't let them ever delete it even after the node comes back :)16:40
dansmithto your other point, I don't think that deleting from the staging dir is "deleting user data"16:41
*** dosaboy has quit IRC16:41
*** dosaboy has joined #openstack-glance16:41
dansmithbut, if I work on a fix to make nodes clean their staging dir on restart, is that going to be acceptable? (cc: abhishekk )16:42
abhishekkat the moment I think this is best possible solution we could have16:42
dansmithI agree, it's very "eventual consistency"16:43
dansmithwhich is supposedly what we're doing here16:43
dansmithmuch like the OS doesn't scrub your bits from disk before unlink() returns16:43
jokkeJust out of curiosity and not pointing any fingers to any direction. Back to the Nova example I made, when nova comes back up and sees that the VM ain't running will it just go and delete the ephemeral data used on that VM or how are you dealing with these there?16:43
dansmithit absolutely does16:44
dansmithif there is a matching instance record that is now deleted16:44
dansmithso,16:44
dansmithinstance is running, compute crashes, client decides they'll recreate elsewhere, so they delete the original16:44
dansmithwe can't actually delete when the node is down, but we let them nuke the record right then16:44
dansmithnova starts back up, scans the instance records, disk images, libvirt domains, and cleans up after the delete16:45
dansmiththere is a config that lets you tell it to leave those things in place, for forensics if you want, but it defaults to "reap"16:46
dansmithsince staging is temporary space, I'm not sure the same makes sense for glance, but if that's important to you then that's fine16:46
jokkeSo this is exactly one of the reasons why I think it's important to have the location record for the staging data. 1) if delayed delete is enabled, we will let the user to nuke the image and scrubber will eventually clean it up or 2) if the data cannot be deleted due to the store not being available (aka the host where the data is sitting in) the delete won't go through before we can act on it16:47
dansmithI thought the scrubber was broken and potentially deprecated?16:49
dansmithduring the wsgi discussions, I asked abhishekk about not being able to make it work and I thought it was needing some love16:49
jokkeWell that's a bug16:50
jokkeI wasn't aware of16:50
abhishekkscrubber is not tested since ages, at least not by me16:50
dansmithso, the scrubber does what? deletes user data from actual stores when the image record is deleted instead of starting that synchronously with the image delete?16:51
dansmithabhishekk: it was crashing when I tried it16:51
abhishekkack16:51
abhishekkthe concept of delayed delete is if it is enabled it marks image as soft delete when user deletes the image and later scrubber picks up those images and deletes those from the backend16:52
abhishekkit was added so that user could restore the image if it is still in pending delete state (soft deleted), but this later part of restoring was never addressed16:53
jokkeyup, so the image status goes to deleted, the metadata gets whacked and the actual location removal from stores will happen out of band instead of keeping the client hostage for it16:53
dansmithright, so if the scrubber was working does it properly handle stores that are non-local?16:53
jokkeno! the restore was never real thing and was tried much later as side effect of it16:54
abhishekkok16:54
jokkeit was implemented as synchronous deletes of big images were equally taxing on the clients as uploads16:54
dansmithlike, does it say "well, I see a location for this, but I can't access it because it's not on my local disk, so I better leave this location in place so another node will try to clean it" ?16:54
jokkeand held the client as hostage until the data was deleted from the store16:54
jokkedansmith: the scrubber basically went and tried to delete the locations, if it fails it just hops to the next image on the list and tries again on the next cycle it runs. Once all the locations of that deleted image was whacked it stops trying and the image is cosidered as deleted16:57
jokkeeverything else is cleaned during the actual delete call16:57
jokkeAllowed for example ceph backed images that had usage by cow to be deleted and the data was finally whacked when the ceph reference count dropped to 0 and ceph allows the data to be deleted16:58
jokkeo you could delete image without killing a VM still running from that image16:59
dansmithso, if we go the location route to solve this staging residue bug, then people without delayed delete have to turn on and run the scrubber to periodically scan everything in order to avoid leaking storage right?17:00
dansmithas opposed to just cleaning the local staging dir on startup, right after we know we could have had a problem17:01
dansmithbut it sounds like the staging leak is a legit bug, regardless of the solution17:01
abhishekkeither way IMO it is beneficial to scan staging_dir and clean at the startup17:02
dansmithagreed17:02
*** gyee has joined #openstack-glance17:02
dansmithI tend to think that it would also be good to clean files there that are related to import-unfinished image records, not only deleted17:02
dansmithbecause if someone didn't delete the image while we were down, we still want to clean up the mess in staging17:03
jokkeI've been thinking that ^^17:03
jokkejust wondering if we have meaningful way to resme or if we just need to go and clean up ... the benefit of resume is that we would not be whacking user data without their agreement of it.17:05
dansmithit's staging data.. it's /tmp17:05
dansmithresume is always nice, but unless or until you're going to make that work reliably, I'd rather have a clean staging dir :)17:06
abhishekkand that will be lot of work to do as well17:07
dansmithso should I spend time on this or not?17:12
abhishekkyes17:13
jokkedansmith: somehow this feels like you're looking to go incredible leghts of work to avoid us having actual record that we have data related to the image record that has not been cleaned up and will require an action. Or am I missing something?17:13
dansmithI assume you're talking about the location for staging data,17:16
dansmithbut as I noted above, that only helps if we're going to require people to turn on the scrubber (and fix the scrubber, and make sure the scrubber does the right thing), in order to avoid the leak17:16
jokkeAlso should we go with the locations route be it wallaby or xena, all that work will be void as we definitely should not go and delete data when we do have location record for it17:17
dansmithif that's the plan, then that's cool, but I think there's a lot of work in front of us for that to happen17:17
dansmithin staging when we have a deleted image? I think that's totally fine17:17
dansmithbut okay, I'm hearing "don't work on this" so I won't17:17
abhishekkjokke, even we go locations approach I think we need to have this check on startup to check delayed delete is not enabled then clear this staging area??17:18
jokkeabhishekk: there won't be deleted images with data left behind as the delete will fail if the location can't be removed. So that code should never be executed in either case17:20
dansmithimages should always be deletable17:21
jokkedansmith: that's We can debate that with the Images API v3 spec, but that is not the case with v1 or v2 ;)17:22
jokke- that's17:23
dansmithyep, well, I'll just drop it17:23
dansmithI'd like to file the bug to document the concern, you can close it and I'll leave it alone17:23
jokkedansmith: sounds good, we can either solve the bug in the refactoring happening on the distributed import to clean up the debt we've accumulated or if that does not happen then we have bug to follow up on17:26
jokkeLike I totally agree we need a solution to clean up after ourselves, I just don't think you running to opposite direction again right now is good use of either of our time17:27
* abhishekk going offline for the day17:38
abhishekkdansmith, jokke ^^17:38
dansmitho/17:38
jokkenight o/~17:38
abhishekko/~17:38
*** k_mouza_ has quit IRC19:08
*** k_mouza has joined #openstack-glance19:48
*** k_mouza has quit IRC19:54
*** whoami-rajat__ has quit IRC19:58
*** rcernin has joined #openstack-glance20:03
*** rcernin has quit IRC20:30
*** rcernin has joined #openstack-glance20:41
*** ralonsoh has quit IRC20:48
*** zzzeek has quit IRC20:57
*** zzzeek has joined #openstack-glance20:57
*** rcernin has quit IRC21:34
*** rcernin has joined #openstack-glance21:51
*** k_mouza has joined #openstack-glance23:09
*** k_mouza has quit IRC23:14
*** knikolla has quit IRC23:36
*** knikolla has joined #openstack-glance23:36
*** PrinzElvis_ has joined #openstack-glance23:37
*** PrinzElvis has quit IRC23:37
*** PrinzElvis_ is now known as PrinzElvis23:37

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!