Friday, 2021-08-27

opendevreviewQiu Fossen proposed openstack/nova master: Allow migrating PMEM's data  https://review.opendev.org/c/openstack/nova/+/80222501:31
opendevreviewmelanie witt proposed openstack/nova master: Revert "consumer gen: more tests for delete allocation cases"  https://review.opendev.org/c/openstack/nova/+/80629201:32
opendevreviewmelanie witt proposed openstack/nova master: Revert "Consumer gen support for delete instance allocations"  https://review.opendev.org/c/openstack/nova/+/80629301:32
melwittgibi, sean-k-mooney, lyarwood: fyi I have uploaded reverts https://review.opendev.org/c/openstack/nova/+/806292 and https://review.opendev.org/c/openstack/nova/+/806293 as one possible option for addressing the gate bug. another option is the WIP patch https://review.opendev.org/c/openstack/nova/+/688802 to add a "force" kwarg to allocation delete01:36
opendevreviewGhanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest  https://review.opendev.org/c/openstack/nova/+/80629401:44
*** abhishekk is now known as akekane|home06:00
*** akekane|home is now known as abhishekk06:00
slaweqgibi: hi06:42
slaweqgibi: I see a lot of job's failures with error on deleting instance, like e.g. https://a91ddba9ffca8a8b5d72-1d82fafbdd07c6a56856d7b1449f73a8.ssl.cf5.rackcdn.com/800059/3/check/neutron-ovn-tempest-ovs-release/fd05a18/testr_results.html06:43
slaweqand it's not only on neutron-tempest-plugin jobs06:43
slaweqI saw it also on some devstack jobs too06:43
slaweqgibi: is it known issue or should I open one?06:43
gibislaweq: hi07:04
fricklerslaweq: I would be hoping that these are solved by the above reverts https://review.opendev.org/c/openstack/nova/+/806293 etc.07:04
gibijust couple lines above melwitt proposed patches07:04
slaweqgibi: frickler thx a lot07:05
gibiI've just started my day but I will prioritize this issue07:05
slaweqgibi: thx, I just wanted to ask as I knew that You will know better if that's something already reported :)07:06
gibisure07:06
fricklerslaweq: if you check n-cpu.log for errors, you'll see tracebacks about conflicts in placement allocations07:07
*** rpittau|afk is now known as rpittau07:07
slaweqfrickler: yes, now I see it :)07:09
gibimelwitt: thanks for proposing patches I read up on the bug discussion and the patches and I prefer the force kwarg that gives us a more fine grained approach to handle deletes in different situations07:37
gibiI will resolve the merge conflict in https://review.opendev.org/c/openstack/nova/+/688802 and then I will +2 it07:37
gibisean-k-mooney, lyarwood, stephenfin: ^^07:37
*** mgoddard- is now known as mgoddard07:58
lyarwoodMorning08:19
* lyarwood reads scrollback08:19
gibilyarwood: morning08:33
lyarwoodgibi: sorry got distracted with $child things08:55
lyarwoodgibi: so how's the rebase going? Are we not going ahead with the reverts just yet?08:55
gibilyarwood: almost done08:55
lyarwoodack08:55
gibilyarwood: for me the reverts are more complex than the force kwargs fix08:56
lyarwoodyup agreed08:56
opendevreviewBalazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance  https://review.opendev.org/c/openstack/nova/+/68880209:01
gibilyarwood, sean-k-mooney, stephenfin: ^^09:02
lyarwoodgibi: LGTM, had to do a little background reading to make sure but yeah still agree this is the better approach09:15
gibilyarwood: yepp the bug report has a good summary of what we are dealing with09:19
opendevreviewFabian Wiesel proposed openstack/nova master: VmWare: Remove unused legacy_nodename regex  https://review.opendev.org/c/openstack/nova/+/80633609:28
gibilyarwood: thanks for the review on the pps series. I replied in https://review.opendev.org/c/openstack/nova/+/793619/15#message-a2916b6a95c6f4061429e9fb2898ed738dfc342709:46
opendevreviewFossenQiu proposed openstack/nova master: Allow migrating PMEM's data  https://review.opendev.org/c/openstack/nova/+/80222510:33
opendevreviewFabian Wiesel proposed openstack/nova master: Vmware: Fix spelling in test  https://review.opendev.org/c/openstack/nova/+/80634810:36
opendevreviewFabian Wiesel proposed openstack/nova master: VmWare: Use of id shadows built-in function  https://review.opendev.org/c/openstack/nova/+/80639010:43
lyarwoodgibi: https://review.opendev.org/c/openstack/nova/+/688802 has some unit and functional failures if you have time to look10:45
opendevreviewFabian Wiesel proposed openstack/nova master: Vmware: Fix indentation in conditionals  https://review.opendev.org/c/openstack/nova/+/80639110:47
gibilyarwood: whaat? I did run those tests before pushed10:49
gibilooking10:49
lyarwoodyeah assumed you had, weird.10:50
gibiobviously I run it on something else as I see failures that I havent seen before :/10:54
gibianyhow fixing it ...10:54
gibithe two failing functional test actually testing our bug that is now resulved with force=True11:09
sean-k-mooneythe reverts to me are still the better long term solution as over all i think it more complext to have 2 ways to delete things11:09
gibisean-k-mooney: but the automatic reclaim of soft deleted instance should not win over a user initiated restore of the same soft deleted instance11:10
sean-k-mooneyalthough the force flag patch is slightly smaller11:10
sean-k-mooneyhum... i can see that althoug i also dont think we should uspport soft deleted in general11:12
sean-k-mooneyso you are saying the added complexityu to support soft cdelete requires force11:12
sean-k-mooneyto ensure we can restore on the same host11:13
sean-k-mooneywithout other allocations talking up the space11:13
sean-k-mooneyim still not sure we need to use a put in the soft delete case11:13
gibiwhat I'm saying that a conflicting delete should be forced most of the time, except if the delete is conflicting with a restore of a soft deleted instance11:14
gibiconflict means that restore receved first, we should not allow removing the allocation in this case blindly as it will lead to an active instance without allocaiton11:14
sean-k-mooneywould we not be able to do that with a lock on the instecak uuid11:15
sean-k-mooneyhave restore aquire the lock and have the perodic try to aquire it too before it does the delete11:15
gibiyes a lock would also be a solution11:16
sean-k-mooneyif we want to proceed with the force flag for now im not nessisarly against that11:17
sean-k-mooneybut im still not trilled by doing deletes with put in general11:17
opendevreviewBalazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance  https://review.opendev.org/c/openstack/nova/+/68880211:17
gibisean-k-mooney: I'd like to proced with the force solution, reverting old patches feels messy. 11:17
gibiI need to go grab lunch, be back in 3011:18
sean-k-mooneyack enjoy11:19
sean-k-mooneystephenfin: ya removing the globals is certenly another valid option. if there is no global state we have nothign to reset. ill take a look at that quickly11:20
opendevreviewsean mooney proposed openstack/nova master: db: Handle parameters in DB strings  https://review.opendev.org/c/openstack/nova/+/80566311:34
opendevreviewsean mooney proposed openstack/nova master: remove module level caching  https://review.opendev.org/c/openstack/nova/+/80639411:34
sean-k-mooneystephenfin: ^ they were only ever used in that module so there was no test covergae for the gloabls so simple to remove them as you suggested11:35
* gibi is back11:38
gibisean-k-mooney:  I agree less global is better11:39
sean-k-mooneywe can re add them if this really is a problem but personaly i would prefer to use the lru cache decorator on the function if it came to that11:42
sean-k-mooneyi tihnk stephen is correct that this likely is not providing much if any useful perfromance benifit today11:42
gibisean-k-mooney: lru cache decorator propbably store info on the func def, so that is also a global, just a hidden one. Still needs the same reset logic as a pure module level global11:44
gibiI'm not sure what I like more, but I guess in this case a pure global is easier to notice11:44
sean-k-mooneygibi: yes it would but i just prefer using libary function for caching then inventing it our selves 11:44
sean-k-mooneyanyway im not suggesting we use caching here11:45
gibiyepp, the reinventing-the-wheel is on the other side of the compromise11:45
gibiagree, only add cache if we see slowness11:45
sean-k-mooneyfor what its worth where we do have our own caching i lie the pater we have in the compute rpc module https://github.com/openstack/nova/blob/50fdbc752a9ca9c31488140ef2997ed59d861a41/nova/compute/rpcapi.py#L43-L50 and a few others11:46
sean-k-mooneye.g. where we providee a reset_globals fucntion in the module to handel that11:47
sean-k-mooneyas an external caller i done need to look at the details then11:47
sean-k-mooneyby the way speaking fo global state there was a patch i wanted to flag to you 11:48
sean-k-mooneygibi: https://review.opendev.org/c/openstack/nova/+/80498511:49
gibilooking..11:50
gibisean-k-mooney: I agree with your comment on ^^. I do think eventlet and python threads are not play nice together and we have extensive separation between them on the compute side to prevent issues11:52
sean-k-mooneythere change would not break anything but if they deployed the way we suggest with only 1 therad and multiple processes i dont think it would fix anything either right11:52
sean-k-mooneythey would have to deploy in a "unsupproted" configuration where wer wer usign mutiple thread curretly right?11:53
gibisean-k-mooney: if they use oslo lock without external=True then the lock is only help with threads not processes.11:55
gibiand I do think we only support process based scaling 11:56
sean-k-mooneylooking at there responce to my commnet https://review.opendev.org/c/openstack/nova/+/804985/3#message-e6eddd2104d423b51d8c4beb14e3eaa7998cf17711:56
sean-k-mooneyim wondering if we are timing out on an io operation11:57
sean-k-mooneyrather then a race11:57
sean-k-mooneyi think im just going to ask them to file a bug. part of the porblem is we dont know what verion of nova this happens in11:57
gibisean-k-mooney: yepp, we need to find a way to see if they really using threads and that casuing the problem or there is a problem even with eventlet 12:03
sean-k-mooneyok i have asked them to file a bug report and linked them to the docs on our threading model in case that helps12:05
sean-k-mooneygibi: im going to try and fix the pep8 issue in bauzas mdev series and then split out the docs patch as stephenfin requested into its own patch at the end. are you ok to re review them if i do that. bauzas  should be back on monday or tuseday but i would prefer to move this forward if we can before hitting FF12:15
gibisean-k-mooney: sure, ping me and I will review it 12:17
sean-k-mooneyoh its a mypy failure we declar we return a sting but if we have an excption we return none. that explains why fast8 did not see it but pep8 does12:22
gibiyes mypy only runs in the pep8 target not in the fast8 one12:24
stephenfinsean-k-mooney: small nits on https://review.opendev.org/c/openstack/nova/+/80639412:28
opendevreviewGhanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest  https://review.opendev.org/c/openstack/nova/+/80629412:34
sean-k-mooneystephenfin: yep just saw them come in. ill gix those up once i finish with bauzas mdev series12:38
* sean-k-mooney current todo list is mdev serise-> db issue -> multi queue extra specs series -> day of learning project if i get time12:40
opendevreviewMerged openstack/nova master: [func test] move port resource request tests  https://review.opendev.org/c/openstack/nova/+/80181512:56
opendevreviewMerged openstack/nova master: [func test] create pps resource on OVS agent RP  https://review.opendev.org/c/openstack/nova/+/78720513:00
gibilyarwood, stephenfin: I'm not sure I get  https://review.opendev.org/c/openstack/nova/+/800086/16/nova/tests/functional/test_servers_resource_request.py#b1669 13:02
stephenfingibi: lyarwood is looking at your earlier statement that those tests are passing even though they wouldn't work in real life, and is saying the tests are therefore misleading13:03
stephenfinand therefore it would be better to comment them out until such a time as their real-world equivalent scenario would pass13:03
lyarwoodstephenfin / gibi ; well I'm saying removing them isn't the way to document that13:04
lyarwoodright13:04
lyarwoodjust seems odd to add things, remove them because they are passing when they shouldn't be and then later add coverage back in again13:04
gibistephenfin: ahh so it is about the interface attach tests that are started passing earlier than the actual implementation?13:04
lyarwoodgibi: right13:04
stephenfinyup, iiuc13:04
lyarwoodit's a nit 13:04
lyarwoodI shouldn't really -1 for it, it's just confusing as a reviewer tbh as opposed to leaving them commented out with a note13:05
lyarwoodIMHO13:05
gibisorry , I got distracted with a call13:21
gibiOK, I got confused as you marked code lines that are about booting VMs not about interface attach13:24
gibianyhow 13:24
gibithose interface attach tests that are passing are not false positives13:24
gibithose sceanrios really works13:24
gibibut those scenarios are error scenarios13:24
gibiso they are pretty useless for the end user13:25
gibiI will respin the patches to move the service version checks to the compute.api13:26
gibibut I'm not sure what to do about the interface attach tests you commented on13:26
gibiI cannot comment them out as they are in the base class where they pass13:26
lyarwoodah crap right sorry, ignore that then13:27
gibiI can redefine them with empty body and a comment but that feels more confusin than actually doing nothing13:27
sean-k-mooney you can boot with one ovs and one seriov prot today13:27
gibisean-k-mooney: ignore the context of the comment the comment is not about those test cases13:27
gibisean-k-mooney: it is about interface attach test cases that started to pass earlier (and therefore not mentioned in the code at all) than expected13:28
sean-k-mooneyright but im wondering why they were previouly mared as exped failure13:28
sean-k-mooneywas that because of resouce requests?13:28
gibisean-k-mooney: yepp, 13:28
sean-k-mooneyah ok13:28
gibisean-k-mooney: we had first patches to reject every operation with the new extended resource request format13:28
gibisean-k-mooney: then this patch adds the impl for them and remove the rejection13:29
sean-k-mooneygot it13:29
sean-k-mooneyok ill leave it for not sicne i would have to start at the begining of the serise to review properly13:29
gibisean-k-mooney: yeah it is a long one :)13:30
sean-k-mooneywell its more if i jump in mid seriese without context i wont really understand what your doing so it wont be helpful13:30
sean-k-mooneyand i dont have time today unfortunetly to load all that context13:31
sean-k-mooneybut if there is anything you want me to look at just let me know13:31
gibisean-k-mooney: no worries, I got good reviews from stephenfin and lyarwood on the series13:31
gibisean-k-mooney: it is better to spread our effort, like making the mdev series land as well13:31
sean-k-mooneyack yep thats why im working on it now13:32
gibicool13:32
sean-k-mooneyim hoping unified limits can also make it13:32
gibidoes the dansmith's comments have been resolved there?13:33
dansmithnot that I've seen,13:33
sean-k-mooneyi think melwitt was working on that13:33
gibiack13:33
dansmithand I think unified limits had a long way to go to be landable even before13:33
dansmithi.e. it's still reaching into oslo library internals and such, last I saw13:33
sean-k-mooneyoh ok melwitt raised it as posibel at risk for this cycle but i think she was hopeful it could still be ready in time13:34
dansmithwow, okay I... would be surprised13:34
sean-k-mooneyill take your word for it since i have not been following the details of it13:35
dansmithI didn't see tempest tests for it,13:35
dansmithbut I definitely shook out some stuff from the glance implementation when I wrote tempest tests for it13:36
dansmithI don't see docs, and I would think it needs docs13:36
sean-k-mooneythere are some https://review.opendev.org/q/topic:%22bp%252Funified-limits-nova%22+(status:open%20OR%20status:merged)13:36
sean-k-mooneyhttps://review.opendev.org/c/openstack/tempest/+/790186 and https://review.opendev.org/c/openstack/tempest/+/80431113:36
dansmithI see gibi has a bunch of +2s over the patch I'm -1 on, but I think some of those have code that reaches into oslo.limit13:37
dansmithah okay cool, glad to see those tempest patches13:38
dansmithmissed them being WIP I guess13:38
gibidansmith: as far as I understand even if we land unified limits it will be a sort of experimental optional feature13:38
gibithere was no API impact so I considered it safe to experiment13:38
dansmithmmmkay :)13:39
gibi:)13:39
sean-k-mooneyhttps://www.youtube.com/watch?v=6wJXBUfcIOE13:40
*** rpittau is now known as rpittau|afk13:42
dansmithgibi: did you see my comment on this patch you have +Wd? https://review.opendev.org/c/openstack/nova/+/71213913:43
dansmithI think those exceptions are wrong, no?13:43
stephenfingibi: I'm +2 on basically the whole QoS series now, bar the things you've already talked about and the last two patches (small issues there)13:45
gibidansmith: could be a bit more chatty yes, but ValueError tend to be used to point out which formal parameter got an unexpected value in the functin call13:46
gibistephenfin: thanks, much appreciated13:47
dansmithgibi: sure, but is that caught and logged appropriately?13:47
dansmithgibi: usually that sort of thing ends up as not a trace in a log line where you just see "ValueError: entity_type" and never know what was going on when it happen13:48
dansmithbut if so, cool13:48
gibidansmith: thanks for pointing that out. I dropped my vote. I have to go back and see how that error is propagated13:48
dansmithgibi: I see there's a test for the specific string, so I guess it's intentional, but I still would like to know that it's turned into something useful :)13:49
gibidansmith: I agree with your concerns13:50
dansmithbtw, is the plan really to just outright revert the consumer types stuff because of that bug?13:53
gibiit felt for multiple cycles that this work is just sitting there waiting, I'm happy that now it got enough attention that we can make progress with it 13:53
dansmithgibi: definitely been sidelined and I'm very happy to see it moving. having just implemented this for glance I've also got context on it, which is helpful since nova's situation is obviously more complex13:54
sean-k-mooneydansmith: i think we are going with the force flag instead of a full revert13:54
sean-k-mooneydansmith: although melwitt  has also proposed patches for the revert if we decied that is what we want to do13:54
dansmithorly13:55
sean-k-mooneysorry melwitt revert patches are for the delete of the allocation using put13:55
sean-k-mooneyis ther a reason to revert the consumer types if the force flag patch merges13:56
dansmithis the force flag safe?13:56
sean-k-mooneygibi: that already making its way through the gate correct13:56
dansmiththat conflict is specifically to avoid deleting or changing data you don't realize has changed right?13:56
sean-k-mooneydansmith: it is but im not sure the orginial premis of this was valid13:57
gibisean-k-mooney: I had to respin due to couple of missed tests13:57
gibisean-k-mooney: so it is not on the gate but no passed check queue13:57
dansmithsean-k-mooney: premise of what, the force flag or the premise of the conflict?13:57
sean-k-mooneythe conflict on allocation delete13:57
sean-k-mooneyor rahter the precident in hte non soft delete case13:58
sean-k-mooneyif we got a delete form the user i think that shoudl have taken precendce over allocation changes in general13:59
sean-k-mooneygibi raised a vaild concern with soft delete which the force flag patch adresses13:59
sean-k-mooneye.g. restore should take precidence over soft delete13:59
dansmiththe bug discusses a retry, which is what I expect to do when I get a 409 from placement14:02
dansmithbut the proposed solution is to force?14:02
sean-k-mooneyperhaps its better for gibi  to surmiarse or look at his patch so i dont butcher the answer :)14:03
gibiretry here means read the new allocatin with the new generation and null it out in the response14:03
gibiso for me that logically the same as ignore the current allocaiton and force the delete14:04
dansmithnull it out in the request?14:04
gibidansmith: yepp the current code deletes an allocation with an empty PUT14:04
gibi"empty" as there is the generation14:04
dansmithokay, the current proposed patch does a delete with no generation right off the bat if force=true, never tries "nicely" once AFAICT14:05
gibidansmith: correct14:06
sean-k-mooneyso i think doing a http delete is the corrct thing if your ar enot using soft delete and the user request the instnace to be deleted14:06
dansmithokay, confused about the "retry" terminology then :)14:06
sean-k-mooneywhich is one of the case we will pass force=true14:06
gibidansmith: we can only do a meaningful retry if our original intention depends on the current state of the resource we are updated14:07
gibiin many delete case we don't care about the current state we just want to delete14:07
dansmithgibi: yeah, technically  now we're just nulling out the allocations and PUTing the thing back as-is, and generation + retry would be the right thing to do in that case,14:08
dansmithbut I understand, if we're deleting the thing then maybe delete is the right answer14:08
dansmiththe thing I was saying above was that it seemed like you (all) were saying try the put, and if the put fails, delete to override14:11
dansmithbut I see that's not what the patch does14:11
sean-k-mooneyright no the patch adds a force flag and we are selectivly setting it depening on which behavior we want14:12
sean-k-mooneyforce=true to jsut delete regardless of state and force=false if we want an operturnity to reconider14:12
gibidansmith: if we know we force on failure then what the point to try first nicely?14:12
dansmithgibi: I'm not arguing that we should, I was just trying to understand the proposed new behavior14:13
gibiack14:13
dansmithI guess I'm wondering why we did the PUT with empty allocations in the first place instead of delete14:14
sean-k-mooneywe used to do delete 14:15
sean-k-mooneywe started doing put when we added generations14:15
dansmithwell, right :)14:15
sean-k-mooneybut that is why we also have the straight revert patches as an option to consider14:15
sean-k-mooneymelwitt: myslef and possibel other were wondering the same thing14:15
sean-k-mooneythe force flag patch is kind of a middle ground 14:16
dansmithit also seems super odd to have force=True as the default14:16
dansmithI'm guessing that's just for expediency, but..14:17
sean-k-mooneyi think in most case we really do just want to delete not put14:17
sean-k-mooneyalthogh callign it force is perhaps questioable in that case14:17
dansmithright14:17
dansmithinverting the logic to "extra_checks=False" or something if you opt into it might be good,14:18
dansmithbut again,14:18
dansmithwe're opting out of the force for things like move and soft delete,14:18
dansmithwhich I kinda expect to exist more places than actual instance delete14:18
dansmithanyway, I didn't mean to come in here and throw a wrench in things, I just wanted to understand what as going on14:18
dansmithI saw 409s in failed runs yesterday, but I assumed those were just retries and didn't give it much thought14:19
opendevreviewsean mooney proposed openstack/nova master: Provide the mdev class for every PCI device  https://review.opendev.org/c/openstack/nova/+/80291814:19
opendevreviewsean mooney proposed openstack/nova master: Provide and use other RCs for mdevs if needed  https://review.opendev.org/c/openstack/nova/+/80323314:19
opendevreviewsean mooney proposed openstack/nova master: Expose the mdev class  https://review.opendev.org/c/openstack/nova/+/80174314:19
opendevreviewsean mooney proposed openstack/nova master: [WIP] update vgpu docs to account for generic mdev support  https://review.opendev.org/c/openstack/nova/+/80641214:19
artomOooohhhh14:21
* artom just fell in love with Gerrit v314:21
sean-k-mooneythis will be good. why so14:21
artomIf something's changed between PSN and PSN+1 that's not actually part of the change itself, it'll show the diff in yellow, not green14:21
sean-k-mooneyah yes14:22
sean-k-mooneyit does14:22
dansmithand sometimes blue right?14:22
sean-k-mooneythe colours depend on light vs dark mode14:22
dansmithI haven't figured out the pattern14:22
dansmithah14:22
artomI've only seen yellow so far14:22
dansmithyeah, probably the dark switch, I had one machine in dark the other day14:23
artomReading the scrollback, what *was* the logic behind PUTing empty allocations instead of outright DELETE?14:23
artomWe had consumer generations, so... just because we could?14:23
artomLike, we're deleting, why do races matter?14:23
dansmithartom: well, I'm kinda wondering if it's related to races during move and such14:23
dansmithartom: we delete allocations for all kinds of reasons other than deleting an instance14:23
artomSo the idea is - before we delete, make sure we have the latest version so that we can then POST it to a new RP?14:24
dansmithwhich is why it seems to me like maybe we should only force=true on instance delete, and default to not everywhere else, instead of what is proposed which is always force14:24
sean-k-mooneyso just invert the default for force14:25
dansmithwe're not really checking the allocations though, so, probably not worth it14:25
dansmithanyway, it just feels like we're removing a fence because it's currently getting in the way :)14:26
artomYeah - hence the question - why did we put the fence there in the first place?14:26
dansmithwell, generations went in to avoid us clobbering allocations all the time14:27
dansmithduring things like moves14:27
sean-k-mooneyartom: this is the orginial commit that added the put behaivor https://review.opendev.org/c/openstack/nova/+/59159714:28
artomSo, I'm sure there's a legitimate reason for it, but I don't grok it. Even during a move it, why would there be 2 bits of code trying to update allocations concurrently?14:28
artom*move operation14:28
sean-k-mooneyit has some motivation for it in the commit but not a fully explanation14:28
sean-k-mooneythe main motivation i got form that was to intneionally be able to put the instance into error on delete if it was in an inconsitent state14:30
sean-k-mooneyto presumable give the operator or user a change to investigate14:30
sean-k-mooney*chance14:30
dansmithoriginally before we were using migration allocations I think both sides fought over who was going to claim the resources and delete the other14:30
artomsean-k-mooney, that's for instance deletion, dansmith is talking about moves needing generations14:30
dansmithand of course same-host migrations made that hard14:31
artom... so we did a thing in placement because Nova was dumb?14:31
artomAh, resize to same host...14:31
sean-k-mooneyisnt that why placment was created :)14:31
dansmithartom: no dude, placement was/is accessed by multiple services at once14:31
artomHah14:31
sean-k-mooneybut here this is a distibuted lock problem14:32
sean-k-mooneyin that we dont have one14:32
artomdansmith, right, I know that. So was this something like Nova updating the VCPUs or whatever, while Neutron concurrently tries to update something about the port?14:32
dansmithwell, generation is the lock14:33
sean-k-mooneyso generations were added ^ yep14:33
dansmithI'm trying to remember how this works, but is the generation on the consumer actually?14:33
dansmithsuch that one instance moving between two RPs shares a generation?14:33
sean-k-mooneyi think the generation is on the allocation14:33
sean-k-mooneyi dont know if they share the same generation when we have both in a migration case14:34
dansmithwe don't when migration is holding the resources on the other,14:35
* sean-k-mooney closes tabs... i have/had 4 tabs with different version fo the same patch open and it was confusing the hell out of me14:35
dansmithbut the generation is on the allocation for the consumer across both compute nodes (if you do it that way)14:35
dansmithso even with both compute nodes being "separate" they both had to update the allocation for the instance,14:35
dansmithto claim resources on incoming and drop the resources from the outgoing14:35
dansmithbut this delete is actually _deleting_ the allocation14:36
dansmithso this is from after that concept of a single allocation against both providers14:36
sean-k-mooneyah so we dont have 2 indepented allcoations14:37
dansmithright14:37
sean-k-mooneyi tought we used the migration uuid to create an indepent allcoation14:37
dansmithoh we do no,14:37
dansmith*now*14:37
sean-k-mooneyah ok14:37
dansmithI'm saying before we did that, we were creating allocations for the instance against both computes14:38
sean-k-mooneyyou were discibing your initall implemenation14:38
sean-k-mooneybefore you refactored it14:38
dansmithwhich required both to modify it during/after the migration14:38
dansmithnot mine, but the, yes14:38
dansmithmine was the "use migration allocation" change14:38
sean-k-mooneyright ok that aligns with what i recall14:38
sean-k-mooneyso you think this was really only need before the move to seperate allocations?14:39
sean-k-mooneyor do you think its still useful now. the put instead of delete14:39
dansmithI think the PUT came after we were already using two separate allocations, but not positive14:40
sean-k-mooneyfor migrations that is14:40
sean-k-mooneyok im not sure of the time line but that sound right14:41
dansmithgibi wrote that initial patch you linked above to move *away* from delete specifically to get generation stuff,14:41
dansmithand it was part of "nested allocation candidates"14:41
sean-k-mooneythe migration allcotion change was a long time ago relitivly speaking14:41
dansmithright14:41
sean-k-mooneyoh it was for nested resouce providers14:41
sean-k-mooneyso maybe for port resouce requests?14:42
dansmithwell, that's kinda what I was eluding to above,14:42
dansmithwith multiple things accessing here14:42
sean-k-mooneyack14:42
dansmithlike does PUT {} avoid nuking a sub-allocation that neutron hasn't yet cleaned up?14:43
sean-k-mooneyi dont think we technially support netron modifying the allocation directly14:43
sean-k-mooneyi think nova always handels that on its behalf14:43
sean-k-mooneybut valid question14:43
dansmithokay I thought in some cases it would.. wasn't that the point of that blueprint?14:44
dansmithI wasn't really involved in it so I dunno14:44
gibiit was definitely before port resource request as nested a_c is a prerequisite for that14:44
sean-k-mooneyso there was the open question about modifying QOS policies for bound ports14:44
sean-k-mooneywhich could change the allcoaitons 14:44
gibithe only case neutorn modifies allocation ^^14:44
sean-k-mooneybut i dont think we support that yet do we14:44
gibiwe do14:45
sean-k-mooneyoh ok14:45
gibiyou can change QoS policy on a bound port, neutron will try to amend the instance allocation according to the new policy14:45
sean-k-mooneyso ya change the bandwith or pps qos amount on a bound port would be the only time neutron could modify the allocation i think14:45
gibiyepp I think too14:45
dansmithokay, but this was for nested A-C, meaning another allocation against another provider coming back from placement, not any sort of actual nesting of allocations as stored right?14:45
sean-k-mooneyyes i dont think there is really a concept of nested allocations14:46
gibidansmith: right, we dont have sub allocations we have multiple RPs in an allocaiton14:46
sean-k-mooneyjsut a singel allcoation with resouce form multple RPs today14:46
dansmithgibi: so you wrote this initial patch to move from delete to put specifically for "benefit" of generations.. are you thinking that was wrong now?14:47
gibidansmith: I tried to figure out this mornign the reason why I had that patch14:48
gibithe only thing I know is that bumping the placemenet api to use nested a_cs meant that we got generations as well14:48
gibiso we had to do something with generations14:48
gibii.e. the nested a_c support was in a bigger placement microversion than the generation support14:49
gibi1.28 and 1.2914:50
gibiif the request from the user is to delete the instance but that deletion fails due to a parallel operation then I think now that the delete operation should be the one that wins14:51
dansmithI don't think this patch was required just because we did generations and nested a_c, because you're updating the version you pass there,14:52
gibiin the other hand if the user requested restore of soft deleted instance and the timer on the soft delete -> hard delete is racing with that request then the restore should win14:52
dansmithand each call uses whatever version it wants right?14:52
sean-k-mooneygibi: microversion are per cal though14:53
dansmithright14:53
gibiright14:53
gibiI think the warning at the end of https://docs.openstack.org/placement/latest/placement-api-microversion-history.html#consumer-generation-support shows that at that time we was pretty serious about using generations strictly14:54
sean-k-mooneygibi: for the soft delete case we are expectign that this will help because? resotre will update allocation generation and the perodici will fail?14:55
sean-k-mooneyfor soft delete i assume we continue to claim the resouce with the allocation until the actul hard delete happens14:56
sean-k-mooneyso would there be a need to update the allocation at all14:56
sean-k-mooneyfor either soft delete or restore14:56
gibisean-k-mooney: if you are right then there could no be race between restore and soft -> hard delete14:56
sean-k-mooneye.g. will the generation actuly cahnge14:56
gibibtw the microversion history doc states14:57
dansmithgibi: I haven't mapped it out, but is there some reason we can't specifically opt-into the nuke-and-destroy behavior when we're deleting an instance and still fail for other cases?14:57
gibiPassing an empty allocations object along with a consumer_generation makes PUT /allocations/{consumer_uuid} a safe way to delete allocations for a consumer. 14:57
gibidansmith: you mean the user should be able to opt into the force?14:57
dansmithno14:58
sean-k-mooneydansmith: am i think the reason is that technially deletes are not allowd to have request bodies and we did not want to use a query arg14:58
dansmithyou're force=True by default14:58
sean-k-mooneyoh you mean the force default 14:58
dansmithyes14:58
gibidansmith: so just inverting the default14:58
gibidansmith: that is OK to me14:58
sean-k-mooneywe can set that to false and jsut pass true for a user delete14:58
sean-k-mooneyya im fine with inverting too14:58
gibiI think at some point melwitt stated that she used force=True as default as that was the more frequent call14:59
dansmithright, I think that's better signaling, and also better for whoever comes along later and uses that to delete an allocation14:59
dansmithI think making the dangerous activity explicit is worth the verbosity14:59
sean-k-mooneyi tihnk its pretty close either way14:59
sean-k-mooneyin terms of calls15:00
dansmithack15:00
dansmithgibi: sean-k-mooney: do we know why the consumer types thing suddenly means we need to do this btw? I imagine the answer is just "more chance for the generation to change" but.. why specifically?15:40
gibidansmith: exactly the consumer_types series merged the two transaction handling the allocation update into one, causing a longer transaction and hence a longer race window15:42
dansmithreally? I wouldn't expect that to be meaningfully long15:43
dansmithas in, I would expect people running this in a larger cloud with lots of stuff going on would have hit this by now and not just in a gate run with few instances at any one point15:44
dansmithI saw that analysis from melwitt in the bug and wasn't so sure15:44
dansmithbut I guess if it's really related to that change only then that has to be it15:44
gibithere is a correlation with the consumer_type series landed and the uptick in the conflict failures in the gate15:45
gibiI know correlation is not causation but still it feel relevant15:46
gibiand the consumer_type series had this transaction logic change15:46
dansmithhrm15:46
gibii have no better explanation at the moment15:46
dansmithack15:47
sean-k-mooneydansmith: sorry was on internal call15:53
sean-k-mooneyah am no i know there is a collation but not why15:54
opendevreviewBalazs Gibizer proposed openstack/nova master: Support boot with extended resource request  https://review.opendev.org/c/openstack/nova/+/80008616:11
opendevreviewGhanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest  https://review.opendev.org/c/openstack/nova/+/80629416:12
opendevreviewBalazs Gibizer proposed openstack/nova master: Support move ops with extended resource request  https://review.opendev.org/c/openstack/nova/+/80008716:13
opendevreviewBalazs Gibizer proposed openstack/nova master: [func test] refactor interface attach with qos  https://review.opendev.org/c/openstack/nova/+/80008816:14
opendevreviewBalazs Gibizer proposed openstack/nova master: Support interface attach / detach with new resource request format  https://review.opendev.org/c/openstack/nova/+/80008916:15
opendevreviewBalazs Gibizer proposed openstack/nova master: [func test] move unshelve test to the proper place  https://review.opendev.org/c/openstack/nova/+/79362116:17
opendevreviewBalazs Gibizer proposed openstack/nova master: [nova-manage]support extended resource request  https://review.opendev.org/c/openstack/nova/+/80206016:18
opendevreviewsean mooney proposed openstack/nova master: Remove module level caching  https://review.opendev.org/c/openstack/nova/+/80639416:19
opendevreviewsean mooney proposed openstack/nova master: db: Handle parameters in DB strings  https://review.opendev.org/c/openstack/nova/+/80566316:19
opendevreviewBalazs Gibizer proposed openstack/nova master: Reno for qos-minimum-guaranteed-packet-rate  https://review.opendev.org/c/openstack/nova/+/80504616:19
opendevreviewBalazs Gibizer proposed openstack/nova master: Add force kwarg to delete_allocation_for_instance  https://review.opendev.org/c/openstack/nova/+/68880216:30
gibidansmith, sean-k-mooney, melwitt, lyarwood: based on the above disucssion I inverted the force default value in ^^16:30
gibilyarwood, stephenfin: I fixed up the nits in the pps series. thanks for the valuable feedback16:31
gibiand with that I end my week. o/ 16:33
sean-k-mooneygibi: https://review.opendev.org/c/openstack/project-config/+/787523 has now merged so we shoudl have the review priority lable avaiable just an fyi16:42
gmannfinally :)16:45
sean-k-mooneyi think https://review.opendev.org/q/project:openstack/nova+Review-Priority:1 is how to use it16:46
sean-k-mooneybut we dont have any patches with it set16:46
sean-k-mooneyhttps://review.opendev.org/q/project:openstack/nova+Review-Priority:0 get hits however16:46
gmannor event this, https://review.opendev.org/q/project:openstack/nova+Review-Priority:1+label:Verified%253D1++NOT+label:Workflow%253C%253D-116:56
gmannsean-k-mooney: ^^16:56
gmann*even16:56
sean-k-mooneyperhaps we will need at lest one patch marked as a prioroty to test it16:57
gmannit can be appended with NOT+owner:self16:57
gmannsean-k-mooney: you can test with 0 priority to test query16:57
melwittgibi: ack16:58
melwittdansmith: I didn't think changing to a single transaction would make that much of a timing difference either but the consumer types patch series hit that bug far more often than anything else did. I didn't see anything else in those patches that could possibly be related, so I guessed about the db transaction change. my guess might be wrong17:22
dansmithI'm not saying you're wrong, I'm just surprised17:22
dansmithand surprised lots of real deployments aren't having trouble with much busier systems before the consumer types patch made the transaction longer17:23
melwittI was surprised too and thought maybe it was a coincidence but once it merged everything was hitting the bug17:23
dansmithand if that's really it, I sure hope we're not in for crazy pain if that made the transaction like waaaay longer or something17:23
melwittyeah, same. I haven't seen something like this before17:24
sean-k-mooneyi know that i have seen conflict in ci jobs form update avaiable resouces before the consome types change17:37
sean-k-mooneyso there was already some conficts happening17:38
sean-k-mooneyi dont think it caused test failures but i have seen it in the logs17:38
dansmithoh definitely17:43
opendevreviewGhanshyam proposed openstack/nova master: Convert features not supported error to HTTPBadRequest  https://review.opendev.org/c/openstack/nova/+/80629417:57
melwittyeah, I know it was occurring prior to consumer types, just saying that on the consumer types series it hit the bug often and after it merged it's happening more often18:28
melwittand re: "yeah, same. I haven't seen something like this before" I haven't seen grouping writes into a single db transaction cause a significant timing difference before18:34
dansmithmelwitt: it failed zuul, but I don't see any conflcit messages in there, although it is one of these "hung until timeout" sorts it seems18:53
dansmithmaybe that multicell failure is known and some other pattern?18:53
melwittdansmith: you're talking about the failure on the force kwarg patch right? I would expect fails on that to not be the conflict bug and at a glance it looks like it is indeed something different18:56
melwittunrelated to that, this is a new error in the controller compute log on that run: Remote error: DBReferenceError (pymysql.err.IntegrityError) (1452, 'Cannot add or update a child row: a foreign key constraint fails (`nova_cell1`.`instance_info_caches`, CONSTRAINT `instance_info_caches_instance_uuid_fkey`18:58
melwitthttps://zuul.opendev.org/t/openstack/build/575784eb62da4365ab8a1942da0353bc/log/controller/logs/screen-n-cpu.txt#3097218:58
dansmithI saw that too18:59
dansmithmaybe that dropped an update and that's why the test waited until timeout?18:59
melwittand on the instance from the failed tempest test:  [instance: 99c8640e-cb21-4fa9-bb90-d3361201ce7c] Failed to allocate network(s): nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed19:00
melwittAug 27 17:45:54.463587 ubuntu-focal-rax-ord-0026150414 nova-compute[111496]: ERROR nova.compute.manager [instance: 99c8640e-cb21-4fa9-bb90-d3361201ce7c] Traceback (most recent call last):19:00
melwitthttps://zuul.opendev.org/t/openstack/build/575784eb62da4365ab8a1942da0353bc/log/controller/logs/screen-n-cpu.txt#3383119:00
* dansmith nods19:00
melwittserver failed to spawn. the info cache error was on a different instance19:02
melwitt[instance: 217a81e4-b9cf-4e9e-97a9-edcb0fab8349] Can not refresh info_cache because instance was not found19:04
melwitttrying to refresh info cache on an instance that's gone.. that's odd19:05
melwittoh, it looks like a network event went to the compute/cell that the instance was not in, so when it tried to refresh the info cache, it was not found19:25
opendevreviewMerged openstack/nova stable/wallaby: Avoid modifying the Mock class in test  https://review.opendev.org/c/openstack/nova/+/80575921:20
opendevreviewMerged openstack/nova master: [func test] move port creation to the NeutronFixture  https://review.opendev.org/c/openstack/nova/+/78720621:21

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!