Friday, 2023-01-27

opendevreviewGhanshyam proposed openstack/nova stable/xena: DNM: testing tempest pin for stable/wallaby  https://review.opendev.org/c/openstack/nova/+/87180004:41
gmanngibi: bauzas: updates on stable/wallaby gate: I have updated the devstack patch (depends-on) and it fixes the gate - https://review.opendev.org/c/openstack/nova/+/87179804:51
gmanngibi: bauzas: but it unhide another bug in devstack/grenade side due to which stable/xena grenade job start failing (with devstack stable/wallaby tempest pin) - https://bugs.launchpad.net/grenade/+bug/200399304:52
gmannI have proposed the fix https://review.opendev.org/q/I5e938139b47f443a4c358415d0d4dcf6549cd085 and testing it in https://review.opendev.org/c/openstack/nova/+/87180004:53
opendevreviewGhanshyam proposed openstack/nova stable/xena: DNM: testing tempest pin for stable/wallaby  https://review.opendev.org/c/openstack/nova/+/87180005:58
opendevreviewSahid Orentino Ferdjaoui proposed openstack/nova master: compute: enhance compute evacuate instance to support target state  https://review.opendev.org/c/openstack/nova/+/85838307:10
opendevreviewSahid Orentino Ferdjaoui proposed openstack/nova master: api: extend evacuate instance to support target state  https://review.opendev.org/c/openstack/nova/+/85838407:10
sahid^ fixed merge conflict07:11
sahido/07:11
*** blarnath is now known as d34dh0r5307:22
*** bhagyashris|ruck is now known as bhagyashris07:47
bauzassahid: I have a busy morning but I'll try to take a look09:01
bauzasgmann: ack, thanks for the heads-up09:01
priteauWould the nova team consider disabling the failing grenade job on stable/wallaby temporarily to be able to merge the VMDK patch?09:02
bauzaspriteau: that's one of the options in the table09:02
bauzasbut before doing it, I need to correctly understand the problem09:03
opendevreviewKashyap Chamarthy proposed openstack/nova stable/zed: libvirt: At start-up rework compareCPU() usage with a workaround  https://review.opendev.org/c/openstack/nova/+/87196809:40
gibigmann: those fixes looks good to me. thanks for proposing them09:45
kashyapHm, /me lost track of these upstream backports from last July still :-( -- https://review.opendev.org/q/topic:bug%252F198285309:46
opendevreviewKashyap Chamarthy proposed openstack/nova stable/yoga: Add a workaround to skip hypervisor version check on LM  https://review.opendev.org/c/openstack/nova/+/85120209:54
opendevreviewKashyap Chamarthy proposed openstack/nova stable/yoga: libvirt: At start-up rework compareCPU() usage with a workaround  https://review.opendev.org/c/openstack/nova/+/87196909:56
kashyapelodilles: bauzas: This backport has been waiting for a while, can this be put through? - https://review.opendev.org/c/openstack/nova/+/85120509:57
bauzaskashyap: done10:02
kashyapThx! 10:02
elodillesit won't merge, as the yoga patch has not been merged yet: https://review.opendev.org/c/openstack/nova/+/85120210:09
elodillesbauzas: ^^^10:09
bauzasvoila why I didn't +W before10:09
sahidbauzas: no worries thank you for your time!10:09
bauzas-ETOOMANYREVIEWSONFLY10:09
elodilles(as I see you +W'd it once, but the patch wasn't cherry picked from the latest, merged PS)10:11
kashyapelodilles: Oh, yeah; the Yoga one is still waiting.  And are you saying the Xena cherry-pick is not correct?10:12
kashyapAh, you were talking about the _past_ ("wasn't").  Now it should be fine10:13
elodilleskashyap: the cherry-pick needs to be done again from yoga patch to stable/xena10:21
kashyapDuh, I thought I just did it ... /me face-palms and looks10:21
elodilles(and the current xena patch won't merge as it is not cherry picked from the latest yoga PS)10:21
elodilleskashyap: thx for fixing it10:22
kashyapelodilles: Gonna cherry-pick from this Yoga commit to Xena: c07495d9d64dd0635d72fc7ff67d73a656a40d1310:23
elodilleskashyap: yepp, that is the hash of the latest PS 10:24
kashyapelodilles: Hmm, I also need to backport another one before that (https://review.opendev.org/c/openstack/nova/+/845045)10:29
kashyapFor Xena, i.e.; /me goes to do it10:29
opendevreviewKashyap Chamarthy proposed openstack/nova stable/xena: Add a workaround to skip hypervisor version check on LM  https://review.opendev.org/c/openstack/nova/+/85120510:31
opendevreviewKashyap Chamarthy proposed openstack/nova stable/xena: libvirt: Add a workaround to skip compareCPU() on destination  https://review.opendev.org/c/openstack/nova/+/87197510:31
kashyapelodilles: Hope that looks better --^10:31
elodilleskashyap: yepp, looks good, that should be accepted by the backport validator job as well10:33
* kashyap nods; thx10:33
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Dividing global privsep profile  https://review.opendev.org/c/openstack/nova/+/87172910:45
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Dividing global privsep profile  https://review.opendev.org/c/openstack/nova/+/87172910:47
opendevreviewSofia Enriquez proposed openstack/nova master: Implement encryption on backingStore  https://review.opendev.org/c/openstack/nova/+/87001211:22
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Dividing global privsep profile  https://review.opendev.org/c/openstack/nova/+/87172911:51
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Moving privsep profiles to nova/__init__.py  https://review.opendev.org/c/openstack/nova/+/87201011:51
opendevreviewKashyap Chamarthy proposed openstack/nova stable/xena: libvirt: At start-up rework compareCPU() usage with a workaround  https://review.opendev.org/c/openstack/nova/+/87201112:13
*** labedz__ is now known as labed12:23
*** labed is now known as labedz12:23
elodillesbauzas: speaking about the tons of things you are to review, i don't know whether you saw the 2023.2 Bobcat schedule plan: https://review.opendev.org/c/openstack/releases/+/86997612:37
elodilleso:)12:37
auniyalHello o/12:38
auniyalplease review these Improving logging at '_allocate_mdevs'.12:39
auniyal  - https://review.opendev.org/q/topic:bug%252F199245112:39
*** tosky_ is now known as tosky12:55
bauzaselodilles: I've seen it and I forgot to ask on Tuesday for folks to look at this change13:18
bauzaselodilles: during the nova meeting13:18
bauzaselodilles: I'm not bad about it fwiw, as we would have the Summit *before* Specfreeze which is nice13:18
opendevreviewJorge San Emeterio proposed openstack/nova master: DNM: Testing check pipeline on master branch  https://review.opendev.org/c/openstack/nova/+/87201813:44
elodillesbauzas: yepp, it's a fairly long cycle (28 weeks) so milestones have longer times13:52
*** dasm|off is now known as dasm14:01
darkhorseHi team, Is it possible to disallow a user from all compute resources using keystone policy or any other means? The use case is I want to have multiple roles to a project. Some users have compute permissions, some have network permissions, some have volume permissions etc.14:12
darkhorseI am not sure if this question is more relevant to keystone channel. 14:12
opendevreviewAlexey Stupnikov proposed openstack/nova stable/victoria: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343615:54
opendevreviewMerged openstack/nova stable/yoga: Add a workaround to skip hypervisor version check on LM  https://review.opendev.org/c/openstack/nova/+/85120216:01
opendevreviewAlexey Stupnikov proposed openstack/nova stable/ussuri: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343716:02
opendevreviewAlexey Stupnikov proposed openstack/nova stable/train: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343816:02
opendevreviewAlexey Stupnikov proposed openstack/nova stable/ussuri: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343716:03
opendevreviewAlexey Stupnikov proposed openstack/nova stable/train: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343816:04
opendevreviewAlexey Stupnikov proposed openstack/nova stable/train: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343816:04
sean-k-mooneydansmith: can you take a look at https://review.opendev.org/c/openstack/nova/+/863919/12/nova/tests/unit/compute/test_resource_tracker.py#1553 when you have time16:14
dansmithsean-k-mooney: yeah, it's on my list.. ya'll drug me into a call at 6:30am and I have another starting in a few ;P16:15
sean-k-mooneysorry about that16:15
dansmiththanks for hitting the ones below, that's some progress for sure16:15
sean-k-mooneyso ill try and test teh rest when i get devstack deploy so ill loop back to them again later today if that works out16:16
dansmithcool16:16
sean-k-mooneycool devstack stacked first time17:10
sean-k-mooneynow ot actully use your patches17:11
sean-k-mooneyso currently i have all in one so when i checkout your branch and restart the nova serices im expecting the compute to start up and writeh the uuid to a file and update the service version right17:12
sean-k-mooneyyep that worked https://paste.opendev.org/show/bLMvRfuUugp9iMcn4R60/17:16
sean-k-mooneyim going to break this in a few differnt ways and document it in an ether pad and ill let you know if i find anything odd or broken17:17
dansmithsean-k-mooney: yeah, cool17:25
sean-k-mooneyhttps://etherpad.opendev.org/p/Stable-compute-uuid-manual-testing ill be usign that to keep my notes17:29
sean-k-mooneythere is not really much there yet but ill kep adding to it as i go17:30
sean-k-mooneydansmith: found a bug17:59
sean-k-mooneygetting logs now17:59
dansmithin "test 3 restart with deleted file" ?18:00
sean-k-mooneyyep18:00
dansmithdoes that mean deleted node id file?18:00
sean-k-mooneyi moved it to compute_id_old18:00
sean-k-mooneythe agent tried to add a duplicate row in the db which resulted in a key error18:01
sean-k-mooneyand it wrote a new file18:01
sean-k-mooneywith a diffent uuid18:01
dansmithwell, right,18:01
dansmithbecause it knows it's not an upgrade (now) and it didn't find a local compute node uuid18:02
dansmithwhat do you think it should do there?18:02
sean-k-mooneywell it shoudl not result in a traceback in the log for one18:02
sean-k-mooneybut it should see that the compute node exist in the db  and not try to create a new one18:03
sean-k-mooneyand we should not write a new file to disk18:03
dansmithbut the whole point of this is that the file becomes the source of truth18:03
dansmithso maybe we should see the failure to create with a keyerror as a "something's wrong abort" but I bet it happens too late for us to abort18:04
sean-k-mooneywell the file didnt exist correct18:04
sean-k-mooneyso before we try to create a new compute node record shoudl we not check if one exits the old way18:04
sean-k-mooneyand abort then18:04
dansmithwe do, specifically on upgrade only18:05
dansmithwhich you saw work18:05
dansmithbut once you have gotten past upgrade, it knows you're not upgrading, and can only really assume that it's a greenfield deployment18:05
dansmithif we're going to stop relying on the hostname, there's really no other way, right?18:05
sean-k-mooneyno if we have a compute node and the serivce is upgraed then the file shoudl exist18:06
sean-k-mooneyif it does not then we knwo somethign odd happened18:06
sean-k-mooneybasicly if compute service>=62? and not file -> error18:07
sean-k-mooneyfor greanfiled thre wont be a compute node in the db18:07
dansmithhow do we know the difference between "already upgraded" and "greenfield" ?18:07
dansmithbut again, the only way you know the "comptue node in the db" is because you're relying on the hostname, which we should not be doing18:07
sean-k-mooneyor we can catch the duplicate key error form the db18:08
dansmithdepending on when the duplicate key happens, I'm fine aborting in that case for sure18:08
sean-k-mooneywe at least shoudl not write the file with the wrong uuid18:08
dansmithbut I think we don't create that record until far too late18:08
sean-k-mooneyits currently writing the one for the row that was rejected18:08
sean-k-mooneyproably because we set the singlton uuid18:08
dansmithokay, I'm with you on not writing it if there's a keyerror, but I think that will be pretty hard to line those things up18:09
dansmithso maybe delete it if we wrote it or something18:09
dansmithalthough, hang on18:10
dansmithlet's say I deploy two new computes with the same hostname by accident, they both generate and write unique uuids,18:10
dansmithone will fail to create their compute node because of the clash,18:10
dansmithbut if I just rename the offending duplicate one, then the uuid it generated is fine to use when I restart18:10
sean-k-mooneyyep 18:10
dansmiththe "I deleted the id" case seems pretty edge-y to me, because there are tons of things I can randomly delete from a running system that will break stuff18:11
dansmithif we can abort on key conflict at start, then I'm on board, but otherwise, I dunno18:11
dansmiththis is a bit like saying I deleted /etc/ssh/ssh_host_key and when I restarted it generated a new one and my clients all complain :)18:11
sean-k-mooneyhonestly i did this as my first test because i tough it could be a trivial thing that might happen and we should prevent it18:12
sean-k-mooneydansmith: i was more thinging what happens if you fail to bind mount this in a container18:12
sean-k-mooneyif the file is ever lost i think it defeats much of the utility of the feature if we allow the agent to start18:13
dansmithokay, but if you do, there's not much harm, because the resolution is to restart with it, and no harm no foul right?18:13
dansmithif you failed to bind mount this, you likely also failed to bind-mount the instance images no?18:14
sean-k-mooneyfair its in the state dir18:14
dansmithin the pre-provisioned case, this goes in /etc/nova anyway18:14
sean-k-mooneyalthough it depend on the location18:14
sean-k-mooneyany way im goign to keep testing/breaking it for a while and see what else i find18:17
sean-k-mooneyi just added the traceback18:17
dansmithack, I say we punt on that for the moment and see what else you can find18:17
dansmithmaybe we can circle back with some extra stuff that makes that smarter18:17
dansmithat least catching keyerror and logging something that says "okay, here's how you've screwed up..."18:18
dansmithbecause we have that trace right now and it's not super obvious why18:18
sean-k-mooneyso i think the abort is broken in general18:26
dansmiththe abort on rename?18:28
sean-k-mooneyyep so i tried restarting it with the incorrect uuid18:28
sean-k-mooneyand it just keeps trying to create the recoed with the incorrect uuid18:29
sean-k-mooneywhich keeps failing18:29
dansmithdoes the uuid exist in the db though?18:29
sean-k-mooneyill check but i dont think so18:29
dansmiththen it is doing what it should do18:29
dansmithbecause that's the pre-provisioned case18:29
dansmithif you create an object in the db with that uuid but a different host, then it should trigger the rename detection18:29
sean-k-mooneyso i really think this is broken as is18:30
sean-k-mooneybut i will try renames later18:30
dansmithif you give it a uuid that doesn't exist in the database, how is it supposed to know that's not what you want it to use?18:30
dansmithyou've told it "this is your uuid, end of story" and if there's no object in the db with that uuid, it's going to try to create it18:31
dansmithit should catch and log a better error than the trace, saying there's a name conflict or something, but otherwise it's doing the right thing I think18:31
sean-k-mooneyso i dont think we shoudl be ignoring the host/hypervior_hostname entirely espcially when we get the db duplicte key error18:31
dansmithisn't that the entire point of this series? to break that as the link?18:32
sean-k-mooneyno18:32
sean-k-mooneyits to make sure a host never change its hostname or host value18:32
sean-k-mooneythat is not entirly the same thing18:32
dansmithI totally disagree that that's the point of this series :)18:33
sean-k-mooneywe want to make sure we always have a stable way to mapp a host to the same comptue node record18:33
sean-k-mooneyand that the host and hypervior_hostname dont change18:33
dansmithbecause I didn't even have the rename protection in there until late18:33
sean-k-mooneywell this would not have prevented the issues our downstream custoemr had without the host/hypersior host checks18:34
dansmithit would, if they didn't delete the state file18:34
dansmithlike I said, if the uuid is actually a compute node in the db, but with a hostname that doesn't match, it will abort18:35
sean-k-mooneyand you want to rely on them not doing something we told them not to do when they are doing something else we told them not to do :)18:35
dansmithall I want to do is make finding the compute node not tied to the hostname18:35
sean-k-mooneyok so that will just result in it failing with placment18:36
sean-k-mooneybecause we will move the dduplicate key to ther resouce provider crations18:36
dansmithbut we won't be recreating providers18:36
dansmithwe'll be accessing them by uuid, which hasn't changed right?18:37
dansmiththe hostname may be wrong, and if something else claims the hostname, then it will fail to create one,18:37
sean-k-mooneywell im about to start testing that stuff so we will see18:37
dansmithbut the case we've had downstream was not that they *shuffled* their compute node names, but rather moved to a different naming schema, still no overlaps18:37
sean-k-mooneysetting it back to the correct uuid and ill change the CONF.HOST and hostname seperatly18:37
sean-k-mooneyanyway my inital feedback is this is not preventign as much as i was expecting it to18:38
dansmithso here's the accepted spec: https://specs.openstack.org/openstack/nova-specs/specs/2023.1/approved/stable-compute-uuid.html#proposed-change18:38
dansmithand I think that's covered here18:38
dansmithit says that the uuid file will be what we use to find the compute node,18:39
dansmithand we will detect compute node renames18:39
sean-k-mooneyright and i kind fo assume "we will not intoduce any new DB exctpions that prevent the resouce tracker form working" woudl be an imporant point too18:40
dansmithI really think that if you take out the "randomly deleted a key state file from the system" then it prevents quite a bit18:40
dansmithhow is this a new db exception? it's the same db exception as before if we try to create a conflicting compute node record18:41
sean-k-mooneydansmith: well if your relase automation updated the uuid it would cause the same failure im seeing 18:41
sean-k-mooneythat was basiclaly step 418:41
sean-k-mooneyno18:41
sean-k-mooneybefore we looked it up by hostname18:41
sean-k-mooneyand would not have got a colliion in this case18:41
sean-k-mooneyso we are trying to create a duplicte record that would not have been created before18:42
dansmithokay I'm getting frustrated18:42
dansmithshall we take this to a gmeet?18:42
sean-k-mooneysure18:42
sean-k-mooneyim not trying to frusttrate you by the way18:42
sean-k-mooneyjust letting you knwo what im finding18:42
dansmithmeet.google.com/gkf-fdhr-wgr18:42
dansmithsean-k-mooney: one other thing, we could also assert that if we're already upgraded *and* are not starting fresh, we could abort if the uuid file is missing19:26
dansmithi.e. your didn't-bind-mount case19:26
dansmithalthough,19:26
dansmithif we handle that in the extra check we discussed, we can say "found X expected Y" which will be the easy way for them to fix their stuff, even if X is empty19:27
sean-k-mooneyyes i thought that was one of the things i said above.19:27
sean-k-mooneyyep19:28
dansmithoh, maybe I was foaming at the mouth and missed it19:28
sean-k-mooneyi read over your comments on teh persist change too so im ok to proceed with that now based on what we discussed19:30
dansmithack19:31
sean-k-mooneyso the first 4 have +w and the first 2 are merged19:31
dansmiththanks, I'll wait until those merge or fail before I push anything else up19:32
sean-k-mooneyok im going to see what else i can break or not break and ill review the last 3 on monday19:33
opendevreviewMerged openstack/nova master: Add get_available_node_uuids() to virt driver  https://review.opendev.org/c/openstack/nova/+/86391719:53
sean-k-mooneydansmith: fyi im gong to bold the titles of the tests that look odd20:04
sean-k-mooneybut the rename logic does not detact a change in hypervior_hostname today as long as CONF.host does not change20:05
sean-k-mooneyhttps://etherpad.opendev.org/p/Stable-compute-uuid-manual-testing#L21620:06
dansmithand that's because we get that from libvirt yeah?20:06
sean-k-mooneyyep 20:06
sean-k-mooneyso if the value of virsh hostname changes then we update hypervior_hostname in the db and rename the placemnet RP20:07
dansmithand that's problematic why, just because cinder/neutron will be unhappy?20:07
sean-k-mooneyim sure that will break things like QOS or other nested resouce providers ya or at least how we discover them20:08
dansmiththat's technically out of scope of what I was trying to guard against in this set20:08
dansmithand what if you change it back?20:08
sean-k-mooneythat siad if they also rename there RPs then it might just fix itslef20:08
sean-k-mooneythats going to be the next test20:08
dansmithokay20:08
sean-k-mooneyafter that im going to try changing the value via /etc/hosts20:09
sean-k-mooneyi expect changing it back to revert all the chagnes20:09
sean-k-mooneyi proably shoudl test this with nested resouce provides form neutron but ill do that after all the simiple tests20:10
sean-k-mooneyactully i need to check a few db tables first20:11
dansmithokay, so you're saying it updates plaacement but does *not* update our compute node right?20:13
sean-k-mooneyit updates teh cn table and placment they are both kept in sync20:13
sean-k-mooneyim checking what the api db looks like20:13
dansmithL219 says otherwise20:14
dansmithor you meant "isn't detected to abort"20:15
sean-k-mooneyi think we shoudl abort starting up if the hypervior_hostname for a compute node changes yes20:15
sean-k-mooneyat least for libvirt20:16
sean-k-mooneybut again thats an extra check we can add to the end fo the series20:16
sean-k-mooneyso i proably should have a vm runnign while i do this as i think the instance.node is going to be out of sync 20:17
sean-k-mooneyso ill boot one before i fix it20:17
sean-k-mooneyand my expectation is that when i restor the old hostname the isntance.node will not be updated but placement and the compute nodes table will be20:17
sean-k-mooneythats proably the last test ill do tonight20:18
sean-k-mooneybut ill do more on monday.20:18
dansmithyeah, like I said, out of scope of what I'm trying to do, but probably should be in scope20:18
dansmithack20:18
sean-k-mooneyactully i need to test with provider.yaml too at some point20:19
sean-k-mooneywe can refernce the RP by name i belive which uses the hypervior_hostname20:19
sean-k-mooneywe can also refrrence it by UUID20:19
sean-k-mooneyso i suspect the uuid way will be fine but the other way might break20:20
sean-k-mooneyagain not nessiarly in scope but i want to know what happens20:20
sean-k-mooneyok so if there are allocation in placment we cant delete and rename the resouce provider becaues we get a 40920:41
sean-k-mooneyso that fails and we never update the compute node record either20:42
sean-k-mooneythe perodic is broken20:42
sean-k-mooneybut the agent does not exit and we just get tracebacks in the logs.20:43
sean-k-mooneyill leave it there for today and come back to this on monday20:44
dansmithwait,20:46
dansmithI thought you said it *did* update the placement provider hostname?20:46
sean-k-mooneyit appeard too but based on that log it deleted the RP and recreated it the the old uuid and new name20:47
sean-k-mooneyso its not updating in place its trying too delete orphan compute and then creating a new one20:48
sean-k-mooneythat by the way i think is ironic code20:48
sean-k-mooneyor rather code we have in the common manager loop for ironic20:48
sean-k-mooneywe we detech hypervior_hostname changes and abort just like conf.host we dont have to care about this20:50
dansmithah because it had no instances?20:50
sean-k-mooneyright so with no instnaces it deelted it fine 20:51
dansmithright okay20:51
sean-k-mooneythe 409 conflict in placment is because fo the allcoation for the instnace20:51
dansmithso if no instances, maybe no harm to cinder and neutron?20:51
sean-k-mooneyit might still break the naming20:51
sean-k-mooneybut it might be ok in the no instance case 20:51
sean-k-mooneyi will confirure bandwith QOS or something next week and see20:52
sean-k-mooneycindier i dont think will use placment at all reight now20:52
sean-k-mooneybut cyborg could break20:52
sean-k-mooneycyborg and neutron might need me to restarck so ill test what i can before that20:53
sean-k-mooneydansmith: in this particalr case this placement exception i think happend before your code20:54
dansmithduring reshape or something?20:54
sean-k-mooneyi have defintly seen this before and its what i was expecting if your code did not block it20:54
dansmithso that sanity check of the hostname->node mapping generates 94 functional test failures20:54
dansmithfml20:54
sean-k-mooneyno i have seen this when people actully change dns/hostname but had CONF.host set20:55
sean-k-mooneywe dont need to fix all theses cases this cycle either like i know that test case 10 is a prexisitng failure mode20:57
sean-k-mooneyanyway im going to go get food20:57
sean-k-mooneydont spend your weekend on this o/20:57
dansmithI shan't, you either20:59
*** dasm is now known as dasm|off21:35

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!