Wednesday, 2022-08-31

*** dasm|off is now known as Guest164504:00
opendevreviewRajat Dhasmana proposed openstack/python-novaclient master: Add support to rebuild boot volume 2.94  https://review.opendev.org/c/openstack/python-novaclient/+/82716306:24
bauzasgood morning Nova07:05
Ugglagood morning07:22
bauzasmmm, I forgot to ask about your preferences for that *yet again* virtual PTG07:44
bauzasI mean, how many slots we should book07:44
bauzasI'm about to book 4 x 4 slots (Tues-Fri) but I'll cancel some if we agree 07:44
bauzasjust done, see the scheduled tracks https://ptg.opendev.org/ptg.html07:49
bauzasI can unbook the slots07:49
gibimorning07:50
gibibauzas: 4x4 works for me07:50
gibisean-k-mooney: thanks for testing the user data out. 07:50
bauzasgibi: we'll see how much we need once we have an agenda07:51
gibibauzas: sure07:51
gibisean-k-mooney, dansmith: in the light of the actual regeneration issue I agree to punt the user_data and land rebuild bfv first07:51
sean-k-mooney[m]im locally trying to fix the patch  by the way as a followup patch07:52
sean-k-mooney[m]ok what i was trying wont work but i know what to do instead so trying that now08:03
hkominosHi guys. Quick question. When overriding policies for nova, Should one redefine the whole policy.json file or Can I just define the single policy that I want to override?My current policy file is empty {}08:34
bauzashkominos: no, you should just update what you need*08:37
gibihkominos: you only need to add to the file that you want to override 08:37
gibibauzas: you won :)08:37
bauzas\o/08:37
hkominosthanks you guys !08:40
bauzasgmann: good spot for the WIP status on my election patch08:40
bauzasgmann: that's because I flagged the WIP status directly in Gerrit when I created the PS1 in Aug08:41
bauzasnow the patch is in active state08:41
sean-k-mooney[m]gibi: this is such a can of worms09:00
sean-k-mooney[m]i think to make this work we need to delete the old config drive first but that means we neeed to deal with all the image backends seperately since they dont have a rm() function09:01
bauzassean-k-mooney: a tl:dr about the problem ?09:02
sean-k-mooney[m]i can add one but i was hoping this would be simple and then i rememebred this should also work for lvm and rbd so i cant just delete the file09:02
sean-k-mooney[m]bauzas:  the user data update feature is trying to rebuild the config deive on hard reboot if you updated the user data09:03
sean-k-mooney[m]btu that fails with a permission denied\09:03
sean-k-mooney[m]i belive that is because the disk is owned by qemu09:03
sean-k-mooney[m]so nova cant updated it and this need to work for rbd and lvm anyway so i cant really just assume its a file on disk and chown it09:04
sean-k-mooney[m]the user_data feature is blocking the BFV rebuild feature because of the micorversion order09:05
sean-k-mooney[m]so i was trying to unblock that but we are just going to have to change the order09:05
sean-k-mooney[m]make BFV rebuild 2.93 and likely punt mutable user_data to AA09:06
sean-k-mooney[m]there are other issue with rpc/testing in the userdata patch that likely wont get adressed without a FFE09:06
bauzasagreed on flipping the microversions09:07
bauzasat least for not blocking bfv rebuild09:07
gibisean-k-mooney[m]: do we store the config drive in lvm / rbd? I thought it is always a file on disk09:07
gibibut agree if this is complicate then punt it09:08
bauzasfor userdata, we can't just tell "sorry you can't update your data because config drive" as this is a config-driven API behaviour09:08
bauzascould we have a separate flag that would say "I allow you to update your userdata" and leave the responsibility to the ops to enable it ?09:08
kashyapgibi: I think it's a file on the disk, too, the config-drive09:08
bauzaswith a caveat saying there could be perm issues with configdrive09:08
sean-k-mooney[m]gibi: good question i belvie we do looking at the code but ill check. i know swap gets created on rbd09:09
sean-k-mooney[m]and this appears to be using the generic imagebackend code09:09
sean-k-mooney[m]but ill confimr again09:09
sean-k-mooney[m]bauzas i dont think we should hack around this09:10
sean-k-mooney[m]the feature is broken currently if you use config deive and i think its important that works09:10
sean-k-mooney[m]espcially since we added a stadard trait for this09:11
bauzasthat's my point09:11
sean-k-mooney[m]we could simple not report that for the libvirt driver i guess09:11
bauzasin theory, we could only allow to update userdate if configdriver isn't in use09:11
bauzasuserdata*09:11
sean-k-mooney[m]so in this cycle no driver would report it so you could only use this if you did not have config drive09:11
sean-k-mooney[m]bauzas:  that works today i think09:12
bauzasbut I don't wanna leak a config-driven behaviour09:12
sean-k-mooney[m]it would not be config driven behavior09:12
bauzassean-k-mooney: then this is documentation 09:12
sean-k-mooney[m]we just set the compute capablity trait to false for libvirt09:12
* bauzas needs to look at the patches09:12
sean-k-mooney[m]and then when config drive works we set it to true09:12
sean-k-mooney[m]bauzas: either way im not sure its reasonable of use to ask whoami-rajat to wait any longer given the state of the user data patch09:13
bauzasagreed again09:13
bauzaswe should ask to flip the microversions09:13
sean-k-mooney[m]so flip microversion and split user data patch in two. setting the capablity trait to false in all drivers in the first one and the second patch and then implemnte the config drive rebuild adn rpc change09:15
sean-k-mooney[m]ahtough that still feels a bit wrong09:16
bauzasrpc change becaaaaause ?09:16
sean-k-mooney[m]because currently it uses a dirty flag in the instance system metadata to say the config drive need to be rebuilt09:17
sean-k-mooney[m]and dansmith made a very valid point tha tthat a shadow rpc interface and we should have added that as a parmater to the reboot rpc call09:17
sean-k-mooney[m]so that was already requested as a followup09:17
gibiI'm OK to only allow user data update for non config drive instances in Zed. That is a good enough compromise. Also config drive is just half config drive, as it can also be requested via the API09:17
gibi*config drive is just half config driven09:18
sean-k-mooney[m]yes it can thats how i was testing09:18
sean-k-mooney[m]it can be forced via nova.conf but its user requestable as you said09:19
gibias a side note I found a bug in the allocation candidate filtering in the PCI series.09:19
gibiso I think what is realistic there is to land the healing part 09:20
gibiup until https://review.opendev.org/c/openstack/nova/+/850468/2009:20
gibistephenfin: Sean is +2 til ^^ so if you have time :)09:20
sean-k-mooney[m]yes that was the partion i tought made sense if we did not land it all09:21
stephenfinokay09:21
sean-k-mooney[m]thats everything except the schduler part09:21
gibialso in the light of the recent shadow RPC discussion I feel that storing data in InstancePCIRequest.extra_info is also shadowy in my series09:21
gibisean-k-mooney[m]: yepp, that already gives visibility of the PCI resource inventories in placement which is good to have09:22
sean-k-mooney[m]it would have been nice to have the split pools by PF change09:22
gibiwe could land that, that is not effected by the bug09:22
sean-k-mooney[m]that would be nice since it gives preference to PFs without VFs when you ask for a PF09:23
gibiit does not do any externally visible change\09:23
gibiI'm not sure it is a real preference or just an ordering change09:23
gibiI have to look if we sort the pool09:23
sean-k-mooney[m]i always tought the way we allocated was more or less deterministic09:23
sean-k-mooney[m]at least it has been stable enough for use to use in the functional tests without intermient failures09:24
gibiit is stable09:24
gibibut it might be not stable do to sorting09:24
gibibut due to simple iteration order of devices / pools09:24
sean-k-mooney[m]ack09:25
bauzassean-k-mooney: so I've read https://review.opendev.org/c/openstack/nova/+/816157/1509:25
sean-k-mooney[m]so i guess we can look at that again and confirm if it is of benifit09:25
bauzassean-k-mooney: can you request for the patch split and the other microversion ?09:25
gibisean-k-mooney[m]: yes, I will look at the ordering09:25
bauzasI can leave some comments but I'm way behind09:25
gibiI have to drop for an hour now for an early lunch but I will be back after09:25
sean-k-mooney[m]bauzas:  sure 09:26
bauzasthanks09:26
bauzasI'll review quickly the bfv rebuild series09:26
bauzasboth you and gibi gave +2s but I'll do my glance quickly09:26
bauzasand I'll tell whoami-rajat to respin the API patch with 2.9309:27
gibibauzas: you could look at the viommu one too09:27
bauzasgibi: yup, on my list09:27
sean-k-mooney[m]https://review.opendev.org/c/openstack/nova/+/816157/15#message-f9bcfb9f6d0115f20223c4cd1e91d5c5428798d709:32
sean-k-mooney[m]done ^09:32
sean-k-mooney[m]whoami-rajat:  dansmith  do you have time to update the BFV serise to use 2.93 and rebase to master09:32
bauzasthanks09:32
sean-k-mooney[m]its too early for them but they should see it when they get online09:33
sean-k-mooney[m]gibi i think you are right by the way.  there is no preference its just the fact we create the pool with just the PF before we create the pool with the VF in those tests i think so its just down to pciadress/pool ordering i think.09:36
sean-k-mooney[m]we could add a prefernce but that is out of scope of the spec so dont worry about it09:37
sean-k-mooney[m]... its a annoying i think i see how to fix configdrive09:45
sean-k-mooney[m]for RBD we do infact import the local config drive into ceph and it will creatly delete and recreate it if required09:45
sean-k-mooney[m]if i use a tempory path to to generate teh new config drive and implemnente import file for the other backends then that should fix it09:46
sean-k-mooney[m]for flat/qcow that jsut an atomic move to the normal localtion via privesep to workaround the permission issue09:47
sean-k-mooney[m]for lvm its not much harder to do but its a little more work09:47
sean-k-mooney[m]so im still going to try and do that so that they have a refernce patch for how to fix it09:48
bauzassean-k-mooney: gibi: fwiw https://review.opendev.org/c/openstack/nova/+/820368/comment/bd35c4c9_0ad9dd3d/ I'm asking to update https://docs.openstack.org/nova/latest/user/support-matrix.html#operation_rebuild in the API patch10:03
sean-k-mooney[m]bauzas:  gibi  ok i have a working version of the config drive part10:11
sean-k-mooney[m]i can push it for review and then work on tests  to see what people think10:11
bauzasok10:14
bauzasI need to bail out for lunch but I'll be back later10:14
opendevreviewsean mooney proposed openstack/nova master: support configdrive rebuilding  https://review.opendev.org/c/openstack/nova/+/85535110:26
whoami-rajatsean-k-mooney[m], ack, on it10:44
opendevreviewRajat Dhasmana proposed openstack/nova master: Add support for volume backed server rebuild  https://review.opendev.org/c/openstack/nova/+/82036811:35
opendevreviewRajat Dhasmana proposed openstack/nova master: Add conductor RPC interface for rebuild  https://review.opendev.org/c/openstack/nova/+/83121911:35
opendevreviewRajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances  https://review.opendev.org/c/openstack/nova/+/83088311:35
whoami-rajatsean-k-mooney[m], bauzas gibi dansmith  ^^ rebased to 2.9311:37
bauzas++11:37
whoami-rajatonce this merges, i will change novaclient, openstackclient, tempest changes as well11:38
whoami-rajatbauzas, I wasn't sure what your ask was regarding the support matrix change, can you check if this is the change you intended ? https://review.opendev.org/c/openstack/nova/+/830883/29..30/doc/source/user/support-matrix.ini11:39
whoami-rajathttps://review.opendev.org/c/openstack/nova/+/830883/30/doc/source/user/support-matrix.ini11:39
bauzaswhoami-rajat: well, the status should be 'complete' for the libvirt drivers, right?11:45
whoami-rajatbauzas, yeah, there were too many, so wasn't sure, will update for all libvirt drivers11:48
bauzaswhoami-rajat: say 'unknown' for the ones you don't know11:49
bauzaslike ppc64, s390x and lxc11:49
whoami-rajatto be honest, I'm not sure about different architectures11:49
whoami-rajatack will do11:50
gibiI guess on those architecture where the libvirt virt driver supports boot from volume there the rebuild will work too. at least I don't remember any arch dependent code in the rebuild patches11:52
sean-k-mooneythe arch shoudl not matter11:53
sean-k-mooneythe virt types might11:53
sean-k-mooneyas i expect this to work for qemu and kvm11:53
sean-k-mooney we could list them as unknon but i stongly suspect it will work on arrch64 at the very least or as gibi suggested anywhere wehre boot form volume is supported11:55
sean-k-mooneywhoami-rajat: we support attach and detach volume on all teh kvm and qemu combindations11:57
sean-k-mooneydo you have any reason to belive that it would not work or would depend on the architecure11:58
gibiwe dont have a row in the matrix about boot from volume support11:59
sean-k-mooneyya i noticed that11:59
sean-k-mooneythis could be adressed in a followup patch yes11:59
gibiyes11:59
gibithe whole matrix thing is just doc12:00
gibithat can be done even after FF12:00
gibibefore RC112:00
gibior as a doc bugfix later12:00
whoami-rajatsean-k-mooney, i don't think so, if BFV works then this should work too as we're doing some attach detach operations and on cinder side we're doing dd to copy image data to volume12:00
whoami-rajatmore or less same as what we do in BFV12:01
sean-k-mooneyack thats what i tought woudl be the case 12:01
whoami-rajatso should i just mark all kvm qemu drivers as supported?12:01
sean-k-mooneyi think we could mark them complete unless we get a bug report saying it does not work12:01
sean-k-mooneythats what i would do but bauzas might perfer unknonw12:02
whoami-rajatack, will update that12:02
bauzasI'm not opiniateds12:02
bauzasand I thought, given whoami-rajat was about to update his API change, it was OK to ask for this for the API chbange12:03
bauzasand not by a FUP12:03
sean-k-mooneyeither is fine12:03
gibiI'm OK with the patch as is if we go with a FUP, or I can reapply my +2 if it is respined12:03
sean-k-mooneyi was waitign for zuul to report before revieing after the microversion change12:03
opendevreviewRajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances  https://review.opendev.org/c/openstack/nova/+/83088312:05
whoami-rajatdoes this look good? ^12:05
sean-k-mooneyyes i think so12:06
bauzas+2d12:06
sean-k-mooneylibvirt-lxc wont work since attach/detach volumes is missing but thats a nit12:07
sean-k-mooneyunknonw is fine12:07
gibiwe need a +2+A on the first patch then the series will land12:09
sean-k-mooneydoing that now12:09
sean-k-mooneywas just revieing it12:09
sean-k-mooneywhoami-rajat: thanks for fixing the typos and renaming the detach funciton12:10
whoami-rajatsean-k-mooney, np, i had to rebase so was just avoiding followups12:10
sean-k-mooneywhoami-rajat: and sorry for the micorversion hassel. stacking the changes was ment to prevent fighting for the same microversion but in this case it did not help12:11
opendevreviewTakashi Natsume proposed openstack/nova master: Add missing descriptions in HACKING.rst  https://review.opendev.org/c/openstack/nova/+/85305412:12
opendevreviewTakashi Natsume proposed openstack/nova master: Add a hacking rule for the setDaemon method  https://review.opendev.org/c/openstack/nova/+/85465312:12
* sean-k-mooney should really be on a donwstream call12:12
whoami-rajatsean-k-mooney, i can understand there were some last minute decisions to be made, everything is fine until the changes get in so no problem :)12:12
* sean-k-mooney that is why i ahve been waring this headset for 15mins in total silence12:12
opendevreviewRajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances  https://review.opendev.org/c/openstack/nova/+/83088312:36
whoami-rajat^ there was a functional test failure due to the method name change in compute manager, everything else should is running fine12:37
sean-k-mooneyack ill re reivew after donwstream call12:38
sean-k-mooneyif gibi  or bauzas dont do it before12:39
gibiI'm on it12:39
whoami-rajatthanks12:40
dansmithgibi: wanna hear something funny?13:42
*** Guest1645 is now known as dasm13:50
gibidansmith: sure13:51
dansmithgibi: I would have bet my lunch on user_data being in the set of things we don't query out by default and only load if required13:52
dansmithI remember extensive discussions about it around icehouse when all this was being done13:52
dansmithbecause of it's size13:52
dansmithbut I think we're actually *always* pulling that out and *always* sending it over the bus right now, which is insane because it's almost never used13:52
dansmith*and* in any of the joined queries where we duplicate the instance fields (like why we stopped joining the metadata queries), we'd be duplicating up to 64k of user data in the DB response...13:53
sean-k-mooneydansmith: we had a cve related to that in the past13:53
sean-k-mooneythat should be fixed now13:54
dansmithsean-k-mooney: related to what?13:54
gibiI don't see any special casing for user_data so probably you are right we are always pulling it out13:54
sean-k-mooneypulling GBs of results into a join when you have large userdata 13:54
dansmithokay, then fixed how?13:54
dansmithbecause AFAICT, we're always pulling it13:55
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/758928/13:55
dansmithsean-k-mooney: that's metadata13:55
dansmithbut you're saying that because we were also loading user_data that was much larger I guess?13:56
sean-k-mooneyyep so i think we fixed the once case wehre we had a really large join as a result13:56
dansmithyeah so that patch will maybe avoid some duplication of it, but we don't need to be pulling it out of the DB except when we need it13:56
sean-k-mooneyyes thats also fair we dont13:57
sean-k-mooneyi guess it would be nice to do as a performace enhancement in general13:58
gibidoes it also mean that we can save the updated user data on the compute side now?13:58
sean-k-mooneybut also possibly reducing the change of another cve13:58
dansmithwell, it could seriously cut down on our rabbit load13:58
sean-k-mooneywhere the user data is large yes13:58
dansmithgibi: yeah I assume that if that patch works, it works because somewhere in reboot the instance gets saved, probably due to state13:59
dansmithgibi: no test coverage of that though that I can see :)13:59
sean-k-mooneyunfortunetly people that use heat tend to also abuse the userdata13:59
sean-k-mooneywe have had requests in teh past to increae the limmit even more13:59
gibidansmith: ahh yes as we dont call save on the api side 13:59
sean-k-mooneywe said no the last few times it came up13:59
dansmithsean-k-mooney: yeah, I bet people are bzipping stuff to keep it under the limit :D13:59
sean-k-mooneyoh by the way the limit of medium text is 64MB not 64KB but i think we limit to 64KB but now i want to go check14:01
sean-k-mooneywe use mediumtext in the db schema14:01
sean-k-mooneyi think we put the limit in the api14:01
dansmithwe are limiting in the api yeah14:01
dansmithat least in this patch14:02
sean-k-mooneyour api ref say Configuration information or scripts to use upon launch. Must be Base64 encoded. Restricted to 65535 bytes.14:02
sean-k-mooneywhich is what i expected14:03
dansmithbtw, I've seen this fail several times in the last day: https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/testr_results.html14:03
dansmithit complains of a missing host, which the test is unshelving to by name, so I assume that's a test bug or something14:04
sean-k-mooneyCompute host ubuntu-focal-rax-dfw-0030919238 could not be found.14:04
sean-k-mooneyya14:04
sean-k-mooneyso tempest config error maybe14:04
dansmithso anyway, on the user_data thing, I think it would be a good idea for us to en-lazy that by default and only load it in the api and metadata api by default and I bet we'll see some rabbit load relaxed14:05
gmanndansmith: sean-k-mooney gibi Uggla I am also seeing nova-multi-cell failing consistently for the new test added for test_unshelve_to_specific_host14:06
gmannhttps://zuul.opendev.org/t/openstack/builds?job_name=nova-multi-cell&skip=014:06
dansmithack, I've probably rechecked 8 of that failure in the last 24 hours14:06
sean-k-mooneyubuntu-focal-rax-dfw-0030919243 is the compute buntu-focal-rax-dfw-0030919238 is the contoller14:06
gmannthis test was recently merged yesterday and was passing that time14:07
gibiit is interesting as the test looks up the other compute by the service list14:07
dansmithmight be flaky or might not work on some cloud providers if there's a name weirdness?14:07
sean-k-mooneydansmith: so the name is correct at least it trying to unshelve to the contoller14:07
sean-k-mooneyi wonder if there is fqdn stuff going on14:08
gibiubuntu-focal-rax-dfw-0030919238  doesnt feel wierd14:08
gmannyeah14:08
dansmithgibi: I meant FQDN type things14:08
dansmithsean-k-mooney and I are scarred for life on FQDN problems :)14:08
sean-k-mooneyubuntu-focal-rax-dfw-0030919238 is what we see for the host value in the compute agent startup14:09
sean-k-mooneyso maybe hypervior hostname14:09
sean-k-mooneyDEBUG nova.compute.resource_tracker [None req-49d84742-3abf-4111-8a83-e2c5e0e67661 None None] Hypervisor/Node resource view: name=ubuntu-focal-rax-dfw-0030919238 14:10
sean-k-mooneynope they all seem to line up14:10
sean-k-mooney odd14:10
dansmithdo we do a disabled or service liveness check before we let you unshelve there? maybe the compute is stuck and not updating its counter?14:11
gibiwhen we query the service we check for that it is up and enabled14:12
gmanndansmith: you mean just before unshelve or while selecting the host? while selecting host we do check service is up and enable 14:12
gmannbut after that shelve happen and then unshelve and that time no check before unshelve 14:12
dansmithgmann: yeah I mean whatever check we're doing that gives that error14:13
dansmithis there a tempest test that is disabling a compute by chance for testing?14:13
sean-k-mooneyits proably worth checkign the schduler logs14:13
dansmithoh is this just the scheduler "is this host okay" check?14:13
gmannI do not think we have any such test of touching the compute services enable/disable 14:15
gibiit is not about sevice state14:15
gibithe unshelve fails on compute_node_get_all_by_host14:15
gibiso the compute is not in the DBV14:15
gibiDB14:15
gibihttps://github.com/openstack/nova/blob/master/nova/compute/api.py#L457714:17
gibithis where the unshelve fails14:17
sean-k-mooneyso it is acutlly trying to shelve and unshlve multiple times14:18
sean-k-mooneyit shleve offloaded and shelve to the compute14:19
sean-k-mooneythen it shelved again 14:19
sean-k-mooneyand failed to unshleve to the contoler with the not found issue14:19
gibiyes14:19
gibifirst it unselves back to where it was, then it unshelves to the other compute14:20
sean-k-mooneyand req-b3e7705d-a20b-42b3-8eb5-594181a096c8 is the request that fialed14:20
sean-k-mooneywhich failed in the api right14:20
sean-k-mooneyits not in the secheuler14:20
gibiyes14:21
gmannyes it is failing in 2nd time unshelve on another host14:21
gibiit is not fails on the scheduler 14:21
gibiit fails in the compute api14:21
gibihttps://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/controller/logs/screen-n-api.txt14:21
gibihttps://github.com/openstack/nova/blob/master/nova/compute/api.py#L457714:21
gibiunshelve checks is the requested host exists14:21
gibibefore it goes to the conductor14:21
sean-k-mooneyby calling compute_node_get_all_by_host14:21
gibiit calls get_first_node_by_host_for_old_compat14:22
sean-k-mooneythats on the main db i.e. cell db right14:22
dansmithso that is looking up by the service, and getting the first compute node on the service14:22
dansmithnot by_host_and_node14:22
* dansmith tickles his "robustify compute node hostnames" spec14:23
sean-k-mooneyfor libvirt there should only be one but ya14:23
sean-k-mooneyif shelve was a thing for vmware or ironic that would not be ideal14:23
dansmithsean-k-mooney: right but this is very loose association which doesn't have FK constraints14:23
sean-k-mooneywhy is it hitting the cell db. it could look at the host mapping table in the api db right14:24
sean-k-mooneyor cell mappings table14:24
dansmitheh?14:24
dansmiththose just tell us which db the host is in14:24
sean-k-mooneyright but if all you care about is does it exist is that not enough14:24
dansmithnothing about whether it's really there or up or whatever14:24
gibisean-k-mooney: you have a point, so the compute-api does a cell DB query without targeting the cell first14:24
sean-k-mooneydansmith: i guess it could be down but im not sure this is checkign that14:25
dansmithsean-k-mooney: or deleted14:25
sean-k-mooneyi would expct that to be check by the schduler to be honest14:25
dansmithgibi: these hosts are in the same cell though so if that wasn't working, we wouldn't have found the first one right?14:25
gibithis is the multicell job so I assume we have hosts in different cells14:26
sean-k-mooneyyes there are only two nodes in the job14:26
sean-k-mooneyso each is in a different cell14:26
dansmithmulticell meaning superconductor not actually one host per cell right?14:26
sean-k-mooneywell we can check but i tought this was 2 cells14:27
dansmithor is this really unshelving to different cells?14:27
dansmithcan we unshelve across cells?14:27
dansmithI thought not14:27
dansmithso maybe we're targeted at the cell where the instance is and not the cell where the host is that we're unshelving to?14:28
dansmithnot sure how that would have ever passed14:28
sean-k-mooneythe compute is not runnign a second conductor14:28
gibiI'm looking at the compute.api.API.unshelve and I think the context is not targeted14:28
sean-k-mooneyoh the contoler is ruuning both cell1 and cell2 conductors14:29
sean-k-mooneyand the supper conductor14:29
sean-k-mooneyhttps://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/compute1/logs/etc/nova/nova_cell1_conf.txt14:30
sean-k-mooneytransport_url = rabbit://stackrabbit:secretrabbit@10.208.192.130:5672/nova_cell214:30
sean-k-mooneyso ya compute1 is nova_cell214:31
sean-k-mooneyand contoler is transport_url = rabbit://stackrabbit:secretrabbit@10.208.192.130:5672/nova_cell114:31
gibiI don't know why this ever passed the test14:31
dansmithI also don't know how unshelve could be using an untargeted context14:31
dansmithit should be pointing at cell0 all the time which would never work14:31
sean-k-mooneydansmith: did you say this just started failing recently?14:32
sean-k-mooneybecause we merged unshelve to host a while ago14:32
dansmithsean-k-mooney: it just recently merged I think gmann said14:32
sean-k-mooneythe tempest test?14:32
sean-k-mooneyif so i guess we dont run the multi-cell job on tempest14:33
gmanndansmith: sean-k-mooney gibi that passed in tempest-multinode-full-py3 job and I do not think we checked multi-cell job run for this14:33
dansmiththe base nova conf doesn't have a db connection pointing at cell0, which I assume is what the apis are using14:33
gmannyeah we do not run multi-cell job there and only checked tempest-multinode-full-py3 job passing it14:34
sean-k-mooneygmann: ack that makes sense14:34
gibiyeah nova-multicell did not run on the tempest patch14:34
sean-k-mooneyok so we have two thing one we should skip this temporaly on the multi-cell-job14:34
sean-k-mooneyand then fix the bug14:34
gibiyes14:34
dansmitha normal run should have the api pointing at cell0: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_8cf/831219/22/check/tempest-integrated-compute/8cfb267/controller/logs/etc/nova/nova_conf.txt14:34
dansmithbut that multicell run has nothing defined there14:35
sean-k-mooneywe at least shoudl check we are not unshelving across cells for now but that actully should work14:35
sean-k-mooneysince we use shelve for14:35
sean-k-mooneycross cell resize14:35
dansmithsean-k-mooney: it shouldn't work14:35
* gibi goes and file a bug14:35
dansmithbecause cross-cell migration does other stuff on top of the shelve14:35
sean-k-mooneyoh your right14:35
gmannand cross-cell is not in that feature acope right? which is added in this cycle only. 14:36
gmannscope14:36
gmannsean-k-mooney: agree to skip it in multi-cell job14:36
sean-k-mooneyit was not explcitly no14:36
sean-k-mooneyso we can just document the limitation i guess14:36
gmannbut will be good to add in documnt/releasenotes14:37
gmannyes14:37
sean-k-mooneydansmith: ohter then usign multiple port binding to test if the destination cell can bind the port. how much more does cross cell migration do over shelve. there is  a bunch of stuff in the supper conductor to copy the instnace object between the cell db right14:38
dansmithyeah it moves the thing between DBs, that's the big thing14:38
sean-k-mooneywell for now i guess its fine the error could be improved but the api wont let you currpt the db or anything like that14:39
dansmithso this multicell job has two cells and two computes, one compute per cell? it must be disabling live migration and lots of other stuff that would normally require multinode right/14:39
sean-k-mooneyya its two nodes (contoller and compute) and both nodes are in a differnt cell14:40
dansmithso it must disable a bunch of things14:40
sean-k-mooneyyep14:40
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/.zuul.yaml#L514-L57114:41
gmannyeah, live migration is disabled https://github.com/openstack/nova/blob/master/.zuul.yaml#L55414:41
sean-k-mooneythats the deffintion14:41
dansmithokay14:42
gibifile the bug https://bugs.launchpad.net/nova/+bug/198831614:42
gibias it is blocks the gate I will set it to Critical14:43
dansmithalso, the api config I was looking at was on the compute, the controller is pointed at cell0 by default14:43
dansmithgibi: ++14:43
sean-k-mooneygmann: does the tempest test have a compute_feature_flag or will i just add it to the exclude regex14:43
gibiand I will push the skip14:43
gibito unblock the gate14:43
sean-k-mooneyoh ok14:43
sean-k-mooneyill leave it to you so14:43
gmannsean-k-mooney: exclude the test14:44
gmanngibi: ok, I was doing but please you go ahead 14:44
gibiack, I'm on it14:45
dansmithsean-k-mooney and I can get our +2 hammers polished up14:45
gmanngibi: we can add test class name in case more test are added for this https://github.com/openstack/nova/blob/master/devstack/nova-multi-cell-exclude-list.txt14:45
sean-k-mooneyya UnshelveToHostMultiNodesTest14:45
sean-k-mooneyor tempest.api.compute.admin.test_servers_on_multinodes.UnshelveToHostMultiNodesTest14:46
sean-k-mooneyif we want to fully quallify it14:46
gibidansmith: would the current code fail to found the compute even if it is the same cell of the instance as it does not target the cell?14:48
dansmithgibi: I think it has to be targeting the cell14:48
sean-k-mooneyit seamed to work for the host it started on14:48
gibiI don't see where it targets it14:48
dansmithI know it's not clear but it wouldn't be getting as far as it is if not14:48
sean-k-mooneyits presumably targeting the source cell and not finding the dest host14:49
dansmithI confirmed it's using cell0 by default, which means it must be getting targeted14:49
sean-k-mooneyoh ok14:50
opendevreviewribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (db)  https://review.opendev.org/c/openstack/nova/+/83119314:52
opendevreviewribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (objects)  https://review.opendev.org/c/openstack/nova/+/83940114:52
opendevreviewribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (manila abstraction)  https://review.opendev.org/c/openstack/nova/+/83119414:52
opendevreviewribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (drivers and compute manager part)  https://review.opendev.org/c/openstack/nova/+/83309014:52
opendevreviewribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (api)  https://review.opendev.org/c/openstack/nova/+/83683014:52
opendevreviewribaudr proposed openstack/nova master: Bump compute version and check shares support  https://review.opendev.org/c/openstack/nova/+/85049914:52
opendevreviewribaudr proposed openstack/nova master: Add metadata for shares  https://review.opendev.org/c/openstack/nova/+/85050014:52
opendevreviewribaudr proposed openstack/nova master: Add instance.share_attach notification  https://review.opendev.org/c/openstack/nova/+/85050114:52
opendevreviewribaudr proposed openstack/nova master: Add instance.share_detach notification  https://review.opendev.org/c/openstack/nova/+/85102814:52
opendevreviewribaudr proposed openstack/nova master: Add shares to InstancePayload  https://review.opendev.org/c/openstack/nova/+/85102914:52
opendevreviewribaudr proposed openstack/nova master: Add instance.power_on_error notification  https://review.opendev.org/c/openstack/nova/+/85208414:52
opendevreviewribaudr proposed openstack/nova master: Add instance.power_off_error notification  https://review.opendev.org/c/openstack/nova/+/85227814:52
opendevreviewribaudr proposed openstack/nova master: Add helper methods to attach/detach shares  https://review.opendev.org/c/openstack/nova/+/85208514:52
opendevreviewribaudr proposed openstack/nova master: Add libvirt test to ensure metadata are working.  https://review.opendev.org/c/openstack/nova/+/85208614:52
opendevreviewribaudr proposed openstack/nova master: Add virt/libvirt error test cases  https://review.opendev.org/c/openstack/nova/+/85208714:52
opendevreviewribaudr proposed openstack/nova master: Add share_info parameter to reboot method for each driver (driver part)  https://review.opendev.org/c/openstack/nova/+/85482314:52
opendevreviewribaudr proposed openstack/nova master: Support rebooting an instance with shares (compute and API part)  https://review.opendev.org/c/openstack/nova/+/85482414:52
opendevreviewribaudr proposed openstack/nova master: Change microversion to 2.93  https://review.opendev.org/c/openstack/nova/+/85208814:52
opendevreviewBalazs Gibizer proposed openstack/nova master: Skip UnshelveToHostMultiNodesTest in nova-multi-cell  https://review.opendev.org/c/openstack/nova/+/85537814:52
gibigmann, dansmith, sean-k-mooney: here is the skip ^^14:52
sean-k-mooneydo we still use that file14:54
gibidansmith: in the rest api code we load the instance properly from its cell could that codepath alter the context too?14:54
sean-k-mooneyoh we do14:54
dansmithno, because we're doing scatter14:54
gmannyes14:54
dansmithso actually14:55
dansmithI think gibi  might be right and that the feature itself is broken14:55
dansmithbut I can't explain why it's working on single-cell, since we're targeted to cell0 by default14:55
dansmithit looks like we don't target until we hit superconductor14:55
dansmithhttps://github.com/openstack/nova/blob/cbc9b516fb5e6b079b57cc40380e1d730751cd6c/nova/conductor/manager.py#L974-L97514:55
dansmithso I can't explain why this ever works, because the api should have only checked for the host being present in cell0, which should never have any hosts14:56
gibiI think I see where we target the context14:56
gibi_get_instance_from_cell14:56
gibihttps://github.com/openstack/nova/blob/3862cfc649d0099971c91d17e6f8800d10712a26/nova/compute/api.py#L2867-L287014:56
gibithis actually has a side effect to alter the context14:56
dansmithah, yep, okay I missed that14:57
dansmithphew :)14:57
gibiso unshelve actually works within a cell14:57
gibiwe need to document that it does not work across cells and it need to emit a better error14:57
dansmithyeah14:57
sean-k-mooneyya although that proably a latent bug14:58
sean-k-mooneyi.e. if we use the old way of targeting an az14:58
sean-k-mooneyand that az was in a differnt cell14:58
sean-k-mooneyit likely would have failed in teh same place14:58
dansmithlatent bug in terms of raising a useful error you mean?14:58
sean-k-mooneyyes14:58
dansmithyeah14:58
sean-k-mooneyits just now that we have a tempest test that hits it we are seeing it14:58
sean-k-mooneybut we proably were not doing cross az unshelve in the multi cell job14:59
gibiI've update the bug 15:00
dansmithsean-k-mooney: can you +W the skip so it's headed in after tests?15:00
dansmithor gmann 15:01
gibiUggla: probably the above discussion about unshelve is interesting to you too ;) 15:01
gibiUggla: also https://bugs.launchpad.net/nova/+bug/1988316 :)15:01
gmanndansmith: gibi done15:03
gibigmann: thanks15:03
dansmithsweet15:03
gmannI think its worth to add multi-cell jo on tempest side too and check the new tests for cross-cell thigns15:04
gmannjob15:04
gmannI will do15:04
gibinah then we had the necessary gate block at FF week done15:04
gibicould have been worse15:06
gmann? did not get15:06
gmannyou mean we should not add in tempest gate ?15:06
gibinah I'm just joking15:06
dansmithI believe gibi is being funny :)15:07
gmann:)15:07
gibiso we tend to break the gate during FF as we are rushing15:07
dansmithbut in all seriousness, gmann adding multicell to tempest gate might be a bit heavy and nova-centric15:07
dansmithnot sure it's really necessary15:07
dansmithmaybe in experimental or something15:07
gmannyeah experimental is good idea15:07
bauzasadd this in periodic15:09
bauzasperiodic-weelky15:09
bauzasand then I'll check its state every week before the meeting15:09
sean-k-mooneydansmith: sorry was geting a drink yes i can15:09
bauzasbetter than experimental where not all of us would check it15:09
sean-k-mooneyoh gmann  already did15:10
gmannbauzas: with experimental we can check it during the new test addition. it run in nova gate as voting so periodic in tempest side would not be much benefit. if anything broken then nova gate will check it nd inform tempest15:11
sean-k-mooneyim not sure periodic-weekly helps15:11
sean-k-mooneywe will notice it on the nova side before it runs15:11
gmannyeah15:11
bauzasokok15:11
bauzasmaybe I misunderstood15:12
dansmithright, periodic won't really help15:12
gmannbauzas: we are talking to add multi-cell job in tempest experimental pipeline so that we can check new tests (migration/unshelve one ) for cross cells env15:12
bauzasaaaaah15:13
bauzasOK, then yes15:13
bauzasfine with me15:13
bauzasI was thinking it was a nova job 15:13
bauzaslgmt then to be experimental15:13
gmannbauzas: with the new change in RBAC (dropped system scope from policy), we do not need this blueprint change anymore. I have added the notes there https://blueprints.launchpad.net/nova/+spec/allow-project-admin-list-hypervisors16:29
gmannbauzas: please close/mark it not needed 16:30
gmannadded the same notes in etherpad too for your easy tracking https://etherpad.opendev.org/p/nova-zed-blueprint-status16:32
sean-k-mooney gmann i just set it to obsolete16:33
gmannsean-k-mooney: thanks16:34
bauzasricolin: just said -1 for https://review.opendev.org/c/openstack/nova/+/830646 but if you don't have time today or tomorrow morning, lemme know16:58
bauzasricolin: we could just need a follow-up patch for my nits16:58
bauzasand I'll +216:58
ricolinbauzas: thanks, I will try to update it ASAP17:15
whoami-rajatdansmith, do you have any idea about the nova multi cell breaking constantly or is it just flaky?17:16
dansmithwhoami-rajat: there's a fix in the queue17:16
dansmithwhoami-rajat: no point in rechecking until that lands17:16
whoami-rajatah ok17:16
whoami-rajatwasn't aware17:16
dansmithit has been in check for hours, not even in the gate yet17:16
whoami-rajatdoesn't sound good17:17
dansmithwhoami-rajat: it's in the gate now17:30
whoami-rajatgreat17:30
whoami-rajatcan i have the link?17:30
dansmithit's 85537817:31
whoami-rajatack17:31
sean-k-mooneywhoami-rajat: sicne your change is approved even if its not merged by COB tommrow at m3 it can be rechecked to land it17:56
sean-k-mooneyso we will keep an eye on it until its landed17:56
sean-k-mooneywhoami-rajat: if you have a depends on form the nova client and osc patchs against the top patch we can review those in parralel17:57
whoami-rajatsean-k-mooney, thanks, just wanted to update other project patches after this merged, guess it's safe to update the clients and tempest in parallel so will get to that17:57
whoami-rajatsean-k-mooney, yep, i will just update them quickly and let you know17:58
sean-k-mooneyyep depends on will prevent them merging until the api change lands and then they can just be rechecked at that point17:58
whoami-rajatack, sounds good18:00
opendevreviewRico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest  https://review.opendev.org/c/openstack/nova/+/83064618:14
opendevreviewRico Lin proposed openstack/nova master: Add traits for viommu model  https://review.opendev.org/c/openstack/nova/+/84450718:14
sean-k-mooneyricolin: you still have not adress my feedback on https://review.opendev.org/c/openstack/nova/+/844507/1518:15
sean-k-mooneythe way you are doing the traits reporting is wrong18:15
opendevreviewRajat Dhasmana proposed openstack/python-novaclient master: Add support to rebuild boot volume 2.93  https://review.opendev.org/c/openstack/python-novaclient/+/82716318:16
whoami-rajatnovaclient: https://review.opendev.org/c/openstack/python-novaclient/+/82716318:18
whoami-rajatosc: https://review.opendev.org/c/openstack/python-openstackclient/+/83101418:18
whoami-rajatsean-k-mooney, ^18:18
sean-k-mooneyack ill go review those now18:19
whoami-rajatthanks!18:20
ricolinsean-k-mooney: okay, checking the comments again18:28
dansmiththat skip is about to pop19:31
dansmithnext ten minutes or so19:32
sean-k-mooneywhoami-rajat: i looked over the chagnes novaclinet looks good. minor issue with osc, and does the tempest change really need its own job? i would just add that test to tempest full if cinder is deployed and nova is new enough19:43
sean-k-mooneynone of that is blocking the nova change but that just my feedback19:43
sean-k-mooneyim going to go have dinner/breakfest19:43
opendevreviewMerged openstack/nova master: Skip UnshelveToHostMultiNodesTest in nova-multi-cell  https://review.opendev.org/c/openstack/nova/+/85537819:50
opendevreviewRico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest  https://review.opendev.org/c/openstack/nova/+/83064620:18
opendevreviewRico Lin proposed openstack/nova master: Add traits for viommu model  https://review.opendev.org/c/openstack/nova/+/84450720:18
ricolinsean-k-mooney: can you check if I do make it right this time, thanks ^^^20:18
opendevreviewRico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest  https://review.opendev.org/c/openstack/nova/+/83064620:50
opendevreviewRico Lin proposed openstack/nova master: Add traits for viommu model  https://review.opendev.org/c/openstack/nova/+/84450720:50
*** dasm is now known as dasm|off21:40
opendevreviewMerged openstack/nova master: Add documentation and releasenotes for RBAC change  https://review.opendev.org/c/openstack/nova/+/85488223:27
*** efried1 is now known as efried23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!