Wednesday, 2019-12-11

*** igordc has quit IRC00:13
*** sapd1 has joined #openstack-nova00:30
*** awalende has joined #openstack-nova00:31
*** mlavalle has quit IRC00:34
*** awalende has quit IRC00:36
*** jcosmao has left #openstack-nova00:36
*** tbachman has quit IRC00:46
*** tbachman has joined #openstack-nova00:51
*** zhanglong has joined #openstack-nova01:02
*** sapd1 has quit IRC01:02
*** gyee has quit IRC01:09
*** Liang__ has joined #openstack-nova01:12
openstackgerritMerged openstack/nova master: Tie requester_id to RequestGroup suffix  https://review.opendev.org/69694601:13
*** jmlowe has joined #openstack-nova01:32
*** zhanglong has quit IRC01:36
*** mmethot has quit IRC01:36
*** zhanglong has joined #openstack-nova01:38
*** mdbooth has quit IRC01:44
*** mdbooth has joined #openstack-nova01:46
*** mmethot has joined #openstack-nova01:47
*** larainema has joined #openstack-nova01:49
*** rcernin has quit IRC01:53
*** dave-mccowan has joined #openstack-nova02:00
*** jmlowe has quit IRC02:10
*** lvbin02 has joined #openstack-nova02:18
*** lvbin01 has quit IRC02:21
*** lvbin02 has quit IRC02:21
*** lvbin01 has joined #openstack-nova02:22
openstackgerritjichenjc proposed openstack/nova master: libvirt: avoid cpu check at s390x arch  https://review.opendev.org/69622802:30
*** awalende has joined #openstack-nova02:32
*** awalende has quit IRC02:37
*** nweinber__ has joined #openstack-nova02:44
openstackgerrithutianhao27 proposed openstack/nova master: Revert "nova shared storage: rbd is always shared storage"  https://review.opendev.org/68252302:45
*** rcernin has joined #openstack-nova02:54
*** dave-mccowan has quit IRC02:55
*** zhanglong has quit IRC03:07
*** nweinber__ has quit IRC03:18
*** munimeha1 has quit IRC03:28
*** mkrai has joined #openstack-nova03:28
*** factor has joined #openstack-nova03:31
*** nweinber__ has joined #openstack-nova03:36
*** psachin has joined #openstack-nova03:41
*** zhanglong has joined #openstack-nova03:54
openstackgerrithutianhao27 proposed openstack/nova master: Revert "nova shared storage: rbd is always shared storage"  https://review.opendev.org/68252303:55
*** tetsuro has quit IRC03:58
*** tetsuro has joined #openstack-nova03:59
*** ociuhandu has joined #openstack-nova04:04
*** nweinber__ has quit IRC04:06
*** ociuhandu has quit IRC04:08
*** awalende has joined #openstack-nova04:10
*** udesale has joined #openstack-nova04:13
*** awalende has quit IRC04:15
*** bhagyashris has joined #openstack-nova04:20
*** zhanglong has quit IRC04:46
*** dansmith has quit IRC05:10
*** dansmith has joined #openstack-nova05:14
*** tetsuro has quit IRC05:28
*** tetsuro has joined #openstack-nova05:31
*** shilpasd has quit IRC05:35
*** tetsuro has quit IRC06:04
*** tetsuro has joined #openstack-nova06:09
*** pcaruana has joined #openstack-nova06:09
*** tbachman has quit IRC06:19
*** tbachman has joined #openstack-nova06:21
openstackgerritMerged openstack/nova master: Switch to uses_virtio to enable iommu driver for AMD SEV  https://review.opendev.org/69669706:25
openstackgerritMerged openstack/nova master: Also enable iommu for virtio controllers and video in libvirt  https://review.opendev.org/68482506:25
*** avolkov has joined #openstack-nova06:39
*** dpawlik has joined #openstack-nova07:00
*** damien_r has quit IRC07:03
*** dpawlik has quit IRC07:04
*** dpawlik has joined #openstack-nova07:07
*** tosky has joined #openstack-nova07:09
*** lpetrut has joined #openstack-nova07:18
openstackgerritShilpa Devharakar proposed openstack/nova master: Handle new is_volume_backend join column query  https://review.opendev.org/69446207:18
openstackgerritShilpa Devharakar proposed openstack/nova master: Instance object changes for the new 'is_volume_backed' expected_attr  https://review.opendev.org/69446307:18
openstackgerritShilpa Devharakar proposed openstack/nova master: Ignore root_gb if instance is booted from volume  https://review.opendev.org/61262607:18
*** damien_r has joined #openstack-nova07:27
*** zainub_wahid has joined #openstack-nova07:30
*** gibi_off is now known as gibi07:44
* gibi is back07:46
gibireading the notification while I was away I feel like IRC's away message system doesn't work as intended as I noted there that I back on Wednesday but it seems this info was unknown for the folks on the channel07:47
*** slaweq has joined #openstack-nova07:47
*** bhagyashris has quit IRC07:54
*** belmoreira has joined #openstack-nova07:56
*** damien_r has quit IRC07:57
*** tesseract has joined #openstack-nova07:57
*** ociuhandu has joined #openstack-nova08:05
*** ociuhandu has quit IRC08:05
*** maciejjozefczyk has joined #openstack-nova08:06
*** bhagyashris has joined #openstack-nova08:06
*** ociuhandu has joined #openstack-nova08:07
openstackgerritEric Xie proposed openstack/nova master: Report trait 'COMPUTE_IMAGE_TYPE_PLOOP'  https://review.opendev.org/69813208:10
*** tkajinam has quit IRC08:12
*** ccamacho has joined #openstack-nova08:13
*** awalende has joined #openstack-nova08:20
*** damien_r has joined #openstack-nova08:21
*** pcaruana has quit IRC08:22
*** ociuhandu has quit IRC08:30
*** bhagyashris has quit IRC08:30
*** rpittau|afk is now known as rpittau08:36
*** links has joined #openstack-nova08:44
*** aloga has quit IRC08:45
openstackgerritGuo Jingyu proposed openstack/nova master: Make scheduling more debuggable  https://review.opendev.org/69842108:45
*** aloga has joined #openstack-nova08:45
*** ccamacho is now known as ccamacho|pto08:50
*** ociuhandu has joined #openstack-nova08:54
*** ralonsoh has joined #openstack-nova08:55
*** awalende has quit IRC08:59
*** ociuhandu has quit IRC08:59
openstackgerritGuo Jingyu proposed openstack/nova master: Make scheduling more debuggable  https://review.opendev.org/69842108:59
*** ociuhandu has joined #openstack-nova09:00
*** awalende has joined #openstack-nova09:01
*** bhagyashris has joined #openstack-nova09:01
*** iurygregory has joined #openstack-nova09:02
openstackgerritGuo Jingyu proposed openstack/nova master: Make scheduling more debuggable  https://review.opendev.org/69842109:03
*** jangutter has quit IRC09:03
*** jangutter has joined #openstack-nova09:04
*** ociuhandu has quit IRC09:05
*** zhanglong has joined #openstack-nova09:08
*** ociuhandu has joined #openstack-nova09:27
*** udesale has quit IRC09:29
*** udesale has joined #openstack-nova09:32
*** martinkennelly has joined #openstack-nova09:32
*** jangutter_ has joined #openstack-nova09:36
*** zainub_wahid has quit IRC09:37
*** dpawlik has quit IRC09:38
*** ccamacho|pto has quit IRC09:38
*** ccamacho has joined #openstack-nova09:38
*** jangutter has quit IRC09:39
openstackgerritBalazs Gibizer proposed openstack/nova master: Extend NeutronFixture to handle multiple bindings  https://review.opendev.org/69624609:42
*** derekh has joined #openstack-nova09:43
openstackgerritBalazs Gibizer proposed openstack/nova master: Do not mock setup net and migrate inst in NeutronFixture  https://review.opendev.org/69624709:43
openstackgerritBalazs Gibizer proposed openstack/nova master: Move _get_request_group_mapping() to RequestSpec  https://review.opendev.org/69654109:45
*** abaindur has quit IRC09:46
openstackgerritBalazs Gibizer proposed openstack/nova master: Move _update_pci_request_spec_with_allocated_interface_name  https://review.opendev.org/69657409:46
*** factor has quit IRC09:46
*** factor has joined #openstack-nova09:47
*** ttsiouts has joined #openstack-nova09:47
openstackgerritBalazs Gibizer proposed openstack/nova master: Support live migration with qos ports  https://review.opendev.org/69590509:48
*** rcernin has quit IRC09:50
openstackgerritBalazs Gibizer proposed openstack/nova master: Extend NeutronFixture to handle multiple bindings  https://review.opendev.org/69624609:53
*** farhanjamil has joined #openstack-nova09:54
*** ccamacho is now known as ccamacho|pto09:54
*** dpawlik has joined #openstack-nova09:54
openstackgerritBalazs Gibizer proposed openstack/nova master: Do not mock setup net and migrate inst in NeutronFixture  https://review.opendev.org/69624709:55
openstackgerritBalazs Gibizer proposed openstack/nova master: Move _get_request_group_mapping() to RequestSpec  https://review.opendev.org/69654109:56
openstackgerritBalazs Gibizer proposed openstack/nova master: Move _update_pci_request_spec_with_allocated_interface_name  https://review.opendev.org/69657409:58
openstackgerritBalazs Gibizer proposed openstack/nova master: Support live migration with qos ports  https://review.opendev.org/69590509:59
*** dtantsur|afk is now known as dtantsur10:06
*** farhanjamil has quit IRC10:07
*** vishalmanchanda has quit IRC10:09
*** vishalmanchanda has joined #openstack-nova10:10
*** Liang__ has quit IRC10:17
*** dpawlik has quit IRC10:19
*** dpawlik has joined #openstack-nova10:22
*** zhanglong has quit IRC10:25
*** salmankhan has joined #openstack-nova10:31
*** salmankhan has joined #openstack-nova10:32
*** belmoreira has quit IRC10:44
*** belmoreira has joined #openstack-nova10:45
*** belmoreira has quit IRC10:49
*** bhagyashris has quit IRC10:53
*** mkrai_ has joined #openstack-nova10:54
*** mkrai has quit IRC10:57
*** mkrai_ has quit IRC10:59
*** mkrai has joined #openstack-nova10:59
*** udesale has quit IRC11:04
*** mkrai has quit IRC11:04
*** bhagyashris has joined #openstack-nova11:07
*** awalende has quit IRC11:20
*** awalende has joined #openstack-nova11:21
*** dpawlik has quit IRC11:21
*** awalende_ has joined #openstack-nova11:23
*** ttsiouts has quit IRC11:24
*** awalende has quit IRC11:25
*** awalende_ has quit IRC11:27
*** arxcruz is now known as arxcruz|pto11:32
*** udesale has joined #openstack-nova11:35
*** pcaruana has joined #openstack-nova11:36
*** tbachman has quit IRC11:46
*** ttsiouts has joined #openstack-nova11:52
sean-k-mooneygibi: ya we did not get any away notifications11:53
sean-k-mooneydid you perhaps set your self as away before chaning your nick or something11:53
gibisean-k-mooney: used /away <message> in my irssi but then I will not try to rely on that in the future11:54
sean-k-mooneythat is what i use but with weechat and it works fine11:54
sean-k-mooneyif you ping me now you should get a message11:54
gibijeah could be an ordering issue11:55
*** sean-k-mooney is now known as skm11:55
skmdid you get a message before11:55
*** skm is now known as sean-k-mooney11:55
sean-k-mooneyoh you didnt use my name11:56
sean-k-mooneygibi: anyway shall i proceed with keeping the notifications for now.11:57
gibisean-k-mooney: I think you should add your new image meta to the payload11:58
gibijust to follow the pattern11:58
sean-k-mooneybut i am the only person that has done that since the notification have been added11:59
sean-k-mooneyand we have added lots of image properties since then11:59
sean-k-mooneyi also added the 1.1 version bump11:59
*** brinzhang has joined #openstack-nova12:00
*** dpawlik has joined #openstack-nova12:00
sean-k-mooneyso i can set and follow a new pattern but if i was to follow the exitising one i would ignore it12:00
gibiI opened the original patch introduced the payload class and at that time it contained everyhing, so the intention is clear to me based on that12:00
sean-k-mooneyyep which is why i added the property but we have not been doing that12:01
*** larainema has quit IRC12:01
*** awalende has joined #openstack-nova12:01
*** brinzhang has quit IRC12:02
sean-k-mooneyi can file a bug to bring them in sync and work on a patch to do that if that is what we want to do going forward12:02
gibihonestly I would not block you patch on this in either way. It is pretty clear to me know that there are no active notification consumers out there that are willing to give us feedback what they need12:02
gibis/know/now/12:02
*** brinzhang has joined #openstack-nova12:02
gibiso I won't be hard with rules12:02
*** brinzhang has quit IRC12:03
sean-k-mooneywell its more efried that was interested since he was addeding image peorperties too and he was wondering if he should be doing the same12:03
*** brinzhang has joined #openstack-nova12:03
sean-k-mooneyi was wondering if we should document the policy if there was one somewhere so that we can do the right thing12:04
*** brinzhang has quit IRC12:04
*** spsurya has joined #openstack-nova12:05
*** dpawlik has quit IRC12:05
gibiso we need to 1) decide what is the policy 2) make sure we can somehow enforce the policy12:06
gibifor 1) the original policy was to mirror image meta but then later patches not followed that policy12:06
sean-k-mooneywell we can just compare the fields dict of both ovos12:06
sean-k-mooneythey shoulds always be in sync right12:07
*** brinzhang has joined #openstack-nova12:07
*** brinzhang has quit IRC12:07
gibithen we need a separate patch to re-sync the two12:07
sean-k-mooneyyes, i commented on the bug that i was not sure why it needed to be a seperate object vs the nova.object.ImageMetaProps12:08
*** brinzhang has joined #openstack-nova12:08
sean-k-mooney*patch12:08
sean-k-mooneywas there a reson you made the payload object sperate?12:08
*** brinzhang has quit IRC12:09
gibiin general we wanted to have separate objects for notification payload so that we are not enforced to push every internal data out in the notification. But for this particular case if we decide that we always want to push every image meta in the notification then sure in this case we don't need a separate payload class12:09
*** brinzhang has joined #openstack-nova12:11
sean-k-mooneywell i guess that is the issue right. having a seperate class made the notification update optional since it would not fail tests if you forgot.12:11
*** brinzhang has quit IRC12:11
*** brinzhang has joined #openstack-nova12:12
sean-k-mooneyok so ill level the notification updates in the patch for now assuming efried is ok with  that and ill file a bug for the fact tehy are not in sync.12:12
gibisean-k-mooney: ack, it works for me12:12
gibiand thanks for taking care of12:12
sean-k-mooneythen we can decide if we want to fix that by extending the new object and added the test for the fields or if we want to just use one object in this case sound good?12:13
*** brinzhang_ has joined #openstack-nova12:16
*** nicolasbock has joined #openstack-nova12:18
*** brinzhang has quit IRC12:19
*** dpawlik has joined #openstack-nova12:20
*** dpawlik has quit IRC12:25
*** Liang__ has joined #openstack-nova12:27
*** elod has quit IRC12:28
*** dpawlik has joined #openstack-nova12:33
*** ociuhandu has quit IRC12:37
*** jangutter_ is now known as jangutter12:41
*** parlos has joined #openstack-nova12:50
*** udesale has quit IRC12:50
*** elod has joined #openstack-nova12:51
*** ociuhandu has joined #openstack-nova13:03
*** CeeMac has quit IRC13:08
*** brinzhang has joined #openstack-nova13:10
*** tbachman has joined #openstack-nova13:11
*** mmethot has quit IRC13:13
*** tbachman_ has joined #openstack-nova13:16
*** tbachman has quit IRC13:16
*** tbachman_ is now known as tbachman13:16
*** mmethot has joined #openstack-nova13:17
*** brinzhang_ has quit IRC13:33
*** damien_r has quit IRC13:33
openstackgerritGuo Jingyu proposed openstack/nova master: Make scheduling more debuggable  https://review.opendev.org/69842113:53
*** Liang__ is now known as LiangFang13:54
*** mriedem has joined #openstack-nova13:59
gibisean-k-mooney: sorry, I was pulled. Yeah, it sounds like a plan14:00
sean-k-mooneyhehe no worries14:01
*** bnemec has joined #openstack-nova14:01
*** ociuhandu_ has joined #openstack-nova14:02
*** mdbooth has quit IRC14:02
*** ociuhandu has quit IRC14:05
*** bhagyashris has quit IRC14:06
*** kaisers has joined #openstack-nova14:06
*** mdbooth has joined #openstack-nova14:08
*** brinzhang has quit IRC14:15
efriedgibi, sean-k-mooney: ack, thanks for the followup. Sounds like we'll want a big patch to sync the two objects.14:16
efriedand maybe some kind of clever test that enforces their parity.14:16
efriedfor the future14:16
gibiefried: hi! I can get behind this plan14:17
efriedThis ship has probably sailed, but is there no way to use the same object for both purposes?14:17
sean-k-mooneywell the clever test is just loop over the fields dict in the nova.object.image_meta.ImageProp object and assert they are in the notificaiton object dict14:18
gibiefried: if we hack the versioning then I guess we can. But I don't know if we really want to heck the versioning14:18
gibisean-k-mooney: yeah that could work14:19
sean-k-mooneyill write a patch to do that when i finish updating the functional tests14:19
efriedsean-k-mooney: and vice versa14:20
efriedthanks sean-k-mooney14:20
*** ociuhandu_ has quit IRC14:20
sean-k-mooneywell actully they are dicts so i could jsut assert the keys are equal14:21
sean-k-mooneyif we want the typs to match i could jsut assert the dicts are equal to check the keys and values14:21
efriedI'd be fine just asserting they have the same keys.14:26
efriedit's really just a sniff test to make sure devs didn't miss syncing.14:26
sean-k-mooneyyep ill include it in the sync patch14:27
sean-k-mooneyalso ddt makes updating the func test way simpler14:27
efriedsweet.14:27
efriedgibi: what's your vacation schedule?14:28
gibiefried: last official day in the office is 16th. If I cannot finish with the qos live migration until then then I will add some extra time to that before 20th14:29
gibithen back in the 6th of Jan14:29
efriedOkay. I'll try to get some reviews in on that. I'm also going to ask for another look at the vTPM spec soon if that's okay. I need to do another update, so maybe tomorrow?14:30
mriedemstephenfin: i've got a few questions in this one https://review.opendev.org/#/c/696509/14:31
gibiefried: sure, lets do it14:32
efriedI'm going to make it simpler. Basically, the bits that are really awkward or hard to explain to users, I'm just going to say "behaves like baremetal".14:32
gibi:)14:32
mriedemstephenfin: also, not sure how others feel about this, but a massive rename like this screws up git history https://review.opendev.org/#/c/696745/714:32
mriedemand backports14:32
mriedemso i'd rather not do that personally14:32
mriedemdansmith: how are your feels on this ^?14:32
efriedI'm more offended by the misused apostrophe in the commit message.14:33
stephenfinmriedem: It really should. git's automerge tooling should detect renames for us14:33
mriedemi guess https://review.opendev.org/#/c/696745/7/nova/network/api.py isn't as terrible as i would expect14:33
stephenfin*it really shouldn't14:33
mriedemmaybe i still have ptsd from when ed renamed all of the legacy v2 contrib api modules14:34
dansmithmriedem: yeah, that seems unnecessary to me14:34
stephenfinThere will be merge conflicts but that's due to stuff having been moved around. Straight up file renames aren't an issue14:34
dansmithif others really care about it, then whatever, but it seems more trouble than it's worth to me14:35
mriedembackports related to anything touching networking stuff is going to be a nightmare for awhile anyway, but that's expected14:35
efriedso might as well shoot the whole hog?14:36
stephenfindansmith: Well, I really care about it :)14:36
dansmiththis definitely makes backports harder,14:36
dansmithI'm not sure why they're hard without it14:36
efried(hm, I've been misusing that idiom for years. It's "go the whole hog".)14:36
efrieddansmith: because of all the *other* nova-net removal churn14:37
stephenfinmocks, for one14:37
stephenfinwe no longer need to care about mocking stuff that nova-net was doing. If you backport a test, that goes back to not being true14:38
dansmithefried: okay but that's hard like code conflict, not hard like "this file doesn't exist in the old branch" hard14:38
efriedI get it.14:38
openstackgerritsean mooney proposed openstack/nova master: support pci numa affinity policies in flavor and image  https://review.opendev.org/67407214:39
mriedemanywho, i didn't -1 it and it's later in the series, just mentioning it14:39
mriedemsomeone has to take on the current big ass bottom change that i +2ed already14:39
efriedstephenfin: Given that we have nova/image/glance.py, I don't see a problem with having nova/networking/neutron.py14:39
sean-k-mooneyefried: i have not added the spy function to check the pci assignment but i think i have made all the other chagnes you suggested14:40
stephenfindansmith: but it won't - git is smart14:40
mriedemefried: that's a good point14:40
mriedemwe also have a cinder.py14:40
efriednova/volume/cinder.py yeah.14:40
stephenfindansmith: Go make a trivial modification to nova/tests/unit/test_nova_manage.py, commit on master, then backport to stable/stein14:40
efriedwe should make sure Sundar names it nova/accelerator/cyborg.py and we're golden.14:40
stephenfinit'll happen cleanly, even though I renamed that file in commit nova/tests/unit/test_nova_manage.py14:41
stephenfinwhoops, commit 463017b51b8cde48582b2f55ad7a1f2321d03d0214:41
mriedemactually i did just -114:41
mriedemi think having the module having neutron in the name would be less confusing to someone coming to hack on nova and knowing that there was at least once a nova-network thing, and finding nova/network/api.py and wondering if it's nova-net or neutron14:42
sean-k-mooneyefried: that would imply that that code is virt driver indepentet. if they cyborg code in that folder was then sure14:42
dansmithstephenfin: aight, well, when we moved tests/ into tests/unit things were not "smart", but whatevs14:42
mriedemsince we have explicit glance.py and cinder.py modules those are pretty clear what they are for14:42
mriedemdansmith: oh yeah the tests -> tests/unit was another big one14:42
efriedmriedem: oh, I hope we're getting rid of the API shim14:43
efriedstephenfin: ? ^14:43
mriedemefried: i'm fine with removing base_api.py and api.py (nova-net)14:43
mriedembut i don't think we need to rename neutronv2.api to api.py14:43
efriedyeah, I'm fine leaving it named neutron14:43
sean-k-mooneywell on that how would people feel about eventurally moving the neutron code to os-vif?14:44
mriedemwhy?14:44
efriedthat sounds like a conversation for a far future release14:44
*** dtantsur is now known as dtantsur|brb14:44
sean-k-mooneywell partly to not need to have any netowrking code in nova14:44
mriedem"hey let's move some already really complicated and not very well understood code out to a library with a different core team"14:44
canori01Is it safe to rebuild the placement database?  I have an issue where all my hypervisors are running into conflicts (conflicting resource provider name)14:45
sean-k-mooneywell all nova cores are os-vif cores14:45
canori01Can I just empty the db and bounce nova-compute?14:45
sean-k-mooneybut anyway its just an idea14:45
mriedemsean-k-mooney: not worth the effort14:45
mriedempick your battles14:45
sean-k-mooneyits currently plugable so i was thinking of porting it then we could swap after14:45
sean-k-mooneybut ok14:45
mriedemcanori01: the inventory will rebuild itself automatically, the consumers/allocations will not14:46
sean-k-mooneyill drop it for now14:46
*** ociuhandu has joined #openstack-nova14:46
mriedemcanori01: nova-manage placement heal_allocations should heal those up those if you do have to rebuild the placement db14:46
mriedem*heal those up though14:47
*** eharney has quit IRC14:47
sean-k-mooneymriedem: that was added in rocky right. in queens we still had the periodic heal task form the move to placment14:47
mriedemwrongish14:48
canori01mriedem: What happened is I used to have a third-party backup service that had service entries under nova.  After removing them, all the hypervisors complain that an entry for them already exists and are unschedulable as a result14:48
canori01So my thought was to rebuild the placement db. I don't know if there's a better option14:49
*** links has quit IRC14:49
sean-k-mooneycanori01: did you remvoe all the nova services14:49
canori01no, the nova services are still there14:50
canori01sean-k-mooney: would removing the service and bouncing nova-compute put things back in order?14:51
mriedemkvm right?14:53
mriedemthe problem is placement has a unique constraint on the hostname, but the uuid on your computes has changed14:53
mriedemthe uuid on the compute_nodes table record that nova creates and uses to report the resource_providers to placement14:53
mriedemso i think you're hitting some version of this https://bugs.launchpad.net/nova/+bug/181783314:55
openstackLaunchpad bug 1817833 in OpenStack Compute (nova) "Check compute_id existence when nova-compute reports info to placement" [Medium,In progress] - Assigned to Matt Riedemann (mriedem)14:55
sean-k-mooneycanori01: no. i asked because if you remvoed the service the uuid would change when the agent restarts but the hostname would be the same and would cause a resouce provider conflcit14:56
*** ociuhandu has quit IRC14:56
mriedemif you really need to rebuild the placement db, then i think your steps would be:14:57
mriedem1. backup your current placement db14:57
mriedem2. drop it and rebuild the schema so it's empty14:57
mriedem3. let the computes report their inventory in which will create resource providers on the first run14:57
mriedem4. run: nova-manage placement heal_allocations14:57
stephenfinmriedem: replied on https://review.opendev.org/#/c/696509/14:57
mriedem5. run: nova-manage placement sync_aggregates14:57
stephenfintl;dr: if I don't do that req stuff, s*** breaks, so I did enough to make it work ¯\_(ツ)_/¯14:58
mriedemthis is also assuming you aren't using some of the more advanced features like QoS ports in neutron14:58
canori01mriedem: I am not yet using QoS for neutron14:58
stephenfinmriedem: also, it's way up the stack but you should definitely look at https://review.opendev.org/#/c/696746/ since it affects your security group caching changes. I think what I did is correct14:59
*** chason has joined #openstack-nova14:59
sean-k-mooneycanori01: it specficlly woudl only be an issue if you were using minium bandwidth qos policy14:59
mriedemjesus that is a big change14:59
sean-k-mooneycanori01: the other qos polices do not interact with placment14:59
sean-k-mooneycanori01: are you useing routed networks out of interest. e.g. calico15:00
canori01sean-k-mooney: So if I sync the uuid on the database to match the uuid of my existing service entries, that would also solve the issue?15:00
canori01For example, nova service-list has:15:00
canori012c1037b3-4977-4a13-aea8-700a805cc11c | nova-compute     | bctlz7nova3615:00
sean-k-mooneycanori01: that is easier said then done as you would have to also consider exitsting allocation too but in principal yes15:01
canori01placement has: | 2019-10-22 19:12:06 | 2019-11-25 19:27:36 | 157 | c53e4b12-0b0b-4eaa-9fb1-373da8538cea | bctlz7nova3615:01
*** chason has left #openstack-nova15:01
mriedemno those aren't the same15:01
mriedemthe nova services table uuid and nova compute_nodes uuid are not the same15:01
canori01ah ok15:01
*** chason has joined #openstack-nova15:01
sean-k-mooneyright the plcamment uuid is the compute node node uuid15:01
canori01sean-k-mooney: I'm not using calico. Just overlay vxlan networks advertised out with the neutron bgp agent15:02
*** amodi has quit IRC15:02
sean-k-mooneycanori01: ok neutron report the network segment for routed networks to placmenet too15:02
sean-k-mooneycanori01: if you are using vxlan then you are fine15:02
mriedemstephenfin: ok +2 on the 2nd from bottom change15:05
stephenfinta15:05
efriedstephenfin: and the bottom one is +A15:05
canori01mriedem: so would the safest course of action be to rebuild the placement db and heal the allocations?15:06
mriedemi'm assuming canori01 didn't understand the question about what routed networks as a feature in neutron is15:06
mriedemhttps://docs.openstack.org/neutron/latest/admin/config-routed-networks.html15:06
mriedemtl;dr it relies on aggregates15:06
mriedemand they aren't supported in nova anyway so it's a red herring here15:06
*** ociuhandu has joined #openstack-nova15:07
mriedemcanori01: i think doing that (what i laid about above) is likely more fool proof than trying to hack the uuids to get all synced up15:07
mriedemdisclaimer: if you run into problems with that this isn't a support channel nor am i your paid vendor so i'm not going to be walking you through every issue you run into :)15:08
efriedIt would be a good real world test of that procedure, anyway.15:08
efriedA canori in a coal mine, so to speak.15:08
canori01mriedem: ok, thanks. Also, I'm definitely not using the routed networks. My provider network is just one segment15:08
mriedemefried: indeed - test it in (someone else's) production15:08
mriedemefried: maybe a troubleshooting item to document, "oh no my placement db is all screwed up, how can i just start over w/ my existing nova"15:09
canori01mriedem: of course. I understand about the disclaimer :D15:09
efriedI thought cdent had that somewhere mebbe?15:09
efriedhe's hanging out in -placement atm...15:09
* efried bbiab15:09
*** ociuhandu has quit IRC15:12
*** chason has quit IRC15:13
mriedemsean-k-mooney: to answer your earlier question, yes heal_allocations was added in rocky, but the RT did not report allocations peridiocially in rocky *unless* it's an ironic compute15:13
mriedemsee https://review.opendev.org/#/c/576462/15:14
*** dpawlik has quit IRC15:16
sean-k-mooneymriedem: no i ment it did that in queens15:16
sean-k-mooneyalthough only if you had ironic or pike compute nodes15:16
mriedemcorrect15:17
mriedemwait, no, <pike computes15:17
sean-k-mooneyam maybe15:17
mriedemstarting in pike once all your computes were upgraded we stopped having the RT report allocations because it would overwrite what the scheduler did and screw up allocations during move operations15:17
sean-k-mooneyyes15:17
mriedemthat's also why dansmith did migration-based allocatoins for move ops in queens15:18
sean-k-mooneyyep so basicaly the downstream issue was related to fixing allocation for undercloud (ironic) node where the customer acidentally deleted the service15:19
mriedemstephenfin: while you're still around, i need you and efried to come to an agreement on https://review.opendev.org/#/c/696582/15:20
sean-k-mooneybecause it was ironic and queens that perodic saved them. but they were asking what would happen if the same happend on the overcloud(libvirt nodes)15:20
sean-k-mooneywhich was when we noticed that the heal_allcoation command was not on queens just rocky15:21
*** eharney has joined #openstack-nova15:23
*** parlos has quit IRC15:24
*** ociuhandu has joined #openstack-nova15:24
*** ociuhandu has quit IRC15:28
*** damien_r has joined #openstack-nova15:30
*** mkrai has joined #openstack-nova15:30
*** lpetrut has quit IRC15:32
*** damien_r has quit IRC15:33
mriedemspeaking of which, melwitt - should i continue backporting these to rocky? https://review.opendev.org/#/q/topic:heal_allocations_dry_run+(status:open+OR+status:merged)15:34
*** links has joined #openstack-nova15:38
openstackgerritMatt Riedemann proposed openstack/nova master: Add troubleshooting doc about rebuilding the placement db  https://review.opendev.org/69851715:42
mriedemefried: canori01: ^ brain dump15:42
*** panda is now known as panda|bbl15:44
*** ttsiouts has quit IRC15:48
*** mlavalle has joined #openstack-nova15:53
*** mkrai has quit IRC15:58
melwittmriedem: if you do, it would be a help16:00
melwittI support ++16:00
openstackgerritBalazs Gibizer proposed openstack/nova master: Support live migration with qos ports  https://review.opendev.org/69590516:06
gibimriedem: the happy case support for live migration is now complete in ^^16:06
efriedstephenfin: re https://review.opendev.org/#/c/696582/ -- I want the shiny new command in the docs for sure. And I'm not sure your PS2 commentary meant you wanted it actually removed -- did it?16:10
mriedemgibi: ack - throw that series into the runways etherpad?16:12
mriedemi'm also waiting on efried to come back on https://review.opendev.org/#/c/696541/16:12
*** lbragstad_ has joined #openstack-nova16:12
efriedmriedem: looking now16:13
gibimriedem: ack, adding...16:13
mriedemgibi: i also replied to your comments on https://review.opendev.org/#/c/637070/ but then accidentally rebased16:14
gibimriedem: ack, put it in my queue16:14
*** lbragstad has quit IRC16:15
*** Sundar has joined #openstack-nova16:16
efriedmriedem, gibi: I'm +2 on https://review.opendev.org/#/c/696541/16:19
gibiefried: thanks16:19
openstackgerritLee Yarwood proposed openstack/nova master: libvirt: Remove native LUKS compat code  https://review.opendev.org/66912116:25
mriedemi'm not16:25
*** jaosorior has joined #openstack-nova16:25
openstackgerritMatt Riedemann proposed openstack/nova stable/rocky: Add --dry-run option to heal_allocations CLI  https://review.opendev.org/69852516:29
gibimriedem: ack, I will need to get back to that patch tomorrow16:30
*** jmlowe has joined #openstack-nova16:30
gibimriedem: most of the nois is there because this patch went through couple PSs with different solutions16:31
gibimriedem: I will get back to it tomorrow and clean it up16:31
* gibi leaves for today, happy hacking folks16:32
sean-k-mooneygibi: o/16:33
openstackgerritMatt Riedemann proposed openstack/nova stable/rocky: Add --instance option to heal_allocations  https://review.opendev.org/69852916:38
openstackgerritsean mooney proposed openstack/nova stable/train: Block rebuild when NUMA topology changed  https://review.opendev.org/69853016:38
openstackgerritMatt Riedemann proposed openstack/nova stable/rocky: Add BFV wrinkle to TestNovaManagePlacementHealAllocations  https://review.opendev.org/69853116:39
openstackgerritsean mooney proposed openstack/nova stable/train: Disable NUMATopologyFilter on rebuild  https://review.opendev.org/69853216:40
sean-k-mooneymriedem: since im backporting stuff should i backport https://review.opendev.org/#/c/695118/ which is the fix to https://bugs.launchpad.net/nova/+bug/184736716:45
openstackLaunchpad bug 1847367 in OpenStack Compute (nova) "Images with hw:vif_multiqueue_enabled can be limited to 8 queues even if more are supported" [Undecided,Fix released] - Assigned to sean mooney (sean-k-mooney)16:45
sean-k-mooneymriedem: it was opened against rocky so i guess it should go back at least that far16:45
mriedemi'd probably let eandersson or his minions do the backports if they want them16:47
efriedmriedem: If I delete a shelved instance, does the virt driver ever get a crack at cleaning up?16:49
efrieda) if not offloaded, I assume yes, because the instance is still on the host16:49
efriedb) if offloaded, I assume no, because whose virt driver would we hit?16:49
mriedemcorrect16:49
efriedthx16:49
mriedemis this vpmem or accelerator related?16:49
efriedneither, vtpm16:51
efriedMeans I think we're going to have to delete the swift obj from the conductor rather than the virt driver.16:52
efriedwhich kinda sucks because it's a virt driver-specific thing. At least the contents are.16:52
mriedemi haven't been paying attentiong to the vtpm hullabaloo16:52
mriedem*attention16:52
mriedemconductor isn't involved in the server delete btw,16:53
mriedemso the api would be doing whatever external cleanup is necessary16:53
efriedsigh, that's what I meant.16:53
efried"controller"16:53
efriedcan't imagine why I confuse that with "conductor".16:53
efriedthough by now you would have thought I could get it the f right.16:54
efriedmaybe after 8 years...16:54
mriedemwith enough pedantic ridicule you'll get there!16:54
efriedis that what it's for? gtk16:54
efriedI thought it was just plain old schoolyard bullying.16:54
*** gyee has joined #openstack-nova16:55
*** dtantsur|brb is now known as dtantsur16:55
*** iurygregory has quit IRC16:56
*** maciejjozefczyk has quit IRC16:59
*** LiangFang has quit IRC17:01
canori01mriedem: That seems to have worked well.  I'm still doing tests, but looks promising.  I had to do it slightly differently.  I'm running rocky, but my placement db is not broken out.  So instead of dropping and recreating the db, I had to truncate the tables that placement uses. Then I bounced nova-compute and did the healing and aggregate syncing17:04
mriedemah cool17:07
mriedemglad it's working17:07
*** rpittau is now known as rpittau|afk17:13
openstackgerritMerged openstack/nova stable/queens: Do not update root_device_name during guest config  https://review.opendev.org/69646917:16
openstackgerritMerged openstack/nova master: nova-net: Drop nova-network-base security group tests  https://review.opendev.org/69650817:16
sean-k-mooneyefried: the virt driver should have cleaned up evertying on the host as part of the offload step17:19
sean-k-mooneyoh i see this is related to vtpm17:19
*** psachin has quit IRC17:19
*** panda|bbl is now known as panda17:20
*** jlvillal has left #openstack-nova17:20
*** links has quit IRC17:22
openstackgerritStephen Finucane proposed openstack/nova master: trivial: Resolve (most) flake8 3.x issues  https://review.opendev.org/69573217:24
openstackgerritStephen Finucane proposed openstack/nova master: Switch to flake8 3.x  https://review.opendev.org/69573317:24
*** yan0s has quit IRC17:26
*** ociuhandu has joined #openstack-nova17:30
mriedemwtf, so back on dec 5 i had a passing run of nova-multi-cell with migration tests enabled. now since the 9th with a new run all migration tests are failing because once the confirmed resized server is active again the api is saying the flavor is the old id even though i can see in the conductor logs where right before that the instance has the correct flavor17:31
mriedemthe api is pulling from the correct cell17:31
openstackgerritStephen Finucane proposed openstack/nova master: Add resource provider allocation unset example to troubleshooting doc  https://review.opendev.org/69658217:32
stephenfinmriedem: Revert that if you don't like what I've done. It was just easier do it than explain what I wanted in a comment ^17:33
stephenfinefried: you asked about that earlier ^17:33
mriedemok17:34
openstackgerritMerged openstack/nova master: Use provider mappings from Placement (mostly)  https://review.opendev.org/69699217:40
openstackgerritMerged openstack/nova master: Create a controller for qga when SEV is used  https://review.opendev.org/69307217:40
openstackgerritMerged openstack/nova master: Extend NeutronFixture to handle multiple bindings  https://review.opendev.org/69624617:40
openstackgerritMerged openstack/nova master: Do not mock setup net and migrate inst in NeutronFixture  https://review.opendev.org/69624717:40
*** igordc has joined #openstack-nova17:42
*** ociuhandu has quit IRC17:45
*** awalende has quit IRC17:46
*** awalende has joined #openstack-nova17:47
*** awalende has quit IRC17:51
*** awalende has joined #openstack-nova17:52
*** awalende_ has joined #openstack-nova17:54
*** awalende has quit IRC17:56
*** awalende_ has quit IRC17:57
*** derekh has quit IRC18:01
*** Sundar has quit IRC18:12
*** igordc has quit IRC18:13
*** igordc has joined #openstack-nova18:14
*** salmankhan has quit IRC18:19
*** awalende has joined #openstack-nova18:21
efriedmriedem: are you okay with stephenfin's update to at doc patch? Since you both have hands in it, if you're okay with it I'll fast approve, taking stephenfin's authorship as implicit approval and since it's docs...18:22
melwittjohnthetubaguy: hey, are you around bychance?18:22
*** dtantsur is now known as dtantsur|afk18:23
*** ociuhandu has joined #openstack-nova18:24
*** awalende has quit IRC18:25
*** igordc has quit IRC18:25
*** jaosorior has quit IRC18:27
*** ociuhandu has quit IRC18:29
mriedemefried: yet to look at it18:37
mriedembut soon, very soon....muwahhaaha18:37
openstackgerritMatt Riedemann proposed openstack/nova master: Add cross-cell resize tests for _poll_unconfirmed_resizes  https://review.opendev.org/69832218:40
openstackgerritMatt Riedemann proposed openstack/nova master: DNM: debug cross-cell resize  https://review.opendev.org/69830418:40
openstackgerritMatt Riedemann proposed openstack/nova master: DNM: debug cross-cell resize  https://review.opendev.org/69830418:41
mriedemefried: stephenfin's changes look fine to me18:44
*** tosky has quit IRC18:44
efried+A18:45
mriedemif someone can push through https://review.opendev.org/#/c/696509/ it's the current bottom of the nova-net removal series; it looks like the rest of the series after that is now in merge conflict so the whole thing has to be rebased.18:45
*** henriqueof has joined #openstack-nova18:50
efriedmriedem: I'm gonna try to hit that today, but it keeps getting pushed down my stack :(18:52
mriedemack it's pretty mechanical so anyone should be able to hit it18:53
*** ralonsoh has quit IRC18:55
melwittTheJulia: do you know whether cpu_arch is supposed to be required from an ironic pov? https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L103-L106 is it valid for a deployment *not* to specify cpu_arch? (for example, in a single arch environment) context is this bug https://bugzilla.redhat.com/show_bug.cgi?id=168883819:02
openstackbugzilla.redhat.com bug 1688838 in openstack-nova "Ironic should not treat cpu_arch as mandatory" [Medium,New] - Assigned to mwitt19:02
sean-k-mooneygibi: by the way i assume attaching a port with a resocue request to a existing instance is still supported correct. it was declared out of scope in stien and it was not mentioned as adressed in train. is that on your ussuri todo list?19:06
*** mvkr has quit IRC19:06
*** martinkennelly has quit IRC19:07
sean-k-mooneyi assume we would have updated https://github.com/openstack/nova/blob/master/releasenotes/notes/reject-interface-attach-with-port-resource-request-17473ddc5a989a2a.yaml  if it was supported or added another release note19:08
*** nicolasbock has quit IRC19:11
*** nicolasbock has joined #openstack-nova19:11
*** spsurya has quit IRC19:14
melwittjroll: ^ maybe you might know (ironic driver question from me)19:18
*** eharney has quit IRC19:30
mriedemsean-k-mooney: not supported - the request has to be validated with placement on attach and that isn't done19:31
sean-k-mooneyya that is what i understood too19:32
sean-k-mooneyim doing a downstream docs review and wanted to make sure that was called out19:32
*** damien_r has joined #openstack-nova19:32
mnasereandersson: is this similar to what you've been running into? https://bugs.launchpad.net/nova/+bug/183563719:33
openstackLaunchpad bug 1835637 in OpenStack Compute (nova) "(404) NOT_FOUND - failed to perform operation on queue 'notifications.info' in vhost '/nova' due to timeout" [Undecided,Incomplete]19:33
*** lpetrut has joined #openstack-nova19:33
*** iurygregory has joined #openstack-nova19:34
*** damien_r has quit IRC19:35
*** mriedem has quit IRC19:37
*** tesseract has quit IRC19:46
*** lpetrut has quit IRC19:46
*** awalende has joined #openstack-nova19:48
*** awalende has quit IRC19:53
*** lbragstad has joined #openstack-nova20:03
openstackgerritSundar Nadathur proposed openstack/nova master: ksa auth conf and client for Cyborg access  https://review.opendev.org/63124220:04
openstackgerritSundar Nadathur proposed openstack/nova master: Add Cyborg device profile groups to request spec.  https://review.opendev.org/63124320:05
openstackgerritSundar Nadathur proposed openstack/nova master: Define Cyborg ARQ binding notification event.  https://review.opendev.org/69270720:05
openstackgerritSundar Nadathur proposed openstack/nova master: Create and bind Cyborg ARQs.  https://review.opendev.org/63124420:05
openstackgerritSundar Nadathur proposed openstack/nova master: Compose accelerator PCI devices into domain XML in libvirt driver.  https://review.opendev.org/63124520:05
openstackgerritSundar Nadathur proposed openstack/nova master: Delete ARQs for an instance when the instance is deleted.  https://review.opendev.org/67373520:05
openstackgerritSundar Nadathur proposed openstack/nova master: Enable hard reboot with accelerators.  https://review.opendev.org/69794020:05
openstackgerritSundar Nadathur proposed openstack/nova master: Add cyborg tempest job.  https://review.opendev.org/67099920:05
openstackgerritSundar Nadathur proposed openstack/nova master: Pass accelerator requests to each virt driver from compute manager.  https://review.opendev.org/69858120:05
openstackgerritSundar Nadathur proposed openstack/nova master: Create and bind Cyborg ARQs.  https://review.opendev.org/63124420:07
openstackgerritSundar Nadathur proposed openstack/nova master: Pass accelerator requests to each virt driver from compute manager.  https://review.opendev.org/69858120:07
openstackgerritSundar Nadathur proposed openstack/nova master: Compose accelerator PCI devices into domain XML in libvirt driver.  https://review.opendev.org/63124520:07
openstackgerritSundar Nadathur proposed openstack/nova master: Delete ARQs for an instance when the instance is deleted.  https://review.opendev.org/67373520:07
openstackgerritSundar Nadathur proposed openstack/nova master: Enable hard reboot with accelerators.  https://review.opendev.org/69794020:07
openstackgerritSundar Nadathur proposed openstack/nova master: Add cyborg tempest job.  https://review.opendev.org/67099920:07
*** ociuhandu has joined #openstack-nova20:09
*** eharney has joined #openstack-nova20:12
*** abaindur has joined #openstack-nova20:16
*** abaindur has quit IRC20:17
*** abaindur has joined #openstack-nova20:17
*** ociuhandu has quit IRC20:21
*** ociuhandu has joined #openstack-nova20:21
*** ociuhandu has quit IRC20:40
*** ociuhandu has joined #openstack-nova20:41
*** martinkennelly has joined #openstack-nova20:43
openstackgerritLee Yarwood proposed openstack/nova stable/stein: block_device: Copy original volume_type when missing for snapshot based volumes  https://review.opendev.org/69668620:44
openstackgerritLee Yarwood proposed openstack/nova stable/rocky: block_device: Copy original volume_type when missing for snapshot based volumes  https://review.opendev.org/69726020:44
openstackgerritLee Yarwood proposed openstack/nova stable/queens: block_device: Copy original volume_type when missing for snapshot based volumes  https://review.opendev.org/69726120:45
*** iurygregory has quit IRC20:45
*** damien_r has joined #openstack-nova20:47
*** damien_r has quit IRC20:47
*** damien_r has joined #openstack-nova20:47
eanderssonmnaser I dont't think it's the same issue, but could be similar20:49
mnasereandersson: seems like the queues just crash even on a restart20:49
eanderssonhttps://github.com/rabbitmq/rabbitmq-server/issues/64120:49
mnasereven with 3.8.120:49
eanderssonThis is the issue we are running into20:49
mnasergiven that we have k8s running side by side our entire infra now, we're just probably going to run one single non-clustered rabbitmq instance20:50
eanderssonbut yea could be the same because we see different symptons each time20:50
mnaserfor every service20:50
eanderssonYea - this has been very draining for us20:51
eanderssonSome of our RabbitMQ clusters has 17k+ queues as well20:52
eanderssonSince every compute ends up with like at least 7 queues20:52
eanderssonbtw mnaser do you have network partition auto-healing enabled?20:52
eanderssonor pause_minority rather20:53
*** mgariepy has quit IRC20:54
eanderssonI believe the trigger for these issues is that when RabbitMQ comes back up it is starting to accept connections before it is fully recovered.20:54
eanderssonAt least in 3.6.X we had that issue. Could be a new issue in 3.7.X20:54
mnasereandersson: i ran into it even with 3.8.120:55
mnasereandersson: cluster_partition_handling, pause_minority20:55
eanderssonI tried to tell the RabbitMQ guys about this, but I don't know how to reproduce it properly.20:56
eanderssonPlus I run like 3.7.5 so they just keep telling me to upgrade.20:56
eanderssonBut after the RabbitMQ 3.6.3 debacle we don't upgrade too frequently without first testing it properly. So takes time.20:57
*** ociuhandu has quit IRC20:57
*** Sundar has joined #openstack-nova20:59
Sundardansmith: Would it help to discuss in IRC and then summarize in Gerrit?21:00
dansmithSundar: if you were ever in irc, sure21:01
dansmithSundar: what I want is discussion instead of replies of "no you're wrong" two minutes before pushing up replacement set without addressing the thing21:01
Sundardansmith Sure. I addressed most of your points, BTW.21:02
SundarI think the disconnect is the understanding of the object model. Please see https://review.opendev.org/#/c/631243/46/nova/accelerator/cyborg.py@2621:03
*** slaweq has quit IRC21:03
dansmithSundar: I'm not sure what your point is21:04
dansmithSundar: I will eventually be able to have one crypto accelerator and one gzip accelerator, right?21:05
Sundardansmith: Yes, for the same instance21:05
dansmithand those are two device profiles?21:05
SundarNo, one device profile with 2 request groups21:05
dansmithokay, so an instance will only ever have one device profile?21:05
SundarYes. That single device profile's name is set in the flavor.21:06
dansmithso, why are you setting the tag in the event to the device profile?21:06
SundarThat was the only logical choice for a tag that seemed relevant.21:07
dansmithsetting the tag means that the event needs to be multiplexed for the instance21:07
dansmithwhich is why we set it to the port id for neutron ports, for example, because there are multiple ports per instance21:08
dansmithI'm not sure why it was decided the encompass all of the accelerators for an instance into a single entity, but alas21:08
SundarWould we ever need multiple events per instance? For example, for hot adds/deletes in the future, by updating the device profile?21:09
dansmithSundar: those would be different event types21:09
SundarThen what exactly is the problem -- that the tag is superfluous?21:10
dansmithsetting the tag implies that there can be multiples, so ... yes21:11
*** slaweq has joined #openstack-nova21:11
Sundardansmith: Got the disconnect. We don;t use the tag, at least not today. With hot adds/deletes, since it it is going to be another event type, we still don;t need it.21:12
Sundardansmith: BTW, the idea of one device profile per instance came from the idea of setting one device profile name in the instance.21:12
dansmithI dunno what those would look like, but I would hope that there is some indication of what device the "thing got added" event pertains to21:12
dansmith...which would be done with a tag21:13
*** ociuhandu has joined #openstack-nova21:13
Sundardansmith: It may make sense to have a single notification for an update too -- because there is not much that Nova can with a partial update knowing that the next event may indicate a failure and things need to be rolled back21:14
SundarThat is the same reasoning as for the bind logic here21:15
dansmithSundar: "something failed" events are pretty terrible21:15
*** slaweq has quit IRC21:16
Sundardansmith: If you mean that the event should say what exactly failed, agreed.21:16
dansmithSundar: I mean we should at least know which thing failed, and tag is the "which"21:16
openstackgerritLee Yarwood proposed openstack/nova master: Remove 'nova-xvpvncproxy'  https://review.opendev.org/68790921:17
Sundardansmith: The problem is that, if I were to send a separate event for each ARQ for binding success/failure, tagged with the ARQ UUID, Nova woudn't know what to do with that . For one, there is no association between the ARQ UUID and the instance in nova -- that would presumably need a db change.21:18
SundarSo, nova wouldn't know which instance is affected.21:19
dansmithhuh? the event would still be delivered to the instance, so we know which instance is affected21:19
dansmithI understand we don't store the arq uuid anywhere, although I think we likely still have it before we're going to wait, since we just polled cyborg for the list21:20
Sundardansmith: true. The event has a server-uuid field.21:20
dansmiththat's kindof the whole point.21:20
dansmithit would be much less odd if you did that, even if it doesn't mean that much to us right now21:20
dansmithbecause we can log the detail, potentially report it in the instance action event,21:21
dansmithinstead of "well, I dunno, cyborg said it didn't work.. *shrug*"21:21
Sundardansmith: However, what does a single ARQ's success mean to Nova? It would have to wait anyway for all of them. If any of them failed, the whole set needs to be rolled back.21:21
dansmithbut whatever happens, if you're going to send one per instance, it needs to not have a tag21:21
Sundardansmith: Ok, I can remove the tag.21:22
dansmithSundar: in some future we may need to do extra things to the device on the host, and we can do that for devices that complete early while we wait for the others,21:22
dansmithand like I said, being able to report to the user and the admin which and what happened is useful21:22
*** rcernin has joined #openstack-nova21:23
*** francoisp has quit IRC21:23
SundarRe. report to user/admin, the exact details would be in Cyborg logs. Given the range of errors, it will be tough to state the failure reason in the Nova event.21:24
SundarPerhaps the Nova logs could point the admin to Cyborg logs.21:24
dansmithno, I mean report *which* thing failed21:25
dansmithnot why21:25
SundarFor example, say we had one event per ARQ, tagged by ARQ UUID. Nova logs report that a certain ARQ UUID failed to bind, Would that really be useful, without knowing more about the ARQ?21:26
dansmithyes?21:26
dansmithI grab that arq uuid and hit logstash to see what else has reported stuff about that thing21:27
dansmithWe could just always report "error with (neutron|cinder|cyborg). You go figure it out" but that's not likely to be very popular21:28
Sundardansmith: IMHO, I think it will be better to look at an error in context. Giving the user an ARQ UUID is equivalent to 'go search Cyborg logs', right?21:30
dansmithI can't believe I'm having this conversation.21:30
*** mvkr has joined #openstack-nova21:30
dansmithSundar: either make the event look like the rest of our events, or stop passing the tag.21:30
Sundardansmith: Sure, I said I can stop passing the tag. You had 2 concerns with that -- one was ease of use of logs. We were attempting to resolve that.21:32
SundarThe other was "in some future we may need to do extra things to the device on the host".21:32
SundarI am still thinking about that, esp. if those extra things will require a different event type, which could have or not have a tag, independent of this event.21:33
Sundardansmith: Please LMK if I am off.21:34
*** lbragstad has quit IRC21:34
*** tbachman has quit IRC21:35
dansmithSundar: is there any reason you can't send the event per ARQ in cyborg, and any reason you can't know the ARQ UUIDs at the time we wait? We have that info passed in now right?21:35
*** pcaruana has quit IRC21:38
Sundardansmith: We are debating the benefits of sending one event per ARQ, beyond 'this is what we have done in the past.' It seems to me that logging the ARQ UUID gives little info to the user beyond 'look elsewhere for the details'. Please LMk wht that is wrong.21:38
SundarSecondly, we still have to wait for all events for that instance, right?21:39
dansmithefried: you around now?21:39
*** martinkennelly has quit IRC21:40
efriedhi, yes, all the doctors this week.21:40
efriedshall I read scrollback or are you going to summarize for me?21:40
dansmithso, it has become clear to me that the cyborg event as currently proposed is being used very differently than what we have for the other services21:40
efriedIf we're discussing how many events...21:41
dansmithspecifically, it's one event per instance with "everything is done" or "everything failed" level granularity21:41
dansmithinstead of per-ARQ, akin to per-port or per-volume or anything else21:41
efriedI skimmed the reviews briefly and it makes sense to me to have an event per "attachable thingy"21:41
dansmithright21:41
dansmithso I think I'm going to convert to taking a hard line on that21:41
efriedThat's me looking at a very high level, without really digging in.21:41
dansmiththat the event should be per-ARQ21:41
dansmitheven though everything is very monolithic right now,21:42
efriedunless the contract with cyborg is "all or nothing"21:42
Sundarefried; What will Nova with a per-ARQ event?21:42
Sundar*do with21:42
efriedI mean, I'm guessing we would abort the build if we get partial21:42
dansmitheventually this should be similar to what we do with ports and volumes in terms of attachability21:42
dansmithefried: yes, just like we do for ports and volumes now during build21:42
dansmithhowever,21:42
dansmithwe use the same event during attach-one-later type operations21:42
dansmithand this should be the same.21:43
efriedsure, but iiuc the (anticipated, future) design is still going to involve wrapping "attach later" arqs in a device profile-like thing.21:43
SundarIf Nova starts sending per-ARQ delete for an instance, that is going to complicate things on both sides21:43
efriedand again, wouldn't we want to fail the whole operation if a subset of attachments fail?21:43
dansmithefried: during build21:44
dansmithefried: just like we do for network and storage21:44
dansmithefried: but during attach, the granularity matters of course21:44
efriedOkay, if we do post-build attach of networks/volumes, we support "partial success" and don't revert the whole operation?21:45
Sundarefried: Exactly. That is why it makes sense to wait for the whole thing.I Cyborg gets an abort in the middle of a device prep task, there may not be much to do, except wait for it to complete21:45
Sundar*If21:45
dansmithefried: post-build attach sends one event per attachable thing, exactly the same cardinality as one-per during build21:45
dansmithefried: and yes, failure to attach just the new thing is not a completly totally fatal thing, of course21:46
dansmithSundar: waiting for them all to complete has nothing to do with whether or not there are multiple events21:46
dansmithI don't even care if you currently batch them up and send them all at once today21:46
efrieddansmith: tbf, it's not "of course"; that's an architectural decision that was made at some point. Which is fine if that's the way it is. And21:46
efriedSundar: I agree we want to have parity with the network/volume operations, even if it doesn't make sense for accels right now.21:46
dansmithit's about the data structure and the protocol, and how that affects the future21:46
efriedso yeah, it happens that for build we'll abort the whole thing if any subset fails.21:47
efriedbut this will allow us to do the future thing without having to like rearchitect the API on both sides and sweat upgrades etc.21:47
efriedSundar: the added complexity is negligible, really.21:47
dansmiththat's the thing I want ... to be granular now so we can be granular in the future21:48
*** ociuhandu has quit IRC21:48
Sundardansmith: efried: just to be sure, you are ok if Cyborg bunches up all the ARQ events for an instance, after all ARQs have bound?21:49
efriedSundar: multiple events in the same POST /os-server-external-events call, yes.21:49
dansmithSundar: yes, I'm okay with you batching them all so they arrive at the same time as they do today, I just care that you represent them separately21:49
efrieddansmith: presumably if we run into the no-host-yet race, it would impact all the events in that payload, because all same instance, and that instance is loaded just once in the method...21:51
SundarThe current logic to wait_for_instance_event waits for a single event, right? So, if if we poll Cyborg at that point, and find that all have resolved, can we still exit early, like today?21:51
dansmithSundar: you're going to need to do one of two things, I think, which is to poll cyborg once more without the binding=complete filter to get those, or pass them down from conductor so you have them. And yes, I'm fine with that and think it's worth the overhead21:51
dansmithefried: it doesn't matter21:51
efried(I just checked it, the instance is loaded once, so yeah, if one 422s, all will 422.)21:52
dansmithefried: the whole reason we got into this conversation is because he wasn't doing the skip granularly.. so we still do that, and only skip the ones we're missing, which is why we started having this discussion21:52
*** henriqueof has quit IRC21:52
dansmithefried: ah you mean the api side, yeah I think that's fine21:52
eanderssonmriedem, sean-k-mooney for the max_queues patch we don't need it as we just internally changed the logic to always allow 256 queues, but if I get some time over this week I can help backport it.21:52
*** ociuhandu has joined #openstack-nova21:52
*** henriqueof has joined #openstack-nova21:52
efriedSundar: wait_for_instance_event takes multiple events to wait for. If you query cyborg and a subset have already bound, you cancel just those events -- dansmith's patch accounts for that -- and then wait_for_instance_event will continue to wait for the remainder.21:53
efrieddansmith: right ^ ?21:53
dansmithprecisely21:53
Sundarefried: There's a problem with that. Nobva doesn't keep ARQ UUIDs around, to use as tags in the event to wait for.21:54
dansmithSundar: I just addressed that above21:54
dansmith[13:51:32]  <dansmith>Sundar: you're going to need to do one of two things, I think, which is to poll cyborg once more without the binding=complete filter to get those, or pass them down from conductor so you have them. And yes, I'm fine with that and think it's worth the overhead21:55
efriedso21:56
efriedall_arqs = poll without binding=complete21:56
efriedwait_for_instance_event(all_arqs):21:56
efried    done_arqs = poll with binding=complete21:56
efried    if done_arqs: cancel(done_arqs)21:56
efrieddansmith: is that what you meant ^21:56
dansmiththat's one option yes21:56
efried"pass down from conductor" like change RPC signatures and stuff?21:56
dansmithyou could also stash the ARQ UUIDs in sysmeta in the conductor and avoid having to dick with the rpc signature21:57
efriedI would prefer the other, just because mucking with RPC gorp scares me.21:57
dansmithefried: so, something I've had in my back pocket on this is that you're already breaking upgrades21:57
dansmithefried: because you are assuming that computes are new enough to do the thing you're promising to the user, but no checks are made for that21:57
dansmithefried: an RPC change would bring that to the forefront21:58
efriedwell21:58
efriedyou'll never get scheduled to such a compute.21:58
efriedBecause such a compute will never advertise the accel inventory21:58
efriedso I think that's n/a21:58
dansmithefried: doesn't cyborg manage that?21:58
efriedthe cyborg agent on the compute21:58
dansmithright, so you're wrong :)21:59
dansmithnew cyborg agent, old compute agent = breakage21:59
efriedso as long as there's a xdep between the cyborg agent and compute on the same host21:59
efriedwhich, that must be a thing, no??21:59
dansmitheh?21:59
efriedI mean, we enforce that for placement21:59
dansmithyou mean like an RPM or DEB dependency?21:59
efriedno, I mean like cyborg agent refuses to start up if compute isn't at version xyz22:00
dansmithefried: how would it know?22:00
Sundardansmith: An old Nova would not query CYborg for device profiles or create/bind ARQs22:00
dansmithSundar: a new nova control plane would22:00
dansmithSundar: nova supports backlevel nova-compute services, which is the scenario I'm talking about22:00
*** dviroel has quit IRC22:00
Sundardansmith: You mean new n-api, n-super-cond, n-sch but old n-cpu?22:01
efried...do we seriously not have a way to query the nova version on a compute?22:01
dansmithSundar: yes, a very important scenario we have supported for a long time that people depend on22:01
efriedI mean, obv the RPC API does, which is what you're leading up to.22:01
dansmithefried: we don't and shouldn't22:02
dansmithI guess I better go ahead and drop that in the review somewhere before I go poof for the year and ya'll try to merge this :)22:03
Sundardansmith: I may be missing something. We had the create/bind and the wait all in n-cpu, which would have avoided this issue (at the expense of less concurrency), right?22:03
efriedSundar: no22:04
dansmithSundar: no, you still have the same problem with that arrangement22:04
efriedSo, ugh, what we actually need is something like:22:04
efriedcompute advertises a COMPUTE_CAN_DO_CYBORG_SHIT trait so that22:04
efriedcyborg (which already queries that RP to know where to hang the accel RPs) can know whether it's even allowed to do that.22:04
dansmithefried: or do it with service version but you might as well do an RPC version in that case22:04
efriedSundar: The create/bind happens at deploy time, when you've already picked a host. We're talking about the bootstrapping process where cyborg decides to advertise accel resources in the first place.22:04
dansmithefried: the trait is more expensive unless you've already got them22:04
efrieddansmith: expensive how? Because an extra placement query, once, on startup? Not worried about that.22:05
efriedor were you talking about something else?22:05
dansmithsomething else, and I misunderstood what you mean so let me 'splain:22:05
*** ociuhandu has quit IRC22:06
dansmithefried:  you could either try to avoid cyborg hanging the inventory off the compute at bootstrap time, or you could avoid letting the conductor/scheduler start down this path for a compute that can't handle it22:06
*** ociuhandu has joined #openstack-nova22:06
efriedthe former gets you the latter for free22:06
dansmithIMHO, the latter is the job of nova anyway,22:06
dansmithand would fit with a "this compute isn't new enough to do that thing you want" sort of check22:06
dansmithefried: it depends on another service for upgrade correctness though22:06
efriedif you expose the inventory but have something else that prevents you from using it, that's a waste, isn't it?22:06
dansmithi.e. it depends on cyborg being well-behaved, and/or not being modified to just always do it22:07
dansmithefried: well, it means that cyborg can be upgraded and working before you've upgraded all your computes22:07
efriedheck, on that theory we could hang the inventory off a *really* old compute and it mushroom clouds because it can't handle nested providers at all.22:07
dansmithso if you upgrade service by service, you don't have to loop back around22:07
dansmithwhich was always annoying with things like ironic that had kinda circular deps with nova22:07
efriedcyborg is polling though, isn't it Sundar?22:08
dansmithefried: well, that might be a good reason to do both then I guess, I dunno22:08
Sundarefried: Sorry.Cyborg is polling for what?22:08
efriedHeh, mushroom cloud. That should be a thing.22:08
SundarPolling for new devices? Yes22:09
dansmithefried: I understand that we avoid asking for the new thing in a roundabout way by not exposing a trait, cyborg not exposing inventory, placement not returning any candidates, and thus us not actually running our control plane code,22:09
efriedSundar: If cyborg agent is shiny and new on a host, but the nova-compute on the host is downlevel22:09
efriedand then you upgrade the nova-compute22:09
efriedwill cyborg eventually figure that out and realize it should now start reporting accel inventory on that host?22:09
dansmithefried: but it would be a lot more in line with our regular checks to just be explicit about it, even if we should never hit it because of the other22:10
efrieddansmith: okay, so what is it that you're suggesting exactly? I didn't pick up on that.22:10
efriedAn RPC version check from conductor to compute?22:10
openstackgerritMykola Yakovliev proposed openstack/nova master: Fix boot_roles in InstanceSystemMetadata  https://review.opendev.org/69804022:10
dansmithefried: if you pass the ARQs down in the spawn call, you need an RPC version, so you get a service version, which is something you can just check for easily in conductor, alternately just bump the service version without the rpc and check for that in conductor22:11
dansmithefried: or expose it as a trait and have conductor/scheduler filter or check for that trait before agreeing to use that compute22:11
*** slaweq has joined #openstack-nova22:11
dansmithefried: perhaps the easiest thing would be just "if we have accel resources, also add CAN_DO_CYBORG_SHIT trait into the requirements" before we call placement during scheduling?22:11
dansmithyou'll need to do it again for attach when we add that22:12
dansmithso CAN_DO_CYBORG_SHIT_V2alpha22:12
efrieddansmith: I'm not offended by the thought of those capability traits, which seem on par with what we have for e.g. "can do multiattach" or whatever22:13
efriedso IMO we should do those regardless22:13
dansmithack22:13
dansmithso cyborg can look for that,22:13
efriedthose are simply flags on that virt driver dict22:13
dansmithand we throw that into the scheduling request also if we're doing accel stuff?22:13
efriedWell, I don't think we need it in the sched request, if cyborg is using it to decide whether to present the inventory in the first place.22:13
efriedbut if you think it should be there for form's sake, I wouldn't object.22:14
dansmithagain, that puts the onus on cyborg to be responsible, which I don't like22:14
efriedwhere's the trust??22:14
SundarSo, the use case for all this is that, the operator added accelerator to a compute node but did not upgrade n-cpu on that node?22:14
dansmithright, thank you :)22:14
dansmithefried: let me check my butt, hang on22:14
efriedThat's not spelled "onus" dansmith22:14
dansmithhehe22:15
efriedSundar: yes, exactly.22:15
efriedSundar: we're trying to make nova account for poorly-behaved a) operators, b) service code from other services.22:15
*** tosky has joined #openstack-nova22:15
efriedand not-work in a predictable rather than unpredictable (and potentially unrecoverable) way.22:15
dansmithjust to be clear,22:16
*** slaweq has quit IRC22:16
dansmithI definitely think that the rpc method change is the right thing to do, instead of just coding in new assumptions on both sides22:16
*** mvkr has quit IRC22:16
dansmithI understand the hesitation to that, so I won't hold strictly to it, despite it being the safer option, IMHO22:16
efrieddansmith: if we did that, we would have to have a post-placement filter to remove hosts based on RPC version, right?22:17
efriedperhaps we already have that.22:17
SundarWhat's wrong with the option where Nova places a CAN_DO_CYBORG trait on some compute node RPs, and factors that into Placement query?22:17
efriedSundar: nothing's wrong with that, and we should do it, but we're talking about doing another thing as well.22:18
dansmithefried: no, you'd still want to pick a capable host during scheduling, and scheduling based on rpc version is not a thing22:18
efrieddansmith: so what's the RPC version for again? Just passing the arq uuids?22:18
*** tbachman has joined #openstack-nova22:18
dansmithefried: yes, and to crystalize our request to the compute node so that (a) the RPC layer can tell if we're (for whatever reason) asking the compute node to do something it can't handle, and so that compute doesn't bake the assumption of the instance having a magical key in its flavor into meaning that it should call to cyborg to do all this stuff22:19
*** mvkr has joined #openstack-nova22:20
dansmithefried: because sometime in the future, the mere presence of that flavor key will imply a slightly different thing, and we will have an upgrade concern to deal with because we have a bunch of old computes that will still be doing the assume-it thing22:20
dansmithI made reference to this in a previous discussion about all of this22:20
efriedThis is getting into details I probably don't care about rn, but do we get (a) for free (does the call bounce if the compute is too old) or do we have to do an explicit check?22:22
dansmithsummarized the bits you do care about here: https://review.opendev.org/#/c/631244/51/nova/conductor/manager.py22:22
efriedack22:23
dansmithefried: the rpc layer knows what the minimum supported version is across the cluster, so when it is converting the request, it can explode if doing so would drop an arqs parameter22:23
dansmithefried: it gets you the circuit breaker, not necessarily the "graceful" part -- that should still be scheduler-based22:24
efriedcool22:24
dansmithSundar: while you're here, what's the plan for getting some small example of this running with an actual pci device in the intel pci ci system (or something) ?22:26
*** ociuhandu has quit IRC22:27
*** nicolasbock has quit IRC22:27
Sundardansmith: We have the server and the FPGA in a lab, and one person is working on that. We expect it by Jan, if not this year.22:27
SundarFWIW, I check each patch set with real FPGAs, with different device profiles, and also the fake driver.22:28
efriedWe should have a 3rd party CI called "Intel Cyborg Sundar Manual CI"22:29
dansmithSundar: that would be good except that I totally don't trust you as much as a computer22:30
dansmith:)22:30
SundarWe already have that hehe22:30
dansmithSundar: so one server and one fpga and one person.. what is the plan? to have that be a manual "recheck use fpgas" command or something?22:30
Sundardansmith: Too bad,  I am more reliable than other past Intel CIs ;)22:30
dansmithSundar: that's true, although doesn't really change my feelings about it :)22:30
efrieddansmith: experimental queue?22:31
*** tbachman has quit IRC22:31
dansmithefried: that's what I meant.. some not-on-everything per-request method22:31
efriedor severely restrict files22:31
efriedbut yeah22:31
dansmithper-request would be better I think, if it's severely constrained22:32
*** awalende has joined #openstack-nova22:34
*** mriedem has joined #openstack-nova22:37
*** awalende has quit IRC22:39
*** mdbooth has quit IRC22:46
*** mdbooth has joined #openstack-nova22:48
*** Sundar has quit IRC23:04
*** tkajinam has joined #openstack-nova23:08
openstackgerritMatt Riedemann proposed openstack/nova master: DNM: debug cross-cell resize  https://review.opendev.org/69830423:09
*** slaweq has joined #openstack-nova23:11
*** slaweq has quit IRC23:15
openstackgerritMerged openstack/nova master: Add resource provider allocation unset example to troubleshooting doc  https://review.opendev.org/69658223:17
openstackgerritMerged openstack/nova master: nova-net: Convert remaining API tests to use neutron  https://review.opendev.org/69650923:17
*** mlavalle has quit IRC23:17
*** Liang__ has joined #openstack-nova23:20
*** brault has quit IRC23:30
*** mmethot has quit IRC23:31
*** brault has joined #openstack-nova23:31
*** slaweq has joined #openstack-nova23:32
*** brault has quit IRC23:32
*** brault has joined #openstack-nova23:33
*** mmethot has joined #openstack-nova23:36
*** pmatulis has joined #openstack-nova23:37
pmatulishow do i map the hypervisor long ID (server show) to the hypervisor hostname (hypervisor list/show)?23:38
*** avolkov has quit IRC23:38
*** tosky has quit IRC23:40
dansmithpmatulis: if you server-show as admin, you should see the unobfuscated hostid , IIRC23:40
*** tbachman has joined #openstack-nova23:41
*** slaweq has quit IRC23:41
dansmithpmatulis: as OS-EXT-SRV-ATTR:hypervisor_hostname23:41
*** Liang__ has quit IRC23:42
*** brinzhang has joined #openstack-nova23:58
*** brault has quit IRC23:58
*** brault has joined #openstack-nova23:59
*** ccamacho|pto has quit IRC23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!