Wednesday, 2022-09-28

opendevreviewAmit Uniyal proposed openstack/nova master: Adds check for VM snapshot fail while quiesce  https://review.opendev.org/c/openstack/nova/+/85217106:10
bauzasJayF: ping me when you're back if you want, sure07:00
opendevreviewBalazs Gibizer proposed openstack/nova stable/yoga: Bump min oslo.concurrencty to >= 5.0.1  https://review.opendev.org/c/openstack/nova/+/85942110:10
damiandabrowskihey folks, I noticed that if compute node already exceeds its overcommit ratios, it's not possible to migrate VMs out of it10:29
damiandabrowskiit makes no sense for me and it should be considered as a bug, but maybe there is some reason behind it?10:29
damiandabrowskihttps://paste.openstack.org/raw/byjHufprWu3hwDFPSKAD/10:29
gibidamiandabrowski: the reason is that if you the compute is already exceeds overcommit then it cannot have new allocations. And the migration needs to move the instance allocation to the migration_uuid in placement and that would require a new allocation even if the total usage would not change. 10:37
gibiwhat you can do is10:37
gibii) increase the allocation ratio temporarily to the level where the usage not exceeds it10:38
gibiii) migrate the workload out10:38
gibiiii) revert the allocation ratio to the original level10:38
damiandabrowskithanks, i'll do that10:39
damiandabrowskiso you don't consider it as a bug, right?10:39
gibiyou exceeded the overallocation so you are in a not supported state I would say10:40
damiandabrowskiokok, thanks for clarification10:41
opendevreviewOleksii Butenko proposed openstack/os-vif master: Add new os-vif `network` property  https://review.opendev.org/c/openstack/os-vif/+/85957412:17
jkulikdamiandabrowski: FYI, we have a patch downstream in our Placement to allow switching resources on overused providers, so migrations off should still be possible: https://github.com/sapcc/placement/commit/2691056d6fa3e7db0cf9966b082af51ae6b5dda912:19
jkulikuse at your own risk, though12:20
damiandabrowskioh that's convenient, thanks12:29
opendevreviewOleksii Butenko proposed openstack/nova master: Napatech SmartNIC support  https://review.opendev.org/c/openstack/nova/+/85957712:41
opendevreviewJ.P.Klippel proposed openstack/nova master: Fix link to Cyborg device profiles API  https://review.opendev.org/c/openstack/nova/+/85957812:53
opendevreviewJustas Poderys proposed openstack/nova-specs master: Add support for Napatech LinkVirt SmartNICs  https://review.opendev.org/c/openstack/nova-specs/+/85929013:17
*** dasm|off is now known as dasm13:18
justas_napaHi bauzas, gibi. Re: Napatech smartnic support, we have now pushed the code changes to opendev13:18
justas_napaall changes are visible in opendev via https://review.opendev.org/q/topic:Napatech_SmartNIC_support13:19
noonedeadpunkgibi: I guess the question is - why new allocation is created at the first place. As it's already present, isn't it?13:45
noonedeadpunkso why another allocation is being created against source host13:46
gibinoonedeadpunk: it is a technicality. When the instance is created the resource allocation of the instance is held by the consumer=instance.uuid in Placemnet. When the instance is being migrated, Nova first moves the allocation held by the consumer=instance.uuid to consumer=migration.uuid in Placement, then nova will call the scheduler to select a target host for the migration and allocate the 13:47
gibiresource on the target host to consumer=instance.uuid in Placement. 13:47
gibinoonedeadpunk: the move of the allocation from instance.uuid to migration.uuid is the one that considered as a new allocation in Placement and rejected as the host is already overallocated13:48
noonedeadpunkwell, it's quite slight line between bug and "as designed"13:48
noonedeadpunkbecause for me - allocation change that does not involve change of resources shouldn't be considered as new one...13:49
gibiI intentionally not say it is a bug or not a bug, what I say is that having the host overallocated (over the overallocation limit) is the cause of the problem. A host should never be in that state13:49
noonedeadpunkbut I do agree that workaround is easy enough to not spend time on making placement logic mroe complex for this13:50
gibiin all these cases you have to first figure out why and how you ended up in an overallocated state13:51
gibiand as you said recovering from this state is not as hard13:51
noonedeadpunkI can say super simply example - you decided that overcommit ratio is too high and it's good to lower it down. And you have prometheus that can tell ou on average for region overcommit is lower then you want to define 13:52
noonedeadpunkSo you lower it down, but some computes are still having higher overcommit, as you have not checked that against placement per host13:53
noonedeadpunkAnd you can't lower it down easily by reverting setting and disabling compute for scheduling on top13:53
noonedeadpunk* And you can't lower it down easily by migrating instances out so you have to revert setting13:55
gibibasically that process makes the mistake of lowering the allocation ratio on the host before checking it if it causes overallocation13:59
gibiso the proper process would be (in my eyes): 1) disable the host temporarily to avoid new scheduling while you reconfigure it 2) migrate VMs out from the host to achive the target resource usage 3) reconfigure the host with the new allocation ratio 4) enable the host14:01
noonedeadpunkwell... it's hard to disagree here :)14:03
noonedeadpunk(it's more tricky to do though with microversion 2.88)14:03
noonedeadpunkor maybe not and it's jsut me who got used to deal with nova api...14:07
noonedeadpunkas eventually representation of inventory in placement is way more accurate14:08
gibiyeah you should use placement to calculate how many vms you need to move14:10
auniyalHello #openstack-nova 14:29
noonedeadpunkgibi: the problem is that with sdk it's quite tricky to use placement as you need to really call api rather then get resources with all attributes in it (like it was with hypervisors)14:29
auniyalhow this works - https://opendev.org/openstack/nova/src/commit/aad31e6ba489f720f5bdc765c132fd0f059a0329/nova/context.py#L39614:30
noonedeadpunkgibi: just to express my pain as operator. https://paste.openstack.org/show/bpA2Iq28mHEfo8r4sH2W/ consumes about 3 seconds to execute. And as you see - overrides microversion <2.88.14:55
noonedeadpunk If I am to use palcement to get the same data - I will need code like that https://paste.openstack.org/show/bOsJwmnMMpftgZ01St0o/ and it takes... 42 seconds to execute against exact same cluster14:56
noonedeadpunkAnd not saying about load on APIs as for each resource provider I will need to make 2 calls to placement14:57
noonedeadpunkso for users deprecation of providing resource statistics from hypervisors is quite serious regression in performance. And if I want to monitor that regulary - it's really waste of time.14:59
gibinoonedeadpunk: interesting. so there is some heavy inefficiencies somewhere as it just 2x the number of calls but it takes more than 10 times the time to execute15:00
noonedeadpunkfwiw - I was testing remote cloud (not from inside of it)15:00
noonedeadpunkjust to show that I'm not exaggerating https://paste.openstack.org/show/b1irA4GLwJEdUORuLarb/15:03
noonedeadpunkBut I'm not sure about 2x number of calls. As it feels that from hypervisor data was fetched with single api call...15:04
noonedeadpunkNot sure though15:04
noonedeadpunk(or at least I can't explain it otherwise)15:05
gibihm, you are right the hypervisor one returned all compute from the cluster 15:06
gibi/os-hypervisors/detail15:06
gibiso then I think it would make sense to add a similar api for placement15:06
gibisingle call that returns all the usages per provider15:06
gibithis explains the preformance differences15:07
noonedeadpunkyeah and it will depend on amount of providers actually right now quite dramatically15:08
gibiso I would support adding such api to placement. I cannot commit to implement it, but sure I can review the implementation15:08
noonedeadpunkI can only put it to my backlog and hope to get to it one day...15:09
gibinoonedeadpunk: and if you want to raise this issue then I think we have dedicated time on the coming PTG for operator feedback15:10
noonedeadpunkthat is good idea actually15:10
gibinoonedeadpunk: https://etherpad.opendev.org/p/oct2022-ptg-operator-hour-nova15:11
noonedeadpunkfirst hour does not intersect with anything, so can join:)15:11
gibicool15:11
jkulikyou could™ make the calls in parallel at least ;)15:17
noonedeadpunkisn't it still waste of resources?15:19
noonedeadpunkand regression from operator prespective?15:19
jkuliksure. would be helpful to get it in a single call, but at least it reduces the pain of waiting15:19
noonedeadpunkand add some hardware to serve increased load?:)15:20
noonedeadpunkbut yeah, that's fair15:20
noonedeadpunk*fair workaround 15:20
noonedeadpunkfwiw, even with 10 threads it's still 5 times slower then jsut ask nova15:28
noonedeadpunkmaybe I shouldn't have used joblib...15:28
*** dasm is now known as dasm|off21:31
rm_workhey, i've noticed the OSC seems to sometimes hide the "fault" field on a server show command... but also sometimes it doesn't. anyone know why this is the case?22:07
rm_workstarting to dig into the code now, but hoping someone has some idea, since the clients are often a bit funky to interpret, lots of magic 😛22:08
rm_workyeah, didn't find anything specifically relating to "fault" in there... this is super weird, I can see the "fault" come back with `--debug` in the response from nova, it just gets filtered out or something before the results are shown22:21
rm_worknevermind, figured it out, there's a client patch I didn't know about >_< FML22:31
clarkbrm_work: don't leave us hanging. What causes it?22:35
rm_worklocal client patch where someone did something ... ill advised22:41
rm_worktrying to figure out how to untangle it now22:41
clarkbOh I see what you mean by client patch now. I thought you meant you found the chagne in gerrit taht did it or something22:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!