Thursday, 2018-08-30

*** Sundar has quit IRC00:12
*** efried1 has joined #openstack-cyborg00:50
*** efried has quit IRC00:51
*** efried1 is now known as efried00:51
*** efried has quit IRC00:58
*** efried has joined #openstack-cyborg01:20
*** jiapei has joined #openstack-cyborg01:23
openstackgerritwangzhh proposed openstack/cyborg master: Add "Report device data to cyborg"  https://review.openstack.org/59669102:32
openstackgerritwangzhh proposed openstack/cyborg master: Add "Report device data to cyborg"  https://review.openstack.org/59669102:58
openstackgerritwangzhh proposed openstack/cyborg master: Add "Report device data to cyborg"  https://review.openstack.org/59669103:15
*** jiapei has quit IRC03:33
openstackgerritXinran WANG proposed openstack/cyborg master: Allocation/Deallocation API Specification  https://review.openstack.org/59799105:43
*** openstackgerrit has quit IRC06:07
*** openstackgerrit has joined #openstack-cyborg06:16
openstackgerritwangzhh proposed openstack/cyborg master: Add "Report device data to cyborg"  https://review.openstack.org/59669106:16
nguyenhai_Is project openstack/os-acc belong to cyborg?07:02
nguyenhai_The core-reviewer of cyborg have responsible for openstack/os-acc or not? Thanks.07:02
*** sahid has joined #openstack-cyborg07:48
kosamaraSundar: yes :)07:50
kosamaraSundar: But this spec is derived from efried's nova-powervm spec, which you have already commented on. The core principles are the same, but a lot of things have changed.07:51
*** sahid has quit IRC08:23
*** sahid has joined #openstack-cyborg08:25
*** kosamara has quit IRC10:01
*** kosamara has joined #openstack-cyborg10:03
openstackgerritMerged openstack/os-acc master: import zuul job settings from project-config  https://review.openstack.org/59283614:09
openstackgerritMerged openstack/os-acc master: switch documentation job to new PTI  https://review.openstack.org/59283714:16
openstackgerritMerged openstack/os-acc master: add python 3.6 unit test job  https://review.openstack.org/59283814:16
openstackgerritMerged openstack/os-acc master: add lib-forward-testing-python3 test job  https://review.openstack.org/59283914:19
*** sahid has quit IRC16:55
openstackgerritwangzhh proposed openstack/cyborg master: Add "Report device data to cyborg"  https://review.openstack.org/59669118:58
*** Sundar has joined #openstack-cyborg19:49
Sundarefried: Please ping me when you can19:50
efriedSundar: Hi!19:50
SundarHi, I got tons of feedback on the os-acc spec ;). Can we discuss them now?19:51
efriedokay, sure19:51
efriedThough I think you got most of it from sean19:51
SundarOne feedback was that the attaching to the VM should be left to Nova virt driver, and os-acc should not do it. That is hypervisor-specific and that's what virt drivers are for.19:52
SundarDo you agree with that?19:52
efriedI... think so, yes. I think that's how neutron plugins operate. But maybe it's not how the os-vif model is set up.19:53
efriedanyway, it's certainly hypervisor-specific. No question there.19:54
efriedSo when you say "os-acc" you really mean "the plugin at the behest of os-acc".19:54
SundarOne main difference between os-vif and os-acc that I am struggling to communicate is this: with os-vif, a VIF gets allocated, port binding happens and then the plug() operation happens.19:55
SundarWith os-acc, the equivalent of binding and plug cannot be separated cleanly. Example: A GPU may need to be reconfigured to create vGPUs of a certain type. Until that point, the accelerators of the right type don't even exist. So, the device may need to be configured to create the right accelerators, or change the inventory of accelerators, before we get the attach handle19:57
efriedokay. That stuff is *also* hypervisor-specific.19:58
SundarSure, it is also device-specific. The device driver has to do this in a hypervisor-specific way19:59
efriedMay be the purview of the "driver" rather than the "plugin" but still.19:59
SundarYes, I think we already agree that the driver would be hypervisor-specific19:59
efriedit may in fact be device *type* specific, not necessarily device specific.19:59
efriedAnyway, I think we're in violent agreement on the high points here.20:00
SundarWell, the actual act of writing to the device and manipulating it to create a new vGPU type or program a bitstream would in fcat be specific to device models or vendors.20:00
SundarSo, the Cyborg and its driver(s) would have to handle the equivalent of port binding (device specific) and plug. That gives us a VAN with the device end configured, ready to be attached to an instance20:02
SundarAfter that, Nova virt can take that VAN and do the needful for that hypervisor, in a device-independen way20:02
efriedand does that happen in a separate step prior to spawn?20:02
SundarYes. I shared a flow diagram with you this morning.20:03
SundarBefore we get to that :),20:03
Sundarthe role of os-acc plugin is highly diminished with these aspects considered. The driver does the device end, Nova virt does the instance end20:04
efried"nova virt" by invoking the plugin through os-acc?20:04
SundarThe possible role for os-acc extensions is to handle device-compute interactions, such as NUMA affinity for interrupt vectors.20:04
SundarNova virt calls os-acc, and that in turn calls Cyborg in some way. When that call returns, Nova virt has to persist the instance - VAN association in Nova's db. neither os-acc nor Cyborg can do that, so it has to go to Nova virt. The subsequent step of attaching to the VM is already in Nova virt.20:06
SundarSo, what does the plugin do for attach?20:06
* efried shrugs20:07
efriedno idea20:07
efriedif it's nothing, it's nothing.20:07
SundarI understand your need for keeping things hypervisor-specific, but Nova virt is already hypervisor-specific, right?20:08
efriedyup20:08
efriedI don't really have a stake in getting lots of code into these plugins.20:08
efriedI just want to make sure we don't end up with linux-isms in common code paths.20:08
efriedlinux/libvirt20:08
SundarOK. Could you look at the flow diagram I shared this morning?20:09
efriedIf I didn't have to write a separate driver/plugin at all, I would be pretty happy. But I'm not sure that is going to happen.20:09
SundarNot to spoil your afternoon, but I think you would need PowerVM drivers for GPUs, FPGAs, or whatever you support :)20:10
SundarAre you looking at Power+KVM hypervisor too?20:11
Sundarbrb in 5 min20:12
efriedI don't know anything about pkvm20:12
Sundarback. ok20:14
SundarBack to os-acc plugins. The possible role for os-acc plugins is to handle device-compute interactions, such as NUMA affinity for interrupt vectors. But it may be too much to bring them in this spec. Shall we put os-acc plugins as a placeholder until we get to that?20:15
efried"NUMA affinity for interrupt vectors" <== Greek to me.20:18
SundarOK, NUMA affinity for devices in general?20:18
efriedIf you don't have an actual use case for os-acc plugins, then...20:19
efriedNUMA affinity is something we want to be able to handle via resource provider structure.20:19
efriedwe can't yet20:19
efriedbut it's something we should bring up (again) in Denver.20:19
SundarSure. Shall we have a f2f chat at Denver? It'll be good to sync up and drive the specs to a close shortly thereafter20:20
efrieddefinitely20:21
efriedIf I were you, I would ask for a nova/cyborg cross-project session.20:21
efriedBased on what I know of people's schedules, I would say Tuesday would be a good day for it.20:21
SundarI have entered a slot in Nova etherpad. May be I should ping melwitt?20:22
efriedIt might be better to schedule something during cyborg's time (M/T) where you can get the "placement cores" to visit the cyborg room and spend a couple of hours hashing stuff out.20:23
efriedAnd keep it on the nova schedule in case we a) don't get everything sorted; and/or b) need a wider audience.20:23
SundarOK20:23
efriedSend something to the dev ML, like what Blazar did, to organize the Tuesday thing.20:24
efriedoh20:24
efriedSorry, I forgot, Blazar sent that directly to people.20:24
efriednot to the ML.20:24
SundarMeanwhile, please LMK if you are ok with the flow diagram. Did you get it?20:24
efriedI got it.20:26
SundarGreat. I'll grab lunch and get back in 15 min. We can shake it out after that.20:27
efriedgive me 2020:29
*** ildikov has joined #openstack-cyborg20:53
Sundarefried: Ready any time21:07
efriedSundar: Nova meeting in progress. Not sure how long it will last, but not beyond top of the hour. Will you be around?21:08
SundarYes21:09
efriedSundar: ō/21:21
SundarThat was quick.21:23
efriednow where did I put that diagram...21:24
efriedgot it21:24
SundarThe main thing to note in the flow is that there isn't much that os-acc is doing.21:25
SundarMost of it is Nova virt or Cyborg. I am still pondering what os-acc can do that is not device-specific (and so in Cyborg) or hypervisor-specific (and so in Nova virt)21:25
efriedSundar: as a touchpoint/router and future extension point for those pieces of the requests, it may be useful.21:26
efriedbut I'm really not sure.21:27
SundarYes, I agree. That's what I was thinking21:27
efriednot sure it makes much sense for the calls from os-acc to cyborg API/conductor to be asynchronous if it's just going to poll for them to complete right away...21:28
SundarThe calls may take milliseconds to possibly seconds, depending on whether Glance bitstreams need to fetched, one or more FPGAs need to be programmed, etc21:29
efriedso?21:30
SundarThat is why it is async21:30
efriedThe caller is blocking on their completion anyway21:30
efriedSo why does it matter if they take "a long time"?21:30
SundarAh, the n-cpu could do other things while this is blocked, right?21:30
efriedcould it?21:30
efriedI guess.21:31
SundarThe allocation of other resources -- networking, storage -- could go in parallel21:31
efriedJust sounds like it would make things pretty complicated.21:31
efriedIf n-cpu wanted to parallelize, it could send that request off in its own thread.21:31
efriedBut having it be async *forces* n-cpu to deal with that async-ness.21:31
efriedAnyway, this is really a nit.21:31
efriednot a substantive thing.21:31
SundarThe n-cpu to os-acc could be async too.21:32
SundarSo, os-acc deals with the async weirdness.21:32
SundarHmm, well, os-acc to Cyborg can be sync if os-acc itself is called in an async way21:32
SundarJust wondering if a REST API call cna really block for seconds21:33
efriedabsolutely.21:33
efriedI'm sure there's a connection timeout at the HTTP level. But seconds shouldn't be a problem.21:34
SundarAre there precedents in OpenStack where a REST API blocks for seconds?21:34
efriedI have no idea.21:35
efriedThat might be something edleafe and/or cdent would know.21:36
SundarOK. Another note: The proposal calls for persisting VAN objects in Cyborg db, but Nova db maintains the association between the instance and its VAN UUIDs . That s because, on a VM suspend for example, Nova would need to call os-acc to detach VANs but not deallocate them21:37
SundarNever mind, the detach would be in Nova virt.21:37
SundarBut you get the point21:38
SundarOn a termination, Nova virt would detach each VAN and then call os-acc to release the resources21:38
efriedright, some kind of handle (i.e. the VAN UUID) would need to be associated with the instance in the nova db.21:38
SundarYes21:38
efriedAs long as we can query cyborg with the UUID to get the rest of the VAN info, the UUID should be the only thing we need to store, I would think.21:39
SundarYes21:39
SundarGood. Since we are in agreement, I'll write this up in the spec.21:42
efriedSundar: Note that I'm only one person, and my stake in the details we've just discussed is not an especially strong one.21:43
SundarWe need some path to get this converged. I am incorporating sean-mooney's comments and responding to them. But, if somebody else were to come along in a month, we can't keep waiting.21:45
SundarI hope to get this closed during the PTG or at most a week after21:45
SundarI requested melwitt for a PTG session. Could I ask for your help in ensuring that all Nova feedback is given by that time?21:46
efriedhah22:13
efriedYou're asking me to move mountains.22:13
efriedThere's no way to force all the stakeholders to review a thing. And there's no way to prevent taking weeks or months to revise something with the collaboration of many people only to have someone who should have been involved from the start come along and throw a huge monkey wrench in the works.22:14
efriedI think your best bet is to be ready to present an overview of the architecture at the PTG, counting on at least some key players in the audience to have *not* reviewed the specs, and talk out some of the issues there.22:16
efriedSundar: ^22:16

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!