*** Yumeng has joined #openstack-cyborg | 01:59 | |
openstackgerrit | YumengBao proposed openstack/cyborg-specs master: Create an initial cyborg-specs repository using cookiecutter https://review.openstack.org/554766 | 03:07 |
---|---|---|
*** jaypipes has quit IRC | 04:05 | |
*** jaypipes has joined #openstack-cyborg | 04:06 | |
*** masuberu has quit IRC | 06:04 | |
*** masber has joined #openstack-cyborg | 06:31 | |
*** masber has quit IRC | 07:05 | |
*** masber has joined #openstack-cyborg | 07:06 | |
*** masber has quit IRC | 08:40 | |
*** masber has joined #openstack-cyborg | 09:20 | |
*** Yumeng has quit IRC | 11:13 | |
*** edleafe has quit IRC | 13:29 | |
*** edleafe has joined #openstack-cyborg | 13:29 | |
*** Yumeng__ has joined #openstack-cyborg | 13:39 | |
*** shaohe_feng_ has joined #openstack-cyborg | 13:43 | |
*** zhipeng has joined #openstack-cyborg | 13:57 | |
*** Sundar has joined #openstack-cyborg | 13:58 | |
zhipeng | #startmeeting openstack-cyborg | 14:00 |
openstack | Meeting started Wed Mar 21 14:00:17 2018 UTC and is due to finish in 60 minutes. The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot. | 14:00 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 14:00 |
*** openstack changes topic to " (Meeting topic: openstack-cyborg)" | 14:00 | |
openstack | The meeting name has been set to 'openstack_cyborg' | 14:00 |
*** crushil has joined #openstack-cyborg | 14:00 | |
zhipeng | #topic Roll Call | 14:00 |
*** openstack changes topic to "Roll Call (Meeting topic: openstack-cyborg)" | 14:00 | |
zhipeng | #info Howard | 14:00 |
shaohe_feng_ | hi zhipeng | 14:01 |
kosamara | hi! | 14:01 |
zhipeng | hi guys | 14:01 |
crushil | \o | 14:01 |
zhipeng | please use #info to record you names | 14:01 |
kosamara | #info Konstantinos | 14:01 |
Sundar | #info Sundar | 14:01 |
crushil | #info Rushil | 14:02 |
shaohe_feng_ | #info shaohe | 14:02 |
Sundar | Do we have a Zoom link? | 14:05 |
zhipeng | let's wait for a few more minutes in case more people will join | 14:05 |
zhipeng | Sundar no we only have irc today | 14:05 |
Sundar | Sure, Zhipeng. Thanks | 14:05 |
Yumeng__ | #info Yumeng__ | 14:06 |
zhipeng | okey let's start | 14:09 |
zhipeng | #topic CERN GPU use case introduction | 14:09 |
*** openstack changes topic to "CERN GPU use case introduction (Meeting topic: openstack-cyborg)" | 14:09 | |
*** Li_Liu has joined #openstack-cyborg | 14:10 | |
zhipeng | kosamara plz takes us away | 14:10 |
kosamara | I joined you recently, so let me remind that I'm a technical student at CERN, integrating GPUs into our openstack. | 14:10 |
kosamara | Our use case is computation only at this point. | 14:10 |
kosamara | We have implemented and are currently testing a nova-only pci-passthrough. | 14:11 |
kosamara | We intend to also explore vGPUs, but they don't seem to fit our use case very much: licensing costs, limited CUDA support (nvidia case both). | 14:12 |
kosamara | At the moment the big issues are enforcing quotas on GPUs and security concerns. | 14:12 |
kosamara | 1. Subsequent users can potentially access data on the GPU memory | 14:13 |
*** helloway has joined #openstack-cyborg | 14:13 | |
kosamara | 2. Low-level access means they could change the firmware or even cause the host to restart | 14:13 |
kosamara | We are looking at a way to mitigate at least the first issue, by performing some kind of cleanup on the GPU after use | 14:14 |
kosamara | But in our current workflow this would require a change in nova. | 14:14 |
kosamara | It looks like something that could be done in cyborg? | 14:15 |
zhipeng | i think so, i remember we had similar convo regarding clean up on FPGA in the PTG | 14:15 |
Li_Liu | are you try to force a "reset" after usage of the device? | 14:15 |
kosamara | Reset of the device? | 14:16 |
kosamara | According to a research article, in nvidia's case performing a device reset through nvidia-smi is not enough for the data leaks. | 14:16 |
Li_Liu | something like that. like Zhipeng said, "clean up" | 14:16 |
kosamara | - article: https://www.semanticscholar.org/paper/Confidentiality-Issues-on-a-GPU-in-a-Virtualized-E-Maurice-Neumann/693a8b56a9e961052702ff088131eb553e88d9ae | 14:16 |
kosamara | The additional complexity in pci passthrough is that the host can't access the GPUs (no drivers) | 14:17 |
shaohe_feng_ | so once the GPU devices are detached, cyborg should do clean up at once. | 14:17 |
Sundar | What kind of clean up do you have in mind? Zero out the RAM? | 14:17 |
kosamara | My thinking is to put the relevant resources in a special state after deallocation from their previous VM and use a service VM to perform the cleanup ops themselves. | 14:18 |
Li_Liu | memset() to zero for all ? | 14:18 |
kosamara | Yes, zero the ram | 14:18 |
kosamara | But ideally we would like to ensure the firmware's state is valid too. | 14:19 |
shaohe_feng_ | only this one method to clear up? | 14:19 |
kosamara | Sorry, I'm out of context. This method is from where? | 14:19 |
shaohe_feng_ | zero the ram | 14:19 |
kosamara | Yes for the ram. | 14:20 |
shaohe_feng_ | OK, got it. | 14:20 |
kosamara | But we would also like to check the firmware's state. | 14:20 |
kosamara | And at least ensure that it hasn't been tampered with. | 14:20 |
shaohe_feng_ | so no ways to power off the GPU device, and re-power it? | 14:21 |
zhipeng | i think this is doable with cyborg | 14:21 |
kosamara | Not that I know of, apart from rebooting the host. | 14:21 |
zhipeng | shaohe_feng_ that will mess up the state ? | 14:21 |
Sundar | The service VM needs to run in every compute node, and it needs to have all the right drivers for various GPU devices. We need to see how practical that is. | 14:21 |
Li_Liu | we might need to get some help from Nvidia | 14:22 |
zhipeng | let's say we have the NVIDIA driver support | 14:22 |
Sundar | In a vGPU scenario, how do you power off the device without affecting other users? | 14:22 |
zhipeng | for cyborg | 14:22 |
kosamara | Sundar: unless vfio is capable of doing these basic operations? | 14:22 |
zhipeng | do we still need the service VM ? | 14:23 |
kosamara | I haven't explored practically nvidia's vGPU scenario. The host is supposed to be able to operate on the GPU in that case. | 14:24 |
Sundar | kosamara: Not sure I understand. vfio is a generic module for allowing device access. How would it know about specific nvidia devices? | 14:24 |
shaohe_feng_ | kosamara: do you use vfio pci passthrough or pci-stub passthrough? | 14:24 |
kosamara | zhipeng: vfio-pci is the host stub driver when the gpu is passed through. If we could do something through that, then we wouldn't need the service VM. | 14:25 |
kosamara | shaohe_feng_: vfio-pci | 14:25 |
zhipeng | kosamara good to know | 14:25 |
kosamara | Sundar: yes, I'm just making a hypothesis. I expect it can't, but perhaps zeroing-out the ram is a general enough operation and it can do it...? | 14:26 |
*** Vipparthy has joined #openstack-cyborg | 14:27 | |
Vipparthy | hi | 14:27 |
shaohe_feng_ | kosamara: is zeroing-out the ram time consuming? | 14:27 |
Sundar | kosamara: I am not a Nvidia expert. But presumably, to access the RAM, one would need to look at the PCI BAR space and find the base address or something? If so, that would be device-specific | 14:28 |
kosamara | Sundar: thanks for the input. I can research the feasibility of this option this week. | 14:29 |
zhipeng | kosamara I will also talk to the NVIDIA OpenStack team about it | 14:30 |
zhipeng | see if we could come out with something for Rocky :) | 14:30 |
kosamara | shaohe_feng_: I don't have a number right now. It should depend on the manner: if it happens on the device without pci transports it should be quite fast. | 14:30 |
kosamara | zhipeng thanks :) | 14:31 |
zhipeng | kosamara is there anything like an architecture diagram for the use case ? | 14:32 |
kosamara | Not really. | 14:32 |
zhipeng | okey :) | 14:32 |
Sundar | kosamara: you also mentioned quotas on GPUs? | 14:33 |
kosamara | What exactly do you mean by architecture diagram? | 14:33 |
Li_Liu | this rings a bell to me about the driver apis | 14:33 |
zhipeng | kosarama like the overall setup | 14:33 |
Li_Liu | currently we have report() and program() for the vendor driver api | 14:33 |
kosamara | Yes. We currently implement quotas indirectly. We only allow GPU flavors on specific projects and quota them by cpu. | 14:33 |
Li_Liu | we might want to consider a reset() api for the drivers | 14:33 |
zhipeng | Li_Liu makes sense | 14:34 |
zhipeng | kosamara we will also have quota support for Rocky | 14:34 |
shaohe_feng_ | +1 | 14:34 |
kosamara | Good to know. | 14:35 |
zhipeng | okey folks let's move on to the next topic | 14:35 |
zhipeng | thx again kosamara | 14:35 |
kosamara | thanks! | 14:35 |
shaohe_feng_ | kosamara: so do you not support GPU quota for specific projects? | 14:35 |
zhipeng | #topic subteam lead report | 14:35 |
*** openstack changes topic to "subteam lead report (Meeting topic: openstack-cyborg)" | 14:35 | |
kosamara | No, we do it indirectly, in the way I mentioned. | 14:36 |
zhipeng | shaohe_feng_ scrow up :) | 14:36 |
zhipeng | scroll | 14:36 |
zhipeng | okey yumeng could you introduce the progress on your side ? | 14:37 |
zhipeng | Yumeng__ | 14:37 |
Yumeng__ | Last week I set up a repository for cyborg-specs | 14:39 |
Yumeng__ | some patches are still waiting for review from the infra team | 14:40 |
Yumeng__ | Hope it could be merged ASAP | 14:40 |
Sundar | Are we not using the git repo's doc/specs/ area anymore? | 14:40 |
zhipeng | Sundar we are migrating it out :) | 14:41 |
Sundar | Maybe I am missing the context. What is this repository for specs? | 14:41 |
zhipeng | but it is fine now to keep submit to that folder | 14:41 |
zhipeng | we will migrate all the approved specs after MS1 to cyborg-specs | 14:41 |
Sundar | Howard, could you explain why we are doing that? | 14:42 |
zhipeng | It would be better for the documentation when we do release | 14:42 |
zhipeng | all the core projects are doing it | 14:42 |
zhipeng | i think yumeng also add the gate check on docs for cyborg-specs | 14:43 |
crushil | Sundar All the core projects split out the specs and the main project. So, it makes sense to follow suite | 14:44 |
zhipeng | yes exactly | 14:44 |
Sundar | IIUC, in the git repo, approved specs will be doc/specs/<release>/approved, but in the new repo, all release specs will be in one place. Is that right? | 14:44 |
zhipeng | it will still follows the similar directory structure | 14:45 |
Sundar | OK, so we are just separating code from docs | 14:45 |
zhipeng | yes | 14:45 |
zhipeng | from specs to be preceise | 14:45 |
zhipeng | precise | 14:45 |
Sundar | Got it, thanks :) | 14:45 |
zhipeng | since general documentation is still in cyborg repo, if I understand correctly | 14:46 |
Li_Liu | for now we still check in the docs to the code repo right? | 14:46 |
zhipeng | Li_Liu yes, nothing changes | 14:46 |
Li_Liu | ok | 14:46 |
zhipeng | thx to Yumeng__ for the quick progress | 14:46 |
Sundar | In future releases, would we check specs into code repo, and have it be migrated after approval? | 14:46 |
Yumeng__ | zhipeng: :) | 14:47 |
zhipeng | in the future we will just submit the spec patch to cyborg-specs | 14:47 |
Sundar | ok | 14:47 |
zhipeng | shaohe_feng_ any update on the python-cyborgclient ? | 14:47 |
shaohe_feng_ | the spec and code will separated | 14:48 |
shaohe_feng_ | zhipeng: jinghan has some personal staff these days. So no more update. | 14:48 |
shaohe_feng_ | zhipeng: I will help him on it. | 14:48 |
zhipeng | thx :) was just gonna mention this | 14:49 |
zhipeng | plz work with him | 14:49 |
shaohe_feng_ | hopeful we will make progress next week. | 14:49 |
shaohe_feng_ | OK | 14:49 |
zhipeng | ok thx | 14:49 |
zhipeng | I will work with zhuli on the os-acc | 14:49 |
zhipeng | that one will most likely involve nova team discussion | 14:50 |
Sundar | Yes. I will be happy to work with zhuli if he needs any help | 14:50 |
zhipeng | Sundar gr8t :) | 14:51 |
zhipeng | #topic rocky spec/patch discussion | 14:51 |
*** openstack changes topic to "rocky spec/patch discussion (Meeting topic: openstack-cyborg)" | 14:51 | |
zhipeng | #link https://review.openstack.org/#/q/status:open+project:openstack/cyborg | 14:51 |
zhipeng | first up, Sundar's spec patch | 14:52 |
shaohe_feng_ | Sundar: good work | 14:53 |
Sundar | Shaohe, thanks :) | 14:53 |
shaohe_feng_ | but I have a question. why nova developer think the accelerator weigher call cause performance loss | 14:53 |
zhipeng | #link https://review.openstack.org/#/c/554717/ | 14:53 |
Sundar | I think the assumption was that the weigher will call into Cyborg REST API for each host | 14:53 |
Sundar | If the weigher is in Nova tree, that is true | 14:54 |
Sundar | But, if Cyborg keeps it, we have other options | 14:54 |
shaohe_feng_ | Sundar: why for each host? | 14:54 |
Sundar | The typical filter today operates per host | 14:54 |
shaohe_feng_ | Sundar: I have discussed it before. | 14:54 |
shaohe_feng_ | Sundar: the cyborg API will run on controller node. | 14:55 |
shaohe_feng_ | Sundar: we only call the api in controller node. | 14:55 |
shaohe_feng_ | just on api is OK. | 14:55 |
shaohe_feng_ | for example, the scheduler filter choose the suitable hosts | 14:56 |
shaohe_feng_ | and the scheduler weigher just call a API to query the accelerator infos of these hosts | 14:56 |
shaohe_feng_ | zhipeng: Li_Liu: right? | 14:57 |
zhipeng | that is still , per host | 14:57 |
Li_Liu | you mean the weigher is on the Cyborg controller side? | 14:57 |
shaohe_feng_ | zhipeng: no, we get list for filter API | 14:57 |
Sundar | Shaohe, yes, we could override the BaseWeigher and handle multiple hosts in one call. That call could invoke Cyborg REST API. | 14:58 |
shaohe_feng_ | for example: GET /cyborg/v1/accelerators?hosts=cyborg-1,cyborg-2&type=fpga | 14:58 |
Sundar | To me, it is not clear what the performance hit would be. | 14:58 |
Sundar | I suspect any performance hit would not be noticeable until we get to some scale | 14:59 |
Li_Liu | Does this involve the 2-stages scheduling problem we were trying to avoid? | 14:59 |
Sundar | There is no 2-stage scheduling here: the proposed filter/weigher is a typical one, which just filters hosts based on function calls. | 15:00 |
shaohe_feng_ | Sundar: yes, the scheduler has call placement several times, is there performance issue? | 15:00 |
zhipeng | Sundar I think Li_Liu meant for weigher in Nova | 15:00 |
zhipeng | shaohe_feng_ it is not the same thing | 15:00 |
shaohe_feng_ | they are both http request. | 15:01 |
zhipeng | anyways this has been discussed in extent with Nova team and let's stay with the conclusion | 15:01 |
Li_Liu | ok | 15:01 |
zhipeng | shaohe_feng_ we could discuss offline more with Alex | 15:01 |
shaohe_feng_ | OK | 15:01 |
zhipeng | but let's not dwell on it | 15:01 |
Sundar | May be I misunderstood :) We are proposing a weigher maintained in Cyborg tree, which the operator will configure in nova.conf. Is that a concern? | 15:01 |
shaohe_feng_ | weigher maintained in Cyborg tree, still need one cyborg api request, so also performance issue? | 15:02 |
zhipeng | Sundar I don't think that would be a concern | 15:03 |
Sundar | shaohe: I personally don't think so, but we'll check the data to assure everybody | 15:03 |
shaohe_feng_ | zhipeng: Li_Liu: do you think a weigher maintained in Cyborg tree is a good idea? | 15:03 |
zhipeng | yes, at the moment | 15:03 |
shaohe_feng_ | Sundar: you still need to tell cyborg which hosts need to weight | 15:04 |
shaohe_feng_ | on cyborg api call. | 15:04 |
Sundar | shaohe: This weigher is querying Cyborg DB. It is better to keep it in Cyborg | 15:04 |
*** crushil has quit IRC | 15:04 | |
Li_Liu | I agree with zhipeng | 15:04 |
zhipeng | the weigher will just talk to the conductor | 15:04 |
zhipeng | it is not blocking nova operations | 15:05 |
*** Vipparthy has quit IRC | 15:05 | |
zhipeng | that is the point | 15:05 |
shaohe_feng_ | any the way, the api call will talk to conductor to query Cyborg DB | 15:05 |
Sundar | Shaohe, yes. The weigher gets a list of hosts. We could either introduce a new Cyborg API for that, or just have the weigher query the db directly | 15:06 |
shaohe_feng_ | no difference. | 15:06 |
zhipeng | shaohe_feng_ let's leave it offline | 15:06 |
shaohe_feng_ | zhipeng: OK. | 15:06 |
zhipeng | Sundar thx for the spec :) | 15:06 |
zhipeng | We will definitely review it more | 15:06 |
zhipeng | next up, Li Liu's patch | 15:07 |
zhipeng | #info Implemented the Objects and APIs for vf/pf | 15:08 |
zhipeng | #link https://review.openstack.org/552734 | 15:08 |
Sundar | Sorry, I need to leave for my next call. :( Will catch up from minutes | 15:08 |
Li_Liu | later | 15:09 |
zhipeng | Sundar no problem | 15:09 |
zhipeng | Sundar we need another discussion on k8s actions :) | 15:09 |
zhipeng | Li_Liu any update or comment on your patch ? | 15:10 |
zhipeng | things we need to be aware of ? | 15:10 |
Li_Liu | not really. I think shaohe needs to change some code in his resource tracker to adopt the change | 15:10 |
Li_Liu | other than that. one thing left over is to utilize the attribute table in deployables. This is still a missing piece | 15:11 |
shaohe_feng_ | Li_Liu: thanks for reminder. | 15:11 |
Li_Liu | I will keep working on that as well as 2 other specs | 15:12 |
Li_Liu | shaohe_feng_ np. let me know if you need any help on using my pf/vfs | 15:12 |
shaohe_feng_ | Li_Liu: 2 other specs include image management? | 15:12 |
shaohe_feng_ | Li_Liu: ok, thanks. | 15:13 |
Li_Liu | programability and image metadata standardization | 15:13 |
zhipeng | yes , big task on your shoulder :) | 15:13 |
Li_Liu | :) | 15:14 |
zhipeng | okey we've gone through our agenda list today | 15:15 |
zhipeng | I think we can end the meeting now :) | 15:15 |
zhipeng | and talk to you guys next week | 15:16 |
kosamara | bye | 15:18 |
Yumeng__ | bye | 15:19 |
zhipeng | #endmeeting | 15:19 |
*** openstack changes topic to "OpenStack Cyborg Project Discussion" | 15:19 | |
openstack | Meeting ended Wed Mar 21 15:19:38 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 15:19 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.html | 15:19 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.txt | 15:19 |
openstack | Log: http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.log.html | 15:19 |
*** helloway has quit IRC | 15:19 | |
*** zhipeng has quit IRC | 15:26 | |
*** Sundar has quit IRC | 15:53 | |
*** masber has quit IRC | 16:27 | |
*** Vipparthy has joined #openstack-cyborg | 16:38 | |
*** Vipparthy has quit IRC | 16:38 | |
*** Helloway has joined #openstack-cyborg | 17:05 | |
*** Helloway has quit IRC | 17:11 | |
*** circ-user-rwgGd has joined #openstack-cyborg | 17:12 | |
*** circ-user-rwgGd has quit IRC | 17:12 | |
*** Sundar has joined #openstack-cyborg | 17:13 | |
*** Sundar has quit IRC | 17:13 | |
*** Yumeng__ has quit IRC | 17:29 | |
*** tim__ has joined #openstack-cyborg | 17:40 | |
*** tim__ has quit IRC | 17:40 | |
*** shaohe_feng_ has quit IRC | 20:28 | |
*** guhcampos has joined #openstack-cyborg | 21:17 | |
*** guhcampos has quit IRC | 22:30 | |
*** masber has joined #openstack-cyborg | 23:04 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!