Wednesday, 2018-03-21

*** Yumeng has joined #openstack-cyborg01:59
openstackgerritYumengBao proposed openstack/cyborg-specs master: Create an initial cyborg-specs repository using cookiecutter  https://review.openstack.org/55476603:07
*** jaypipes has quit IRC04:05
*** jaypipes has joined #openstack-cyborg04:06
*** masuberu has quit IRC06:04
*** masber has joined #openstack-cyborg06:31
*** masber has quit IRC07:05
*** masber has joined #openstack-cyborg07:06
*** masber has quit IRC08:40
*** masber has joined #openstack-cyborg09:20
*** Yumeng has quit IRC11:13
*** edleafe has quit IRC13:29
*** edleafe has joined #openstack-cyborg13:29
*** Yumeng__ has joined #openstack-cyborg13:39
*** shaohe_feng_ has joined #openstack-cyborg13:43
*** zhipeng has joined #openstack-cyborg13:57
*** Sundar has joined #openstack-cyborg13:58
zhipeng#startmeeting openstack-cyborg14:00
openstackMeeting started Wed Mar 21 14:00:17 2018 UTC and is due to finish in 60 minutes.  The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot.14:00
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.14:00
*** openstack changes topic to " (Meeting topic: openstack-cyborg)"14:00
openstackThe meeting name has been set to 'openstack_cyborg'14:00
*** crushil has joined #openstack-cyborg14:00
zhipeng#topic Roll Call14:00
*** openstack changes topic to "Roll Call (Meeting topic: openstack-cyborg)"14:00
zhipeng#info Howard14:00
shaohe_feng_hi zhipeng14:01
kosamarahi!14:01
zhipenghi guys14:01
crushil\o14:01
zhipengplease use #info to record you names14:01
kosamara#info Konstantinos14:01
Sundar#info Sundar14:01
crushil#info Rushil14:02
shaohe_feng_#info shaohe14:02
SundarDo we have a Zoom link?14:05
zhipenglet's wait for a few more minutes in case more people will join14:05
zhipengSundar no we only have irc today14:05
SundarSure, Zhipeng. Thanks14:05
Yumeng__#info Yumeng__14:06
zhipengokey let's start14:09
zhipeng#topic CERN GPU use case introduction14:09
*** openstack changes topic to "CERN GPU use case introduction (Meeting topic: openstack-cyborg)"14:09
*** Li_Liu has joined #openstack-cyborg14:10
zhipengkosamara plz takes us away14:10
kosamaraI joined you recently, so let me remind that I'm a technical student at CERN, integrating GPUs into our openstack.14:10
kosamaraOur use case is computation only at this point.14:10
kosamaraWe have implemented and are currently testing a nova-only pci-passthrough.14:11
kosamaraWe intend to also explore vGPUs, but they don't seem to fit our use case very much: licensing costs, limited CUDA support (nvidia case both).14:12
kosamaraAt the moment the big issues are enforcing quotas on GPUs and security concerns.14:12
kosamara1. Subsequent users can potentially access data on the GPU memory14:13
*** helloway has joined #openstack-cyborg14:13
kosamara2. Low-level access means they could change the firmware or even cause the host to restart14:13
kosamaraWe are looking at a way to mitigate at least the first issue, by performing some kind of cleanup on the GPU after use14:14
kosamaraBut in our current workflow this would require a change in nova.14:14
kosamaraIt looks like something that could be done in cyborg?14:15
zhipengi think so, i remember we had similar convo regarding clean up on FPGA in the PTG14:15
Li_Liuare you try to force a "reset" after usage of the device?14:15
kosamaraReset of the device?14:16
kosamaraAccording to a research article, in nvidia's case performing a device reset through nvidia-smi is not enough for the data leaks.14:16
Li_Liusomething like that. like Zhipeng said, "clean up"14:16
kosamara- article: https://www.semanticscholar.org/paper/Confidentiality-Issues-on-a-GPU-in-a-Virtualized-E-Maurice-Neumann/693a8b56a9e961052702ff088131eb553e88d9ae14:16
kosamaraThe additional complexity in pci passthrough is that the host can't access the GPUs (no drivers)14:17
shaohe_feng_so once the GPU devices are detached, cyborg should do clean up at once.14:17
SundarWhat kind of clean up do you have in mind? Zero out the RAM?14:17
kosamaraMy thinking is to put the relevant resources in a special state after deallocation from their previous VM and use a service VM to perform the cleanup ops themselves.14:18
Li_Liumemset() to zero for all ?14:18
kosamaraYes, zero the ram14:18
kosamaraBut ideally we would like to ensure the firmware's state is valid too.14:19
shaohe_feng_only this one method to clear up?14:19
kosamaraSorry, I'm out of context. This method is from where?14:19
shaohe_feng_zero the ram14:19
kosamaraYes for the ram.14:20
shaohe_feng_OK, got it.14:20
kosamaraBut we would also like to check the firmware's state.14:20
kosamaraAnd at least ensure that it hasn't been tampered with.14:20
shaohe_feng_so no ways to power off the GPU device, and re-power it?14:21
zhipengi think this is doable with cyborg14:21
kosamaraNot that I know of, apart from rebooting the host.14:21
zhipengshaohe_feng_ that will mess up the state ?14:21
SundarThe service VM needs to run in every compute node, and it needs to have all the right drivers for various GPU devices. We need to see how practical that is.14:21
Li_Liuwe might need to get some help from Nvidia14:22
zhipenglet's say we have the NVIDIA driver support14:22
SundarIn a vGPU scenario, how do you power off the device without affecting other users?14:22
zhipengfor cyborg14:22
kosamaraSundar: unless vfio is capable of doing these basic operations?14:22
zhipengdo we still need the service VM ?14:23
kosamaraI haven't explored practically nvidia's vGPU scenario. The host is supposed to be able to operate on the GPU in that case.14:24
Sundarkosamara: Not sure I understand. vfio is a generic module for allowing device access. How would it know about specific nvidia devices?14:24
shaohe_feng_kosamara: do you use vfio pci passthrough or pci-stub passthrough?14:24
kosamarazhipeng: vfio-pci is the host stub driver when the gpu is passed through. If we could do something through that, then we wouldn't need the service VM.14:25
kosamarashaohe_feng_: vfio-pci14:25
zhipengkosamara good to know14:25
kosamaraSundar: yes, I'm just making a hypothesis. I expect it can't, but perhaps zeroing-out the ram is a general enough operation and it can do it...?14:26
*** Vipparthy has joined #openstack-cyborg14:27
Vipparthyhi14:27
shaohe_feng_kosamara:  is zeroing-out the ram time consuming?14:27
Sundarkosamara: I am not a Nvidia expert. But presumably, to access the RAM, one would need to look at the PCI BAR space and find the base address or something? If so, that would be device-specific14:28
kosamaraSundar: thanks for the input. I can research the feasibility of this option this week.14:29
zhipengkosamara I will also talk to the NVIDIA OpenStack team about it14:30
zhipengsee if we could come out with something for Rocky :)14:30
kosamarashaohe_feng_: I don't have a number right now. It should depend on the manner: if it happens on the device without pci transports it should be quite fast.14:30
kosamarazhipeng thanks :)14:31
zhipengkosamara is there anything like an architecture diagram for the use case ?14:32
kosamaraNot really.14:32
zhipengokey :)14:32
Sundarkosamara: you also mentioned quotas on GPUs?14:33
kosamaraWhat exactly do you mean by architecture diagram?14:33
Li_Liuthis rings a bell to me about the driver apis14:33
zhipengkosarama like the overall setup14:33
Li_Liucurrently we have report() and program() for the vendor driver api14:33
kosamaraYes. We currently implement quotas indirectly. We only allow GPU flavors on specific projects and quota them by cpu.14:33
Li_Liuwe might want to consider a reset() api for the drivers14:33
zhipengLi_Liu makes sense14:34
zhipengkosamara we will also have quota support for Rocky14:34
shaohe_feng_+114:34
kosamaraGood to know.14:35
zhipengokey folks let's move on to the next topic14:35
zhipengthx again kosamara14:35
kosamarathanks!14:35
shaohe_feng_kosamara: so do you not support GPU quota  for specific projects?14:35
zhipeng#topic subteam lead report14:35
*** openstack changes topic to "subteam lead report (Meeting topic: openstack-cyborg)"14:35
kosamaraNo, we do it indirectly, in the way I mentioned.14:36
zhipengshaohe_feng_ scrow up :)14:36
zhipengscroll14:36
zhipengokey yumeng could you introduce the progress on your side ?14:37
zhipengYumeng__14:37
Yumeng__Last week I set up a repository for cyborg-specs14:39
Yumeng__some patches are still waiting for review from the infra team14:40
Yumeng__Hope it could be merged ASAP14:40
SundarAre we not using the git repo's doc/specs/ area anymore?14:40
zhipengSundar we are migrating it out :)14:41
SundarMaybe I am missing the context. What is this repository for specs?14:41
zhipengbut it is fine now to keep submit to that folder14:41
zhipengwe will migrate all the approved specs after MS1 to cyborg-specs14:41
SundarHoward, could you explain why we are doing that?14:42
zhipengIt would be better for the documentation when we do release14:42
zhipengall the core projects are doing it14:42
zhipengi think yumeng also add the gate check on docs for cyborg-specs14:43
crushilSundar All the core projects split out the specs and the main project. So, it makes sense to follow suite14:44
zhipengyes exactly14:44
SundarIIUC, in the git repo, approved specs will be doc/specs/<release>/approved, but in the new repo, all release specs will be in one place. Is that right?14:44
zhipengit will still follows the similar directory structure14:45
SundarOK, so we are just separating code from docs14:45
zhipengyes14:45
zhipengfrom specs to be preceise14:45
zhipengprecise14:45
SundarGot it, thanks :)14:45
zhipengsince general documentation is still in cyborg repo, if I understand correctly14:46
Li_Liufor now we still check in the docs to the code repo right?14:46
zhipengLi_Liu yes, nothing changes14:46
Li_Liuok14:46
zhipengthx to Yumeng__ for the quick progress14:46
SundarIn future releases, would we check specs into code repo, and have it be migrated after approval?14:46
Yumeng__zhipeng: :)14:47
zhipengin the future we will just submit the spec patch to cyborg-specs14:47
Sundarok14:47
zhipengshaohe_feng_ any update on the python-cyborgclient ?14:47
shaohe_feng_the spec and code will separated14:48
shaohe_feng_zhipeng:  jinghan has some personal staff these days. So no more update.14:48
shaohe_feng_zhipeng: I will help him on it.14:48
zhipengthx :) was just gonna mention this14:49
zhipengplz work with him14:49
shaohe_feng_hopeful we will make progress next week.14:49
shaohe_feng_OK14:49
zhipengok thx14:49
zhipengI will work with zhuli on the os-acc14:49
zhipengthat one will most likely involve nova team discussion14:50
SundarYes. I will be happy to work with zhuli if he needs any help14:50
zhipengSundar gr8t :)14:51
zhipeng#topic rocky spec/patch discussion14:51
*** openstack changes topic to "rocky spec/patch discussion (Meeting topic: openstack-cyborg)"14:51
zhipeng#link https://review.openstack.org/#/q/status:open+project:openstack/cyborg14:51
zhipengfirst up, Sundar's spec patch14:52
shaohe_feng_Sundar: good work14:53
SundarShaohe, thanks :)14:53
shaohe_feng_but I have a question. why nova developer think the accelerator weigher call cause performance loss14:53
zhipeng#link https://review.openstack.org/#/c/554717/14:53
SundarI think the assumption was that the weigher will call into Cyborg REST API for each host14:53
SundarIf the weigher is in Nova tree, that is true14:54
SundarBut, if Cyborg keeps it, we have other options14:54
shaohe_feng_Sundar: why for each host?14:54
SundarThe typical filter today operates per host14:54
shaohe_feng_Sundar: I have discussed it before.14:54
shaohe_feng_Sundar: the cyborg API will run on controller node.14:55
shaohe_feng_Sundar: we only call the api in controller node.14:55
shaohe_feng_just on api is OK.14:55
shaohe_feng_for example, the scheduler filter choose the suitable hosts14:56
shaohe_feng_and the scheduler weigher just call a API to query the accelerator infos of these hosts14:56
shaohe_feng_zhipeng: Li_Liu: right?14:57
zhipengthat is still , per host14:57
Li_Liuyou mean the weigher is on the Cyborg controller side?14:57
shaohe_feng_zhipeng: no, we get list for filter API14:57
SundarShaohe, yes, we could override the BaseWeigher and handle multiple hosts in one call. That call could invoke Cyborg REST API.14:58
shaohe_feng_for example: GET /cyborg/v1/accelerators?hosts=cyborg-1,cyborg-2&type=fpga14:58
SundarTo me, it is not clear what the performance hit would be.14:58
SundarI suspect any performance hit would not be noticeable until we get to some scale14:59
Li_LiuDoes this involve the 2-stages scheduling problem we were trying to avoid?14:59
SundarThere is no 2-stage scheduling here: the proposed filter/weigher is a typical one, which just filters hosts based on function calls.15:00
shaohe_feng_Sundar: yes, the scheduler has call placement several times, is there performance issue?15:00
zhipengSundar I think Li_Liu meant for weigher in Nova15:00
zhipengshaohe_feng_ it is not the same thing15:00
shaohe_feng_they are both http request.15:01
zhipenganyways this has been discussed in extent with Nova team and let's stay with the conclusion15:01
Li_Liuok15:01
zhipengshaohe_feng_ we could discuss offline more with Alex15:01
shaohe_feng_OK15:01
zhipengbut let's not dwell on it15:01
SundarMay be I misunderstood :) We are proposing a weigher maintained in Cyborg tree, which the operator will configure in nova.conf. Is that a concern?15:01
shaohe_feng_weigher maintained in Cyborg tree, still need one cyborg api request,  so also performance issue?15:02
zhipengSundar I don't think that would be a concern15:03
Sundarshaohe: I personally don't think so, but we'll check the data to assure everybody15:03
shaohe_feng_zhipeng: Li_Liu:  do you think  a weigher maintained in Cyborg tree is a good idea?15:03
zhipengyes, at the moment15:03
shaohe_feng_Sundar: you still need to tell cyborg which hosts need to weight15:04
shaohe_feng_on cyborg api call.15:04
Sundarshaohe: This weigher is querying Cyborg DB. It is better to keep it in Cyborg15:04
*** crushil has quit IRC15:04
Li_LiuI agree with zhipeng15:04
zhipengthe weigher will just talk to the conductor15:04
zhipengit is not blocking nova operations15:05
*** Vipparthy has quit IRC15:05
zhipengthat is the point15:05
shaohe_feng_any the way, the api call will talk to conductor to query Cyborg DB15:05
SundarShaohe, yes. The weigher gets a list of hosts. We could either introduce a new Cyborg API for that, or just have the weigher query the db directly15:06
shaohe_feng_no difference.15:06
zhipengshaohe_feng_ let's leave it offline15:06
shaohe_feng_zhipeng: OK.15:06
zhipengSundar thx for the spec :)15:06
zhipengWe will definitely review it more15:06
zhipengnext up, Li Liu's patch15:07
zhipeng#info Implemented the Objects and APIs for vf/pf15:08
zhipeng#link https://review.openstack.org/55273415:08
SundarSorry, I need to leave for my next call. :( Will catch up from minutes15:08
Li_Liulater15:09
zhipengSundar no problem15:09
zhipengSundar we need another discussion on k8s actions :)15:09
zhipengLi_Liu any update or comment on your patch ?15:10
zhipengthings we need to be aware of ?15:10
Li_Liunot really. I think shaohe needs to change some code in his resource tracker to adopt the change15:10
Li_Liuother than that. one thing left over is to utilize the attribute table in deployables. This is still a missing piece15:11
shaohe_feng_Li_Liu: thanks for reminder.15:11
Li_LiuI will keep working on that as well as 2 other specs15:12
Li_Liushaohe_feng_ np. let me know if you need any help on using my pf/vfs15:12
shaohe_feng_Li_Liu: 2 other specs include image management?15:12
shaohe_feng_Li_Liu: ok, thanks.15:13
Li_Liuprogramability and image metadata standardization15:13
zhipengyes , big task on your shoulder :)15:13
Li_Liu:)15:14
zhipengokey we've gone through our agenda list today15:15
zhipengI think we can end the meeting now :)15:15
zhipengand talk to you guys next week15:16
kosamarabye15:18
Yumeng__bye15:19
zhipeng#endmeeting15:19
*** openstack changes topic to "OpenStack Cyborg Project Discussion"15:19
openstackMeeting ended Wed Mar 21 15:19:38 2018 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)15:19
openstackMinutes:        http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.html15:19
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.txt15:19
openstackLog:            http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.log.html15:19
*** helloway has quit IRC15:19
*** zhipeng has quit IRC15:26
*** Sundar has quit IRC15:53
*** masber has quit IRC16:27
*** Vipparthy has joined #openstack-cyborg16:38
*** Vipparthy has quit IRC16:38
*** Helloway has joined #openstack-cyborg17:05
*** Helloway has quit IRC17:11
*** circ-user-rwgGd has joined #openstack-cyborg17:12
*** circ-user-rwgGd has quit IRC17:12
*** Sundar has joined #openstack-cyborg17:13
*** Sundar has quit IRC17:13
*** Yumeng__ has quit IRC17:29
*** tim__ has joined #openstack-cyborg17:40
*** tim__ has quit IRC17:40
*** shaohe_feng_ has quit IRC20:28
*** guhcampos has joined #openstack-cyborg21:17
*** guhcampos has quit IRC22:30
*** masber has joined #openstack-cyborg23:04

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!