Wednesday, 2018-03-21

*** Yumeng has joined #openstack-cyborg		01:59
openstackgerrit	YumengBao proposed openstack/cyborg-specs master: Create an initial cyborg-specs repository using cookiecutter https://review.openstack.org/554766	03:07
*** jaypipes has quit IRC		04:05
*** jaypipes has joined #openstack-cyborg		04:06
*** masuberu has quit IRC		06:04
*** masber has joined #openstack-cyborg		06:31
*** masber has quit IRC		07:05
*** masber has joined #openstack-cyborg		07:06
*** masber has quit IRC		08:40
*** masber has joined #openstack-cyborg		09:20
*** Yumeng has quit IRC		11:13
*** edleafe has quit IRC		13:29
*** edleafe has joined #openstack-cyborg		13:29
*** Yumeng__ has joined #openstack-cyborg		13:39
*** shaohe_feng_ has joined #openstack-cyborg		13:43
*** zhipeng has joined #openstack-cyborg		13:57
*** Sundar has joined #openstack-cyborg		13:58
zhipeng	#startmeeting openstack-cyborg	14:00
openstack	Meeting started Wed Mar 21 14:00:17 2018 UTC and is due to finish in 60 minutes. The chair is zhipeng. Information about MeetBot at http://wiki.debian.org/MeetBot.	14:00
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	14:00
*** openstack changes topic to " (Meeting topic: openstack-cyborg)"		14:00
openstack	The meeting name has been set to 'openstack_cyborg'	14:00
*** crushil has joined #openstack-cyborg		14:00
zhipeng	#topic Roll Call	14:00
*** openstack changes topic to "Roll Call (Meeting topic: openstack-cyborg)"		14:00
zhipeng	#info Howard	14:00
shaohe_feng_	hi zhipeng	14:01
kosamara	hi!	14:01
zhipeng	hi guys	14:01
crushil	\o	14:01
zhipeng	please use #info to record you names	14:01
kosamara	#info Konstantinos	14:01
Sundar	#info Sundar	14:01
crushil	#info Rushil	14:02
shaohe_feng_	#info shaohe	14:02
Sundar	Do we have a Zoom link?	14:05
zhipeng	let's wait for a few more minutes in case more people will join	14:05
zhipeng	Sundar no we only have irc today	14:05
Sundar	Sure, Zhipeng. Thanks	14:05
Yumeng__	#info Yumeng__	14:06
zhipeng	okey let's start	14:09
zhipeng	#topic CERN GPU use case introduction	14:09
*** openstack changes topic to "CERN GPU use case introduction (Meeting topic: openstack-cyborg)"		14:09
*** Li_Liu has joined #openstack-cyborg		14:10
zhipeng	kosamara plz takes us away	14:10
kosamara	I joined you recently, so let me remind that I'm a technical student at CERN, integrating GPUs into our openstack.	14:10
kosamara	Our use case is computation only at this point.	14:10
kosamara	We have implemented and are currently testing a nova-only pci-passthrough.	14:11
kosamara	We intend to also explore vGPUs, but they don't seem to fit our use case very much: licensing costs, limited CUDA support (nvidia case both).	14:12
kosamara	At the moment the big issues are enforcing quotas on GPUs and security concerns.	14:12
kosamara	1. Subsequent users can potentially access data on the GPU memory	14:13
*** helloway has joined #openstack-cyborg		14:13
kosamara	2. Low-level access means they could change the firmware or even cause the host to restart	14:13
kosamara	We are looking at a way to mitigate at least the first issue, by performing some kind of cleanup on the GPU after use	14:14
kosamara	But in our current workflow this would require a change in nova.	14:14
kosamara	It looks like something that could be done in cyborg?	14:15
zhipeng	i think so, i remember we had similar convo regarding clean up on FPGA in the PTG	14:15
Li_Liu	are you try to force a "reset" after usage of the device?	14:15
kosamara	Reset of the device?	14:16
kosamara	According to a research article, in nvidia's case performing a device reset through nvidia-smi is not enough for the data leaks.	14:16
Li_Liu	something like that. like Zhipeng said, "clean up"	14:16
kosamara	- article: https://www.semanticscholar.org/paper/Confidentiality-Issues-on-a-GPU-in-a-Virtualized-E-Maurice-Neumann/693a8b56a9e961052702ff088131eb553e88d9ae	14:16
kosamara	The additional complexity in pci passthrough is that the host can't access the GPUs (no drivers)	14:17
shaohe_feng_	so once the GPU devices are detached, cyborg should do clean up at once.	14:17
Sundar	What kind of clean up do you have in mind? Zero out the RAM?	14:17
kosamara	My thinking is to put the relevant resources in a special state after deallocation from their previous VM and use a service VM to perform the cleanup ops themselves.	14:18
Li_Liu	memset() to zero for all ?	14:18
kosamara	Yes, zero the ram	14:18
kosamara	But ideally we would like to ensure the firmware's state is valid too.	14:19
shaohe_feng_	only this one method to clear up?	14:19
kosamara	Sorry, I'm out of context. This method is from where?	14:19
shaohe_feng_	zero the ram	14:19
kosamara	Yes for the ram.	14:20
shaohe_feng_	OK, got it.	14:20
kosamara	But we would also like to check the firmware's state.	14:20
kosamara	And at least ensure that it hasn't been tampered with.	14:20
shaohe_feng_	so no ways to power off the GPU device, and re-power it?	14:21
zhipeng	i think this is doable with cyborg	14:21
kosamara	Not that I know of, apart from rebooting the host.	14:21
zhipeng	shaohe_feng_ that will mess up the state ?	14:21
Sundar	The service VM needs to run in every compute node, and it needs to have all the right drivers for various GPU devices. We need to see how practical that is.	14:21
Li_Liu	we might need to get some help from Nvidia	14:22
zhipeng	let's say we have the NVIDIA driver support	14:22
Sundar	In a vGPU scenario, how do you power off the device without affecting other users?	14:22
zhipeng	for cyborg	14:22
kosamara	Sundar: unless vfio is capable of doing these basic operations?	14:22
zhipeng	do we still need the service VM ?	14:23
kosamara	I haven't explored practically nvidia's vGPU scenario. The host is supposed to be able to operate on the GPU in that case.	14:24
Sundar	kosamara: Not sure I understand. vfio is a generic module for allowing device access. How would it know about specific nvidia devices?	14:24
shaohe_feng_	kosamara: do you use vfio pci passthrough or pci-stub passthrough?	14:24
kosamara	zhipeng: vfio-pci is the host stub driver when the gpu is passed through. If we could do something through that, then we wouldn't need the service VM.	14:25
kosamara	shaohe_feng_: vfio-pci	14:25
zhipeng	kosamara good to know	14:25
kosamara	Sundar: yes, I'm just making a hypothesis. I expect it can't, but perhaps zeroing-out the ram is a general enough operation and it can do it...?	14:26
*** Vipparthy has joined #openstack-cyborg		14:27
Vipparthy	hi	14:27
shaohe_feng_	kosamara: is zeroing-out the ram time consuming？	14:27
Sundar	kosamara: I am not a Nvidia expert. But presumably, to access the RAM, one would need to look at the PCI BAR space and find the base address or something? If so, that would be device-specific	14:28
kosamara	Sundar: thanks for the input. I can research the feasibility of this option this week.	14:29
zhipeng	kosamara I will also talk to the NVIDIA OpenStack team about it	14:30
zhipeng	see if we could come out with something for Rocky :)	14:30
kosamara	shaohe_feng_: I don't have a number right now. It should depend on the manner: if it happens on the device without pci transports it should be quite fast.	14:30
kosamara	zhipeng thanks :)	14:31
zhipeng	kosamara is there anything like an architecture diagram for the use case ?	14:32
kosamara	Not really.	14:32
zhipeng	okey :)	14:32
Sundar	kosamara: you also mentioned quotas on GPUs?	14:33
kosamara	What exactly do you mean by architecture diagram?	14:33
Li_Liu	this rings a bell to me about the driver apis	14:33
zhipeng	kosarama like the overall setup	14:33
Li_Liu	currently we have report() and program() for the vendor driver api	14:33
kosamara	Yes. We currently implement quotas indirectly. We only allow GPU flavors on specific projects and quota them by cpu.	14:33
Li_Liu	we might want to consider a reset() api for the drivers	14:33
zhipeng	Li_Liu makes sense	14:34
zhipeng	kosamara we will also have quota support for Rocky	14:34
shaohe_feng_	+1	14:34
kosamara	Good to know.	14:35
zhipeng	okey folks let's move on to the next topic	14:35
zhipeng	thx again kosamara	14:35
kosamara	thanks!	14:35
shaohe_feng_	kosamara: so do you not support GPU quota for specific projects？	14:35
zhipeng	#topic subteam lead report	14:35
*** openstack changes topic to "subteam lead report (Meeting topic: openstack-cyborg)"		14:35
kosamara	No, we do it indirectly, in the way I mentioned.	14:36
zhipeng	shaohe_feng_ scrow up :)	14:36
zhipeng	scroll	14:36
zhipeng	okey yumeng could you introduce the progress on your side ?	14:37
zhipeng	Yumeng__	14:37
Yumeng__	Last week I set up a repository for cyborg-specs	14:39
Yumeng__	some patches are still waiting for review from the infra team	14:40
Yumeng__	Hope it could be merged ASAP	14:40
Sundar	Are we not using the git repo's doc/specs/ area anymore?	14:40
zhipeng	Sundar we are migrating it out :)	14:41
Sundar	Maybe I am missing the context. What is this repository for specs?	14:41
zhipeng	but it is fine now to keep submit to that folder	14:41
zhipeng	we will migrate all the approved specs after MS1 to cyborg-specs	14:41
Sundar	Howard, could you explain why we are doing that?	14:42
zhipeng	It would be better for the documentation when we do release	14:42
zhipeng	all the core projects are doing it	14:42
zhipeng	i think yumeng also add the gate check on docs for cyborg-specs	14:43
crushil	Sundar All the core projects split out the specs and the main project. So, it makes sense to follow suite	14:44
zhipeng	yes exactly	14:44
Sundar	IIUC, in the git repo, approved specs will be doc/specs/<release>/approved, but in the new repo, all release specs will be in one place. Is that right?	14:44
zhipeng	it will still follows the similar directory structure	14:45
Sundar	OK, so we are just separating code from docs	14:45
zhipeng	yes	14:45
zhipeng	from specs to be preceise	14:45
zhipeng	precise	14:45
Sundar	Got it, thanks :)	14:45
zhipeng	since general documentation is still in cyborg repo, if I understand correctly	14:46
Li_Liu	for now we still check in the docs to the code repo right?	14:46
zhipeng	Li_Liu yes, nothing changes	14:46
Li_Liu	ok	14:46
zhipeng	thx to Yumeng__ for the quick progress	14:46
Sundar	In future releases, would we check specs into code repo, and have it be migrated after approval?	14:46
Yumeng__	zhipeng: :)	14:47
zhipeng	in the future we will just submit the spec patch to cyborg-specs	14:47
Sundar	ok	14:47
zhipeng	shaohe_feng_ any update on the python-cyborgclient ?	14:47
shaohe_feng_	the spec and code will separated	14:48
shaohe_feng_	zhipeng: jinghan has some personal staff these days. So no more update.	14:48
shaohe_feng_	zhipeng: I will help him on it.	14:48
zhipeng	thx :) was just gonna mention this	14:49
zhipeng	plz work with him	14:49
shaohe_feng_	hopeful we will make progress next week.	14:49
shaohe_feng_	OK	14:49
zhipeng	ok thx	14:49
zhipeng	I will work with zhuli on the os-acc	14:49
zhipeng	that one will most likely involve nova team discussion	14:50
Sundar	Yes. I will be happy to work with zhuli if he needs any help	14:50
zhipeng	Sundar gr8t :)	14:51
zhipeng	#topic rocky spec/patch discussion	14:51
*** openstack changes topic to "rocky spec/patch discussion (Meeting topic: openstack-cyborg)"		14:51
zhipeng	#link https://review.openstack.org/#/q/status:open+project:openstack/cyborg	14:51
zhipeng	first up, Sundar's spec patch	14:52
shaohe_feng_	Sundar: good work	14:53
Sundar	Shaohe, thanks :)	14:53
shaohe_feng_	but I have a question. why nova developer think the accelerator weigher call cause performance loss	14:53
zhipeng	#link https://review.openstack.org/#/c/554717/	14:53
Sundar	I think the assumption was that the weigher will call into Cyborg REST API for each host	14:53
Sundar	If the weigher is in Nova tree, that is true	14:54
Sundar	But, if Cyborg keeps it, we have other options	14:54
shaohe_feng_	Sundar: why for each host?	14:54
Sundar	The typical filter today operates per host	14:54
shaohe_feng_	Sundar: I have discussed it before.	14:54
shaohe_feng_	Sundar: the cyborg API will run on controller node.	14:55
shaohe_feng_	Sundar: we only call the api in controller node.	14:55
shaohe_feng_	just on api is OK.	14:55
shaohe_feng_	for example, the scheduler filter choose the suitable hosts	14:56
shaohe_feng_	and the scheduler weigher just call a API to query the accelerator infos of these hosts	14:56
shaohe_feng_	zhipeng: Li_Liu: right?	14:57
zhipeng	that is still , per host	14:57
Li_Liu	you mean the weigher is on the Cyborg controller side?	14:57
shaohe_feng_	zhipeng: no, we get list for filter API	14:57
Sundar	Shaohe, yes, we could override the BaseWeigher and handle multiple hosts in one call. That call could invoke Cyborg REST API.	14:58
shaohe_feng_	for example: GET /cyborg/v1/accelerators?hosts=cyborg-1,cyborg-2&type=fpga	14:58
Sundar	To me, it is not clear what the performance hit would be.	14:58
Sundar	I suspect any performance hit would not be noticeable until we get to some scale	14:59
Li_Liu	Does this involve the 2-stages scheduling problem we were trying to avoid?	14:59
Sundar	There is no 2-stage scheduling here: the proposed filter/weigher is a typical one, which just filters hosts based on function calls.	15:00
shaohe_feng_	Sundar: yes, the scheduler has call placement several times, is there performance issue?	15:00
zhipeng	Sundar I think Li_Liu meant for weigher in Nova	15:00
zhipeng	shaohe_feng_ it is not the same thing	15:00
shaohe_feng_	they are both http request.	15:01
zhipeng	anyways this has been discussed in extent with Nova team and let's stay with the conclusion	15:01
Li_Liu	ok	15:01
zhipeng	shaohe_feng_ we could discuss offline more with Alex	15:01
shaohe_feng_	OK	15:01
zhipeng	but let's not dwell on it	15:01
Sundar	May be I misunderstood :) We are proposing a weigher maintained in Cyborg tree, which the operator will configure in nova.conf. Is that a concern?	15:01
shaohe_feng_	weigher maintained in Cyborg tree, still need one cyborg api request, so also performance issue?	15:02
zhipeng	Sundar I don't think that would be a concern	15:03
Sundar	shaohe: I personally don't think so, but we'll check the data to assure everybody	15:03
shaohe_feng_	zhipeng: Li_Liu: do you think a weigher maintained in Cyborg tree is a good idea?	15:03
zhipeng	yes, at the moment	15:03
shaohe_feng_	Sundar: you still need to tell cyborg which hosts need to weight	15:04
shaohe_feng_	on cyborg api call.	15:04
Sundar	shaohe: This weigher is querying Cyborg DB. It is better to keep it in Cyborg	15:04
*** crushil has quit IRC		15:04
Li_Liu	I agree with zhipeng	15:04
zhipeng	the weigher will just talk to the conductor	15:04
zhipeng	it is not blocking nova operations	15:05
*** Vipparthy has quit IRC		15:05
zhipeng	that is the point	15:05
shaohe_feng_	any the way, the api call will talk to conductor to query Cyborg DB	15:05
Sundar	Shaohe, yes. The weigher gets a list of hosts. We could either introduce a new Cyborg API for that, or just have the weigher query the db directly	15:06
shaohe_feng_	no difference.	15:06
zhipeng	shaohe_feng_ let's leave it offline	15:06
shaohe_feng_	zhipeng: OK.	15:06
zhipeng	Sundar thx for the spec :)	15:06
zhipeng	We will definitely review it more	15:06
zhipeng	next up, Li Liu's patch	15:07
zhipeng	#info Implemented the Objects and APIs for vf/pf	15:08
zhipeng	#link https://review.openstack.org/552734	15:08
Sundar	Sorry, I need to leave for my next call. :( Will catch up from minutes	15:08
Li_Liu	later	15:09
zhipeng	Sundar no problem	15:09
zhipeng	Sundar we need another discussion on k8s actions :)	15:09
zhipeng	Li_Liu any update or comment on your patch ?	15:10
zhipeng	things we need to be aware of ?	15:10
Li_Liu	not really. I think shaohe needs to change some code in his resource tracker to adopt the change	15:10
Li_Liu	other than that. one thing left over is to utilize the attribute table in deployables. This is still a missing piece	15:11
shaohe_feng_	Li_Liu: thanks for reminder.	15:11
Li_Liu	I will keep working on that as well as 2 other specs	15:12
Li_Liu	shaohe_feng_ np. let me know if you need any help on using my pf/vfs	15:12
shaohe_feng_	Li_Liu: 2 other specs include image management?	15:12
shaohe_feng_	Li_Liu: ok, thanks.	15:13
Li_Liu	programability and image metadata standardization	15:13
zhipeng	yes , big task on your shoulder :)	15:13
Li_Liu	:)	15:14
zhipeng	okey we've gone through our agenda list today	15:15
zhipeng	I think we can end the meeting now :)	15:15
zhipeng	and talk to you guys next week	15:16
kosamara	bye	15:18
Yumeng__	bye	15:19
zhipeng	#endmeeting	15:19
*** openstack changes topic to "OpenStack Cyborg Project Discussion"		15:19
openstack	Meeting ended Wed Mar 21 15:19:38 2018 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	15:19
openstack	Minutes: http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.html	15:19
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.txt	15:19
openstack	Log: http://eavesdrop.openstack.org/meetings/openstack_cyborg/2018/openstack_cyborg.2018-03-21-14.00.log.html	15:19
*** helloway has quit IRC		15:19
*** zhipeng has quit IRC		15:26
*** Sundar has quit IRC		15:53
*** masber has quit IRC		16:27
*** Vipparthy has joined #openstack-cyborg		16:38
*** Vipparthy has quit IRC		16:38
*** Helloway has joined #openstack-cyborg		17:05
*** Helloway has quit IRC		17:11
*** circ-user-rwgGd has joined #openstack-cyborg		17:12
*** circ-user-rwgGd has quit IRC		17:12
*** Sundar has joined #openstack-cyborg		17:13
*** Sundar has quit IRC		17:13
*** Yumeng__ has quit IRC		17:29
*** tim__ has joined #openstack-cyborg		17:40
*** tim__ has quit IRC		17:40
*** shaohe_feng_ has quit IRC		20:28
*** guhcampos has joined #openstack-cyborg		21:17
*** guhcampos has quit IRC		22:30
*** masber has joined #openstack-cyborg		23:04

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!