Friday, 2017-09-01

*** rbudden has quit IRC01:08
*** priteau has joined #scientific-wg01:50
*** priteau has quit IRC01:55
*** rbudden has joined #scientific-wg02:02
*** rbudden has quit IRC02:02
masberI see, we would like to deploy openstack to deploy HPC but also VMs through nova and containers using zen+kuryr. Right now we use rocks cluster for deployment and to provision SGE (job scheduler), does openstack/ironic has any integration with a job scheduler?02:25
jmlowerbudden and trandles are the ones to ask about that02:26
jmlowemasber I take it you are not in North America?02:26
masberjmlowe, correct I am in Sydney (Australia)02:27
jmlowealways seem to be on in the middle of the night for most of us02:27
jmloweb1airo might be on, he's from Nectar02:28
jmloweI'm up late babysitting my ceph cluster converting from filestore to bluestore02:28
masberI would love to get in contact with the people from Nectar02:29
masberjmlowe, bluestore looks quite nice, apparently it is a better fit for all-flash ceph environments?02:30
jmloweI'm estimating by Saturday when this is all said and done I will have moved around 750TB of data 35M objects02:30
jmloweI think it's just better all around02:30
jmloweit really does cow instead of emulating it with atomic filesystem operations02:31
jmlowegets rid of double write penalty02:31
jmloweif you are all flash, should make your system twice as fast and last twice as long02:32
masberjmlowe, nice, I am playing with jewel as it is the one used by openstack-kolla02:32
jmloweall data is checksummed on disk02:32
jmloweon disk inline compression, they use some trick similar to zfs to make erasure coded pools for rbd usable02:33
jmlowethe rados gw gets swift public buckets, great for hosting data sets02:33
jmloweI can't say enough good things about this luminous update02:34
masberjmlowe, I tried to configure it by reading a document from intel, they said to pin each nvme drive to the cpu... couldn figure out how to do that, at the end I just reduced the size of the objects (my files are not big) to get better performance but I guess I can get much more than what I am doing now02:34
jmloweright, what you would do is pin the osd process to the cpu02:34
masberyes, I am using NUMA which makes things a little bit more complicated I guess02:35
masberI am not an expert, just reading and trying things around02:36
jmloweapparently you use numactl to do the pinning02:36
jmlowehttp://www.glennklockwood.com/hpc-howtos/process-affinity.html02:36
jmlowecpu pinning is an old hpc tric02:36
jmlowetrick02:37
jmlowewhen you are counting floating point operations per clock tick getting yourself on the wrong numa node is to be avoided at all costs02:37
masberyes something like this --> Ceph startup scripts need change with setaffinity=" numactl --membind=0 --cpunodebind=0 "02:38
masberthis is the docs I was reading http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments02:38
masberjmloew, may I ask, have you used or heard about Panasas?02:40
jmloweYeah, they were always too expensive for us, we have traditionally been a gpfs (came with the cluster from IBM) and lustre shop02:41
masberjmlowe, we are currently using Panasas02:43
masberyes very expensive02:43
masberjmlowe, how do you use ceph?02:59
jmlowebacks the openstack for jetstream-cloud.org03:01
masberwow03:03
masberspinning disks I guess? and are you using 25Gb/s network?03:05
jmlowe10gbs, 25Gbs wasn't really a thing when we were putting together the proposal 3 years ago03:05
jmlowefrom the time we found out we were getting the award to the time the hardware arrived was about 14 months if I remember correctly03:06
jmlowebut yes 20x12x4TB 12Gbs sas spinning disks03:09
jmlowebonded 2x10Gbs networking03:10
*** priteau has joined #scientific-wg03:51
*** priteau has quit IRC03:56
b1airohi masber04:43
b1airojust noticed this conversation04:43
masberb1airo, hi04:43
b1airoand hey jmlowe04:43
b1airowhat process are you using to convert those OSDs?04:43
b1airoto you earlier question masber, Nova/Ironic don't currently have any official HPC scheduler integrations, but the placement-api and scheduler decoupling work that has been a focus of Nova dev for the last couple of cycles is going to make deep integration easier04:46
b1airothere are people out there today doing interesting things with SLURM, e.g., using a SLURM job to essentially request a VM and job prolog to convert a regular HPC job node to a Nova node for the VM04:47
b1airoi think that is something the folks at PSC running Bridges were doing04:47
masberb1airo, so the job scheduler will provision the vm to run the job?04:48
b1airoin that case i was talking about the job is "a VM"04:51
b1airoso once the job has started the user's new VM should be up and running for them within the cloud environment04:52
b1airorather than going to the dashboard or using IaaS APIs they requested it through SLURM04:52
masberb1airo, I see, sounds interesting. I was wondering to deploy the os and job scheduler integration with master using ironic04:53
b1airothat is a possible approach to a more narrow and highly-managed HPC-integrated cloud service, which would probably be of interest if you are a HPC centre/shop but need some way to offer more flexible on-demand systems04:53
masberand having the master on a vm already running04:53
b1airosorry, what do you mean by master in this case?04:54
masberlike the master host for sge04:55
masberI was thinking having he master on a vm and the clients on the baremetal04:58
masbernot sure if I explained properly05:04
b1airomasber - ah right, your cluster master controller/s, at first i thought you meant Ironic master branch o_005:10
b1airoi'd say having those sorts of service machines as VMs is pretty normal practice even without cloud thrown in05:12
masberthey to me would be to find a way to install all packages needed and the integration with the job scheduler automatically05:12
masbersame if I want to shrink the HPC cluster05:12
masberthat way I could do things like assign nodes to HPC or to hadoop or VMs or containers05:13
masbervms provided by nova, containers by zun_kuryr05:13
masberhadoop was thinking to use apache ambari or sahara05:13
masberbut I am not sure how to provision the HPC after ironic05:14
masbermaybe ansible?05:14
masberdo you have an opinion about that? maybe I am crazy?05:14
*** simon-AS559 has joined #scientific-wg05:15
*** simon-AS559 has quit IRC05:33
*** priteau has joined #scientific-wg05:52
*** priteau has quit IRC05:57
*** priteau has joined #scientific-wg07:53
*** priteau has quit IRC07:58
*** priteau has joined #scientific-wg09:02
*** jmlowe has quit IRC11:38
*** jmlowe has joined #scientific-wg11:41
*** rbudden has joined #scientific-wg11:57
jmloweb1airo set the crush weight to zero, go have lunch, come back, take a nap, make dinner, get a good night's sleep, have brunch, then stop osd and follow the manual osd replacement procedure from the docs http://docs.ceph.com/docs/luminous/rados/operations/add-or-rm-osds/#adding-osds-manual12:20
jmlowemasber rbudden uses puppet to deploy hpc (slurm) following ironic for https://www.psc.edu/bridges12:21
b1airomasber, at Monash we use Ansible for those sorts of tasks, seems to work pretty well and is definitely a better choice for quick and simple stuff compared to Puppet12:22
jmloweb1airo 17 hours later I'm still waiting for the first 1/3 to drain12:23
jmlowe             5842308/90961059 objects misplaced (6.423%)12:23
jmlowe             11152 active+clean12:23
jmlowe             1198  active+remapped+backfill_wait12:23
jmlowe             248   active+remapped+backfilling12:23
jmloweI did one node, it was relatively painless except for the draining12:24
b1airojmlowe, so you are letting cluster recover completely before re-provisioning that OSDs? our biggest cluster has about 1400 OSDs now so I think we'll probably have to take a shorter route to make it less painful and just reformat the current set of OSDs we're working on without doing any recovery12:24
jmlowewell depending on how much free space you have....12:25
b1airosure, and the crush topology12:25
jmloweI'm 1/3 full so I did 1/3 of the osd's and I expect the osd's to go up to %50 used12:25
jmloweI'll set the weights back up to normal and the next 1/3 down to zero so in another 18-20 hours I'll be able to convert the next 1/312:27
jmloweso I'm guessing with sleep the whole process is going to take about 96 hours12:28
jmloweI'll give this whole process a proper writeup somewhere12:30
jmloweI also did some contrarian things, I had my disks in raid0 pairs to cut down on the number of osd's, with async messaging and no journaling for the data I'm splitting them and going jbod as part of the conversion12:32
jmloweraid0 pairs meant I had 50GB ssd journals rather than 25GB, performance was markedly better than jbod with smaller disks that the texas jetstream cloud uses12:33
jmloweI'm a bit sleep deprived, texas cloud has smaller journals and per disk osd's on identical hardware, if that wasn't clear12:35
jmloweb1airo if you drain there is no blocking, it's a lot less disruptive12:37
b1airoshouldn't cause blocking either way unless you have a undersize issue?12:44
b1airojmlowe, interesting that the raid0 setup was better performing - for what benchmarks? i guess you get to a point where you have enough OSDs to soak up your peak workload so adding extra OSDs will just give you more headroom for extra clients that you don't need, whereas making existing OSDs faster (such as RAID0) will mean individual clients get better performance12:48
jmlowethe only scenario where more osd's did better was small writes, once you got up above 4k for writes the larger journals kicked in12:49
jmloweI've got a graph somewhere12:50
b1airoare you sure it was to do with the journal sizes? once you add them all up you should need a lot of sustained writes at high bandwidth before their size matters much. did you also change the sync interval?12:51
b1airohave you started CephFS-ing in JetStream yet?12:52
jmlowenot yet, I'm looking at manila and I'm kind of freaked out by the idea of the instances being able to directly talk to the ceph cluster12:52
jmloweI'd also have to change all of my ip addresses12:53
jmlowethat being said, it's something I think I want to do, doubling my osd's will give me the overhead I need to add the cephfs pools, I was at 300 pgs/osd before12:55
jmloweI'm really not sure how raid controller cache with raid0 pairs is going to compare with being able to enable on disk cache13:00
b1airoyou mean the ~64MB on-disk cache?13:01
b1airoin one Nova AZ I forced that into write-caching mode - but it is usually not power safe so...13:02
b1airore. CephFS + Manila, Red Hat is working towards NFS gateway as the standard mode I think13:03
b1airowould expect to hear more on that in Sydney13:03
jmloweoh, I like the sound of that13:03
b1airowe're going to start out soon running in native mode but only give access to "trusted" tenants13:04
b1airowe have the ganesha cephfs-nfs gateway up on one cluster too, seems to work but our HA setup needs more testing13:05
jmloweI can live with the trusted service vm exporting cephfs over nfs, similar to the generic manila setup13:12
jmlowere cache, 768MB of dram vs 2048MB of flash13:37
*** simon-AS559 has joined #scientific-wg16:26
*** simon-AS559 has quit IRC16:58
trandlesmasber: +1 what b1airo and jmlowe said about everything, they're fonts of knowledge ;)18:36
trandlesI'll be in Sydney for the Summit in November if you're attending.  I'm happy to sit down and go over how we're using OpenStack with HPC at Los Alamos18:37
jmloweI'm not sure I'd call what I'm spewing knowledge :)18:37
trandlesthere are several orthogonal use cases that we're addressing that largely fall along user-support and system-management lines18:37
trandlesto answer your immediate question, there's no direct integration between something like SLURM and OpenStack, but I've written SLURM plugins to do some integration18:38
trandlesthe plugin source code release is being held up by the lab's legal process though :(  Not for any technical reason, but for bureaucratic reasons18:39
trandlesjmlowe: you're just being modest :)18:39
*** jmlowe_ has joined #scientific-wg20:04
*** jmlowe has quit IRC20:06
*** simon-AS559 has joined #scientific-wg20:46
*** simon-AS5591 has joined #scientific-wg20:50
*** simon-AS559 has quit IRC20:51
*** simon-AS5591 has quit IRC21:22
*** priteau has quit IRC21:39

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!