Friday, 2016-01-22

*** rbak_ has quit IRC00:01
*** liamji has quit IRC00:03
*** jfluhmann has quit IRC00:06
*** ddieterly has quit IRC00:11
mnaserthrew another node, total of 40 cores (over 2 nodes) at full load, api endpoint not responding for gnocchi :(00:25
*** zqfan has joined #openstack-telemetry00:26
*** yarkot has joined #openstack-telemetry00:29
openstackgerritSam Morrison proposed openstack/gnocchi: Convert user and project IDs from UUIDs to Strings  https://review.openstack.org/27000700:48
*** cheneydc has joined #openstack-telemetry01:06
*** yarkot_ has joined #openstack-telemetry01:18
*** cheneydc has quit IRC01:25
*** liusheng has joined #openstack-telemetry01:25
*** cheneydc has joined #openstack-telemetry01:25
*** ljxiash has joined #openstack-telemetry01:27
*** jfluhmann has joined #openstack-telemetry01:29
*** yarkot_ has quit IRC01:29
*** cheneydc has quit IRC01:34
*** thorst has joined #openstack-telemetry01:43
*** caishan has joined #openstack-telemetry01:52
*** ddieterly has joined #openstack-telemetry02:01
openstackgerritliusheng proposed openstack/ceilometer-specs: Improve Nova Instance metering  https://review.openstack.org/20979902:03
*** thorst has quit IRC02:05
*** cheneydc has joined #openstack-telemetry02:15
openstackgerritSam Morrison proposed openstack/gnocchi: Convert user and project IDs from UUIDs to Strings  https://review.openstack.org/27000702:22
*** liusheng has quit IRC02:39
*** liusheng has joined #openstack-telemetry02:40
*** Ich has joined #openstack-telemetry02:49
*** ccz has joined #openstack-telemetry02:49
*** Ich is now known as Guest9344802:49
*** achatterjee has quit IRC03:00
*** thorst has joined #openstack-telemetry03:06
*** achatterjee has joined #openstack-telemetry03:07
openstackgerritMerged openstack/python-aodhclient: add alarm-history interface  https://review.openstack.org/26500103:10
*** thorst has quit IRC03:14
*** links has joined #openstack-telemetry03:26
*** prashantD_ has quit IRC03:28
*** links has quit IRC03:40
*** links has joined #openstack-telemetry03:48
*** links has quit IRC04:04
*** thorst has joined #openstack-telemetry04:12
*** links has joined #openstack-telemetry04:16
*** ddieterly has quit IRC04:16
*** fabian2 has joined #openstack-telemetry04:18
*** thorst has quit IRC04:19
*** agireud has quit IRC04:20
*** agireud has joined #openstack-telemetry04:22
*** links has quit IRC04:27
*** links has joined #openstack-telemetry04:28
openstackgerritZi Lian Ji proposed openstack/python-aodhclient: Add the functional tests  https://review.openstack.org/27113704:44
*** ljxiash has quit IRC04:46
*** ljxiash has joined #openstack-telemetry04:47
*** ddieterly has joined #openstack-telemetry04:47
*** links has quit IRC04:50
*** ljxiash has quit IRC04:51
openstackgerritMerged openstack/ceilometer: Updated from global requirements  https://review.openstack.org/26963404:51
*** fabian2 has quit IRC04:51
openstackgerritMerged openstack/ceilometer: tempest: copy telemetry client from tempest tree  https://review.openstack.org/25570604:51
*** ddieterly has quit IRC04:52
*** links has joined #openstack-telemetry04:57
*** achatterjee has quit IRC05:08
*** links has quit IRC05:13
*** thorst has joined #openstack-telemetry05:17
*** achatterjee has joined #openstack-telemetry05:20
*** links has joined #openstack-telemetry05:21
*** thorst has quit IRC05:25
*** links has quit IRC05:36
*** links has joined #openstack-telemetry05:37
*** ljxiash has joined #openstack-telemetry05:40
*** yprokule has joined #openstack-telemetry05:44
*** ddieterly has joined #openstack-telemetry05:48
*** ddieterly has quit IRC05:52
*** links has quit IRC05:59
*** links has joined #openstack-telemetry06:01
*** _nadya_ has joined #openstack-telemetry06:10
openstackgerritOpenStack Proposal Bot proposed openstack/ceilometer: Imported Translations from Zanata  https://review.openstack.org/26857506:16
*** links has quit IRC06:22
*** thorst has joined #openstack-telemetry06:22
*** links has joined #openstack-telemetry06:23
*** achatterjee has quit IRC06:28
*** thorst has quit IRC06:29
*** liamji has joined #openstack-telemetry06:42
*** _nadya_ has quit IRC06:43
*** links has quit IRC06:46
*** ddieterly has joined #openstack-telemetry06:49
*** links has joined #openstack-telemetry06:50
*** ddieterly has quit IRC06:53
*** idegtiarov_ has quit IRC07:01
openstackgerritZi Lian Ji proposed openstack/python-aodhclient: Add the functional tests  https://review.openstack.org/27113707:08
*** links has quit IRC07:09
*** links has joined #openstack-telemetry07:09
*** thorst has joined #openstack-telemetry07:28
*** idegtiarov_ has joined #openstack-telemetry07:29
*** links has quit IRC07:31
*** rcernin has joined #openstack-telemetry07:31
*** ljxiash_ has joined #openstack-telemetry07:32
*** liamji has quit IRC07:34
*** belmoreira has joined #openstack-telemetry07:34
*** ljxiash has quit IRC07:35
*** thorst has quit IRC07:35
*** liamji has joined #openstack-telemetry07:36
*** links has joined #openstack-telemetry07:48
*** ddieterly has joined #openstack-telemetry07:50
*** links has quit IRC07:54
*** ddieterly has quit IRC07:55
*** safchain has joined #openstack-telemetry07:58
*** boris-42 has quit IRC08:13
*** _nadya_ has joined #openstack-telemetry08:14
*** _nadya_ has quit IRC08:17
openstackgerritZi Lian Ji proposed openstack/python-aodhclient: Add the functional tests  https://review.openstack.org/27113708:21
*** links has joined #openstack-telemetry08:22
*** thorst has joined #openstack-telemetry08:32
*** ljxiash has joined #openstack-telemetry08:33
*** ljxiash_ has quit IRC08:34
*** links has quit IRC08:35
silehtjd__, look at this jsonschema example: http://paste.openstack.org/show/484651/08:37
silehtjd__, creating a parsing to generate the good column type will be not an easy task...08:37
silehtjd__, also no jsonschema python libs do deserialization, it does only strict validation08:38
silehtjd__, if you says {"type": number} and send the string "2", this is refused, when with voluptuous this is convert to int08:39
*** thorst has quit IRC08:39
jd__sileht: righ08:41
jd__sileht: let's forget about that; we could still do an export to jsonpath I guess that'd be simpler08:41
silehtjd__, yes, I have thought, we could mimic the jsonpath keywords08:42
*** shardy has joined #openstack-telemetry08:53
*** efoley has joined #openstack-telemetry09:01
*** mattyw has joined #openstack-telemetry09:13
jd__sileht: I guess we'll have to fix stable/1.3 to work with keystonemiddleware 4 too…09:15
jd__i don't know if it's complicated to support09:15
jd__if you have an idea, it's welcome,  otherwise I'll look into it :(09:15
silehtjd__, I can't help, I'm in PTO today and I will go out soon09:16
jd__oh sure, nevermind09:16
jd__have fun :)09:16
openstackgerritJulien Danjou proposed openstack/gnocchi: storage: autoconfigure coordination_url  https://review.openstack.org/27090709:17
*** yassine__ has joined #openstack-telemetry09:22
*** boris-42 has joined #openstack-telemetry09:32
*** _nadya_ has joined #openstack-telemetry09:34
*** _nadya_ has quit IRC09:35
*** achatterjee has joined #openstack-telemetry09:36
openstackgerritMerged openstack/aodh: tempest: migrate codes from tempest tree  https://review.openstack.org/25518709:36
*** thorst has joined #openstack-telemetry09:39
*** thorst has quit IRC09:45
*** ddieterly has joined #openstack-telemetry09:52
*** ljxiash has quit IRC09:53
*** ddieterly has quit IRC09:57
openstackgerritMerged openstack/gnocchi: plugin cleanup  https://review.openstack.org/27106610:00
*** _nadya_ has joined #openstack-telemetry10:01
*** liamji has quit IRC10:01
*** cheneydc has quit IRC10:01
*** liamji has joined #openstack-telemetry10:02
openstackgerritZi Lian Ji proposed openstack/python-aodhclient: Add the functional tests  https://review.openstack.org/27113710:06
*** ljxiash has joined #openstack-telemetry10:21
*** ljxiash has quit IRC10:22
*** cdent has joined #openstack-telemetry10:23
*** liamji has quit IRC10:24
openstackgerritJulien Danjou proposed openstack/gnocchi: storage: autoconfigure coordination_url  https://review.openstack.org/27090710:30
*** caishan has quit IRC10:32
*** thorst has joined #openstack-telemetry10:43
*** thorst has quit IRC10:49
silehtw10:51
*** ddieterly has joined #openstack-telemetry10:53
*** ddieterly has quit IRC10:57
*** ljxiash has joined #openstack-telemetry11:09
*** mattyw has quit IRC11:40
*** cdent has quit IRC11:43
*** thorst has joined #openstack-telemetry11:47
*** openstackgerrit has quit IRC11:47
*** openstackgerrit has joined #openstack-telemetry11:48
*** boris-42 has quit IRC11:53
*** efoley has quit IRC11:54
*** efoley has joined #openstack-telemetry11:54
*** ddieterly has joined #openstack-telemetry11:54
*** ddieterly has quit IRC11:59
openstackgerritChaozhe Chen(ccz) proposed openstack/gnocchi: Replace dict.iteritems() with six.iteritems()  https://review.openstack.org/27125812:03
openstackgerritMichael Krotscheck proposed openstack/aodh: Added CORS support to Aodh  https://review.openstack.org/26534212:18
openstackgerritMichael Krotscheck proposed openstack/aodh: gabbi's own paste.ini file  https://review.openstack.org/26533012:18
*** efoley has quit IRC12:27
*** openstackgerrit has quit IRC12:33
*** openstackgerrit has joined #openstack-telemetry12:34
*** gordc has joined #openstack-telemetry12:34
*** mattyw has joined #openstack-telemetry12:35
*** ildikov has quit IRC12:37
*** liamji has joined #openstack-telemetry12:38
*** Guest93448 has quit IRC12:40
*** ccz has quit IRC12:40
*** efoley has joined #openstack-telemetry12:43
*** ildikov has joined #openstack-telemetry12:49
*** cdent has joined #openstack-telemetry12:53
*** ddieterly has joined #openstack-telemetry12:55
*** leitan has joined #openstack-telemetry12:55
leitanHi guys, maybe gordc or jd__ can help me on this,  is there any way that we can configure several memcached servers on the coordination_url  ? if so hows  the correct syntaxis, cause i have tried several12:56
*** ddieterly has quit IRC12:59
*** cdent is now known as anticdent13:04
gordcleitan: for ceilometer, aodh, or gnocchi?13:05
gordcleitan: why would you want to do that?13:05
leitangordc, for gnocchi, i have multiple gnocchi-api and metricd, and i need the locking feature13:06
leitangordc, i have read that the tooz backends are supported, but i cant find the way to configure multiple memcached on the coordination_url line13:07
gordcleitan: if you have several different memcache servers how would they coordinate? the processes on one memcache server wouldn't know the ones on the other memcache server exist.13:07
leitangordc, yes , im guessing that this can be done similar on how nova or keystone works with memcached, if not find in one, try to find it in the other one13:08
*** julim has joined #openstack-telemetry13:08
mnaseri think when you have multiple memcache servers, the hashing distributes the data throughout the cluster13:08
leitangordc, if doesnt work like the other projects, ill have to use clusterized redis13:09
mnaserleitan: that what i ended up setting up13:09
mnaseralso I ran into an issue yesterday and after troubleshooting, it seems that the Ceph driver might have some scaling bugs :(13:09
leitangordc, but since i already have a big memcached pool , i really wanted to use it13:09
gordcleitan: does keystone use memcache as a cache or group membership?13:09
leitangordc, as a cache13:10
leitanmnaser, yes, performance is a complicated thing13:10
leitanmnaser, were running into many issues now13:10
leitantroubleshooting13:10
leitanalso with ceph backend13:10
mnaserleitan: *something* happened which caused a delay in the backend which accumulated a large # of measures to be processed13:11
gordcyeah, so tooz is a group membership tool... i'm not sure there's a way to define two different logical targets. a tooz expert can correct me here.13:11
mnaserbut there is a call that lists all measures and it has to go through ~80k records before it can process a single request13:11
gordcmnaser: you see a spike randomly? or it just grows consistently?13:12
mnaseri've had to shut down ceilometer and im trying to let it catch up13:12
mnasergordc: i mean this is only the 2nd day it's existed.. but basically once it goes to a really high number, i think it will be almost impossible for it tocatch up13:12
leitangordc, thats why i tought, but offering memcached as a membership backend, an expect to use just one memcached ? i guess thats weird, ill research a little bit more on tooz code13:12
mnasergordc: I've documented this here - https://bugs.launchpad.net/gnocchi/+bug/153690913:12
openstackLaunchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete]13:12
mnaser_list_object_names_to_process is called a lot and loop/goes through some 80k items on *every* single time it has to process a new measure13:13
gordcleitan: well it's not really a lot of data. if you have a group of say 10 agents... why would you need a lot of memcache servers?13:13
mnaserin around 10 hours, it was only able to process 10k measures.  this is with 48 total metricd workers across 2 machines, each with 24 cores (machines dedicated for gnocchi at this point)13:14
leitanits actually not for the amount of data, but more likely about distribution13:14
leitanmnaser, is metricd not catching up, or metricd its actually killing your ceph cluster ? (or both )13:14
mnasermetricd not catching up, and because of that "part" of the codebase, it would basically *never* be able to catch up, because the more it grows behind, the slower it gets .. is it killing the ceph cluster, i dont think so13:15
mnaserclient io 57575 kB/s rd, 51290 B/s wr, 92 op/s13:15
mnasernot much13:15
leitanmnaser, just asking because with all our ceilometer-agent-compute and metricd turned on, we start to have A LOT of ceph blocking requests13:16
mnaserleitan: i haven't even turned on ceilometer-agent-compute yet... just the collector so far and ran into this issue13:17
leitanand just a slow metricd, ceph not blocking anything ? mnaser13:18
mnaserleitan: yes, i dont think ceph is blocking anything.  it's still responding fine using rados commands13:19
gordcleitan: you might want to read into each tooz driver. memcache is probably not something you want in production.13:19
gordcleitan: http://docs.openstack.org/developer/tooz/drivers.html#memcached13:19
gordci use it... but it's for local testing.13:20
mnaseri just wrote a small poc code to check if my theory that i'm being delayed by that command is true using timeutils.StopWatch13:20
leitangordc, yup ill take a look thanks, just didnt want to run setting up just another cluster of stuff to support another service :)13:21
leitanmnaser, im really interested on how that is going to result, please let me know13:21
gordcmnaser: i'm trying to think about stable/1.3. we may quite a few changes to the metricd code to break/organise data chunks13:21
mnaserleitan: i will!   you can follow this bug if you want automatic updates too.. https://bugs.launchpad.net/gnocchi/+bug/153690913:22
openstackLaunchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete]13:22
gordcleitan: you'll probably need to. ceilometer/aodh/gnocchi use tooz for coordination. nova is adopting it in mitaka. and it's going to be used across all openstack eventually.13:22
mnaseryep, theory valid woo :<13:23
gordcmnaser: does your environment work with gnocchi 1.4?13:24
leitangordc, ill go with redis and stop playing around :)13:24
gordcleitan: :) it should be good enough... especially if you don't want to maintain zookeeper13:25
jd__you can't have multiple memcached backend13:25
* jd__ reads the backlog13:26
mnasergordc: i *guess*?  I don't really see why it would not13:26
mnaseralso added my findings with more accurate numbers here - https://bugs.launchpad.net/gnocchi/+bug/153690913:26
openstackLaunchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete]13:26
jd__mnaser: can you try with your SQL backend as a coordination url?13:26
mnaserthe actual problem still exists in master from what I see13:26
*** eglynn has joined #openstack-telemetry13:27
mnaserjd__: i could but I suspect it's not the problem because I was having this issue when I first started out (single node, local file lock)13:27
jd__mnaser: ok!13:28
mnaserand that function itself doesn't seem to be reliant or blocking on the tooz lock (the one that's delaying the processing of each measure)13:28
jd__mnaser: which tooz version?13:28
mnaserii  python-tooz                          1.21.0-1ubuntu1~cloud0           all          coordination library for distributed systems - Python 2.x13:28
jd__ah I see you're using 1.3.013:28
jd__you *really* need to upgrade then13:28
jd__we fixed a lot of bugs since then…13:28
mnaserok, I think zigo ran into some issues when he was upgrading the packaging13:29
jd__with the unit tests yeah13:29
jd__I'm working on that13:29
mnasermaybe when he's around i could ask him for the packaging files he currently has and build it without the tests13:30
zigoI already upgraded tooz to 1.29.0-1 in debian Experimental.13:31
leitangordc, i suffered zookeeper a lot ... its a no go for me jaja13:31
zigomnaser: o/13:31
zigoHere ...13:31
gordcleitan: makes sense. i believe rdo also forgoes zookeeper uses redis as well.13:33
zigojd__: mnaser: The current issue I have with tooz is that it has broken unit tests with Py3.13:33
jd__what's the link with gnocchi here?13:33
zigojsonschema.exceptions.ValidationError: 'group_id' is a required property13:33
mnaserzigo: i'm using python 2.7 so I could skip building packages for py3 .. i'd appreciate if you have the .tar.gz and debian files to work with13:34
mnaseri can build them and install them and give feedback as wel13:34
zigomnaser: Tooz in Experimental has disabled Py3 unit tests currently.13:34
*** diogogmt has quit IRC13:35
zigoThe package for Py3 is built, but just not unit tested.13:35
mnaserah I see, and for the gnocchi, I only found 1.3.0, is there 1.3.3 laying around somewhere?13:35
*** diogogmt has joined #openstack-telemetry13:36
*** nicodemus_ has joined #openstack-telemetry13:38
openstackgerritXia Linjuan proposed openstack/python-ceilometerclient: make aggregation-method argument as a mandatory field  https://review.openstack.org/26894713:42
mnaserit looks like this is already fixed but in master: https://github.com/openstack/gnocchi/commit/614e13d47fdcaeea9d41bebc214014e0c83a0e8313:43
gordcsigh. i don't understand how project-config works.13:43
gordchow is oslo.messaging good, but not ceilometerclient...  https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/ceilometer.yaml#L247-L25113:43
gordcmnaser: yeah, there's quite a difference between 1.3 and master13:44
*** ildikov has quit IRC13:44
mnaseri just patched that part to see if that functions runs more effectively13:45
gordcjd__: i'm going to write up something about gnocchi for summit over weekend.13:45
gordcany random thoughts you want me to consider?13:45
mnaserstill taking around 43s.. im going to see on how/if there is a way to speed up iterating through the xattrs13:45
jd__gordc: write up for?13:46
gordcjd__: a summit talk re: gnocchi13:46
nicodemus_hello13:46
*** diogogmt has quit IRC13:47
openstackgerritXia Linjuan proposed openstack/python-ceilometerclient: make aggregation-method argument as a mandatory field  https://review.openstack.org/26894713:47
jd__gordc: oh cool, you wanna do it alone or do you want me too?13:47
nicodemus_I believe I'm having issues running gnocchi-metricd distributed with redis... after two days of such deploy, I'm seeing several errors in gnocchi-metricd logs13:47
jd__gordc: I'll be glad to help you anyway13:47
jd__nicodemus_: paste, paste :)13:47
nicodemus_such as: "ObjectNotFound: Failed to remove 'measure_00e8c7d6-4128-4164-bfc6-107cda486b77_027493bc-d830-4f33-b596-a20fcb95378f_20164221_18:42:24"13:48
jd__mnaser: I'm backporting it but stable/1.3 is blocked right now, working on it as I speak13:48
gordcjd__: together? sileht: cdent: want in as well?13:48
jd__gordc: sileht won't be there13:48
nicodemus_or: "TypeError: _get_measures() takes exactly 3 arguments (4 given)"13:48
jd__nicodemus_: paste me the full log so I can fix all of that :) on paste.openstack.org for example13:48
gordcjd__: remember the idea about explaining how to use gnocchi and some random made metrics13:48
mnasernicodemus_: do you have your indexer_url setup properly? i saw that problem when the indexer wasnt configured13:49
nicodemus_jd__: certainly13:49
nicodemus_mnaser: the coordination url is coordination_url = redis://10.10.20.30:6379 (single redis, no sentinel)13:49
mnaserindexer_url, nicodemus_ :)13:50
nicodemus_mnaser: doing a monitor through redis-cli I see all four metricd connected and sending commands13:50
nicodemus_mnaser: mmm... now that you mention that, I haven't configured an indexer_url :/13:51
mnaserjd__: sounds good, i'm going to do more checking on speeding up this large iteration, still iterating through 80k items taking quite sometime.. "time rados -p gnocchi listxattr measure | cut -d'_' -f2 | uniq" => 0.9s13:51
mnaserthats exactly your problem nicodemus_13:51
mnaser(regarding the TypeErrors at least)13:51
nicodemus_mnaser: the indexer_url should point to redis as well?13:51
mnaserno, indexer_url points to a database13:51
mnaserhttp://docs.openstack.org/developer/gnocchi/configuration.html nicodemus_13:52
jd__mnaser: yeah but it should not do it that often13:52
nicodemus_mnaser: oh, I get it... it's under the [indexer] section, the url is: url = mysql://gnocchi:SERGITOgnocchi@10.10.10.252/gnocchi?charset=utf813:52
mnaserjd__: https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L108 - it calls it for every single time it has to process a measure13:53
mnaserwhich is why my processing has been slow13:53
mnaseradds 45s in each run13:53
jd__yeah I saw, I added a comment13:54
jd__it's likely a dup of #153379313:54
jd__I'm really trying to get 1.3.4 out by the end of today with all those fixes out, but well13:54
jd__keystonemiddleware is not helping13:54
jd__:)13:54
mnasernicodemus_: i dont think that's the right config13:54
mnaseroh it is, im not sure13:54
mnaserdid you make sure to start services after?13:55
mnaserim using postgres so not sure if any differences might affect13:55
mnaserjd__: 4.0.0 problems?  if you need a hand let me know, if it helps getting the release out13:55
nicodemus_jd__: I've pasted the two kind of errors I'm having here: http://paste.openstack.org/show/484686/13:55
*** ddieterly has joined #openstack-telemetry13:55
jd__mnaser: they broke everything between 2.3.0 and 4.0.0, so supporting *both* is painful13:56
nicodemus_mnaser: using that indexer_url I've had no errors with a single metricd instance, the errors began only when I deployed a second metricd with redis13:56
jd__nicodemus_: you're on master, right? (just to be sure :)13:56
*** jfluhmann has quit IRC13:56
nicodemus_mnaser: yes13:56
mnaserim assuming that's for jd__ ^ :)13:57
jd__nicodemus_: oh you did git pull today or something?13:57
nicodemus_jd__: one metricd was installed on dec 16, the other three were installed a couple of days ago. Didn't pull today nor yesterday13:58
jd__ok13:58
jd__this is going to be a problem as we changed the file formats in master during this last week13:58
jd__so if you install different version of master it's likely going to fail :)13:58
jd__not sure it's your problem there yet though13:58
nicodemus_oops.... want me to try pulling on all four today and see if these errors are still present?13:59
jd__nicodemus_: yeah that'd be helpful13:59
jd__because the TypeError: _get_measures() takes exactly 3 arguments (4 given) is really suspicious14:00
gordcjd__: i see you gave up and backported :) https://review.openstack.org/#/c/271296/1/gnocchi/opts.py14:00
*** belmoreira has quit IRC14:00
jd__we'll support 1.3 -> 2.0 upgrade, but upgrading in the middle of master is risky14:00
*** ddieterly has quit IRC14:00
nicodemus_jd__: ok, I'll give it a try and let you know how it goes :)14:00
jd__gordc: ah sh*t maybe I shuold not remove that option though14:00
jd__gordc:  let me try to keep it14:00
mnaserfairly sure this is a rados bug.  I ripped out all the code and it takes 36s to list all measures with get_xattrs (i think the iterator implemented here is to blame).  using the rados CLI, takes 1.5s14:08
gordcmnaser: sucks... if only we knew people who worked at red hat ;)14:12
mnaser:P14:12
mnaseri'm doing a bit of reading... it would be nice if the API had something which gave a list instead of an interator14:13
mnaserpython isn't too good at iterating in these scenarios (ex: same happens when doing a fetchall in the db is much more effective)14:14
*** ljxiash has quit IRC14:19
mnaserJust to list them (not even create a list out of the items, a simple loop with pass inside): http://paste.openstack.org/show/484696/14:19
*** ljxiash has joined #openstack-telemetry14:20
gordcyou are profiling the rados call?14:21
mnasergordc: profiiled just this only http://paste.openstack.org/show/484697/14:21
mnaserin comparision, "time rados -p gnocchi listxattr measure >/dev/null" => runs in 1.7s14:22
*** Ephur has quit IRC14:23
jd__mnaser: yeah if this is a binding bug in rados we can't really do anything on the Gnocchi side… but we have multiple hats though14:24
jd__and one of them is Red :]14:24
mnaserjd__: sounds good :) i'll do some more investigating when I have something more concrete14:25
jd__mnaser: 👍14:25
jd__gordc: I have more faith in https://review.openstack.org/#/c/271296/214:30
gordcjd__: my review is: 'seems legit'14:33
*** efoley has quit IRC14:33
*** efoley has joined #openstack-telemetry14:34
jd__gordc: I did not expect less than you :)14:38
jd__I was about to write an email about how upper-constraint is fucked14:38
jd__finally, I'll pass.14:39
gordcjd__: hahah. you can still right it.14:40
gordcwrite*14:40
gordcor right... still works14:40
*** ddieterly has joined #openstack-telemetry14:40
mnaseralright .. i now have a librados code to list all xattrs both in python and C .. in python, around 45s for 80k items, in C, 1.3s.14:43
liamjigordc: hello :), I met a problem when i am implementing the aodhclient test. When the test cases are related to gnocchi, it seems we need to install the gnocchi as the dependcy. Or we use mock to test the cases with gnocchi? thx14:43
openstackgerritJulien Danjou proposed openstack/gnocchi: storage/carbonara: store TimeSerieAggregated in splits  https://review.openstack.org/25843414:47
*** ildikov has joined #openstack-telemetry14:47
gordcliamji: you should probably mock it. i think aodh checks gnocchi for rule types14:48
*** pcaruana has joined #openstack-telemetry14:50
liamjiyes, it throws exception 'raise GnocchiUnavailable(e)'. okey, I will do it. Thanks :)14:50
gordcliamji: thanks14:50
gordcmnaser: i think ideally we want metricd to be processing as quick as possible to minimise the number of items but yeah... 45s vs 1.3s sucks.14:51
mnaserim not really sure *how* i ended up with that 80k or so to start with14:51
mnaserstill 70k left to process and i think it will take much longer.. but it's something to be figured out thats for sure.. i dont know if introducing a small caching concept to this can help14:52
gordcmnaser: probably the bugs jd__  mentioned above... causing a backlog for metricd... and then the ceph stuff compounds the issue.14:53
liamjigordc: np :)14:53
mnaserpossibly14:53
mnaserif 1.3.4 makes it out and zigo has some deb stuff, i'll try to upgrade and start ceilometer again and see if it catches up14:53
gordckk14:54
jd__there's a bug where Gnocchi returns a list vs a set which is fixed in master and being fixed in 1.3 – that reduces a lot the time14:54
jd__then if rados takes 45s on itself, we can't help much14:54
jd__but the listing is done once in a while anyway14:54
mnaserjd__: this is where im confused, you're saying its done once in a while but the ceph code base does it on *every* single time it has to process a metric (unless im not reading things right)14:55
mnaserhttps://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L10814:55
gordcanticdent: you understand tis? https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/ceilometer.yaml#L247-L25114:55
mnaserso for every metric, it runs this to grab the list of all measures14:55
jd__mnaser: yeah but that one is not 80k14:56
* anticdent looks14:56
*** jd__ has left #openstack-telemetry14:56
*** jd__ has joined #openstack-telemetry14:56
anticdentgordc: I reckon that's something sileht did but14:56
mnaserjd__: it is, while it may not "send" 80k items or copy them over, it's processing them (which is taking the 45s) ... it is a "rados" problem at the end of the day anyways14:56
zigomnaser: As soon as 1.3.4 is out, I'll package it.14:57
mnasermerci beaucoup :)14:57
gordcanticdent: he did oslo part, i did client... the client one doesn't work14:57
jd__mnaser: oh right I missed that it was using the same prefix14:58
anticdentis the idea that it is supposed to be get from git when the project that initiated the test is X?14:58
jd__mnaser: we could probably be smarter here and list less14:58
gordcyeah.14:58
gordcanticdent: sileht add the test to ceilometerclient gate but i noticed it was just using pip14:58
_nadya_gordc: hi Gordon! Am I right that we send all data from one compute node in one batch during polling?14:59
gordcso i thought, hey, client is like oslo.messsaging... oslo gate works correct...let's copy14:59
mnaserjd__: i dont mind writing a small code to cache the results (done some of this for Horizon) .. but i dont know what the impact of caching these results might have14:59
gordcno dice.14:59
mnaserthe driver seems pretty straightforward14:59
anticdentgordc: reasonable plan, but not workie. My guess would be that g-r is messing things somewhere, but why that would be for one and not the other is unclear14:59
gordc_nadya_: yep. i believe it batches by default.15:00
_nadya_gordc: cool, thanks!15:00
*** rbak has joined #openstack-telemetry15:00
gordcanticdent: yeah... i love guessing at project-config changes.15:01
gordchttp://logs.openstack.org/08/235208/3/check/gate-ceilometer-dsvm-integration/a637ac7/logs/devstacklog.txt.gz#_2016-01-22_03_37_41_35815:01
anticdentnobody in infra to come to your aid?15:01
gordchttp://logs.openstack.org/81/267981/1/check/gate-ceilometer-dsvm-integration/6588583/logs/devstacklog.txt.gz#_2016-01-22_11_57_04_84615:01
gordcyou are my aid... does anticdent mean don't bother me?15:01
* anticdent has friday brain dumb15:02
anticdentMaybe try doing what it says: when the project is ceiloclient, add ceiloclient to PROJ15:02
nicodemus_jd__: after pulling and re-deploying gnocchi-metricd, so far I didn't see any errors. Nevertheless, the errors might not appear immediately. I'll keep watching how it goes.15:02
gordcanticdent: i'm scared as that has blown up quite a bit... that projects stuff is magic.15:03
gordcalong with everything15:03
gordcanticdent: you want to join our gnocchi summit talk?15:03
jd__nicodemus_: ok15:03
anticdentgordc: thus my "ask infra" because we both dumb15:04
anticdentgordc: I've already got two in the works, but what's the topic?15:04
gordcgnocchi magic15:04
anticdentsomething something magic something something15:04
gordcor some flame topic like 'we beat <insert db name here>'15:05
anticdentthat would be fun15:05
*** yprokule has quit IRC15:05
anticdentI reckon most of my gnocchi brain will have been archived by then so I may not be a useful addition15:05
gordcanticdent: kk15:05
anticdentso unless you want someone to say "one way bs to uuid mapping!" over and over...15:06
gordcanticdent: you doing uuid everywhere topic for cross project?15:06
anticdentnot yet, it will come up soon enough15:07
gordc'soon' 2 years?15:07
*** liamji has quit IRC15:08
*** liamji has joined #openstack-telemetry15:08
anticdentthat's pretty fast in openstack time15:10
gordcanticdent: i'm optimistic.15:12
*** diogogmt has joined #openstack-telemetry15:12
anticdentI'll try to write something up after the nova mid-cycle, that will get the ball created, a few months later it will start rolling15:12
gordc.. you have midcycle 1 month before m-3?15:13
gordcisn't feature freeze already set in nova?15:14
anticdentthe mid-cyle is next week, but yeah freeze has already happened15:14
anticdentI'm not sure I understand it15:14
gordcair miles15:15
anticdentbut then again, people frequently talk about "yeah we should be able to merge that code in O"15:15
anticdentwhich boggles15:15
mnaserlol, the rados.py bindings literally spawns a thread for every single time a next() is called on the iterator15:16
mnaserwhich means 80k threads are generated in the whole period of time15:16
nicodemus_mnaser: in my setup, I have 7,5M of measures in ceph >.<15:19
gordcmnaser: ceph just gave you justification for new hardware request15:19
gordcrequest supercomputer, reason: need to query ceph15:20
mnaserhaha15:21
mnasernicodemus_ -- i think at this point, you're a bit doomed.  the time to process a single metric is 70 minutes if your numbers match mine15:22
mnaserbecause for 80s it takes 45 seconds15:22
mnaserI'm thinking of writing something in the codebase that relies of using redis sets to do this part instead of xattrs15:23
mnaserxattrs can work at small scale but at large scale it will be near impossible (imho)15:23
nicodemus_mnaser: I'll try the code you wrote on https://bugs.launchpad.net/gnocchi/+bug/1536909 to get some time measurement15:25
openstackLaunchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete]15:25
mnasernicodemus_ i'd leave running in a screen, i suspect it might take quite sometime with 7.5m records15:25
mnaserim going to work on something based on redis with a small migration script as a poc and see where it goes15:25
*** anticdent has quit IRC15:28
*** mragupat has joined #openstack-telemetry15:30
_nadya_Redis is everywhere...15:31
mnaser_nadya_: i dont know :( i'm kinda trying to do the best I can to see if this can be adjusted but it looks like there was a concious decision to spawn everything in threads - https://github.com/ceph/ceph/commit/2ce5cb7d06c80e34824981c36adcaa2dbfbc581f15:35
_nadya_mnaser: I have nothing against Redis :) It's funny that in 3 "areas" in Ceilomer we suddenly need Redis (or just distr cache)15:39
*** peristeri has joined #openstack-telemetry15:39
mnaserheh, well it's also the fact that large number of xattr's might start causing problems.. im not sure if xattrs are the best way to implement something for this type of need at a large scale (esp with all the O(N)) code that's put in place in the driver code15:39
gordcmnaser: yeah. so that part is kinda an intermediate step where we store data before it gets aggregated/processed... ideally that shouldn't be as backed up as it is... but using another form of storage for the backlog sounds interesting.15:43
*** ddieterly has quit IRC15:46
*** ddieterly has joined #openstack-telemetry15:48
*** cdent has joined #openstack-telemetry15:52
*** safchain has quit IRC15:54
*** idegtiarov_ has quit IRC16:01
mnasergordc: poc, listing all uuids or metrics take 0.111s and measures for a specific metric = 0.055s.16:04
mnaserthat was by importing all my 80k records into redis16:04
mnasertime to make the small changes to the ceph driver and see what happens..16:04
*** zqfan has quit IRC16:11
*** mragupat has quit IRC16:12
*** ddieterly has quit IRC16:14
*** mragupat has joined #openstack-telemetry16:15
*** rcernin has quit IRC16:15
*** ddieterly has joined #openstack-telemetry16:21
*** mattyw has quit IRC16:22
mnaseroh man this is great16:25
mnaserprocessing at ~1k/s16:26
mnasernicodemus_ : i may have some good news in that redis can help allievate A LOT of this load16:48
mnaseri just cleared the entire 80k in minutes .. i will try to draft up a patch + some docs16:48
gordcmnaser: sounds good.16:49
gordcmnaser: have you tried with oslo.cache? so we can make it pluggable for someone out there not on redis?16:50
mnaserthat's a good idea gordc --- this was just an initial "hey would this really make things faster" type of thing, oslo.cache is good but ill have to check to see if its possible16:50
mnaserim relying on set based operations (that should be available in most kvs) but i use a redis-specific way of retrieving the list of all keys matching a specific string16:51
gordcmnaser: cool cool.16:51
mnaseralso, that would have to document things some more as well.. because we dont want people using oslo.cache+memcache to do this type of thing16:51
gordcmnaser: i see... yeah i think it might not be possible. vaguely remember _nadya_ said oslo.cache limited functionatliy16:51
mnaserif that memcache instance dies, you now have a ton of stale measurements in your ceph cluster16:52
mnasers/stale/orphan/16:52
gordcmnaser: true16:52
*** efoley has quit IRC16:52
mnaserbut an sql backend is considerable for this type of thing too16:52
mnaseralright, found another bug doing unnecessary scans, but we're getting there...16:54
*** rcernin has joined #openstack-telemetry16:55
*** pradk has quit IRC16:55
*** pradk has joined #openstack-telemetry16:56
gordcmnaser: feel free to post your fixes too :)16:57
*** eglynn has quit IRC16:58
*** mattyw has joined #openstack-telemetry17:01
*** ildikov has quit IRC17:04
zigogordc: Do you know when Gnocchi 1.3.4 will be out?17:07
zigo<= 1.3.3 is currently not working well at all with the rest of Mitaka17:08
nicodemus_mnaser: great news!17:08
gordczigo: i think we want to merge these first: https://review.openstack.org/#/q/status:open+project:openstack/gnocchi+branch:stable/1.317:09
zigogordc: So would say it's a mater of (working) days?17:10
nicodemus_mnaser: please let me know if you want me and my millions of metrics to give it a try17:10
mnasernicodemus_: i will def let you know!17:10
gordcdepends whether jd__ is around when they merge17:10
mnaseri would actually want you to do that17:10
mnaseri want to see how it performs with a large # of records17:11
gordczigo: he has release permissions17:11
zigogordc: Ok, thanks. We'll wait then.17:11
zigoFYI this is blocking the release of Ceilometer Mitaka b2.17:12
gordcthanks for the help testing folks. packaging and code... much appreciated.17:12
zigo(in Debian & Ubuntu)17:12
gordczigo: you take over ubuntu work?17:12
zigoSince when Ubuntu guys have been doing any packaging work? :) :) :)17:12
gordczigo: lol you said it, not me.17:13
zigoMore seriously, Gnocchi is being maintained in Debian, and they just sync from it.17:13
mnaseroh man17:13
mnaseri guess this is good...17:13
zigoAnd they'll be waiting on it.17:13
mnaserload average is ~40 lol17:13
mnaserredis doing work17:13
*** prashantD_ has joined #openstack-telemetry17:14
mnaser| storage/number of metric having measures to process | 1085  |17:14
mnaser| storage/total number of measures to process         | 2395  |17:14
nicodemus_hahahah well that's what it's there for17:14
*** mattyw has quit IRC17:14
mnaseri dont know if im generating too much data.. this seems like a lot heh17:14
mnaseri wonder if im generating too much data17:15
gordcwhat kinda policies do you have defined?17:15
zigo2k measure is like nothing.17:15
mnaserthe default onesr really17:15
zigoinfluxdb can process like 500k / s !17:16
mnaser2k constant measurements, not total measurements :p17:16
mnaserlike i understand its not much but17:16
gordcmnaser: shouldn't be that bad than.17:16
gordcmnaser: can't improve it if we set bar low17:16
mnaserit seemstrue17:16
mnaser*true17:16
jd__zigo: I want to release it today but not sure i'll have time, I wait for the patches to be merged17:17
mnaseri think swift is ceilometer-collector + swift is causing a lot of this17:17
zigojd__: Ah, cool, thanks for the info. Forwarding it to the Ubuntu guys.17:17
gordcmnaser: possibly. the metering we do for swift is quit chatty considering it's every single api request17:18
jd__zigo: influxdb can do zillions per second!17:18
mnaseri havent even set that up yet actaully gordc :<17:18
gordcjd__: it's cool. my slide says gnocchi does zillions+1 per second17:18
jd__haha17:18
gordcmnaser: ceilometermiddleware?17:18
jd__mnaser: this is not too much, but did you apply the patches?17:18
mnaserdoing a small log file analysis...   31124 image resource requests and  182143 swift_account requests17:18
mnaserjd__: i did.. but i sorta rewrote a big portion of the ceph driver to use redis instead of xattrs to store the list and i got back up processing to ~1k/sec events17:19
jd__mnaser: what's taking long?17:19
jd__1k/sec sounds not that bad if you have 2k waiting :)17:20
mnaserjd__: im not sure if my ceph cluster cant keep up, or not enough cpu power17:20
mnaserits slowly creeping up, its up to 6k measures to process now17:20
gordcjd__: taht's with redis not ceph.17:20
jd__yeah17:20
mnasergordc: redis only used to replace teh xattr stuff, measurement objects still stored in ceph17:21
jd__mnaser seems smart enough to dig this out so I'm not worried17:21
jd__)17:21
gordcmnaser: yep17:21
jd__:)17:21
*** leitan has quit IRC17:21
mnaseryeah i'll do my checking and more digging17:21
gordcjd__: somethign we might want to add17:21
mnaserthe number is rising :( cant keep up17:21
jd__we have large perf improvement in master merged and coming though this could be interesting to try them I guess17:21
gordci assume it would benefit swift backend too17:21
jd__gordc: what do we want to add?17:22
gordcinstead of dumping temporary measures in ceph/swift, we dump it in redis and then store permanent in ceph/swift17:22
jd__right, to be tested I guess17:23
jd__though I still think we can optimize Ceph a bit before going that road17:23
cdentwasn't there some effort to temp on regular files and perm in ceph/swift?17:24
gordcyeah. definitely. don't want to increase complexity if not needed17:24
gordccdent: what happens if they are local on different machines?17:25
gordci have no idea... maybe nothing of concern.17:26
cdentpresumably you're running multiple metricd's across several machines17:26
cdentand several api servers across multiple machines17:26
cdenteach ap server dumps to local, metricd picks it up, sends it on?17:27
cdentit's not really adding more failure space (since the api server itself could fail)17:27
* cdent is just speculating17:27
gordctrue true.17:27
jd__but it seems that the first issue to fix is the xattr listing taking 45x times normal time in Python :)17:27
cdentand the files have been demonstrated to be fast17:27
jd__sileht might help on that17:27
cdentyeah, that's weird17:28
mnaserjd__: agreed, it shouldn't do that.. but it seems like i doubt this would be easily addressed17:31
*** yassine__ has quit IRC17:32
jd__mnaser: have faith17:34
mnaserjd__: i tried asking the person who did the change (back in 2013)17:34
mnaserbut we'll see (hopefully) i get a response17:34
gordcmnaser: just for reference, how much thread/processes did you configure for httpd?17:36
mnaser# of CPU cores on the system which is 24 gordc17:36
mnaserand i have 2 metricd processes too right now17:36
gordcso 24processes, and 32 threads?17:36
mnasergordc: http://paste.openstack.org/show/484722/17:37
mnaserthat was using puppet-gnocchi unless that's not ideal17:37
gordchmm.. and you don't have a backlog in rabbit queues?17:38
mnasergordc: i dont even use the notifications yet17:38
mnaserjust ceilometer-collector17:38
gordcmnaser: what is your data source? collector just dispatches data from a metering.sample queue17:39
mnasergordc: doesn't ceilometer by default do polling out?17:39
mnaserfor images and swift container sizes17:40
gordcmnaser: yeah, ah i see. you use polling agents as well.17:40
gordcmnaser: this is with kilo?17:40
mnaserliberty17:40
gordc... then you probably have notification agent running too since data goes polling -> notification -> collector -> gnocchi17:40
mnaseri do, huge amount of POST requests swift_account/<...>/..17:41
gordcif you're using rabbit, can you check how large metering.sample queue is?17:41
gordcrabbitmqctl list_queues... or via webui17:41
jd__if you meter swift requests that make sense17:42
mnasernot doing requests monitoring yet, they're mostly things like this: ::1 - - [22/Jan/2016:17:42:38 +0000] "POST /v1/resource/swift_account/f91c013d82dc46da8f52530332408160/metric/storage.objects.containers/measures HTTP/1.1" 202 - "-" "python-requests/2.7.0 CPython/2.7.6 Linux/3.16.0-57-generic"17:42
mnaserits just the measurements it does for usage checks etc17:44
*** jd__ has left #openstack-telemetry17:44
*** jd__ has joined #openstack-telemetry17:44
jd__ok :)17:44
mnaserbtw gordc - metering.sample73917:45
gordcthat's with 600 VM?17:45
mnaseryes17:45
mnaserbut i didnt configure the compute agents17:45
mnaserso its mostly swift data17:46
*** _nadya_ has quit IRC17:46
gordcoh. ok. that probably makes a bit more sense. i was surprised it wasn't a lot higher if you were running with 1 process and 24 threads.17:46
gordcmine gets backed up with a lot more threads and a lot less vms17:47
gordccool cool. was just trying to get a sense of how it was deployed before i play around with deployment configurations17:47
-openstackstatus- NOTICE: Restarting zuul due to a memory leak17:51
*** thumpba has joined #openstack-telemetry17:54
mnaserwell, we did progress but it look like my metricd processes can't keep up17:55
*** ildikov has joined #openstack-telemetry17:58
*** Ephur has joined #openstack-telemetry17:58
*** mragupat has quit IRC17:58
*** boris-42 has joined #openstack-telemetry18:07
mnaserjust turned it off to catch up.. i have a few hints of what's going on.. the process may be doing lots of extra loops on metrics with 0 values18:13
nicodemus_mnaser: I'm running the code you pasted on #153690918:13
mnasernicodemus_: probably still stuck eh? :p18:13
nicodemus_yeah... still waiting on it to finish18:14
mnasernicodemus_: suspect you'll have an hour ro so to run i think18:14
mnaseri'm working a bit more on this redis patch..18:14
openstackgerritMerged openstack/python-gnocchiclient: Add granularity argument to measures show  https://review.openstack.org/26608218:20
*** dan-t has quit IRC18:33
*** dan-t has joined #openstack-telemetry18:34
mnaserhmm i think i've ran into an oddity in carbonara18:41
mnaserwhere it actually breaks down under scale18:41
nicodemus_mnaser: the code snippet took 41m16s to run18:42
mnasernicodemus_ : thats the time it's taking to process a *single* metric in your cluster right now18:42
mnaser7.5 million * 41 minutes18:42
mnasersee ya in 584 years18:42
mnaserhttps://github.com/openstack/gnocchi/blob/master/gnocchi/storage/_carbonara.py#L228 <-- this line grabs all of the metrics that need work, it could be a list of 1000+ metrics18:43
mnasernow the problem is that, the metricd process seems to loop around those even tho many of them were already processes18:43
mnaserif process #1 does the first 100, process #2 will spend most of it's time running _process_measure_for_metric for 100 metrics that are already done18:44
mnaserand now you have 24 processes looping around in empty/already done metrics18:44
*** harlowja has quit IRC18:46
nicodemus_I don't think I have 584 years T___T18:46
mnaserrunning a single worker makes work serialized, but the worker process is actually doing work 100% of the time.. so there has to be an improvement done so that self. _list_metric_with_measures_to_process() somehow distributes work per process18:46
*** harlowja has joined #openstack-telemetry18:46
mnasernicodemus_: i'll make sure we get it all figured out :)18:46
nicodemus_mnaser: I'm afraid I don't have much coding skills to assist, but let me know if I can help doing tests/benchmarks :)18:48
mnasernicodemus_: that's very useful on it's own!  i dont have 7.5 million records (yet) and it'd be immense help once its cooked up that18:48
mnaserwhat im hopefully saying isnt incorrect :p18:49
nicodemus_right now all ceilometer-agents are stopped, so there are no new metrics being pushed to gnocchi18:49
mnaseryeah thats the best thing to do18:50
*** PsionTheory has joined #openstack-telemetry19:01
*** _nadya_ has joined #openstack-telemetry19:07
gordcyeah i think we need to look at why get_xattrs is so slow. that is a required component regardless19:09
mnasergordc: even if it worked fast, what i mentioned above is a problem too19:10
mnaser24 processes getting the same list, first 1 or 2 will start working through them, the last ones will just loop over empty measures, doing work for nothing19:11
mnaseri think there should be a proper way where a set of metrics are dispatchd to be processed by a specific worker, that way you dont have workers sitting around loop over already-processed data19:11
mnaserit seems to me the bigger amount of workers you have, the bigger this becomes a problem19:12
gordcright. i'm not saying it's not an issue. but i think we need to start at step 119:12
gordchttps://github.com/openstack/gnocchi/blob/master/gnocchi/storage/_carbonara.py#L24219:12
*** rcernin has quit IRC19:12
mnasertrue, but this exposed another issue that would likely be a problem with all backends19:12
gordcthis lock should essentially ensure those later works scan through and restart again.19:13
gordcs/works/workers/19:13
mnasergordc: correct, that locks it, but the list of metrics which is acquired using " metrics_to_process = self._list_metric_with_measures_to_process()" can be huge in some cases19:13
mnaserand the metric might even be processed at that point as well19:14
mnaserit wouldn't be locked19:14
mnaserself._process_measure_for_metric would just return 0, but imagine this happening 2000 times x 48 processes19:14
mnaserthe processes are spinning and going over a list of metrics which have already been processed, it would be far more effective for it to not be doing this19:15
mnaserresults in a lot of stuff in the logs like this: 2016-01-22 19:09:14.363 30510 DEBUG gnocchi.storage._carbonara [-] Computed new metric b389e4e8-2d7d-4697-ad5c-68013a9b7948 with 0 new measures in 0.46 seconds process_measures /usr/lib/python2.7/dist-packages/gnocchi/storage/_carbonara.py:22919:15
gordcright. so we need smarter exit strategy...19:16
gordcand then more improvement would be to bucketise what results we get back19:16
nicodemus_I have to sign off for a while (power in the office is down)... will be back in an hour or so19:16
mnaserlater nicodemus_ !19:16
gordclates19:16
mnasergordc: correct, something where each worker gets a batch of N metrics to process that no other worker gets19:17
mnaser_pending_measures_to_process_count(metric_id) is implemented and i think that would be useful at exiting early19:18
gordcmnaser: not sure how possibel 'no other worker' part is. but we can definitely dump into smaller buckets and distribute the dump buckets to workers.19:18
mnasergordc: of course, this becomes a bit more complex, i did try to feed each worker batches of 10 which was effective, but runs much slower (and still suffers of this issue)19:19
gordcmnaser: yes, that could be a good solution.19:19
gordcmnaser: want to push a patch19:19
mnaserlet me try adding some code re: _pending_measures_to_process_count and early exit and see if that improves things19:19
*** mragupat has joined #openstack-telemetry19:19
mnaseryep, ill do that19:19
gordcmnaser: awesome!19:20
mnasergordc: question, should i check _pending_measures_to_process_count before the lock or after19:20
mnaserim thinking it safe to do it before the lock, worst case scenario it will come back to it on the next run19:20
gordcmaybe after the lock. if the lock is taken, then the worker will just pass through it since it assumes another worker is working on it.19:21
*** nicodemus_ has quit IRC19:23
*** chmouel_ is now known as chmouel19:25
*** dan-t has quit IRC19:33
mnasergordc: huge difference19:34
mnaserwent from can barely handle the load to now only around ~38 in queue19:34
gordcmnaser: what is 'queue' in this case?19:35
gordcmnaser: this sounds promising19:35
mnasersorry, | storage/total number of measures to process         | 39    |19:35
mnaserso measurements are actually being processed much faster19:36
gordcah got it.19:36
mnaserbefore, they were just queueing up19:36
gordcthat's great.19:36
gordccare to upload your tweak?19:36
mnasergordc: also, i am pretty sure we can go back to the original implementation and it will perform just as fine, because teh issue only manifests itself when the # of xattrs aka measurements pending is huge19:36
gordcmnaser: yeah. definitely one of those cases where if it explodes, it only gets worse.19:37
gordcseems like a good first fix if it still functions as expected.19:38
mnasergordc: i just had a talk with some of the ceph devs, and the person who worked on making the changes re: moving things to its own thread that i suspect is causing this19:39
mnaserapparently there is work being done to port rados to cython which will improve performance, and also asked if we could rerun it with the code to run in thread overridden and see if the peformance changes19:40
mnaserso first i'll just upload this small change then i'll go ahead and test that19:40
gordcmnaser: awesome!19:41
gordcthanks for the help19:41
mnasernp, thanks to your help and support to19:41
*** cdent has quit IRC19:44
*** pradk has quit IRC19:45
*** pradk has joined #openstack-telemetry19:45
openstackgerritMohammed Naser proposed openstack/gnocchi: Skip already processed measurements  https://review.openstack.org/27149919:48
mnasergordc ^ :)19:48
gordcmnaser: cool cool. heading to doctors right now. will take a look at that later19:48
gordcthanks again.19:48
mnasernp, no rush19:48
*** gordc has quit IRC19:48
*** dan-t has joined #openstack-telemetry19:54
*** ddieterly has quit IRC20:02
openstackgerritMohammed Naser proposed openstack/gnocchi: Skip already processed measurements  https://review.openstack.org/27149920:08
openstackgerritMohammed Naser proposed openstack/gnocchi: Skip already processed measurements  https://review.openstack.org/27149920:11
*** pcaruana has quit IRC20:14
openstackgerritGeorge Peristerakis proposed openstack/ceilometer: Load a directory of YAML event config files  https://review.openstack.org/24717720:21
*** nicodemus_ has joined #openstack-telemetry20:30
*** ddieterly has joined #openstack-telemetry20:41
mnaserman right now the code for processing measures is just a big race between workers which becomes a big mess at scale :(20:55
mnasertrying to figure out how to split it apart in buckets or something20:55
*** jwcroppe has quit IRC21:27
*** peristeri has quit IRC21:45
*** pcaruana has joined #openstack-telemetry21:48
*** rcernin has joined #openstack-telemetry21:53
*** mragupat has quit IRC21:59
*** gordc has joined #openstack-telemetry22:01
*** gordc has quit IRC22:37
*** PsionTheory has quit IRC22:38
*** pradk has quit IRC22:52
*** rcernin has quit IRC22:53
*** dan-t has quit IRC22:54
*** dan-t has joined #openstack-telemetry23:20
*** shakamunyi has quit IRC23:22
*** rbak has quit IRC23:26
*** ddieterly has quit IRC23:27
*** vishwanathj has quit IRC23:29
openstackgerritRohit Jaiswal proposed openstack/ceilometer: Enhances get_meters to return unique meters  https://review.openstack.org/25962623:30
*** thumpba has quit IRC23:34
*** thorst has quit IRC23:43
*** thorst has joined #openstack-telemetry23:43
*** thorst has quit IRC23:52
*** Ephur has quit IRC23:54

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!