*** rbak_ has quit IRC | 00:01 | |
*** liamji has quit IRC | 00:03 | |
*** jfluhmann has quit IRC | 00:06 | |
*** ddieterly has quit IRC | 00:11 | |
mnaser | threw another node, total of 40 cores (over 2 nodes) at full load, api endpoint not responding for gnocchi :( | 00:25 |
---|---|---|
*** zqfan has joined #openstack-telemetry | 00:26 | |
*** yarkot has joined #openstack-telemetry | 00:29 | |
openstackgerrit | Sam Morrison proposed openstack/gnocchi: Convert user and project IDs from UUIDs to Strings https://review.openstack.org/270007 | 00:48 |
*** cheneydc has joined #openstack-telemetry | 01:06 | |
*** yarkot_ has joined #openstack-telemetry | 01:18 | |
*** cheneydc has quit IRC | 01:25 | |
*** liusheng has joined #openstack-telemetry | 01:25 | |
*** cheneydc has joined #openstack-telemetry | 01:25 | |
*** ljxiash has joined #openstack-telemetry | 01:27 | |
*** jfluhmann has joined #openstack-telemetry | 01:29 | |
*** yarkot_ has quit IRC | 01:29 | |
*** cheneydc has quit IRC | 01:34 | |
*** thorst has joined #openstack-telemetry | 01:43 | |
*** caishan has joined #openstack-telemetry | 01:52 | |
*** ddieterly has joined #openstack-telemetry | 02:01 | |
openstackgerrit | liusheng proposed openstack/ceilometer-specs: Improve Nova Instance metering https://review.openstack.org/209799 | 02:03 |
*** thorst has quit IRC | 02:05 | |
*** cheneydc has joined #openstack-telemetry | 02:15 | |
openstackgerrit | Sam Morrison proposed openstack/gnocchi: Convert user and project IDs from UUIDs to Strings https://review.openstack.org/270007 | 02:22 |
*** liusheng has quit IRC | 02:39 | |
*** liusheng has joined #openstack-telemetry | 02:40 | |
*** Ich has joined #openstack-telemetry | 02:49 | |
*** ccz has joined #openstack-telemetry | 02:49 | |
*** Ich is now known as Guest93448 | 02:49 | |
*** achatterjee has quit IRC | 03:00 | |
*** thorst has joined #openstack-telemetry | 03:06 | |
*** achatterjee has joined #openstack-telemetry | 03:07 | |
openstackgerrit | Merged openstack/python-aodhclient: add alarm-history interface https://review.openstack.org/265001 | 03:10 |
*** thorst has quit IRC | 03:14 | |
*** links has joined #openstack-telemetry | 03:26 | |
*** prashantD_ has quit IRC | 03:28 | |
*** links has quit IRC | 03:40 | |
*** links has joined #openstack-telemetry | 03:48 | |
*** links has quit IRC | 04:04 | |
*** thorst has joined #openstack-telemetry | 04:12 | |
*** links has joined #openstack-telemetry | 04:16 | |
*** ddieterly has quit IRC | 04:16 | |
*** fabian2 has joined #openstack-telemetry | 04:18 | |
*** thorst has quit IRC | 04:19 | |
*** agireud has quit IRC | 04:20 | |
*** agireud has joined #openstack-telemetry | 04:22 | |
*** links has quit IRC | 04:27 | |
*** links has joined #openstack-telemetry | 04:28 | |
openstackgerrit | Zi Lian Ji proposed openstack/python-aodhclient: Add the functional tests https://review.openstack.org/271137 | 04:44 |
*** ljxiash has quit IRC | 04:46 | |
*** ljxiash has joined #openstack-telemetry | 04:47 | |
*** ddieterly has joined #openstack-telemetry | 04:47 | |
*** links has quit IRC | 04:50 | |
*** ljxiash has quit IRC | 04:51 | |
openstackgerrit | Merged openstack/ceilometer: Updated from global requirements https://review.openstack.org/269634 | 04:51 |
*** fabian2 has quit IRC | 04:51 | |
openstackgerrit | Merged openstack/ceilometer: tempest: copy telemetry client from tempest tree https://review.openstack.org/255706 | 04:51 |
*** ddieterly has quit IRC | 04:52 | |
*** links has joined #openstack-telemetry | 04:57 | |
*** achatterjee has quit IRC | 05:08 | |
*** links has quit IRC | 05:13 | |
*** thorst has joined #openstack-telemetry | 05:17 | |
*** achatterjee has joined #openstack-telemetry | 05:20 | |
*** links has joined #openstack-telemetry | 05:21 | |
*** thorst has quit IRC | 05:25 | |
*** links has quit IRC | 05:36 | |
*** links has joined #openstack-telemetry | 05:37 | |
*** ljxiash has joined #openstack-telemetry | 05:40 | |
*** yprokule has joined #openstack-telemetry | 05:44 | |
*** ddieterly has joined #openstack-telemetry | 05:48 | |
*** ddieterly has quit IRC | 05:52 | |
*** links has quit IRC | 05:59 | |
*** links has joined #openstack-telemetry | 06:01 | |
*** _nadya_ has joined #openstack-telemetry | 06:10 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/ceilometer: Imported Translations from Zanata https://review.openstack.org/268575 | 06:16 |
*** links has quit IRC | 06:22 | |
*** thorst has joined #openstack-telemetry | 06:22 | |
*** links has joined #openstack-telemetry | 06:23 | |
*** achatterjee has quit IRC | 06:28 | |
*** thorst has quit IRC | 06:29 | |
*** liamji has joined #openstack-telemetry | 06:42 | |
*** _nadya_ has quit IRC | 06:43 | |
*** links has quit IRC | 06:46 | |
*** ddieterly has joined #openstack-telemetry | 06:49 | |
*** links has joined #openstack-telemetry | 06:50 | |
*** ddieterly has quit IRC | 06:53 | |
*** idegtiarov_ has quit IRC | 07:01 | |
openstackgerrit | Zi Lian Ji proposed openstack/python-aodhclient: Add the functional tests https://review.openstack.org/271137 | 07:08 |
*** links has quit IRC | 07:09 | |
*** links has joined #openstack-telemetry | 07:09 | |
*** thorst has joined #openstack-telemetry | 07:28 | |
*** idegtiarov_ has joined #openstack-telemetry | 07:29 | |
*** links has quit IRC | 07:31 | |
*** rcernin has joined #openstack-telemetry | 07:31 | |
*** ljxiash_ has joined #openstack-telemetry | 07:32 | |
*** liamji has quit IRC | 07:34 | |
*** belmoreira has joined #openstack-telemetry | 07:34 | |
*** ljxiash has quit IRC | 07:35 | |
*** thorst has quit IRC | 07:35 | |
*** liamji has joined #openstack-telemetry | 07:36 | |
*** links has joined #openstack-telemetry | 07:48 | |
*** ddieterly has joined #openstack-telemetry | 07:50 | |
*** links has quit IRC | 07:54 | |
*** ddieterly has quit IRC | 07:55 | |
*** safchain has joined #openstack-telemetry | 07:58 | |
*** boris-42 has quit IRC | 08:13 | |
*** _nadya_ has joined #openstack-telemetry | 08:14 | |
*** _nadya_ has quit IRC | 08:17 | |
openstackgerrit | Zi Lian Ji proposed openstack/python-aodhclient: Add the functional tests https://review.openstack.org/271137 | 08:21 |
*** links has joined #openstack-telemetry | 08:22 | |
*** thorst has joined #openstack-telemetry | 08:32 | |
*** ljxiash has joined #openstack-telemetry | 08:33 | |
*** ljxiash_ has quit IRC | 08:34 | |
*** links has quit IRC | 08:35 | |
sileht | jd__, look at this jsonschema example: http://paste.openstack.org/show/484651/ | 08:37 |
sileht | jd__, creating a parsing to generate the good column type will be not an easy task... | 08:37 |
sileht | jd__, also no jsonschema python libs do deserialization, it does only strict validation | 08:38 |
sileht | jd__, if you says {"type": number} and send the string "2", this is refused, when with voluptuous this is convert to int | 08:39 |
*** thorst has quit IRC | 08:39 | |
jd__ | sileht: righ | 08:41 |
jd__ | sileht: let's forget about that; we could still do an export to jsonpath I guess that'd be simpler | 08:41 |
sileht | jd__, yes, I have thought, we could mimic the jsonpath keywords | 08:42 |
*** shardy has joined #openstack-telemetry | 08:53 | |
*** efoley has joined #openstack-telemetry | 09:01 | |
*** mattyw has joined #openstack-telemetry | 09:13 | |
jd__ | sileht: I guess we'll have to fix stable/1.3 to work with keystonemiddleware 4 too… | 09:15 |
jd__ | i don't know if it's complicated to support | 09:15 |
jd__ | if you have an idea, it's welcome, otherwise I'll look into it :( | 09:15 |
sileht | jd__, I can't help, I'm in PTO today and I will go out soon | 09:16 |
jd__ | oh sure, nevermind | 09:16 |
jd__ | have fun :) | 09:16 |
openstackgerrit | Julien Danjou proposed openstack/gnocchi: storage: autoconfigure coordination_url https://review.openstack.org/270907 | 09:17 |
*** yassine__ has joined #openstack-telemetry | 09:22 | |
*** boris-42 has joined #openstack-telemetry | 09:32 | |
*** _nadya_ has joined #openstack-telemetry | 09:34 | |
*** _nadya_ has quit IRC | 09:35 | |
*** achatterjee has joined #openstack-telemetry | 09:36 | |
openstackgerrit | Merged openstack/aodh: tempest: migrate codes from tempest tree https://review.openstack.org/255187 | 09:36 |
*** thorst has joined #openstack-telemetry | 09:39 | |
*** thorst has quit IRC | 09:45 | |
*** ddieterly has joined #openstack-telemetry | 09:52 | |
*** ljxiash has quit IRC | 09:53 | |
*** ddieterly has quit IRC | 09:57 | |
openstackgerrit | Merged openstack/gnocchi: plugin cleanup https://review.openstack.org/271066 | 10:00 |
*** _nadya_ has joined #openstack-telemetry | 10:01 | |
*** liamji has quit IRC | 10:01 | |
*** cheneydc has quit IRC | 10:01 | |
*** liamji has joined #openstack-telemetry | 10:02 | |
openstackgerrit | Zi Lian Ji proposed openstack/python-aodhclient: Add the functional tests https://review.openstack.org/271137 | 10:06 |
*** ljxiash has joined #openstack-telemetry | 10:21 | |
*** ljxiash has quit IRC | 10:22 | |
*** cdent has joined #openstack-telemetry | 10:23 | |
*** liamji has quit IRC | 10:24 | |
openstackgerrit | Julien Danjou proposed openstack/gnocchi: storage: autoconfigure coordination_url https://review.openstack.org/270907 | 10:30 |
*** caishan has quit IRC | 10:32 | |
*** thorst has joined #openstack-telemetry | 10:43 | |
*** thorst has quit IRC | 10:49 | |
sileht | w | 10:51 |
*** ddieterly has joined #openstack-telemetry | 10:53 | |
*** ddieterly has quit IRC | 10:57 | |
*** ljxiash has joined #openstack-telemetry | 11:09 | |
*** mattyw has quit IRC | 11:40 | |
*** cdent has quit IRC | 11:43 | |
*** thorst has joined #openstack-telemetry | 11:47 | |
*** openstackgerrit has quit IRC | 11:47 | |
*** openstackgerrit has joined #openstack-telemetry | 11:48 | |
*** boris-42 has quit IRC | 11:53 | |
*** efoley has quit IRC | 11:54 | |
*** efoley has joined #openstack-telemetry | 11:54 | |
*** ddieterly has joined #openstack-telemetry | 11:54 | |
*** ddieterly has quit IRC | 11:59 | |
openstackgerrit | Chaozhe Chen(ccz) proposed openstack/gnocchi: Replace dict.iteritems() with six.iteritems() https://review.openstack.org/271258 | 12:03 |
openstackgerrit | Michael Krotscheck proposed openstack/aodh: Added CORS support to Aodh https://review.openstack.org/265342 | 12:18 |
openstackgerrit | Michael Krotscheck proposed openstack/aodh: gabbi's own paste.ini file https://review.openstack.org/265330 | 12:18 |
*** efoley has quit IRC | 12:27 | |
*** openstackgerrit has quit IRC | 12:33 | |
*** openstackgerrit has joined #openstack-telemetry | 12:34 | |
*** gordc has joined #openstack-telemetry | 12:34 | |
*** mattyw has joined #openstack-telemetry | 12:35 | |
*** ildikov has quit IRC | 12:37 | |
*** liamji has joined #openstack-telemetry | 12:38 | |
*** Guest93448 has quit IRC | 12:40 | |
*** ccz has quit IRC | 12:40 | |
*** efoley has joined #openstack-telemetry | 12:43 | |
*** ildikov has joined #openstack-telemetry | 12:49 | |
*** cdent has joined #openstack-telemetry | 12:53 | |
*** ddieterly has joined #openstack-telemetry | 12:55 | |
*** leitan has joined #openstack-telemetry | 12:55 | |
leitan | Hi guys, maybe gordc or jd__ can help me on this, is there any way that we can configure several memcached servers on the coordination_url ? if so hows the correct syntaxis, cause i have tried several | 12:56 |
*** ddieterly has quit IRC | 12:59 | |
*** cdent is now known as anticdent | 13:04 | |
gordc | leitan: for ceilometer, aodh, or gnocchi? | 13:05 |
gordc | leitan: why would you want to do that? | 13:05 |
leitan | gordc, for gnocchi, i have multiple gnocchi-api and metricd, and i need the locking feature | 13:06 |
leitan | gordc, i have read that the tooz backends are supported, but i cant find the way to configure multiple memcached on the coordination_url line | 13:07 |
gordc | leitan: if you have several different memcache servers how would they coordinate? the processes on one memcache server wouldn't know the ones on the other memcache server exist. | 13:07 |
leitan | gordc, yes , im guessing that this can be done similar on how nova or keystone works with memcached, if not find in one, try to find it in the other one | 13:08 |
*** julim has joined #openstack-telemetry | 13:08 | |
mnaser | i think when you have multiple memcache servers, the hashing distributes the data throughout the cluster | 13:08 |
leitan | gordc, if doesnt work like the other projects, ill have to use clusterized redis | 13:09 |
mnaser | leitan: that what i ended up setting up | 13:09 |
mnaser | also I ran into an issue yesterday and after troubleshooting, it seems that the Ceph driver might have some scaling bugs :( | 13:09 |
leitan | gordc, but since i already have a big memcached pool , i really wanted to use it | 13:09 |
gordc | leitan: does keystone use memcache as a cache or group membership? | 13:09 |
leitan | gordc, as a cache | 13:10 |
leitan | mnaser, yes, performance is a complicated thing | 13:10 |
leitan | mnaser, were running into many issues now | 13:10 |
leitan | troubleshooting | 13:10 |
leitan | also with ceph backend | 13:10 |
mnaser | leitan: *something* happened which caused a delay in the backend which accumulated a large # of measures to be processed | 13:11 |
gordc | yeah, so tooz is a group membership tool... i'm not sure there's a way to define two different logical targets. a tooz expert can correct me here. | 13:11 |
mnaser | but there is a call that lists all measures and it has to go through ~80k records before it can process a single request | 13:11 |
gordc | mnaser: you see a spike randomly? or it just grows consistently? | 13:12 |
mnaser | i've had to shut down ceilometer and im trying to let it catch up | 13:12 |
mnaser | gordc: i mean this is only the 2nd day it's existed.. but basically once it goes to a really high number, i think it will be almost impossible for it tocatch up | 13:12 |
leitan | gordc, thats why i tought, but offering memcached as a membership backend, an expect to use just one memcached ? i guess thats weird, ill research a little bit more on tooz code | 13:12 |
mnaser | gordc: I've documented this here - https://bugs.launchpad.net/gnocchi/+bug/1536909 | 13:12 |
openstack | Launchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete] | 13:12 |
mnaser | _list_object_names_to_process is called a lot and loop/goes through some 80k items on *every* single time it has to process a new measure | 13:13 |
gordc | leitan: well it's not really a lot of data. if you have a group of say 10 agents... why would you need a lot of memcache servers? | 13:13 |
mnaser | in around 10 hours, it was only able to process 10k measures. this is with 48 total metricd workers across 2 machines, each with 24 cores (machines dedicated for gnocchi at this point) | 13:14 |
leitan | its actually not for the amount of data, but more likely about distribution | 13:14 |
leitan | mnaser, is metricd not catching up, or metricd its actually killing your ceph cluster ? (or both ) | 13:14 |
mnaser | metricd not catching up, and because of that "part" of the codebase, it would basically *never* be able to catch up, because the more it grows behind, the slower it gets .. is it killing the ceph cluster, i dont think so | 13:15 |
mnaser | client io 57575 kB/s rd, 51290 B/s wr, 92 op/s | 13:15 |
mnaser | not much | 13:15 |
leitan | mnaser, just asking because with all our ceilometer-agent-compute and metricd turned on, we start to have A LOT of ceph blocking requests | 13:16 |
mnaser | leitan: i haven't even turned on ceilometer-agent-compute yet... just the collector so far and ran into this issue | 13:17 |
leitan | and just a slow metricd, ceph not blocking anything ? mnaser | 13:18 |
mnaser | leitan: yes, i dont think ceph is blocking anything. it's still responding fine using rados commands | 13:19 |
gordc | leitan: you might want to read into each tooz driver. memcache is probably not something you want in production. | 13:19 |
gordc | leitan: http://docs.openstack.org/developer/tooz/drivers.html#memcached | 13:19 |
gordc | i use it... but it's for local testing. | 13:20 |
mnaser | i just wrote a small poc code to check if my theory that i'm being delayed by that command is true using timeutils.StopWatch | 13:20 |
leitan | gordc, yup ill take a look thanks, just didnt want to run setting up just another cluster of stuff to support another service :) | 13:21 |
leitan | mnaser, im really interested on how that is going to result, please let me know | 13:21 |
gordc | mnaser: i'm trying to think about stable/1.3. we may quite a few changes to the metricd code to break/organise data chunks | 13:21 |
mnaser | leitan: i will! you can follow this bug if you want automatic updates too.. https://bugs.launchpad.net/gnocchi/+bug/1536909 | 13:22 |
openstack | Launchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete] | 13:22 |
gordc | leitan: you'll probably need to. ceilometer/aodh/gnocchi use tooz for coordination. nova is adopting it in mitaka. and it's going to be used across all openstack eventually. | 13:22 |
mnaser | yep, theory valid woo :< | 13:23 |
gordc | mnaser: does your environment work with gnocchi 1.4? | 13:24 |
leitan | gordc, ill go with redis and stop playing around :) | 13:24 |
gordc | leitan: :) it should be good enough... especially if you don't want to maintain zookeeper | 13:25 |
jd__ | you can't have multiple memcached backend | 13:25 |
* jd__ reads the backlog | 13:26 | |
mnaser | gordc: i *guess*? I don't really see why it would not | 13:26 |
mnaser | also added my findings with more accurate numbers here - https://bugs.launchpad.net/gnocchi/+bug/1536909 | 13:26 |
openstack | Launchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete] | 13:26 |
jd__ | mnaser: can you try with your SQL backend as a coordination url? | 13:26 |
mnaser | the actual problem still exists in master from what I see | 13:26 |
*** eglynn has joined #openstack-telemetry | 13:27 | |
mnaser | jd__: i could but I suspect it's not the problem because I was having this issue when I first started out (single node, local file lock) | 13:27 |
jd__ | mnaser: ok! | 13:28 |
mnaser | and that function itself doesn't seem to be reliant or blocking on the tooz lock (the one that's delaying the processing of each measure) | 13:28 |
jd__ | mnaser: which tooz version? | 13:28 |
mnaser | ii python-tooz 1.21.0-1ubuntu1~cloud0 all coordination library for distributed systems - Python 2.x | 13:28 |
jd__ | ah I see you're using 1.3.0 | 13:28 |
jd__ | you *really* need to upgrade then | 13:28 |
jd__ | we fixed a lot of bugs since then… | 13:28 |
mnaser | ok, I think zigo ran into some issues when he was upgrading the packaging | 13:29 |
jd__ | with the unit tests yeah | 13:29 |
jd__ | I'm working on that | 13:29 |
mnaser | maybe when he's around i could ask him for the packaging files he currently has and build it without the tests | 13:30 |
zigo | I already upgraded tooz to 1.29.0-1 in debian Experimental. | 13:31 |
leitan | gordc, i suffered zookeeper a lot ... its a no go for me jaja | 13:31 |
zigo | mnaser: o/ | 13:31 |
zigo | Here ... | 13:31 |
gordc | leitan: makes sense. i believe rdo also forgoes zookeeper uses redis as well. | 13:33 |
zigo | jd__: mnaser: The current issue I have with tooz is that it has broken unit tests with Py3. | 13:33 |
jd__ | what's the link with gnocchi here? | 13:33 |
zigo | jsonschema.exceptions.ValidationError: 'group_id' is a required property | 13:33 |
mnaser | zigo: i'm using python 2.7 so I could skip building packages for py3 .. i'd appreciate if you have the .tar.gz and debian files to work with | 13:34 |
mnaser | i can build them and install them and give feedback as wel | 13:34 |
zigo | mnaser: Tooz in Experimental has disabled Py3 unit tests currently. | 13:34 |
*** diogogmt has quit IRC | 13:35 | |
zigo | The package for Py3 is built, but just not unit tested. | 13:35 |
mnaser | ah I see, and for the gnocchi, I only found 1.3.0, is there 1.3.3 laying around somewhere? | 13:35 |
*** diogogmt has joined #openstack-telemetry | 13:36 | |
*** nicodemus_ has joined #openstack-telemetry | 13:38 | |
openstackgerrit | Xia Linjuan proposed openstack/python-ceilometerclient: make aggregation-method argument as a mandatory field https://review.openstack.org/268947 | 13:42 |
mnaser | it looks like this is already fixed but in master: https://github.com/openstack/gnocchi/commit/614e13d47fdcaeea9d41bebc214014e0c83a0e83 | 13:43 |
gordc | sigh. i don't understand how project-config works. | 13:43 |
gordc | how is oslo.messaging good, but not ceilometerclient... https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/ceilometer.yaml#L247-L251 | 13:43 |
gordc | mnaser: yeah, there's quite a difference between 1.3 and master | 13:44 |
*** ildikov has quit IRC | 13:44 | |
mnaser | i just patched that part to see if that functions runs more effectively | 13:45 |
gordc | jd__: i'm going to write up something about gnocchi for summit over weekend. | 13:45 |
gordc | any random thoughts you want me to consider? | 13:45 |
mnaser | still taking around 43s.. im going to see on how/if there is a way to speed up iterating through the xattrs | 13:45 |
jd__ | gordc: write up for? | 13:46 |
gordc | jd__: a summit talk re: gnocchi | 13:46 |
nicodemus_ | hello | 13:46 |
*** diogogmt has quit IRC | 13:47 | |
openstackgerrit | Xia Linjuan proposed openstack/python-ceilometerclient: make aggregation-method argument as a mandatory field https://review.openstack.org/268947 | 13:47 |
jd__ | gordc: oh cool, you wanna do it alone or do you want me too? | 13:47 |
nicodemus_ | I believe I'm having issues running gnocchi-metricd distributed with redis... after two days of such deploy, I'm seeing several errors in gnocchi-metricd logs | 13:47 |
jd__ | gordc: I'll be glad to help you anyway | 13:47 |
jd__ | nicodemus_: paste, paste :) | 13:47 |
nicodemus_ | such as: "ObjectNotFound: Failed to remove 'measure_00e8c7d6-4128-4164-bfc6-107cda486b77_027493bc-d830-4f33-b596-a20fcb95378f_20164221_18:42:24" | 13:48 |
jd__ | mnaser: I'm backporting it but stable/1.3 is blocked right now, working on it as I speak | 13:48 |
gordc | jd__: together? sileht: cdent: want in as well? | 13:48 |
jd__ | gordc: sileht won't be there | 13:48 |
nicodemus_ | or: "TypeError: _get_measures() takes exactly 3 arguments (4 given)" | 13:48 |
jd__ | nicodemus_: paste me the full log so I can fix all of that :) on paste.openstack.org for example | 13:48 |
gordc | jd__: remember the idea about explaining how to use gnocchi and some random made metrics | 13:48 |
mnaser | nicodemus_: do you have your indexer_url setup properly? i saw that problem when the indexer wasnt configured | 13:49 |
nicodemus_ | jd__: certainly | 13:49 |
nicodemus_ | mnaser: the coordination url is coordination_url = redis://10.10.20.30:6379 (single redis, no sentinel) | 13:49 |
mnaser | indexer_url, nicodemus_ :) | 13:50 |
nicodemus_ | mnaser: doing a monitor through redis-cli I see all four metricd connected and sending commands | 13:50 |
nicodemus_ | mnaser: mmm... now that you mention that, I haven't configured an indexer_url :/ | 13:51 |
mnaser | jd__: sounds good, i'm going to do more checking on speeding up this large iteration, still iterating through 80k items taking quite sometime.. "time rados -p gnocchi listxattr measure | cut -d'_' -f2 | uniq" => 0.9s | 13:51 |
mnaser | thats exactly your problem nicodemus_ | 13:51 |
mnaser | (regarding the TypeErrors at least) | 13:51 |
nicodemus_ | mnaser: the indexer_url should point to redis as well? | 13:51 |
mnaser | no, indexer_url points to a database | 13:51 |
mnaser | http://docs.openstack.org/developer/gnocchi/configuration.html nicodemus_ | 13:52 |
jd__ | mnaser: yeah but it should not do it that often | 13:52 |
nicodemus_ | mnaser: oh, I get it... it's under the [indexer] section, the url is: url = mysql://gnocchi:SERGITOgnocchi@10.10.10.252/gnocchi?charset=utf8 | 13:52 |
mnaser | jd__: https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L108 - it calls it for every single time it has to process a measure | 13:53 |
mnaser | which is why my processing has been slow | 13:53 |
mnaser | adds 45s in each run | 13:53 |
jd__ | yeah I saw, I added a comment | 13:54 |
jd__ | it's likely a dup of #1533793 | 13:54 |
jd__ | I'm really trying to get 1.3.4 out by the end of today with all those fixes out, but well | 13:54 |
jd__ | keystonemiddleware is not helping | 13:54 |
jd__ | :) | 13:54 |
mnaser | nicodemus_: i dont think that's the right config | 13:54 |
mnaser | oh it is, im not sure | 13:54 |
mnaser | did you make sure to start services after? | 13:55 |
mnaser | im using postgres so not sure if any differences might affect | 13:55 |
mnaser | jd__: 4.0.0 problems? if you need a hand let me know, if it helps getting the release out | 13:55 |
nicodemus_ | jd__: I've pasted the two kind of errors I'm having here: http://paste.openstack.org/show/484686/ | 13:55 |
*** ddieterly has joined #openstack-telemetry | 13:55 | |
jd__ | mnaser: they broke everything between 2.3.0 and 4.0.0, so supporting *both* is painful | 13:56 |
nicodemus_ | mnaser: using that indexer_url I've had no errors with a single metricd instance, the errors began only when I deployed a second metricd with redis | 13:56 |
jd__ | nicodemus_: you're on master, right? (just to be sure :) | 13:56 |
*** jfluhmann has quit IRC | 13:56 | |
nicodemus_ | mnaser: yes | 13:56 |
mnaser | im assuming that's for jd__ ^ :) | 13:57 |
jd__ | nicodemus_: oh you did git pull today or something? | 13:57 |
nicodemus_ | jd__: one metricd was installed on dec 16, the other three were installed a couple of days ago. Didn't pull today nor yesterday | 13:58 |
jd__ | ok | 13:58 |
jd__ | this is going to be a problem as we changed the file formats in master during this last week | 13:58 |
jd__ | so if you install different version of master it's likely going to fail :) | 13:58 |
jd__ | not sure it's your problem there yet though | 13:58 |
nicodemus_ | oops.... want me to try pulling on all four today and see if these errors are still present? | 13:59 |
jd__ | nicodemus_: yeah that'd be helpful | 13:59 |
jd__ | because the TypeError: _get_measures() takes exactly 3 arguments (4 given) is really suspicious | 14:00 |
gordc | jd__: i see you gave up and backported :) https://review.openstack.org/#/c/271296/1/gnocchi/opts.py | 14:00 |
*** belmoreira has quit IRC | 14:00 | |
jd__ | we'll support 1.3 -> 2.0 upgrade, but upgrading in the middle of master is risky | 14:00 |
*** ddieterly has quit IRC | 14:00 | |
nicodemus_ | jd__: ok, I'll give it a try and let you know how it goes :) | 14:00 |
jd__ | gordc: ah sh*t maybe I shuold not remove that option though | 14:00 |
jd__ | gordc: let me try to keep it | 14:00 |
mnaser | fairly sure this is a rados bug. I ripped out all the code and it takes 36s to list all measures with get_xattrs (i think the iterator implemented here is to blame). using the rados CLI, takes 1.5s | 14:08 |
gordc | mnaser: sucks... if only we knew people who worked at red hat ;) | 14:12 |
mnaser | :P | 14:12 |
mnaser | i'm doing a bit of reading... it would be nice if the API had something which gave a list instead of an interator | 14:13 |
mnaser | python isn't too good at iterating in these scenarios (ex: same happens when doing a fetchall in the db is much more effective) | 14:14 |
*** ljxiash has quit IRC | 14:19 | |
mnaser | Just to list them (not even create a list out of the items, a simple loop with pass inside): http://paste.openstack.org/show/484696/ | 14:19 |
*** ljxiash has joined #openstack-telemetry | 14:20 | |
gordc | you are profiling the rados call? | 14:21 |
mnaser | gordc: profiiled just this only http://paste.openstack.org/show/484697/ | 14:21 |
mnaser | in comparision, "time rados -p gnocchi listxattr measure >/dev/null" => runs in 1.7s | 14:22 |
*** Ephur has quit IRC | 14:23 | |
jd__ | mnaser: yeah if this is a binding bug in rados we can't really do anything on the Gnocchi side… but we have multiple hats though | 14:24 |
jd__ | and one of them is Red :] | 14:24 |
mnaser | jd__: sounds good :) i'll do some more investigating when I have something more concrete | 14:25 |
jd__ | mnaser: 👍 | 14:25 |
jd__ | gordc: I have more faith in https://review.openstack.org/#/c/271296/2 | 14:30 |
gordc | jd__: my review is: 'seems legit' | 14:33 |
*** efoley has quit IRC | 14:33 | |
*** efoley has joined #openstack-telemetry | 14:34 | |
jd__ | gordc: I did not expect less than you :) | 14:38 |
jd__ | I was about to write an email about how upper-constraint is fucked | 14:38 |
jd__ | finally, I'll pass. | 14:39 |
gordc | jd__: hahah. you can still right it. | 14:40 |
gordc | write* | 14:40 |
gordc | or right... still works | 14:40 |
*** ddieterly has joined #openstack-telemetry | 14:40 | |
mnaser | alright .. i now have a librados code to list all xattrs both in python and C .. in python, around 45s for 80k items, in C, 1.3s. | 14:43 |
liamji | gordc: hello :), I met a problem when i am implementing the aodhclient test. When the test cases are related to gnocchi, it seems we need to install the gnocchi as the dependcy. Or we use mock to test the cases with gnocchi? thx | 14:43 |
openstackgerrit | Julien Danjou proposed openstack/gnocchi: storage/carbonara: store TimeSerieAggregated in splits https://review.openstack.org/258434 | 14:47 |
*** ildikov has joined #openstack-telemetry | 14:47 | |
gordc | liamji: you should probably mock it. i think aodh checks gnocchi for rule types | 14:48 |
*** pcaruana has joined #openstack-telemetry | 14:50 | |
liamji | yes, it throws exception 'raise GnocchiUnavailable(e)'. okey, I will do it. Thanks :) | 14:50 |
gordc | liamji: thanks | 14:50 |
gordc | mnaser: i think ideally we want metricd to be processing as quick as possible to minimise the number of items but yeah... 45s vs 1.3s sucks. | 14:51 |
mnaser | im not really sure *how* i ended up with that 80k or so to start with | 14:51 |
mnaser | still 70k left to process and i think it will take much longer.. but it's something to be figured out thats for sure.. i dont know if introducing a small caching concept to this can help | 14:52 |
gordc | mnaser: probably the bugs jd__ mentioned above... causing a backlog for metricd... and then the ceph stuff compounds the issue. | 14:53 |
liamji | gordc: np :) | 14:53 |
mnaser | possibly | 14:53 |
mnaser | if 1.3.4 makes it out and zigo has some deb stuff, i'll try to upgrade and start ceilometer again and see if it catches up | 14:53 |
gordc | kk | 14:54 |
jd__ | there's a bug where Gnocchi returns a list vs a set which is fixed in master and being fixed in 1.3 – that reduces a lot the time | 14:54 |
jd__ | then if rados takes 45s on itself, we can't help much | 14:54 |
jd__ | but the listing is done once in a while anyway | 14:54 |
mnaser | jd__: this is where im confused, you're saying its done once in a while but the ceph code base does it on *every* single time it has to process a metric (unless im not reading things right) | 14:55 |
mnaser | https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L108 | 14:55 |
gordc | anticdent: you understand tis? https://github.com/openstack-infra/project-config/blob/master/jenkins/jobs/ceilometer.yaml#L247-L251 | 14:55 |
mnaser | so for every metric, it runs this to grab the list of all measures | 14:55 |
jd__ | mnaser: yeah but that one is not 80k | 14:56 |
* anticdent looks | 14:56 | |
*** jd__ has left #openstack-telemetry | 14:56 | |
*** jd__ has joined #openstack-telemetry | 14:56 | |
anticdent | gordc: I reckon that's something sileht did but | 14:56 |
mnaser | jd__: it is, while it may not "send" 80k items or copy them over, it's processing them (which is taking the 45s) ... it is a "rados" problem at the end of the day anyways | 14:56 |
zigo | mnaser: As soon as 1.3.4 is out, I'll package it. | 14:57 |
mnaser | merci beaucoup :) | 14:57 |
gordc | anticdent: he did oslo part, i did client... the client one doesn't work | 14:57 |
jd__ | mnaser: oh right I missed that it was using the same prefix | 14:58 |
anticdent | is the idea that it is supposed to be get from git when the project that initiated the test is X? | 14:58 |
jd__ | mnaser: we could probably be smarter here and list less | 14:58 |
gordc | yeah. | 14:58 |
gordc | anticdent: sileht add the test to ceilometerclient gate but i noticed it was just using pip | 14:58 |
_nadya_ | gordc: hi Gordon! Am I right that we send all data from one compute node in one batch during polling? | 14:59 |
gordc | so i thought, hey, client is like oslo.messsaging... oslo gate works correct...let's copy | 14:59 |
mnaser | jd__: i dont mind writing a small code to cache the results (done some of this for Horizon) .. but i dont know what the impact of caching these results might have | 14:59 |
gordc | no dice. | 14:59 |
mnaser | the driver seems pretty straightforward | 14:59 |
anticdent | gordc: reasonable plan, but not workie. My guess would be that g-r is messing things somewhere, but why that would be for one and not the other is unclear | 14:59 |
gordc | _nadya_: yep. i believe it batches by default. | 15:00 |
_nadya_ | gordc: cool, thanks! | 15:00 |
*** rbak has joined #openstack-telemetry | 15:00 | |
gordc | anticdent: yeah... i love guessing at project-config changes. | 15:01 |
gordc | http://logs.openstack.org/08/235208/3/check/gate-ceilometer-dsvm-integration/a637ac7/logs/devstacklog.txt.gz#_2016-01-22_03_37_41_358 | 15:01 |
anticdent | nobody in infra to come to your aid? | 15:01 |
gordc | http://logs.openstack.org/81/267981/1/check/gate-ceilometer-dsvm-integration/6588583/logs/devstacklog.txt.gz#_2016-01-22_11_57_04_846 | 15:01 |
gordc | you are my aid... does anticdent mean don't bother me? | 15:01 |
* anticdent has friday brain dumb | 15:02 | |
anticdent | Maybe try doing what it says: when the project is ceiloclient, add ceiloclient to PROJ | 15:02 |
nicodemus_ | jd__: after pulling and re-deploying gnocchi-metricd, so far I didn't see any errors. Nevertheless, the errors might not appear immediately. I'll keep watching how it goes. | 15:02 |
gordc | anticdent: i'm scared as that has blown up quite a bit... that projects stuff is magic. | 15:03 |
gordc | along with everything | 15:03 |
gordc | anticdent: you want to join our gnocchi summit talk? | 15:03 |
jd__ | nicodemus_: ok | 15:03 |
anticdent | gordc: thus my "ask infra" because we both dumb | 15:04 |
anticdent | gordc: I've already got two in the works, but what's the topic? | 15:04 |
gordc | gnocchi magic | 15:04 |
anticdent | something something magic something something | 15:04 |
gordc | or some flame topic like 'we beat <insert db name here>' | 15:05 |
anticdent | that would be fun | 15:05 |
*** yprokule has quit IRC | 15:05 | |
anticdent | I reckon most of my gnocchi brain will have been archived by then so I may not be a useful addition | 15:05 |
gordc | anticdent: kk | 15:05 |
anticdent | so unless you want someone to say "one way bs to uuid mapping!" over and over... | 15:06 |
gordc | anticdent: you doing uuid everywhere topic for cross project? | 15:06 |
anticdent | not yet, it will come up soon enough | 15:07 |
gordc | 'soon' 2 years? | 15:07 |
*** liamji has quit IRC | 15:08 | |
*** liamji has joined #openstack-telemetry | 15:08 | |
anticdent | that's pretty fast in openstack time | 15:10 |
gordc | anticdent: i'm optimistic. | 15:12 |
*** diogogmt has joined #openstack-telemetry | 15:12 | |
anticdent | I'll try to write something up after the nova mid-cycle, that will get the ball created, a few months later it will start rolling | 15:12 |
gordc | .. you have midcycle 1 month before m-3? | 15:13 |
gordc | isn't feature freeze already set in nova? | 15:14 |
anticdent | the mid-cyle is next week, but yeah freeze has already happened | 15:14 |
anticdent | I'm not sure I understand it | 15:14 |
gordc | air miles | 15:15 |
anticdent | but then again, people frequently talk about "yeah we should be able to merge that code in O" | 15:15 |
anticdent | which boggles | 15:15 |
mnaser | lol, the rados.py bindings literally spawns a thread for every single time a next() is called on the iterator | 15:16 |
mnaser | which means 80k threads are generated in the whole period of time | 15:16 |
nicodemus_ | mnaser: in my setup, I have 7,5M of measures in ceph >.< | 15:19 |
gordc | mnaser: ceph just gave you justification for new hardware request | 15:19 |
gordc | request supercomputer, reason: need to query ceph | 15:20 |
mnaser | haha | 15:21 |
mnaser | nicodemus_ -- i think at this point, you're a bit doomed. the time to process a single metric is 70 minutes if your numbers match mine | 15:22 |
mnaser | because for 80s it takes 45 seconds | 15:22 |
mnaser | I'm thinking of writing something in the codebase that relies of using redis sets to do this part instead of xattrs | 15:23 |
mnaser | xattrs can work at small scale but at large scale it will be near impossible (imho) | 15:23 |
nicodemus_ | mnaser: I'll try the code you wrote on https://bugs.launchpad.net/gnocchi/+bug/1536909 to get some time measurement | 15:25 |
openstack | Launchpad bug 1536909 in Gnocchi "Scaling problems with Ceph" [Undecided,Incomplete] | 15:25 |
mnaser | nicodemus_ i'd leave running in a screen, i suspect it might take quite sometime with 7.5m records | 15:25 |
mnaser | im going to work on something based on redis with a small migration script as a poc and see where it goes | 15:25 |
*** anticdent has quit IRC | 15:28 | |
*** mragupat has joined #openstack-telemetry | 15:30 | |
_nadya_ | Redis is everywhere... | 15:31 |
mnaser | _nadya_: i dont know :( i'm kinda trying to do the best I can to see if this can be adjusted but it looks like there was a concious decision to spawn everything in threads - https://github.com/ceph/ceph/commit/2ce5cb7d06c80e34824981c36adcaa2dbfbc581f | 15:35 |
_nadya_ | mnaser: I have nothing against Redis :) It's funny that in 3 "areas" in Ceilomer we suddenly need Redis (or just distr cache) | 15:39 |
*** peristeri has joined #openstack-telemetry | 15:39 | |
mnaser | heh, well it's also the fact that large number of xattr's might start causing problems.. im not sure if xattrs are the best way to implement something for this type of need at a large scale (esp with all the O(N)) code that's put in place in the driver code | 15:39 |
gordc | mnaser: yeah. so that part is kinda an intermediate step where we store data before it gets aggregated/processed... ideally that shouldn't be as backed up as it is... but using another form of storage for the backlog sounds interesting. | 15:43 |
*** ddieterly has quit IRC | 15:46 | |
*** ddieterly has joined #openstack-telemetry | 15:48 | |
*** cdent has joined #openstack-telemetry | 15:52 | |
*** safchain has quit IRC | 15:54 | |
*** idegtiarov_ has quit IRC | 16:01 | |
mnaser | gordc: poc, listing all uuids or metrics take 0.111s and measures for a specific metric = 0.055s. | 16:04 |
mnaser | that was by importing all my 80k records into redis | 16:04 |
mnaser | time to make the small changes to the ceph driver and see what happens.. | 16:04 |
*** zqfan has quit IRC | 16:11 | |
*** mragupat has quit IRC | 16:12 | |
*** ddieterly has quit IRC | 16:14 | |
*** mragupat has joined #openstack-telemetry | 16:15 | |
*** rcernin has quit IRC | 16:15 | |
*** ddieterly has joined #openstack-telemetry | 16:21 | |
*** mattyw has quit IRC | 16:22 | |
mnaser | oh man this is great | 16:25 |
mnaser | processing at ~1k/s | 16:26 |
mnaser | nicodemus_ : i may have some good news in that redis can help allievate A LOT of this load | 16:48 |
mnaser | i just cleared the entire 80k in minutes .. i will try to draft up a patch + some docs | 16:48 |
gordc | mnaser: sounds good. | 16:49 |
gordc | mnaser: have you tried with oslo.cache? so we can make it pluggable for someone out there not on redis? | 16:50 |
mnaser | that's a good idea gordc --- this was just an initial "hey would this really make things faster" type of thing, oslo.cache is good but ill have to check to see if its possible | 16:50 |
mnaser | im relying on set based operations (that should be available in most kvs) but i use a redis-specific way of retrieving the list of all keys matching a specific string | 16:51 |
gordc | mnaser: cool cool. | 16:51 |
mnaser | also, that would have to document things some more as well.. because we dont want people using oslo.cache+memcache to do this type of thing | 16:51 |
gordc | mnaser: i see... yeah i think it might not be possible. vaguely remember _nadya_ said oslo.cache limited functionatliy | 16:51 |
mnaser | if that memcache instance dies, you now have a ton of stale measurements in your ceph cluster | 16:52 |
mnaser | s/stale/orphan/ | 16:52 |
gordc | mnaser: true | 16:52 |
*** efoley has quit IRC | 16:52 | |
mnaser | but an sql backend is considerable for this type of thing too | 16:52 |
mnaser | alright, found another bug doing unnecessary scans, but we're getting there... | 16:54 |
*** rcernin has joined #openstack-telemetry | 16:55 | |
*** pradk has quit IRC | 16:55 | |
*** pradk has joined #openstack-telemetry | 16:56 | |
gordc | mnaser: feel free to post your fixes too :) | 16:57 |
*** eglynn has quit IRC | 16:58 | |
*** mattyw has joined #openstack-telemetry | 17:01 | |
*** ildikov has quit IRC | 17:04 | |
zigo | gordc: Do you know when Gnocchi 1.3.4 will be out? | 17:07 |
zigo | <= 1.3.3 is currently not working well at all with the rest of Mitaka | 17:08 |
nicodemus_ | mnaser: great news! | 17:08 |
gordc | zigo: i think we want to merge these first: https://review.openstack.org/#/q/status:open+project:openstack/gnocchi+branch:stable/1.3 | 17:09 |
zigo | gordc: So would say it's a mater of (working) days? | 17:10 |
nicodemus_ | mnaser: please let me know if you want me and my millions of metrics to give it a try | 17:10 |
mnaser | nicodemus_: i will def let you know! | 17:10 |
gordc | depends whether jd__ is around when they merge | 17:10 |
mnaser | i would actually want you to do that | 17:10 |
mnaser | i want to see how it performs with a large # of records | 17:11 |
gordc | zigo: he has release permissions | 17:11 |
zigo | gordc: Ok, thanks. We'll wait then. | 17:11 |
zigo | FYI this is blocking the release of Ceilometer Mitaka b2. | 17:12 |
gordc | thanks for the help testing folks. packaging and code... much appreciated. | 17:12 |
zigo | (in Debian & Ubuntu) | 17:12 |
gordc | zigo: you take over ubuntu work? | 17:12 |
zigo | Since when Ubuntu guys have been doing any packaging work? :) :) :) | 17:12 |
gordc | zigo: lol you said it, not me. | 17:13 |
zigo | More seriously, Gnocchi is being maintained in Debian, and they just sync from it. | 17:13 |
mnaser | oh man | 17:13 |
mnaser | i guess this is good... | 17:13 |
zigo | And they'll be waiting on it. | 17:13 |
mnaser | load average is ~40 lol | 17:13 |
mnaser | redis doing work | 17:13 |
*** prashantD_ has joined #openstack-telemetry | 17:14 | |
mnaser | | storage/number of metric having measures to process | 1085 | | 17:14 |
mnaser | | storage/total number of measures to process | 2395 | | 17:14 |
nicodemus_ | hahahah well that's what it's there for | 17:14 |
*** mattyw has quit IRC | 17:14 | |
mnaser | i dont know if im generating too much data.. this seems like a lot heh | 17:14 |
mnaser | i wonder if im generating too much data | 17:15 |
gordc | what kinda policies do you have defined? | 17:15 |
zigo | 2k measure is like nothing. | 17:15 |
mnaser | the default onesr really | 17:15 |
zigo | influxdb can process like 500k / s ! | 17:16 |
mnaser | 2k constant measurements, not total measurements :p | 17:16 |
mnaser | like i understand its not much but | 17:16 |
gordc | mnaser: shouldn't be that bad than. | 17:16 |
gordc | mnaser: can't improve it if we set bar low | 17:16 |
mnaser | it seemstrue | 17:16 |
mnaser | *true | 17:16 |
jd__ | zigo: I want to release it today but not sure i'll have time, I wait for the patches to be merged | 17:17 |
mnaser | i think swift is ceilometer-collector + swift is causing a lot of this | 17:17 |
zigo | jd__: Ah, cool, thanks for the info. Forwarding it to the Ubuntu guys. | 17:17 |
gordc | mnaser: possibly. the metering we do for swift is quit chatty considering it's every single api request | 17:18 |
jd__ | zigo: influxdb can do zillions per second! | 17:18 |
mnaser | i havent even set that up yet actaully gordc :< | 17:18 |
gordc | jd__: it's cool. my slide says gnocchi does zillions+1 per second | 17:18 |
jd__ | haha | 17:18 |
gordc | mnaser: ceilometermiddleware? | 17:18 |
jd__ | mnaser: this is not too much, but did you apply the patches? | 17:18 |
mnaser | doing a small log file analysis... 31124 image resource requests and 182143 swift_account requests | 17:18 |
mnaser | jd__: i did.. but i sorta rewrote a big portion of the ceph driver to use redis instead of xattrs to store the list and i got back up processing to ~1k/sec events | 17:19 |
jd__ | mnaser: what's taking long? | 17:19 |
jd__ | 1k/sec sounds not that bad if you have 2k waiting :) | 17:20 |
mnaser | jd__: im not sure if my ceph cluster cant keep up, or not enough cpu power | 17:20 |
mnaser | its slowly creeping up, its up to 6k measures to process now | 17:20 |
gordc | jd__: taht's with redis not ceph. | 17:20 |
jd__ | yeah | 17:20 |
mnaser | gordc: redis only used to replace teh xattr stuff, measurement objects still stored in ceph | 17:21 |
jd__ | mnaser seems smart enough to dig this out so I'm not worried | 17:21 |
jd__ | ) | 17:21 |
gordc | mnaser: yep | 17:21 |
jd__ | :) | 17:21 |
*** leitan has quit IRC | 17:21 | |
mnaser | yeah i'll do my checking and more digging | 17:21 |
gordc | jd__: somethign we might want to add | 17:21 |
mnaser | the number is rising :( cant keep up | 17:21 |
jd__ | we have large perf improvement in master merged and coming though this could be interesting to try them I guess | 17:21 |
gordc | i assume it would benefit swift backend too | 17:21 |
jd__ | gordc: what do we want to add? | 17:22 |
gordc | instead of dumping temporary measures in ceph/swift, we dump it in redis and then store permanent in ceph/swift | 17:22 |
jd__ | right, to be tested I guess | 17:23 |
jd__ | though I still think we can optimize Ceph a bit before going that road | 17:23 |
cdent | wasn't there some effort to temp on regular files and perm in ceph/swift? | 17:24 |
gordc | yeah. definitely. don't want to increase complexity if not needed | 17:24 |
gordc | cdent: what happens if they are local on different machines? | 17:25 |
gordc | i have no idea... maybe nothing of concern. | 17:26 |
cdent | presumably you're running multiple metricd's across several machines | 17:26 |
cdent | and several api servers across multiple machines | 17:26 |
cdent | each ap server dumps to local, metricd picks it up, sends it on? | 17:27 |
cdent | it's not really adding more failure space (since the api server itself could fail) | 17:27 |
* cdent is just speculating | 17:27 | |
gordc | true true. | 17:27 |
jd__ | but it seems that the first issue to fix is the xattr listing taking 45x times normal time in Python :) | 17:27 |
cdent | and the files have been demonstrated to be fast | 17:27 |
jd__ | sileht might help on that | 17:27 |
cdent | yeah, that's weird | 17:28 |
mnaser | jd__: agreed, it shouldn't do that.. but it seems like i doubt this would be easily addressed | 17:31 |
*** yassine__ has quit IRC | 17:32 | |
jd__ | mnaser: have faith | 17:34 |
mnaser | jd__: i tried asking the person who did the change (back in 2013) | 17:34 |
mnaser | but we'll see (hopefully) i get a response | 17:34 |
gordc | mnaser: just for reference, how much thread/processes did you configure for httpd? | 17:36 |
mnaser | # of CPU cores on the system which is 24 gordc | 17:36 |
mnaser | and i have 2 metricd processes too right now | 17:36 |
gordc | so 24processes, and 32 threads? | 17:36 |
mnaser | gordc: http://paste.openstack.org/show/484722/ | 17:37 |
mnaser | that was using puppet-gnocchi unless that's not ideal | 17:37 |
gordc | hmm.. and you don't have a backlog in rabbit queues? | 17:38 |
mnaser | gordc: i dont even use the notifications yet | 17:38 |
mnaser | just ceilometer-collector | 17:38 |
gordc | mnaser: what is your data source? collector just dispatches data from a metering.sample queue | 17:39 |
mnaser | gordc: doesn't ceilometer by default do polling out? | 17:39 |
mnaser | for images and swift container sizes | 17:40 |
gordc | mnaser: yeah, ah i see. you use polling agents as well. | 17:40 |
gordc | mnaser: this is with kilo? | 17:40 |
mnaser | liberty | 17:40 |
gordc | ... then you probably have notification agent running too since data goes polling -> notification -> collector -> gnocchi | 17:40 |
mnaser | i do, huge amount of POST requests swift_account/<...>/.. | 17:41 |
gordc | if you're using rabbit, can you check how large metering.sample queue is? | 17:41 |
gordc | rabbitmqctl list_queues... or via webui | 17:41 |
jd__ | if you meter swift requests that make sense | 17:42 |
mnaser | not doing requests monitoring yet, they're mostly things like this: ::1 - - [22/Jan/2016:17:42:38 +0000] "POST /v1/resource/swift_account/f91c013d82dc46da8f52530332408160/metric/storage.objects.containers/measures HTTP/1.1" 202 - "-" "python-requests/2.7.0 CPython/2.7.6 Linux/3.16.0-57-generic" | 17:42 |
mnaser | its just the measurements it does for usage checks etc | 17:44 |
*** jd__ has left #openstack-telemetry | 17:44 | |
*** jd__ has joined #openstack-telemetry | 17:44 | |
jd__ | ok :) | 17:44 |
mnaser | btw gordc - metering.sample739 | 17:45 |
gordc | that's with 600 VM? | 17:45 |
mnaser | yes | 17:45 |
mnaser | but i didnt configure the compute agents | 17:45 |
mnaser | so its mostly swift data | 17:46 |
*** _nadya_ has quit IRC | 17:46 | |
gordc | oh. ok. that probably makes a bit more sense. i was surprised it wasn't a lot higher if you were running with 1 process and 24 threads. | 17:46 |
gordc | mine gets backed up with a lot more threads and a lot less vms | 17:47 |
gordc | cool cool. was just trying to get a sense of how it was deployed before i play around with deployment configurations | 17:47 |
-openstackstatus- NOTICE: Restarting zuul due to a memory leak | 17:51 | |
*** thumpba has joined #openstack-telemetry | 17:54 | |
mnaser | well, we did progress but it look like my metricd processes can't keep up | 17:55 |
*** ildikov has joined #openstack-telemetry | 17:58 | |
*** Ephur has joined #openstack-telemetry | 17:58 | |
*** mragupat has quit IRC | 17:58 | |
*** boris-42 has joined #openstack-telemetry | 18:07 | |
mnaser | just turned it off to catch up.. i have a few hints of what's going on.. the process may be doing lots of extra loops on metrics with 0 values | 18:13 |
nicodemus_ | mnaser: I'm running the code you pasted on #1536909 | 18:13 |
mnaser | nicodemus_: probably still stuck eh? :p | 18:13 |
nicodemus_ | yeah... still waiting on it to finish | 18:14 |
mnaser | nicodemus_: suspect you'll have an hour ro so to run i think | 18:14 |
mnaser | i'm working a bit more on this redis patch.. | 18:14 |
openstackgerrit | Merged openstack/python-gnocchiclient: Add granularity argument to measures show https://review.openstack.org/266082 | 18:20 |
*** dan-t has quit IRC | 18:33 | |
*** dan-t has joined #openstack-telemetry | 18:34 | |
mnaser | hmm i think i've ran into an oddity in carbonara | 18:41 |
mnaser | where it actually breaks down under scale | 18:41 |
nicodemus_ | mnaser: the code snippet took 41m16s to run | 18:42 |
mnaser | nicodemus_ : thats the time it's taking to process a *single* metric in your cluster right now | 18:42 |
mnaser | 7.5 million * 41 minutes | 18:42 |
mnaser | see ya in 584 years | 18:42 |
mnaser | https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/_carbonara.py#L228 <-- this line grabs all of the metrics that need work, it could be a list of 1000+ metrics | 18:43 |
mnaser | now the problem is that, the metricd process seems to loop around those even tho many of them were already processes | 18:43 |
mnaser | if process #1 does the first 100, process #2 will spend most of it's time running _process_measure_for_metric for 100 metrics that are already done | 18:44 |
mnaser | and now you have 24 processes looping around in empty/already done metrics | 18:44 |
*** harlowja has quit IRC | 18:46 | |
nicodemus_ | I don't think I have 584 years T___T | 18:46 |
mnaser | running a single worker makes work serialized, but the worker process is actually doing work 100% of the time.. so there has to be an improvement done so that self. _list_metric_with_measures_to_process() somehow distributes work per process | 18:46 |
*** harlowja has joined #openstack-telemetry | 18:46 | |
mnaser | nicodemus_: i'll make sure we get it all figured out :) | 18:46 |
nicodemus_ | mnaser: I'm afraid I don't have much coding skills to assist, but let me know if I can help doing tests/benchmarks :) | 18:48 |
mnaser | nicodemus_: that's very useful on it's own! i dont have 7.5 million records (yet) and it'd be immense help once its cooked up that | 18:48 |
mnaser | what im hopefully saying isnt incorrect :p | 18:49 |
nicodemus_ | right now all ceilometer-agents are stopped, so there are no new metrics being pushed to gnocchi | 18:49 |
mnaser | yeah thats the best thing to do | 18:50 |
*** PsionTheory has joined #openstack-telemetry | 19:01 | |
*** _nadya_ has joined #openstack-telemetry | 19:07 | |
gordc | yeah i think we need to look at why get_xattrs is so slow. that is a required component regardless | 19:09 |
mnaser | gordc: even if it worked fast, what i mentioned above is a problem too | 19:10 |
mnaser | 24 processes getting the same list, first 1 or 2 will start working through them, the last ones will just loop over empty measures, doing work for nothing | 19:11 |
mnaser | i think there should be a proper way where a set of metrics are dispatchd to be processed by a specific worker, that way you dont have workers sitting around loop over already-processed data | 19:11 |
mnaser | it seems to me the bigger amount of workers you have, the bigger this becomes a problem | 19:12 |
gordc | right. i'm not saying it's not an issue. but i think we need to start at step 1 | 19:12 |
gordc | https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/_carbonara.py#L242 | 19:12 |
*** rcernin has quit IRC | 19:12 | |
mnaser | true, but this exposed another issue that would likely be a problem with all backends | 19:12 |
gordc | this lock should essentially ensure those later works scan through and restart again. | 19:13 |
gordc | s/works/workers/ | 19:13 |
mnaser | gordc: correct, that locks it, but the list of metrics which is acquired using " metrics_to_process = self._list_metric_with_measures_to_process()" can be huge in some cases | 19:13 |
mnaser | and the metric might even be processed at that point as well | 19:14 |
mnaser | it wouldn't be locked | 19:14 |
mnaser | self._process_measure_for_metric would just return 0, but imagine this happening 2000 times x 48 processes | 19:14 |
mnaser | the processes are spinning and going over a list of metrics which have already been processed, it would be far more effective for it to not be doing this | 19:15 |
mnaser | results in a lot of stuff in the logs like this: 2016-01-22 19:09:14.363 30510 DEBUG gnocchi.storage._carbonara [-] Computed new metric b389e4e8-2d7d-4697-ad5c-68013a9b7948 with 0 new measures in 0.46 seconds process_measures /usr/lib/python2.7/dist-packages/gnocchi/storage/_carbonara.py:229 | 19:15 |
gordc | right. so we need smarter exit strategy... | 19:16 |
gordc | and then more improvement would be to bucketise what results we get back | 19:16 |
nicodemus_ | I have to sign off for a while (power in the office is down)... will be back in an hour or so | 19:16 |
mnaser | later nicodemus_ ! | 19:16 |
gordc | lates | 19:16 |
mnaser | gordc: correct, something where each worker gets a batch of N metrics to process that no other worker gets | 19:17 |
mnaser | _pending_measures_to_process_count(metric_id) is implemented and i think that would be useful at exiting early | 19:18 |
gordc | mnaser: not sure how possibel 'no other worker' part is. but we can definitely dump into smaller buckets and distribute the dump buckets to workers. | 19:18 |
mnaser | gordc: of course, this becomes a bit more complex, i did try to feed each worker batches of 10 which was effective, but runs much slower (and still suffers of this issue) | 19:19 |
gordc | mnaser: yes, that could be a good solution. | 19:19 |
gordc | mnaser: want to push a patch | 19:19 |
mnaser | let me try adding some code re: _pending_measures_to_process_count and early exit and see if that improves things | 19:19 |
*** mragupat has joined #openstack-telemetry | 19:19 | |
mnaser | yep, ill do that | 19:19 |
gordc | mnaser: awesome! | 19:20 |
mnaser | gordc: question, should i check _pending_measures_to_process_count before the lock or after | 19:20 |
mnaser | im thinking it safe to do it before the lock, worst case scenario it will come back to it on the next run | 19:20 |
gordc | maybe after the lock. if the lock is taken, then the worker will just pass through it since it assumes another worker is working on it. | 19:21 |
*** nicodemus_ has quit IRC | 19:23 | |
*** chmouel_ is now known as chmouel | 19:25 | |
*** dan-t has quit IRC | 19:33 | |
mnaser | gordc: huge difference | 19:34 |
mnaser | went from can barely handle the load to now only around ~38 in queue | 19:34 |
gordc | mnaser: what is 'queue' in this case? | 19:35 |
gordc | mnaser: this sounds promising | 19:35 |
mnaser | sorry, | storage/total number of measures to process | 39 | | 19:35 |
mnaser | so measurements are actually being processed much faster | 19:36 |
gordc | ah got it. | 19:36 |
mnaser | before, they were just queueing up | 19:36 |
gordc | that's great. | 19:36 |
gordc | care to upload your tweak? | 19:36 |
mnaser | gordc: also, i am pretty sure we can go back to the original implementation and it will perform just as fine, because teh issue only manifests itself when the # of xattrs aka measurements pending is huge | 19:36 |
gordc | mnaser: yeah. definitely one of those cases where if it explodes, it only gets worse. | 19:37 |
gordc | seems like a good first fix if it still functions as expected. | 19:38 |
mnaser | gordc: i just had a talk with some of the ceph devs, and the person who worked on making the changes re: moving things to its own thread that i suspect is causing this | 19:39 |
mnaser | apparently there is work being done to port rados to cython which will improve performance, and also asked if we could rerun it with the code to run in thread overridden and see if the peformance changes | 19:40 |
mnaser | so first i'll just upload this small change then i'll go ahead and test that | 19:40 |
gordc | mnaser: awesome! | 19:41 |
gordc | thanks for the help | 19:41 |
mnaser | np, thanks to your help and support to | 19:41 |
*** cdent has quit IRC | 19:44 | |
*** pradk has quit IRC | 19:45 | |
*** pradk has joined #openstack-telemetry | 19:45 | |
openstackgerrit | Mohammed Naser proposed openstack/gnocchi: Skip already processed measurements https://review.openstack.org/271499 | 19:48 |
mnaser | gordc ^ :) | 19:48 |
gordc | mnaser: cool cool. heading to doctors right now. will take a look at that later | 19:48 |
gordc | thanks again. | 19:48 |
mnaser | np, no rush | 19:48 |
*** gordc has quit IRC | 19:48 | |
*** dan-t has joined #openstack-telemetry | 19:54 | |
*** ddieterly has quit IRC | 20:02 | |
openstackgerrit | Mohammed Naser proposed openstack/gnocchi: Skip already processed measurements https://review.openstack.org/271499 | 20:08 |
openstackgerrit | Mohammed Naser proposed openstack/gnocchi: Skip already processed measurements https://review.openstack.org/271499 | 20:11 |
*** pcaruana has quit IRC | 20:14 | |
openstackgerrit | George Peristerakis proposed openstack/ceilometer: Load a directory of YAML event config files https://review.openstack.org/247177 | 20:21 |
*** nicodemus_ has joined #openstack-telemetry | 20:30 | |
*** ddieterly has joined #openstack-telemetry | 20:41 | |
mnaser | man right now the code for processing measures is just a big race between workers which becomes a big mess at scale :( | 20:55 |
mnaser | trying to figure out how to split it apart in buckets or something | 20:55 |
*** jwcroppe has quit IRC | 21:27 | |
*** peristeri has quit IRC | 21:45 | |
*** pcaruana has joined #openstack-telemetry | 21:48 | |
*** rcernin has joined #openstack-telemetry | 21:53 | |
*** mragupat has quit IRC | 21:59 | |
*** gordc has joined #openstack-telemetry | 22:01 | |
*** gordc has quit IRC | 22:37 | |
*** PsionTheory has quit IRC | 22:38 | |
*** pradk has quit IRC | 22:52 | |
*** rcernin has quit IRC | 22:53 | |
*** dan-t has quit IRC | 22:54 | |
*** dan-t has joined #openstack-telemetry | 23:20 | |
*** shakamunyi has quit IRC | 23:22 | |
*** rbak has quit IRC | 23:26 | |
*** ddieterly has quit IRC | 23:27 | |
*** vishwanathj has quit IRC | 23:29 | |
openstackgerrit | Rohit Jaiswal proposed openstack/ceilometer: Enhances get_meters to return unique meters https://review.openstack.org/259626 | 23:30 |
*** thumpba has quit IRC | 23:34 | |
*** thorst has quit IRC | 23:43 | |
*** thorst has joined #openstack-telemetry | 23:43 | |
*** thorst has quit IRC | 23:52 | |
*** Ephur has quit IRC | 23:54 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!