*** irclogbot_2 has quit IRC | 02:19 | |
*** apetrich has quit IRC | 02:58 | |
*** pgaxatte has joined #openstack-mistral | 06:28 | |
*** openstackgerrit has joined #openstack-mistral | 06:59 | |
openstackgerrit | Merged openstack/python-mistralclient stable/stein: Update .gitreview for stable/stein https://review.openstack.org/644169 | 06:59 |
---|---|---|
*** apetrich has joined #openstack-mistral | 07:32 | |
rakhmerov | d0ugal, apetrich: hi, any idea why this is is happening? http://logs.openstack.org/16/648316/5/check/openstack-tox-docs/e9b504c/job-output.txt.gz#_2019-03-29_21_46_34_025523 | 07:33 |
apetrich | rakhmerov, that looks like flask from pecan inside sphinx. that's something I didn't know happened. But that seems to be a sphinx error | 07:34 |
rakhmerov | yeah | 07:35 |
apetrich | that is very odd indeed | 07:35 |
*** akovi has joined #openstack-mistral | 07:51 | |
rakhmerov | akovi: hi ) | 07:58 |
akovi | rakhmerov: hi Renat! | 07:59 |
rakhmerov | akovi: just FYI: my last few patches seriously improve Mistral performance | 07:59 |
rakhmerov | in case of big data contextx | 07:59 |
rakhmerov | contexts | 07:59 |
rakhmerov | one of them was indeed a regression fix | 07:59 |
akovi | good to know | 07:59 |
rakhmerov | yes | 07:59 |
rakhmerov | you may want to try it | 07:59 |
akovi | I'll try to get an update in the product | 07:59 |
rakhmerov | yep | 08:00 |
akovi | Currently I'm working on solving the expiring tokens issue | 08:00 |
rakhmerov | some of our NSs now get deployed 4-5x faster | 08:00 |
rakhmerov | ooh, cool | 08:00 |
akovi | Nasty hack, this cannot be upstreamed :( | 08:00 |
rakhmerov | aah ) | 08:00 |
*** gkadam has joined #openstack-mistral | 08:02 | |
*** vgvoleg has joined #openstack-mistral | 08:09 | |
vgvoleg | Hi everyone! Does mistral have any recommendations about how to work with huge context? Engines lose their connections with rabbit during yaql evaluation, sometimes engines could be killed by OOM and we lose some delayed calls | 08:25 |
akovi | you've got to increase the RPC and DB heartbeat timeouts | 08:26 |
akovi | and increase the memory limits | 08:26 |
vgvoleg | We have workflow, which has about 6000 tasks, each of them puts about 200 key:value objects to context | 08:26 |
akovi | I had an instance lately where I had to move the limit above 8G | 08:27 |
akovi | Does the kill happen when the Executuion completes? | 08:27 |
vgvoleg | It should be less than 1GB of mem | 08:28 |
vgvoleg | Yes | 08:28 |
akovi | you should decrease the batch size | 08:28 |
vgvoleg | We tried to give 10-15 GB to the engines, but they eat everything | 08:28 |
akovi | Renat has introduced it las summer if I remember correctly | 08:28 |
vgvoleg | we tried to set batch size 5 :D | 08:29 |
vgvoleg | ok we'll try to increase RPC heartbeat timeout | 08:29 |
vgvoleg | ty | 08:29 |
akovi | let the limits go away | 08:30 |
akovi | run only a single engine | 08:30 |
akovi | to see what you have to deal with | 08:30 |
akovi | looks like the json marshalling-unmarshalling is really memory intensive | 08:30 |
akovi | 200 tasks with 4MB contexts are 800MB context | 08:31 |
akovi | this required 8-9 GB memory to go through | 08:31 |
vgvoleg | how have you calculate this? :D | 08:32 |
vgvoleg | 4MB context | 08:32 |
*** gkadam has quit IRC | 08:32 | |
vgvoleg | I've just woke up... | 08:32 |
akovi | this is an example for what I had to tackle lately | 08:33 |
akovi | easiest to check the context size from the DB | 08:33 |
akovi | select sum(len(in_context)) from action_executions_v2; | 08:34 |
akovi | or something similar | 08:34 |
vgvoleg | oh ty | 08:34 |
vgvoleg | `increase the RPC and DB heartbeat timeouts` == heartbeat_timeout_threshold ? | 08:36 |
akovi | yes, and the number of missed HBs | 08:36 |
akovi | during YAQL evaluation the thread is not yielded and the greenthread is stuck | 08:37 |
akovi | we tried to put it on a separate real thread but other issues arised immediately | 08:38 |
akovi | (or rather: consequently) | 08:38 |
vgvoleg | yes, I thought about it | 08:38 |
vgvoleg | `the number of missed HBs` - can't find it | 08:38 |
akovi | #heartbeat_rate = 2 | 08:39 |
akovi | #heartbeat_timeout_threshold = 60 | 08:39 |
akovi | I think these are the two important | 08:39 |
akovi | #heartbeat_interval = 3 | 08:40 |
akovi | maybe this one too | 08:40 |
akovi | it's freakin' loaded with legacy :) | 08:40 |
vgvoleg | ty so much | 08:40 |
vgvoleg | Is there any mechanish to handle stucked delayed calls? | 08:42 |
vgvoleg | I've found `pickup_job_after` option | 08:42 |
akovi | yes | 08:42 |
vgvoleg | But I can't get is it what I needed | 08:43 |
akovi | ah, it's only in our version | 08:44 |
akovi | you can implement it with a cron job | 08:44 |
akovi | select the delayed calls that have the processing=1 flag for too long | 08:44 |
akovi | simply update these lines to 0 | 08:44 |
akovi | the engine will start processing them | 08:45 |
akovi | this is the simplest way we could tackle OOM kills too | 08:45 |
vgvoleg | Yes, I thought about how to implement it locally, and how you handle the case when this timeout is less than the executing time? | 08:46 |
akovi | it should not be :) | 08:47 |
akovi | delayed calls are usually short | 08:47 |
vgvoleg | Not all functions, that could be delayed, are OK with executing two times simultaneously | 08:47 |
vgvoleg | oh OK | 08:47 |
akovi | this is practically a fix for the discrepancy that ongoing calls may not have been administered consistently at the time of the OOM kill | 08:48 |
akovi | so yes, this is far slower than having an optimal solution but at least it keeps your service running | 08:49 |
vgvoleg | I think by the time we stop falling because of the memory, it will be possible to set some timeout | 08:49 |
vgvoleg | thank you | 08:50 |
akovi | you're welcome :) | 08:50 |
vgvoleg | This mechanism could be implemented right in the scheduler, was there any problem with it? Or why do you use external cron job for it? | 08:52 |
*** bobh has joined #openstack-mistral | 09:06 | |
*** bobh has quit IRC | 09:11 | |
*** jrist has quit IRC | 09:15 | |
*** jrist has joined #openstack-mistral | 09:16 | |
*** gkadam has joined #openstack-mistral | 09:28 | |
vgvoleg | btw parallel execution tooks more time then consistent | 09:35 |
vgvoleg | lol | 09:35 |
*** akovi has quit IRC | 09:44 | |
*** akovi has joined #openstack-mistral | 09:45 | |
*** d0ugal has quit IRC | 09:55 | |
*** d0ugal has joined #openstack-mistral | 10:05 | |
rakhmerov | vgvoleg: can you remind what version of Mistral you're using? | 10:36 |
rakhmerov | if you have a version from last summer (I remember something like this, Mistral Queens) than the recommendation #1 from me is to switch to the latest available version from master | 10:37 |
rakhmerov | + my latest patch https://review.openstack.org/#/c/648316/ | 10:38 |
rakhmerov | this patch removes a HUGE performance regression related to YAQL evaluation | 10:38 |
rakhmerov | also lots of performance improvements were made in Oct-Nov | 10:39 |
rakhmerov | apetrich, d0ugal: sphinx was updated to 2.0.0 on Mar 28 | 10:51 |
rakhmerov | I guess that's the cause | 10:51 |
d0ugal | Sounds likely | 10:52 |
openstackgerrit | Renat Akhmerov proposed openstack/mistral master: WIP: try to pin sphinx version https://review.openstack.org/648944 | 10:58 |
vgvoleg | rakhmerov: I'm using latest with your patch | 11:01 |
rakhmerov | ok | 11:01 |
rakhmerov | 6000 tasks is a lot :) | 11:01 |
rakhmerov | I guess parsing YAML alone takes very much time | 11:02 |
vgvoleg | we are trying to place all cycles in publish section to one yaql expression, I think we could win some time with it | 11:08 |
rakhmerov | vgvoleg: cycles? | 11:22 |
rakhmerov | what do you mean by that? | 11:22 |
rakhmerov | d0ugal, apetrich: yes, it's sphinx version. https://review.openstack.org/#/c/648944/ passes doc but doesn't pass requirements-check | 11:23 |
apetrich | rakhmerov, great | 11:23 |
rakhmerov | I probably need your advice here. Do you think we need to send a patch to the global requirements to pin sphinx version? | 11:24 |
rakhmerov | vgvoleg: and make sure to set the config property "convert_input_data" to false | 11:25 |
apetrich | rakhmerov, I'm not sure if that isn't an interaction with pecan | 11:25 |
apetrich | I'd keep like that for now. It is going to hit other projects if that is a sphinx bug. if it is not and it is just an interaction with pecan we might need to leave like this anyway | 11:26 |
rakhmerov | it seems like that the new version of sphinx 2.0.0 conflicts with sphinxcontrib-pecanwsme | 11:26 |
rakhmerov | I guess that's what it is | 11:27 |
rakhmerov | there probably must be a new version of sphinxcontrib-pecanwsme to fix that but it doesn't exist yet | 11:27 |
rakhmerov | because yes, the problem comes from interaction of sphinx and pecanwsme | 11:28 |
apetrich | yeah | 11:28 |
apetrich | makes sense | 11:28 |
apetrich | "sense" | 11:28 |
*** akovi has quit IRC | 11:30 | |
rakhmerov | ok, I'll leave it as is for now | 11:30 |
rakhmerov | let's see if they fix it | 11:30 |
*** akovi has joined #openstack-mistral | 11:30 | |
openstackgerrit | Vlad Gusev proposed openstack/mistral master: Add release note for I04ba85488b27cb05c3b81ad8c973c3cc3fe56d36 https://review.openstack.org/648956 | 12:11 |
*** apetrich has quit IRC | 12:16 | |
*** apetrich has joined #openstack-mistral | 12:17 | |
*** apetrich has quit IRC | 12:36 | |
openstackgerrit | Vlad Gusev proposed openstack/mistral stable/stein: Add http_proxy_to_wsgi middleware https://review.openstack.org/647694 | 12:36 |
*** apetrich has joined #openstack-mistral | 12:53 | |
*** irclogbot_2 has joined #openstack-mistral | 13:26 | |
*** bobh has joined #openstack-mistral | 15:15 | |
*** bobh has quit IRC | 15:19 | |
*** pgaxatte has quit IRC | 15:49 | |
*** akovi has quit IRC | 16:24 | |
*** gkadam has quit IRC | 17:05 | |
*** bobh has joined #openstack-mistral | 17:15 | |
*** zigo has quit IRC | 17:37 | |
*** bobh has quit IRC | 17:54 | |
*** openstackgerrit has quit IRC | 23:56 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!