*** apetrich has quit IRC | 01:57 | |
*** threestrands has joined #openstack-mistral | 03:31 | |
*** altlogbot_0 has quit IRC | 03:44 | |
*** altlogbot_1 has joined #openstack-mistral | 03:45 | |
*** ykarel|away has joined #openstack-mistral | 03:53 | |
*** ykarel|away has quit IRC | 03:53 | |
*** ykarel|away has joined #openstack-mistral | 03:55 | |
*** ykarel|away has quit IRC | 03:55 | |
*** akovi has joined #openstack-mistral | 04:08 | |
*** quenti[m] has quit IRC | 04:36 | |
*** quenti[m]1 has quit IRC | 04:36 | |
*** altlogbot_1 has quit IRC | 04:38 | |
*** altlogbot_1 has joined #openstack-mistral | 04:40 | |
*** altlogbot_1 has quit IRC | 04:40 | |
*** altlogbot_2 has joined #openstack-mistral | 04:41 | |
*** quenti[m] has joined #openstack-mistral | 04:42 | |
*** quenti[m]1 has joined #openstack-mistral | 04:58 | |
*** sapd1 has joined #openstack-mistral | 07:03 | |
*** apetrich has joined #openstack-mistral | 07:32 | |
*** vgvoleg_ has joined #openstack-mistral | 07:51 | |
rakhmerov | hi all | 08:01 |
---|---|---|
rakhmerov | anybody here for the meeting? | 08:01 |
vgvoleg_ | hi | 08:01 |
vgvoleg_ | + | 08:01 |
rakhmerov | :) | 08:01 |
rakhmerov | akovi, apetrich: hi, if you wanna participate, welcome ) | 08:02 |
rakhmerov | #startmeeting Mistral | 08:02 |
openstack | Meeting started Wed May 29 08:02:23 2019 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot. | 08:02 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 08:02 |
*** openstack changes topic to " (Meeting topic: Mistral)" | 08:02 | |
openstack | The meeting name has been set to 'mistral' | 08:02 |
apetrich | Morning | 08:02 |
rakhmerov | morning | 08:03 |
rakhmerov | apetrich: btw, just a reminder: still waiting when you change those backports ) | 08:03 |
rakhmerov | no rush but please don't forget | 08:03 |
apetrich | rakhmerov, I know. Thanks for understanding. | 08:04 |
rakhmerov | no problem | 08:04 |
rakhmerov | so | 08:04 |
vgvoleg_ | oh I did not have a time to write a blueprint about fail-on policy | 08:04 |
vgvoleg_ | sorry | 08:04 |
rakhmerov | :) | 08:04 |
rakhmerov | please do | 08:04 |
rakhmerov | vgvoleg_: I know you wanted to share some concerns during the office hour | 08:05 |
rakhmerov | you can go ahead and do that | 08:05 |
vgvoleg_ | yes | 08:06 |
vgvoleg_ | For now, we are testing mistral with huge workflow | 08:06 |
*** ricolin_ has joined #openstack-mistral | 08:06 | |
vgvoleg_ | it has about 600 nested wf and a very big context | 08:07 |
*** ricolin_ has quit IRC | 08:07 | |
vgvoleg_ | we found 3 problems | 08:07 |
*** ricolin has joined #openstack-mistral | 08:07 | |
vgvoleg_ | 1) There are some memory leaks in engine | 08:08 |
rakhmerov | щл | 08:08 |
rakhmerov | ok | 08:09 |
vgvoleg_ | 2) Mistral for some reasons stuck because of db deadlocks in action execution reporter | 08:09 |
rakhmerov | Oleg, how do you know that you're observing memory leaks? | 08:09 |
rakhmerov | maybe it's just a large memory footprint (which is totally OK with that big workflow) | 08:10 |
vgvoleg_ | we see lots of active sessions 'update state error info heartbeat wasn't reseived' | 08:10 |
rakhmerov | ok | 08:10 |
rakhmerov | maybe you need to increase timeouts? | 08:10 |
vgvoleg_ | nonono | 08:10 |
vgvoleg_ | it's ok to fail action | 08:11 |
vgvoleg_ | it is not ok to stuck mistral :D | 08:11 |
vgvoleg_ | and from that point engines couldn't do anything, they miss connection to rabbit and never return to working state | 08:12 |
vgvoleg_ | btw we see a lot of sessions in 'idle in transaction' state, tbh I don't know what does it mean | 08:12 |
rakhmerov | ok | 08:13 |
rakhmerov | I see | 08:13 |
vgvoleg_ | about memory leaks: we use monitoring to see current state of Mistral's pods | 08:13 |
rakhmerov | Oleg, we've recently found one issue with RabbitMQ | 08:13 |
rakhmerov | if you're using the latest code you probably hit it as well | 08:14 |
rakhmerov | the thing is that oslo.messagine recently removed some deprecated options | 08:14 |
vgvoleg_ | and we see that memory value increases after complete run | 08:14 |
rakhmerov | and the configuration option responsible for retries is now zero by default | 08:14 |
rakhmerov | so it never tries to reconnect | 08:14 |
rakhmerov | it's easy to solve just by reconfiguring the connnection a little bit | 08:14 |
vgvoleg_ | if we run flow once again, this value will add some more memory, in our case this is about 2GB per pod | 08:15 |
vgvoleg_ | 2GB that we don't know where they comes from | 08:15 |
vgvoleg_ | even if we turn off all caches | 08:15 |
vgvoleg_ | Oh, great news about rabbit, ty! We'll try ro research it | 08:16 |
rakhmerov | vgvoleg_: yes, I can share the details on Rabbit with you separately | 08:17 |
rakhmerov | as far as leaks, ok, I understood. We used to observe them but all of them have been fixed | 08:18 |
rakhmerov | we haven't observed any leaks for at least a year of constant Mistral use | 08:18 |
rakhmerov | although the workflows are also huge | 08:18 |
rakhmerov | but ok, I assume you may be hitting some corner case or something | 08:19 |
rakhmerov | I can also advise you on how to find memory leaks | 08:19 |
rakhmerov | I can recommend some tools for that | 08:19 |
vgvoleg_ | can anyone help me and tell about mechanisms to detect where they come from? | 08:19 |
vgvoleg_ | oh :) | 08:19 |
vgvoleg_ | ok | 08:20 |
rakhmerov | basically you need to see objects of what type are mostly in the Python heap | 08:20 |
rakhmerov | yes, I'll let you know later here in the channel | 08:20 |
rakhmerov | so that others could see as well | 08:20 |
rakhmerov | I just need to find all the relevant links | 08:20 |
vgvoleg_ | I've tried to do something like this, and some 'dict: 9999KB' is not helping at all | 08:20 |
rakhmerov | yeah-yeah, I know | 08:22 |
rakhmerov | it's not that simple | 08:22 |
vgvoleg_ | The third issue i'd like to tell about is the load balancing between engine instances | 08:23 |
rakhmerov | you need to trace a chain of references from these primitive types to our user types | 08:23 |
rakhmerov | vgvoleg_: | 08:23 |
rakhmerov | ok | 08:23 |
vgvoleg_ | There are cases, e.g one task has on-success with lots of other tasks | 08:24 |
rakhmerov | yep | 08:24 |
vgvoleg_ | Starting and executing all this tasks is one indivisible operation | 08:24 |
rakhmerov | yes | 08:25 |
vgvoleg_ | so it doesn't split between engines and we can see that one engine use 99% CPU and others 2-4% | 08:25 |
rakhmerov | as we discussed, I'd propose to make that option that we recently added (start_subworkflows_via_rpc) more universal | 08:25 |
vgvoleg_ | we use one very dirty hack to solve this problem: we add 'join:all' to all this tasks, and it help to balance load | 08:26 |
rakhmerov | so that it works not only during the start but at any point during the execution life tie | 08:26 |
rakhmerov | time | 08:26 |
rakhmerov | vgvoleg_: yes | 08:27 |
rakhmerov | what do you think about my suggestion? Do you think it will help you? | 08:27 |
vgvoleg_ | I think creating tasks with WAITING state and start them by rpc is the only correct solution | 08:28 |
rakhmerov | yes, makes sense | 08:29 |
vgvoleg_ | this is good because it could solve one more problem | 08:29 |
rakhmerov | can you come up with some short spec or blueprint for this please? | 08:29 |
rakhmerov | we need to understand what else it will affect | 08:29 |
rakhmerov | basically, here we're going to change how tasks change their state | 08:30 |
rakhmerov | and this is a serious change | 08:30 |
vgvoleg_ | if all execution steps are atomic and rpc-based, we can use priority to make old executions finish faster than new one | 08:31 |
rakhmerov | vgvoleg_: we definitely need to write up a description of this somewhere :) | 08:31 |
rakhmerov | with all the details and consequences of this change | 08:31 |
rakhmerov | I'd propose to make a blueprint for now | 08:32 |
rakhmerov | can you do this please? | 08:32 |
vgvoleg_ | ok, i'll try | 08:32 |
akovi | hi | 08:32 |
rakhmerov | #action vgvoleg_: file a blueprint for processing big on-success|on-error|on-complete clauses using WAITING state and RPC | 08:33 |
akovi | the action execution reporting may fail because of custom actions, unfortunately | 08:33 |
rakhmerov | akovi: hi! how's it going? | 08:33 |
rakhmerov | custom actions? | 08:33 |
rakhmerov | ad-hoc you mean? | 08:33 |
akovi | if the reporter thread (green) does not get time to run, timeouts will happen in the engine and the execution will be closed | 08:33 |
vgvoleg_ | we use only std.noop and std.echo in out test case | 08:34 |
akovi | we found such an error in one of our custom actions that listened to the output of a forked process | 08:34 |
rakhmerov | ok | 08:35 |
vgvoleg_ | btw we have tried to use random_delay in this job, but deadlocks appear quite often | 08:36 |
rakhmerov | vgvoleg_: it's not going to help much | 08:38 |
rakhmerov | we've already proven that | 08:38 |
rakhmerov | I have to admit that it was a stupid attempt to help mitigate a bad scheduler architecture | 08:38 |
rakhmerov | that whole thing is going to change pretty soon | 08:38 |
akovi | large context can also cause issues in RPC processing | 08:39 |
rakhmerov | yes | 08:39 |
vgvoleg_ | yes, but action execution reporter works with DB | 08:39 |
akovi | I tried compressing the data (with lzo and then lz4) but the results were not conclusive | 08:40 |
akovi | no | 08:40 |
akovi | the reporter has a list of the running actions | 08:40 |
akovi | this list is sent to the engine over RPC in given intervals | 08:41 |
akovi | the DB update happens in the engine | 08:41 |
vgvoleg_ | oh I got it | 08:41 |
akovi | could be moved to the executor (this happened accidentally once) | 08:41 |
akovi | but that would mean the executor has to have direct DB access | 08:41 |
akovi | which was not required earlier | 08:42 |
vgvoleg_ | I'll try to research it (the problem appears yesterday) | 08:42 |
rakhmerov | yeah, we've always tried to avoid that for several reasons | 08:42 |
akovi | if the RPC channels are overloaded and messages pile up, that can cause heartbeat misses too | 08:42 |
akovi | action heartbeating should probably be a last resort actually | 08:44 |
akovi | to close an execution | 08:44 |
akovi | giving a fair amount of time for processing is a good practice | 08:45 |
akovi | may cause crashes to be processed later but the overall stability improves significantly | 08:45 |
akovi | 30 sec reporting and 10 misses should be ok | 08:46 |
rakhmerov | akovi: yes, it's been working for us pretty well so far | 08:47 |
rakhmerov | no complaints | 08:47 |
rakhmerov | vgvoleg: so, please complete those two action items (creating blueprints) | 08:49 |
rakhmerov | I'll provide you details on Rabbit connectivity and detecting leaks in about an hour | 08:49 |
vgvoleg_ | thank you :) | 08:50 |
rakhmerov | sure thing | 08:50 |
rakhmerov | ok, I guess we've discussed what we had | 08:52 |
rakhmerov | I'll close the meeting now (the logs will be available online) | 08:53 |
akovi | (thumbs up) | 08:53 |
rakhmerov | #endmeeting | 08:53 |
*** openstack changes topic to "Mistral the Workflow Service for OpenStack. https://docs.openstack.org/mistral/latest/" | 08:53 | |
openstack | Meeting ended Wed May 29 08:53:20 2019 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 08:53 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.html | 08:53 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.txt | 08:53 |
openstack | Log: http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.log.html | 08:53 |
akovi | thank you Renat! | 08:53 |
rakhmerov | :) | 08:53 |
rakhmerov | thank you for joining ) | 08:53 |
rakhmerov | akovi: do you know if this bug is still relevant? https://bugs.launchpad.net/mistral/+bug/1812170 | 08:58 |
openstack | Launchpad bug 1812170 in Mistral "Flaky unittest: mistral.tests.unit.engine.test_workflow_cancel.WorkflowCancelTest" [Undecided,New] | 08:58 |
rakhmerov | I haven't seen such failures for a while | 08:58 |
akovi | if it's gone, that's good :) | 09:10 |
akovi | These are very much load sensitive issues | 09:10 |
akovi | these are usually more frequent around the release deadlines | 09:11 |
rakhmerov | ok | 09:15 |
rakhmerov | I'll leave it for now then ) | 09:15 |
rakhmerov | vgvoleg: hey, I used this article for finding memory leaks: https://www.darkcoding.net/software/finding-memory-leaks-in-python-with-objgraph/ | 09:40 |
rakhmerov | as far as RabbitMQ connection, you need to make these steps: | 09:41 |
rakhmerov | 1. Remove the group [oslo_messaging_rabbit] completely with all config options | 09:41 |
rakhmerov | 2. Add the property “transport_url = rabbit://...” under the default group | 09:41 |
rakhmerov | 3. Under the group [oslo_messaging_amqp] add “default_reply_retry = -1" | 09:42 |
rakhmerov | #3 is very important because like I said, the value for this property is now 0 by default so we have to change it explicitly | 09:42 |
rakhmerov | -1 means indefinitely | 09:43 |
rakhmerov | you can choose your own value | 09:43 |
vgvoleg_ | thank you so much | 09:47 |
*** threestrands has quit IRC | 09:54 | |
openstackgerrit | Mike Fedosin proposed openstack/mistral master: Fix adhoc action lookup https://review.opendev.org/661959 | 12:18 |
openstackgerrit | Mike Fedosin proposed openstack/mistral master: Fix adhoc action lookup https://review.opendev.org/661959 | 13:03 |
*** akovi has quit IRC | 13:05 | |
*** mmethot_ has joined #openstack-mistral | 13:53 | |
*** mmethot has quit IRC | 13:54 | |
openstackgerrit | abner zhao proposed openstack/mistral master: feat: Adds workflow input dict Chinese parameter support https://review.opendev.org/661991 | 14:00 |
*** ricolin has quit IRC | 14:20 | |
*** vgvoleg_ has quit IRC | 14:21 | |
*** ricolin has joined #openstack-mistral | 14:26 | |
openstackgerrit | Mike Fedosin proposed openstack/mistral master: Fix adhoc action lookup https://review.opendev.org/661959 | 15:14 |
*** altlogbot_2 has quit IRC | 15:35 | |
*** altlogbot_3 has joined #openstack-mistral | 15:36 | |
*** irclogbot_3 has quit IRC | 15:36 | |
*** irclogbot_0 has joined #openstack-mistral | 15:37 | |
*** ricolin has quit IRC | 16:42 | |
*** jtomasek has quit IRC | 20:55 | |
openstackgerrit | Mike Fedosin proposed openstack/mistral master: Rework finding indirectly affected created joins https://review.opendev.org/662102 | 22:17 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!