Wednesday, 2019-05-29

*** apetrich has quit IRC01:57
*** threestrands has joined #openstack-mistral03:31
*** altlogbot_0 has quit IRC03:44
*** altlogbot_1 has joined #openstack-mistral03:45
*** ykarel|away has joined #openstack-mistral03:53
*** ykarel|away has quit IRC03:53
*** ykarel|away has joined #openstack-mistral03:55
*** ykarel|away has quit IRC03:55
*** akovi has joined #openstack-mistral04:08
*** quenti[m] has quit IRC04:36
*** quenti[m]1 has quit IRC04:36
*** altlogbot_1 has quit IRC04:38
*** altlogbot_1 has joined #openstack-mistral04:40
*** altlogbot_1 has quit IRC04:40
*** altlogbot_2 has joined #openstack-mistral04:41
*** quenti[m] has joined #openstack-mistral04:42
*** quenti[m]1 has joined #openstack-mistral04:58
*** sapd1 has joined #openstack-mistral07:03
*** apetrich has joined #openstack-mistral07:32
*** vgvoleg_ has joined #openstack-mistral07:51
rakhmerovhi all08:01
rakhmerovanybody here for the meeting?08:01
vgvoleg_hi08:01
vgvoleg_+08:01
rakhmerov:)08:01
rakhmerovakovi, apetrich: hi, if you wanna participate, welcome )08:02
rakhmerov#startmeeting Mistral08:02
openstackMeeting started Wed May 29 08:02:23 2019 UTC and is due to finish in 60 minutes.  The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot.08:02
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.08:02
*** openstack changes topic to " (Meeting topic: Mistral)"08:02
openstackThe meeting name has been set to 'mistral'08:02
apetrichMorning08:02
rakhmerovmorning08:03
rakhmerovapetrich: btw, just a reminder: still waiting when you change those backports )08:03
rakhmerovno rush but please don't forget08:03
apetrichrakhmerov, I know. Thanks for understanding.08:04
rakhmerovno problem08:04
rakhmerovso08:04
vgvoleg_oh I did not have a time to write a blueprint about fail-on policy08:04
vgvoleg_sorry08:04
rakhmerov:)08:04
rakhmerovplease do08:04
rakhmerovvgvoleg_: I know you wanted to share some concerns during the office hour08:05
rakhmerovyou can go ahead and do that08:05
vgvoleg_yes08:06
vgvoleg_For now, we are testing mistral with huge workflow08:06
*** ricolin_ has joined #openstack-mistral08:06
vgvoleg_it has about 600 nested wf and a very big context08:07
*** ricolin_ has quit IRC08:07
vgvoleg_we found 3 problems08:07
*** ricolin has joined #openstack-mistral08:07
vgvoleg_1) There are some memory leaks in engine08:08
rakhmerovщл08:08
rakhmerovok08:09
vgvoleg_2) Mistral for some reasons stuck because of db deadlocks in action execution reporter08:09
rakhmerovOleg, how do you know that you're observing memory leaks?08:09
rakhmerovmaybe it's just a large memory footprint (which is totally OK with that big workflow)08:10
vgvoleg_we see lots of active sessions 'update state error info heartbeat wasn't reseived'08:10
rakhmerovok08:10
rakhmerovmaybe you need to increase timeouts?08:10
vgvoleg_nonono08:10
vgvoleg_it's ok to fail action08:11
vgvoleg_it is not ok to stuck mistral :D08:11
vgvoleg_and from that point engines couldn't do anything, they miss connection to rabbit and never return to working state08:12
vgvoleg_btw we see a lot of sessions in 'idle in transaction' state, tbh I don't know what does it mean08:12
rakhmerovok08:13
rakhmerovI see08:13
vgvoleg_about memory leaks: we use monitoring to see current state of Mistral's pods08:13
rakhmerovOleg, we've recently found one issue with RabbitMQ08:13
rakhmerovif you're using the latest code you probably hit it as well08:14
rakhmerovthe thing is that oslo.messagine recently removed some deprecated options08:14
vgvoleg_and we see that memory value increases after complete run08:14
rakhmerovand the configuration option responsible for retries is now zero by default08:14
rakhmerovso it never tries to reconnect08:14
rakhmerovit's easy to solve just by reconfiguring the connnection a little bit08:14
vgvoleg_if we run flow once again, this value will add some more memory, in our case this is about 2GB per pod08:15
vgvoleg_2GB that we don't know where they comes from08:15
vgvoleg_even if we turn off all caches08:15
vgvoleg_Oh, great news about rabbit, ty! We'll try ro research it08:16
rakhmerovvgvoleg_: yes, I can share the details on Rabbit with you separately08:17
rakhmerovas far as leaks, ok, I understood. We used to observe them but all of them have been fixed08:18
rakhmerovwe haven't observed any leaks for at least a year of constant Mistral use08:18
rakhmerovalthough the workflows are also huge08:18
rakhmerovbut ok, I assume you may be hitting some corner case or something08:19
rakhmerovI can also advise you on how to find memory leaks08:19
rakhmerovI can recommend some tools for that08:19
vgvoleg_can anyone help me and tell about mechanisms to detect where they come from?08:19
vgvoleg_oh :)08:19
vgvoleg_ok08:20
rakhmerovbasically you need to see objects of what type are mostly in the Python heap08:20
rakhmerovyes, I'll let you know later here in the channel08:20
rakhmerovso that others could see as well08:20
rakhmerovI just need to find all the relevant links08:20
vgvoleg_I've tried to do something like this, and some 'dict: 9999KB' is not helping at all08:20
rakhmerovyeah-yeah, I know08:22
rakhmerovit's not that simple08:22
vgvoleg_The third issue i'd like to tell about is the load balancing between engine instances08:23
rakhmerovyou need to trace a chain of references from these primitive types to our user types08:23
rakhmerovvgvoleg_:08:23
rakhmerovok08:23
vgvoleg_There are cases, e.g one task has on-success with lots of other tasks08:24
rakhmerovyep08:24
vgvoleg_Starting and executing all this tasks is one indivisible operation08:24
rakhmerovyes08:25
vgvoleg_so it doesn't split between engines and we can see that one engine use 99% CPU and others 2-4%08:25
rakhmerovas we discussed, I'd propose to make that option that we recently added (start_subworkflows_via_rpc) more universal08:25
vgvoleg_we use one very dirty hack to solve this problem: we add 'join:all' to all this tasks, and it help to balance load08:26
rakhmerovso that it works not only during the start but at any point during the execution life tie08:26
rakhmerovtime08:26
rakhmerovvgvoleg_: yes08:27
rakhmerovwhat do you think about my suggestion? Do you think it will help you?08:27
vgvoleg_I think creating tasks with WAITING state and start them by rpc is the only correct solution08:28
rakhmerovyes, makes sense08:29
vgvoleg_this is good because it could solve one more problem08:29
rakhmerovcan you come up with some short spec or blueprint for this please?08:29
rakhmerovwe need to understand what else it will affect08:29
rakhmerovbasically, here we're going to change how tasks change their state08:30
rakhmerovand this is a serious change08:30
vgvoleg_if all execution steps are atomic and rpc-based, we can use priority to make old executions finish faster than new one08:31
rakhmerovvgvoleg_: we definitely need to write up a description of this somewhere :)08:31
rakhmerovwith all the details and consequences of this change08:31
rakhmerovI'd propose to make a blueprint for now08:32
rakhmerovcan you do this please?08:32
vgvoleg_ok, i'll try08:32
akovihi08:32
rakhmerov#action vgvoleg_: file a blueprint for processing big on-success|on-error|on-complete clauses using WAITING state and RPC08:33
akovithe action execution reporting may fail because of custom actions, unfortunately08:33
rakhmerovakovi: hi! how's it going?08:33
rakhmerovcustom actions?08:33
rakhmerovad-hoc you mean?08:33
akoviif the reporter thread (green) does not get time to run, timeouts will happen in the engine and the execution will be closed08:33
vgvoleg_we use only std.noop and std.echo in out test case08:34
akoviwe found such an error in one of our custom actions that listened to the output of a forked process08:34
rakhmerovok08:35
vgvoleg_btw we have tried to use random_delay in this job, but deadlocks appear quite often08:36
rakhmerovvgvoleg_: it's not going to help much08:38
rakhmerovwe've already proven that08:38
rakhmerovI have to admit that it was a stupid attempt to help mitigate a bad scheduler architecture08:38
rakhmerovthat whole thing is going to change pretty soon08:38
akovilarge context can also cause issues in RPC processing08:39
rakhmerovyes08:39
vgvoleg_yes, but action execution reporter works with DB08:39
akoviI tried compressing the data (with lzo and then lz4) but the results were not conclusive08:40
akovino08:40
akovithe reporter has a list of the running actions08:40
akovithis list is sent to the engine over RPC in given intervals08:41
akovithe DB update happens in the engine08:41
vgvoleg_oh I got it08:41
akovicould be moved to the executor (this happened accidentally once)08:41
akovibut that would mean the executor has to have direct DB access08:41
akoviwhich was not required earlier08:42
vgvoleg_I'll try to research it (the problem appears yesterday)08:42
rakhmerovyeah, we've always tried to avoid that for several reasons08:42
akoviif the RPC channels are overloaded and messages pile up, that can cause heartbeat misses too08:42
akoviaction heartbeating should probably be a last resort actually08:44
akovito close an execution08:44
akovigiving a fair amount of time for processing is a good practice08:45
akovimay cause crashes to be processed later but the overall stability improves significantly08:45
akovi30 sec reporting and 10 misses should be ok08:46
rakhmerovakovi: yes, it's been working for us pretty well so far08:47
rakhmerovno complaints08:47
rakhmerovvgvoleg: so, please complete those two action items (creating blueprints)08:49
rakhmerovI'll provide you details on Rabbit connectivity and detecting leaks in about an hour08:49
vgvoleg_thank you :)08:50
rakhmerovsure thing08:50
rakhmerovok, I guess we've discussed what we had08:52
rakhmerovI'll close the meeting now (the logs will be available online)08:53
akovi(thumbs up)08:53
rakhmerov#endmeeting08:53
*** openstack changes topic to "Mistral the Workflow Service for OpenStack. https://docs.openstack.org/mistral/latest/"08:53
openstackMeeting ended Wed May 29 08:53:20 2019 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)08:53
openstackMinutes:        http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.html08:53
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.txt08:53
openstackLog:            http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.log.html08:53
akovithank you Renat!08:53
rakhmerov:)08:53
rakhmerovthank you for joining )08:53
rakhmerovakovi: do you know if this bug is still relevant? https://bugs.launchpad.net/mistral/+bug/181217008:58
openstackLaunchpad bug 1812170 in Mistral "Flaky unittest: mistral.tests.unit.engine.test_workflow_cancel.WorkflowCancelTest" [Undecided,New]08:58
rakhmerovI haven't seen such failures for a while08:58
akoviif it's gone, that's good :)09:10
akoviThese are very much load sensitive issues09:10
akovithese are usually more frequent around the release deadlines09:11
rakhmerovok09:15
rakhmerovI'll leave it for now then )09:15
rakhmerovvgvoleg: hey, I used this article for finding memory leaks: https://www.darkcoding.net/software/finding-memory-leaks-in-python-with-objgraph/09:40
rakhmerovas far as RabbitMQ connection, you need to make these steps:09:41
rakhmerov1. Remove the group [oslo_messaging_rabbit] completely with all config options09:41
rakhmerov2. Add the property “transport_url = rabbit://...” under the default group09:41
rakhmerov3. Under the group [oslo_messaging_amqp] add “default_reply_retry = -1"09:42
rakhmerov#3 is very important because like I said, the value for this property is now 0 by default so we have to change it explicitly09:42
rakhmerov-1 means indefinitely09:43
rakhmerovyou can choose your own value09:43
vgvoleg_thank you so much09:47
*** threestrands has quit IRC09:54
openstackgerritMike Fedosin proposed openstack/mistral master: Fix adhoc action lookup  https://review.opendev.org/66195912:18
openstackgerritMike Fedosin proposed openstack/mistral master: Fix adhoc action lookup  https://review.opendev.org/66195913:03
*** akovi has quit IRC13:05
*** mmethot_ has joined #openstack-mistral13:53
*** mmethot has quit IRC13:54
openstackgerritabner zhao proposed openstack/mistral master: feat: Adds workflow input dict Chinese parameter support  https://review.opendev.org/66199114:00
*** ricolin has quit IRC14:20
*** vgvoleg_ has quit IRC14:21
*** ricolin has joined #openstack-mistral14:26
openstackgerritMike Fedosin proposed openstack/mistral master: Fix adhoc action lookup  https://review.opendev.org/66195915:14
*** altlogbot_2 has quit IRC15:35
*** altlogbot_3 has joined #openstack-mistral15:36
*** irclogbot_3 has quit IRC15:36
*** irclogbot_0 has joined #openstack-mistral15:37
*** ricolin has quit IRC16:42
*** jtomasek has quit IRC20:55
openstackgerritMike Fedosin proposed openstack/mistral master: Rework finding indirectly affected created joins  https://review.opendev.org/66210222:17

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!