Wednesday, 2019-05-29

*** apetrich has quit IRC		01:57
*** threestrands has joined #openstack-mistral		03:31
*** altlogbot_0 has quit IRC		03:44
*** altlogbot_1 has joined #openstack-mistral		03:45
*** ykarel\|away has joined #openstack-mistral		03:53
*** ykarel\|away has quit IRC		03:53
*** ykarel\|away has joined #openstack-mistral		03:55
*** ykarel\|away has quit IRC		03:55
*** akovi has joined #openstack-mistral		04:08
*** quenti[m] has quit IRC		04:36
*** quenti[m]1 has quit IRC		04:36
*** altlogbot_1 has quit IRC		04:38
*** altlogbot_1 has joined #openstack-mistral		04:40
*** altlogbot_1 has quit IRC		04:40
*** altlogbot_2 has joined #openstack-mistral		04:41
*** quenti[m] has joined #openstack-mistral		04:42
*** quenti[m]1 has joined #openstack-mistral		04:58
*** sapd1 has joined #openstack-mistral		07:03
*** apetrich has joined #openstack-mistral		07:32
*** vgvoleg_ has joined #openstack-mistral		07:51
rakhmerov	hi all	08:01
rakhmerov	anybody here for the meeting?	08:01
vgvoleg_	hi	08:01
vgvoleg_	+	08:01
rakhmerov	:)	08:01
rakhmerov	akovi, apetrich: hi, if you wanna participate, welcome )	08:02
rakhmerov	#startmeeting Mistral	08:02
openstack	Meeting started Wed May 29 08:02:23 2019 UTC and is due to finish in 60 minutes. The chair is rakhmerov. Information about MeetBot at http://wiki.debian.org/MeetBot.	08:02
openstack	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	08:02
*** openstack changes topic to " (Meeting topic: Mistral)"		08:02
openstack	The meeting name has been set to 'mistral'	08:02
apetrich	Morning	08:02
rakhmerov	morning	08:03
rakhmerov	apetrich: btw, just a reminder: still waiting when you change those backports )	08:03
rakhmerov	no rush but please don't forget	08:03
apetrich	rakhmerov, I know. Thanks for understanding.	08:04
rakhmerov	no problem	08:04
rakhmerov	so	08:04
vgvoleg_	oh I did not have a time to write a blueprint about fail-on policy	08:04
vgvoleg_	sorry	08:04
rakhmerov	:)	08:04
rakhmerov	please do	08:04
rakhmerov	vgvoleg_: I know you wanted to share some concerns during the office hour	08:05
rakhmerov	you can go ahead and do that	08:05
vgvoleg_	yes	08:06
vgvoleg_	For now, we are testing mistral with huge workflow	08:06
*** ricolin_ has joined #openstack-mistral		08:06
vgvoleg_	it has about 600 nested wf and a very big context	08:07
*** ricolin_ has quit IRC		08:07
vgvoleg_	we found 3 problems	08:07
*** ricolin has joined #openstack-mistral		08:07
vgvoleg_	1) There are some memory leaks in engine	08:08
rakhmerov	щл	08:08
rakhmerov	ok	08:09
vgvoleg_	2) Mistral for some reasons stuck because of db deadlocks in action execution reporter	08:09
rakhmerov	Oleg, how do you know that you're observing memory leaks?	08:09
rakhmerov	maybe it's just a large memory footprint (which is totally OK with that big workflow)	08:10
vgvoleg_	we see lots of active sessions 'update state error info heartbeat wasn't reseived'	08:10
rakhmerov	ok	08:10
rakhmerov	maybe you need to increase timeouts?	08:10
vgvoleg_	nonono	08:10
vgvoleg_	it's ok to fail action	08:11
vgvoleg_	it is not ok to stuck mistral :D	08:11
vgvoleg_	and from that point engines couldn't do anything, they miss connection to rabbit and never return to working state	08:12
vgvoleg_	btw we see a lot of sessions in 'idle in transaction' state, tbh I don't know what does it mean	08:12
rakhmerov	ok	08:13
rakhmerov	I see	08:13
vgvoleg_	about memory leaks: we use monitoring to see current state of Mistral's pods	08:13
rakhmerov	Oleg, we've recently found one issue with RabbitMQ	08:13
rakhmerov	if you're using the latest code you probably hit it as well	08:14
rakhmerov	the thing is that oslo.messagine recently removed some deprecated options	08:14
vgvoleg_	and we see that memory value increases after complete run	08:14
rakhmerov	and the configuration option responsible for retries is now zero by default	08:14
rakhmerov	so it never tries to reconnect	08:14
rakhmerov	it's easy to solve just by reconfiguring the connnection a little bit	08:14
vgvoleg_	if we run flow once again, this value will add some more memory, in our case this is about 2GB per pod	08:15
vgvoleg_	2GB that we don't know where they comes from	08:15
vgvoleg_	even if we turn off all caches	08:15
vgvoleg_	Oh, great news about rabbit, ty! We'll try ro research it	08:16
rakhmerov	vgvoleg_: yes, I can share the details on Rabbit with you separately	08:17
rakhmerov	as far as leaks, ok, I understood. We used to observe them but all of them have been fixed	08:18
rakhmerov	we haven't observed any leaks for at least a year of constant Mistral use	08:18
rakhmerov	although the workflows are also huge	08:18
rakhmerov	but ok, I assume you may be hitting some corner case or something	08:19
rakhmerov	I can also advise you on how to find memory leaks	08:19
rakhmerov	I can recommend some tools for that	08:19
vgvoleg_	can anyone help me and tell about mechanisms to detect where they come from?	08:19
vgvoleg_	oh :)	08:19
vgvoleg_	ok	08:20
rakhmerov	basically you need to see objects of what type are mostly in the Python heap	08:20
rakhmerov	yes, I'll let you know later here in the channel	08:20
rakhmerov	so that others could see as well	08:20
rakhmerov	I just need to find all the relevant links	08:20
vgvoleg_	I've tried to do something like this, and some 'dict: 9999KB' is not helping at all	08:20
rakhmerov	yeah-yeah, I know	08:22
rakhmerov	it's not that simple	08:22
vgvoleg_	The third issue i'd like to tell about is the load balancing between engine instances	08:23
rakhmerov	you need to trace a chain of references from these primitive types to our user types	08:23
rakhmerov	vgvoleg_:	08:23
rakhmerov	ok	08:23
vgvoleg_	There are cases, e.g one task has on-success with lots of other tasks	08:24
rakhmerov	yep	08:24
vgvoleg_	Starting and executing all this tasks is one indivisible operation	08:24
rakhmerov	yes	08:25
vgvoleg_	so it doesn't split between engines and we can see that one engine use 99% CPU and others 2-4%	08:25
rakhmerov	as we discussed, I'd propose to make that option that we recently added (start_subworkflows_via_rpc) more universal	08:25
vgvoleg_	we use one very dirty hack to solve this problem: we add 'join:all' to all this tasks, and it help to balance load	08:26
rakhmerov	so that it works not only during the start but at any point during the execution life tie	08:26
rakhmerov	time	08:26
rakhmerov	vgvoleg_: yes	08:27
rakhmerov	what do you think about my suggestion? Do you think it will help you?	08:27
vgvoleg_	I think creating tasks with WAITING state and start them by rpc is the only correct solution	08:28
rakhmerov	yes, makes sense	08:29
vgvoleg_	this is good because it could solve one more problem	08:29
rakhmerov	can you come up with some short spec or blueprint for this please?	08:29
rakhmerov	we need to understand what else it will affect	08:29
rakhmerov	basically, here we're going to change how tasks change their state	08:30
rakhmerov	and this is a serious change	08:30
vgvoleg_	if all execution steps are atomic and rpc-based, we can use priority to make old executions finish faster than new one	08:31
rakhmerov	vgvoleg_: we definitely need to write up a description of this somewhere :)	08:31
rakhmerov	with all the details and consequences of this change	08:31
rakhmerov	I'd propose to make a blueprint for now	08:32
rakhmerov	can you do this please?	08:32
vgvoleg_	ok, i'll try	08:32
akovi	hi	08:32
rakhmerov	#action vgvoleg_: file a blueprint for processing big on-success\|on-error\|on-complete clauses using WAITING state and RPC	08:33
akovi	the action execution reporting may fail because of custom actions, unfortunately	08:33
rakhmerov	akovi: hi! how's it going?	08:33
rakhmerov	custom actions?	08:33
rakhmerov	ad-hoc you mean?	08:33
akovi	if the reporter thread (green) does not get time to run, timeouts will happen in the engine and the execution will be closed	08:33
vgvoleg_	we use only std.noop and std.echo in out test case	08:34
akovi	we found such an error in one of our custom actions that listened to the output of a forked process	08:34
rakhmerov	ok	08:35
vgvoleg_	btw we have tried to use random_delay in this job, but deadlocks appear quite often	08:36
rakhmerov	vgvoleg_: it's not going to help much	08:38
rakhmerov	we've already proven that	08:38
rakhmerov	I have to admit that it was a stupid attempt to help mitigate a bad scheduler architecture	08:38
rakhmerov	that whole thing is going to change pretty soon	08:38
akovi	large context can also cause issues in RPC processing	08:39
rakhmerov	yes	08:39
vgvoleg_	yes, but action execution reporter works with DB	08:39
akovi	I tried compressing the data (with lzo and then lz4) but the results were not conclusive	08:40
akovi	no	08:40
akovi	the reporter has a list of the running actions	08:40
akovi	this list is sent to the engine over RPC in given intervals	08:41
akovi	the DB update happens in the engine	08:41
vgvoleg_	oh I got it	08:41
akovi	could be moved to the executor (this happened accidentally once)	08:41
akovi	but that would mean the executor has to have direct DB access	08:41
akovi	which was not required earlier	08:42
vgvoleg_	I'll try to research it (the problem appears yesterday)	08:42
rakhmerov	yeah, we've always tried to avoid that for several reasons	08:42
akovi	if the RPC channels are overloaded and messages pile up, that can cause heartbeat misses too	08:42
akovi	action heartbeating should probably be a last resort actually	08:44
akovi	to close an execution	08:44
akovi	giving a fair amount of time for processing is a good practice	08:45
akovi	may cause crashes to be processed later but the overall stability improves significantly	08:45
akovi	30 sec reporting and 10 misses should be ok	08:46
rakhmerov	akovi: yes, it's been working for us pretty well so far	08:47
rakhmerov	no complaints	08:47
rakhmerov	vgvoleg: so, please complete those two action items (creating blueprints)	08:49
rakhmerov	I'll provide you details on Rabbit connectivity and detecting leaks in about an hour	08:49
vgvoleg_	thank you :)	08:50
rakhmerov	sure thing	08:50
rakhmerov	ok, I guess we've discussed what we had	08:52
rakhmerov	I'll close the meeting now (the logs will be available online)	08:53
akovi	(thumbs up)	08:53
rakhmerov	#endmeeting	08:53
*** openstack changes topic to "Mistral the Workflow Service for OpenStack. https://docs.openstack.org/mistral/latest/"		08:53
openstack	Meeting ended Wed May 29 08:53:20 2019 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	08:53
openstack	Minutes: http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.html	08:53
openstack	Minutes (text): http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.txt	08:53
openstack	Log: http://eavesdrop.openstack.org/meetings/mistral/2019/mistral.2019-05-29-08.02.log.html	08:53
akovi	thank you Renat!	08:53
rakhmerov	:)	08:53
rakhmerov	thank you for joining )	08:53
rakhmerov	akovi: do you know if this bug is still relevant? https://bugs.launchpad.net/mistral/+bug/1812170	08:58
openstack	Launchpad bug 1812170 in Mistral "Flaky unittest: mistral.tests.unit.engine.test_workflow_cancel.WorkflowCancelTest" [Undecided,New]	08:58
rakhmerov	I haven't seen such failures for a while	08:58
akovi	if it's gone, that's good :)	09:10
akovi	These are very much load sensitive issues	09:10
akovi	these are usually more frequent around the release deadlines	09:11
rakhmerov	ok	09:15
rakhmerov	I'll leave it for now then )	09:15
rakhmerov	vgvoleg: hey, I used this article for finding memory leaks: https://www.darkcoding.net/software/finding-memory-leaks-in-python-with-objgraph/	09:40
rakhmerov	as far as RabbitMQ connection, you need to make these steps:	09:41
rakhmerov	1. Remove the group [oslo_messaging_rabbit] completely with all config options	09:41
rakhmerov	2. Add the property “transport_url = rabbit://...” under the default group	09:41
rakhmerov	3. Under the group [oslo_messaging_amqp] add “default_reply_retry = -1"	09:42
rakhmerov	#3 is very important because like I said, the value for this property is now 0 by default so we have to change it explicitly	09:42
rakhmerov	-1 means indefinitely	09:43
rakhmerov	you can choose your own value	09:43
vgvoleg_	thank you so much	09:47
*** threestrands has quit IRC		09:54
openstackgerrit	Mike Fedosin proposed openstack/mistral master: Fix adhoc action lookup https://review.opendev.org/661959	12:18
openstackgerrit	Mike Fedosin proposed openstack/mistral master: Fix adhoc action lookup https://review.opendev.org/661959	13:03
*** akovi has quit IRC		13:05
*** mmethot_ has joined #openstack-mistral		13:53
*** mmethot has quit IRC		13:54
openstackgerrit	abner zhao proposed openstack/mistral master: feat: Adds workflow input dict Chinese parameter support https://review.opendev.org/661991	14:00
*** ricolin has quit IRC		14:20
*** vgvoleg_ has quit IRC		14:21
*** ricolin has joined #openstack-mistral		14:26
openstackgerrit	Mike Fedosin proposed openstack/mistral master: Fix adhoc action lookup https://review.opendev.org/661959	15:14
*** altlogbot_2 has quit IRC		15:35
*** altlogbot_3 has joined #openstack-mistral		15:36
*** irclogbot_3 has quit IRC		15:36
*** irclogbot_0 has joined #openstack-mistral		15:37
*** ricolin has quit IRC		16:42
*** jtomasek has quit IRC		20:55
openstackgerrit	Mike Fedosin proposed openstack/mistral master: Rework finding indirectly affected created joins https://review.opendev.org/662102	22:17

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!