Wednesday, 2017-04-12

*** zhurong has joined #senlin		00:39
*** XueFeng has joined #senlin		01:06
*** dixiaoli has joined #senlin		01:33
*** yanyanhu has joined #senlin		01:35
xuhaiwei	morning Qiming, yanyanhu and others, about the concurrency problem I reported the day before yesterday, I found there is a problem here https://github.com/openstack/senlin/blob/master/senlin/engine/scheduler.py#L140	01:43
xuhaiwei	because when concurrency happened, one action is executed, the other one went to a status of READY, this action is desired to be ran later, but the scheduler logic will finish checking the action's status before it went to READY	01:45
xuhaiwei	so the READY action is left unexecuted, though the former one has released the lock	01:46
yanyanhu	hi, xuhaiwei, that action will finally get chance to run when next scheduling event comes	01:48
*** yuanbin has quit IRC		01:48
yanyanhu	although there is risk that the action could fail if there is no more scheduling event comes	01:49
*** yuanbin has joined #senlin		01:49
yanyanhu	I mean in case no any new action is created	01:49
xuhaiwei	yanyanhu, but if no other scheduling event comes for a long time, the action will be left there all the time?	01:49
yanyanhu	xuhaiwei, yes, currently it is	01:50
yanyanhu	so maybe we can add a periodic task to trigger action scheduling internally	01:51
yanyanhu	to avoid this issue	01:51
XueFeng	hi, haiwei,yanyanhu	01:51
XueFeng	else: # result == self.RES_RETRY:	01:51
XueFeng	status = self.READY	01:51
XueFeng	# Action failed at the moment, but can be retried	01:51
XueFeng	# We abandon it and then notify other dispatchers to execute it	01:51
XueFeng	ao.Action.abandon(self.context, self.id)	01:51
xuhaiwei	so the scheduler should be monitoring the action pools all the time, instead of starting working when triggered by someone	01:51
xuhaiwei	yanyanhu, exactly	01:51
yanyanhu	oh, check what XueFeng post	01:52
xuhaiwei	XueFeng, 'abandon' doesn't mean abandoning the action	01:52
xuhaiwei	# We abandon it and then notify other dispatchers to execute it	01:53
XueFeng	yes,the status is in ready	01:53
yanyanhu	XueFeng, could you please paste the link of that code section	01:53
XueFeng	and the reason string is about abandon	01:53
XueFeng	yanyanhu	01:53
XueFeng	ok	01:53
xuhaiwei	yanyanhu https://github.com/openstack/senlin/blob/master/senlin/db/sqlalchemy/api.py#L1220	01:54
yanyanhu	let me take a look :) if the comment is accurate, the problem haiwei mentioned won't exist	01:54
XueFeng	http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n316	01:54
XueFeng	here	01:55
yanyanhu	oh, this is different	01:55
XueFeng	Currently, we notify nothing	01:55
yanyanhu	it is just for a worker/engine to abandon an action which has been locked by itself	01:56
XueFeng	So, the ready action will be picked when next action come	01:56
yanyanhu	and then allow other works to pick it up for scheduling	01:56
yanyanhu	XueFeng, they are two different cases I think	01:56
yanyanhu	they are for two different cases	01:56
XueFeng	ok	01:56
yanyanhu	action_abandon is to avoid action deadlock I guess	01:57
yanyanhu	while the problem xuhaiwei mentioned is more about scheduling	01:57
XueFeng	I remember I met the problem when I do cluster_resize	01:58
yanyanhu	once action is acquired and locked by an worker/engine, any other engine or worker cannot acquire it anymore. So there could be some cases that the action owner want to give up this action and then allow other workers to acquire it	01:58
yanyanhu	I guess this db api is for this purpose	01:58
xuhaiwei	yanyanhu: I think the scheduling logic need to be modified	01:58
yanyanhu	xuhaiwei, there can be a improvement I think	01:59
XueFeng	Two cluster_resize commands continuous coming	01:59
yanyanhu	XueFeng, you mean the same issue xuhaiwei met?	02:00
XueFeng	Don't ensure	02:02
yanyanhu	so could be, if you observed a ready action hang there without being scheduling for long time : )	02:02
XueFeng	I think we have problem in http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n316	02:02
*** openstackgerrit has joined #senlin		02:03
*** ChanServ sets mode: +v openstackgerrit		02:03
openstackgerrit	RUIJIE YUAN proposed openstack/senlin master: handle node which status is WARNING https://review.openstack.org/455542	02:03
XueFeng	yanyanhu, yes	02:03
XueFeng	It's easy to reproduction	02:03
XueFeng	We expect other senlin engine to scheder the action	02:04
XueFeng	But we only run engine worker at most time	02:04
yanyanhu	so maybe simply adding "dispatcher.start_action(action_id)" after this line can address the issue?	02:05
yanyanhu	http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n316	02:05
yanyanhu	http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n320	02:05
yanyanhu	this line	02:05
XueFeng	Maybe not:)	02:06
XueFeng	we rescheder the action, but it can also fail	02:06
xuhaiwei	yanyanhu: that is meanless to set READY status to an action if you do so	02:06
XueFeng	and maybe it will go to a dead loop	02:07
yanyanhu	yes. but once the code execution reaches line 320, the action will be free and re-fired again	02:07
yanyanhu	XueFeng, that could be. But that means the target obj(cluster/node) is always being locked	02:07
yanyanhu	we can do nothing on that I guess	02:08
XueFeng	right, so we can try the idea	02:08
XueFeng	:)	02:08
yanyanhu	currently, we don't promise strict time sequence based operations in senlin	02:08
xuhaiwei	yanyanhu: you mean if an action is failed, we try it again and again, until it get the lock to run?	02:08
yanyanhu	user should be aware of this point	02:08
yanyanhu	xuhaiwei, yes. If user don't control it	02:09
yanyanhu	by checking the action or op target status by themself	02:09
xuhaiwei	yanyanhu, it may be work, but not an inteligent way	02:09
yanyanhu	xuhaiwei, yea, it's perfect. Just in internet, no one knows which API REALLY comes and will be handled first, especially when you have multiple API service instances running	02:11
yanyanhu	so we may expect users have some logic in their side to prepare the operation sequence	02:11
xuhaiwei	yanyanhu, if you user write the senlin resource in heat template, user can't control the sequence	02:12
yanyanhu	e.g. you request multiple cluster scaling operations at the same time, you won't know which one will get executed first...	02:12
yanyanhu	although the finally result should be the same	02:12
yanyanhu	xuhaiwei, yes. For limited number of operations(with random sequence), senlin should guarantee that the final result is consistent	02:13
yanyanhu	however, we can't ensure the sequence that each operation happens	02:13
xuhaiwei	yes, that's not the import	02:14
yanyanhu	so if you do file LOTS of operations that target on the same cluster/node, some of them could wait for a long while before get chance to be executed	02:14
xuhaiwei	important	02:14
yanyanhu	an alternate is adding a timeout logic for action	02:15
yanyanhu	not the current timeout which only takes effect when action is scheduled	02:15
yanyanhu	we logged the timestamp of action becomes ready, once the time elapsed over a threshold, e.g. 24 hours, we marked the action to failed	02:16
yanyanhu	however, this could be inappropriate in some cases where action does cost lots of time to finish	02:16
yanyanhu	so this can be an option	02:16
yanyanhu	I mean configuration option	02:17
xuhaiwei	so your suggestion is add a 'try again' logic when the action can't get the lock?	02:18
XueFeng	yanyanhu,xuhaiwei, action_acquire_random_ready the may need optimize	02:19
yanyanhu	xuhaiwei, yes, that could be helpful to address the issue you met. But for the action waiting for long time without being scheduled, may need other solution	02:19
yanyanhu	XueFeng, you mean?	02:20
xuhaiwei	yanyanhu, I would suggest to improve the scheduling logic to make the READY action executable	02:21
yanyanhu	xuhaiwei, yes, that will also work. Just still can't resolve the problem that action waits too long to be scheduled...	02:21
yanyanhu	e.g. a cluster is locked for 24 hours for some reasons e.g. scaling or maintaining, any other action target on it will keep failing until the in-progress action finishs	02:23
yanyanhu	keep failing and retrying	02:23
XueFeng	We pick ready action randomly. For a cluster/node, we can pick it with create time	02:25
XueFeng	Add the targert and create_time	02:26
XueFeng	target	02:26
yanyanhu	XueFeng, you mean order the action according to their timestamp first?	02:27
yanyanhu	then pickup the oldest one	02:27
XueFeng	When a action is running, the cluster or the node is locked.	02:30
XueFeng	And the later actions for the cluster/node can't get the lock, and maybe go to ready again	02:31
XueFeng	Then for the target we'd better to pick the action with the create time	02:31
yanyanhu	XueFeng, we did consider this way before. However, this could cause another problem: once the oldest action keep failing and retrying. Any other younger actions could never get chance to be scheduled...	02:32
yanyanhu	we are acquiring ready action randomly to avoid this issue since we have no "Action Queue" here	02:33
yanyanhu	unless we update the action timestamp before putting it back to DB	02:34
yanyanhu	if so, we may need to add extra timestamp of each different status	02:34
yanyanhu	e.g. timestamp of action becomes ready/failed/succeeded/init	02:35
yanyanhu	etc.	02:35
yanyanhu	currently, we only have created_at and updated_at	02:35
XueFeng	Yes ,there will be another problem come	02:36
yanyanhu	yes... so my suggestion is retriggering the action scheduling after we marked it to ready in the following position, or as xuhaiwei suggested, we optimize our scheduler to add periodical scheduling logic	02:37
yanyanhu	http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n320	02:38
XueFeng	We can do this first	02:38
yanyanhu	currently, senlin scheduler works following a way that combines both tickless and event-driven	02:39
yanyanhu	XueFeng, yes	02:39
yanyanhu	then we consider how to better handle the situation that ready action waits too long to be scheduled	02:39
XueFeng	Yes	02:40
XueFeng	And I conidered again, maybe too long time to be scheduled will not happen frequently	02:41
XueFeng	Here it was rescheduer because can't the action get the target lock. And once it get lock, it will run success/fail...	02:43
yanyanhu	XueFeng, uhm, yes although it could happen, depends on the use case. But anyway, user can handle it as well by checking the action and target cluster/node status	02:43
XueFeng	ok, we can try in this way first	02:44
XueFeng	I will do a test now	02:45
yanyanhu	XueFeng, great, thanks a lot :)	02:45
XueFeng	my pleasure:)	02:46
XueFeng	root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 2	02:49
XueFeng	Request accepted by action: e43d5b4e-fd6a-409b-adab-5ecec25c84ef	02:49
XueFeng	root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 3	02:49
XueFeng	Request accepted by action: e1863376-70cd-4605-9611-b99dc546be6a	02:49
XueFeng	It's easy to root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 2	02:50
XueFeng	Request accepted by action: e43d5b4e-fd6a-409b-adab-5ecec25c84ef	02:50
XueFeng	root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 3	02:50
XueFeng	It's easy to reproduction now	02:50
XueFeng	And I will change the code to see the effect	02:51
openstackgerrit	Qiming Teng proposed openstack/senlin master: Fix ovo object for requests https://review.openstack.org/455891	03:02
*** zhurong has quit IRC		04:06
openstackgerrit	OpenStack Proposal Bot proposed openstack/python-senlinclient master: Updated from global requirements https://review.openstack.org/454535	04:22
openstackgerrit	OpenStack Proposal Bot proposed openstack/senlin master: Updated from global requirements https://review.openstack.org/456018	04:22
openstackgerrit	OpenStack Proposal Bot proposed openstack/senlin-dashboard master: Updated from global requirements https://review.openstack.org/456019	04:23
*** zhurong has joined #senlin		04:34
openstackgerrit	XueFeng Liu proposed openstack/senlin master: Fix scheduing problem about abandon action https://review.openstack.org/456048	05:08
*** shu-mutou-AWAY is now known as shu-mutou		05:45
openstackgerrit	XueFeng Liu proposed openstack/senlin master: Fix scheduing problem about abandon action https://review.openstack.org/456048	06:10
*** yuanying_ has joined #senlin		06:48
*** yuanying has quit IRC		06:48
Qiming	@everyone, py35 jobs are not demoted to non-voting	07:44
Qiming	it may take some time for this change to be propagated to all CI nodes, then we won't get blocked by the py35 jobs for critical patches	07:45
Qiming	we can fix the py35 job later when the root cause is identified	07:45
XueFeng	ok, got it	08:05
openstackgerrit	RUIJIE YUAN proposed openstack/senlin master: revise engine cluster obj to update runtime data https://review.openstack.org/456133	09:38
openstackgerrit	Merged openstack/senlin master: fix node do_check invalid code https://review.openstack.org/455575	09:40
openstackgerrit	Qiming Teng proposed openstack/senlin master: Pike-1 release notes https://review.openstack.org/456146	10:10
*** yanyanhu has quit IRC		10:50
*** dixiaoli has quit IRC		11:06
*** zhurong has quit IRC		11:26
openstackgerrit	yangyide proposed openstack/senlin master: Improve check_object for health_policy_poll recover https://review.openstack.org/456187	11:50
openstackgerrit	yangyide proposed openstack/senlin master: Improve check_object for health_policy_poll recover https://review.openstack.org/456187	11:52
openstackgerrit	Merged openstack/senlin master: Fix scheduing problem about abandon action https://review.openstack.org/456048	11:53
openstackgerrit	OpenStack Proposal Bot proposed openstack/senlin master: Updated from global requirements https://review.openstack.org/456018	11:54
openstackgerrit	Shu Muto proposed openstack/senlin-dashboard master: [DNM] Fix test environments https://review.openstack.org/456189	11:55
*** shu-mutou is now known as shu-mutou-AWAY		11:55
openstackgerrit	Shu Muto proposed openstack/senlin-dashboard master: [DNM] Fix test environments https://review.openstack.org/456189	12:09
*** catintheroof has joined #senlin		13:27
*** rate has joined #senlin		13:28
*** zhurong has joined #senlin		13:46
*** zhurong_ has joined #senlin		13:59
*** zhurong has quit IRC		14:01
*** rate has quit IRC		14:13
*** rate has joined #senlin		14:19
*** rate has quit IRC		14:48
*** rate has joined #senlin		14:55
*** zhurong_ has quit IRC		14:57
*** rate has quit IRC		15:06
*** rate has joined #senlin		15:07
*** rate has quit IRC		15:11
openstackgerrit	Merged openstack/senlin master: Updated from global requirements https://review.openstack.org/456018	15:50
-openstackstatus- NOTICE: Restarting Gerrit for our weekly memory leak cleanup.		21:27
*** Qiming has quit IRC		21:41
*** Qiming has joined #senlin		21:46
*** catintheroof has quit IRC		22:48

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!