*** zhurong has joined #senlin | 00:39 | |
*** XueFeng has joined #senlin | 01:06 | |
*** dixiaoli has joined #senlin | 01:33 | |
*** yanyanhu has joined #senlin | 01:35 | |
xuhaiwei | morning Qiming, yanyanhu and others, about the concurrency problem I reported the day before yesterday, I found there is a problem here https://github.com/openstack/senlin/blob/master/senlin/engine/scheduler.py#L140 | 01:43 |
---|---|---|
xuhaiwei | because when concurrency happened, one action is executed, the other one went to a status of READY, this action is desired to be ran later, but the scheduler logic will finish checking the action's status before it went to READY | 01:45 |
xuhaiwei | so the READY action is left unexecuted, though the former one has released the lock | 01:46 |
yanyanhu | hi, xuhaiwei, that action will finally get chance to run when next scheduling event comes | 01:48 |
*** yuanbin has quit IRC | 01:48 | |
yanyanhu | although there is risk that the action could fail if there is no more scheduling event comes | 01:49 |
*** yuanbin has joined #senlin | 01:49 | |
yanyanhu | I mean in case no any new action is created | 01:49 |
xuhaiwei | yanyanhu, but if no other scheduling event comes for a long time, the action will be left there all the time? | 01:49 |
yanyanhu | xuhaiwei, yes, currently it is | 01:50 |
yanyanhu | so maybe we can add a periodic task to trigger action scheduling internally | 01:51 |
yanyanhu | to avoid this issue | 01:51 |
XueFeng | hi, haiwei,yanyanhu | 01:51 |
XueFeng | else: # result == self.RES_RETRY: | 01:51 |
XueFeng | status = self.READY | 01:51 |
XueFeng | # Action failed at the moment, but can be retried | 01:51 |
XueFeng | # We abandon it and then notify other dispatchers to execute it | 01:51 |
XueFeng | ao.Action.abandon(self.context, self.id) | 01:51 |
xuhaiwei | so the scheduler should be monitoring the action pools all the time, instead of starting working when triggered by someone | 01:51 |
xuhaiwei | yanyanhu, exactly | 01:51 |
yanyanhu | oh, check what XueFeng post | 01:52 |
xuhaiwei | XueFeng, 'abandon' doesn't mean abandoning the action | 01:52 |
xuhaiwei | # We abandon it and then notify other dispatchers to execute it | 01:53 |
XueFeng | yes,the status is in ready | 01:53 |
yanyanhu | XueFeng, could you please paste the link of that code section | 01:53 |
XueFeng | and the reason string is about abandon | 01:53 |
XueFeng | yanyanhu | 01:53 |
XueFeng | ok | 01:53 |
xuhaiwei | yanyanhu https://github.com/openstack/senlin/blob/master/senlin/db/sqlalchemy/api.py#L1220 | 01:54 |
yanyanhu | let me take a look :) if the comment is accurate, the problem haiwei mentioned won't exist | 01:54 |
XueFeng | http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n316 | 01:54 |
XueFeng | here | 01:55 |
yanyanhu | oh, this is different | 01:55 |
XueFeng | Currently, we notify nothing | 01:55 |
yanyanhu | it is just for a worker/engine to abandon an action which has been locked by itself | 01:56 |
XueFeng | So, the ready action will be picked when next action come | 01:56 |
yanyanhu | and then allow other works to pick it up for scheduling | 01:56 |
yanyanhu | XueFeng, they are two different cases I think | 01:56 |
yanyanhu | they are for two different cases | 01:56 |
XueFeng | ok | 01:56 |
yanyanhu | action_abandon is to avoid action deadlock I guess | 01:57 |
yanyanhu | while the problem xuhaiwei mentioned is more about scheduling | 01:57 |
XueFeng | I remember I met the problem when I do cluster_resize | 01:58 |
yanyanhu | once action is acquired and locked by an worker/engine, any other engine or worker cannot acquire it anymore. So there could be some cases that the action owner want to give up this action and then allow other workers to acquire it | 01:58 |
yanyanhu | I guess this db api is for this purpose | 01:58 |
xuhaiwei | yanyanhu: I think the scheduling logic need to be modified | 01:58 |
yanyanhu | xuhaiwei, there can be a improvement I think | 01:59 |
XueFeng | Two cluster_resize commands continuous coming | 01:59 |
yanyanhu | XueFeng, you mean the same issue xuhaiwei met? | 02:00 |
XueFeng | Don't ensure | 02:02 |
yanyanhu | so could be, if you observed a ready action hang there without being scheduling for long time : ) | 02:02 |
XueFeng | I think we have problem in http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n316 | 02:02 |
*** openstackgerrit has joined #senlin | 02:03 | |
*** ChanServ sets mode: +v openstackgerrit | 02:03 | |
openstackgerrit | RUIJIE YUAN proposed openstack/senlin master: handle node which status is WARNING https://review.openstack.org/455542 | 02:03 |
XueFeng | yanyanhu, yes | 02:03 |
XueFeng | It's easy to reproduction | 02:03 |
XueFeng | We expect other senlin engine to scheder the action | 02:04 |
XueFeng | But we only run engine worker at most time | 02:04 |
yanyanhu | so maybe simply adding "dispatcher.start_action(action_id)" after this line can address the issue? | 02:05 |
yanyanhu | http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n316 | 02:05 |
yanyanhu | http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n320 | 02:05 |
yanyanhu | this line | 02:05 |
XueFeng | Maybe not:) | 02:06 |
XueFeng | we rescheder the action, but it can also fail | 02:06 |
xuhaiwei | yanyanhu: that is meanless to set READY status to an action if you do so | 02:06 |
XueFeng | and maybe it will go to a dead loop | 02:07 |
yanyanhu | yes. but once the code execution reaches line 320, the action will be free and re-fired again | 02:07 |
yanyanhu | XueFeng, that could be. But that means the target obj(cluster/node) is always being locked | 02:07 |
yanyanhu | we can do nothing on that I guess | 02:08 |
XueFeng | right, so we can try the idea | 02:08 |
XueFeng | :) | 02:08 |
yanyanhu | currently, we don't promise strict time sequence based operations in senlin | 02:08 |
xuhaiwei | yanyanhu: you mean if an action is failed, we try it again and again, until it get the lock to run? | 02:08 |
yanyanhu | user should be aware of this point | 02:08 |
yanyanhu | xuhaiwei, yes. If user don't control it | 02:09 |
yanyanhu | by checking the action or op target status by themself | 02:09 |
xuhaiwei | yanyanhu, it may be work, but not an inteligent way | 02:09 |
yanyanhu | xuhaiwei, yea, it's perfect. Just in internet, no one knows which API REALLY comes and will be handled first, especially when you have multiple API service instances running | 02:11 |
yanyanhu | so we may expect users have some logic in their side to prepare the operation sequence | 02:11 |
xuhaiwei | yanyanhu, if you user write the senlin resource in heat template, user can't control the sequence | 02:12 |
yanyanhu | e.g. you request multiple cluster scaling operations at the same time, you won't know which one will get executed first... | 02:12 |
yanyanhu | although the finally result should be the same | 02:12 |
yanyanhu | xuhaiwei, yes. For limited number of operations(with random sequence), senlin should guarantee that the final result is consistent | 02:13 |
yanyanhu | however, we can't ensure the sequence that each operation happens | 02:13 |
xuhaiwei | yes, that's not the import | 02:14 |
yanyanhu | so if you do file LOTS of operations that target on the same cluster/node, some of them could wait for a long while before get chance to be executed | 02:14 |
xuhaiwei | important | 02:14 |
yanyanhu | an alternate is adding a timeout logic for action | 02:15 |
yanyanhu | not the current timeout which only takes effect when action is scheduled | 02:15 |
yanyanhu | we logged the timestamp of action becomes ready, once the time elapsed over a threshold, e.g. 24 hours, we marked the action to failed | 02:16 |
yanyanhu | however, this could be inappropriate in some cases where action does cost lots of time to finish | 02:16 |
yanyanhu | so this can be an option | 02:16 |
yanyanhu | I mean configuration option | 02:17 |
xuhaiwei | so your suggestion is add a 'try again' logic when the action can't get the lock? | 02:18 |
XueFeng | yanyanhu,xuhaiwei, action_acquire_random_ready the may need optimize | 02:19 |
yanyanhu | xuhaiwei, yes, that could be helpful to address the issue you met. But for the action waiting for long time without being scheduled, may need other solution | 02:19 |
yanyanhu | XueFeng, you mean? | 02:20 |
xuhaiwei | yanyanhu, I would suggest to improve the scheduling logic to make the READY action executable | 02:21 |
yanyanhu | xuhaiwei, yes, that will also work. Just still can't resolve the problem that action waits too long to be scheduled... | 02:21 |
yanyanhu | e.g. a cluster is locked for 24 hours for some reasons e.g. scaling or maintaining, any other action target on it will keep failing until the in-progress action finishs | 02:23 |
yanyanhu | keep failing and retrying | 02:23 |
XueFeng | We pick ready action randomly. For a cluster/node, we can pick it with create time | 02:25 |
XueFeng | Add the targert and create_time | 02:26 |
XueFeng | target | 02:26 |
yanyanhu | XueFeng, you mean order the action according to their timestamp first? | 02:27 |
yanyanhu | then pickup the oldest one | 02:27 |
XueFeng | When a action is running, the cluster or the node is locked. | 02:30 |
XueFeng | And the later actions for the cluster/node can't get the lock, and maybe go to ready again | 02:31 |
XueFeng | Then for the target we'd better to pick the action with the create time | 02:31 |
yanyanhu | XueFeng, we did consider this way before. However, this could cause another problem: once the oldest action keep failing and retrying. Any other younger actions could never get chance to be scheduled... | 02:32 |
yanyanhu | we are acquiring ready action randomly to avoid this issue since we have no "Action Queue" here | 02:33 |
yanyanhu | unless we update the action timestamp before putting it back to DB | 02:34 |
yanyanhu | if so, we may need to add extra timestamp of each different status | 02:34 |
yanyanhu | e.g. timestamp of action becomes ready/failed/succeeded/init | 02:35 |
yanyanhu | etc. | 02:35 |
yanyanhu | currently, we only have created_at and updated_at | 02:35 |
XueFeng | Yes ,there will be another problem come | 02:36 |
yanyanhu | yes... so my suggestion is retriggering the action scheduling after we marked it to ready in the following position, or as xuhaiwei suggested, we optimize our scheduler to add periodical scheduling logic | 02:37 |
yanyanhu | http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/actions/base.py#n320 | 02:38 |
XueFeng | We can do this first | 02:38 |
yanyanhu | currently, senlin scheduler works following a way that combines both tickless and event-driven | 02:39 |
yanyanhu | XueFeng, yes | 02:39 |
yanyanhu | then we consider how to better handle the situation that ready action waits too long to be scheduled | 02:39 |
XueFeng | Yes | 02:40 |
XueFeng | And I conidered again, maybe too long time to be scheduled will not happen frequently | 02:41 |
XueFeng | Here it was rescheduer because can't the action get the target lock. And once it get lock, it will run success/fail... | 02:43 |
yanyanhu | XueFeng, uhm, yes although it could happen, depends on the use case. But anyway, user can handle it as well by checking the action and target cluster/node status | 02:43 |
XueFeng | ok, we can try in this way first | 02:44 |
XueFeng | I will do a test now | 02:45 |
yanyanhu | XueFeng, great, thanks a lot :) | 02:45 |
XueFeng | my pleasure:) | 02:46 |
XueFeng | root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 2 | 02:49 |
XueFeng | Request accepted by action: e43d5b4e-fd6a-409b-adab-5ecec25c84ef | 02:49 |
XueFeng | root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 3 | 02:49 |
XueFeng | Request accepted by action: e1863376-70cd-4605-9611-b99dc546be6a | 02:49 |
XueFeng | It's easy to root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 2 | 02:50 |
XueFeng | Request accepted by action: e43d5b4e-fd6a-409b-adab-5ecec25c84ef | 02:50 |
XueFeng | root@tecs:/home/openstack/devstack# openstack cluster resize mycluster1 --capacity 3 | 02:50 |
XueFeng | It's easy to reproduction now | 02:50 |
XueFeng | And I will change the code to see the effect | 02:51 |
openstackgerrit | Qiming Teng proposed openstack/senlin master: Fix ovo object for requests https://review.openstack.org/455891 | 03:02 |
*** zhurong has quit IRC | 04:06 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/python-senlinclient master: Updated from global requirements https://review.openstack.org/454535 | 04:22 |
openstackgerrit | OpenStack Proposal Bot proposed openstack/senlin master: Updated from global requirements https://review.openstack.org/456018 | 04:22 |
openstackgerrit | OpenStack Proposal Bot proposed openstack/senlin-dashboard master: Updated from global requirements https://review.openstack.org/456019 | 04:23 |
*** zhurong has joined #senlin | 04:34 | |
openstackgerrit | XueFeng Liu proposed openstack/senlin master: Fix scheduing problem about abandon action https://review.openstack.org/456048 | 05:08 |
*** shu-mutou-AWAY is now known as shu-mutou | 05:45 | |
openstackgerrit | XueFeng Liu proposed openstack/senlin master: Fix scheduing problem about abandon action https://review.openstack.org/456048 | 06:10 |
*** yuanying_ has joined #senlin | 06:48 | |
*** yuanying has quit IRC | 06:48 | |
Qiming | @everyone, py35 jobs are not demoted to non-voting | 07:44 |
Qiming | it may take some time for this change to be propagated to all CI nodes, then we won't get blocked by the py35 jobs for critical patches | 07:45 |
Qiming | we can fix the py35 job later when the root cause is identified | 07:45 |
XueFeng | ok, got it | 08:05 |
openstackgerrit | RUIJIE YUAN proposed openstack/senlin master: revise engine cluster obj to update runtime data https://review.openstack.org/456133 | 09:38 |
openstackgerrit | Merged openstack/senlin master: fix node do_check invalid code https://review.openstack.org/455575 | 09:40 |
openstackgerrit | Qiming Teng proposed openstack/senlin master: Pike-1 release notes https://review.openstack.org/456146 | 10:10 |
*** yanyanhu has quit IRC | 10:50 | |
*** dixiaoli has quit IRC | 11:06 | |
*** zhurong has quit IRC | 11:26 | |
openstackgerrit | yangyide proposed openstack/senlin master: Improve check_object for health_policy_poll recover https://review.openstack.org/456187 | 11:50 |
openstackgerrit | yangyide proposed openstack/senlin master: Improve check_object for health_policy_poll recover https://review.openstack.org/456187 | 11:52 |
openstackgerrit | Merged openstack/senlin master: Fix scheduing problem about abandon action https://review.openstack.org/456048 | 11:53 |
openstackgerrit | OpenStack Proposal Bot proposed openstack/senlin master: Updated from global requirements https://review.openstack.org/456018 | 11:54 |
openstackgerrit | Shu Muto proposed openstack/senlin-dashboard master: [DNM] Fix test environments https://review.openstack.org/456189 | 11:55 |
*** shu-mutou is now known as shu-mutou-AWAY | 11:55 | |
openstackgerrit | Shu Muto proposed openstack/senlin-dashboard master: [DNM] Fix test environments https://review.openstack.org/456189 | 12:09 |
*** catintheroof has joined #senlin | 13:27 | |
*** rate has joined #senlin | 13:28 | |
*** zhurong has joined #senlin | 13:46 | |
*** zhurong_ has joined #senlin | 13:59 | |
*** zhurong has quit IRC | 14:01 | |
*** rate has quit IRC | 14:13 | |
*** rate has joined #senlin | 14:19 | |
*** rate has quit IRC | 14:48 | |
*** rate has joined #senlin | 14:55 | |
*** zhurong_ has quit IRC | 14:57 | |
*** rate has quit IRC | 15:06 | |
*** rate has joined #senlin | 15:07 | |
*** rate has quit IRC | 15:11 | |
openstackgerrit | Merged openstack/senlin master: Updated from global requirements https://review.openstack.org/456018 | 15:50 |
-openstackstatus- NOTICE: Restarting Gerrit for our weekly memory leak cleanup. | 21:27 | |
*** Qiming has quit IRC | 21:41 | |
*** Qiming has joined #senlin | 21:46 | |
*** catintheroof has quit IRC | 22:48 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!