Tuesday, 2016-11-08

openstackgerritXueFeng Liu proposed openstack/senlin: Fix node_lock can't be stole  https://review.openstack.org/39440700:27
*** zhurong has joined #senlin01:30
*** yanyanhu has joined #senlin01:34
Qimingyanyanhu, there?02:14
yanyanhuyes02:14
Qimingabout this bug report: https://bugs.launchpad.net/senlin/+bug/163912202:14
openstackLaunchpad bug 1639122 in senlin "senlin-cluster-deadlock" [Medium,Confirmed] - Assigned to XueFeng Liu (jonnary-liu)02:14
*** elynn has joined #senlin02:15
Qimingthe description is a mess, cannot read it through02:15
Qimingit is wordy, and not touching the essence of the problem02:15
Qimingwhen I quickly go thru the report, it seems the so called "dead lock" bugs are all caused by engine restart02:15
Qimingthat is something we were not designed for02:16
yanyanhuQiming, yes, we didn't consider the case engine restarts before02:16
QimingI have seen some patching coming in in attempt to solve the problem02:16
Qimingbut those patches are solving the problem in the wrong way02:16
QimingI mean, they are trying solve the DB problem case by case ... but there are too many cases to be enumerated, so ... it won't be a comprehensive solution02:17
Qiminginstead, I think we need a discussion on how to cleanse the database when engine gets restarted02:18
yanyanhuI see.02:18
Qimingthat is, how to reset each action that was owned by the dead engine, and release locks held by those actions02:19
yanyanhuI recalled ethan's work before partially touched this problem? about dead lock stealing02:20
*** shu-mutou has joined #senlin02:20
*** elynn has quit IRC02:20
Qimingyes, that patch has somehow solved the problem, but not completely02:20
Qimingwe have a two-leveled lock design02:20
*** elynn has joined #senlin02:20
Qimingthe lock-steal mechanism is a little bit lazy02:21
yanyanhuyes, it happens only in case engine starts02:21
Qimingit is stealing locks when an action wants to lock a node or a cluster02:21
Qimingwhile in real life, restarting engine would be rare02:22
yanyanhuah, right, false statement02:22
Qiming(let's skip testing scenarios for the moment)02:22
yanyanhuyes. the key is we don need some events to trigger this checking and lock stealing02:23
Qimingelynn, thanks god you joined promptly02:23
yanyanhus/don/do02:23
Qimingright02:23
elynnQiming, what happened?02:23
Qimingso ... if we do house cleaning when an engine starts, I mean, let the new engine to cleanse all things once for all02:23
elynnlock-steal mechanism mess up?02:24
Qimingyes, elynn, it seems not a clean or comprehensive solution, there are still dangling locks somewhere, see : https://bugs.launchpad.net/senlin/+bug/163912202:24
openstackLaunchpad bug 1639122 in senlin "senlin-cluster-deadlock" [Medium,Confirmed] - Assigned to XueFeng Liu (jonnary-liu)02:24
Qimingso I'm thinking maybe we can fix it this way02:24
Qimingsince the lock-steal mechanism (thanks elynn again) is already there working02:25
Qimingand that mechanism is already stealing the lock in question02:25
Qimingwhy don't we trigger another thread to cleanse the database for the dead engine ?02:26
Qiminghere: http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/senlin_lock.py#n14302:27
Qimingwe checked and found that an engine is dead, and we steal the lock on the node02:28
yanyanhuso sounds like a periodically task02:29
yanyanhumaybe running beside each engine02:29
Qimingmaybe we should trigger a worker thread to do a GC for the dead engine02:29
QimingI thought about that before, but it is too heavy weight02:29
Qimingit is the same reason people hate polling02:29
Qimingwith Python, not real multi-processing, we should avoid doing polling whenever possible, it is a waste of CPU cycles02:30
elynnWhat's wrong with that code? still trying to catching your discussion...02:30
Qiming http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/senlin_lock.py#n14302:30
yanyanhuif so, we need to consider the situation that this gc thread is dead02:31
Qiminghere, after having found that an engine is dead, we steal the lock on line 151, right? elynn02:31
elynnyes02:31
elynnis it a problem?02:31
Qimingwe are talking about an improvement to that02:31
Qimingwe only stole the node lock today02:32
Qimingand we know the engine E has died02:32
Qimingwhy don't we trigger a "background" worker thread there, doing some further garbage collection job for the dead engine E?02:32
elynnHow can we know that engine E has died? is there any notifications?02:33
Qimingno ... it was detected by your timeout logic02:33
Qimingfunction called on line 14402:33
elynnOh, you mean once when we run into that code and check that engine is dead, then we maybe could clean all dead actions02:33
Qimingyes, exactly02:34
Qimingall dead actions related to that engine02:34
Qimingand all node locks/cluster locks held by those dead actions02:34
Qimingit could be just a new sqlalchemy API02:34
Qimingif done right02:34
yanyanhusounds good02:35
elynnHmm, that could do.02:35
QimingI'd suggest we don't block the return on line 15302:35
yanyanhumaybe the first step is improve this cleaning logic, then we can consider to add gc thread to invoke it periodically02:35
Qiminginstead we launch a background thread to do the job02:35
Qimingyanyanhu, that is another point I want to discuss/argue with you02:36
Qimingplease be cautious when introducing periodical threads02:36
elynnI think we could add a thread after http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/senlin_lock.py#n152 to clean all actions02:36
Qimingyep, elynn02:36
Qimingsounds feasible?02:36
Qimingyanyanhu, we only have a single process in the Python world02:37
elynnMaybe it's less impact for the current code.02:37
yanyanhuQiming, yes, each engine is actually a single thread worker02:37
XueFengHi02:37
elynnThat's a problem02:37
yanyanhuhi, XueFeng02:37
Qimingwe should kill all pollers, all periodicals if possible02:37
yanyanhuQiming, yes, for the architecture we are using now is actually event driven02:38
Qimingright, let's keep things event driven when possible02:38
Qiminguntil the day we introduce multiprocessing into the code02:39
yanyanhuQiming, sure, that makes sense02:39
Qimingwhen one worker/thread is loop polling something, it is tax on all other workers ... unfair, unproductive, waste of resources ...02:39
Qimingelynn, if you (or others), can come up with a DB call that can reset all actions (AND the nodes/clusters locked by those actions) related to an engine known to be dead02:41
Qimingwe will be more confident in writing complex/efficient DB queries02:41
elynnActually I'm thinking making the lock table more general, for example when clean  thread is up then add a lock there02:41
Qimingthat is doable, but don't lock the world02:42
yanyanhuelynn, you mean something like fencing?  :)02:42
openstackgerritRUIJIE YUAN proposed openstack/senlin: engine work prepare for policy update v2  https://review.openstack.org/39471102:43
elynnyanyanhu, no... just to make sure only one cleaning thread is running.02:43
yanyanhuI see02:43
yanyanhuyes, that is important02:43
elynncleaning thread is the thread to reset actions lock.02:43
Qimingwe will be more confident in writing complex/efficient DB queries, we can add (back) a 'senlin-manage purge-db' call which can clean up data in the 'action' and 'event' table which are too old based on a criteria02:43
yanyanhuto avoid potential race condition on lock table02:43
Qimingelynn, yep, absolutely02:43
elynnQiming, I will start from a DB call maybe.02:44
elynnBut the problem is how do we deal with that lock if it's not released.02:44
Qimingyou can learn from Heat's DB logic on deleting stack tags ... if that helps02:44
Qimingaction is locked by a dead engine? okay, it is no longer locked. It is ready to be executed again? not sure, need to check its state.02:46
Qimingnode/cluster is locked by an action which was owned by a dead engine? it is no longer locked. Was it in state UPDATING? em ... now it should be put into ERROR maybe.02:47
Qimingwe don't know what happened02:47
Qimingwhen you have questions how to set the node/cluster states, we can discuss. otherwise, use your own judgement would be fine.02:48
elynnQiming, okay02:48
elynnQiming, I could start from DB call first, right?02:49
Qimingsure, you decide where to start02:49
Qiming:)02:49
Qimingif still have doubts about the problems you are solving, check the latest comment posted by XueFeng here: https://review.openstack.org/#/c/394407/02:50
elynnokay, will go through your comments on that patch.02:51
Qimingit was referred to as an 'inconsistency' problem02:51
XueFengAbout how to avoid problem happened I have no good idea.Only summit patch 392218.So try to handle inconsistency.02:53
Qimingyes, read back our dicussion since 10:1402:53
XueFengWill read.02:54
QimingXueFeng, hope this can solve the problem you encountered and help you understand our philosophy on the internal design, :)02:54
XueFengok,thanks:)02:55
openstackgerritMerged openstack/python-senlinclient: Revise "enabled" related code in ClusterPolicyUpdate  https://review.openstack.org/39431402:59
openstackgerritMerged openstack/python-senlinclient: cluster policy attach cannot work  https://review.openstack.org/39428002:59
openstackgerritMerged openstack/python-senlinclient: policy binding update cannot work  https://review.openstack.org/39425403:00
openstackgerritShu Muto proposed openstack/senlin-dashboard: Angularize node tables  https://review.openstack.org/38732103:00
openstackgerritMerged openstack/python-senlinclient: Policy binding attach cannot work  https://review.openstack.org/39425603:01
openstackgerritRUIJIE YUAN proposed openstack/senlin: engine work prepare for policy update v2  https://review.openstack.org/39471103:01
openstackgerritYanyan Hu proposed openstack/senlin: Minor fix on node-create API  https://review.openstack.org/39472303:27
XueFengHi elynn,03:39
elynnHi XueFeng03:40
XueFengelynn03:40
XueFeng"I think we could add a thread after http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/senlin_lock.py#n152 to clean all actions "03:40
XueFengI solve the problem that line 152 can't be run.03:41
elynnIn which patch?03:41
XueFengelynn03:41
XueFengI think we could add a thread after http://git.openstack.org/cgit/openstack/senlin/tree/senlin/engine/senlin_lock.py#n152 to clean all actions03:41
XueFenghttps://review.openstack.org/#/c/394407/03:42
XueFeng39440703:42
elynnActually I haven't quit understand why line 152 can't be run?03:42
openstackgerritMerged openstack/senlin: Versioned request object for receiver list  https://review.openstack.org/39434003:43
XueFengSee line 14303:43
elynnCan I call you XueFeng ?03:43
XueFengOK03:44
QimingXueFeng, in that patch03:44
QimingLine 143 in the original code contains something you didn't change, Line 143 in the revised code is empty03:45
Qimingwhat are you referring to?03:45
QimingI'd call that fix a piecemeal fix, it fixes a corner case of the problem03:45
Qimingit introduces two DB interactions on a critical path of lock acquire03:46
Qimingwhat we discussed this morning is about a more generic solution to solve lock/action/node/cluster problem03:46
Qimingthe line 138 in your revised code contains some strange condition for setting forced to True03:48
Qimingit is at best some logic difficult to maintain03:48
Qimingfor example, why don't we add the same logic before line 71?03:48
Qimingwhy we only set forced to True if action.owner is None?03:49
Qimingthe thing is not about whether we trust your test results ... okay it is working03:49
Qimingthe thing is that we want a more generic solution aiming at the root cause of a problem, which is "engine restarted, db appears inconsistent"03:50
Qimingthat root cause will cause various problems, not just the stealing of node locks03:51
QimingI believe there will be far more problems undercovered if we don't handle engine restart properly03:51
xuhaiweihi sorry to interrupt, is senlin working well under the latest codes?  I got this error when running senlin commands  http://paste.openstack.org/show/588334/03:56
openstackgerritMerged openstack/senlin: Engine support for receiver_list2  https://review.openstack.org/39435703:59
XueFeng"for example, why don't we add the same logic before line 71? " this original plan to do in another patch, in this patch' message mentioned "chandling will be in another patch.04:21
XueFengYes, need find root cause04:22
XueFengAnd need design a more generic sloution04:23
XueFeng"for example, why don't we add the same logic before line 71? " this original plan to do in another patch, in this patch' message mentioned "cluster_lock handling will be in another patch."04:28
XueFenghaiwei, I met the same problem04:29
xuhaiweiXueFeng, it seems something has changed in keystoneauth104:30
XueFengMaybe04:30
XueFengAnd I update all modules04:31
XueFengAlso update senlinclient/openstack-sdk04:31
xuhaiweianyway, I will file a bug report for this04:31
XueFengYes04:31
XueFengSeem not compatibility04:32
xuhaiweihi elynn, does Heat senlin resource supports nested template in OS::Senlin::Profile type resource?  like this  http://paste.openstack.org/show/588338/04:39
elynnxuhaiwei, let me see...04:40
xuhaiweithanks04:40
elynnNeed to do a little change, try with {get_file: example.yaml} when you use heat template.04:41
xuhaiweigot it, thanks04:41
Qimingxuhaiwei, looks like a python-openstacksdk problem again05:55
Qimingxuhaiwei, seems you are invoking some service which doesn't support version negotiation?05:58
Qimingbetter check the 'uri' content here: File "/usr/local/lib/python2.7/dist-packages/openstack/session.py", line 16805:58
xuhaiweiQiming, the uri is the IP address of my host, seem no problem06:05
xuhaiweiin senlin-api logs, it logs http://host_ip, and to client side it returns http://host_ip:8778  with port number added06:07
Qimingwhat do you get if you do curl http://host_ip:8778 ?06:11
xuhaiwei{"versions": [{"status": "CURRENT", "updated": "2016-01-18T00:00:00Z", "links": [{"href": "/v1", "rel": "self"}, {"href": "http://developer.openstack.org/api-ref/clustering", "rel": "help"}], "min_version": "1.0", "max_version": "1.3", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.clustering-v1+json"}], "id": "1.0"}]}06:11
Qimingokay, that means senlin api is set up correctly and working properly06:12
Qimingopenstacksdk wasn't able to parse this to get a versioned endpoint06:12
XueFengopenstacksdk version need update?06:14
Qimingit depends on which version are you using06:18
Qimingin my case, I'm using the master HEAD version, i.e. openstacksdk-0.9.10.dev1, so I don't have a problem of it06:21
xuhaiweiQiming, XueFeng, I got the reason, it  is caused by this patch https://review.openstack.org/#/c/386440/06:21
Qimingline 77 didn't check if the response contains a JSON or not?06:22
xuhaiweiit is a little strange, if no exceptions returns the exception handing should not be ran06:25
Qimingit seems there still is an exception though06:27
openstackgerritRUIJIE YUAN proposed openstack/senlin: api support policy update v2  https://review.openstack.org/39478906:54
openstackgerritRUIJIE YUAN proposed openstack/senlin: remove dead code in policy update  https://review.openstack.org/39479006:58
openstackgerritShu Muto proposed openstack/senlin-dashboard: Angularize node tables  https://review.openstack.org/38732107:00
openstackgerritQiming Teng proposed openstack/senlin: Ensure /v1 endpoint returns proper version info  https://review.openstack.org/39480407:36
QimingHINT: if you are talking about "codes" instead of "code", make sure you are actually meaning it, :)  http://english.stackexchange.com/questions/20455/is-it-wrong-to-use-the-word-codes-in-a-programming-context07:38
*** yanyanhu has quit IRC07:44
openstackgerritRUIJIE YUAN proposed openstack/senlin: engine support policy delete v2  https://review.openstack.org/39480607:48
*** yanyanhu has joined #senlin07:51
openstackgerritQiming Teng proposed openstack/senlin: Move container spec to approved dir  https://review.openstack.org/39481308:07
Qimingruijie, online?08:11
ruijieyes, Qiming08:58
*** shu-mutou is now known as shu-mutou-AWAY09:02
Qiminghi, ruijie09:04
ruijiehi, Qiming :)09:04
Qimingwhat's the progress of this blueprint: https://blueprints.launchpad.net/senlin/+spec/support-batching-policy09:05
Qimingcan we mark it as completed?09:05
QimingAFAICT, the spec is mainly about three things: policy itself, update operation and delete operation09:06
ruijieyes, Qiming. Currently, it support cluster_create/update.09:06
ruijieI am not sure if we need to support more cluster actions09:06
Qimingwe don't have a decent solution for cluster_create09:06
ruijiesorry, cluster_delete :)09:07
Qimingbefore a cluster is created, no policy can be checked ...09:07
Qiminghow about add/delete nodes?09:07
ruijieyes, but, like some other action cluster_add_nodes or del_nodes may work with it.09:07
Qimingscale out and scale in?09:07
ruijieyes Qiming, some actions need to create or delete nodes09:08
Qimingdo we want to add support to more node creation/deletion operations or we just stop here?09:09
ruijieInfact, I hope to add support to more actions09:10
ruijieone thing is that, current logic will reduce efficiency in my opinion09:11
Qimingokay, I'm marking the bp as 'good progress'09:11
Qimingreason?09:11
ruijiesince we need to sleep for period after each batch09:11
Qimingyes, that is what the policy says09:12
Qimingbatches don't overlap with each other09:12
Qimingit is a tradeoff between robustness and efficiency09:13
ruijieIn that case, do we need to consider the "TIME_OUT" exception09:13
Qimingif the interaction with sqlalchemy is ideal, e.g. every db operation will be immediately reflected into Python objects, we can do a lot optimizations09:13
ruijieif the working chain is too long09:14
Qimingwe will never know the optimal working chain length, or batch size09:14
Qimingthat is why we leave it a policy parameter for users to tune09:15
ruijieokay, that makes sense.09:15
Qimingin an ideal world, this should be implmented as a sliding window, but sqlalchemy interaction is not reliable ... we have encountered a lot of problems trying to get the latest status out of DB table09:16
Qimingpython runtime and sqlalchemy package will cache things that have to be explicited invalidated ...09:16
Qimingtracking action status closely remains an open issue till today09:17
ruijieeh, then waiting until the whole batch completed will be better.09:18
ruijieokay, Qiming, I will add support to more actions like scale in/out, resize .etc after versioned resource support completed.09:18
Qimingmany thanks!09:19
ruijiemy pleasure :)09:20
*** elynn has quit IRC09:37
openstackgerritOpenStack Proposal Bot proposed openstack/python-senlinclient: Updated from global requirements  https://review.openstack.org/39413009:53
openstackgerritQiming Teng proposed openstack/senlin: A spec for generic event/notification support  https://review.openstack.org/39487409:58
*** zhurong has quit IRC09:58
openstackgerritQiming Teng proposed openstack/senlin: A spec for generic event/notification support  https://review.openstack.org/39487409:59
openstackgerritMerged openstack/python-senlinclient: Updated from global requirements  https://review.openstack.org/39413010:04
*** yanyanhu has quit IRC10:13
*** XueFeng has quit IRC10:21
openstackgerritShan Guo proposed openstack/senlin: Replaces uuid.uuid4 with uuidutils.generate_uuid()  https://review.openstack.org/39489210:39
openstackgerritxu-haiwei proposed openstack/senlin: Remove container nodes information from dependents property of cluster  https://review.openstack.org/39489610:51
*** ruijie has quit IRC11:14
*** zhurong has joined #senlin12:05
openstackgerritMerged openstack/senlin-dashboard: Fix the wrong url to the policy detail page in cluster detail page  https://review.openstack.org/39016312:12
openstackgerritMerged openstack/senlin-dashboard: Imported Translations from Zanata  https://review.openstack.org/38928812:12
*** zhurong has quit IRC12:13
*** zhurong has joined #senlin12:21
*** zhurong has quit IRC12:38
*** yanyanhu has joined #senlin12:40
*** zhurong has joined #senlin12:43
*** Drago has joined #senlin12:46
openstackgerritlvdongbing proposed openstack/senlin: Versioned objects for profile request  https://review.openstack.org/39494112:47
*** lvdongbing has joined #senlin12:49
*** Ruijie has joined #senlin12:50
yanyanhuhi, guys, meeting will start in #openstack-meeting channel in minutes12:54
yanyanhuhi, meeting has started13:00
*** lixinhui_ has joined #senlin13:01
*** elynn has joined #senlin13:02
openstackgerritLiuqing Jing proposed openstack/senlin-dashboard: Add spec file for profiles.service.js  https://review.openstack.org/39494913:04
*** miaohb has joined #senlin13:09
*** miaohb has left #senlin13:09
openstackgerritLiuqing Jing proposed openstack/senlin-dashboard: Remove unnecessary variable assignment  https://review.openstack.org/39495213:16
*** Liuqing has joined #senlin13:18
*** Liuqing has quit IRC13:19
*** Liuqing has joined #senlin13:20
*** zhurong has quit IRC13:37
*** openstackgerrit has quit IRC13:48
*** openstackgerrit has joined #senlin13:49
*** ChanServ sets mode: +v openstackgerrit13:49
*** elynn has quit IRC13:59
*** XueFeng has joined #senlin14:00
*** XueFeng has quit IRC14:00
*** yanyanhu has quit IRC14:01
*** Ruijie has quit IRC14:01
*** lvdongbing has quit IRC14:04
*** Liuqing has quit IRC14:10
*** Liuqing has joined #senlin14:11
*** Liuqing has quit IRC14:14
*** lixinhui_ has quit IRC14:25
*** Drago has quit IRC14:49
*** catintheroof has joined #senlin14:57
*** Drago has joined #senlin15:38
*** Drago has quit IRC15:42
*** Drago has joined #senlin15:48
*** Drago1 has joined #senlin17:02
*** Drago has quit IRC17:06
*** Drago1 has quit IRC17:06
*** Drago has joined #senlin17:13
*** Drago has quit IRC17:22
*** Drago has joined #senlin17:40
openstackgerritOpenStack Proposal Bot proposed openstack/senlin: Updated from global requirements  https://review.openstack.org/39519821:31
*** catintheroof has quit IRC22:22
*** catintheroof has joined #senlin22:22
*** catintheroof has quit IRC22:27
*** lixinhui_ has joined #senlin23:16
*** lixinhui_ has quit IRC23:17
*** catintheroof has joined #senlin23:29
*** catintheroof has quit IRC23:31
*** catintheroof has joined #senlin23:31

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!