Monday, 2023-03-06

__ministryHello, I using senlin in production environments and had a problem that are senlin health manager had call many time to keystone and health check a cluster more than expected. This problem I had also raise formerly. After long time monitor. I had show this problem start when registering health manager for cluster being timeout with example log: "Registering health manager for cluster 9af41150-f44c-4d0f-a905-0492bfcfbc9c timed out."02:36
__ministryAnyone else had met this problem?02:36
eandersson__ministry this bug right https://bugs.launchpad.net/senlin/+bug/197544002:49
eanderssonAre you running the version that has that fix? 14.002:50
eanderssonZed02:51
__ministryyup. I had run code had this commit. but this problem still happen.02:56
eanderssonCan you add the new information to the bug too so that we don’t lose track of the info 02:56
eanderssonNeed to figure out a way to reproduce this 02:56
__ministryThis bug I can't re-procedure. but I was collected some informations:03:01
__ministry1. Bug start when have log "Registering health manager for cluster timed out."03:01
__ministry2. Health manager alway health check this clusters although this clusters not found. Example: "Cluster (0dd4f97a-effb-4223-9863-7d7932960ad8) is not found."03:01
__ministry3. When I restart health manager, this problem will being resolved.03:01
eanderssonDoes that cluster id exist?03:26
eanderssonI wonder if we should just handle this different here.03:33
eanderssonhttps://github.com/openstack/senlin/blob/master/senlin/engine/health_manager.py#L39003:33
eanderssonMaybe if this fails just stop trying, because I don't know if there is any point in continuing if the cluster does not exist. What do you think dtruong?03:35
eandersson__ministry: Do you see an uuid in the logs with this log line 03:39
eandersson> Unregistering health manager for cluster <uuid> timed out.03:39
eanderssonIt really feels like this is a RabbitMQ issue.03:41
eanderssonA few (or many) lost messages.03:41
eanderssonBut it could also be that the health manager isn't responsive, maybe stuck processing some call03:43
__ministrythat clusters had deleted and not found in database. but health manager still do health check for this clusters.04:02
__ministryI also had saw this log with above format. "Registering health manager for cluster 9af41150-f44c-4d0f-a905-0492bfcfbc9c timed out."04:03
__ministrybut never seen log that with format "Unregistering health manager for cluster <uuid> timed out."04:05
opendevreviewErik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Check for orphaned health checks  https://review.opendev.org/c/openstack/senlin/+/87627505:40
eanderssonNo clue if the above works, but might have to do something like this to ensure that we don't have unwanted health checks hanging around.05:41
__ministryThank you very much. Let me try your code.07:27
opendevreviewErik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Add cleanup step for orphaned health checks  https://review.opendev.org/c/openstack/senlin/+/87627507:30
eanderssonSounds good thanks __ministry. I updated the code a bit to not be as aggressive, but the previous version should do what we want.07:31
eanderssonI still don't know how we are getting into this state though.07:32
__ministryLet me trace and collect more informations.07:34
opendevreviewErik Olof Gunnar Andersson proposed openstack/senlin master: [WIP] Add cleanup step for orphaned health checks  https://review.opendev.org/c/openstack/senlin/+/87627519:24

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!