Thursday, 2020-05-07

kgiustieandersson: ah... yes you've nailed it.  So long story short, there's a bunch of tricky code wrapping the server (listener) service in oslo.messaging that enforces order to the server.start(), .stop(), and .wait() calls.00:54
kgiustieandersson: predates me, but IIRC it ensured that if service.stop() was ever called by a thread before service.start() had been invoked, the service.stop() would block until the server.start() call completed.00:55
kgiustieandersson: since service.start() isn't called, service.stop() waits... and waits....00:56
eanderssonI wonder why this only showed up now.00:56
kgiustieandersson: IIUC server.stop() can take an optional timeout, which if triggered server.stop() will throw a MessagingServerError.00:56
kgiustieandersson: check out the _OrderedTask class & friends in oslo_messaging.server.py ... I fear to tread there :)00:57
kgiustieandersson: that's a good question - it should've been hanging since the dawn of time.  That "feature" has been around for years.00:58
eanderssonYea exactly00:58
eanderssonMight just be that we never tested it before tbh00:58
eanderssonI see most services are just killed at the end of a test run with very little regard00:58
eanderssone.g. https://zuul.opendev.org/t/openstack/build/53bdbdd36b2f4fa388048e2cebcfa59a/log/controller/logs/screen-designate-sink.txt#32000:59
eanderssonIt's a minor bug anyway, since it only affects unconfigured services... like the one we are running in our default tests lol01:00
kgiustieandersson: yeah this is first I've heard of it being hit...01:02
*** ralonsoh has joined #openstack-oslo07:26
*** tosky has joined #openstack-oslo07:31
*** rpittau|afk is now known as rpittau07:47
*** TheJulia has quit IRC07:56
*** mnasiadka has quit IRC07:56
*** Anticimex has quit IRC07:56
*** TheJulia has joined #openstack-oslo08:01
*** mnasiadka has joined #openstack-oslo08:01
*** Anticimex has joined #openstack-oslo08:01
*** irclogbot_1 has quit IRC09:41
*** irclogbot_3 has joined #openstack-oslo09:42
*** rpittau is now known as rpittau|bbl10:10
openstackgerritHervé Beraud proposed openstack/oslo.tools master: Introduce a new tools to release oslo in a semi-automatized way  https://review.opendev.org/72593810:53
openstackgerritHervé Beraud proposed openstack/oslo.tools master: Ignore a struture error (E010) on bashate  https://review.opendev.org/72609210:53
openstackgerritHervé Beraud proposed openstack/oslo.tools master: Introduce a new tools to release oslo in a semi-automatized way  https://review.opendev.org/72593811:58
openstackgerritHervé Beraud proposed openstack/oslo.tools master: Ignore a struture error (E010) on bashate  https://review.opendev.org/72609211:59
openstackgerritHervé Beraud proposed openstack/oslo.tools master: Introduce a new tools to release oslo in a semi-automatized way  https://review.opendev.org/72593811:59
openstackgerritBenjamin Reichel proposed openstack/oslo.log master: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/71871012:06
*** rpittau|bbl is now known as rpittau12:11
moguimartkajinam, nice progress!12:12
moguimarwrong channel xD12:12
*** raildo has joined #openstack-oslo12:25
openstackgerritBenjamin Reichel proposed openstack/oslo.log master: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/71871012:47
*** jhesketh has quit IRC13:43
*** jhesketh has joined #openstack-oslo13:47
openstackgerritBenjamin Reichel proposed openstack/oslo.log stable/ussuri: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/72615614:26
openstackgerritBenjamin Reichel proposed openstack/oslo.log stable/train: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/72615714:26
openstackgerritBenjamin Reichel proposed openstack/oslo.log stable/stein: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/72615914:27
*** tosky_ has joined #openstack-oslo14:29
*** tosky has quit IRC14:32
*** dtantsur has joined #openstack-oslo14:33
dtantsurhey oslo people, have you ever seen this oslo.messaging failure? https://zuul.opendev.org/t/openstack/build/6a16fe9e33a340b691fe70e10778afb8/log/controller/logs/screen-ironic-inspector-api.txt#84614:34
dtantsurwhat can it mean? we see it pretty often in the CI.14:34
*** tkajinam has quit IRC14:43
*** tosky_ is now known as tosky15:04
bnemeckgiusti: ^15:06
* kgiusti looks....15:07
kgiustidtantsur: sorry if I can't give more detail but for whatever reason the tcp connection to rabbitmq is failing15:11
kgiustidtantsur: that error is thrown when the socket read fails.   Is there anything in the rabbitmq logs?15:11
dtantsurkgiusti: anything weird you can spot here? https://zuul.opendev.org/t/openstack/build/6a16fe9e33a340b691fe70e10778afb8/log/controller/logs/rabbitmq/rabbit@ubuntu-bionic-ovh-bhs1-0016428334_log.txt15:12
kgiustidtantsur: yes - it appears that rabbit is dropping the connection due to missed heartbeats from client15:17
dtantsurmmm, ok15:17
kgiustidtantsur: from the rabbit log it appears if the heartbeat is set to 60s, which is plenty long.15:17
dtantsurI wonder what we could do wrong..15:17
openstackgerritBenjamin Reichel proposed openstack/oslo.log stable/queens: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/72617315:18
kgiustidtantsur: from the rabbit log - "uwsgi" appears...  hmm, seem to recall something about wsgi and eventlet not playing together well..15:19
openstackgerritBenjamin Reichel proposed openstack/oslo.log stable/rocky: Add missing SYSLOG_FACILITY to JournalHandler  https://review.opendev.org/72617415:19
dtantsurugh15:19
dtantsurkgiusti: aren't all API services using uwsgi on devstack?15:19
kgiustidtantsur: I forget the specific issue here - I'll need to try to page that back in to be sure.  Had something to do with a client lib holding a lock under a C extension (breaks eventlet IIRC)15:20
dtantsursigh15:22
dtantsurI wonder why we see it on inspector, but not on ironic15:22
dtantsurmaybe they're started differently15:22
* dtantsur will check after the current meeting15:23
kgiustidtantsur: see https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/impl_rabbit.py#L47315:42
dtantsurugh15:43
kgiustidtantsur: just stepping back a bit - I'm *guessing* the heartbeats are timing out due to something preventing the heartbeat thread from running15:43
dtantsurit's a good guess, I suspect15:43
kgiustidtantsur: it's what we've seen before...15:43
kgiustidtantsur: in any case, hberaud did some work in this area recently to provide a way to run the heartbeat thread in an isolated pthread15:44
* hberaud read15:45
kgiustidtantsur: might be worth trying it out by setting heartbeat_in_pthread to true in the config - I'd defer to hberaud's wise opinion on this.15:46
hberauddtantsur, kgiusti: hum... maybe something related to => https://github.com/eventlet/eventlet/issues/59215:46
* hberaud think out loud about my first idea15:48
hberaudthread, eventlet, heartbeat, greenthread, py37...15:48
dtantsurkey words of pain :)15:48
hberauddtantsur: all in one15:49
hberaudalso could be named: the recipe of the disaster...15:50
dtantsuryeah15:50
hberauddtantsur: which version is impacted?15:51
dtantsurwe've noticed this on master15:51
hberaudok so I guess py37 for sure on CI15:52
dtantsurI *think* dsvm jobs are still running on 3615:53
hberaudok15:53
dtantsurwhich is what bionic has15:53
hberauddtantsur: a quick one could be to test something like => https://github.com/openstack/sahara/commit/a6ee52234382f8c52023796ed13cb3cdc852109815:54
dtantsurlemme check if we've merged this one15:54
hberauddtantsur: to help us to find the culprit15:54
hberauddtantsur: IIRC around => https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L50,L6015:55
dtantsurokay, we have this patch proposed, we'll try to merge it15:55
hberauddtantsur: also I wondering if we should doing the same thing here ^^^15:56
hberauddtantsur: I think yes15:56
hberauddtantsur: any opinion?15:56
dtantsurI don't entirely understand what these patches are doing.. likely yes15:57
dtantsurwe're kind of in a chicken-and-egg situation now, because we have another breakage which we've been unable to fix15:58
hberaudthey patching the original thread to use the up-to-date _active global variable15:58
hberauddtantsur: awesome situation15:58
dtantsurwhat can be better15:58
hberauddtantsur: do you plan to enjoy your weekend?15:58
dtantsurnot until Friday evening15:59
hberaud:)15:59
dtantsur(Germany doesn't quite celebrate the Victory day)15:59
kgiustihberaud: is there any way be can determine if the oslo.messaging heartbeat thread is in fact blocked?  We only have rabbit logs indicating hb timeouts as our only clue.15:59
hberauddtantsur: ahaha good point16:00
hberaudkgiusti: I suppose by attaching/snooping the right pid or something like16:01
hberaudps pidstat16:01
hberaudbut on CI I don't know how to catch this16:02
hberaudand I don't think we've more log than that...16:02
kgiustihberaud: I imagine slapping a log w/timestamp here: https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/impl_rabbit.py#L99316:03
kgiustihberaud: brute force low tech stuff.16:04
* hberaud on another meeting so sorry for the delays16:05
hberaudkgiusti: yep log.debug around => https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/impl_rabbit.py#L959 could help us16:07
hberauddtantsur: do not hesitate to ping me when you will get more feedback with your (un)merged patch (eventlet related)16:15
dtantsurcool, thanks16:15
dtantsurinterestingly, it doesn't fail all the time, "only" about 50%16:18
hberaudso it doesn't reflect a standard behavior :)16:20
kgiustidtantsur: just in case it's necessary: https://paste.centos.org/view/a3ef28e116:20
hberaudI guess if eventlet is behind this one we should observe all the time this bug16:21
dtantsurall bets are off when dealing with eventlet16:21
hberauddtantsur: do you know if this one is in use here => https://github.com/openstack/puppet-oslo/blob/23eb13c6c86ef7b3a1deb9b1b8f636a8c072908d/manifests/messaging/rabbit.pp#L106,L114 ?16:23
hberauddtantsur: in the used config16:23
dtantsurit's not hard to check, gimme a minute16:24
openstackgerritMerged openstack/oslo.utils master: Remove monotonic usage  https://review.opendev.org/72542416:25
dtantsurhberaud: I don't see this option in https://zuul.opendev.org/t/openstack/build/6a16fe9e33a340b691fe70e10778afb8/log/controller/logs/etc/ironic-inspector/inspector_conf.txt so probably the default is used16:26
dtantsuroslo_messaging_rabbit.heartbeat_in_pthread = False16:26
hberaudok so we are in a greenthread for sure16:26
hberaudanother check could be to turn true this one16:27
hberaudto switch on standard python thread16:27
*** dtantsur is now known as dtantsur|afk16:28
hberauds/on/to a/16:28
hberauddtantsur|afk: in other word you're using the native behavior (historical behavior)16:30
hberauddtantsur|afk: and this option allow us to move the heartbeat in a std python thread to avoid eventlet side effects16:31
*** rpittau is now known as rpittau|afk16:34
openstackgerritMerged openstack/oslo.service master: Remove monotonic usage  https://review.opendev.org/72541416:58
openstackgerritMerged openstack/oslo.service stable/ussuri: Monkey patch original current_thread _active  https://review.opendev.org/72561316:58
*** toabctl has quit IRC17:04
*** gmann is now known as gmann_afk17:25
*** ralonsoh has quit IRC17:51
*** gmann_afk is now known as gmann19:06
*** hemna has quit IRC21:26
*** hemna has joined #openstack-oslo21:39
*** raildo has quit IRC21:45
*** hoonetorg has quit IRC22:34
*** hoonetorg has joined #openstack-oslo22:46
*** tkajinam has joined #openstack-oslo22:49
*** tosky has quit IRC23:36

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!