kgiusti | eandersson: ah... yes you've nailed it. So long story short, there's a bunch of tricky code wrapping the server (listener) service in oslo.messaging that enforces order to the server.start(), .stop(), and .wait() calls. | 00:54 |
---|---|---|
kgiusti | eandersson: predates me, but IIRC it ensured that if service.stop() was ever called by a thread before service.start() had been invoked, the service.stop() would block until the server.start() call completed. | 00:55 |
kgiusti | eandersson: since service.start() isn't called, service.stop() waits... and waits.... | 00:56 |
eandersson | I wonder why this only showed up now. | 00:56 |
kgiusti | eandersson: IIUC server.stop() can take an optional timeout, which if triggered server.stop() will throw a MessagingServerError. | 00:56 |
kgiusti | eandersson: check out the _OrderedTask class & friends in oslo_messaging.server.py ... I fear to tread there :) | 00:57 |
kgiusti | eandersson: that's a good question - it should've been hanging since the dawn of time. That "feature" has been around for years. | 00:58 |
eandersson | Yea exactly | 00:58 |
eandersson | Might just be that we never tested it before tbh | 00:58 |
eandersson | I see most services are just killed at the end of a test run with very little regard | 00:58 |
eandersson | e.g. https://zuul.opendev.org/t/openstack/build/53bdbdd36b2f4fa388048e2cebcfa59a/log/controller/logs/screen-designate-sink.txt#320 | 00:59 |
eandersson | It's a minor bug anyway, since it only affects unconfigured services... like the one we are running in our default tests lol | 01:00 |
kgiusti | eandersson: yeah this is first I've heard of it being hit... | 01:02 |
*** ralonsoh has joined #openstack-oslo | 07:26 | |
*** tosky has joined #openstack-oslo | 07:31 | |
*** rpittau|afk is now known as rpittau | 07:47 | |
*** TheJulia has quit IRC | 07:56 | |
*** mnasiadka has quit IRC | 07:56 | |
*** Anticimex has quit IRC | 07:56 | |
*** TheJulia has joined #openstack-oslo | 08:01 | |
*** mnasiadka has joined #openstack-oslo | 08:01 | |
*** Anticimex has joined #openstack-oslo | 08:01 | |
*** irclogbot_1 has quit IRC | 09:41 | |
*** irclogbot_3 has joined #openstack-oslo | 09:42 | |
*** rpittau is now known as rpittau|bbl | 10:10 | |
openstackgerrit | Hervé Beraud proposed openstack/oslo.tools master: Introduce a new tools to release oslo in a semi-automatized way https://review.opendev.org/725938 | 10:53 |
openstackgerrit | Hervé Beraud proposed openstack/oslo.tools master: Ignore a struture error (E010) on bashate https://review.opendev.org/726092 | 10:53 |
openstackgerrit | Hervé Beraud proposed openstack/oslo.tools master: Introduce a new tools to release oslo in a semi-automatized way https://review.opendev.org/725938 | 11:58 |
openstackgerrit | Hervé Beraud proposed openstack/oslo.tools master: Ignore a struture error (E010) on bashate https://review.opendev.org/726092 | 11:59 |
openstackgerrit | Hervé Beraud proposed openstack/oslo.tools master: Introduce a new tools to release oslo in a semi-automatized way https://review.opendev.org/725938 | 11:59 |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log master: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/718710 | 12:06 |
*** rpittau|bbl is now known as rpittau | 12:11 | |
moguimar | tkajinam, nice progress! | 12:12 |
moguimar | wrong channel xD | 12:12 |
*** raildo has joined #openstack-oslo | 12:25 | |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log master: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/718710 | 12:47 |
*** jhesketh has quit IRC | 13:43 | |
*** jhesketh has joined #openstack-oslo | 13:47 | |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log stable/ussuri: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/726156 | 14:26 |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log stable/train: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/726157 | 14:26 |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log stable/stein: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/726159 | 14:27 |
*** tosky_ has joined #openstack-oslo | 14:29 | |
*** tosky has quit IRC | 14:32 | |
*** dtantsur has joined #openstack-oslo | 14:33 | |
dtantsur | hey oslo people, have you ever seen this oslo.messaging failure? https://zuul.opendev.org/t/openstack/build/6a16fe9e33a340b691fe70e10778afb8/log/controller/logs/screen-ironic-inspector-api.txt#846 | 14:34 |
dtantsur | what can it mean? we see it pretty often in the CI. | 14:34 |
*** tkajinam has quit IRC | 14:43 | |
*** tosky_ is now known as tosky | 15:04 | |
bnemec | kgiusti: ^ | 15:06 |
* kgiusti looks.... | 15:07 | |
kgiusti | dtantsur: sorry if I can't give more detail but for whatever reason the tcp connection to rabbitmq is failing | 15:11 |
kgiusti | dtantsur: that error is thrown when the socket read fails. Is there anything in the rabbitmq logs? | 15:11 |
dtantsur | kgiusti: anything weird you can spot here? https://zuul.opendev.org/t/openstack/build/6a16fe9e33a340b691fe70e10778afb8/log/controller/logs/rabbitmq/rabbit@ubuntu-bionic-ovh-bhs1-0016428334_log.txt | 15:12 |
kgiusti | dtantsur: yes - it appears that rabbit is dropping the connection due to missed heartbeats from client | 15:17 |
dtantsur | mmm, ok | 15:17 |
kgiusti | dtantsur: from the rabbit log it appears if the heartbeat is set to 60s, which is plenty long. | 15:17 |
dtantsur | I wonder what we could do wrong.. | 15:17 |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log stable/queens: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/726173 | 15:18 |
kgiusti | dtantsur: from the rabbit log - "uwsgi" appears... hmm, seem to recall something about wsgi and eventlet not playing together well.. | 15:19 |
openstackgerrit | Benjamin Reichel proposed openstack/oslo.log stable/rocky: Add missing SYSLOG_FACILITY to JournalHandler https://review.opendev.org/726174 | 15:19 |
dtantsur | ugh | 15:19 |
dtantsur | kgiusti: aren't all API services using uwsgi on devstack? | 15:19 |
kgiusti | dtantsur: I forget the specific issue here - I'll need to try to page that back in to be sure. Had something to do with a client lib holding a lock under a C extension (breaks eventlet IIRC) | 15:20 |
dtantsur | sigh | 15:22 |
dtantsur | I wonder why we see it on inspector, but not on ironic | 15:22 |
dtantsur | maybe they're started differently | 15:22 |
* dtantsur will check after the current meeting | 15:23 | |
kgiusti | dtantsur: see https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/impl_rabbit.py#L473 | 15:42 |
dtantsur | ugh | 15:43 |
kgiusti | dtantsur: just stepping back a bit - I'm *guessing* the heartbeats are timing out due to something preventing the heartbeat thread from running | 15:43 |
dtantsur | it's a good guess, I suspect | 15:43 |
kgiusti | dtantsur: it's what we've seen before... | 15:43 |
kgiusti | dtantsur: in any case, hberaud did some work in this area recently to provide a way to run the heartbeat thread in an isolated pthread | 15:44 |
* hberaud read | 15:45 | |
kgiusti | dtantsur: might be worth trying it out by setting heartbeat_in_pthread to true in the config - I'd defer to hberaud's wise opinion on this. | 15:46 |
hberaud | dtantsur, kgiusti: hum... maybe something related to => https://github.com/eventlet/eventlet/issues/592 | 15:46 |
* hberaud think out loud about my first idea | 15:48 | |
hberaud | thread, eventlet, heartbeat, greenthread, py37... | 15:48 |
dtantsur | key words of pain :) | 15:48 |
hberaud | dtantsur: all in one | 15:49 |
hberaud | also could be named: the recipe of the disaster... | 15:50 |
dtantsur | yeah | 15:50 |
hberaud | dtantsur: which version is impacted? | 15:51 |
dtantsur | we've noticed this on master | 15:51 |
hberaud | ok so I guess py37 for sure on CI | 15:52 |
dtantsur | I *think* dsvm jobs are still running on 36 | 15:53 |
hberaud | ok | 15:53 |
dtantsur | which is what bionic has | 15:53 |
hberaud | dtantsur: a quick one could be to test something like => https://github.com/openstack/sahara/commit/a6ee52234382f8c52023796ed13cb3cdc8521098 | 15:54 |
dtantsur | lemme check if we've merged this one | 15:54 |
hberaud | dtantsur: to help us to find the culprit | 15:54 |
hberaud | dtantsur: IIRC around => https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L50,L60 | 15:55 |
dtantsur | okay, we have this patch proposed, we'll try to merge it | 15:55 |
hberaud | dtantsur: also I wondering if we should doing the same thing here ^^^ | 15:56 |
hberaud | dtantsur: I think yes | 15:56 |
hberaud | dtantsur: any opinion? | 15:56 |
dtantsur | I don't entirely understand what these patches are doing.. likely yes | 15:57 |
dtantsur | we're kind of in a chicken-and-egg situation now, because we have another breakage which we've been unable to fix | 15:58 |
hberaud | they patching the original thread to use the up-to-date _active global variable | 15:58 |
hberaud | dtantsur: awesome situation | 15:58 |
dtantsur | what can be better | 15:58 |
hberaud | dtantsur: do you plan to enjoy your weekend? | 15:58 |
dtantsur | not until Friday evening | 15:59 |
hberaud | :) | 15:59 |
dtantsur | (Germany doesn't quite celebrate the Victory day) | 15:59 |
kgiusti | hberaud: is there any way be can determine if the oslo.messaging heartbeat thread is in fact blocked? We only have rabbit logs indicating hb timeouts as our only clue. | 15:59 |
hberaud | dtantsur: ahaha good point | 16:00 |
hberaud | kgiusti: I suppose by attaching/snooping the right pid or something like | 16:01 |
hberaud | ps pidstat | 16:01 |
hberaud | but on CI I don't know how to catch this | 16:02 |
hberaud | and I don't think we've more log than that... | 16:02 |
kgiusti | hberaud: I imagine slapping a log w/timestamp here: https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/impl_rabbit.py#L993 | 16:03 |
kgiusti | hberaud: brute force low tech stuff. | 16:04 |
* hberaud on another meeting so sorry for the delays | 16:05 | |
hberaud | kgiusti: yep log.debug around => https://opendev.org/openstack/oslo.messaging/src/branch/master/oslo_messaging/_drivers/impl_rabbit.py#L959 could help us | 16:07 |
hberaud | dtantsur: do not hesitate to ping me when you will get more feedback with your (un)merged patch (eventlet related) | 16:15 |
dtantsur | cool, thanks | 16:15 |
dtantsur | interestingly, it doesn't fail all the time, "only" about 50% | 16:18 |
hberaud | so it doesn't reflect a standard behavior :) | 16:20 |
kgiusti | dtantsur: just in case it's necessary: https://paste.centos.org/view/a3ef28e1 | 16:20 |
hberaud | I guess if eventlet is behind this one we should observe all the time this bug | 16:21 |
dtantsur | all bets are off when dealing with eventlet | 16:21 |
hberaud | dtantsur: do you know if this one is in use here => https://github.com/openstack/puppet-oslo/blob/23eb13c6c86ef7b3a1deb9b1b8f636a8c072908d/manifests/messaging/rabbit.pp#L106,L114 ? | 16:23 |
hberaud | dtantsur: in the used config | 16:23 |
dtantsur | it's not hard to check, gimme a minute | 16:24 |
openstackgerrit | Merged openstack/oslo.utils master: Remove monotonic usage https://review.opendev.org/725424 | 16:25 |
dtantsur | hberaud: I don't see this option in https://zuul.opendev.org/t/openstack/build/6a16fe9e33a340b691fe70e10778afb8/log/controller/logs/etc/ironic-inspector/inspector_conf.txt so probably the default is used | 16:26 |
dtantsur | oslo_messaging_rabbit.heartbeat_in_pthread = False | 16:26 |
hberaud | ok so we are in a greenthread for sure | 16:26 |
hberaud | another check could be to turn true this one | 16:27 |
hberaud | to switch on standard python thread | 16:27 |
*** dtantsur is now known as dtantsur|afk | 16:28 | |
hberaud | s/on/to a/ | 16:28 |
hberaud | dtantsur|afk: in other word you're using the native behavior (historical behavior) | 16:30 |
hberaud | dtantsur|afk: and this option allow us to move the heartbeat in a std python thread to avoid eventlet side effects | 16:31 |
*** rpittau is now known as rpittau|afk | 16:34 | |
openstackgerrit | Merged openstack/oslo.service master: Remove monotonic usage https://review.opendev.org/725414 | 16:58 |
openstackgerrit | Merged openstack/oslo.service stable/ussuri: Monkey patch original current_thread _active https://review.opendev.org/725613 | 16:58 |
*** toabctl has quit IRC | 17:04 | |
*** gmann is now known as gmann_afk | 17:25 | |
*** ralonsoh has quit IRC | 17:51 | |
*** gmann_afk is now known as gmann | 19:06 | |
*** hemna has quit IRC | 21:26 | |
*** hemna has joined #openstack-oslo | 21:39 | |
*** raildo has quit IRC | 21:45 | |
*** hoonetorg has quit IRC | 22:34 | |
*** hoonetorg has joined #openstack-oslo | 22:46 | |
*** tkajinam has joined #openstack-oslo | 22:49 | |
*** tosky has quit IRC | 23:36 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!