gus | harlowja: yep, looks good. That reminds me that I still need to go back and audit the code I wrote yesterday to make sure all the flows are using consistent names so they can be composed usefully... | 00:18 |
---|---|---|
openstackgerrit | Joshua Harlow proposed openstack/taskflow: Add *basic* scope visibility constraints (WIP) https://review.openstack.org/137245 | 00:20 |
harlowja | gus ^ also | 00:20 |
harlowja | seems to work on a local test, ha | 00:20 |
harlowja | ship it :-P | 00:23 |
vishy | does anyone here have any idea why periodics fail to fire while oslo.messaging is atttempting ot reconnect? | 00:24 |
harlowja | vishy using rabbit? | 00:30 |
*** dims_ has quit IRC | 00:30 | |
vishy | yup | 00:30 |
vishy | i’m seeing a failure which has happened before where suddently a worker is processing messages | 00:31 |
vishy | but not doing anything with them | 00:31 |
vishy | in this case it happened after a 5 min network outage | 00:31 |
harlowja | hmmm, using oslo.messaging master or version? | 00:32 |
harlowja | cause i think that code just recently changed | 00:32 |
harlowja | * https://github.com/openstack/oslo.messaging/commit/973301aa70 | 00:32 |
harlowja | maybe something changed there that shouldn't have (if u are using that) | 00:33 |
harlowja | sileht might have some ideas to | 00:35 |
vishy | @harlowja http://paste.openstack.org/show/138493/ | 00:36 |
vishy | icehouse | 00:37 |
vishy | so that is an example of where it is hung | 00:37 |
vishy | ugh darn it recovered | 00:37 |
*** ViswaV_ has quit IRC | 00:38 | |
*** ViswaV has joined #openstack-oslo | 00:39 | |
harlowja | :( | 00:42 |
*** ViswaV has quit IRC | 00:44 | |
vishy | harlowja: ok great | 00:49 |
vishy | so connecting to the eventlet backdoor and doing a print greenthreads causes it to recover | 00:49 |
vishy | fun | 00:49 |
harlowja | seems like this loop https://github.com/openstack/oslo.messaging/blob/master/oslo/messaging/_drivers/amqpdriver.py#L272 | 00:50 |
*** alexpilotti has quit IRC | 00:51 | |
vishy | it looks like it was stuck here: | 00:51 |
vishy | http://paste.openstack.org/show/138506/ | 00:51 |
vishy | @harlowja it looks like that while loop doesn’t ever yield to other greenthreads | 00:52 |
harlowja | ya | 00:52 |
vishy | which might be what hangs the periodics | 00:52 |
vishy | and then they somehow get into an endless loop where they don’t recover | 00:52 |
vishy | let me try adding a sleep in there | 00:52 |
harlowja | kk | 00:52 |
harlowja | or at https://github.com/openstack/oslo.messaging/blob/stable/icehouse/oslo/messaging/_drivers/amqpdriver.py#L220 | 00:53 |
harlowja | if that timeout is None or something that might be bad | 00:54 |
harlowja | it seems like other greenthreads there are also doing self._queues[msg_id].get(block=True, timeout=timeout) which i'm not sure, is that the same queue everyone is blokcing on | 00:55 |
vishy | harlowja: weird there is another spot that is responsible for reconnect | 00:55 |
vishy | and it does have a time.sleep | 00:55 |
harlowja | vishy i also wasn't aware that more than 1 greenthread runs periodic tasks (but i might be wrong) | 01:00 |
harlowja | so if 1 periodic task is sucked up running that loop | 01:00 |
vishy | yeah it has a spawn_n in loopingcall | 01:01 |
cburgess | vishy: I'm with harlowja on this. I thought the periodics run in the main loop which will block on RPC reconnect. | 01:01 |
vishy | so each loopingcall runs one | 01:01 |
*** yamahata has joined #openstack-oslo | 01:01 | |
vishy | i think the periodic tasks is a single call | 01:01 |
harlowja | so that equates to 1 thread? | 01:01 |
* harlowja checking what nova is doing | 01:01 | |
cburgess | Oh you are saying it reconnects but the task never unblocks, sorry just read your thread trace. | 01:01 |
vishy | so the issue is reconnect happens but periodics don’t start and the process is completely screwed | 01:02 |
vishy | messages get pulled off of the queue but don’t make any progress | 01:02 |
vishy | i noticed during the reconnect loop periodics are not firing | 01:02 |
vishy | i thought it might be related | 01:02 |
vishy | the other odd thing is logging in to backdoor and pgt() | 01:02 |
cburgess | Normal orchestration requests also get pulled off but don't do any work? | 01:02 |
vishy | seems to unblocks the periodics | 01:02 |
vishy | yup | 01:03 |
cburgess | Damn ok I feel like we saw this once a few months ago. | 01:03 |
cburgess | Trying to remember what we determined it was. | 01:03 |
vishy | i got a launch command and it said trying to start | 01:03 |
vishy | but doesn’t actually do anything | 01:03 |
vishy | like it doesn’t ever get into the libvirt thread | 01:03 |
vishy | i strongly suspect this old issue: https://bitbucket.org/eventlet/eventlet/pull-request/29/fix-use-of-semaphore-with-tpool-issue-137/diff | 01:04 |
vishy | hanging on the logging semaphore or the libvirt tpool | 01:04 |
vishy | but i’m still a bit confused why the amqp sleep isn’t allowing the periodic tasks to make progress | 01:05 |
*** stevemar has quit IRC | 01:05 | |
*** takedakn has joined #openstack-oslo | 01:05 | |
cburgess | That link isn't loading. | 01:06 |
vishy | yay atlassian | 01:07 |
vishy | whole thing is down https://bitbucket.org/eventlet/eventlet/issues | 01:07 |
cburgess | LOL | 01:07 |
cburgess | RAD | 01:07 |
cburgess | I vaguely remember that one though. | 01:07 |
cburgess | eventlet 0.13 fixed it or something as I recall. | 01:07 |
vishy | google cache | 01:08 |
vishy | http://webcache.googleusercontent.com/search?q=cache:H57Vyv3erXEJ:https://bitbucket.org/eventlet/eventlet/pull-request/29/fix-use-of-semaphore-with-tpool-issue-137+&cd=1&hl=en&ct=clnk&gl=us | 01:08 |
vishy | no it has not been fixed | 01:08 |
cburgess | Oh this one from comstud from way back when. | 01:08 |
vishy | yup | 01:08 |
vishy | and rax is still using a forked eventlet afaik | 01:08 |
vishy | with his fix in | 01:09 |
*** takedakn has quit IRC | 01:10 | |
*** amotoki has joined #openstack-oslo | 01:10 | |
vishy | is it possible that somehow that does not have a monkeypatched time? | 01:11 |
vishy | i don’t see how that could happen | 01:11 |
cburgess | I guess its theoretically possilbe. | 01:11 |
cburgess | Does it happen frequently? You could test the object and log something if its not patched. | 01:12 |
harlowja | i wonder if the perodic task gets sucked into the following (never getting its reply for some reason) | 01:13 |
harlowja | https://github.com/openstack/oslo.messaging/blob/stable/icehouse/oslo/messaging/_drivers/amqpdriver.py#L252 | 01:13 |
* harlowja just speculating | 01:13 | |
vishy | holy crap that works | 01:13 |
cburgess | What works? | 01:13 |
vishy | periodcs are still going at the moment | 01:13 |
vishy | oh wait | 01:13 |
vishy | so yeah the periodics are hanging waiting for replies | 01:14 |
vishy | they run initially | 01:14 |
harlowja | ya, do they get locked into that loop? | 01:14 |
vishy | ah so that’s it | 01:14 |
vishy | the periodic processor gets stuck waiting for a reply from conductor | 01:14 |
harlowja | ya, which i guess drops on the floor (due to reconnect) | 01:14 |
harlowja | and then spins | 01:15 |
harlowja | maybe even becoming 'Ok, we're the thread responsible for polling the connection' | 01:15 |
harlowja | :-/ | 01:15 |
cburgess | Yeah that _poll_connection function doesn't seem to have sleep or yielf. | 01:15 |
cburgess | er yield | 01:15 |
cburgess | https://github.com/openstack/oslo.messaging/blob/stable/icehouse/oslo/messaging/_drivers/amqpdriver.py#L201 | 01:15 |
harlowja | ya, its almost like we don't want periodic tasks to get stuck in that part | 01:15 |
harlowja | they shouldn't become the 'thread responsible for polling the connection' | 01:16 |
cburgess | Agreed | 01:16 |
vishy | i suspect it is locking trying to send the messagre actually | 01:16 |
harlowja | leave it to some other thread, ha | 01:16 |
vishy | lemme look | 01:16 |
cburgess | Could be.. | 01:16 |
vishy | so i have three threads stuck in recoonect | 01:17 |
cburgess | So the rabbit server is down right now? | 01:18 |
cburgess | Or unavailable? | 01:18 |
vishy | one in service_update | 01:18 |
vishy | report state | 01:18 |
vishy | which makes sense | 01:18 |
vishy | now in this case the recovery happens on that one | 01:18 |
vishy | because my services were happily reporting state | 01:18 |
vishy | one in update_available_resource | 01:19 |
vishy | that one is the one that never recovers | 01:19 |
cburgess | Oh.... | 01:19 |
cburgess | Can you send a link to the reconnect function you are stuck in? | 01:20 |
vishy | its here | 01:20 |
vishy | File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/impl_rabbit.py", line 580, in reconnect | 01:20 |
vishy | self._connect(params) | 01:20 |
vishy | but when it does reconnect | 01:21 |
*** tsekiyam_ has joined #openstack-oslo | 01:21 | |
cburgess | That line number if a bit off from what is in the stable/icehouse | 01:21 |
vishy | i think it is stuck in loopingcall here: | 01:21 |
vishy | http://paste.openstack.org/show/138506/ | 01:21 |
vishy | maybe i need to update my oslo.messaging? | 01:22 |
vishy | lemme look at a diff | 01:22 |
vishy | i’m using 1.3.0-0ubuntu1~cloud0 | 01:22 |
harlowja | it'd be interesting to see whatever the following outputs | 01:23 |
harlowja | LOG.debug('Dynamic looping call %(func_name)s sleeping ' | 01:23 |
harlowja | 'for %(idle).02f seconds', | 01:23 |
harlowja | {'func_name': repr(self.f), 'idle': idle}) | 01:23 |
harlowja | if u think its stuck there | 01:23 |
*** tsekiyama has quit IRC | 01:23 | |
vishy | i added an import so that would move the lines by 2 | 01:24 |
vishy | to try switching time.sleep for greenthread.sleep | 01:25 |
cburgess | OK thats harmless then. | 01:25 |
*** tsekiyam_ has quit IRC | 01:25 | |
vishy | harloja that stops printing | 01:25 |
vishy | *harlowja | 01:25 |
harlowja | hmmm | 01:26 |
vishy | i did get a WWARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 426.364903 sec | 01:26 |
vishy | on one of them | 01:26 |
harlowja | i do wonder if stuff is just stuck in https://github.com/openstack/oslo.messaging/blob/stable/icehouse/oslo/messaging/_drivers/amqpdriver.py#L250 | 01:27 |
*** mtanino has quit IRC | 01:27 | |
harlowja | in that durn loop | 01:27 |
vishy | that is often the last message that prints | 01:27 |
vishy | i don’t see that loop in the pgt output | 01:28 |
vishy | any thoughts on how i could test for that? | 01:28 |
harlowja | File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 280, in wait ? | 01:28 |
harlowja | hmmm | 01:29 |
*** dimsum__ has joined #openstack-oslo | 01:30 | |
cburgess | Log on entrance and exit of the loop? | 01:30 |
vishy | wait a sec | 01:31 |
vishy | if the dynamic looping call function throws an exception | 01:32 |
vishy | doesn’t it just exit | 01:32 |
harlowja | seems so, maybe that too | 01:33 |
harlowja | but i think u get a 'LOG.exception(_LE('in dynamic looping call'))' | 01:33 |
vishy | i’m not seeing that message | 01:35 |
harlowja | ya | 01:35 |
vishy | ok so it isn’t an exception | 01:36 |
harlowja | it'd be interesting to know the timeout that 'message = self.waiters.get(msg_id, timeout)' has | 01:36 |
harlowja | or whatever timeout it think its using (hopefully not None) | 01:36 |
*** dimsum__ has quit IRC | 01:36 | |
*** dimsum__ has joined #openstack-oslo | 01:37 | |
harlowja | although the default that seems to be passed along if not provided is timeout=None | 01:37 |
harlowja | hmmm | 01:37 |
*** denis_makogon has quit IRC | 01:37 | |
harlowja | which i guess becomes self.conf.rpc_response_timeout | 01:38 |
harlowja | but it could be X times of that, seeing all the places where that seems to be used | 01:38 |
vishy | harlowja: added some logs to that message | 01:40 |
vishy | * method | 01:40 |
vishy | so we will see | 01:40 |
harlowja | cools | 01:40 |
vishy | not seeing logs at all | 01:41 |
vishy | I don’t think it is making calls | 01:41 |
*** dimsum__ has quit IRC | 01:41 | |
vishy | i think the updates are casts | 01:41 |
vishy | so it isn’t getting stuck there | 01:42 |
harlowja | hmmm | 01:42 |
harlowja | seems to be though that the update_available_resource calls into the conductor, so i think thats a call; but i guess this isn't it | 01:43 |
harlowja | pretty sure most of stuff using conductor is call | 01:43 |
harlowja | anyway | 01:44 |
harlowja | anyway u can get more traces from when this happens? | 01:45 |
harlowja | wonder if a pattern will emerge | 01:45 |
harlowja | wonder/hope, ha | 01:45 |
harlowja | its interesting how so much goes through the RPC layer nowadays | 01:46 |
harlowja | probably be seeing more stuff like this i think | 01:46 |
harlowja | *especially under reconnects, disconnects, partitions... | 01:50 |
vishy | ok so this is interesting | 01:50 |
vishy | this is the last trace | 01:51 |
vishy | http://paste.openstack.org/show/138536/ | 01:51 |
vishy | so it tries to reconnect with timeouts a bunch | 01:51 |
cburgess | vishy: How are you simulating the failure? Killing rabbit? | 01:52 |
vishy | no dropping traffic to rabbit | 01:52 |
vishy | in both direections | 01:52 |
vishy | but this definitely reproed | 01:52 |
cburgess | OK random lark... try setting kombu_reconnect_delay to 10 | 01:53 |
vishy | my report state seems to be ok in this case | 01:53 |
vishy | but my other thread failed | 01:53 |
vishy | so i’m suspecting that this is the issue | 01:53 |
cburgess | What is? | 01:53 |
vishy | two threads are attempting to publish to conductor | 01:53 |
*** dimsum__ has joined #openstack-oslo | 01:53 | |
vishy | they both attempt to redeclare the exchange | 01:54 |
vishy | the first succeeds | 01:54 |
vishy | but the second fails | 01:54 |
vishy | and it doesn’t recover | 01:54 |
cburgess | OH | 01:54 |
cburgess | Oh | 01:54 |
cburgess | There is a bug about this. | 01:54 |
cburgess | Where the hell did I see this bug... | 01:54 |
cburgess | https://bugs.launchpad.net/neutron/+bug/1318721 | 01:55 |
vishy | i remember a bug about needing to redeclare | 01:55 |
vishy | looks like that never merged | 01:56 |
cburgess | Nope it didn't | 01:56 |
cburgess | But pretty sure its the same bug. | 01:56 |
harlowja | anyone know the historical reason that oslo.messaging just doesn't have 1 dispatch thread (that can also do reconnects and such); instead of having many threads that seem to try to do it together? (just out of curosity) | 01:59 |
vishy | cburgess: so not sure | 02:00 |
vishy | so i have two greenthreads logging that they are trying to reconnect | 02:00 |
vishy | one of them gets a timeout | 02:01 |
vishy | then the other one gets an ioerror | 02:01 |
vishy | they both log the error properly | 02:01 |
vishy | but then only one tries to reconnect | 02:01 |
cburgess | Well similar | 02:01 |
vishy | the other one is just sitting there | 02:01 |
vishy | ok i have the hang finally | 02:06 |
harlowja | ?? | 02:06 |
vishy | this is where it is hung | 02:07 |
vishy | http://paste.openstack.org/show/138542/ | 02:07 |
harlowja | so its in that loop of death | 02:08 |
harlowja | https://github.com/openstack/oslo.messaging/blob/stable/icehouse/oslo/messaging/_drivers/amqpdriver.py#L265 ? | 02:08 |
harlowja | someone the periodic task got to take over ''Ok, we're the thread responsible for polling the connection' ? | 02:08 |
vishy | no it is hung in send | 02:09 |
vishy | not receive | 02:09 |
cburgess | OK I have to run but let me know how this turns out. | 02:09 |
harlowja | hmmm, from that paste send -> wait (the loop?) -> _poll_connection -> consume | 02:10 |
vishy | yeah | 02:10 |
vishy | weird | 02:11 |
vishy | i added logging there | 02:11 |
vishy | and i didn’t see any log messages | 02:11 |
harlowja | odd | 02:11 |
harlowja | leftover pyc file or something? | 02:11 |
harlowja | there are 2 branches of that loop, only log one branch? | 02:11 |
harlowja | or both? | 02:11 |
harlowja | there's also nested while True that it can get stuck in | 02:12 |
harlowja | a few of those | 02:12 |
harlowja | if the timeout is the default 60; and it does that spinning it really becomes 60 * infinite from what i can tell (not knowing this code so well) | 02:13 |
vishy | i definitely put logs everywhere | 02:14 |
harlowja | odd | 02:15 |
vishy | maybe the pyc didn’t get regenerated | 02:15 |
vishy | i have four logs in that wait method and i didn’t see any of them get hit at any point | 02:15 |
harlowja | a few more while True loops in https://github.com/openstack/oslo.messaging/blob/stable/icehouse/oslo/messaging/_drivers/amqpdriver.py#L201 | 02:15 |
harlowja | weird | 02:15 |
vishy | ok got it logging now | 02:17 |
vishy | sweet | 02:17 |
harlowja | woot | 02:17 |
harlowja | log baby log | 02:18 |
harlowja | ha | 02:18 |
vishy | ok i see the log saying it | 02:19 |
vishy | 2014-11-26 02:18:00.833 29972 WARNING oslo.messaging._drivers.amqpdriver [-] GOT LOCK FOR d9d2dec9dc88496fba77cd42f1a4f79a with timeout: 120 | 02:19 |
vishy | i have three separate methods waiting on reconnect | 02:19 |
harlowja | with one or 2 spinning? | 02:20 |
vishy | so i suspect that it is when it passes the 120 seconds that it fails | 02:20 |
harlowja | possibly | 02:20 |
vishy | because if i reconnect quickly it seems to recover ok | 02:20 |
harlowja | the resource tracker does have alot of '@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)' on it | 02:21 |
harlowja | although if one of those threads that has the lock gets promoted to the 'thread responsible for polling the connection' that would seem bad | 02:21 |
harlowja | since at that point nobody else will get said lock | 02:21 |
harlowja | until that thread stops being the 'thread responsible for polling the connection' | 02:22 |
vishy | or it could be a race | 02:22 |
vishy | cuz it recovered ok this time | 02:22 |
harlowja | keep on trying, maybe adding the logging stuff in screwed it up due to timing changes | 02:22 |
vishy | http://paste.openstack.org/show/138558/ | 02:23 |
vishy | that is very interesting | 02:23 |
vishy | it looked like all three threads that were waiting there attempted to proccess all three message responses | 02:24 |
harlowja | :-/ | 02:24 |
* harlowja makes me wonder why oslo.messaging has so many dispatch threads (vs just one) | 02:24 | |
*** raildo_ has joined #openstack-oslo | 02:24 | |
harlowja | anyway, gotta head out, let me know what u find | 02:25 |
harlowja | seems to be getting closer (maybe) | 02:25 |
harlowja | ha | 02:25 |
harlowja | *not like mean 'ha' | 02:26 |
harlowja | lol | 02:26 |
vishy | ok | 02:26 |
vishy | well i will try to repro again | 02:26 |
openstackgerrit | Joshua Harlow proposed openstack/taskflow: Add *basic* scope visibility constraints (WIP) https://review.openstack.org/137245 | 02:26 |
vishy | with the logging in | 02:26 |
vishy | haven’t had it repro with logging in | 02:26 |
vishy | so.. add a sleep :) | 02:26 |
harlowja | ya, sucky part if it never occurs with logging on due to eventlet | 02:26 |
harlowja | but lets see | 02:26 |
vishy | recovered 3/3 so far | 02:27 |
harlowja | :( | 02:27 |
harlowja | ya, so there's your problem, run more with debug logging | 02:27 |
harlowja | ha | 02:27 |
vishy | so looks like it is this issue as I suspected: https://bitbucket.org/eventlet/eventlet/issue/137/use-of-threading-locks-causes-deadlock | 02:30 |
*** ftcjeff has joined #openstack-oslo | 02:33 | |
*** raildo_ has quit IRC | 02:35 | |
*** dimsum__ has quit IRC | 02:52 | |
*** raildo_ has joined #openstack-oslo | 02:57 | |
*** raildo_ has quit IRC | 03:24 | |
*** harlowja is now known as harlowja_away | 03:29 | |
*** ftcjeff has quit IRC | 03:37 | |
*** ftcjeff has joined #openstack-oslo | 03:41 | |
*** ftcjeff has quit IRC | 03:42 | |
*** dimsum__ has joined #openstack-oslo | 03:52 | |
*** ftcjeff has joined #openstack-oslo | 03:55 | |
*** dimsum__ has quit IRC | 03:57 | |
*** amrith is now known as _amrith_ | 04:04 | |
*** ftcjeff has quit IRC | 04:08 | |
*** zzzeek has quit IRC | 04:52 | |
*** harlowja_at_home has joined #openstack-oslo | 05:08 | |
*** harlowja_at_home has quit IRC | 05:12 | |
*** stevemar has joined #openstack-oslo | 05:13 | |
*** arnaud___ has quit IRC | 05:21 | |
vishy | harlowja_away: so that fix does not fix the issue | 05:27 |
vishy | which makes me feel better | 05:27 |
*** stevemar has quit IRC | 05:36 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/oslo.db: Imported Translations from Transifex https://review.openstack.org/136684 | 06:01 |
*** stevemar has joined #openstack-oslo | 06:01 | |
*** k4n0 has joined #openstack-oslo | 06:03 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/oslo.utils: Imported Translations from Transifex https://review.openstack.org/136566 | 06:13 |
*** arnaud___ has joined #openstack-oslo | 06:30 | |
*** takedakn has joined #openstack-oslo | 06:31 | |
*** stevemar has quit IRC | 06:34 | |
openstackgerrit | Joshua Harlow proposed openstack/taskflow: Add *basic* scope visibility constraints https://review.openstack.org/137245 | 06:45 |
*** takedakn has quit IRC | 06:58 | |
*** ajo has joined #openstack-oslo | 07:08 | |
*** exploreshaifali has joined #openstack-oslo | 07:18 | |
*** e0ne has joined #openstack-oslo | 07:38 | |
*** e0ne has quit IRC | 07:38 | |
*** jaypipes has quit IRC | 07:50 | |
openstackgerrit | Jens Rosenboom proposed openstack/oslo.messaging: Fix reconnect race condition with RabbitMQ cluster https://review.openstack.org/103157 | 07:56 |
*** dtantsur|afk is now known as dtantsur | 08:13 | |
*** oomichi has quit IRC | 08:17 | |
*** yamahata has quit IRC | 08:18 | |
frickler | vishy: I reanimated the re-declare patch from https://review.openstack.org/103157 before seeing your discussion on this issue. | 08:32 |
frickler | vishy: Did you check whether that patch might help mitigate the issues you are seeing? | 08:32 |
*** andreykurilin_ has joined #openstack-oslo | 08:35 | |
vishy | i haven’t checked yet | 08:51 |
vishy | but i don’t think it will | 08:51 |
vishy | frickler: it seems to be getting stuck on a recv that never times out | 08:51 |
vishy | nasty | 08:51 |
*** andreykurilin_ has quit IRC | 08:58 | |
*** andreykurilin_ has joined #openstack-oslo | 08:59 | |
*** amotoki has quit IRC | 09:11 | |
*** i159 has joined #openstack-oslo | 09:13 | |
*** e0ne has joined #openstack-oslo | 09:16 | |
*** ihrachyshka has joined #openstack-oslo | 09:19 | |
*** vigneshvar has joined #openstack-oslo | 09:20 | |
*** denis_makogon has joined #openstack-oslo | 09:26 | |
*** pblaho has joined #openstack-oslo | 09:43 | |
*** yamahata has joined #openstack-oslo | 09:44 | |
*** andreykurilin_ has quit IRC | 09:52 | |
*** subscope has quit IRC | 10:08 | |
openstackgerrit | Merged openstack/oslo.concurrency: Add a TODO for retrying pull request #20 https://review.openstack.org/137225 | 10:20 |
frickler | jenkins seems to fail with: 'oslo.middleware' is not in global-requirements.txt -- is this a known issue? | 10:26 |
*** andreykurilin_ has joined #openstack-oslo | 10:27 | |
*** subscope has joined #openstack-oslo | 10:34 | |
*** yamahata has quit IRC | 10:43 | |
*** yamahata has joined #openstack-oslo | 10:45 | |
*** arnaud___ has quit IRC | 10:48 | |
jokke_ | ttx: ping | 11:05 |
ttx | jokke_: pong | 11:06 |
jokke_ | ttx: sorry, wrong channel, will pm you | 11:06 |
*** exploreshaifali has quit IRC | 11:20 | |
ihrachyshka | frickler: a link to review? | 11:26 |
openstackgerrit | Matthew Gilliard proposed openstack/oslo.log: Remove NullHandler https://review.openstack.org/137338 | 11:45 |
*** denis_makogon has quit IRC | 11:49 | |
*** dmakogon_ is now known as denis_makogon | 11:49 | |
*** denis_makogon_ has joined #openstack-oslo | 11:49 | |
*** viktors|afk is now known as viktors | 11:52 | |
jokke_ | anyone here who could help me with oslo.messaging? | 11:54 |
*** denis_makogon_ has quit IRC | 12:04 | |
viktors | hi! Does anybody knows the current state of the gate-tempest-dsvm-neutron-src-*-icehouse gate jobs ? | 12:15 |
viktors | jd__: hi! Maybe you know something about ^^ | 12:20 |
viktors | ? | 12:20 |
frickler | ihrachyshka: https://review.openstack.org/103157 | 12:25 |
ihrachyshka | frickler: hm. I wonder what those jobs are for. they include 'neutron' in their name, and icehouse, and neutron definitely hasn't used oslo.messaging till Juno | 12:27 |
frickler | the icehouse test has a different error from tempest about floating IPs | 12:29 |
frickler | but I think none of these could be related to the patch | 12:29 |
ihrachyshka | frickler: as for oslo.middleware issue for juno+ jobs, I guess we'll need to add it to requirements repo for Juno branch | 12:30 |
ihrachyshka | frickler: devstack checks that all oslo.messaging requirements are present in Juno requirements repo, and I guess oslo.middleware didn't get there, yet | 12:30 |
*** hemanthm has left #openstack-oslo | 12:32 | |
frickler | ihrachyshka: so can you fix that? or who takes care of these gate things? | 12:34 |
frickler | all the latest reviews at https://review.openstack.org/#/q/oslo.messaging,n,z seem to share these three failures | 12:36 |
ihrachyshka | frickler: see https://review.openstack.org/#/c/134727/ | 12:37 |
ihrachyshka | frickler: I guess it's meant to fix similar failures, though I wonder whether the patch is actually correct and will solve your particular issue | 12:38 |
ihrachyshka | frickler: I've asked Dave to consider your failures in comments, let's see what he has to say | 12:38 |
frickler | yeah, that seems to be related, so I will wait for that, thx | 12:41 |
*** exploreshaifali has joined #openstack-oslo | 13:04 | |
*** alexpilotti has joined #openstack-oslo | 13:22 | |
jd__ | viktors: I don't | 13:22 |
viktors | jd__: ok, got it | 13:23 |
*** vigneshvar_ has joined #openstack-oslo | 13:31 | |
*** alexpilotti has quit IRC | 13:32 | |
*** vigneshvar has quit IRC | 13:34 | |
*** jaypipes has joined #openstack-oslo | 13:36 | |
*** gordc has joined #openstack-oslo | 13:42 | |
*** yamahata has quit IRC | 13:50 | |
*** dimsum__ has joined #openstack-oslo | 13:51 | |
*** yamahata has joined #openstack-oslo | 13:51 | |
*** vigneshvar_ has quit IRC | 13:55 | |
openstackgerrit | Davanum Srinivas (dims) proposed openstack/oslo.vmware: Switch to use requests/urllib3 and enable cacert validation https://review.openstack.org/121956 | 13:56 |
*** _amrith_ is now known as amrith | 13:58 | |
*** sigmavirus24_awa is now known as sigmavirus24 | 14:08 | |
openstackgerrit | Merged openstack/oslo-incubator: Don't log missing policy.d as a warning https://review.openstack.org/137191 | 14:14 |
*** exploreshaifali has quit IRC | 14:15 | |
*** jgrimm is now known as zz_jgrimm | 14:32 | |
*** kgiusti has joined #openstack-oslo | 14:49 | |
*** zzzeek has joined #openstack-oslo | 14:53 | |
openstackgerrit | Davanum Srinivas (dims) proposed openstack/oslo.vmware: Switch to use requests/urllib3 and enable cacert validation https://review.openstack.org/121956 | 15:12 |
*** exploreshaifali has joined #openstack-oslo | 15:19 | |
*** mtanino has joined #openstack-oslo | 15:22 | |
*** dimsum__ has quit IRC | 15:28 | |
*** stevemar has joined #openstack-oslo | 15:28 | |
*** dimsum__ has joined #openstack-oslo | 15:29 | |
viktors | dhellmann: hi! | 15:30 |
dhellmann | viktors: hi. Looks like we have some persistent gate issues. :-( | 15:35 |
viktors | dhellmann: are you talking about gate-tempest-dsvm-src-*-icehouse job? | 15:37 |
dhellmann | viktors: yes | 15:37 |
dhellmann | viktors: I was out yesterday; is someone working on that? | 15:37 |
* dhellmann is still catching up on the backlog of email | 15:38 | |
viktors | dhellmann: I looked into project meeting log - http://eavesdrop.openstack.org/meetings/project/2014/project.2014-11-25-21.01.log.html | 15:38 |
dhellmann | I'm just reading that now | 15:39 |
*** alexpilotti has joined #openstack-oslo | 15:39 | |
*** tsekiyama has joined #openstack-oslo | 15:55 | |
*** pblaho_ has joined #openstack-oslo | 16:01 | |
*** dimsum__ is now known as dims | 16:03 | |
viktors | dhellmann: can we merge this patch to unblock gates - https://review.openstack.org/#/c/136856/ ? | 16:03 |
dhellmann | viktors: see the backlog in #openstack-dev -- I am working on a patch to pin the oslo requirements in the icehouse branch | 16:04 |
*** pblaho has quit IRC | 16:04 | |
viktors | dhellmann: ok, thanks! | 16:04 |
*** bogdando has quit IRC | 16:09 | |
*** bogdando has joined #openstack-oslo | 16:11 | |
*** kgiusti has quit IRC | 16:16 | |
*** pblaho_ is now known as pblaho | 16:16 | |
*** k4n0 has quit IRC | 16:30 | |
*** pblaho_ has joined #openstack-oslo | 16:31 | |
viktors | dhellmann: as for stable requirements - what should we do with requirements, which were missed in Icehouse | 16:33 |
viktors | ? | 16:33 |
*** pblaho has quit IRC | 16:35 | |
dhellmann | viktors: I'm not sure what you mean? | 16:44 |
viktors | dhellmann: patch https://review.openstack.org/#/c/136207/ fails on icehouse gates with error 'doc8' is not a global requirement but it should be,something went wrong | 16:45 |
viktors | that's true, because doc8 was added to global requirements in juno | 16:45 |
dhellmann | viktors: yeah, I think I need to cap whatever is calling for that to a lower number for icehouse | 16:45 |
dhellmann | viktors: do you know what is adding that requirement? | 16:47 |
viktors | dhellmann: see https://github.com/openstack/requirements/commit/83b3a598558d1ce51d81d3dd0c5dd219f96f8c84 | 16:47 |
dhellmann | viktors: ok, so taskflow at least and possibly some others. I'll work out which version of taskflow added that and lower the cap | 16:49 |
*** pblaho_ has quit IRC | 16:49 | |
*** viktors is now known as viktors|afk | 16:53 | |
sileht | dhellmann, viktors|afk I known that oslo.config and oslo.messaging are broken too | 16:54 |
sileht | by the introduction of docs8 and oslo.middleware | 16:55 |
sileht | dhellmann, icehouse/juno compat-jobs for the master branch can be removed when the stable requirements have been merged | 16:56 |
*** i159 has quit IRC | 16:59 | |
*** mfedosin has quit IRC | 17:05 | |
ihrachyshka | dhellmann: may I ask oslo team to prioritize https://review.openstack.org/136999 which blocks oslo.middleware consumption for neutron? | 17:08 |
openstackgerrit | Sergey Skripnick proposed openstack/oslo-incubator: Add ConnectionError exception https://review.openstack.org/137412 | 17:16 |
vishy | harlowja_away: so still no luck in fixing the issue | 17:21 |
*** e0ne has quit IRC | 17:28 | |
*** andreykurilin_ has quit IRC | 17:34 | |
*** kgiusti has joined #openstack-oslo | 17:38 | |
dims | ihrachyshka: lgtm | 17:44 |
ihrachyshka | dims: thanks sir! | 17:45 |
*** arnaud___ has joined #openstack-oslo | 17:48 | |
openstackgerrit | Davanum Srinivas (dims) proposed openstack/oslo.vmware: Switch to use requests/urllib3 and enable cacert validation https://review.openstack.org/121956 | 18:02 |
*** arnaud___ has quit IRC | 18:03 | |
*** harlowja_away is now known as harlowja | 18:04 | |
harlowja | vishy durn, i thought u found it :) | 18:04 |
harlowja | vishy did adding logs basically just cause eventlet to switch more and then it doesn't happen? | 18:05 |
harlowja | dhellmann the lower cap i think u put for taskflow is ok with me | 18:05 |
dhellmann | harlowja: k, thanks | 18:08 |
harlowja | np | 18:09 |
harlowja | sileht yt | 18:09 |
*** e0ne has joined #openstack-oslo | 18:13 | |
*** dtantsur is now known as dtantsur|afk | 18:13 | |
*** exploreshaifali has quit IRC | 18:13 | |
openstackgerrit | Joshua Harlow proposed openstack/taskflow: Add *basic* scope visibility constraints https://review.openstack.org/137245 | 18:15 |
*** e0ne has quit IRC | 18:18 | |
*** harlowja has quit IRC | 18:18 | |
*** harlowja has joined #openstack-oslo | 18:19 | |
*** e0ne has joined #openstack-oslo | 18:20 | |
openstackgerrit | Darragh Bailey proposed openstack-dev/hacking: Add import check for try/except wrapped imports https://review.openstack.org/134288 | 18:24 |
*** e0ne has quit IRC | 18:29 | |
*** arnaud___ has joined #openstack-oslo | 18:34 | |
*** _honning_ has joined #openstack-oslo | 18:38 | |
*** andreykurilin_ has joined #openstack-oslo | 18:39 | |
*** e0ne has joined #openstack-oslo | 18:40 | |
*** harlowja_ has joined #openstack-oslo | 18:41 | |
*** e0ne has quit IRC | 18:44 | |
*** harlowja has quit IRC | 18:45 | |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Add a decorator that logs enter/exit on wait() (no merge) https://review.openstack.org/137443 | 18:48 |
harlowja_ | vishy ^ might be useful for u | 18:48 |
harlowja_ | if u just use that patch, and take all the other logging out, i wonder if that would show the issue again (and still show useful info) | 18:49 |
harlowja_ | can place it on other methods, if u feel thats useful | 18:49 |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Add a decorator that logs enter/exit on wait() (no merge) https://review.openstack.org/137443 | 18:52 |
vishy | harlowja_: it isn’t in the loop | 18:56 |
*** yamahata has quit IRC | 18:59 | |
vishy | harlowja_: it gets stuck doing a recv and never times out | 19:05 |
vishy | then the other thread hits and doesn’t get the lock so it can’t ever make progress | 19:05 |
openstackgerrit | Merged openstack/oslo-incubator: Add middleware.catch_errors shim for Kilo https://review.openstack.org/136999 | 19:05 |
harlowja_ | vishy kk, so it gets a timeout passed in, but doesn't use it apparently? | 19:09 |
harlowja_ | or recv just broken, lol | 19:09 |
*** _honning_ has quit IRC | 19:10 | |
*** andreykurilin_ has quit IRC | 19:13 | |
vishy | apparently | 19:14 |
harlowja_ | :- | 19:16 |
harlowja_ | :-/ | 19:16 |
*** exploreshaifali has joined #openstack-oslo | 19:37 | |
vishy | harlowja_: so the main executor thread calls drain events without a timeout :( | 19:40 |
harlowja_ | :( | 19:40 |
harlowja_ | pew pew pew | 19:40 |
harlowja_ | that sux | 19:40 |
*** ihrachyshka has quit IRC | 19:41 | |
*** ihrachyshka has joined #openstack-oslo | 19:43 | |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Have the timeout decrement inside the wait() method https://review.openstack.org/137456 | 19:47 |
harlowja_ | vishy ^ probably is a good thing to have also | 19:47 |
harlowja_ | but won't address your main concern i think | 19:47 |
vishy | so i found the issue | 19:48 |
vishy | i think | 19:48 |
harlowja_ | cools | 19:49 |
vishy | harlowja_: https://github.com/openstack/oslo.messaging/blob/master/oslo/messaging/_executors/impl_eventlet.py#L87 | 19:50 |
harlowja_ | poll without timeout :( | 19:50 |
vishy | the main executor polls (which does a drain event) with no timeout | 19:50 |
harlowja_ | durn | 19:51 |
vishy | so yeah it hangs on reconnect | 19:56 |
*** dims has quit IRC | 19:56 | |
vishy | and none of the other threads can process events because they don’t have the lock | 19:56 |
vishy | cburgess: ^^ | 19:57 |
harlowja_ | :( | 19:58 |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Have the timeout decrement inside the wait() method https://review.openstack.org/137456 | 20:01 |
vishy | harlowja_: so it looks like a timeout param was recently added to the poll method for the trollius work | 20:02 |
vishy | the question is what should the timeout be for that call? | 20:02 |
vishy | should it be the same as the rpc timeout or do we need a new param | 20:02 |
harlowja_ | i don't think its rpc timeout related, just seems to be more poll related | 20:03 |
harlowja_ | rpc seems seperate (hopefully 137456 helps with that one) | 20:03 |
*** ihrachyshka has quit IRC | 20:04 | |
*** denis_makogon_ has joined #openstack-oslo | 20:05 | |
*** e0ne has joined #openstack-oslo | 20:06 | |
*** ihrachyshka has joined #openstack-oslo | 20:08 | |
*** e0ne has quit IRC | 20:17 | |
*** denis_makogon has quit IRC | 20:19 | |
*** denis_makogon_ is now known as denis_makogon | 20:19 | |
*** dmakogon_ has joined #openstack-oslo | 20:19 | |
*** andreykurilin_ has joined #openstack-oslo | 20:20 | |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Add a listener provided default poll() timeout https://review.openstack.org/137467 | 20:28 |
harlowja_ | vishy ^ | 20:28 |
harlowja_ | seems reasonable... | 20:28 |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Add a listener provided default poll() timeout https://review.openstack.org/137467 | 20:34 |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Add a listener provided default poll() timeout https://review.openstack.org/137467 | 20:36 |
*** rpodolyaka1 has quit IRC | 20:36 | |
*** andreykurilin_ has quit IRC | 20:37 | |
*** exploreshaifali has quit IRC | 20:37 | |
*** ajo has quit IRC | 20:58 | |
*** ajo has joined #openstack-oslo | 21:05 | |
*** tsekiyam_ has joined #openstack-oslo | 21:06 | |
*** ajo has quit IRC | 21:09 | |
*** tsekiyama has quit IRC | 21:09 | |
*** e0ne has joined #openstack-oslo | 21:09 | |
*** mtanino has quit IRC | 21:11 | |
*** tsekiyam_ has quit IRC | 21:11 | |
*** e0ne has quit IRC | 21:13 | |
*** andreykurilin_ has joined #openstack-oslo | 21:19 | |
openstackgerrit | Thiago Paiva Brito proposed openstack/oslo-incubator: Improving docstrings for policy API https://review.openstack.org/137476 | 21:21 |
openstackgerrit | Thiago Paiva Brito proposed openstack/oslo-incubator: Improving docstrings for policy API https://review.openstack.org/137476 | 21:32 |
*** kgiusti has quit IRC | 21:39 | |
*** kgiusti has joined #openstack-oslo | 21:42 | |
*** ajo has joined #openstack-oslo | 21:44 | |
openstackgerrit | Joshua Harlow proposed openstack/oslo.messaging: Add a listener provided default poll() timeout https://review.openstack.org/137467 | 21:44 |
*** kgiusti has quit IRC | 21:51 | |
*** ihrachyshka has quit IRC | 21:52 | |
*** ajo has quit IRC | 21:56 | |
*** ajo has joined #openstack-oslo | 21:59 | |
*** alexpilotti has quit IRC | 22:15 | |
*** ajo has quit IRC | 22:18 | |
openstackgerrit | Joshua Harlow proposed openstack/taskflow: Add *basic* scope visibility constraints https://review.openstack.org/137245 | 22:35 |
openstackgerrit | Ben Nemec proposed openstack/oslo.concurrency: Add external lock fixture https://review.openstack.org/131517 | 22:39 |
*** alexpilotti has joined #openstack-oslo | 22:45 | |
*** stevemar has quit IRC | 22:50 | |
*** gordc has quit IRC | 22:55 | |
*** tsekiyama has joined #openstack-oslo | 23:01 | |
vishy | harlowja_: 1 sec may not be long enough, I got a timeout in normal operation when starting the service | 23:09 |
vishy | imeout: Timeout while waiting on RPC response - topic: "<unknown>", RPC method: "<unknown>" info: "<unknown>" | 23:09 |
harlowja_ | hmmm, ideally u'd get a timeout, then the loop would continue | 23:10 |
harlowja_ | and try again and again? | 23:10 |
vishy | yeah it did continue | 23:11 |
vishy | but i don’t think it should timeout unless there is a failure | 23:11 |
*** kgiusti has joined #openstack-oslo | 23:11 | |
vishy | I’m guessing the greenlet switching can cause more than 1 second | 23:11 |
vishy | or something | 23:11 |
vishy | just set it to 120 | 23:11 |
vishy | seems ok so far | 23:11 |
vishy | testing to see if it fixes the issue | 23:11 |
harlowja_ | kk | 23:11 |
harlowja_ | can also handle the exception better; instead of invoking the logic in excutils.forever_retry_uncaught_exceptions to retry | 23:12 |
harlowja_ | or just mess around with poll() itself | 23:12 |
harlowja_ | although if it is timing out, thats good | 23:13 |
harlowja_ | it'd be interseting to try https://review.openstack.org/#/c/137456/ also | 23:14 |
harlowja_ | which imho does more of the 'correct' thing for threads that get stuck in the wait() method | 23:15 |
vishy | harlowja_: well partial victory | 23:19 |
vishy | it does raise a timeout properly | 23:19 |
harlowja_ | yippe | 23:19 |
harlowja_ | lol | 23:19 |
vishy | every 120 seconds... | 23:19 |
vishy | so the socket itself is no longer sending data | 23:20 |
vishy | which means it isn’t reconnecting properly | 23:20 |
harlowja_ | onwards toward the next victory/battle! | 23:20 |
harlowja_ | ha | 23:20 |
harlowja_ | what'd the reconnect do, just never really do it and giveup? | 23:22 |
*** andreykurilin_ has quit IRC | 23:22 | |
harlowja_ | i wonder if https://github.com/openstack/oslo.messaging/commit/973301aa7 then fixed a bunch of that | 23:23 |
harlowja_ | i don't think u are running master/that commit though | 23:23 |
vishy | harlowja_: was that stuff in juno? | 23:26 |
harlowja_ | not afaik | 23:26 |
vishy | I could try upgrading to master | 23:26 |
harlowja_ | authored 9 days ago so i don't think so | 23:27 |
vishy | although i’m sure there will be new dependencies that will break | 23:27 |
vishy | something about the reconnect logic seems to be a bit borked | 23:27 |
harlowja_ | ya, its complicated, so that doesn't help either :-/ | 23:27 |
harlowja_ | which is why i'm wondering if that commit fixes it | 23:27 |
harlowja_ | since it seemed to have overhauled that whole thing | 23:28 |
* harlowja_ crosses fingers, ha | 23:28 | |
vishy | i suspect it will be somewhat difficult to use that due to dependencies but i will try it | 23:51 |
*** tsekiyama has quit IRC | 23:52 | |
*** sigmavirus24 is now known as sigmavirus24_awa | 23:53 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!