Friday, 2016-02-05

openstackgerritAdam Gandelman proposed openstack/astara: WIP Add clustering capabilities to instance manager  https://review.openstack.org/26726300:23
*** shashank_hegde has quit IRC01:42
*** phil_hopk has quit IRC01:49
*** phil_hop has joined #openstack-astara01:50
*** phil_hopk has joined #openstack-astara01:51
*** phil_hop has quit IRC01:55
*** davidlenwell has quit IRC02:18
*** davidlenwell has joined #openstack-astara02:19
*** davidlenwell has quit IRC02:23
*** davidlenwell has joined #openstack-astara02:27
*** shashank_hegde has joined #openstack-astara02:37
*** justinlund has quit IRC04:24
*** shashank_hegde has quit IRC05:02
*** shashank_hegde has joined #openstack-astara05:27
*** reedip is now known as reedip_away06:30
*** shashank_hegde has quit IRC07:23
*** openstackgerrit has quit IRC09:17
*** openstackgerrit has joined #openstack-astara09:17
*** phil_hop has joined #openstack-astara11:48
*** phil_hopk has quit IRC11:52
*** justinlund has joined #openstack-astara14:54
drwahlso, akanda dudes, i think we found a bug17:31
drwahlhttps://github.com/openstack/astara/blob/f2360d861f3904c8a06d94175be553fe5e7bab05/astara/worker.py#L21217:31
drwahlthat get is blocking. for some reason, we have threads hanging on that get17:31
drwahlwhich causes everything to back up and the workers to stop processing jobs17:31
drwahlwe're testing a (probably improper) patch for this right now17:32
drwahlby setting sm = self.work_queue.get(block=False)17:32
drwahland then adding a 10 second sleep to the execpt17:32
drwahlif that solves our problem, we'll create a bug and submit a patch and then you guys can tell us how crappy our patch is and we can figure out a proper fix for this17:32
cleverdevil:)17:34
cleverdevilthe `Queue.Queue` object pretty clearly says that it *won't* block for longer than `timeout` seconds.17:35
cleverdevilso, that's a strange issue.17:35
rodsyeah, that's a weird17:36
drwahli have other suspicions as to what *could* be happening, and it wouldn't incrimenate Queue17:39
cleverdevilyeah, I would find it highly unlikely that Queue is the issue.17:39
cleverdevilhave we confirmed that we actually block longer on that line than `timeout`?17:39
drwahlwe created a bug in that try/except and it spooled up the queue as well17:40
drwahlwhich makes me think that the try/except block isn't working as expected17:40
cleverdevilhmm17:40
drwahlneed to test more, but haven't had time yet17:40
cleverdevilits possible that something other than Queue.Empty is being raised.17:41
drwahlindeed17:41
drwahlhowever, i'd expect that to actually raise through the stack and cause problems17:42
drwahlinstead of just spool up the queue...17:42
cleverdevilyeah... still... Queue is a pretty proven and simple bit of the Python standard library.17:42
drwahlyup, i doubt the bug resides in Queue17:42
drwahlbut rather, in the way it's being used in that section of the code17:43
* cleverdevil nods17:43
cleverdevilwe'll figure it out :)\17:43
drwahlunless our current patch *doesn't* fix it. then we're back to square 1 :(17:43
cleverdevilyeah, not to be a downer, but my instinct is that the problem happens somewhere in the body of the loop, when an operation hangs.17:45
cleverdevilnot in the management of the queue17:45
*** shashank_hegde has joined #openstack-astara17:46
drwahlso, in the "except Queue.Empty", we added time.sleep(10), but forgot to import time, so it was creating a traceback17:49
drwahlhowever17:49
drwahlit wasn't crashing the thread or creating any logs mentioning anything17:49
drwahland the queue was spooling up17:49
drwahlwhich is the *exact* behavior we see happen eventually17:49
drwahlalso, checking strace, the threads are clearly hanging on some FIFO read17:49
drwahlwhich this queue get call should be executing17:50
drwahlso we're clearly in the right area of the code. just need to figure out what exactly in this are is causing the threads to hang17:51
drwahlalso, since we changed that queue.get to non-blocking, things seem to be working, which again, makes it suspicious that a non-blocking call makes thigns work there whereas a blocking call causes us issues17:51
drwahlafaik, this is all an internal/IPC queue, so there shouldn't be anything external (like rabbitmq) causing the threads to hang17:52
markmcclaindrwahl, cleverdevil: interesting... don't forget eventlet messes with implementations of standard things sp possible something isn't yielding for TO event to fire18:16
markmcclainadam_g: ^^^^18:16
cleverdevilthat's a *really* good point18:28
cleverdevilI absolutely think it could be eventlet.18:28
cleverdevil(I. hate. eventlet.)18:28
markmcclaineventlet and multi-processing are known not to play nice with each other18:28
cleverdevilis there any reason that eventlet needs to be used at all with the rug?18:30
markmcclainrpc and message processing since we inherit some hooks from neutron18:31
cleverdevilhm.18:31
adam_gdrwahl, do you have any good way to reproduce the original issue or is it sporadic?18:32
adam_galso id be interested in seeing the orchestrator logs leading up to the hangup, if you have them18:32
drwahladam_g: there has been effectively nothing notable in the logs18:36
drwahli'll see if i can dig them up18:36
drwahlbut we've gone over them really thoroughly the last couple of weeks and say nothing18:36
adam_gdrwahl, it might be worth adding a bare 'except:' at the end of the block to log and re-raise18:40
drwahlwas thinking the same thing18:41
adam_galso, are you hitting this on master or a stable branch?18:41
adam_ghow did you narrow it down to the .get() blocking? i believe an uncaught exception anywhere in the thread target will cause the state machine to gum up. i ran into a similar issue that had the same symptoms fixed by https://review.openstack.org/#/c/271158/18:44
drwahladam_g: we narrowed it down to that get because when we set .get(block=False), things work18:47
drwahlat least, they've been working for the last 2 hours...18:47
drwahlthe verdict is still out18:47
drwahlalso, when things are locked up, the threads are stuck in reading some FIFO pipe18:47
drwahlso clearly it some IPC18:47
drwahlas for the code base, we're building from stable/kilo18:50
drwahlwe built from it Feb 2 last18:51
adam_gdrwahl, i wonder, is setting it get(block=False) just masking the fact that the get()s corresponding task_done() is never getting calling?18:51
adam_gdrwahl, so https://review.openstack.org/#/c/271158/1 hasnt been backported to kilo yet and the symptoms sound the same, you might want to pull that in as well18:53
adam_gill put up a backport now18:53
* drwahl doesn't see any "_release_resource_lock"18:55
drwahlthat entire function seems absent from the code we're running18:55
adam_gdrwahl, https://github.com/openstack/astara/blob/stable/kilo/akanda/rug/worker.py#L36618:58
adam_gbackporting it now18:58
cleverdevilonce we get our new cluster fully stabilized, we just need to push forward to get on the newest release of openstack we're happy with18:59
openstackgerritAdam Gandelman proposed openstack/astara: Handle exception when worker unlocks already unlocked resource  https://review.openstack.org/27687519:02
adam_gdrwahl, ^19:02
drwahldo i have perms to +2 that?19:04
* drwahl checks...19:04
drwahl+119:06
drwahl(as good as i can do for you)19:06
adam_g:)19:06
openstackgerritAdam Gandelman proposed openstack/astara: Handle exception when worker unlocks already unlocked resource  https://review.openstack.org/27687719:09
adam_g(liberty)19:09
markmcclaindrwahl: rods and ryanpetrello have +2 too19:11
drwahlroger that. thanks19:12
openstackgerritMerged openstack/astara: Handle exception when worker unlocks already unlocked resource  https://review.openstack.org/27687520:25
*** adam_g has left #openstack-astara20:32
*** adam_g has joined #openstack-astara20:32
*** justinlund1 has joined #openstack-astara20:57
*** justinlund has quit IRC20:58
*** shashank_hegde has quit IRC21:49
*** shashank_hegde has joined #openstack-astara22:32
*** phil_h has quit IRC22:34
*** phil_h has joined #openstack-astara22:50
markmcclainadam_g: took a moment to dig into the functional failure when we turn off auto add resources23:04
markmcclainlooks like I'll have to make a few changes to how we setup the router since we set it into error state fairly early23:04
*** phil_hop has quit IRC23:09
*** phil_hop has joined #openstack-astara23:10

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!