Monday, 2018-11-19

*** tosky has quit IRC00:49
*** sambetts_ has quit IRC02:35
*** sambetts_ has joined #openstack-oslo02:39
openstackgerritMerged openstack-dev/oslo-cookiecutter master: change default python 3 env in tox to 3.5  https://review.openstack.org/61714903:17
*** njohnston has quit IRC04:47
*** njohnston has joined #openstack-oslo04:48
*** e0ne has joined #openstack-oslo06:12
*** e0ne has quit IRC06:17
*** Luzi has joined #openstack-oslo07:03
*** pcaruana has joined #openstack-oslo07:24
*** hoonetorg has quit IRC07:55
*** hberaud has joined #openstack-oslo07:56
*** shardy has joined #openstack-oslo07:58
*** hoonetorg has joined #openstack-oslo08:13
*** a-pugachev has joined #openstack-oslo08:42
*** tosky has joined #openstack-oslo08:50
*** moguimar has joined #openstack-oslo09:32
*** jaosorior has joined #openstack-oslo09:41
*** cdent has joined #openstack-oslo09:51
*** e0ne has joined #openstack-oslo09:57
*** e0ne_ has joined #openstack-oslo09:59
*** e0ne has quit IRC10:02
*** sean-k-mooney has quit IRC10:15
*** sean-k-mooney has joined #openstack-oslo10:23
*** toabctl has quit IRC10:33
*** cdent has quit IRC10:57
*** tosky has quit IRC11:03
*** tosky has joined #openstack-oslo11:06
*** raildo has joined #openstack-oslo11:46
*** njohnston has quit IRC11:50
*** njohnston has joined #openstack-oslo11:51
*** raildo has quit IRC12:01
*** cdent has joined #openstack-oslo12:19
*** dkehn has quit IRC12:33
*** raildo has joined #openstack-oslo12:45
*** cdent has quit IRC13:13
*** kgiusti has joined #openstack-oslo13:37
openstackgerritJuan Antonio Osorio Robles proposed openstack/oslo.policy master: Implement base for pluggable policy drivers  https://review.openstack.org/57780713:41
openstackgerritJosephine Seifert proposed openstack/oslo-specs master: Adding library for encryption and decryption  https://review.openstack.org/61875413:50
*** cdent has joined #openstack-oslo13:50
*** cdent has quit IRC13:55
*** pbourke has quit IRC14:04
*** cdent has joined #openstack-oslo14:05
*** pbourke has joined #openstack-oslo14:06
*** lbragstad has joined #openstack-oslo14:09
*** notq has joined #openstack-oslo14:21
notqfor oslo messaging, when rabbitmq is restarted for the rpc bus, we get mesage timeouts and the service stays down. It's waiting on some message that was in process, and is no longer geting a respond back from the other service. It never seems to timeout, and move on though. The service has to be restarted to recover.14:25
notqIs anyone familiar with this type of issue, and can point me in the right direction?14:25
*** ansmith has joined #openstack-oslo14:26
notqI understand how to debug the issue, but there's no need to debug it, I know the rabbitmq is restarted. Is it something where the rabbitmq can wait on restart stopping new messages, and waiting until the other ones finish during shutdown. That would be one solution.14:26
notqAnother solution would be, that the openstack service understands this state, and times out properly if a message isn't received, and retries14:27
dhellmannnotq : which version of openstack and oslo.messaging are you using? there are timeout handling features but I wonder if you've found a buggy configuration14:35
notqit's on rocky, not sure the oslo messaging version14:36
notqcan check14:36
kgiustinotq: do you know what service is blocking?  Ideally where the service is blocking?14:40
notqthis time it was nova, but it's happened with others as well.14:41
notqwhere the service is blocking, it's always in a stack trace to amqp14:41
notqand it keeps repeating the attempt for the message, and the service stops14:42
kgiustinotq: can you pastebin a sample of the stack trace?14:42
notqsure14:42
kgiustinotq: it may be best to open a launchpad bug on this and upload any logs you have to it:  https://bugs.launchpad.net/oslo.messaging/+filebug14:43
*** bobh has joined #openstack-oslo14:43
kgiustinotq: that way we can track this publicly and others can help out14:44
notqok. It's gone on over many releases. I wasn't on the topic, but the rest of the team always said it was rabbitmq being bad, or a password change, always some sort of weird answer. Recently assigned to look after them, and it's this whole class of problems is just happening the same way. Someone deploys a rabbitmq change, it updates the container, it restarts, a service loses connection in this exact same way.14:45
kgiustinotq: does the service's tcp connection to the 'new' rabbitmq service come up?14:46
notqI've had other rabbitmq connection issues, we're using kubernetes to deploy openstack for reference.14:46
notqNo, there's no consumer for it, it's waiting on this message id14:47
notqthe similar problems i've had were with like logstash and rabbitmq. Getting the retry settings just right solved that.14:48
notqbecause if it's connection pooling, or somehow saving state, and the new rabbitmq comes up on the service ip, different settings can handle that correctly, and others don't14:49
notqso for example, a java application caches the dns resolution, which causes a problem. you have to turn off caching. Golang http clients cache the connection as well, you have to turn that off in the http library. Etc etc.14:49
notqelasticsearch has an all together different caching problem, but the same story. It's connection pool saves state, and saving state when a new container comes up on the same service ip, doesn't work.14:50
notqfrom experience, i'd guess kombu is saving something with it's connection pooling which is causing a problem14:52
notqand it never happens with notification busses though. Only for the rpc case where it's waiting for a return message14:53
*** zaneb has joined #openstack-oslo14:55
kgiustinotq: I'm not familiar with the type of failure you describe - it's the first I've heard of it actually.   Can you capture debug logs of the service that gets stuck?14:59
*** zaneb has quit IRC15:01
kgiustinotq: it could very well be some state in kombu - the oslo.messaging driver relies heavily on kombu to handle connection failures/retries15:01
notqI can try. We only throw debug on in staging, so I'll have to reboot the rabbitmq repeatedly until I can catch it. Which is a bit harder given staging is not used to the extent15:01
notqand it will need the oslo debug i'm guessing right, not just debug=true15:02
kgiustinotq: yeah - oslo messaging debug would be ideal15:02
*** Luzi has quit IRC15:03
kgiustinotq: usually I enable "oslo.messaging=DEBUG,oslo_messaging=DEBUG" in log_levels for oslo.log15:04
kgiustinotq: but if you do have existing logs w/o debug that capture the error that would be helpful also - perhaps we start with that before venturing into debug hell...15:05
notqsure, but i doubt it's going to explain much. I think it's simply an issue of kombu and how it handles the service ip situation with kubernetes. Now, perhaps nova shouldn't just keep retrying and dying on the amqp driver failure...15:06
kgiustinotq: re: nova - yes oslo.messaging is best effort.  It cannot guarantee zero message loss.  But it could be that nova is using exceptionally long timeouts.15:08
notqwell, i think it's failing past exceptionally long timeouts. I think the case is missing the timeout somehow, but I'd need to look at it.15:09
kgiustinotq: we (oslo) recently implemented a heartbeat mechanisim for faster detection of failed RPC calls specifically for nova, but I don't believe it is being used in Rocky15:09
notqwe have long timeouts due to recovering, but this will last hours until someone wakes up15:09
notqyeah, a heartbeat would be a great solution15:10
kgiustinotq: yeesh - hours...  not what I'd expect for a retry timeout (more like 10 minutes)15:10
notqbut, the heartbeat would also have to not be caching the connection to retry, for itto work with the service ip15:10
*** zaneb has joined #openstack-oslo15:12
kgiustinotq: can you explain a bit more in detail about "caching the service ip" - specifically what meta data you believe is being cached?15:12
notqI can only give examples of other cases I've sorted out, like with java, or golangs http client.15:12
notqbasically, there's a service ip, and a real ip.15:13
notqif it caches the real ip for connection pooling, or anything else at any point, the dns for example15:13
kgiustinotq: ok, gotcha15:13
notqthen it will keep trying the ip for the old container, and not the new container which is now up on the service ip15:13
notqand when i work with library owners on this type of issue, i'm generaly told nothing caches repeatedly, until we find where it caches. because libraries, inside libaries, inside libaries, might be caching it15:14
kgiustinotq: understood.  Yeah oslo.messaging simply passes along whatever the value of "transport_url" is to kombu.15:16
kgiustinotq: do you happen to know what version kombu is being used?15:16
notqi don't, it's a great question. kombu also has issues, and critical bugs fixed in the newest version regarding it's reconnection logic15:17
*** bobh has quit IRC15:18
kgiustinotq: in any case it sounds like a combination of bugs: failure to reconnect to rabbitmq, and failure to timeout while waiting for an RPC reply message.15:19
notqagreed.15:22
kgiustinotq: just looking at the rocky releases for oslo.messaging, there's a requirement's bump to py-amqp in release 8.1.0:15:26
kgiustinotq: https://bugs.launchpad.net/oslo.messaging/+bug/178099215:26
openstackLaunchpad bug 1780992 in oslo.messaging "Trying to connect to Rabbit servers can timeout if only /etc/hosts entries are used" [Undecided,Fix released] - Assigned to Daniel Alvarez (dalvarezs)15:26
kgiustinotq: plus a connection related bugfix: https://bugs.launchpad.net/oslo.messaging/+bug/174516615:28
openstackLaunchpad bug 1745166 in oslo.messaging "amqp.py warns about deprecated feature" [Low,Fix released] - Assigned to Ken Giusti (kgiusti)15:28
notq@kglusti that's very helpful, I wasn't aware that was the release name.15:28
kgiustinotq: that last one ended up fixing some intermittent connection failures we were seeing in the python3 gates, FWIW15:28
*** bobh has joined #openstack-oslo15:29
notqit doesn't help that we are mostly off openstack-helm into our dedicated helm charts, but the one that failed this time was in the openstack-helm rabbitmq, and that may be different15:29
*** notq has quit IRC15:36
*** bobh has quit IRC15:58
*** pcaruana has quit IRC16:07
*** e0ne_ has quit IRC16:16
*** e0ne has joined #openstack-oslo16:21
*** e0ne has quit IRC16:27
*** kgiusti has quit IRC17:24
*** a-pugachev has quit IRC17:35
*** zbitter has joined #openstack-oslo17:37
*** zaneb has quit IRC17:39
*** pcaruana has joined #openstack-oslo17:44
openstackgerritMerged openstack/oslo.messaging master: Use ensure_connection to prevent loss of connection error logs  https://review.openstack.org/61564917:51
*** e0ne has joined #openstack-oslo18:47
*** e0ne has quit IRC18:51
*** e0ne has joined #openstack-oslo19:00
*** shardy has quit IRC19:27
*** kgiusti has joined #openstack-oslo19:45
openstackgerritMerged openstack/sphinx-feature-classification master: Optimizing the safety of the http link site in HACKING.rst  https://review.openstack.org/61835120:06
*** e0ne has quit IRC20:07
*** e0ne has joined #openstack-oslo20:10
*** e0ne has quit IRC20:10
*** zbitter is now known as zaneb20:18
*** a-pugachev has joined #openstack-oslo20:34
*** a-pugachev has quit IRC20:35
*** efried has joined #openstack-oslo20:41
efrieddhellmann: Are we allowed to edit renos that are out the door already?20:41
efrieddhellmann: à la https://review.openstack.org/#/c/617222/ ?20:41
dhellmannefried : it's possible and allowed but "dangerous" if done wrong20:41
efrieddhellmann: Dangerous under what circumstances?20:42
efriedi.e. what's "done wrong"?20:42
dhellmannhttps://docs.openstack.org/reno/latest/user/usage.html#updating-stable-branch-release-notes20:42
efried...20:42
efriednyaha. So that one is bogus, needs to be reproposed to stable/rocky.20:43
*** raildo has quit IRC20:43
dhellmannefried : I commented; it's better to set up a redirect20:44
efriednice, thanks dhellmann20:44
efriedin fact this reno was introduced in ocata20:45
efriedafaict20:45
*** a-pugachev has joined #openstack-oslo21:35
*** e0ne has joined #openstack-oslo21:35
*** e0ne has quit IRC21:36
*** ansmith has quit IRC21:42
*** pcaruana has quit IRC21:51
openstackgerritMerged openstack/oslo.messaging master: Add a test for rabbit URLs lacking terminating '/'  https://review.openstack.org/61187021:55
*** HenryG has quit IRC22:12
*** HenryG has joined #openstack-oslo22:14
*** cdent has quit IRC22:19
*** kgiusti has quit IRC22:39
*** moguimar has quit IRC22:39
*** jbadiapa has quit IRC22:39
*** dhellmann has quit IRC22:39
*** sambetts_ has quit IRC22:39
*** dhellmann has joined #openstack-oslo22:40
*** jbadiapa has joined #openstack-oslo22:40
*** sambetts_ has joined #openstack-oslo22:43
efrieddhellmann: More reno fun. See https://review.openstack.org/#/c/618708/ -- adding the release notes job to .zuul.yaml didn't trigger the release notes job to run on that patch.23:16
efriedI think I get that this is because the current conditionals are set up to only trigger the job if something under the releasenotes/ dir changes23:16
efriedjust wondering how hard it would be to add a condition for this special case.23:16
efriedIf not trivial, I wouldn't bother, since this should be quite rare23:17
*** a-pugachev has quit IRC23:29
*** njohnston has quit IRC23:29
*** njohnston has joined #openstack-oslo23:30
*** gibi has quit IRC23:34
*** gibi has joined #openstack-oslo23:35
dhellmannefried : we could change the job definition to also look at the zuul configuration files, but it would be just as easy to add an empty release notes file in that patch I think23:40
*** lbragstad has quit IRC23:40
efrieddhellmann: I threw a patch on top to verify. No biggie. Just wondered if it was worth doing in the job def.23:40
dhellmannefried : I'd hate to run that job every time we update some other settings :-/23:53
efrieddhellmann: Right, it would have to be smart enough to just run the job for the specific line that was added (whatever that may be).23:54
efriedi.e. this isn't a reno-specific thing.23:54

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!