Monday, 2019-06-17

*** yamamoto has joined #openstack-lbaas00:54
*** yamamoto has quit IRC01:49
*** goldyfruit has joined #openstack-lbaas02:23
*** ianychoi has quit IRC02:25
*** ianychoi has joined #openstack-lbaas02:26
lxkongjohnsom: the the fedora amphora image, here are the logs you asked02:39
lxkongnova console log: http://paste.openstack.org/show/753065/02:39
lxkongcloud-init.log: http://paste.openstack.org/show/753066/02:39
lxkong`/var/log/message`: http://paste.openstack.org/show/753067/02:39
*** goldyfruit has quit IRC02:42
lxkongjohnsom: (updated)the `/var/log/message`: http://dpaste.com/1Z0BZ9V02:54
*** yamamoto has joined #openstack-lbaas03:07
*** psachin has joined #openstack-lbaas03:37
*** yamamoto has quit IRC04:51
*** yamamoto has joined #openstack-lbaas04:53
*** gcheresh has joined #openstack-lbaas05:03
*** vishalmanchanda has joined #openstack-lbaas05:18
*** ivve has quit IRC05:30
*** gcheresh has quit IRC05:32
*** ramishra has joined #openstack-lbaas05:38
*** AlexStaf has quit IRC05:47
*** gcheresh has joined #openstack-lbaas05:49
*** luksky has joined #openstack-lbaas05:55
*** rcernin has quit IRC06:02
*** gcheresh has quit IRC06:09
*** gcheresh has joined #openstack-lbaas06:25
*** ccamposr has joined #openstack-lbaas06:30
*** yboaron has joined #openstack-lbaas06:34
*** gcheresh has quit IRC06:36
*** ivve has joined #openstack-lbaas06:46
*** rpittau|afk is now known as rpittau06:56
*** rcernin has joined #openstack-lbaas07:00
*** trident has quit IRC07:06
*** trident has joined #openstack-lbaas07:08
*** yamamoto has quit IRC07:16
*** tesseract has joined #openstack-lbaas07:24
*** AlexStaf has joined #openstack-lbaas07:27
*** yamamoto has joined #openstack-lbaas07:48
*** yboaron has quit IRC07:54
*** yamamoto has quit IRC07:57
*** yamamoto has joined #openstack-lbaas08:08
*** ricolin has joined #openstack-lbaas08:12
*** yboaron has joined #openstack-lbaas08:44
*** yboaron_ has joined #openstack-lbaas08:50
*** pcaruana has joined #openstack-lbaas08:52
*** yboaron has quit IRC08:52
*** luksky has quit IRC08:54
*** numans has joined #openstack-lbaas09:00
*** ricolin has quit IRC09:14
*** luksky has joined #openstack-lbaas09:27
*** yamamoto has quit IRC09:34
*** lemko has joined #openstack-lbaas09:35
*** rcernin has quit IRC09:36
lemkoHi. I'm getting the error " Amphora ae091665-aa7d-44c0-bc33-f3107e879e41 health message was processed too slowly: 18.0750341415s! The system may be overloaded or otherwise malfunctioning. This heartbeat has been ignored and no update was made to the amphora health entry. THIS IS NOT GOOD.".  WHat can I do to make this appear less? Because somtimes, it tends to kill my amphoraes. What could I do to avoid this?10:04
*** yamamoto has joined #openstack-lbaas10:05
*** ccamposr__ has joined #openstack-lbaas10:08
*** ccamposr has quit IRC10:12
*** yamamoto has quit IRC10:15
*** yamamoto has joined #openstack-lbaas10:34
*** yamamoto has quit IRC10:36
*** yamamoto has joined #openstack-lbaas10:39
*** yamamoto has quit IRC10:43
openstackgerritMerged openstack/octavia master: Fix TCP listener logging bug  https://review.opendev.org/66547010:48
*** yamamoto has joined #openstack-lbaas11:08
*** yamamoto has quit IRC11:12
sapd1lemko, What version are you running? I got this issue before in old version.11:45
lemkoSome version from like 5 months ago.11:46
lemkoHow did you fix it sapdl?11:46
lemkodid you increase the time check interval?11:46
sapd1there was an optimized patch which fixed it.11:46
sapd1The root cause is octavia used query all command in database.11:47
*** yamamoto has joined #openstack-lbaas11:49
lemkoSo the problem is in the database query?11:51
sapd1This patch: https://review.opendev.org/#/c/603242/11:51
*** boden has joined #openstack-lbaas11:55
*** yamamoto has quit IRC11:56
*** yamamoto has joined #openstack-lbaas11:57
*** yamamoto has quit IRC12:02
lemkothanks a lot :)12:04
*** yboaron_ has quit IRC12:13
*** yamamoto has joined #openstack-lbaas12:14
*** yamamoto has quit IRC12:14
*** yamamoto has joined #openstack-lbaas12:14
*** rtjure has quit IRC12:16
*** yamamoto has quit IRC12:19
openstackgerritNir Magnezi proposed openstack/octavia master: Fix a python3 issue in the amphora-agent  https://review.opendev.org/66549812:20
openstackgerritNir Magnezi proposed openstack/octavia master: Fix a python3 issue in the amphora-agent  https://review.opendev.org/66549812:22
lemkoIt seems I already have this patch sapl :( Maybe increasing the heartbeat control interval will help12:27
*** ricolin has joined #openstack-lbaas12:45
*** psachin has quit IRC12:46
johnsomlemko With that patch the process should take a few thousandth of a second to complete, but are taking over 10 seconds. Something is seriously overloaded in your environment and changing the heartbeat interval will likely only buy you some time. Check that the database is not having trouble or overloaded. That is the most likely cause.12:47
*** vishalmanchanda has quit IRC12:57
*** gcheresh has joined #openstack-lbaas12:58
*** gcheresh has quit IRC13:09
*** yamamoto has joined #openstack-lbaas13:13
*** ricolin has quit IRC13:17
*** yamamoto has quit IRC13:19
*** yboaron_ has joined #openstack-lbaas13:22
*** goldyfruit has joined #openstack-lbaas13:24
*** pcaruana|afk| has joined #openstack-lbaas13:26
*** pcaruana has quit IRC13:26
*** gcheresh has joined #openstack-lbaas13:30
*** ricolin has joined #openstack-lbaas13:40
lemkoany tip to check how overloaded the database is johnsom?13:40
*** pcaruana has joined #openstack-lbaas13:48
*** pcaruana|afk| has quit IRC13:51
kklimondaso I have any interesting case, over the last 4-5 days our network has been flapping like crazy and now I have two loadbalancers in a weird state: one has provisioning_status 'PENDING_UPDATE' and two amphorae: one ERROR/BACKUP and another ACTIVE/STANDBY.13:58
kklimondathe other loadbalancer is ACTIVE with one amphora ALLOCATED/MASTER but no BACKUP even though topology is ACTIVE_STANDBY13:58
kklimondaerm, for first one states are ERROR/BACKUP and ACTIVE/STANDALONE, not STANDBY13:59
kklimondanot sure how to get this into a working state13:59
sapd1lemko, Does it take long time to list all loadbalancer in your deployment?14:01
sapd1kklimonda, As my experiments, You can update provisioning status for error LB in your DB to ACTIVE. then perform failover loadbalancer/14:03
kklimonda@sapd1 yeah, that was my plan - both were in the ERROR/PENDING_UPDATE state initially and I've sacrificed one of LBs, changing its state to ACTIVE but it's not spawning the BACKUP amphora14:04
kklimondaI'll try failing over amphora now, instead of the LB itself14:05
sapd1So the amphora was deleted, You can change status from DELETED to ACTIVE then perform that error amphora.14:06
kklimondahmm, I think I see what you mean14:07
openstackgerritMichael Johnson proposed openstack/octavia master: Fix diskimage-create.sh datasources  https://review.opendev.org/66568014:07
openstackgerritNir Magnezi proposed openstack/octavia stable/rocky: Fix auto setup Barbican's ACL in the legacy driver.  https://review.opendev.org/66568114:08
openstackgerritGregory Thiemonge proposed openstack/octavia stable/rocky: Update amphora-agent to report UDP listener health  https://review.opendev.org/66568314:14
openstackgerritMichael Johnson proposed openstack/octavia master: Fix diskimage-create.sh datasources  https://review.opendev.org/66568014:14
*** yboaron_ has quit IRC14:16
kklimonda@sapd1 thanks, seems that've helped - now I just have to figure out 2 minute delay between amphora-agent starting and starting processing requests from worker/health check14:21
johnsomkklimonda There is a built in delay after a controller restart where the health checks will not run. This is to allow the amphora instances to all report in fresh heartbeats before we start checking for failed amphora.14:25
johnsomBasically it''s in receive only mode for a short window, then will start looking for failed instances.14:25
kklimonda@johnsom mhm, but in my deployment it takes a long time for the worker to connect  on port 9443  and fetch `/0.5/info` - I get connection refused, but the agent is running14:26
johnsomOh, you are talking about a different issue.14:26
johnsomkklimonda CentOS or Ubuntu image?14:26
kklimondaubuntu14:26
johnsomHmm, ok. We are tracking down an CentOS issue along these lines, but Ubuntu has been fine.14:27
*** fnaval has joined #openstack-lbaas14:27
johnsomI would start by watching the instance console to see how long it is taking for the instance to come up.14:27
kklimondait's much quicker - I can login to the amphora way before it starts listening on port 944314:28
kklimondait takes ~30 seconds for the amphora instance to come up14:31
kklimondanot sure how to debug it further14:31
johnsomYeah, my whole load balancer build finishes in about 23 seconds14:31
kklimondaand then agent is stuck for ~2 minutes - not sure how to debug that14:32
kklimondajournalctl is not helpful, but there seems to be a 2 minute delay of some sort: ```Jun 17 14:30:59 amphora-7e0666b0-a07f-4510-a428-141ccf19de26 systemd[1]: Started OpenStack Octavia Amphora Agent.14:33
kklimondaJun 17 14:32:43 amphora-7e0666b0-a07f-4510-a428-141ccf19de26 amphora-agent[843]: 2019-06-17 14:32:43.165 843 INFO octavia.common.config [-] Logging enabled!```14:33
johnsomWell, like I said, I would check the console log. The amphora-agent startup should be included there. The amphora-agent logs to both the amphora-agent log and syslog in older versions of the image (like before master last week), so those would also be good places to look at timing14:33
kklimondaah, lets see14:33
kklimondaI thought console would basically display the same thing journalctl does14:33
johnsomNo, before last week gunicorn was writing straight to log files.14:34
johnsomI was just working on cleaning up the logging inside the amphora14:34
kklimondathe console has nothing but the login prompt14:35
*** AlexStaf has quit IRC14:45
*** pcaruana has quit IRC14:47
openstackgerritGregory Thiemonge proposed openstack/neutron-lbaas stable/rocky: Improve performance on get and create/update/delete requests  https://review.opendev.org/66569614:56
openstackgerritNir Magnezi proposed openstack/octavia stable/stein: Fix a python3 issue in the amphora-agent  https://review.opendev.org/66569814:57
openstackgerritGregory Thiemonge proposed openstack/neutron-lbaas stable/queens: Improve performance on get and create/update/delete requests  https://review.opendev.org/66570014:59
*** ivve has quit IRC15:05
*** luksky has quit IRC15:20
cgoncalvesjohnsom, backport candidate? https://review.opendev.org/#/c/559460/15:21
*** bonguardo has joined #openstack-lbaas15:23
johnsomcgoncalves Technically that was an API change so I think that is why we did not backport it.15:24
cgoncalvesI was/am on the fence with this one. it's an API change but at same time compatibility is kept and is a fix patch15:26
kklimondawell, that's unexpected: https://pastebin.com/5ZMHatj915:31
kklimondahow can I have two standalone amphoras for a single ACTIVE_STANDBY LB?15:31
johnsomcgoncalves Yeah, the extent of the API change is adding the links... Not a huge change.15:32
kklimondabonus points, neither one actually has VIP set in the amphora-haproxy netns15:33
johnsomkklimonda During a failover the amps are in standalone configuration, then transitioned to their final roles.15:33
openstackgerritNir Magnezi proposed openstack/octavia stable/stein: Fix a python3 issue in the amphora-agent  https://review.opendev.org/66569815:33
kklimondayeah, but that's not happening15:33
johnsomDid someone kill -9 one of the controllers?15:33
kklimondanot while they were spawning15:33
johnsomOr is the load balancer marked as provisioning_status ERROR?15:33
kklimondano, although it was marked as ERROR initially15:34
johnsomThat would not be spawning, that would be failover.15:34
kklimondaI'm trying to get it to work now15:34
johnsomOk, yeah, if it was in ERROR that means we ran out of retry attempts due to some cloud failure. It's likely the failover could not progress so we stopped and marked it in ERROR.15:34
kklimondaok, how do I proceed in that case?15:35
johnsomManually trigger a failover once the cloud is back to functional15:35
kklimondafailover the LB or amphorae?15:35
johnsomUse the load balancer failover, not the amphora failover15:35
johnsomlol, yeah, was just clarifying15:36
kklimondamhm, I've done that and that's how I ended up with two STANDALONE amphorae15:36
johnsomSo you did a manual failover, but you ended up back in ERROR?15:36
kklimonda no, the LB itself is ACTIVE now, but both amphorae are STANDALONE15:37
*** tesseract has quit IRC15:38
johnsomUmmm, now that is not good. Let me look in the code for a minute, but if you don't mind sharing, it would be helpful to have the controller logs for that faliover flow.15:38
kklimondafrom the initial one, or can I failover now?15:39
johnsomThe failover now15:39
*** pcaruana has joined #openstack-lbaas15:40
johnsomI wonder if you have a warning log "Could not fetch Amphora %s from DB, ignoring "15:42
johnsomOr: skipping the failover15:43
*** gcheresh has quit IRC15:43
johnsomThere are a few known bugs in the failover flow that are planned to be fixed soon.15:44
kklimondahuh, there is no flow: https://pastebin.com/VSrnL6ts15:46
kklimondaafter that the second amphora is being failed over15:46
*** ianychoi_ has joined #openstack-lbaas15:48
johnsomAh, the reset is probably in debug level logging....15:49
*** Vorrtex has joined #openstack-lbaas15:49
kklimondaI can bump to debug and retry15:50
johnsomAh, ok, I see the bug15:50
*** ianychoi has quit IRC15:52
johnsomHmm, no, it still should have come in. I think what happened is an automated failover occurred, but did not complete.  This ties back to the known issue with the failover flow. It assumes too much based on what is in the DB about the previous amphora.15:52
kklimondaah, the flow is only logged with debug15:53
kklimondanow I have more :)15:53
kklimondait's possible that I'm partially to blame, initially the LB was stuck in PENDING_UPDATE15:54
johnsomFrom a controller kill -9?15:54
kklimondawell, it wasn't kill -915:55
johnsomOr did the retry timer not yet expire?15:55
kklimondabut I think the L2 flapping resulted in the same as kill -915:55
kklimondawe've had the network flapping for like 12 hours, and that affected rabbit, mysql, and probably any other connectivity15:55
johnsomIf we could not reach the DB to update the status, that could happen15:55
kklimondaso I ended up changing its status to ACTIVE (or ERROR) directly in the DB15:56
*** trident has quit IRC15:56
kklimondabecause everything was failing with "LB immutable"15:56
johnsomOk, yeah. So, path forward, I would go into the DB, set one to role=MASTER and one to role=BACKUP, then failover again. It should fix it15:57
kklimondaeven with it being stuck in PENDING_UPDATE? failover command wasn't working15:57
kklimondahttp://paste.openstack.org/show/753104/ <- the new log from failover with debug enabled15:57
johnsomYeah, if the controller can't set the status to ERROR or ACTIVE in the DB, it will log the ERRORs and give up.15:57
*** trident has joined #openstack-lbaas15:58
johnsomYeah PENDING_* anything means that one of the controllers is ACTIVELY working on the objects and no other actions should occur until the controller releases ownership. We are actually also working on improving that this cycle as well with the jobboard work. However, DB outage is kind of catastrophic as we can't update any state then. After that retry timer expires, we don't really have much we can do....15:59
kklimondamhm, I've seen some discussions regarding making it more robust16:01
kklimondacould I set one STANDALONE to MASTER, another to BACKUP in the DB and initiate failover?16:03
johnsomYeah, that is what I recommended above.16:03
johnsom<johnsom> Michael Johnson Ok, yeah. So, path forward, I would go into the DB, set one to role=MASTER and one to role=BACKUP, then failover again. It should fix it16:04
kklimondaoh, I thought you meant initially16:04
johnsomNope, current state16:04
kklimondaalso, bonus points for explaining why it takes 2 minutes for worker to connect to amphora on port 9443 ;)16:06
kklimondathe instance is up after ~30 seconds16:06
johnsomI saw your log earlier. I'm not sure why there is a two minute delay there between systemd saying it's starting it and it actually starts. Last time I checked I'm not seeing that myself. I was still at 23 seconds to LB ACTIVE.  However I will say someone else reported seeing that recently too.16:08
johnsomWe may have to turn on some kind of systemd tracing16:08
kklimondathanks, setting roles and failover made the LB work again16:13
johnsom+116:14
*** lemko has quit IRC16:15
*** mithilarun has joined #openstack-lbaas16:21
*** yamamoto has joined #openstack-lbaas16:27
*** yamamoto has quit IRC16:31
*** rpittau is now known as rpittau|afk16:31
*** ricolin has quit IRC16:35
openstackgerritMichael Johnson proposed openstack/octavia master: DNM CentOS7 gate test  https://review.opendev.org/66546416:48
*** ricolin has joined #openstack-lbaas16:53
*** ivve has joined #openstack-lbaas16:58
*** ramishra has quit IRC17:08
openstackgerritMichael Johnson proposed openstack/octavia master: DNM: CentOS test job - cpu-mode  https://review.opendev.org/66572617:10
openstackgerritMichael Johnson proposed openstack/octavia-tempest-plugin master: DNM: CentOS test - cpu-mode  https://review.opendev.org/66572717:15
*** eandersson has joined #openstack-lbaas17:19
*** bonguardo has quit IRC17:22
*** trident has quit IRC17:27
*** trident has joined #openstack-lbaas17:29
*** ccamposr__ has quit IRC17:30
*** ricolin has quit IRC17:30
openstackgerritMichael Johnson proposed openstack/octavia master: Add a note about nova hardware architectures  https://review.opendev.org/66573217:37
*** AlexStaf has joined #openstack-lbaas18:11
*** luksky has joined #openstack-lbaas18:12
*** ccamposr has joined #openstack-lbaas18:34
*** yamamoto has joined #openstack-lbaas18:36
*** ccamposr__ has joined #openstack-lbaas18:39
*** ccamposr has quit IRC18:42
*** pcaruana has quit IRC18:44
*** gcheresh has joined #openstack-lbaas19:07
*** AlexStaf has quit IRC19:17
*** pcaruana has joined #openstack-lbaas19:17
*** yamamoto has quit IRC19:21
*** yamamoto has joined #openstack-lbaas19:25
*** pcaruana has quit IRC19:27
*** yamamoto has quit IRC19:30
*** mithilarun has quit IRC19:34
*** trident has quit IRC19:37
*** trident has joined #openstack-lbaas19:39
*** yamamoto has joined #openstack-lbaas19:39
*** yamamoto has quit IRC19:39
*** yamamoto has joined #openstack-lbaas19:40
*** yamamoto has quit IRC19:45
*** mithilarun has joined #openstack-lbaas20:02
*** gcheresh has quit IRC20:34
*** mithilarun has quit IRC20:36
*** mithilarun has joined #openstack-lbaas20:36
*** mithilarun has quit IRC20:41
*** mithilarun has joined #openstack-lbaas20:49
*** xgerman_ has joined #openstack-lbaas20:54
xgermanShanghai will be mandarin and English… reminds me of Tokyo where half the presentatiins were Japanese :-)20:55
johnsomHa, yeah. Tokyo was fun20:56
johnsomxgerman BTW, the log offloading patches are merging as we chat20:56
xgermanyep, sweet :-)20:56
*** boden has quit IRC21:10
openstackgerritMichael Johnson proposed openstack/octavia master: DNM CentOS7 gate test  https://review.opendev.org/66546421:16
*** Vorrtex has quit IRC21:17
openstackgerritMichael Johnson proposed openstack/octavia master: Add the Amphora image building guide to the docs  https://review.opendev.org/66576921:39
*** rcernin has joined #openstack-lbaas21:43
*** ccamposr__ has quit IRC21:51
rm_workjohnsom / kklimonda: yes, i was seeing at least 30s from when i was able to ssh in to an amp, and when the agent even started... but i redid my devstack and moved to a different patch and now i can't reproduce21:52
rm_workif i am able to repro again I will let you know21:52
*** lemko has joined #openstack-lbaas21:52
rm_workand for me it wasn't just centos, it was ubuntu also21:52
johnsomRight, that user was using ubuntu as well.21:54
rm_workyeah21:54
rm_workwasn't sure if when you said "but it was on centos" you were referring to my testing21:54
rm_workbecause i saw it on both i think21:55
johnsomIt looked like a gap between when systemd said it was starting it to when the process actually started21:55
johnsomCarlos and I are working on the CentOS issue which is yet another boot performance issue21:55
openstackgerritMichael Johnson proposed openstack/octavia master: Add the Amphora image building guide to the docs  https://review.opendev.org/66576921:56
*** mithilarun has quit IRC21:57
*** mithilarun has joined #openstack-lbaas22:00
*** fnaval has quit IRC22:05
*** luksky has quit IRC22:08
*** ChanServ sets mode: +o johnsom22:21
*** luksky has joined #openstack-lbaas22:21
*** blake has joined #openstack-lbaas22:22
*** goldyfruit has quit IRC22:27
openstackgerritMerged openstack/octavia-tempest-plugin master: Save amphora logs in gate  https://review.opendev.org/62640622:34
*** luksky has quit IRC22:39
*** blake has quit IRC22:40
*** sapd1_x has joined #openstack-lbaas23:01
johnsomI keep forgetting how painful running tox on neutron is....23:16
rm_workyes23:36
rm_workugh been trying to get remote-debugging working on my devstack VM for like 4 hours now23:36
rm_workit's just not behaving23:36
rm_workvery frustrating23:36
rm_workupgraded to latest pycharm and still no luck. connects but doesn't break23:37
rm_workand only the first time, after the first connection it never tries to connect again >_><23:37
*** goldyfruit has joined #openstack-lbaas23:43
johnsomI am working on cleaning out the neutron-lbaas stuff from neutron-lib/neutron. It's the part I signed up for at the PTG23:47
johnsomneutron-lib: https://review.opendev.org/66582823:49

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!