openstackgerrit | Michael Johnson proposed openstack/octavia master: WIP - Refactor the failover flows https://review.opendev.org/705317 | 00:05 |
---|---|---|
*** spatel has joined #openstack-lbaas | 00:09 | |
*** spatel has quit IRC | 00:13 | |
*** vishalmanchanda has joined #openstack-lbaas | 00:21 | |
*** nicolasbock has quit IRC | 01:15 | |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Support HTTP and TCP checks in UDP healthmonitor https://review.opendev.org/589180 | 01:29 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Select the right lb_network_ip interface using AZ https://review.opendev.org/704927 | 01:30 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Network Delta calculations should respect AZs https://review.opendev.org/705165 | 01:30 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow AZ to override valid_vip_networks config https://review.opendev.org/699521 | 01:30 |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Allow AZ to override valid_vip_networks config https://review.opendev.org/699521 | 01:30 |
*** yamamoto has joined #openstack-lbaas | 01:31 | |
openstackgerrit | Adam Harwell proposed openstack/octavia master: Conf option to use VIP ip as source ip for backend https://review.opendev.org/702535 | 01:31 |
rm_work | REALLY still need to merge that whole az-tweaks chain ^^ | 01:34 |
rm_work | (not 702535, that's unrelated, no hurry on that) | 01:34 |
rm_work | we're just leaving AZs broken in the meantime | 01:35 |
*** ccamposr__ has quit IRC | 02:01 | |
*** ccamposr__ has joined #openstack-lbaas | 02:01 | |
*** TMM has quit IRC | 02:26 | |
*** TMM has joined #openstack-lbaas | 02:26 | |
*** vishalmanchanda has quit IRC | 03:09 | |
*** psachin has joined #openstack-lbaas | 03:29 | |
*** psachin has quit IRC | 03:33 | |
*** psachin has joined #openstack-lbaas | 03:35 | |
*** vishalmanchanda has joined #openstack-lbaas | 04:31 | |
*** numans has quit IRC | 06:15 | |
*** strigazi has quit IRC | 06:16 | |
*** strigazi has joined #openstack-lbaas | 06:18 | |
*** numans has joined #openstack-lbaas | 07:11 | |
*** psachin has quit IRC | 07:47 | |
*** gcheresh has joined #openstack-lbaas | 07:55 | |
*** maciejjozefczyk has joined #openstack-lbaas | 08:12 | |
*** yamamoto has quit IRC | 08:16 | |
openstackgerrit | Gregory Thiemonge proposed openstack/octavia stable/queens: Fix uncaught DB exception when trying to get a spare amphora https://review.opendev.org/709569 | 08:17 |
*** yamamoto has joined #openstack-lbaas | 08:21 | |
*** tkajinam has quit IRC | 08:22 | |
*** tesseract has joined #openstack-lbaas | 08:23 | |
*** tkajinam has joined #openstack-lbaas | 08:43 | |
*** yamamoto has quit IRC | 09:58 | |
*** yamamoto has joined #openstack-lbaas | 10:09 | |
*** yamamoto has quit IRC | 11:20 | |
*** yamamoto has joined #openstack-lbaas | 11:22 | |
*** ivve has joined #openstack-lbaas | 11:30 | |
ivve | hey guys, what version of keepalived is recommended in amphoras. and have you had issues with 1.3.9-1 reported (ubuntu 16.04/18.04 repo version) | 11:33 |
openstackgerrit | Ann Taraday proposed openstack/octavia master: [Amphorav2] Fix noop driver case https://review.opendev.org/709696 | 11:36 |
*** rcernin has quit IRC | 11:37 | |
*** jamesdenton has quit IRC | 11:45 | |
*** jamesdenton has joined #openstack-lbaas | 11:46 | |
*** nicolasbock has joined #openstack-lbaas | 11:53 | |
*** ccamposr has joined #openstack-lbaas | 12:15 | |
*** xakaitetoia1 has joined #openstack-lbaas | 12:15 | |
xakaitetoia1 | Hello team, now that fwaas is being depricated, what's a good alternative for puting a firewall infront of LBaaS | 12:16 |
*** ccamposr__ has quit IRC | 12:17 | |
*** tkajinam has quit IRC | 12:30 | |
*** maciejjozefczyk has quit IRC | 13:14 | |
*** ccamposr has quit IRC | 13:17 | |
*** ccamposr has joined #openstack-lbaas | 13:17 | |
*** gcheresh has quit IRC | 13:19 | |
*** gcheresh has joined #openstack-lbaas | 13:19 | |
*** maciejjozefczyk has joined #openstack-lbaas | 13:27 | |
*** nicolasbock has quit IRC | 13:33 | |
*** nicolasbock has joined #openstack-lbaas | 13:34 | |
*** nicolasbock has quit IRC | 13:36 | |
*** yamamoto has quit IRC | 13:36 | |
*** gcheresh_ has joined #openstack-lbaas | 13:40 | |
*** gcheresh has quit IRC | 13:40 | |
*** tkajinam has joined #openstack-lbaas | 13:53 | |
johnsom | ivve We have had minor issues with keepalived, but we have put work arounds in place for those. We currently just use the version included in the supported distrobutions (centos 7/8, rhel 7/8, ubuntu 16.04/18.04). In "v" we will have the newer versions of keepalived across both the centos and ubuntu platforms and will make some adjustments. | 13:58 |
*** yamamoto has joined #openstack-lbaas | 13:58 | |
johnsom | xakaitetoia1 What do you need the firewall for? Octavia manages the security groups for the load balancer automatically, so curious what your use case is. | 13:59 |
ivve | johnsom: alright, do you recognize this: when using keepalived sometimes sessions are passed back and forth between loadbalancers effectivly cutting sessions and/or causing resets | 14:00 |
ivve | we seems to experience this not only in amphoras, but also in snat namespaces (via neutron) | 14:01 |
ivve | when using standalone built amphoras it works 100% of the time. this doesn't occur like 100% when you use active_standby topology | 14:02 |
ivve | we have around a ~50% failure rate i guess on active_standby | 14:02 |
ivve | we upgraded octavia to 4.1.0 and now to 4.1.1 (still testing) but so far the same issue exists in 4.1.0, making me think this is keepalived doing some funny business | 14:04 |
rm_work | ivve: to me that sounds like your network is either incredibly unreliable, or entirely blocking the connection between amps and they are just both in split-brain constantly fighting to gARP for the VIP (so it passes back and forth as fast as the switches will listen) | 14:24 |
ivve | rm_work: we tested on the same host | 14:24 |
ivve | same problem for amphoras | 14:24 |
rm_work | ivve: if you are using our standard image building without any custom changes... then i very much doubt it is the image | 14:25 |
rm_work | when was your image built? | 14:25 |
ivve | i was thinking about that since we do have some vxlan tunneling between DC but then move all traffic to in-DC and also later on in-host without any difference | 14:25 |
ivve | 18.04 just few days ago | 14:25 |
rm_work | hmm | 14:25 |
rm_work | can you SSH into the amphorae and look at the keepalived logs? or are you offloading those logs somewhere? | 14:26 |
ivve | yea | 14:27 |
rm_work | what are they showing? | 14:27 |
ivve | so for the snat we do get double master promotion, but nothing like that in the amphorae | 14:27 |
ivve | we're also going to upgrade neutron from 14.0.2 to 14.0.4 | 14:28 |
ivve | doubt anything will change though | 14:28 |
ivve | even with -D flags the logs are very quiet | 14:28 |
rm_work | can you paste the logs perchance? | 14:29 |
ivve | when dumping we can just see 1 keepalived sending its regular heartbeat to the multicast address | 14:29 |
ivve | gonna check if i have some old around | 14:29 |
rm_work | ideally you could have one currently running | 14:30 |
rm_work | and paste logs from both sides | 14:30 |
rm_work | and make sure they are actually connectable from inside the netns | 14:30 |
ivve | yea | 14:32 |
ivve | didn't have one but ill generate one, gimme 5-10mins need to do some reconfs | 14:33 |
ivve | im getting the feeling if we can solve it in one of the components we can hit two birds with one stone | 14:35 |
ivve | i mean the strange thing is that we have the failure at spawntime | 14:35 |
ivve | if it works, it works all the time, all the way | 14:35 |
ivve | if it "fails" during spawn, it keeps failing until we bump them | 14:35 |
ivve | like with a failover or restart stuff | 14:36 |
ivve | for neutron-snat we just have to delete the routes in the "second" master | 14:36 |
ivve | then its all good | 14:36 |
ivve | spawning them now, should have logs on both sides soonish | 14:36 |
*** TrevorV has joined #openstack-lbaas | 14:41 | |
rm_work | hmm | 14:46 |
ivve | the debug logs from neutron-snat is just really nothing, doesn't say why it promotes to master, just that it does | 14:48 |
rm_work | the "why" is pretty predictable -- it loses its connection to the pair | 14:49 |
ivve | yea but one would expect some kind of heartbeat fail message | 14:50 |
ivve | at least when running debug logs | 14:50 |
ivve | and then it keeps switching back and forth without saying anything about master promotion, well basically at that stage you have a split brain situation | 14:51 |
ivve | so sessions flipflop between the namespaces | 14:51 |
johnsom | Sorry, I'm giving a training this morning. I will catch up with the thread here in an hour or so | 14:52 |
johnsom | ivve The keepalived logs inside the amphora go to the "syslog" or "messages" log | 14:54 |
ivve | thanks, im aware. its a bit different for neutron-snat, they choose to send it to file | 14:55 |
ivve | which unfortunately is deleted along with the namespace if the router is removed | 14:56 |
ivve | also, when the pair is lost, shouldn't octavia-services be informed? | 14:58 |
rm_work | no | 14:58 |
ivve | okay | 14:58 |
rm_work | we let the amp pair handle itself | 14:58 |
rm_work | we never track which is MASTER/BACKUP after initial provisioning | 14:58 |
ivve | got it | 14:58 |
rm_work | I'm ... on the fence about whether that's actually best or not; IMO it would be nice to have the health message include info about whether the amp thinks it is the MASTER or not -- but there's the obvious problem of deciding what is actually correct | 15:00 |
ivve | i was thinking about the nopreempt setting as well | 15:00 |
ivve | for haproxy | 15:00 |
ivve | since configs are master/backup with priorities anyways | 15:01 |
*** tkajinam has quit IRC | 15:08 | |
xakaitetoia1 | johnsom, ah perfect, I didn't know octavia manages this and i wanted to protect the LB itself. Any idea how i can check that ? | 15:18 |
xakaitetoia1 | rm_work, yes i recently saw that myself when doing failover for master amphora. It immediately switched to the backup node but it didn't make it master. Then octavia destroyed the "master node" and re-created the node which then keepalived address moved from backup node to master node | 15:20 |
ivve | damn, after updating to 4.1.1 i don't seem to be able to reproduce it. into 7th stack now | 15:20 |
rm_work | hmmm i thought we had nopreempt on and that would normally prevent the "new master" from taking back over | 15:21 |
ivve | ok now i think i have one, lol | 15:22 |
ivve | rm_work: okay i now have one failing, all i can see is this in syslog | 15:24 |
ivve | https://hastebin.com/uzoticuhod.nginx | 15:24 |
ivve | the backup is all silent | 15:25 |
ivve | and we have a script running in the background that does curl to the listener | 15:25 |
ivve | with roughly 10% failure rate on those curls atm | 15:25 |
ivve | so 1 / 7 failed atm | 15:25 |
ivve | the others have 100% successrate for the entire time from when i started, so 15mins give or take | 15:26 |
*** irclogbot_0 has quit IRC | 15:27 | |
ivve | tcpdump is only the heartbeat udp's at port 5555 | 15:28 |
ivve | from controller nodes | 15:28 |
ivve | (on the slave) | 15:28 |
ivve | and some arps | 15:28 |
*** irclogbot_1 has joined #openstack-lbaas | 15:28 | |
ivve | and i can see the udp traffic from all 3 controller nodes | 15:29 |
ivve | so the slave never send out any grat arps | 15:30 |
ivve | https://hastebin.com/epijiwakuf.css <-- slave, the prom switches from my tcpdumping | 15:32 |
*** gcheresh has joined #openstack-lbaas | 15:46 | |
*** gcheresh_ has quit IRC | 15:50 | |
*** nicolasbock has joined #openstack-lbaas | 15:53 | |
rm_work | ivve: can you enter the netns on the amphora (`ip netns exec amphora-haproxy bash`) and try a tcpdump in there to see what's going on with the keepalived pair? | 15:57 |
*** Trevor_V has joined #openstack-lbaas | 16:03 | |
*** TrevorV has quit IRC | 16:07 | |
johnsom | ivve I think we have seen an issue in the 4.1.x series that was fixed in keepalived. Let me see if I can find the reference. | 16:07 |
johnsom | Aslo, yes, you need to be in the network namespace | 16:07 |
johnsom | xakaitetoia1 As an operator/admin, you can look at the load balancer VIP port, then see the security groups associated. Then query the security group details. Basically Octavia only opens the ports defined for the listener(s). | 16:09 |
ivve | rm_work: this line is spammed fa:16:3e:7c:93:58 > fa:16:3e:9e:09:91, ethertype IPv4 (0x0800), length 54: 10.150.1.57 > 10.150.1.166: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20 | 16:11 |
johnsom | ivve This is the issue we saw with keepalived > 2.0 and < 2.0.11: https://github.com/acassen/keepalived/commit/30eeb48b1a0737dc7443fd421fd6613e0d55fd17 | 16:11 |
johnsom | ivve Yes, that is the normal Active<->Standby keepalived heartbeat. That is outside the network namespace | 16:12 |
ivve | yea | 16:12 |
ivve | expected | 16:12 |
*** maciejjozefczyk has quit IRC | 16:12 | |
johnsom | Oh, wait, it is inside the netns actaully. | 16:12 |
ivve | yes | 16:12 |
johnsom | The GARP messages are coming from the current "MASTER" instance | 16:12 |
ivve | this is on the non-namespace interface | 16:12 |
rm_work | err | 16:13 |
rm_work | outside the NS? | 16:13 |
johnsom | If you grep "grep -i stat /var/log/syslog" it should pop out which is master which is backup. Since these can switch sub-second, it's rough to try to track at the control plane level | 16:13 |
ivve | https://hastebin.com/uzoticuhod.nginx | 16:14 |
ivve | keeps coming from the master | 16:14 |
ivve | and on the non-namespace interface all i can see is port 5555 udp traffic and some normal bum traffic | 16:15 |
johnsom | Right | 16:15 |
ivve | like arp who-has / reply | 16:15 |
johnsom | So, check your syslog for the status change messages, see if you have any or if it is a network issue outside the amphora | 16:16 |
ivve | syslog is all empty in the slave | 16:18 |
ivve | in the master it says keeps doing garps | 16:18 |
ivve | thats it | 16:18 |
johnsom | Ok, now I am done training a new QE team, I can answer quicker.... | 16:18 |
johnsom | Hmm, it must have rotated out already. Check systemd status of the keepalived on the backup | 16:19 |
johnsom | Just to make sure it is running. You can also check the process list, it should be running using the /var/lib/octavia/... config file | 16:20 |
*** gcheresh has quit IRC | 16:20 | |
ivve | getting the full log from when it started | 16:21 |
johnsom | I am going to boot an act/stdby so I can paste you some expected log snippets | 16:21 |
johnsom | ivve What version of Octavia are you running? | 16:21 |
ivve | 4.1.1 now | 16:22 |
ivve | this is the slave from namespace config (right after sysctl) down to me starting tcpdumps - https://hastebin.com/uqoxefojid.sql | 16:23 |
johnsom | ivve Yeah, that looks normal for a backup. From that log, it has never FAILED or became MASTER. | 16:25 |
johnsom | It is still in the BACKUP state | 16:25 |
ivve | this is the master | 16:25 |
ivve | Feb 25 15:04:34 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Transition to MASTER STATE | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Entering MASTER STATE | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol VIPs. | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol Virtual Routes | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol Virtual Rules | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:25 |
ivve | Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: message repeated 3 times: [ Sending gratuitous ARP on eth1 for 10.150.1.40] | 16:25 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:25 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:25 |
johnsom | Otherwise, that is all normal, including the warnings (we should clean up at some point) | 16:25 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:25 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:25 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: message repeated 4 times: [ Sending gratuitous ARP on eth1 for 10.150.1.40] | 16:25 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:26 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Starting HAProxy Load Balancer... | 16:26 |
ivve | Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Started HAProxy Load Balancer. | 16:26 |
ivve | Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:26 |
ivve | Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:26 |
ivve | Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer. | 16:26 |
johnsom | pastebin is your friend | 16:26 |
ivve | Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150440 (1104) : Reexecuting Master process | 16:26 |
ivve | Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer. | 16:26 |
ivve | Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150446 (1104) : Former worker 1105 exited with code 0 | 16:26 |
johnsom | https://paste.openstack.org | 16:26 |
ivve | Feb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:26 |
ivve | Feb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:27 |
ivve | Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:27 |
ivve | Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:27 |
ivve | Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer. | 16:27 |
ivve | Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150446 (1104) : Reexecuting Master process | 16:27 |
ivve | Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150517 (1104) : Former worker 1166 exited with code 0 | 16:27 |
ivve | Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer. | 16:27 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:27 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer. | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150517 (1104) : Reexecuting Master process | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150520 (1104) : Former worker 1246 exited with code 0 | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer. | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:28 |
ivve | Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading. | 16:28 |
ivve | Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer. | 16:28 |
ivve | Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150520 (1104) : Reexecuting Master process | 16:28 |
ivve | Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150524 (1104) : Former worker 1301 exited with code 0 | 16:28 |
ivve | Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer. | 16:28 |
ivve | Feb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:28 |
ivve | Feb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40 | 16:29 |
ivve | Feb 25 15:05:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40 | 16:29 |
ivve | fuck | 16:29 |
ivve | https://hastebin.com/qiwotufufe.coffeescript | 16:29 |
ivve | sorry, clipboards failure | 16:29 |
ivve | would have happend anyway, unless that clears the secondary clipboard :P | 16:29 |
ivve | anyways | 16:29 |
ivve | this is the master - https://hastebin.com/qiwotufufe.coffeescript | 16:29 |
ivve | same timeline | 16:29 |
ivve | from sysctl finish confs to master up sending grat arps | 16:29 |
ivve | Keepalived v1.3.9 (10/21,2017) | 16:29 |
ivve | this is the default in ubuntu 18.04.4 | 16:29 |
ivve | from their repo | 16:29 |
ivve | the reexecutions is nothing strange since thats configuring backends/members | 16:29 |
ivve | could it be on the neutron side? the neutron router having the wrong arp or something | 16:30 |
*** gcheresh has joined #openstack-lbaas | 16:35 | |
johnsom | Well, that is why we beat it over the head with GARPs.... | 16:36 |
johnsom | What is your ML2 in neutron? | 16:36 |
ivve | l2 | 16:36 |
johnsom | OVS? OVN? linux bridge? | 16:37 |
ivve | oh ovs | 16:37 |
ivve | l2pop | 16:37 |
johnsom | Ok, yeah, that should be fine | 16:37 |
johnsom | OVN had a nasty bug at one point (maybe still does) where ARPs took 30 seconds to actually update, but OVS doesn't have that issue. | 16:38 |
ivve | its as if the router is getting the wrong arp every now and then and then it is corrected, but just for a split second | 16:39 |
ivve | or something.. this is so wierd at this point | 16:39 |
ivve | 10% of sessions from this lb is just gone atm | 16:39 |
ivve | over the time i found the lb to now | 16:40 |
ivve | which is over an hour | 16:40 |
johnsom | Can you run "sudo ip netns exec amphora-haproxy ip a" on the standby amphora? | 16:40 |
ivve | https://hastebin.com/hemurirone.cs | 16:42 |
johnsom | Ok, yeah, so it's behaving correctly, it doesn't have the VIP IP on eth1, so it's not ARPing for it at all. | 16:42 |
johnsom | On master, eth1 will have two IPs, and should be GARP/ARP for the VIP address. | 16:43 |
ivve | yea | 16:43 |
johnsom | You could "sudo ip netns exec amphora-haproxy tcpdump -nli eth1 arp" on the backup just to make sure, but I'm 99.9995% sure it's not going to be advertising the VIP address. | 16:44 |
johnsom | You will see other ARP traffic of course, but it shouldn't be from the eth1 with the VIP in it | 16:44 |
*** gcheresh has quit IRC | 16:45 | |
ivve | https://hastebin.com/icexoziluy.css | 16:47 |
johnsom | Yeah, so the arps are fine | 16:48 |
ivve | testmethod: https://hastebin.com/modowocaba.bash | 16:51 |
ivve | expected outcome is 0 100% | 16:51 |
ivve | but right now im getting around 85% on just this one lb | 16:52 |
johnsom | What is the web server? | 16:52 |
ivve | its sooooo strange why not everytime, why only like one lb every now and then | 16:52 |
ivve | like i setup 7 lbs | 16:52 |
ivve | only 1 of them behaves like this | 16:53 |
ivve | yeah its 3 members behind | 16:53 |
ivve | k8s | 16:53 |
johnsom | It's a real web server right? not a "netcat/nc" faker right? | 16:53 |
ivve | yea | 16:53 |
*** tesseract has quit IRC | 16:54 | |
johnsom | Ok. Because we get that with netcat as it can't handle concurrent connections, so the health check will interfere with the user request. That is why we have a golang web server for the cirros test instances. | 16:54 |
johnsom | Ok, next thing I would check, in the master amphora, look in the /var/log/haproxy log, you should see a line per connection. Towards the end of the line is the "status" string that describes how the connection was resolved. They should all basically look the same | 16:56 |
johnsom | In this example: | 16:57 |
ivve | when the error happens i get cD or CD | 16:57 |
johnsom | 4798-bc0c-400465756fa1 172.24.4.1 42940 25/Feb/2020:16:57:07.480 "GET / HTTP/1.1" 200 79 73 - "" 8c3cc362-7516-4c6f-a889-9347dd021fb5:08a2bedc-96d0-4798-bc0c-400465756fa1 36472f77-ee9c-435f-8a34-a9e6538bfce7 4 ---- | 16:57 |
johnsom | The four hyphens at the end mean successful flow. | 16:57 |
johnsom | Ah! ok, so there is the clue. Can you past the exact four characters there? | 16:58 |
ivve | https://hastebin.com/sonofesafe.nginx | 16:58 |
ivve | there are 3 error i think in there | 16:58 |
ivve | 6 actually | 16:59 |
johnsom | hmm, that is an odd log. Are these TCP LBs? | 16:59 |
ivve | this is haproxy.log | 16:59 |
ivve | straight tail, no fuzz | 16:59 |
johnsom | Oh, right, nevermind, I changed the log format for offloading, you don't have that in stein. | 16:59 |
johnsom | Ok | 16:59 |
ivve | :) | 16:59 |
johnsom | decoding, just a sec | 17:00 |
johnsom | c : the client-side timeout expired while waiting for the client to | 17:01 |
johnsom | send or receive data. | 17:01 |
johnsom | D : the session was in the DATA phase. | 17:01 |
johnsom | C : the TCP session was unexpectedly aborted by the client. | 17:02 |
johnsom | So, the connection from the client to the VIP of the load balancer is un-expectantly terminated. | 17:02 |
johnsom | Pretty much what you said..... | 17:02 |
ivve | :( | 17:02 |
johnsom | So it's in front of the haproxy somewhere. | 17:02 |
ivve | yeah, im out of juice for this problem | 17:04 |
ivve | think we checked this particular log 2 days ago | 17:04 |
ivve | no, it was friday | 17:04 |
ivve | anyways | 17:04 |
johnsom | ivve ok, so my next step would be tcpdump on eth1 and see if there are any clues there. | 17:05 |
johnsom | inside the netns | 17:05 |
johnsom | But I doubt it. It's most likely in the neutron of forward layer. | 17:05 |
johnsom | of-or | 17:06 |
ivve | mm | 17:06 |
ivve | im probably gonna try upgrade neutron tomorrow | 17:06 |
johnsom | Ok, let me know if you want to keep debugging lower in the amphora. I.e. doing a packet capture and I can look at the pcap for you. | 17:07 |
johnsom | I'm happy to spend more time helping if you want it. | 17:07 |
ivve | aye no worries. ill get back to you if i need help. thanks a bunch, much appreciated! | 17:08 |
johnsom | You could also stop keepalived and configure the interface manually, just to remove keepalived from the path, but if there isn't issues in the syslog, I doubt it your problem. | 17:08 |
ivve | but i think i need to look more at neutron | 17:08 |
johnsom | Ok, no problem | 17:08 |
johnsom | Good luck! | 17:08 |
ivve | due to the clues, it only happens every now and then | 17:08 |
ivve | maybe lower the amount of garps? | 17:09 |
ivve | i understand the failover time will be slower | 17:09 |
ivve | but maybe its excessive for ovs? | 17:09 |
johnsom | No, actually it won't. The garps are just there to make sure neutron has the right mapping. | 17:09 |
ivve | failure/switchover forces a new? | 17:10 |
johnsom | We saw issues with some ML2 where it aged them out too agressively | 17:10 |
ivve | oh | 17:10 |
johnsom | Yeah, on transition it will always send a garp. | 17:10 |
ivve | naturally | 17:10 |
johnsom | You can certainly lower it, but I don't think it is related | 17:10 |
ivve | yea im just grasping for straws here | 17:10 |
*** vishalmanchanda has quit IRC | 17:11 | |
ivve | and we still do have the issues with snat on routers too | 17:11 |
johnsom | We error'd on the side of being aggressive with the GARPs as they should be harmless. | 17:11 |
ivve | which is kinda similar but not | 17:11 |
ivve | where neutron tells two namespaces out of 3 that they are master half a minute after creation | 17:12 |
*** ccamposr__ has joined #openstack-lbaas | 17:30 | |
*** ccamposr has quit IRC | 17:32 | |
*** ivve has quit IRC | 17:47 | |
*** yamamoto has quit IRC | 17:53 | |
johnsom | Sigh, 500 error from github.... Don't forget the code really lives at opendev.org | 17:58 |
*** trident has quit IRC | 18:52 | |
*** trident has joined #openstack-lbaas | 18:58 | |
*** maciejjozefczyk has joined #openstack-lbaas | 20:01 | |
*** maciejjozefczyk has quit IRC | 20:21 | |
*** gcheresh has joined #openstack-lbaas | 20:59 | |
*** Trevor_V has quit IRC | 21:00 | |
*** ccamposr__ has quit IRC | 21:03 | |
*** ccamposr has joined #openstack-lbaas | 21:40 | |
*** gcheresh has quit IRC | 21:59 | |
*** ccamposr has quit IRC | 22:06 | |
*** xakaitetoia1 has quit IRC | 22:49 | |
*** tkajinam has joined #openstack-lbaas | 22:58 | |
*** rcernin has joined #openstack-lbaas | 23:39 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!