Tuesday, 2020-02-25

openstackgerritMichael Johnson proposed openstack/octavia master: WIP - Refactor the failover flows  https://review.opendev.org/70531700:05
*** spatel has joined #openstack-lbaas00:09
*** spatel has quit IRC00:13
*** vishalmanchanda has joined #openstack-lbaas00:21
*** nicolasbock has quit IRC01:15
openstackgerritAdam Harwell proposed openstack/octavia master: Support HTTP and TCP checks in UDP healthmonitor  https://review.opendev.org/58918001:29
openstackgerritAdam Harwell proposed openstack/octavia master: Select the right lb_network_ip interface using AZ  https://review.opendev.org/70492701:30
openstackgerritAdam Harwell proposed openstack/octavia master: Network Delta calculations should respect AZs  https://review.opendev.org/70516501:30
openstackgerritAdam Harwell proposed openstack/octavia master: Allow AZ to override valid_vip_networks config  https://review.opendev.org/69952101:30
openstackgerritAdam Harwell proposed openstack/octavia master: Allow AZ to override valid_vip_networks config  https://review.opendev.org/69952101:30
*** yamamoto has joined #openstack-lbaas01:31
openstackgerritAdam Harwell proposed openstack/octavia master: Conf option to use VIP ip as source ip for backend  https://review.opendev.org/70253501:31
rm_workREALLY still need to merge that whole az-tweaks chain ^^01:34
rm_work(not 702535, that's unrelated, no hurry on that)01:34
rm_workwe're just leaving AZs broken in the meantime01:35
*** ccamposr__ has quit IRC02:01
*** ccamposr__ has joined #openstack-lbaas02:01
*** TMM has quit IRC02:26
*** TMM has joined #openstack-lbaas02:26
*** vishalmanchanda has quit IRC03:09
*** psachin has joined #openstack-lbaas03:29
*** psachin has quit IRC03:33
*** psachin has joined #openstack-lbaas03:35
*** vishalmanchanda has joined #openstack-lbaas04:31
*** numans has quit IRC06:15
*** strigazi has quit IRC06:16
*** strigazi has joined #openstack-lbaas06:18
*** numans has joined #openstack-lbaas07:11
*** psachin has quit IRC07:47
*** gcheresh has joined #openstack-lbaas07:55
*** maciejjozefczyk has joined #openstack-lbaas08:12
*** yamamoto has quit IRC08:16
openstackgerritGregory Thiemonge proposed openstack/octavia stable/queens: Fix uncaught DB exception when trying to get a spare amphora  https://review.opendev.org/70956908:17
*** yamamoto has joined #openstack-lbaas08:21
*** tkajinam has quit IRC08:22
*** tesseract has joined #openstack-lbaas08:23
*** tkajinam has joined #openstack-lbaas08:43
*** yamamoto has quit IRC09:58
*** yamamoto has joined #openstack-lbaas10:09
*** yamamoto has quit IRC11:20
*** yamamoto has joined #openstack-lbaas11:22
*** ivve has joined #openstack-lbaas11:30
ivvehey guys, what version of keepalived is recommended in amphoras. and have you had issues with 1.3.9-1 reported (ubuntu 16.04/18.04 repo version)11:33
openstackgerritAnn Taraday proposed openstack/octavia master: [Amphorav2] Fix noop driver case  https://review.opendev.org/70969611:36
*** rcernin has quit IRC11:37
*** jamesdenton has quit IRC11:45
*** jamesdenton has joined #openstack-lbaas11:46
*** nicolasbock has joined #openstack-lbaas11:53
*** ccamposr has joined #openstack-lbaas12:15
*** xakaitetoia1 has joined #openstack-lbaas12:15
xakaitetoia1Hello team, now that fwaas is being depricated, what's a good alternative for puting a firewall infront of LBaaS12:16
*** ccamposr__ has quit IRC12:17
*** tkajinam has quit IRC12:30
*** maciejjozefczyk has quit IRC13:14
*** ccamposr has quit IRC13:17
*** ccamposr has joined #openstack-lbaas13:17
*** gcheresh has quit IRC13:19
*** gcheresh has joined #openstack-lbaas13:19
*** maciejjozefczyk has joined #openstack-lbaas13:27
*** nicolasbock has quit IRC13:33
*** nicolasbock has joined #openstack-lbaas13:34
*** nicolasbock has quit IRC13:36
*** yamamoto has quit IRC13:36
*** gcheresh_ has joined #openstack-lbaas13:40
*** gcheresh has quit IRC13:40
*** tkajinam has joined #openstack-lbaas13:53
johnsomivve We have had minor issues with keepalived, but we have put work arounds in place for those. We currently just use the version included in the supported distrobutions (centos 7/8, rhel 7/8, ubuntu 16.04/18.04). In "v" we will have the newer versions of keepalived across both the centos and ubuntu platforms and will make some adjustments.13:58
*** yamamoto has joined #openstack-lbaas13:58
johnsomxakaitetoia1 What do you need the firewall for? Octavia manages the security groups for the load balancer automatically, so curious what your use case is.13:59
ivvejohnsom: alright, do you recognize this: when using keepalived sometimes sessions are passed back and forth between loadbalancers effectivly cutting sessions and/or causing resets14:00
ivvewe seems to experience this not only in amphoras, but also in snat namespaces (via neutron)14:01
ivvewhen using standalone built amphoras it works 100% of the time. this doesn't occur like 100% when you use active_standby topology14:02
ivvewe have around a ~50% failure rate i guess on active_standby14:02
ivvewe upgraded octavia to 4.1.0 and now to 4.1.1 (still testing) but so far the same issue exists in 4.1.0, making me think this is keepalived doing some funny business14:04
rm_workivve: to me that sounds like your network is either incredibly unreliable, or entirely blocking the connection between amps and they are just both in split-brain constantly fighting to gARP for the VIP (so it passes back and forth as fast as the switches will listen)14:24
ivverm_work: we tested on the same host14:24
ivvesame problem for amphoras14:24
rm_workivve: if you are using our standard image building without any custom changes... then i very much doubt it is the image14:25
rm_workwhen was your image built?14:25
ivvei was thinking about that since we do have some vxlan tunneling between DC but then move all traffic to in-DC and also later on in-host without any difference14:25
ivve18.04 just few days ago14:25
rm_workhmm14:25
rm_workcan you SSH into the amphorae and look at the keepalived logs? or are you offloading those logs somewhere?14:26
ivveyea14:27
rm_workwhat are they showing?14:27
ivveso for the snat we do get double master promotion, but nothing like that in the amphorae14:27
ivvewe're also going to upgrade neutron from 14.0.2 to 14.0.414:28
ivvedoubt anything will change though14:28
ivveeven with -D flags the logs are very quiet14:28
rm_workcan you paste the logs perchance?14:29
ivvewhen dumping we can just see 1 keepalived sending its regular heartbeat to the multicast address14:29
ivvegonna check if i have some old around14:29
rm_workideally you could have one currently running14:30
rm_workand paste logs from both sides14:30
rm_workand make sure they are actually connectable from inside the netns14:30
ivveyea14:32
ivvedidn't have one but ill generate one, gimme 5-10mins need to do some reconfs14:33
ivveim getting the feeling if we can solve it in one of the components we can hit two birds with one stone14:35
ivvei mean the strange thing is that we have the failure at spawntime14:35
ivveif it works, it works all the time, all the way14:35
ivveif it "fails" during spawn, it keeps failing until we bump them14:35
ivvelike with a failover or restart stuff14:36
ivvefor neutron-snat we just have to delete the routes in the "second" master14:36
ivvethen its all good14:36
ivvespawning them now, should have logs on both sides soonish14:36
*** TrevorV has joined #openstack-lbaas14:41
rm_workhmm14:46
ivvethe debug logs from neutron-snat is just really nothing, doesn't say why it promotes to master, just that it does14:48
rm_workthe "why" is pretty predictable -- it loses its connection to the pair14:49
ivveyea but one would expect some kind of heartbeat fail message14:50
ivveat least when running debug logs14:50
ivveand then it keeps switching back and forth without saying anything about master promotion, well basically at that stage you have a split brain situation14:51
ivveso sessions flipflop between the namespaces14:51
johnsomSorry, I'm giving a training this morning. I will catch up with the thread here in an hour or so14:52
johnsomivve The keepalived logs inside the amphora go to the "syslog" or "messages" log14:54
ivvethanks, im aware. its a bit different for neutron-snat, they choose to send it to file14:55
ivvewhich unfortunately is deleted along with the namespace if the router is removed14:56
ivvealso, when the pair is lost, shouldn't octavia-services be informed?14:58
rm_workno14:58
ivveokay14:58
rm_workwe let the amp pair handle itself14:58
rm_workwe never track which is MASTER/BACKUP after initial provisioning14:58
ivvegot it14:58
rm_workI'm ... on the fence about whether that's actually best or not; IMO it would be nice to have the health message include info about whether the amp thinks it is the MASTER or not -- but there's the obvious problem of deciding what is actually correct15:00
ivvei was thinking about the nopreempt setting as well15:00
ivvefor haproxy15:00
ivvesince configs are master/backup with priorities anyways15:01
*** tkajinam has quit IRC15:08
xakaitetoia1johnsom, ah perfect, I didn't know octavia manages this and i wanted to protect the LB itself. Any idea how i can check that ?15:18
xakaitetoia1rm_work, yes i recently saw that myself when doing failover for master amphora. It immediately switched to the backup node but it didn't make it master. Then octavia destroyed the "master node" and re-created the node which then keepalived address moved from backup node to master node15:20
ivvedamn, after updating to 4.1.1 i don't seem to be able to reproduce it. into 7th stack now15:20
rm_workhmmm i thought we had nopreempt on and that would normally prevent the "new master" from taking back over15:21
ivveok now i think i have one, lol15:22
ivverm_work: okay i now have one failing, all i can see is this in syslog15:24
ivvehttps://hastebin.com/uzoticuhod.nginx15:24
ivvethe backup is all silent15:25
ivveand we have a script running in the background that does curl to the listener15:25
ivvewith roughly 10% failure rate on those curls atm15:25
ivveso 1 / 7 failed atm15:25
ivvethe others have 100% successrate for the entire time from when i started, so 15mins give or take15:26
*** irclogbot_0 has quit IRC15:27
ivvetcpdump is only the heartbeat udp's at port 555515:28
ivvefrom controller nodes15:28
ivve(on the slave)15:28
ivveand some arps15:28
*** irclogbot_1 has joined #openstack-lbaas15:28
ivveand i can see the udp traffic from all 3 controller nodes15:29
ivveso the slave never send out any grat arps15:30
ivvehttps://hastebin.com/epijiwakuf.css <-- slave, the prom switches from my tcpdumping15:32
*** gcheresh has joined #openstack-lbaas15:46
*** gcheresh_ has quit IRC15:50
*** nicolasbock has joined #openstack-lbaas15:53
rm_workivve: can you enter the netns on the amphora (`ip netns exec amphora-haproxy bash`) and try a tcpdump in there to see what's going on with the keepalived pair?15:57
*** Trevor_V has joined #openstack-lbaas16:03
*** TrevorV has quit IRC16:07
johnsomivve I think we have seen an issue in the 4.1.x series that was fixed in keepalived. Let me see if I can find the reference.16:07
johnsomAslo, yes, you need to be in the network namespace16:07
johnsomxakaitetoia1 As an operator/admin, you can look at the load balancer VIP port, then see the security groups associated. Then query the security group  details. Basically Octavia only opens the ports defined for the listener(s).16:09
ivverm_work: this line is spammed fa:16:3e:7c:93:58 > fa:16:3e:9e:09:91, ethertype IPv4 (0x0800), length 54: 10.150.1.57 > 10.150.1.166: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 2016:11
johnsomivve This is the issue we saw with keepalived > 2.0  and < 2.0.11: https://github.com/acassen/keepalived/commit/30eeb48b1a0737dc7443fd421fd6613e0d55fd1716:11
johnsomivve Yes, that is the normal Active<->Standby keepalived heartbeat. That is outside the network namespace16:12
ivveyea16:12
ivveexpected16:12
*** maciejjozefczyk has quit IRC16:12
johnsomOh, wait, it is inside the netns actaully.16:12
ivveyes16:12
johnsomThe GARP messages are coming from the current "MASTER" instance16:12
ivvethis is on the non-namespace interface16:12
rm_workerr16:13
rm_workoutside the NS?16:13
johnsomIf you grep "grep -i stat /var/log/syslog" it should pop out which is master which is backup. Since these can switch sub-second, it's rough to try to track at the control plane level16:13
ivvehttps://hastebin.com/uzoticuhod.nginx16:14
ivvekeeps coming from the master16:14
ivveand on the non-namespace interface all i can see is port 5555 udp traffic and some normal bum traffic16:15
johnsomRight16:15
ivvelike arp who-has / reply16:15
johnsomSo, check your syslog for the status change messages, see if you have any or if it is a network issue outside the amphora16:16
ivvesyslog is all empty in the slave16:18
ivvein the master it says keeps doing garps16:18
ivvethats it16:18
johnsomOk, now I am done training a new QE team, I can answer quicker....16:18
johnsomHmm, it must have rotated out already. Check systemd status of the keepalived on the backup16:19
johnsomJust to make sure it is running. You can also check the process list, it should be running using the /var/lib/octavia/... config file16:20
*** gcheresh has quit IRC16:20
ivvegetting the full log from when it started16:21
johnsomI am going to boot an act/stdby so I can paste you some expected log snippets16:21
johnsomivve What version of Octavia are you running?16:21
ivve4.1.1 now16:22
ivvethis is the slave from namespace config (right after sysctl) down to me starting tcpdumps  - https://hastebin.com/uqoxefojid.sql16:23
johnsomivve Yeah, that looks normal for a backup. From that log, it has never FAILED or became MASTER.16:25
johnsomIt is still in the BACKUP state16:25
ivvethis is the master16:25
ivveFeb 25 15:04:34 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Transition to MASTER STATE16:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Entering MASTER STATE16:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol VIPs.16:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol Virtual Routes16:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol Virtual Rules16:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:25
ivveFeb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: message repeated 3 times: [ Sending gratuitous ARP on eth1 for 10.150.1.40]16:25
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:25
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:25
johnsomOtherwise, that is all normal, including the warnings (we should clean up at some point)16:25
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:25
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:25
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: message repeated 4 times: [ Sending gratuitous ARP on eth1 for 10.150.1.40]16:25
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:26
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Starting HAProxy Load Balancer...16:26
ivveFeb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Started HAProxy Load Balancer.16:26
ivveFeb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:26
ivveFeb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:26
ivveFeb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.16:26
johnsompastebin is your friend16:26
ivveFeb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150440 (1104) : Reexecuting Master process16:26
ivveFeb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.16:26
ivveFeb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150446 (1104) : Former worker 1105 exited with code 016:26
johnsomhttps://paste.openstack.org16:26
ivveFeb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:26
ivveFeb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:27
ivveFeb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:27
ivveFeb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:27
ivveFeb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.16:27
ivveFeb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150446 (1104) : Reexecuting Master process16:27
ivveFeb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150517 (1104) : Former worker 1166 exited with code 016:27
ivveFeb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.16:27
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:27
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.16:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150517 (1104) : Reexecuting Master process16:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150520 (1104) : Former worker 1246 exited with code 016:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.16:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:28
ivveFeb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.16:28
ivveFeb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.16:28
ivveFeb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150520 (1104) : Reexecuting Master process16:28
ivveFeb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150524 (1104) : Former worker 1301 exited with code 016:28
ivveFeb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.16:28
ivveFeb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:28
ivveFeb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.4016:29
ivveFeb 25 15:05:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.4016:29
ivvefuck16:29
ivvehttps://hastebin.com/qiwotufufe.coffeescript16:29
ivvesorry, clipboards failure16:29
ivvewould have happend anyway, unless that clears the secondary clipboard :P16:29
ivveanyways16:29
ivvethis is the master - https://hastebin.com/qiwotufufe.coffeescript16:29
ivvesame timeline16:29
ivvefrom sysctl finish confs to master up sending grat arps16:29
ivveKeepalived v1.3.9 (10/21,2017)16:29
ivvethis is the default in ubuntu 18.04.416:29
ivvefrom their repo16:29
ivvethe reexecutions is nothing strange since thats configuring backends/members16:29
ivvecould it be on the neutron side? the neutron router having the wrong arp or something16:30
*** gcheresh has joined #openstack-lbaas16:35
johnsomWell, that is why we beat it over the head with GARPs....16:36
johnsomWhat is your ML2 in neutron?16:36
ivvel216:36
johnsomOVS? OVN? linux bridge?16:37
ivveoh ovs16:37
ivvel2pop16:37
johnsomOk, yeah, that should be fine16:37
johnsomOVN had a nasty bug at one point (maybe still does) where ARPs took 30 seconds to actually update, but OVS doesn't have that issue.16:38
ivveits as if the router is getting the wrong arp every now and then and then it is corrected, but just for a split second16:39
ivveor something.. this is so wierd at this point16:39
ivve10% of sessions from this lb is just gone atm16:39
ivveover the time i found the lb to now16:40
ivvewhich is over an hour16:40
johnsomCan you run "sudo ip netns exec amphora-haproxy ip a" on the standby amphora?16:40
ivvehttps://hastebin.com/hemurirone.cs16:42
johnsomOk, yeah, so it's behaving correctly, it doesn't have the VIP IP on eth1, so it's not ARPing for it at all.16:42
johnsomOn master, eth1 will have two IPs, and should be GARP/ARP for the VIP address.16:43
ivveyea16:43
johnsomYou could "sudo ip netns exec amphora-haproxy tcpdump -nli eth1 arp" on the backup just to make sure, but I'm 99.9995% sure it's not going to be advertising the VIP address.16:44
johnsomYou will see other ARP traffic of course, but it shouldn't be from the eth1 with the VIP in it16:44
*** gcheresh has quit IRC16:45
ivvehttps://hastebin.com/icexoziluy.css16:47
johnsomYeah, so the arps are fine16:48
ivvetestmethod: https://hastebin.com/modowocaba.bash16:51
ivveexpected outcome is 0 100%16:51
ivvebut right now im getting around 85% on just this one lb16:52
johnsomWhat is the web server?16:52
ivveits sooooo strange why not everytime, why only like one lb every now and then16:52
ivvelike i setup 7 lbs16:52
ivveonly 1 of them behaves like this16:53
ivveyeah its 3 members behind16:53
ivvek8s16:53
johnsomIt's a real web server right? not a "netcat/nc" faker right?16:53
ivveyea16:53
*** tesseract has quit IRC16:54
johnsomOk. Because we get that with netcat as it can't handle concurrent connections, so the health check will interfere with the user request. That is why we have a golang web server for the cirros test instances.16:54
johnsomOk, next thing I would check, in the master amphora, look in the /var/log/haproxy log, you should see a line per connection. Towards the end of the line is the "status" string that describes how the connection was resolved. They should all basically look the same16:56
johnsomIn this example:16:57
ivvewhen the error happens i get cD or CD16:57
johnsom4798-bc0c-400465756fa1 172.24.4.1 42940 25/Feb/2020:16:57:07.480 "GET / HTTP/1.1" 200 79 73 - "" 8c3cc362-7516-4c6f-a889-9347dd021fb5:08a2bedc-96d0-4798-bc0c-400465756fa1 36472f77-ee9c-435f-8a34-a9e6538bfce7 4 ----16:57
johnsomThe four hyphens at the end mean successful flow.16:57
johnsomAh! ok, so there is the clue. Can you past the exact four characters there?16:58
ivvehttps://hastebin.com/sonofesafe.nginx16:58
ivvethere are 3 error i think in there16:58
ivve6 actually16:59
johnsomhmm, that is an odd log. Are these TCP LBs?16:59
ivvethis is haproxy.log16:59
ivvestraight tail, no fuzz16:59
johnsomOh, right, nevermind, I changed the log format for offloading, you don't have that in stein.16:59
johnsomOk16:59
ivve:)16:59
johnsomdecoding, just a sec17:00
johnsomc : the client-side timeout expired while waiting for the client to17:01
johnsom            send or receive data.17:01
johnsomD : the session was in the DATA phase.17:01
johnsom C : the TCP session was unexpectedly aborted by the client.17:02
johnsomSo, the connection from the client to the VIP of the load balancer is un-expectantly terminated.17:02
johnsomPretty much what you said.....17:02
ivve:(17:02
johnsomSo it's in front of the haproxy somewhere.17:02
ivveyeah, im out of juice for this problem17:04
ivvethink we checked this particular log 2 days ago17:04
ivveno, it was friday17:04
ivveanyways17:04
johnsomivve ok, so my next step would be tcpdump on eth1 and see if there are any clues there.17:05
johnsominside the netns17:05
johnsomBut I doubt it. It's most likely in the neutron of forward layer.17:05
johnsomof-or17:06
ivvemm17:06
ivveim probably gonna try upgrade neutron tomorrow17:06
johnsomOk, let me know if you want to keep debugging lower in the amphora. I.e. doing a packet capture and I can look at the pcap for you.17:07
johnsomI'm happy to spend more time helping if you want it.17:07
ivveaye no worries. ill get back to you if i need help. thanks a bunch, much appreciated!17:08
johnsomYou could also stop keepalived and configure the interface manually, just to remove keepalived from the path, but if there isn't issues in the syslog, I doubt it your problem.17:08
ivvebut i think i need to look more at neutron17:08
johnsomOk, no problem17:08
johnsomGood luck!17:08
ivvedue to the clues, it only happens every now and then17:08
ivvemaybe lower the amount of garps?17:09
ivvei understand the failover time will be slower17:09
ivvebut maybe its excessive for ovs?17:09
johnsomNo, actually it won't. The garps are just there to make sure neutron has the right mapping.17:09
ivvefailure/switchover forces a new?17:10
johnsomWe saw issues with some ML2 where it aged them out too agressively17:10
ivveoh17:10
johnsomYeah, on transition it will always send a garp.17:10
ivvenaturally17:10
johnsomYou can certainly lower it, but I don't think it is related17:10
ivveyea im just grasping for straws here17:10
*** vishalmanchanda has quit IRC17:11
ivveand we still do have the issues with snat on routers too17:11
johnsomWe error'd on the side of being aggressive with the GARPs as they should be harmless.17:11
ivvewhich is kinda similar but not17:11
ivvewhere neutron tells two namespaces out of 3 that they are master half a minute after creation17:12
*** ccamposr__ has joined #openstack-lbaas17:30
*** ccamposr has quit IRC17:32
*** ivve has quit IRC17:47
*** yamamoto has quit IRC17:53
johnsomSigh, 500 error from github.... Don't forget the code really lives at opendev.org17:58
*** trident has quit IRC18:52
*** trident has joined #openstack-lbaas18:58
*** maciejjozefczyk has joined #openstack-lbaas20:01
*** maciejjozefczyk has quit IRC20:21
*** gcheresh has joined #openstack-lbaas20:59
*** Trevor_V has quit IRC21:00
*** ccamposr__ has quit IRC21:03
*** ccamposr has joined #openstack-lbaas21:40
*** gcheresh has quit IRC21:59
*** ccamposr has quit IRC22:06
*** xakaitetoia1 has quit IRC22:49
*** tkajinam has joined #openstack-lbaas22:58
*** rcernin has joined #openstack-lbaas23:39

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!