Tuesday, 2020-02-25

openstackgerrit	Michael Johnson proposed openstack/octavia master: WIP - Refactor the failover flows https://review.opendev.org/705317	00:05
*** spatel has joined #openstack-lbaas		00:09
*** spatel has quit IRC		00:13
*** vishalmanchanda has joined #openstack-lbaas		00:21
*** nicolasbock has quit IRC		01:15
openstackgerrit	Adam Harwell proposed openstack/octavia master: Support HTTP and TCP checks in UDP healthmonitor https://review.opendev.org/589180	01:29
openstackgerrit	Adam Harwell proposed openstack/octavia master: Select the right lb_network_ip interface using AZ https://review.opendev.org/704927	01:30
openstackgerrit	Adam Harwell proposed openstack/octavia master: Network Delta calculations should respect AZs https://review.opendev.org/705165	01:30
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow AZ to override valid_vip_networks config https://review.opendev.org/699521	01:30
openstackgerrit	Adam Harwell proposed openstack/octavia master: Allow AZ to override valid_vip_networks config https://review.opendev.org/699521	01:30
*** yamamoto has joined #openstack-lbaas		01:31
openstackgerrit	Adam Harwell proposed openstack/octavia master: Conf option to use VIP ip as source ip for backend https://review.opendev.org/702535	01:31
rm_work	REALLY still need to merge that whole az-tweaks chain ^^	01:34
rm_work	(not 702535, that's unrelated, no hurry on that)	01:34
rm_work	we're just leaving AZs broken in the meantime	01:35
*** ccamposr__ has quit IRC		02:01
*** ccamposr__ has joined #openstack-lbaas		02:01
*** TMM has quit IRC		02:26
*** TMM has joined #openstack-lbaas		02:26
*** vishalmanchanda has quit IRC		03:09
*** psachin has joined #openstack-lbaas		03:29
*** psachin has quit IRC		03:33
*** psachin has joined #openstack-lbaas		03:35
*** vishalmanchanda has joined #openstack-lbaas		04:31
*** numans has quit IRC		06:15
*** strigazi has quit IRC		06:16
*** strigazi has joined #openstack-lbaas		06:18
*** numans has joined #openstack-lbaas		07:11
*** psachin has quit IRC		07:47
*** gcheresh has joined #openstack-lbaas		07:55
*** maciejjozefczyk has joined #openstack-lbaas		08:12
*** yamamoto has quit IRC		08:16
openstackgerrit	Gregory Thiemonge proposed openstack/octavia stable/queens: Fix uncaught DB exception when trying to get a spare amphora https://review.opendev.org/709569	08:17
*** yamamoto has joined #openstack-lbaas		08:21
*** tkajinam has quit IRC		08:22
*** tesseract has joined #openstack-lbaas		08:23
*** tkajinam has joined #openstack-lbaas		08:43
*** yamamoto has quit IRC		09:58
*** yamamoto has joined #openstack-lbaas		10:09
*** yamamoto has quit IRC		11:20
*** yamamoto has joined #openstack-lbaas		11:22
*** ivve has joined #openstack-lbaas		11:30
ivve	hey guys, what version of keepalived is recommended in amphoras. and have you had issues with 1.3.9-1 reported (ubuntu 16.04/18.04 repo version)	11:33
openstackgerrit	Ann Taraday proposed openstack/octavia master: [Amphorav2] Fix noop driver case https://review.opendev.org/709696	11:36
*** rcernin has quit IRC		11:37
*** jamesdenton has quit IRC		11:45
*** jamesdenton has joined #openstack-lbaas		11:46
*** nicolasbock has joined #openstack-lbaas		11:53
*** ccamposr has joined #openstack-lbaas		12:15
*** xakaitetoia1 has joined #openstack-lbaas		12:15
xakaitetoia1	Hello team, now that fwaas is being depricated, what's a good alternative for puting a firewall infront of LBaaS	12:16
*** ccamposr__ has quit IRC		12:17
*** tkajinam has quit IRC		12:30
*** maciejjozefczyk has quit IRC		13:14
*** ccamposr has quit IRC		13:17
*** ccamposr has joined #openstack-lbaas		13:17
*** gcheresh has quit IRC		13:19
*** gcheresh has joined #openstack-lbaas		13:19
*** maciejjozefczyk has joined #openstack-lbaas		13:27
*** nicolasbock has quit IRC		13:33
*** nicolasbock has joined #openstack-lbaas		13:34
*** nicolasbock has quit IRC		13:36
*** yamamoto has quit IRC		13:36
*** gcheresh_ has joined #openstack-lbaas		13:40
*** gcheresh has quit IRC		13:40
*** tkajinam has joined #openstack-lbaas		13:53
johnsom	ivve We have had minor issues with keepalived, but we have put work arounds in place for those. We currently just use the version included in the supported distrobutions (centos 7/8, rhel 7/8, ubuntu 16.04/18.04). In "v" we will have the newer versions of keepalived across both the centos and ubuntu platforms and will make some adjustments.	13:58
*** yamamoto has joined #openstack-lbaas		13:58
johnsom	xakaitetoia1 What do you need the firewall for? Octavia manages the security groups for the load balancer automatically, so curious what your use case is.	13:59
ivve	johnsom: alright, do you recognize this: when using keepalived sometimes sessions are passed back and forth between loadbalancers effectivly cutting sessions and/or causing resets	14:00
ivve	we seems to experience this not only in amphoras, but also in snat namespaces (via neutron)	14:01
ivve	when using standalone built amphoras it works 100% of the time. this doesn't occur like 100% when you use active_standby topology	14:02
ivve	we have around a ~50% failure rate i guess on active_standby	14:02
ivve	we upgraded octavia to 4.1.0 and now to 4.1.1 (still testing) but so far the same issue exists in 4.1.0, making me think this is keepalived doing some funny business	14:04
rm_work	ivve: to me that sounds like your network is either incredibly unreliable, or entirely blocking the connection between amps and they are just both in split-brain constantly fighting to gARP for the VIP (so it passes back and forth as fast as the switches will listen)	14:24
ivve	rm_work: we tested on the same host	14:24
ivve	same problem for amphoras	14:24
rm_work	ivve: if you are using our standard image building without any custom changes... then i very much doubt it is the image	14:25
rm_work	when was your image built?	14:25
ivve	i was thinking about that since we do have some vxlan tunneling between DC but then move all traffic to in-DC and also later on in-host without any difference	14:25
ivve	18.04 just few days ago	14:25
rm_work	hmm	14:25
rm_work	can you SSH into the amphorae and look at the keepalived logs? or are you offloading those logs somewhere?	14:26
ivve	yea	14:27
rm_work	what are they showing?	14:27
ivve	so for the snat we do get double master promotion, but nothing like that in the amphorae	14:27
ivve	we're also going to upgrade neutron from 14.0.2 to 14.0.4	14:28
ivve	doubt anything will change though	14:28
ivve	even with -D flags the logs are very quiet	14:28
rm_work	can you paste the logs perchance?	14:29
ivve	when dumping we can just see 1 keepalived sending its regular heartbeat to the multicast address	14:29
ivve	gonna check if i have some old around	14:29
rm_work	ideally you could have one currently running	14:30
rm_work	and paste logs from both sides	14:30
rm_work	and make sure they are actually connectable from inside the netns	14:30
ivve	yea	14:32
ivve	didn't have one but ill generate one, gimme 5-10mins need to do some reconfs	14:33
ivve	im getting the feeling if we can solve it in one of the components we can hit two birds with one stone	14:35
ivve	i mean the strange thing is that we have the failure at spawntime	14:35
ivve	if it works, it works all the time, all the way	14:35
ivve	if it "fails" during spawn, it keeps failing until we bump them	14:35
ivve	like with a failover or restart stuff	14:36
ivve	for neutron-snat we just have to delete the routes in the "second" master	14:36
ivve	then its all good	14:36
ivve	spawning them now, should have logs on both sides soonish	14:36
*** TrevorV has joined #openstack-lbaas		14:41
rm_work	hmm	14:46
ivve	the debug logs from neutron-snat is just really nothing, doesn't say why it promotes to master, just that it does	14:48
rm_work	the "why" is pretty predictable -- it loses its connection to the pair	14:49
ivve	yea but one would expect some kind of heartbeat fail message	14:50
ivve	at least when running debug logs	14:50
ivve	and then it keeps switching back and forth without saying anything about master promotion, well basically at that stage you have a split brain situation	14:51
ivve	so sessions flipflop between the namespaces	14:51
johnsom	Sorry, I'm giving a training this morning. I will catch up with the thread here in an hour or so	14:52
johnsom	ivve The keepalived logs inside the amphora go to the "syslog" or "messages" log	14:54
ivve	thanks, im aware. its a bit different for neutron-snat, they choose to send it to file	14:55
ivve	which unfortunately is deleted along with the namespace if the router is removed	14:56
ivve	also, when the pair is lost, shouldn't octavia-services be informed?	14:58
rm_work	no	14:58
ivve	okay	14:58
rm_work	we let the amp pair handle itself	14:58
rm_work	we never track which is MASTER/BACKUP after initial provisioning	14:58
ivve	got it	14:58
rm_work	I'm ... on the fence about whether that's actually best or not; IMO it would be nice to have the health message include info about whether the amp thinks it is the MASTER or not -- but there's the obvious problem of deciding what is actually correct	15:00
ivve	i was thinking about the nopreempt setting as well	15:00
ivve	for haproxy	15:00
ivve	since configs are master/backup with priorities anyways	15:01
*** tkajinam has quit IRC		15:08
xakaitetoia1	johnsom, ah perfect, I didn't know octavia manages this and i wanted to protect the LB itself. Any idea how i can check that ?	15:18
xakaitetoia1	rm_work, yes i recently saw that myself when doing failover for master amphora. It immediately switched to the backup node but it didn't make it master. Then octavia destroyed the "master node" and re-created the node which then keepalived address moved from backup node to master node	15:20
ivve	damn, after updating to 4.1.1 i don't seem to be able to reproduce it. into 7th stack now	15:20
rm_work	hmmm i thought we had nopreempt on and that would normally prevent the "new master" from taking back over	15:21
ivve	ok now i think i have one, lol	15:22
ivve	rm_work: okay i now have one failing, all i can see is this in syslog	15:24
ivve	https://hastebin.com/uzoticuhod.nginx	15:24
ivve	the backup is all silent	15:25
ivve	and we have a script running in the background that does curl to the listener	15:25
ivve	with roughly 10% failure rate on those curls atm	15:25
ivve	so 1 / 7 failed atm	15:25
ivve	the others have 100% successrate for the entire time from when i started, so 15mins give or take	15:26
*** irclogbot_0 has quit IRC		15:27
ivve	tcpdump is only the heartbeat udp's at port 5555	15:28
ivve	from controller nodes	15:28
ivve	(on the slave)	15:28
ivve	and some arps	15:28
*** irclogbot_1 has joined #openstack-lbaas		15:28
ivve	and i can see the udp traffic from all 3 controller nodes	15:29
ivve	so the slave never send out any grat arps	15:30
ivve	https://hastebin.com/epijiwakuf.css <-- slave, the prom switches from my tcpdumping	15:32
*** gcheresh has joined #openstack-lbaas		15:46
*** gcheresh_ has quit IRC		15:50
*** nicolasbock has joined #openstack-lbaas		15:53
rm_work	ivve: can you enter the netns on the amphora (`ip netns exec amphora-haproxy bash`) and try a tcpdump in there to see what's going on with the keepalived pair?	15:57
*** Trevor_V has joined #openstack-lbaas		16:03
*** TrevorV has quit IRC		16:07
johnsom	ivve I think we have seen an issue in the 4.1.x series that was fixed in keepalived. Let me see if I can find the reference.	16:07
johnsom	Aslo, yes, you need to be in the network namespace	16:07
johnsom	xakaitetoia1 As an operator/admin, you can look at the load balancer VIP port, then see the security groups associated. Then query the security group details. Basically Octavia only opens the ports defined for the listener(s).	16:09
ivve	rm_work: this line is spammed fa:16:3e:7c:93:58 > fa:16:3e:9e:09:91, ethertype IPv4 (0x0800), length 54: 10.150.1.57 > 10.150.1.166: VRRPv2, Advertisement, vrid 1, prio 100, authtype simple, intvl 1s, length 20	16:11
johnsom	ivve This is the issue we saw with keepalived > 2.0 and < 2.0.11: https://github.com/acassen/keepalived/commit/30eeb48b1a0737dc7443fd421fd6613e0d55fd17	16:11
johnsom	ivve Yes, that is the normal Active<->Standby keepalived heartbeat. That is outside the network namespace	16:12
ivve	yea	16:12
ivve	expected	16:12
*** maciejjozefczyk has quit IRC		16:12
johnsom	Oh, wait, it is inside the netns actaully.	16:12
ivve	yes	16:12
johnsom	The GARP messages are coming from the current "MASTER" instance	16:12
ivve	this is on the non-namespace interface	16:12
rm_work	err	16:13
rm_work	outside the NS?	16:13
johnsom	If you grep "grep -i stat /var/log/syslog" it should pop out which is master which is backup. Since these can switch sub-second, it's rough to try to track at the control plane level	16:13
ivve	https://hastebin.com/uzoticuhod.nginx	16:14
ivve	keeps coming from the master	16:14
ivve	and on the non-namespace interface all i can see is port 5555 udp traffic and some normal bum traffic	16:15
johnsom	Right	16:15
ivve	like arp who-has / reply	16:15
johnsom	So, check your syslog for the status change messages, see if you have any or if it is a network issue outside the amphora	16:16
ivve	syslog is all empty in the slave	16:18
ivve	in the master it says keeps doing garps	16:18
ivve	thats it	16:18
johnsom	Ok, now I am done training a new QE team, I can answer quicker....	16:18
johnsom	Hmm, it must have rotated out already. Check systemd status of the keepalived on the backup	16:19
johnsom	Just to make sure it is running. You can also check the process list, it should be running using the /var/lib/octavia/... config file	16:20
*** gcheresh has quit IRC		16:20
ivve	getting the full log from when it started	16:21
johnsom	I am going to boot an act/stdby so I can paste you some expected log snippets	16:21
johnsom	ivve What version of Octavia are you running?	16:21
ivve	4.1.1 now	16:22
ivve	this is the slave from namespace config (right after sysctl) down to me starting tcpdumps - https://hastebin.com/uqoxefojid.sql	16:23
johnsom	ivve Yeah, that looks normal for a backup. From that log, it has never FAILED or became MASTER.	16:25
johnsom	It is still in the BACKUP state	16:25
ivve	this is the master	16:25
ivve	Feb 25 15:04:34 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Transition to MASTER STATE	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Entering MASTER STATE	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol VIPs.	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol Virtual Routes	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) setting protocol Virtual Rules	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:25
ivve	Feb 25 15:04:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: message repeated 3 times: [ Sending gratuitous ARP on eth1 for 10.150.1.40]	16:25
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:25
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:25
johnsom	Otherwise, that is all normal, including the warnings (we should clean up at some point)	16:25
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:25
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:25
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: message repeated 4 times: [ Sending gratuitous ARP on eth1 for 10.150.1.40]	16:25
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:26
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Starting HAProxy Load Balancer...	16:26
ivve	Feb 25 15:04:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Started HAProxy Load Balancer.	16:26
ivve	Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:26
ivve	Feb 25 15:04:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:26
ivve	Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.	16:26
johnsom	pastebin is your friend	16:26
ivve	Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150440 (1104) : Reexecuting Master process	16:26
ivve	Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.	16:26
ivve	Feb 25 15:04:46 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150446 (1104) : Former worker 1105 exited with code 0	16:26
johnsom	https://paste.openstack.org	16:26
ivve	Feb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:50 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:26
ivve	Feb 25 15:04:55 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:00 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:05 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:10 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:15 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:27
ivve	Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:27
ivve	Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:27
ivve	Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.	16:27
ivve	Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150446 (1104) : Reexecuting Master process	16:27
ivve	Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150517 (1104) : Former worker 1166 exited with code 0	16:27
ivve	Feb 25 15:05:17 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.	16:27
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:27
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150517 (1104) : Reexecuting Master process	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150520 (1104) : Former worker 1246 exited with code 0	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:20 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:28
ivve	Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading.	16:28
ivve	Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloading HAProxy Load Balancer.	16:28
ivve	Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150520 (1104) : Reexecuting Master process	16:28
ivve	Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 ip[1104]: [WARNING] 055/150524 (1104) : Former worker 1301 exited with code 0	16:28
ivve	Feb 25 15:05:24 amphora-78aa0630-f462-4607-85ce-26c348db1571 systemd[1]: Reloaded HAProxy Load Balancer.	16:28
ivve	Feb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:25 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:30 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:28
ivve	Feb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:35 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:40 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: Sending gratuitous ARP on eth1 for 10.150.1.40	16:29
ivve	Feb 25 15:05:45 amphora-78aa0630-f462-4607-85ce-26c348db1571 Keepalived_vrrp[1031]: VRRP_Instance(45d418d985cc4baf9dd966a107325515) Sending/queueing gratuitous ARPs on eth1 for 10.150.1.40	16:29
ivve	fuck	16:29
ivve	https://hastebin.com/qiwotufufe.coffeescript	16:29
ivve	sorry, clipboards failure	16:29
ivve	would have happend anyway, unless that clears the secondary clipboard :P	16:29
ivve	anyways	16:29
ivve	this is the master - https://hastebin.com/qiwotufufe.coffeescript	16:29
ivve	same timeline	16:29
ivve	from sysctl finish confs to master up sending grat arps	16:29
ivve	Keepalived v1.3.9 (10/21,2017)	16:29
ivve	this is the default in ubuntu 18.04.4	16:29
ivve	from their repo	16:29
ivve	the reexecutions is nothing strange since thats configuring backends/members	16:29
ivve	could it be on the neutron side? the neutron router having the wrong arp or something	16:30
*** gcheresh has joined #openstack-lbaas		16:35
johnsom	Well, that is why we beat it over the head with GARPs....	16:36
johnsom	What is your ML2 in neutron?	16:36
ivve	l2	16:36
johnsom	OVS? OVN? linux bridge?	16:37
ivve	oh ovs	16:37
ivve	l2pop	16:37
johnsom	Ok, yeah, that should be fine	16:37
johnsom	OVN had a nasty bug at one point (maybe still does) where ARPs took 30 seconds to actually update, but OVS doesn't have that issue.	16:38
ivve	its as if the router is getting the wrong arp every now and then and then it is corrected, but just for a split second	16:39
ivve	or something.. this is so wierd at this point	16:39
ivve	10% of sessions from this lb is just gone atm	16:39
ivve	over the time i found the lb to now	16:40
ivve	which is over an hour	16:40
johnsom	Can you run "sudo ip netns exec amphora-haproxy ip a" on the standby amphora?	16:40
ivve	https://hastebin.com/hemurirone.cs	16:42
johnsom	Ok, yeah, so it's behaving correctly, it doesn't have the VIP IP on eth1, so it's not ARPing for it at all.	16:42
johnsom	On master, eth1 will have two IPs, and should be GARP/ARP for the VIP address.	16:43
ivve	yea	16:43
johnsom	You could "sudo ip netns exec amphora-haproxy tcpdump -nli eth1 arp" on the backup just to make sure, but I'm 99.9995% sure it's not going to be advertising the VIP address.	16:44
johnsom	You will see other ARP traffic of course, but it shouldn't be from the eth1 with the VIP in it	16:44
*** gcheresh has quit IRC		16:45
ivve	https://hastebin.com/icexoziluy.css	16:47
johnsom	Yeah, so the arps are fine	16:48
ivve	testmethod: https://hastebin.com/modowocaba.bash	16:51
ivve	expected outcome is 0 100%	16:51
ivve	but right now im getting around 85% on just this one lb	16:52
johnsom	What is the web server?	16:52
ivve	its sooooo strange why not everytime, why only like one lb every now and then	16:52
ivve	like i setup 7 lbs	16:52
ivve	only 1 of them behaves like this	16:53
ivve	yeah its 3 members behind	16:53
ivve	k8s	16:53
johnsom	It's a real web server right? not a "netcat/nc" faker right?	16:53
ivve	yea	16:53
*** tesseract has quit IRC		16:54
johnsom	Ok. Because we get that with netcat as it can't handle concurrent connections, so the health check will interfere with the user request. That is why we have a golang web server for the cirros test instances.	16:54
johnsom	Ok, next thing I would check, in the master amphora, look in the /var/log/haproxy log, you should see a line per connection. Towards the end of the line is the "status" string that describes how the connection was resolved. They should all basically look the same	16:56
johnsom	In this example:	16:57
ivve	when the error happens i get cD or CD	16:57
johnsom	4798-bc0c-400465756fa1 172.24.4.1 42940 25/Feb/2020:16:57:07.480 "GET / HTTP/1.1" 200 79 73 - "" 8c3cc362-7516-4c6f-a889-9347dd021fb5:08a2bedc-96d0-4798-bc0c-400465756fa1 36472f77-ee9c-435f-8a34-a9e6538bfce7 4 ----	16:57
johnsom	The four hyphens at the end mean successful flow.	16:57
johnsom	Ah! ok, so there is the clue. Can you past the exact four characters there?	16:58
ivve	https://hastebin.com/sonofesafe.nginx	16:58
ivve	there are 3 error i think in there	16:58
ivve	6 actually	16:59
johnsom	hmm, that is an odd log. Are these TCP LBs?	16:59
ivve	this is haproxy.log	16:59
ivve	straight tail, no fuzz	16:59
johnsom	Oh, right, nevermind, I changed the log format for offloading, you don't have that in stein.	16:59
johnsom	Ok	16:59
ivve	:)	16:59
johnsom	decoding, just a sec	17:00
johnsom	c : the client-side timeout expired while waiting for the client to	17:01
johnsom	send or receive data.	17:01
johnsom	D : the session was in the DATA phase.	17:01
johnsom	C : the TCP session was unexpectedly aborted by the client.	17:02
johnsom	So, the connection from the client to the VIP of the load balancer is un-expectantly terminated.	17:02
johnsom	Pretty much what you said.....	17:02
ivve	:(	17:02
johnsom	So it's in front of the haproxy somewhere.	17:02
ivve	yeah, im out of juice for this problem	17:04
ivve	think we checked this particular log 2 days ago	17:04
ivve	no, it was friday	17:04
ivve	anyways	17:04
johnsom	ivve ok, so my next step would be tcpdump on eth1 and see if there are any clues there.	17:05
johnsom	inside the netns	17:05
johnsom	But I doubt it. It's most likely in the neutron of forward layer.	17:05
johnsom	of-or	17:06
ivve	mm	17:06
ivve	im probably gonna try upgrade neutron tomorrow	17:06
johnsom	Ok, let me know if you want to keep debugging lower in the amphora. I.e. doing a packet capture and I can look at the pcap for you.	17:07
johnsom	I'm happy to spend more time helping if you want it.	17:07
ivve	aye no worries. ill get back to you if i need help. thanks a bunch, much appreciated!	17:08
johnsom	You could also stop keepalived and configure the interface manually, just to remove keepalived from the path, but if there isn't issues in the syslog, I doubt it your problem.	17:08
ivve	but i think i need to look more at neutron	17:08
johnsom	Ok, no problem	17:08
johnsom	Good luck!	17:08
ivve	due to the clues, it only happens every now and then	17:08
ivve	maybe lower the amount of garps?	17:09
ivve	i understand the failover time will be slower	17:09
ivve	but maybe its excessive for ovs?	17:09
johnsom	No, actually it won't. The garps are just there to make sure neutron has the right mapping.	17:09
ivve	failure/switchover forces a new?	17:10
johnsom	We saw issues with some ML2 where it aged them out too agressively	17:10
ivve	oh	17:10
johnsom	Yeah, on transition it will always send a garp.	17:10
ivve	naturally	17:10
johnsom	You can certainly lower it, but I don't think it is related	17:10
ivve	yea im just grasping for straws here	17:10
*** vishalmanchanda has quit IRC		17:11
ivve	and we still do have the issues with snat on routers too	17:11
johnsom	We error'd on the side of being aggressive with the GARPs as they should be harmless.	17:11
ivve	which is kinda similar but not	17:11
ivve	where neutron tells two namespaces out of 3 that they are master half a minute after creation	17:12
*** ccamposr__ has joined #openstack-lbaas		17:30
*** ccamposr has quit IRC		17:32
*** ivve has quit IRC		17:47
*** yamamoto has quit IRC		17:53
johnsom	Sigh, 500 error from github.... Don't forget the code really lives at opendev.org	17:58
*** trident has quit IRC		18:52
*** trident has joined #openstack-lbaas		18:58
*** maciejjozefczyk has joined #openstack-lbaas		20:01
*** maciejjozefczyk has quit IRC		20:21
*** gcheresh has joined #openstack-lbaas		20:59
*** Trevor_V has quit IRC		21:00
*** ccamposr__ has quit IRC		21:03
*** ccamposr has joined #openstack-lbaas		21:40
*** gcheresh has quit IRC		21:59
*** ccamposr has quit IRC		22:06
*** xakaitetoia1 has quit IRC		22:49
*** tkajinam has joined #openstack-lbaas		22:58
*** rcernin has joined #openstack-lbaas		23:39

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!