Friday, 2020-02-14

*** yamamoto has joined #openstack-lbaas		00:15
*** armax has joined #openstack-lbaas		00:35
*** abaindur has joined #openstack-lbaas		00:58
*** yamamoto has quit IRC		01:19
johnsom	FYI, I have found that nova anti-affinity appears to be broken: https://bugs.launchpad.net/nova/+bug/1863190	01:33
openstack	Launchpad bug 1863190 in OpenStack Compute (nova) "Server group anti-affinity no longer works" [Undecided,New]	01:33
johnsom	I was testing that failover puts the amp back in the server group correctly (which works BTW).	01:34
johnsom	On that fine note, catch you all tomorrow.	01:35
johnsom	#canary	01:35
*** vishalmanchanda has joined #openstack-lbaas		01:54
*** armax has quit IRC		02:54
*** abaindur has quit IRC		03:02
*** abaindur has joined #openstack-lbaas		03:03
*** armax has joined #openstack-lbaas		03:07
*** abaindur has quit IRC		03:08
*** yamamoto has joined #openstack-lbaas		03:25
*** armax has quit IRC		03:55
*** andy_ has quit IRC		05:07
*** andy_ has joined #openstack-lbaas		05:08
*** nicolasbock has quit IRC		05:09
*** psachin has joined #openstack-lbaas		05:37
*** ramishra has joined #openstack-lbaas		05:46
*** abaindur has joined #openstack-lbaas		05:59
*** goldyfruit has quit IRC		06:00
*** goldyfruit has joined #openstack-lbaas		06:00
*** ramishra has quit IRC		06:30
*** abaindur has quit IRC		06:32
*** abaindur has joined #openstack-lbaas		06:32
*** ivve has joined #openstack-lbaas		07:56
*** maciejjozefczyk has joined #openstack-lbaas		07:58
*** abaindur_ has joined #openstack-lbaas		08:05
*** abaindur has quit IRC		08:08
*** abaindur_ has quit IRC		08:11
*** abaindur has joined #openstack-lbaas		08:12
*** gcheresh_ has joined #openstack-lbaas		08:14
cgoncalves	hah!	08:17
cgoncalves	#canary neutron-dhcp-agent seems to be broken. fails to spawn dnsmasq	08:18
cgoncalves	the bot should start collecting these messages like it does for #success	08:18
cgoncalves	#success Octavia is the canary	08:18
openstackstatus	cgoncalves: Added success to Success page (https://wiki.openstack.org/wiki/Successes)	08:18
*** ccamposr has quit IRC		08:19
*** ccamposr has joined #openstack-lbaas		08:20
*** ramishra has joined #openstack-lbaas		08:21
*** tesseract has joined #openstack-lbaas		08:21
*** ccamposr__ has joined #openstack-lbaas		08:39
*** ccamposr has quit IRC		08:41
*** tkajinam has quit IRC		08:42
ivve	question: wouldn't it be great if octavia timed out on the "pending update" immutable state after some time (configurable) and move it back into last state with an error on last request rather than be stuck in immutable state forever (and force an admin to hack it out of existence with db commands)	08:48
cgoncalves	ivve, lucky us there is a timeout setting ;)	08:49
ivve	great!	08:49
ivve	whats the default and what is the actual timeout?	08:49
*** gcheresh_ has quit IRC		08:49
ivve	because currently everything just hangs indefinitly	08:50
ivve	(with default value)	08:50
cgoncalves	default is 25 minutes IIRC	08:50
ivve	oh then either a) its is not OR b) it doesn't work OR c) it is refreshed if someone tries a delete command again on the entire loadbalancer	08:51
cgoncalves	ivve, it hangs indefinitely in PENDING_* if you restart the controller service while it was still processing the request	08:51
ivve	i see	08:51
ivve	or it lost network connection during that time, i suppose?	08:51
cgoncalves	I don't think that would have an impact on the revert	08:52
cgoncalves	do you have service logs from the start of the request?	08:53
ivve	i have a debug command from the last request only	08:54
ivve	it was in available state, everything was ok. tried to delete it	08:54
ivve	went into delete, then update pending	08:54
ivve	then stuck	08:54
ivve	for a day	08:54
cgoncalves	can you confirm the octavia services were not restarted during that period?	08:55
ivve	yea	08:55
cgoncalves	no octavia worker logs we could check?	08:55
ivve	users create k8s clusters and then out of stack loadbalancer items for it (ingress objects etc)	08:56
ivve	when they done testing, lazily removes the stack	08:56
ivve	this same thing happens with cinder objects	08:56
ivve	i guess	08:57
ivve	could be long, but now i have an appointment. i will check when i come back!	08:57
cgoncalves	is Heat involved in creating/deleting Octavia resources?	08:58
cgoncalves	ok. talk to you later	08:58
*** xakaitetoia has joined #openstack-lbaas		08:59
*** rcernin has quit IRC		09:12
openstackgerrit	Carlos Goncalves proposed openstack/octavia-tempest-plugin master: Add tests for allowed CIDRs in listeners https://review.opendev.org/702629	09:19
*** baffle has joined #openstack-lbaas		09:32
*** ccamposr__ has quit IRC		09:33
*** ccamposr__ has joined #openstack-lbaas		09:34
*** yamamoto has quit IRC		09:57
*** rcernin has joined #openstack-lbaas		09:59
*** yamamoto has joined #openstack-lbaas		10:04
*** rcernin has quit IRC		10:17
*** abaindur has quit IRC		10:18
*** abaindur has joined #openstack-lbaas		10:18
*** abaindur has quit IRC		10:23
*** psachin has quit IRC		10:25
*** yamamoto has quit IRC		10:28
*** vishalmanchanda has quit IRC		10:28
*** psachin has joined #openstack-lbaas		10:33
*** yamamoto has joined #openstack-lbaas		10:39
*** gcheresh_ has joined #openstack-lbaas		10:52
*** abaindur has joined #openstack-lbaas		11:05
*** abaindur has quit IRC		11:10
*** psachin has quit IRC		11:10
*** psachin has joined #openstack-lbaas		11:12
openstackgerrit	Carlos Goncalves proposed openstack/octavia-tempest-plugin master: Add tests for allowed CIDRs in listeners https://review.opendev.org/702629	11:15
*** gcheresh_ has quit IRC		11:16
*** yamamoto has quit IRC		11:21
ivve	cgoncalves: back again, so heat created the initial lb, but then the ingress controller creates bunch of other stuff and probably more loadbalancers. then when the stack is removed (some of those resources are not included in the stack)	11:41
ivve	and they got stuck	11:41
ivve	k8s ingress controller	11:42
ivve	so in order to get rid of them i set them as active/online in the db and quickly remove them	11:43
cgoncalves	ivve, I can't think of anything else how to troubleshoot than checking the logs :/	11:44
ivve	yea i get it	11:45
ivve	its just that this issue spans over multiple scenarios	11:45
cgoncalves	we've also had customers reporting that their LB resources got stuck while deleting a k8s cluster on top of openstack via Heat	11:45
ivve	restarting controllers, even when doing it one by one	11:45
cgoncalves	so your case seems similar	11:45
ivve	if a network failure occurs in parts of the datacenter/infrastructure	11:46
ivve	then this happens again	11:46
ivve	and then manual db labour to fix tens upon tens of loadbalancers	11:46
ivve	what im saying is the general state recovery	11:47
ivve	could it be improved? even if the service gets a restart	11:47
ivve	this is the easiest way to fail it	11:48
ivve	stop the octavia mgmt net	11:48
ivve	and then hell breaks loose	11:49
ivve	and the api calls in existance doesn't help to reset the loadbalancers when they are in the immutable state, there is no way other than DB hacking and restarting things manually	11:50
ivve	existence*	11:50
cgoncalves	there's work ongoing now in master that will mitigate resources getting stuck in PENDING_*	11:51
cgoncalves	https://review.opendev.org/#/c/647406/	11:51
ivve	oh okay cool	11:51
*** yamamoto has joined #openstack-lbaas		11:54
ivve	cgoncalves: one last question, if i have a loadbalancer that ends up in error mode or immutable state, which way is the best approach to recreate it (even if it takes some hacking db/sending api commands as admin)	11:56
ivve	like for example this one, it's still working but no matter my approach octavia tries to solve the issue but is unable too, and it now looks like this:	11:57
cgoncalves	ivve, if in PENDING_*, set it to ERROR rather than ACTIVE. then, issue a loadbalancer failver	11:57
ivve	https://hastebin.com/esoyitefaj.rb	11:57
cgoncalves	$ openstack loadbalancer failover $LB_ID	11:57
ivve	failover the error one im assuming?	11:58
*** tkajinam has joined #openstack-lbaas		11:58
ivve	sometimes they come back, but assume role: standalone	11:58
cgoncalves	correct, failover the LB in ERROR provisioning_status	11:58
ivve	not amphora failover?	11:59
*** yamamoto has quit IRC		11:59
cgoncalves	better failover LB	11:59
ivve	:(	12:01
ivve	it got worse	12:01
ivve	https://hastebin.com/pulupayuha.rb	12:02
ivve	keeps trying to create that new one	12:02
ivve	then fails, then gives up	12:03
cgoncalves	show me the logs .... :)	12:03
ivve	ill grab logs	12:03
cgoncalves	heh	12:03
ivve	LB bb5fe733-82c3-4156-b26d-7735b9a8c7dc failover exception: port not found (port id: a9b9ce28-bf7f-4d81-b08a-cf7ab554149e).: PortNotFound: port not found (port id: a9b9ce28-bf7f-4d81-b08a-cf7ab554149e).	12:04
ivve	im guessing this is the problem	12:04
cgoncalves	oh, either the vrrp or the vip port got deleted :/	12:05
ivve	aye	12:05
cgoncalves	johnsom helped a couple of users with this problem by recreating the ports manually	12:06
cgoncalves	I'm not 100% I'd know all the process to do so	12:06
cgoncalves	he may be help to better help you than me once he's online	12:07
cgoncalves	also, johnsom has been working on fixing the failover flow which if I recall correctly will also address this scenario, i.e. recreate the port if missing	12:07
ivve	yeah, i guess im just looking to be taught how to fish as this happens from time to time in a prod env	12:07
cgoncalves	https://review.opendev.org/#/c/705317/	12:08
cgoncalves	if you're in a hurry and that LB can be re-created from scratch, you can do openstack loadbalancer delete	12:09
ivve	yea so this is where the next issue comes up	12:12
ivve	it's not possible	12:12
ivve	:)	12:12
ivve	the only way i can delete it is to fake it to be all healthy and active in db before octavia notices and then delete it before it is check for health	12:12
ivve	checked*	12:13
*** nicolasbock has joined #openstack-lbaas		12:13
cgoncalves	setting it to ERROR and deleting immediately after should work	12:13
ivve	yeah or that	12:13
ivve	it would rock if we could get "openstack loadbalancer set --state <states> <lb>" like cinder	12:14
*** psachin has quit IRC		12:15
cgoncalves	we lost count at how many people have asked for that :)	12:15
ivve	hahah	12:15
ivve	:D	12:15
ivve	"another one" :P	12:15
ivve	the issue becomes when i have tons of users which in turn have tons of loadbalancers	12:16
ivve	and we have some kind of outtage	12:16
ivve	and as admin you do that dreadful openstack loadbalancer list and see all those errors :(	12:17
ivve	and then go through them one by one and fix them	12:18
cgoncalves	the refactor failover patch will help you in failing over broken LBs, including when ports got deleted like is your case now	12:18
ivve	also maybe a openstack loadbalancer re-create <lb>	12:18
cgoncalves	the jobboard patch will help when controller handling the CUD request gets killed halfway through the flow	12:18
ivve	just delete the whole thing and recreate it with identical uuids/ips. that would probably load to even more issues in the end i guess if resources half-exist or can't be removed	12:19
ivve	load=lead	12:19
cgoncalves	how would recreate be different than failover?	12:19
ivve	it would not mend a lb, it woudl delete the resource and recreate it i guess	12:20
ivve	well in my case (and i guess in a lot of cases) i want a active/passive topology	12:21
*** maciejjozefczyk has quit IRC		12:21
ivve	in a standalone topology i guess the failover is exactly that and im guessing that the failover procedure for a standalone works more often then not compared to the active/passive	12:21
cgoncalves	ivve, that is what failover does. it recreates the amphora (delete + create)	12:21
ivve	yea but not if a part of them is functioning, right?	12:22
ivve	so i have active and a backup, the backup is error but active is fine. only fix the backup if i failover, correct or not?	12:22
cgoncalves	failover in active-standby topology will recreate the amps one at a time. this is to avoid data plane outages	12:22
ivve	oh, so it does recreate both? (should?)	12:22
ivve	i haven't seen that	12:23
cgoncalves	ivve, in that case you could do "openstack loadbalancer amphora failover $AMP_ID"	12:23
ivve	yea thats what i have been using most of the time	12:23
cgoncalves	if you failover the loadbalancer, it will failover all amps associated to it	12:23
ivve	got it	12:23
ivve	like here is a result of attempted fix, eventually it started working but now looks like an abomination and nobody dear touch it to halt production https://hastebin.com/ucesumaduy.rb	12:30
ivve	this is probably setting in error state & using amphora failover after a network outtage between controller nodes	12:32
*** maciejjozefczyk has joined #openstack-lbaas		12:37
ivve	cgoncalves: while im at it asking questions, where there ever be support for using multiple images at the same time? today we use haproxy with some OS (tagging image with amphora)	12:38
ivve	i mean flavor is one thing, but what about images :)	12:40
cgoncalves	ivve, support to set amphora image in flavor is in the to-do list. just need someone to go there and do it. should be trivial, I guess	12:41
ivve	cool	12:42
ivve	its a request from my users, myself im just thinking for testing new images before putting it "prod"	12:42
*** yamamoto has joined #openstack-lbaas		12:43
ivve	but they want multiple types of OS in the background (don't ask my why)	12:43
ivve	s/my/me	12:43
cgoncalves	yeah, I understand	12:44
ivve	okay well, i will handle this and wait for coming updates	12:45
ivve	thanks a bunch for answering my questions, much appreciated	12:45
*** yamamoto has quit IRC		12:45
ivve	i will also try do the full loadbalancer failover next time with setting it to error if it isnt already, not sure if i have done exactly that before	12:46
cgoncalves	sorry about the trouble. the team is working hard to fix them	12:47
cgoncalves	having users reporting issues and helping us troubleshooting is great	12:47
*** yamamoto has joined #openstack-lbaas		13:06
*** tkajinam has quit IRC		13:17
ataraday	johnsom, Hi! Sorry for bothering you, but could you take a look at points 7-9 in https://etherpad.openstack.org/p/octavia-worker-v2-issue-tracker and leave some comments what do you think	13:23
*** ivve has quit IRC		13:28
*** psachin has joined #openstack-lbaas		13:37
*** TrevorV has joined #openstack-lbaas		14:18
*** psachin has quit IRC		14:48
johnsom	Just a comment on the above thread. Restart of the controller will not hang a flow, but kill -9 will. You must use graceful shutdown until our jobboard work is done.	14:56
johnsom	ataraday: will do	14:56
johnsom	Sounds like the root cause of ivve problem was nova or not being able to reach the database from the controllers.	15:05
*** abaindur has joined #openstack-lbaas		15:16
*** abaindur has quit IRC		15:21
xakaitetoia	i've seem a lof of communication issues and generally a good thing is to always check rabbit.	15:35
cgoncalves	johnsom, why do you say that? one problem was the vrrp port was gone in neutron	15:35
johnsom	cgoncalves exactly that and looking at the logs and what he said.	15:36
johnsom	VRRP port gone either means nova didn't release it, or we couldn't access the database. Also, stuck in PENDING_ could mean that the controllers gave up trying to write "ERROR" or "ACTIVE" to the database. It only retries so long	15:37
cgoncalves	one can also port delete even if attached, I tried that the other day	15:38
johnsom	Yes, but not when nova is hung	15:39
johnsom	I.e. it is having DB issues, or the compute instance is not reachable to nova	15:39
cgoncalves	ah, right. k, didn't try that	15:39
johnsom	This is the open nova bug (that also affects cinder as he mentioned)	15:40
*** maciejjozefczyk has quit IRC		15:42
johnsom	cgoncalves Just to clarify, heat stack deletes cannot cause a stuck PENDING_* state. That is not an RCA reason.	15:42
*** armax has joined #openstack-lbaas		15:53
*** TrevorV has quit IRC		16:09
*** ramishra has quit IRC		16:36
*** yamamoto has quit IRC		17:05
*** yamamoto has joined #openstack-lbaas		17:06
*** yamamoto has quit IRC		17:06
*** yamamoto has joined #openstack-lbaas		17:06
*** yamamoto has quit IRC		17:11
*** xakaitetoia has quit IRC		17:11
*** yamamoto has joined #openstack-lbaas		17:26
*** tesseract has quit IRC		17:29
*** gmann is now known as gmann_afk		17:42
openstackgerrit	Merged openstack/octavia stable/queens: Use stable upper-constraints.txt in Amphora builds https://review.opendev.org/706052	17:47
johnsom	Wahooo! Thanks cgoncalves for your persistence. lol	17:47
cgoncalves	happy to help. 23 rechecks	17:54
johnsom	🤦	17:54
*** gcheresh_ has joined #openstack-lbaas		18:14
*** gcheresh_ has quit IRC		18:32
*** gcheresh_ has joined #openstack-lbaas		18:42
openstackgerrit	Brian Haley proposed openstack/octavia master: Allow multiple VIPs per LB https://review.opendev.org/660239	19:16
*** gcheresh_ has quit IRC		19:22
openstackgerrit	Michael Johnson proposed openstack/octavia master: WIP - Refactor the failover flows https://review.opendev.org/705317	19:26
cgoncalves	cores: gate fix https://review.opendev.org/#/c/706051/	19:43
johnsom	+2	19:45
cgoncalves	thanks	19:52
*** gmann_afk is now known as gmann		20:12
haleyb	johnsom: since you're approving gate fixes, https://review.opendev.org/#/c/707687/ :)	20:22
*** abaindur has joined #openstack-lbaas		21:08
*** abaindur has quit IRC		21:08
*** abaindur has joined #openstack-lbaas		21:09
johnsom	haleyb +W	21:38
*** nicolasbock has quit IRC		21:54
*** yamamoto has quit IRC		22:01
*** ccamposr has joined #openstack-lbaas		22:02
*** ccamposr__ has quit IRC		22:03
*** yamamoto has joined #openstack-lbaas		22:37
*** yamamoto has quit IRC		22:45
openstackgerrit	Merged openstack/octavia stable/stein: Fix pep8 failures on stable/stein branch https://review.opendev.org/707687	23:20
*** armax has quit IRC		23:33
*** armax has joined #openstack-lbaas		23:33
*** abaindur has quit IRC		23:52
*** spatel has joined #openstack-lbaas		23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!