Thursday, 2021-01-21

*** sapd1 has joined #openstack-lbaas		00:26
*** armax has joined #openstack-lbaas		01:11
*** sapd1 has quit IRC		01:11
*** jamesdenton has quit IRC		01:33
*** jamesden_ has joined #openstack-lbaas		01:33
*** armax has quit IRC		01:44
*** yamamoto_ has joined #openstack-lbaas		04:51
*** yamamoto has quit IRC		04:52
rm_work	I think I'm going to be patching our haproxy element internally to add the following:	05:00
rm_work	in elements/haproxy-octavia/post-install.d/20-setup-haproxy-log	05:00
rm_work	sed -i 's/daily/size 10G/' /etc/logrotate.d/haproxy	05:00
rm_work	johnsom: thoughts? rather than daily rotations, just rotate on 10G size, should be 19G total space approximately with a rotate value of 10, and assuming compression is about 10:1	05:01
rm_work	we've had cases where logs fill up WAY too fast for a daily rotation	05:02
rm_work	ideally offloading, but that's going to be next quarter :D	05:02
johnsom	I would check that your OS will actually look at the size more often than once a day.	05:03
johnsom	Just saying.... I heard a rumor	05:03
rm_work	ah was that the issue you had with a certain logrotate being broken? :D	05:03
johnsom	Ok, I didn’t hear a rumor but was shocked to see an issue there	05:04
johnsom	Ideally, it would rotate on size once an hour	05:04
rm_work	you sure that isn't related to size vs maxsize?	05:06
rm_work	size rotates only on the specified interval even if it's passed before that time	05:07
rm_work	maxsize ignores interval	05:07
rm_work	so actually I should be using maxsize	05:07
rm_work	maxsize added in 3.8.1	05:08
rm_work	cent8 has 3.14.0	05:09
johnsom	So check my work...	05:09
johnsom	Look in all of the crontab files and directories.	05:09
johnsom	If you can find a trigger hourly for logrotate I will stand corrected	05:10
rm_work	hmm, no you're right about the crontab being only daily, but i thought "logrotate.d" was a ... "d"	05:11
rm_work	ie, running daemon	05:11
rm_work	guess not?	05:11
rm_work	but if this is the "bug" then this is crazy easy to fix	05:11
johnsom	As I looked farther it seemed that simply adding the hourly trigger might break other log rotate configs that expect daily only	05:11
rm_work	ah well ok then	05:11
rm_work	that was my naive assumption about fixing it :D	05:11
johnsom	Sorry to be a few steps ahead on this one	05:12
rm_work	eh, according to config i think default is "weekly"	05:12
johnsom	That is the point I had to stop looking at it because I didn’t have sponsorship to spend the time testing/fixing, etc	05:13
rm_work	so unless a service specified hourly (which was not honored, but should at least be OK if it specifies it) then it should be fine	05:13
rm_work	so I'm going to go ahead and assume copying the cron trigger is OK	05:13
johnsom	Well, if the task is set for size and current runs once a day, with say 5 history, having it actually run on the hour will rotate things out	05:14
johnsom	But, hey, I am also the one that implemented “log nothing in the amp”, lol	05:15
rm_work	yeah but the difference should be ... nothing	05:16
rm_work	if it rotates vs not... it's a log	05:16
rm_work	and it won't use MORE space	05:16
rm_work	I'm honestly just sad there's no /etc/cron.minutely	05:17
johnsom	Right, just delete sooner than expected and send signals to processes more often	05:17
johnsom	Actually, you can do that via the crontab config	05:17
rm_work	if processes can handle the signals daily, they should be able to handle them hourly... this is not a super busy system besides haproxy	05:17
rm_work	and if they can't handle them daily, we'd be in trouble anyway	05:17
rm_work	but ok, let's play this out	05:18
rm_work	it's not many services... here's the contents of logrotate.d/	05:18
johnsom	Yeah, adding the symbolic like is “It should work, but there are warning signs it might have side effecfs”	05:18
rm_work	amphora-agent btmp dnf haproxy iscsiuiolog syslog wtmp	05:18
rm_work	/var/log/amphora-agent.log specifies daily	05:19
rm_work	/var/log/btmp specifies monthly	05:20
rm_work	/var/log/dnf.librepo.log specifies weekly	05:20
rm_work	/var/log/hawkey.log specifies weekly	05:20
rm_work	/var/log/iscsiuio.log specifies weekly	05:20
johnsom	I am on mobile, so can’t dig into this now	05:20
rm_work	right, i'm telling you what they all are :D	05:20
rm_work	/var/log/wtmp specifies monthly	05:21
rm_work	/etc/logrotate.d/syslog is the only one that doesn't specify a time	05:21
rm_work	which means it defaults to weekly	05:22
*** zzzeek has quit IRC		05:41
*** zzzeek has joined #openstack-lbaas		05:42
*** vishalmanchanda has joined #openstack-lbaas		05:58
*** rcernin has quit IRC		06:07
*** gcheresh has joined #openstack-lbaas		06:37
*** ramishra has quit IRC		06:45
*** xgerman has quit IRC		07:05
*** ramishra has joined #openstack-lbaas		07:08
*** tkajinam_ has joined #openstack-lbaas		07:19
*** tkajinam has quit IRC		07:20
openstackgerrit	OpenStack Proposal Bot proposed openstack/octavia-dashboard master: Imported Translations from Zanata https://review.opendev.org/c/openstack/octavia-dashboard/+/766679	07:36
*** luksky has joined #openstack-lbaas		08:03
*** jamesden_ has quit IRC		08:05
*** jamesdenton has joined #openstack-lbaas		08:06
openstackgerrit	Gregory Thiemonge proposed openstack/octavia master: Validate listener protocol in amphora driver https://review.opendev.org/c/openstack/octavia/+/771759	08:11
*** rpittau\|afk is now known as rpittau		08:11
openstackgerrit	Ann Taraday proposed openstack/octavia master: Add retry for getting amphora VM https://review.opendev.org/c/openstack/octavia/+/726084	09:27
openstackgerrit	Gregory Thiemonge proposed openstack/octavia master: Validate listener protocol in amphora driver https://review.opendev.org/c/openstack/octavia/+/771759	09:37
*** yamamoto_ has quit IRC		09:39
admin0	lb stuck on pending-update .. how to delete it ?	09:58
*** jamesdenton has quit IRC		10:00
*** jamesdenton has joined #openstack-lbaas		10:01
openstackgerrit	Gregory Thiemonge proposed openstack/octavia master: Add SCTP support in Amphora https://review.opendev.org/c/openstack/octavia/+/753247	10:12
gthiemonge	admin0: it should go in an ERROR state after a timeout, then you'll be able to delete it	10:13
admin0	its been like this for 14 hours	10:13
gthiemonge	admin0: do you see any activity in the logs? I mean, it can go into ERROR, then an API call tries to update it, it goes in PENDING_UPDATE... some kind of loop of PENDING_UPDATE and ERROR statuses	10:16
openstackgerrit	Arieh Maron proposed openstack/octavia-tempest-plugin master: Updating _test_pool_CRUD to enable testing of updates to the load balancer algorithm: https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/771781	10:20
*** yamamoto has joined #openstack-lbaas		10:22
*** yamamoto has quit IRC		10:28
openstackgerrit	Merged openstack/octavia stable/stein: Fix listener update with SNI certificates https://review.opendev.org/c/openstack/octavia/+/767338	10:53
*** rcernin has joined #openstack-lbaas		11:23
openstackgerrit	Merged openstack/octavia stable/train: Fix amphora failover when VRRP port is missing https://review.opendev.org/c/openstack/octavia/+/753193	11:33
*** yamamoto has joined #openstack-lbaas		11:56
*** rcernin has quit IRC		11:57
*** jamesdenton has quit IRC		12:18
*** jamesdenton has joined #openstack-lbaas		12:19
*** yamamoto has quit IRC		12:40
*** AlexStaf has quit IRC		13:10
*** AlexStaf has joined #openstack-lbaas		13:10
*** yamamoto has joined #openstack-lbaas		13:13
*** yamamoto has quit IRC		13:18
*** yamamoto has joined #openstack-lbaas		13:18
*** AlexStaf has quit IRC		13:27
openstackgerrit	Gregory Thiemonge proposed openstack/octavia-tempest-plugin master: Add SCTP protocol listener api tests https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/760305	13:40
openstackgerrit	Gregory Thiemonge proposed openstack/octavia-tempest-plugin master: Add SCTP protocol scenario tests https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/738643	13:40
*** sapd1 has joined #openstack-lbaas		13:47
*** AlexStaf has joined #openstack-lbaas		13:50
*** vishalmanchanda has quit IRC		14:38
*** TrevorV has joined #openstack-lbaas		14:50
*** ccamposr has joined #openstack-lbaas		15:10
*** wolsen has quit IRC		15:19
*** irclogbot_1 has quit IRC		15:21
*** irclogbot_0 has joined #openstack-lbaas		15:22
*** armax has joined #openstack-lbaas		15:25
*** wolsen has joined #openstack-lbaas		15:27
*** ccamposr has quit IRC		15:28
*** armax has quit IRC		15:48
*** armax has joined #openstack-lbaas		16:06
*** armax has quit IRC		16:52
*** njohnston is now known as njohnston\|lunch		17:01
*** AlexStaf has quit IRC		17:01
admin0	gthiemonge, 24 hours .. pending update	17:25
admin0	hi all .. how do I delete a lb that is in Pending Update mode	17:26
johnsom	admin0 Normal operations it will automatically go to ERROR once the retry timeouts have expired.	17:26
johnsom	Did you have a rabbit or host outage on your control plane?	17:26
johnsom	If it's been that long and you don't see retry WARNING messages scrolling in the controller logs (worker or health), it likely means a controller was forcefully killed (not a graceful shutdown) somehow while the controller had ownership (PENDING_*) and was mid-provisioning.	17:28
admin0	the controller is up and running .. its a 3controller HA setup (openstack-ansible)	17:29
johnsom	If you have checked you logs, and it's not continuing to work on that instance (24 hours could still be valid if you changed the retry timeouts in the config file).	17:29
admin0	but even assuming something bad might have happened, its impossible to chagne state and just kill this lb ?	17:29
admin0	just the defaults are used ( .. but i did not checked what osa defaults are for octavia )	17:30
johnsom	admin0 Yeah, up and running doesn't mean it wasn't forcefully killed in a bad way. It's the same as nova/neutron, we just call it out in your face a bit more. lol	17:30
admin0	i meant .. how do I kill this specific one now ?	17:31
admin0	i could not see any option to change state .. except going into the database .. which i don't like	17:31
johnsom	If, and I mean if, you have checked the logs and it's not being worked on (don't do this to LBs that are still retrying), you can update the load balancer object in the DB to provisioning_status == ERROR, then failover or delete the load balancer.	17:31
johnsom	Yeah, it's super dangerous to modify the state of the load balancer objects. You can really mess up the cloud.	17:32
johnsom	The work we are doing on the amphorav2 is to address situations where, say the power was pulled from a controller mid-provisioning. However the bugs are still being worked out of that.	17:33
johnsom	If it wasn't kill -9 or power pulled, all code paths lead back to a consistent and mutable state after the retries timeout.	17:34
johnsom	Either ACTIVE if the clouds services recover, or ERROR if they do not come back up in time.	17:35
johnsom	Graceful shutdowns will not lead to this as we can put the objects in a consistent state on shutdown	17:37
*** ccamposr has joined #openstack-lbaas		17:41
*** rpittau is now known as rpittau\|afk		17:42
admin0	johnsom, journalctl shows no errors	17:45
admin0	inside the amphora image	17:46
*** xgerman has joined #openstack-lbaas		17:46
admin0	how to map a amphora image from a lb uuid ?	17:46
johnsom	Yeah, this has nothing to do with the load balancer or the amphora. It is purely a control plane (controller) issue.	17:46
johnsom	If the operating_status of the LB is ONLINE, the LB/amphora are still passing traffic, etc.	17:47
admin0	operating state = online, provisioning state = pending update, admin state up = yes	17:48
admin0	i can reach the controllers from inside the amphora .. they ping fine	17:48
johnsom	Yep. The load balancer/amphora are happy and working fully. The last requested control plane change (via the API) did not complete it's provisioning steps due to the controller being forcefully killed while it was trying to take the action.	17:49
johnsom	Ok, let me try to step back and explain this.	17:49
johnsom	Let's say a user calls the Octavia API to change the maximum connections on their load balancer.	17:50
johnsom	Before this call, the load balancer status is provisioning = Active, operating status = online.	17:50
admin0	WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='172.29.250.148', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fadd60040d0>: Failed to establish a new connection: [Errno 111] Connection refused'))	17:50
admin0	i can ping/ssh the amphora fine from this octaiva api controllers	17:51
admin0	they are up and running for a long time	17:51
admin0	22 days	17:51
johnsom	Yes, that WARNING message is the controller retrying to connect, that LB will be in PENDING_*.	17:51
johnsom	Let me finish explaining what is happening.	17:51
admin0	can a firewall in the provisioning ( terraform) prevent this internal communication ?	17:51
johnsom	Sure, neutron security groups, firewalls, misconfigured switches, etc.	17:52
admin0	on what port of the amphora image is the controller trying to connect ?	17:52
admin0	i see 22 and 9443 allowed in the octavia_sec_group created	17:53
johnsom	Ok, so, API call comes in. It's queued for a controller to start the provisioning. At this point the LB is marked PENDING_* as there is an assigned provisioning action.	17:53
johnsom	Per the WARNING above, it is 9443	17:53
*** rcernin has joined #openstack-lbaas		17:54
johnsom	A controller will pop the requested change off the queue and start the process of reconfiguring the load balancer. That controller instance has a lock and ownership of the LB.	17:55
johnsom	Now, let's say that API request takes 10 steps to complete. I.e. building a config, pushing it out, triggering reloads, etc.	17:56
johnsom	Then on step 4, someone kill -9 the controller that has ownership.	17:56
admin0	that did not happend in my case ( no one logged into or touched the controller, or even logged in ) .. the controllers are workking fine	17:57
admin0	if a lb was created wtih strict firewall rules, can this also happen ?	17:57
johnsom	kill -9 does not allow the controller to clean up or finish. At that point the object is now stuck in the state it was when someone kill -9 it. The controller can no longer keep retrying or advancing through the tasks.	17:57
johnsom	No, there is no way this could happen other than rabbit/mysql/or the controller being killed.	17:58
admin0	and this is false in my case	17:58
admin0	there are no errors on rabbit mysql or the controllers dying or givine error	17:58
*** rcernin has quit IRC		17:59
admin0	i only see the controller unable to connect to 9443 on the amphora ip	17:59
admin0	while the lb is up and i can ssh in just fine	17:59
admin0	9443 does not work, ssh works fine	17:59
johnsom	If you are seeing those WARNING messages in the logs scrolling, that means the controller is still retrying based on the timeouts in the config. It will eventually give up and move it to ERROR if it cannot reach the amphora.	17:59
johnsom	Ok, so is that WARNING repeating in your log? or was that historical?	17:59
johnsom	Obviously that WARNING can also happen at LB creation time, while nova is still booting the VM.	18:01
johnsom	So if it's historical, that probably doesn't relate.	18:01
admin0	i killed the amphora image .. thinking the image was corrupt .. then it created a new amphora image in a new ip address.. it came 3 times .. and then it said .. Mark READY in DB for amphora: c46ffb37-6c3f-4a7e-b178-9be1922e4ff0 .. but in the frontend side, the lb is still marked PENDING	18:01
admin0	if load balancer ID is xyz, how do I map it to amphora image abc ?	18:02
admin0	so that i know i killed the right image	18:02
johnsom	Image or VM? Image is in glance, VM instance is in nova	18:02
admin0	as user, i see my lb uuid as say abc123 .. and as octavia, i see amphora images as xyz123	18:03
admin0	how to make a connection on what amphora image uuid is used by which lb uuid	18:03
johnsom	I think you mean amphora VM instance. You can map those with "openstack loadbalancer amphora list --loadbalancer <LB ID>". That is different than the image ID which is the qcow2 image ID as it is stored in glance to boot future VMs.	18:04
admin0	sorry .. amphora vm uuid	18:04
admin0	my mistake in typing .. sorry	18:04
admin0	i meant the vm uuid	18:04
admin0	i had killed the wrong lb :)	18:06
admin0	wrong amphora image	18:06
admin0	will killing that amphora image help fix stuff ?	18:06
johnsom	No, as the amphora VM instance has no relation to provisioning_status	18:06
admin0	ok.. assuming something bad happened, how to I change its status and delete this lb entry ?	18:07
johnsom	provisioning_status is the status from the control plane side.	18:07
johnsom	Ah, DON'T delete from the DB.	18:07
admin0	checking if there is someting like we can do for nova , cinder etc .. change its state to --state available annd then issue a delete	18:08
admin0	in pending update status, i cannot delete it, or pool or listeners	18:08
johnsom	Ok, so assuming you have no scrolling messages in the logs showing it's still being worked on.	18:08
admin0	logs -- is it the amphora image logs or oactaiva logs ?	18:08
johnsom	Right, it's still owned, so you can't make additional changes to it.	18:08
johnsom	Octavia worker and health manager logs. Control plane	18:08
johnsom	Amphora logs will have no knowledge of any of this	18:09
admin0	on all 3 controllers, following the journalctl , i don't see anything recurring	18:12
johnsom	Ok, yeah, so a controller was killed somehow. So, we need to switch the provisioning_status to ERROR on the impacted load balancer.	18:13
johnsom	Do you know how to connect to the mysql octavia database in your cloud?	18:13
admin0	yep	18:15
johnsom	update load_balancer set provisioning_status = 'ERROR' where id = 'd9775ed5-40e2-4641-bdd6-89b61102d467';	18:16
admin0	where ID = lb id ?	18:16
johnsom	Where ID is the load balancer ID.	18:16
admin0	ok	18:16
johnsom	Then, you can either "openstack loadbalancer failover <LB ID>" or delete it	18:16
johnsom	In the future, just be sure that a controller isn't still working on it, as if you do this when a controller still actually owns the object, you will get a mess and a broken LB.	18:18
admin0	but my controllers are fine	18:18
admin0	3 load balancers were created in the same time ... one after another .. this is the middle one .. 2 of them worked fine	18:18
johnsom	Well, if this LB was stuck in PENDING_*, at some point in the past one of them was not.	18:19
johnsom	If you look at the flow documentation here: https://docs.openstack.org/octavia/latest/contributor/devref/flows.html	18:19
johnsom	You will see that every controller flow will either complete or revert to ERROR should something software wise go wrong. It's only when the controller is killed that the status does not advance.	18:20
johnsom	If you have the controller logs for the time period you created those three, I am happy to go through them and see if I can isolate what/when it happened.	18:22
admin0	i need to also reduce the time period to say like 10 mins max	18:27
admin0	to prevent this from getting stuck up for ever	18:27
johnsom	Yeah, most deployments have retry timeouts like that. However, some argue they want it to try forever....	18:28
johnsom	The defaults are fairly long if I remember corrrectly.	18:28
*** ccamposr__ has joined #openstack-lbaas		18:42
*** jamesdenton has quit IRC		18:43
*** jamesdenton has joined #openstack-lbaas		18:43
*** ccamposr has quit IRC		18:44
admin0	johnsom, is an api/command coming soon to set it to erorr or other states to not have to go via db	19:00
johnsom	No, never, it is too dangerous	19:01
johnsom	As I mentioned, amphorav2 will help resolve the issue forceful kills of the controllers.	19:01
*** sapd1 has quit IRC		19:02
*** njohnston\|lunch is now known as njohnston		19:02
*** sapd1 has joined #openstack-lbaas		19:06
admin0	just wondering .. why is how is openstack lb set -status error dangerous ? .. its just a field update in the db isnt it ?	19:06
johnsom	We have had many long conversations about this at PTGs, etc.	19:06
johnsom	Oh, it's not, PENDING_* is an object lock. It's how the HA control plane works to make sure multiple controllers aren't trying to take action on the same load balancer at the same time. If you aren't super careful to make sure one of the contorllers doesn't have owner ship, you can completely corrupt the LB, break it to a point it isn't passing traffic, leave broken objects in nova and neutron that may cause	19:08
johnsom	the load balancer to not work again, etc.	19:08
*** haleyb has quit IRC		19:27
*** dougwig has quit IRC		19:27
*** bbezak has quit IRC		19:28
*** headphoneJames has quit IRC		19:28
*** f0o has quit IRC		19:28
*** bbezak has joined #openstack-lbaas		19:28
*** dougwig has joined #openstack-lbaas		19:28
*** headphoneJames has joined #openstack-lbaas		19:28
*** f0o has joined #openstack-lbaas		19:30
admin0	thanks	19:37
admin0	when is v2 coming up ?	19:37
johnsom	It's in Victoria, but we are working through bugs with it	19:37
admin0	ok	19:37
johnsom	Actually, it was in earlier releases too, but not so usable	19:37
*** jamesdenton has quit IRC		20:06
*** jamesdenton has joined #openstack-lbaas		20:06
*** ccamposr has joined #openstack-lbaas		20:12
*** irclogbot_0 has quit IRC		20:14
*** ccamposr__ has quit IRC		20:15
*** irclogbot_2 has joined #openstack-lbaas		20:16
openstackgerrit	Merged openstack/octavia stable/ussuri: Add missing log line for finishing amp operations https://review.opendev.org/c/openstack/octavia/+/752860	20:22
*** TrevorV has quit IRC		21:12
*** gcheresh has quit IRC		21:22
openstackgerrit	Brian Haley proposed openstack/octavia-tempest-plugin master: Remove duplicate operating status check https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/771887	21:41
*** rcernin has joined #openstack-lbaas		21:54
openstackgerrit	Brian Haley proposed openstack/octavia-tempest-plugin master: DNM - SIP TCP debugging https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/771888	21:57
*** rcernin has quit IRC		21:59
*** rcernin has joined #openstack-lbaas		22:38
*** rcernin has quit IRC		22:56
*** rcernin has joined #openstack-lbaas		22:57
*** yamamoto has quit IRC		23:04
*** yamamoto has joined #openstack-lbaas		23:04
openstackgerrit	Brian Haley proposed openstack/octavia-tempest-plugin master: DNM - SIP TCP debugging https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/771888	23:10
*** luksky has quit IRC		23:51
*** lemko has quit IRC		23:51
*** lemko6 has joined #openstack-lbaas		23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!