Thursday, 2021-01-21

*** sapd1 has joined #openstack-lbaas00:26
*** armax has joined #openstack-lbaas01:11
*** sapd1 has quit IRC01:11
*** jamesdenton has quit IRC01:33
*** jamesden_ has joined #openstack-lbaas01:33
*** armax has quit IRC01:44
*** yamamoto_ has joined #openstack-lbaas04:51
*** yamamoto has quit IRC04:52
rm_workI think I'm going to be patching our haproxy element internally to add the following:05:00
rm_workin elements/haproxy-octavia/post-install.d/20-setup-haproxy-log05:00
rm_worksed -i 's/daily/size 10G/' /etc/logrotate.d/haproxy05:00
rm_workjohnsom: thoughts? rather than daily rotations, just rotate on 10G size, should be 19G total space approximately with a rotate value of 10, and assuming compression is about 10:105:01
rm_workwe've had cases where logs fill up WAY too fast for a daily rotation05:02
rm_workideally offloading, but that's going to be next quarter :D05:02
johnsomI would check that your OS will actually look at the size more often than once a day.05:03
johnsomJust saying.... I heard a rumor05:03
rm_workah was that the issue you had with a certain logrotate being broken? :D05:03
johnsomOk, I didn’t hear a rumor but was shocked to see an issue there05:04
johnsomIdeally, it would rotate on size once an hour05:04
rm_workyou sure that isn't related to size vs maxsize?05:06
rm_worksize rotates only on the specified interval even if it's passed before that time05:07
rm_workmaxsize ignores interval05:07
rm_workso actually I should be using maxsize05:07
rm_workmaxsize added in 3.8.105:08
rm_workcent8 has 3.14.005:09
johnsomSo check my work...05:09
johnsomLook in all of the crontab files and directories.05:09
johnsomIf you can find a trigger hourly for logrotate I will stand corrected05:10
rm_workhmm, no you're right about the crontab being only daily, but i thought "logrotate.d" was a ... "d"05:11
rm_workie, running daemon05:11
rm_workguess not?05:11
rm_workbut if this is the "bug" then this is crazy easy to fix05:11
johnsomAs I looked farther it seemed that simply adding the hourly trigger might break other log rotate configs that expect daily only05:11
rm_workah well ok then05:11
rm_workthat was my naive assumption about fixing it :D05:11
johnsomSorry to be a few steps ahead on this one05:12
rm_workeh, according to config i think default is "weekly"05:12
johnsomThat is the point I had to stop looking at it because I didn’t have sponsorship to spend the time testing/fixing, etc05:13
rm_workso unless a service specified hourly (which was not honored, but should at least be OK if it specifies it) then it should be fine05:13
rm_workso I'm going to go ahead and assume copying the cron trigger is OK05:13
johnsomWell, if the task is set for size and current runs once a day, with say 5 history, having it actually run on the hour will rotate things out05:14
johnsomBut, hey, I am also the one that implemented “log nothing in the amp”, lol05:15
rm_workyeah but the difference should be ... nothing05:16
rm_workif it rotates vs not... it's a log05:16
rm_workand it won't use MORE space05:16
rm_workI'm honestly just sad there's no /etc/cron.minutely05:17
johnsomRight, just delete sooner than expected and send signals to processes more often05:17
johnsomActually, you can do that via the crontab config05:17
rm_workif processes can handle the signals daily, they should be able to handle them hourly... this is not a super busy system besides haproxy05:17
rm_workand if they can't handle them daily, we'd be in trouble anyway05:17
rm_workbut ok, let's play this out05:18
rm_workit's not many services... here's the contents of logrotate.d/05:18
johnsomYeah, adding the symbolic like is “It should work, but there are warning signs it might have side effecfs”05:18
rm_workamphora-agent  btmp  dnf  haproxy  iscsiuiolog  syslog  wtmp05:18
rm_work /var/log/amphora-agent.log specifies daily05:19
rm_work /var/log/btmp specifies monthly05:20
rm_work /var/log/dnf.librepo.log specifies weekly05:20
rm_work /var/log/hawkey.log specifies weekly05:20
rm_work /var/log/iscsiuio.log specifies weekly05:20
johnsomI am on mobile, so can’t dig into this now05:20
rm_workright, i'm telling you what they all are :D05:20
rm_work /var/log/wtmp specifies monthly05:21
rm_work /etc/logrotate.d/syslog is the only one that doesn't specify a time05:21
rm_workwhich means it defaults to weekly05:22
*** zzzeek has quit IRC05:41
*** zzzeek has joined #openstack-lbaas05:42
*** vishalmanchanda has joined #openstack-lbaas05:58
*** rcernin has quit IRC06:07
*** gcheresh has joined #openstack-lbaas06:37
*** ramishra has quit IRC06:45
*** xgerman has quit IRC07:05
*** ramishra has joined #openstack-lbaas07:08
*** tkajinam_ has joined #openstack-lbaas07:19
*** tkajinam has quit IRC07:20
openstackgerritOpenStack Proposal Bot proposed openstack/octavia-dashboard master: Imported Translations from Zanata  https://review.opendev.org/c/openstack/octavia-dashboard/+/76667907:36
*** luksky has joined #openstack-lbaas08:03
*** jamesden_ has quit IRC08:05
*** jamesdenton has joined #openstack-lbaas08:06
openstackgerritGregory Thiemonge proposed openstack/octavia master: Validate listener protocol in amphora driver  https://review.opendev.org/c/openstack/octavia/+/77175908:11
*** rpittau|afk is now known as rpittau08:11
openstackgerritAnn Taraday proposed openstack/octavia master: Add retry for getting amphora VM  https://review.opendev.org/c/openstack/octavia/+/72608409:27
openstackgerritGregory Thiemonge proposed openstack/octavia master: Validate listener protocol in amphora driver  https://review.opendev.org/c/openstack/octavia/+/77175909:37
*** yamamoto_ has quit IRC09:39
admin0lb stuck on pending-update .. how to delete it ?09:58
*** jamesdenton has quit IRC10:00
*** jamesdenton has joined #openstack-lbaas10:01
openstackgerritGregory Thiemonge proposed openstack/octavia master: Add SCTP support in Amphora  https://review.opendev.org/c/openstack/octavia/+/75324710:12
gthiemongeadmin0: it should go in an ERROR state after a timeout, then you'll be able to delete it10:13
admin0its been like this for 14 hours10:13
gthiemongeadmin0: do you see any activity in the logs? I mean, it can go into ERROR, then an API call tries to update it, it goes in PENDING_UPDATE... some kind of loop of PENDING_UPDATE and ERROR statuses10:16
openstackgerritArieh Maron proposed openstack/octavia-tempest-plugin master: Updating _test_pool_CRUD to enable testing of updates to the load balancer algorithm:  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/77178110:20
*** yamamoto has joined #openstack-lbaas10:22
*** yamamoto has quit IRC10:28
openstackgerritMerged openstack/octavia stable/stein: Fix listener update with SNI certificates  https://review.opendev.org/c/openstack/octavia/+/76733810:53
*** rcernin has joined #openstack-lbaas11:23
openstackgerritMerged openstack/octavia stable/train: Fix amphora failover when VRRP port is missing  https://review.opendev.org/c/openstack/octavia/+/75319311:33
*** yamamoto has joined #openstack-lbaas11:56
*** rcernin has quit IRC11:57
*** jamesdenton has quit IRC12:18
*** jamesdenton has joined #openstack-lbaas12:19
*** yamamoto has quit IRC12:40
*** AlexStaf has quit IRC13:10
*** AlexStaf has joined #openstack-lbaas13:10
*** yamamoto has joined #openstack-lbaas13:13
*** yamamoto has quit IRC13:18
*** yamamoto has joined #openstack-lbaas13:18
*** AlexStaf has quit IRC13:27
openstackgerritGregory Thiemonge proposed openstack/octavia-tempest-plugin master: Add SCTP protocol listener api tests  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/76030513:40
openstackgerritGregory Thiemonge proposed openstack/octavia-tempest-plugin master: Add SCTP protocol scenario tests  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/73864313:40
*** sapd1 has joined #openstack-lbaas13:47
*** AlexStaf has joined #openstack-lbaas13:50
*** vishalmanchanda has quit IRC14:38
*** TrevorV has joined #openstack-lbaas14:50
*** ccamposr has joined #openstack-lbaas15:10
*** wolsen has quit IRC15:19
*** irclogbot_1 has quit IRC15:21
*** irclogbot_0 has joined #openstack-lbaas15:22
*** armax has joined #openstack-lbaas15:25
*** wolsen has joined #openstack-lbaas15:27
*** ccamposr has quit IRC15:28
*** armax has quit IRC15:48
*** armax has joined #openstack-lbaas16:06
*** armax has quit IRC16:52
*** njohnston is now known as njohnston|lunch17:01
*** AlexStaf has quit IRC17:01
admin0gthiemonge, 24 hours .. pending update17:25
admin0hi all .. how do I delete a lb that is in  Pending Update mode17:26
johnsomadmin0 Normal operations it will automatically go to ERROR once the retry timeouts have expired.17:26
johnsomDid you have a rabbit or host outage on your control plane?17:26
johnsom If it's been that long and you don't see retry WARNING messages scrolling in the controller logs (worker or health), it likely means a controller was forcefully killed (not a graceful shutdown) somehow while the controller had ownership (PENDING_*) and was mid-provisioning.17:28
admin0the controller is up and running .. its a 3controller HA setup (openstack-ansible)17:29
johnsomIf you have checked you logs, and it's not continuing to work on that instance (24 hours could still be valid if you changed the retry timeouts in the config file).17:29
admin0but even assuming something bad might have happened, its impossible to chagne state and just kill this lb ?17:29
admin0just the defaults are used ( .. but i did not checked what osa defaults are for octavia )17:30
johnsomadmin0 Yeah, up and running doesn't mean it wasn't forcefully killed in a bad way. It's the same as nova/neutron, we just call it out in your face a bit more. lol17:30
admin0i meant .. how do I kill this specific one now ?17:31
admin0i could not see any option to change state .. except going into the database .. which i don't like17:31
johnsomIf, and I mean if, you have checked the logs and it's not being worked on (don't do this to LBs that are still retrying), you can update the load balancer object in the DB to provisioning_status == ERROR, then failover or delete the load balancer.17:31
johnsomYeah, it's super dangerous to modify the state of the load balancer objects. You can really mess up the cloud.17:32
johnsomThe work we are doing on the amphorav2 is to address situations where, say the power was pulled from a controller mid-provisioning. However the bugs are still being worked out of that.17:33
johnsomIf it wasn't kill -9 or power pulled, all code paths lead back to a consistent and mutable state after the retries timeout.17:34
johnsomEither ACTIVE if the clouds services recover, or ERROR if they do not come back up in time.17:35
johnsomGraceful shutdowns will not lead to this as we can put the objects in a consistent state on shutdown17:37
*** ccamposr has joined #openstack-lbaas17:41
*** rpittau is now known as rpittau|afk17:42
admin0johnsom, journalctl shows no errors17:45
admin0inside the amphora image17:46
*** xgerman has joined #openstack-lbaas17:46
admin0how to map a amphora image from a lb uuid ?17:46
johnsomYeah, this has nothing to do with the load balancer or the amphora. It is purely a control plane (controller) issue.17:46
johnsomIf the operating_status of the LB is ONLINE, the LB/amphora are still passing traffic, etc.17:47
admin0operating state = online,  provisioning state = pending update, admin state up = yes17:48
admin0i can reach the controllers from inside the amphora .. they ping fine17:48
johnsomYep. The load balancer/amphora are happy and working fully. The last requested control plane change (via the API) did not complete it's provisioning steps due to the controller being forcefully killed while it was trying to take the action.17:49
johnsomOk, let me try to step back and explain this.17:49
johnsomLet's say a user calls the Octavia API to change the maximum connections on their load balancer.17:50
johnsomBefore this call, the load balancer status is provisioning = Active, operating status = online.17:50
admin0WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='172.29.250.148', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fadd60040d0>: Failed to establish a new connection: [Errno 111] Connection refused'))17:50
admin0i can ping/ssh the amphora fine from this octaiva api controllers17:51
admin0they are up and running for a long time17:51
admin022 days17:51
johnsomYes, that WARNING message is the controller retrying to connect, that LB will be in PENDING_*.17:51
johnsomLet me finish explaining what is happening.17:51
admin0can a firewall in the provisioning ( terraform) prevent this internal communication ?17:51
johnsomSure, neutron security groups, firewalls, misconfigured switches, etc.17:52
admin0on what port of the amphora image is the controller trying to connect ?17:52
admin0i see 22 and 9443 allowed in the octavia_sec_group created17:53
johnsomOk, so, API call comes in. It's queued for a controller to start the provisioning. At this point the LB is marked PENDING_* as there is an assigned provisioning action.17:53
johnsomPer the WARNING above, it is 944317:53
*** rcernin has joined #openstack-lbaas17:54
johnsomA controller will pop the requested change off the queue and start the process of reconfiguring the load balancer. That controller instance has a lock and ownership of the LB.17:55
johnsomNow, let's say that API request takes 10 steps to complete. I.e. building a config, pushing it out, triggering reloads, etc.17:56
johnsomThen on step 4, someone kill -9 the controller that has ownership.17:56
admin0that did not happend in my case ( no one logged into or touched the controller, or even logged in ) .. the controllers are workking fine17:57
admin0if a lb was created wtih strict firewall rules, can this also happen ?17:57
johnsomkill -9 does not allow the controller to clean up or finish. At that point the object is now stuck in the state it was when someone kill -9 it. The controller can no longer keep retrying or advancing through the tasks.17:57
johnsomNo, there is no way this could happen other than rabbit/mysql/or the controller being killed.17:58
admin0and this is false in my case17:58
admin0there are no errors on rabbit mysql or the controllers dying or givine error17:58
*** rcernin has quit IRC17:59
admin0i only see the controller unable to connect to 9443 on the amphora ip17:59
admin0while the lb is up and i can ssh in just fine17:59
admin09443 does not work, ssh works fine17:59
johnsomIf you are seeing those WARNING messages in the logs scrolling, that means the controller is still retrying based on the timeouts in the config. It will eventually give up and move it to ERROR if it cannot reach the amphora.17:59
johnsomOk, so is that WARNING repeating in your log? or was that historical?17:59
johnsomObviously that WARNING can also happen at LB creation time, while nova is still booting the VM.18:01
johnsomSo if it's historical, that probably doesn't relate.18:01
admin0i killed the amphora image .. thinking the image was corrupt .. then it created a new amphora image in a new ip address.. it came 3 times .. and then it said .. Mark READY in DB for amphora: c46ffb37-6c3f-4a7e-b178-9be1922e4ff0  .. but in the frontend side, the lb is still marked PENDING18:01
admin0if load balancer ID is   xyz, how do I map it to amphora image abc ?18:02
admin0so that i know i killed the right image18:02
johnsomImage or VM? Image is in glance, VM instance is in nova18:02
admin0as user, i see my lb uuid as   say abc123 .. and as octavia, i see amphora images as  xyz12318:03
admin0how to make a connection on what amphora image uuid is used by which  lb uuid18:03
johnsomI think you mean amphora VM instance. You can map those with "openstack loadbalancer amphora list --loadbalancer <LB ID>". That is different than the image ID which is the qcow2 image ID as it is stored in glance to boot future VMs.18:04
admin0sorry .. amphora vm uuid18:04
admin0my mistake in typing .. sorry18:04
admin0i meant the vm uuid18:04
admin0i had killed the wrong lb :)18:06
admin0wrong amphora image18:06
admin0will killing that amphora image help fix stuff ?18:06
johnsomNo, as the amphora VM instance has no relation to provisioning_status18:06
admin0ok..  assuming something bad happened, how to I change its status and delete this lb entry ?18:07
johnsomprovisioning_status is the status from the control plane side.18:07
johnsomAh, DON'T delete from the DB.18:07
admin0checking if there is someting like we can do for nova , cinder etc .. change its state to --state available annd then issue a delete18:08
admin0in pending update status, i cannot delete it, or pool or listeners18:08
johnsomOk, so assuming you have no scrolling messages in the logs showing it's still being worked on.18:08
admin0logs -- is it the amphora image logs or oactaiva logs ?18:08
johnsomRight, it's still owned, so you can't make additional changes to it.18:08
johnsomOctavia worker and health manager logs. Control plane18:08
johnsomAmphora logs will have no knowledge of any of this18:09
admin0on all 3 controllers, following the journalctl , i don't see anything recurring18:12
johnsomOk, yeah, so a controller was killed somehow. So, we need to switch the provisioning_status to ERROR on the impacted load balancer.18:13
johnsomDo you know how to connect to the mysql octavia database in your cloud?18:13
admin0yep18:15
johnsomupdate load_balancer set provisioning_status = 'ERROR' where id = 'd9775ed5-40e2-4641-bdd6-89b61102d467';18:16
admin0where ID = lb id ?18:16
johnsomWhere ID is the load balancer ID.18:16
admin0ok18:16
johnsomThen, you can either "openstack loadbalancer failover <LB ID>" or delete it18:16
johnsomIn the future, just be sure that a controller isn't still working on it, as if you do this when a controller still actually owns the object, you will get a mess and a broken LB.18:18
admin0but my controllers are fine18:18
admin03 load balancers were created in the same time ... one after another .. this is the middle one .. 2 of them worked fine18:18
johnsomWell, if this LB was stuck in PENDING_*, at some point in the past one of them was not.18:19
johnsomIf you look at the flow documentation here: https://docs.openstack.org/octavia/latest/contributor/devref/flows.html18:19
johnsomYou will see that every controller flow will either complete or revert to ERROR should something software wise go wrong. It's only when the controller is killed that the status does not advance.18:20
johnsomIf you have the controller logs for the time period you created those three, I am happy to go through them and see if I can isolate what/when it happened.18:22
admin0i need to also reduce the time period to say like 10 mins max18:27
admin0to prevent this from getting stuck up for ever18:27
johnsomYeah, most deployments have retry timeouts like that. However, some argue they want it to try forever....18:28
johnsomThe defaults are fairly long if I remember corrrectly.18:28
*** ccamposr__ has joined #openstack-lbaas18:42
*** jamesdenton has quit IRC18:43
*** jamesdenton has joined #openstack-lbaas18:43
*** ccamposr has quit IRC18:44
admin0johnsom, is an api/command coming soon to set it to erorr or other states to not have to go via db19:00
johnsomNo, never, it is too dangerous19:01
johnsomAs I mentioned, amphorav2 will help resolve the issue forceful kills of the controllers.19:01
*** sapd1 has quit IRC19:02
*** njohnston|lunch is now known as njohnston19:02
*** sapd1 has joined #openstack-lbaas19:06
admin0just wondering .. why is how is openstack lb set -status error  dangerous ?    .. its just a field update in the db isnt it ?19:06
johnsomWe have had many long conversations about this at PTGs, etc.19:06
johnsomOh, it's not, PENDING_* is an object lock. It's how the HA control plane works to make sure multiple controllers aren't trying to take action on the same load balancer at the same time. If you aren't super careful to make sure one of the contorllers doesn't have owner ship, you can completely corrupt the LB, break it to a point it isn't passing traffic, leave broken objects in nova and neutron that may cause19:08
johnsomthe load balancer to not work again, etc.19:08
*** haleyb has quit IRC19:27
*** dougwig has quit IRC19:27
*** bbezak has quit IRC19:28
*** headphoneJames has quit IRC19:28
*** f0o has quit IRC19:28
*** bbezak has joined #openstack-lbaas19:28
*** dougwig has joined #openstack-lbaas19:28
*** headphoneJames has joined #openstack-lbaas19:28
*** f0o has joined #openstack-lbaas19:30
admin0thanks19:37
admin0when is v2 coming up ?19:37
johnsomIt's in Victoria, but we are working through bugs with it19:37
admin0ok19:37
johnsomActually, it was in earlier releases too, but not so usable19:37
*** jamesdenton has quit IRC20:06
*** jamesdenton has joined #openstack-lbaas20:06
*** ccamposr has joined #openstack-lbaas20:12
*** irclogbot_0 has quit IRC20:14
*** ccamposr__ has quit IRC20:15
*** irclogbot_2 has joined #openstack-lbaas20:16
openstackgerritMerged openstack/octavia stable/ussuri: Add missing log line for finishing amp operations  https://review.opendev.org/c/openstack/octavia/+/75286020:22
*** TrevorV has quit IRC21:12
*** gcheresh has quit IRC21:22
openstackgerritBrian Haley proposed openstack/octavia-tempest-plugin master: Remove duplicate operating status check  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/77188721:41
*** rcernin has joined #openstack-lbaas21:54
openstackgerritBrian Haley proposed openstack/octavia-tempest-plugin master: DNM - SIP TCP debugging  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/77188821:57
*** rcernin has quit IRC21:59
*** rcernin has joined #openstack-lbaas22:38
*** rcernin has quit IRC22:56
*** rcernin has joined #openstack-lbaas22:57
*** yamamoto has quit IRC23:04
*** yamamoto has joined #openstack-lbaas23:04
openstackgerritBrian Haley proposed openstack/octavia-tempest-plugin master: DNM - SIP TCP debugging  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/77188823:10
*** luksky has quit IRC23:51
*** lemko has quit IRC23:51
*** lemko6 has joined #openstack-lbaas23:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!