Wednesday, 2023-06-07

pyjouHi I have a quick question. Have octavia project migrated to launchpad or do we still need to work with storyboard?13:30
tweininghi. the channel topic says it :)13:30
tweininglaunchpad13:30
pyjouTrue. Sorry I don´t see it. Thank you :)13:31
opendevreviewPierre-Yves Jourel proposed openstack/octavia master: Add specs to resize a load balancer  https://review.opendev.org/c/openstack/octavia/+/88549013:39
opendevreviewPierre-Yves Jourel proposed openstack/octavia master: Add specs to resize a load balancer  https://review.opendev.org/c/openstack/octavia/+/88549014:38
mnasergthiemonge: did you ever get to the bottom of https://storyboard.openstack.org/#!/story/2008226 ?15:26
* mnaser currently has a broken lb in my hands with this issue15:26
gthiemongemnaser: nop it has happened randomly in the CI, I haevn't seen it for a long time15:41
mnasergthiemonge: ah gr, i'm trying to understanding why/how its happening right now, it looks like the gunicorn process keeps on respawning and then timing out, trying to strace my way out of this15:41
gthiemongeI think that something might be really slow in the amphora, and the client timeouts, when the server replies the client is gone15:42
mnasergthiemonge: yeah except in this case its like respawning non stop the worker as it times out15:42
mnaserwrite(7, "[2023-06-07 15:41:45 +0000] [808] [CRITICAL] WORKER TIMEOUT (pid:19986)\n", 72) = 7215:43
mnaserand then it boots a new worker and goes back to being stuck, but i just got my first trace of it rebooting so lets see15:43
gthiemongemnaser: what version do you have? what distrib for the amphora?15:44
gthiemongemnaser: do you see timeouts in the Octavia worker logs?15:45
gthiemongemaybe if we can isolate the API call that takes too much time, we could understand what's happening15:46
mnasergthiemonge: yeah, octavia-worker looping kept trying to contact the worker and timing out, which lead me to start doing a simple 'curl' that was timing out (unless the reason my call times out is because amphora agent can do 1 req a time)15:46
mnaserlet me see what version.. it is ancient in here, pushing to get it updated :)15:47
mnasera nice ussuri... :(15:47
mnaseralso it looks like the image is 18.04 bionic15:47
mnaserunfortunately strace is not helping too much bc tls, let me see15:48
johnsommnaser We do limit the gunicorn to 1 worker on purpose15:48
mnaserok so that means indeed its something taking too long and timing out so my `curl` test is useless15:48
johnsomBack in that time frame there were some kernel issues with bringing up the interfaces. The ifup calls would take forever or fail occasionally. It was specific kernel versions in 18.0415:50
mnaserthe thing is this will be a perfectly happy amphora until it gets stuck in PENDING_UPDATE15:50
mnaserso something starts this chain15:50
mnaseronly the worker talks to the api eh?15:51
johnsomRight15:51
mnaser2023-06-07 15:51:15.022 20 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [req-241a87d1-af88-4a1a-84b2-728e64845018 - 4d5940f2bb1040ee9d6468a7d1e8bf64 - - -] Could not connect to instance. Retrying.: requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='10.8.15.55', port=9443): Read timed out. (read timeout=10.0)15:51
gthiemongewell, housekeeping and health-manager too (to the Amphora API)15:51
mnaserits indeed having its way at it, except i think we have it set at *sigh* some 1500 retries so15:51
mnaserlet me see what triggers these constant attempts15:52
johnsomYeah, worker timing out the connection due to no response in time from the amp agent.15:52
mnasernothing in octavia worker logs, but i think there was a batch update that it tried to do and couldnt talk/reach15:53
mnaseri wonder if i can use gmr to pull the stack15:53
johnsomIt probably failed over for some reason then landed in this situation. We do recommend in production dropping the retry attempt numbers, otherwise people think it is "stuck" in a pending state, where really it is just endlessly banging it's head retrying whatever is failing.15:54
johnsomIt is only one or two deployments that have slow setups that led to those crazy retry numbers.15:54
mnaserright, that puts two issues here -- faster failovers.. but also figuring out why it is stuck here in the first place15:56
mnaserim pulling gmr reports to try and see if i can get a stack of why/how its stuck15:57
gthiemongethere's another similar story: https://storyboard.openstack.org/#!/story/201001015:58
mnaserthere, i got the stack trace15:59
gthiemongeguys, we have the weekly meeting in 1min, can we resume the discussion after it?15:59
mnaserah yes sorry15:59
mnaserill toss this first :) https://www.irccloud.com/pastebin/jG1sdOEr/16:00
gthiemonge#startmeeting Octavia16:00
opendevmeetMeeting started Wed Jun  7 16:00:41 2023 UTC and is due to finish in 60 minutes.  The chair is gthiemonge. Information about MeetBot at http://wiki.debian.org/MeetBot.16:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.16:00
opendevmeetThe meeting name has been set to 'octavia'16:00
gthiemongeo/16:00
johnsomo/16:00
pyjouo/16:00
jdacunha2o/16:00
tweining_o/16:00
QGo/16:01
gthiemonge#topic Announcements16:01
gthiemonge* OpenInfra Summit Vancouver 202316:01
gthiemongereminder: the openinfra summit is next week16:02
gthiemongejohnsom will host the OpenStack Octavia Forum Session (Wed, June 14, 9:00am - 9:30am)16:02
johnsomThe Octavia Forum session is at 9am on Wednesday16:02
gthiemonge(thank you johnsom BTW!)16:02
gthiemongejohnsom: I think it's the same time as our weekly meeting, right?16:03
johnsomYes16:03
gthiemongeok16:03
gthiemongeso I'm proposing that we cancel next week's meeting, are you fine with it?16:03
tweining_I noticed there is no Octavia PTG session in https://ptg.opendev.org/ptg.html yet16:04
gthiemongenop we have'nt scheduled a PTG session16:04
tweining_I think it's fine to cancel16:04
gthiemongeack16:05
gthiemongeany other announcements?16:05
gthiemongenop, ok16:07
gthiemonge#topic CI Status16:07
gthiemongeI didn't find the time to investigate the failures in the FIPS jobs16:07
gthiemongehttps://zuul.openstack.org/builds?job_name=octavia-v2-dsvm-tls-barbican-fips&skip=016:07
gthiemongeand I still need to open a launchpad issue for this error16:08
gthiemongeI think that's it for CI Status16:09
gthiemonge#topic Brief progress reports / bugs needing review16:09
johnsomLooks like it can't ssh into the cirros image16:09
gthiemongeyeah16:09
johnsomSo test setup fails16:09
pyjouI have an old change to review16:10
gthiemongewe need add a testing patch that fetches the logs of the VMs16:10
pyjouhttps://review.opendev.org/c/openstack/octavia-dashboard/+/86606416:10
opendevreviewJulian DA CUNHA proposed openstack/octavia master: Add new spec Let's Encrypt support  https://review.opendev.org/c/openstack/octavia/+/87728116:11
gthiemongepyjou: nice feature, I'll review/test it16:12
pyjouAnd I've added a spec to add load balancer resize as a new feature.16:12
pyjouhttps://review.opendev.org/c/openstack/octavia/+/88549016:12
gthiemongepyjou: nice too!16:12
tweining_yeah, that spec looks interesting16:14
johnsomCool, I will review that spec. At first glance it's aligned to our vision for that.16:14
pyjouGreat :) 16:15
johnsomYou have to failover as nova resizing causes a reboot which drops the encrypted ram disk content16:15
gthiemongeyeah this is what we discussed during the last PTG16:16
gthiemongeI'm working on the Multi-Active BGP support16:18
pyjouYeah That's exactly what I'm proposing. It's not a resize in the sense of nova. It´s just a failover with a new octavia flavor in param of failover flow.16:18
gthiemongeI need to propose a new spec, I hope I will upload it before the end of the week (maybe I'm too much optimistic)16:18
gthiemonge#topic Open Discussion16:23
gthiemongeanything else foks?16:24
jdacunha2could you please review ACMEv2 RFE again ?16:25
jdacunha2https://review.opendev.org/c/openstack/octavia/+/87728116:26
gthiemongeack16:26
gthiemongeok i guess that's all for today!16:27
gthiemongethank you folks!16:27
gthiemonge#endmeeting16:27
opendevmeetMeeting ended Wed Jun  7 16:27:53 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:27
opendevmeetMinutes:        https://meetings.opendev.org/meetings/octavia/2023/octavia.2023-06-07-16.00.html16:27
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/octavia/2023/octavia.2023-06-07-16.00.txt16:27
opendevmeetLog:            https://meetings.opendev.org/meetings/octavia/2023/octavia.2023-06-07-16.00.log.html16:27
gthiemongeback to you mnaser 16:27
jdacunha2quit16:27
mnasergthiemonge: i think one thing i've identified is that the amphora was overloaded at some point or something, i see some traces in the kernel and the rtc time was like 40 seconds behind the kernel time16:28
gthiemongemnaser: this is really weird, the API call initiated by get_api_version is the simplest API call we may have16:28
gthiemongeoh16:28
mnaserso that tells me there was a bunch of hangs that got us in a super weird state16:28
mnaserso i think for now we can put octavia out of the equation till i figure this out16:29
gthiemongemnaser: ack, ping us if there's anything else we can do16:29
opendevreviewMerged openstack/octavia stable/2023.1: Send IP advertisements when plugging a new member subnet  https://review.opendev.org/c/openstack/octavia/+/88072317:16
opendevreviewMerged openstack/octavia stable/zed: Send IP advertisements when plugging a new member subnet  https://review.opendev.org/c/openstack/octavia/+/88072417:58
opendevreviewMerged openstack/octavia stable/wallaby: Send IP advertisements when plugging a new member subnet  https://review.opendev.org/c/openstack/octavia/+/88072717:58
opendevreviewMerged openstack/octavia stable/yoga: Send IP advertisements when plugging a new member subnet  https://review.opendev.org/c/openstack/octavia/+/88072517:58
opendevreviewGregory Thiemonge proposed openstack/octavia-tempest-plugin master: DNM/WIP Testing server output for FIPS  https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/87766719:13
opendevreviewGregory Thiemonge proposed openstack/octavia master: DNM/WIP Testing FIPS job  https://review.opendev.org/c/openstack/octavia/+/88554019:16
gthiemongeit's not only FIPS that is failing, c9s job fails too: https://zuul.openstack.org/builds?job_name=octavia-v2-dsvm-scenario-centos-9-stream&branch=master&skip=019:48
gthiemongehttps://paste.opendev.org/show/b0WFpGOQQ7854jHEqGic/20:19

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!