Tuesday, 2022-03-22

BraceOk, after doing those fixes recommended a few days ago, all the Neutron ports have now built, yet, we still can't connect to any instances.07:43
Bracehttps://paste.openstack.org/show/bY9wtajyDwd12D5j6M0z/ - this is the error message we're seeing, anyone got any ideas?07:43
BraceAlso the privsep-helper process for L3 has 1070+ 'pipe' file descriptors attached07:44
jrosserBrace: what version of openstack are you using - i only see really really old bugs related to that08:33
noonedeadpunkadmin1: currently in Poland08:36
noonedeadpunkBut I tend to travel _a lot_ during last month as originally from Ukraine08:38
Bracejrosser: we're on Victoria 22.3.308:39
noonedeadpunkBrace: and you was increasing open file limit, right?08:42
noonedeadpunkor what were the recommended fixes?:)08:43
Bracenoonedeadpunk: yeah you recommended https://paste.openstack.org/show/bY235whPe5LKkFFzo6pn/08:44
Braceso we increased the LimitNOFILE and ended up rebooting the whole cluster08:44
Bracehowever we have 1000 projects and that means we end up with 5000 ports in build which is why it's taken me a few days to come back08:45
noonedeadpunkum... that change should not ever be that breaking. and should run only against l3-agents kind of08:49
Braceeverytime we do something with Neutron, it ends up with many thousands of ports in build08:52
jrosserBrace: this might be a question more for #openstack-neutron as i think there have been improvements how neutron handles lots of networks09:15
jrosserbut i couldnt say which release any of that was done in09:15
Bracejrosser: ok, thanks for that, I'll ask in there09:17
jrosserthere may also be tuning you can do around the number of workers but i've not really any experience there09:19
noonedeadpunkto have that said - we never saw such behaviour with ports going to build state....10:13
noonedeadpunkI guess top problem for us always was is l3 being re-build or misbehaving....10:14
jrosser1000 projects really is a lot though10:14
noonedeadpunkI bet we have 10000+ ports in some regions... 10:18
noonedeadpunknever checked amount of projects though.... 10:22
*** arxcruz is now known as arxcruz|ruck10:23
*** dviroel_ is now known as dviroel11:08
Bracejrosser: yeah, I'm vaguely aware that 1000 projects is bad, but we're using it to manage resources11:22
Bracewe don't seem to have much more than about 6000 ports at the moment11:22
*** prometheanfire is now known as Guest011:48
*** ChanServ changes topic to "Launchpad: https://launchpad.net/openstack-ansible || Weekly Meetings: https://wiki.openstack.org/wiki/Meetings/openstack-ansible || Review Dashboard: http://bit.ly/osa-review-board-v4_1"11:52
noonedeadpunkso what I'm saying that it's not _that_ much ports to cause issues IMO...12:08
noonedeadpunkBrace: to you use ovs or lxb?12:08
Braceis lxb - linuxbridge?12:13
mgariepyyes12:15
Braceyes, we use linuxbridge then12:15
noonedeadpunkoh, ok, we use ovs at that scale...12:16
mgariepyovs or ovn would perform a lot better i guess12:17
noonedeadpunkwell, it's arguably I'd say...12:18
noonedeadpunkI was running big enough with lxb, but that was quite some time ago12:18
mgariepylxb with 500 router is not good... 12:18
mgariepyif it runs smoothly it's ok but if you need to migrate the load accross servers it takes for ever to sync.12:19
noonedeadpunkis ovs somewhat different lol12:19
mgariepylol 12:20
Braceheh12:20
mgariepylxb had some nasty issue  ( getting sysctl stuff takes a lot of time when you have a lot of ports) 12:20
noonedeadpunkI mean - to cleanly move 500 l3 routers from net node with ovs in serial takes like 30-45mins?12:20
mgariepy30 min is fast.. compared to lxb.. 12:21
mgariepywith lxb i saw a couple hours.12:21
noonedeadpunkwith lxb I never had to do that :p as l3-agent restart was recovering properly. and with ovs you _always_ have ~10 routers then can't recover on themselves and overall router recovery can take time as well...12:22
Bracewell, when we did a config change a while back it took 4 days to rebuild all the ports12:22
noonedeadpunkbut dunno... I haven't used lxb for a while now, so it can have nasty stuff today as well...12:22
noonedeadpunk(and we never-ever were rebuilding ports)12:23
mgariepyi don't think that lxb have a lot of work on it these day12:23
Bracewe've had a couple of times when networking has just failed (and we can't figure out why) so we just restart all the networking componenets12:23
mgariepyi had to when a network node crash.12:23
Braceand after a few days it starts working again12:23
Bracebut this time, we've done that and it's still dead12:24
mgariepynetwork part is the most flaky imo :)12:24
noonedeadpunkso you mean ports that are part of l3?12:24
mgariepybrace are you using ha-router ?12:25
Bracemgariepy: so I'm learning12:25
Bracemgariepy: you mean haproxy? yup, that's in there12:26
mgariepyno. vrrp for the routers 12:26
BraceI *think* so, there's lots of lines talking about vrrp in the logs12:27
BraceMar 22 12:25:38 controller-3 Keepalived_vrrp[873858]: VRRP_Instance(VR_9) removing protocol Virtual Routes12:27
Bracetbh, we used openstack-ansible a number of years ago to deploy the cluster, but normally it broadly works so my knowledge of the exact config is pretty poor12:27
mgariepyopenstack router show <router_name_here> -c ha -f value12:27
mgariepyif you use ha-router it takes more ports. 12:29
Bracethat command came back with True12:29
mgariepywhat version of openstack are you running ?12:30
Brace22.3.312:30
mgariepyok12:31
jamesdentonso are all of your routers down right now? networking is busted?12:31
Bracenope, the routers are ACTIVE and UP, however we can't connect to any of the servers12:32
Braceif we build new servers, they're equally broken12:32
jamesdentoncan you reach the servers from the respective dhcp namespaces?12:32
jamesdentonthat would be L2 adjacent. 12:32
jamesdentonare you using vxlan or vlan for those tenant networks?12:33
Bracevxlan, I'm fairly sure12:33
mgariepyback in kilo, over 50 or 100 routers in ha mode it was taking longer to recover than with non-ha ones with the scheduler script.12:33
BraceI sort of understand what you're saying there, but no idea how to actually test reaching the servers via l212:34
jamesdenton"back in kilo" :D12:34
mgariepylol :) yep.12:34
mgariepyi know it's a long time ago.. lol12:34
jamesdentonif you have access to the controllers, there should be a qdhcp namespace for each network. "ip netns list" will list them all12:34
jamesdentonyou will correlate the network uuid of your VM to the qdhcp namespace name - the ID will be the same. You should then be able to ssh or ping from that namespace to the VM. It's essentially avoiding the routers.12:35
Braceok, so I have stuff like this - qrouter-00eb4cab-d5d5-4d99-b6d1-bb2efd04390312:35
Braceso I just ssh that?12:36
mgariepymaybe you can tweak : https://docs.openstack.org/neutron/ussuri/configuration/neutron.html#DEFAULT.max_l3_agents_per_router12:36
admin1Brace, how many network nodes do you actually have vs  networks ? 12:36
jamesdentonyes, that's for each router. and that would work, too. if you can find the router that matches the one in front of your tenant network. the command would be something like "ip netns exec qrouter-00eb...3903 ping x.x.x.x"12:36
jamesdentonor "ip netns exec qdhcp-3b2e...0697 ssh ubuntu@x.x.x.x"12:37
mgariepyalso `dhcp_agents_per_network` is a good candidate to remove some ports.12:37
Braceadmin1: we have three controllers, we don't have dedicated network nodes12:38
lowercaseOne issue i've had in the past is that when neutron goes down, when all the interfaces on the router all attempt to come up at the same time. Rabbitmq gets overwhelmed and the network doesn't come up. Sorry if this isn't relevant. I just popped in for a sec.12:39
Bracelowercase: nope, it's useful information12:39
jamesdentonif HA is enabled (which it appears to be) you will have a qrouter namespace on EACH controller/network node, but only one will be active. They use VRRP across a dedicated (vxlan likely) network to determine master12:40
jamesdentonso, all three will have an 'ha' interface, but only one should have a qg and qr interface12:40
jamesdentonthat's the one you'd want to test from12:40
Braceok12:42
mgariepywith:  openstack network agent list --router <router_name>12:42
Braceso I need to connect to dhcp-b27..... and not the qrouter12:42
jamesdentonboth would be a good test12:43
jamesdentonand yes, that openstack network agent list command should tell you which one it thinks is active.12:43
Braceok, so sshing to the qrouter, just gives a 'No route to host'12:43
jamesdentoncan you post the command?12:44
Braceip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ssh ubuntu@10.72.103.6212:44
jamesdentonok, try: ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ip addr12:44
jamesdentonwhat interfaces do you see and can you pastebin that12:45
Bracehttps://paste.openstack.org/show/bPLAlfK6Wuex68WnZ1Pw/12:45
jamesdentonok, try the 192.168.5 addr of the VM instead12:46
jamesdentonthe fixed IP vs the floating IP{12:46
jamesdenton10.72.103.62 may not be a floating IP or this may not be the right router, hard to tell12:46
Bracethat's a connection refused12:46
jamesdentoncan you post the 'openstack server show' output for the VM?12:47
Bracehttps://paste.openstack.org/show/bFiL16SeU7ROmH1J3amH/12:49
jamesdentonok, so ssh to 192.168.5.108 is connection refused?12:49
Braceno, I was sshing to 192.168.5.112:50
jamesdentonahh ok, try 192.168.5.10812:50
Brace.108 works, just tried it now12:50
jamesdentonok good. well, one problem, then, is that 10.72.103.62 is the floating IP but does not appear on the qg interface12:50
jamesdentonas a /3212:50
Braceok12:51
jamesdentontrying to think of the easiest way to rebuild that router without an agent restart. might be able to set admin state down for the router, watch the namespaces disappear, then admin up12:51
jamesdentoncan you check the other two controllers and do that same "ip netns exec qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3 ip addr" and post?12:52
jamesdentonalso, curious to know the specs of those controller nodes. cpu/ram/etc12:52
Brace16 core Xeons, 192gb ram, 5x363gb hdds for ceph12:53
Brace3 controller nodes, with 17 compute nodes12:54
jamesdentonok, all identical specs then?12:54
jamesdentonthanks12:54
Bracethe compute nodes have a lower spec, but the controller nodes are all the spec above12:56
Bracehttps://paste.openstack.org/show/bCLxmVkZPwM6KhTExrTR/12:58
jamesdentonok, thanks12:59
jamesdentonjust making sure there wasn't some sort of split-brain issue12:59
Braceah ok13:01
jamesdentonok, so you can try disabling the router with this command: openstack router set --disable b27d7882-d19f-4b8a-a34f-11e7d7821ba3. That should tear down the namespaces for that router only. then wait a min, and issue: openstack router set --enable b27d7882-d19f-4b8a-a34f-11e7d7821ba3. That should implement the namespaces and build out the floating IPs.13:01
jamesdentonyou might watch the l3 agent logs simultaneously across three controllers to see if there are any errors13:01
Braceok13:09
Bracenope, can't connect to the server on the 10. ip13:13
jamesdentoncan you post that ip netns output again from all 3?13:14
jamesdentonand also, can you find the port that corresponds to 10.72.103.62 and post the 'port show' output?13:14
Bracehttps://paste.openstack.org/show/b6dbUxyrRWMpX0sdQuEc/13:21
jamesdentonok, so if you see c2 and c3, they both have interfaces for this router. Only, C3 has the correct one by the looks of it13:23
jamesdentonso there is something up with C2 - i'm wondering if the namespace never got torn down.13:23
jrosseris it odd that 10.72.103.133/32 is on both C2 and C3 as well13:24
jamesdentonyou might try the disable command again, wait a min, and check to see if the namespaces disappear from all 3 controllers. if C2 is still there, then the agent may not be processing that properly13:24
jamesdentonjrosser i think that namespace never went away13:24
jrosserright13:25
jamesdentonBrace if you perform that check, and the namespace on C2 is still there, you can try to destroy it with "ip netns delete qrouter-b27d7882-d19f-4b8a-a34f-11e7d7821ba3". then re-enable the router. the logs on C2 may help indicate what's happening, and/or rabbitmq may reveal stuck messages or something13:25
jamesdentonrestarting the L3 agent *only* on C2 could help, too13:26
Braceok, I'll give that a bash, but I'm going to grab some lunch first13:26
Bracebut here is the port show pastebin = https://paste.openstack.org/show/b5RuQSovWmz0K6pSvRri/13:26
jamesdentongood idea :D13:26
jamesdentoncool, thanks. is this a new router with only 2 floating IPs?13:27
Bracethanks for all the help so far, much appreciated!13:27
jamesdentonjust want to make sure13:27
Braceyes, that's the router of the instance that I've been doing all the troubleshooting on so far13:27
Braceport even13:27
jamesdentoncool, thanks13:27
Braceopenstack port show d04ed8a1-3313-4a14-bbd3-8a6057bc52e813:27
Bracethat was the command I used13:28
jamesdentonperfect. i'm glad to see it turned up in the C3 namespace.13:28
Bracejamesdenton: so we have restarted the l3 agent on C2 several times (and rebooted C2) so I'm not sure it'll do much tbh14:22
BraceI''m going to try the disable command again14:27
Braceso I deleted the router as per ip netns delete qrouter-b27....14:31
Bracebut a qrouter-b27d7.... is still there on C314:32
jamesdentonsure - is the router re-enabled?14:41
Braceno I didn't re-enable it14:41
spatelnoonedeadpunk hey! around14:41
jamesdentonok - so you disabled the router and the namespace didn't disappear from C3? weird. Can you check to see if there are any agent queues in rabbit that have messages sitting there?14:44
noonedeadpunksemi-around14:44
Bracenope according to haproxy nothing much happening14:45
BraceI guess the best way to look would be to go into the rabbit container and look at queues in there?14:45
spatelnoonedeadpunk do you have any example yaml file of k8s to create hell-world with octavia lb to expose port? my yaml not spun up LB :(14:46
noonedeadpunknope, I don't :(14:47
noonedeadpunktrying to avoid messing with k8s as much as I can :)14:48
Bracenope looking at the neutron rabbit queues, they're all empty14:48
spatelwhat do you mean by avoid messing k8s?14:51
NeilHanlonspatel: can you share what you have so far?14:51
jamesdentonBrace ok cool. Very odd the namespace didn't get destroyed. you can try to delete it on there, then turn up the router again and see if it gets built across all 3 and if only 1 becomes master.14:51
spatelNeilHanlon i have this example code to spin up app. I am trying to expose my app port via Octavia but it doesn't work. I meant i am not seeing k8s creating any LB to expose port - https://paste.opendev.org/show/b7jCTenIZdMX3VFw5Rv2/14:53
spatelJust trying to understand how does k8s integrate with octavia 14:53
NeilHanlonhave you deployed the octavia ingress controller?14:55
jamesdenton^^14:55
NeilHanlonhttps://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/octavia-ingress-controller/using-octavia-ingress-controller.md14:55
spatelNeilHanlon deploy ingress controller???14:58
spatelThis is first-time playing with k8s with octavia so sorry if i missed something or not understand steps14:59
NeilHanlonspatel: check out that github link for the octavia ingres controller. Octavia doesn't natively interact with Kubernetes14:59
NeilHanlonyou need the octavia ingress controller running in your kubernetes cluster to facilitate using Octavia to load balance for kubernetes15:00
noonedeadpunk#startmeeting openstack_ansible_meeting15:01
opendevmeetMeeting started Tue Mar 22 15:01:04 2022 UTC and is due to finish in 60 minutes.  The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot.15:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:01
opendevmeetThe meeting name has been set to 'openstack_ansible_meeting'15:01
noonedeadpunk#topic rollcall15:01
noonedeadpunkhey everyone15:01
noonedeadpunkI'm half-there as have a meeting15:01
damiandabrowski[m]hi!15:04
noonedeadpunk#topic office hours15:08
noonedeadpunkwell, don't have much topics to have that said15:08
noonedeadpunkwe're super close to PTL and openstack release15:08
noonedeadpunk* s/PTL/PTG/15:09
noonedeadpunkSo please register to PTG and let's fill in topics for discussion in etherpad15:10
noonedeadpunk#link https://etherpad.opendev.org/p/osa-Z-ptg15:10
NeilHanlonheya15:10
NeilHanlonTY for the note about PTG!15:12
Bracejamesdenton: nope, it doesn't get rebuilt on C2 and takes a disable/enable before it comes up on the other two, so I guess that leans towards rabbitmq being the issue?15:14
jamesdentonwell, or somehow the router is no longer scheduled against C2. can you try openstack network agent list --router <id>15:17
noonedeadpunkother then that I didn't have time for anything else since week was quite tough with internal things going on15:18
Bracejamesdenton: it's alive and up on all controllers according to that command15:19
jamesdentonok. it may be that the other routers are experiencing a similar condition15:21
jamesdentonnot sure if only C2 is to blame, but seems problematic anyway.15:21
Braceyup15:25
jrossero/ hello15:25
opendevreviewJonathan Rosser proposed openstack/ansible-role-pki master: Refactor conditional generation of CA and certificates  https://review.opendev.org/c/openstack/ansible-role-pki/+/83079415:27
*** dviroel is now known as dviroel|lunch15:50
Bracejamesdenton: I'm going to try restarting l3-agent on C2 and see what that does, can't make things worse can it!15:50
jamesdentoni guess not :D15:50
jamesdentontail the log after you restart it and see if there's anything unusual reported15:50
jamesdentonmain loop exceeding timeout or anything along those lines15:51
Bracejamesdenton: I wouldn't know what's unusual tbh, but there's an note about finding left over processes which indicates an unclean termination of a previous run15:57
jrosseroh well now that is a thing with systemd isnt it?15:57
Bracenot sure tbh, but it otherwise seems ok15:58
jrosserthis https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/77253815:59
Bracebut the router still doesn't come back on c215:59
jrosserand https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/78110915:59
jrosserso restarting the netutron agent does not necessarily remove all the things that it created16:00
jamesdentonBrace you might try unscheduling the router from the agent, and rescheduling to see if it will clean up and recreate16:00
Braceah, I just tried disabling and enabling the port16:01
Bracerouter even16:01
jamesdentonright, ok. 16:01
jamesdentontry to unschedule and reschedule. i cannot remember offhand the command16:01
jamesdentoni'm headed OOO for the day16:01
Bracejamesdenton: thank you for all your help, much appreciated16:02
BraceI'll try the unschedule and reschedule16:02
noonedeadpunk#endmeeting16:04
opendevmeetMeeting ended Tue Mar 22 16:04:40 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:04
opendevmeetMinutes:        https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.html16:04
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.txt16:04
opendevmeetLog:            https://meetings.opendev.org/meetings/openstack_ansible_meeting/2022/openstack_ansible_meeting.2022-03-22-15.01.log.html16:04
*** Guest0 is now known as prometheanfire16:33
admin1have anyone seen an issue with nova snapshots when glance and cinder is ceph but nova is  local storage ? 16:56
admin1and if my glance backend is only rbd, do I need enabled_backends = rbd:rbd,http:http,cinder:cinder  -- all of these ? 16:56
admin1i am getting an error that snapshots do not work .. and research points to multiple backends being in glance 16:57
*** tosky is now known as Guest3817:04
*** tosky_ is now known as tosky17:04
opendevreviewMerged openstack/openstack-ansible-os_neutron master: Set fail_mode explicitly  https://review.opendev.org/c/openstack/openstack-ansible-os_neutron/+/83443618:30
admin1if that bug is really what i think i am hitting, how do I override glance in haproxy from mode http to tcp ? 18:33
admin1so i changed the mode in haproxy to tcp from http and its fixed .. now want to know how to make this fix permanent20:20
*** dviroel is now known as dviroel|brb20:21
*** dviroel|brb is now known as dviroel23:24
*** dviroel is now known as dviroel|out23:31

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!