Wednesday, 2021-09-22

mrf3hi, deploying wallaby ansible in a lab for test, do you use default HAProxy settings for galera?06:16
mrf3with the default balance params during deployment you can get timeouts06:16
jrosserthe defaults should be "sensible"07:05
jrossermrf3: can you give a specific example? do you mean something times out during the deployment and it fails?07:06
mrf3haproxy by default for galera is configured for just node1 get all traffic during deployment07:12
mrf3due "massive" queries / connections for settup , during the lab of 3 nodes the galera got timeout07:13
mrf3no ones happens in your test?07:13
mrf3or just people test ussing AIO?07:14
jrosserthe galera setup is deliberately active/standby/standby07:27
jrosserare you exceeding the maximum number of connections? 07:28
*** rpittau|afk is now known as rpittau07:28
mrf3yes just deploying, 3 controllers, 3 networks, 1 compute 07:32
mrf3yes we got " backend galera-back has no server available!"07:32
mrf3galera has 3 containers ofc07:32
jrosserwell, there is a kind of corner case, that if you exceed the maximum number of database connections, then the haproxy healthcheck itself cannot connect to the database07:40
jrosserso the loadbalancer thinks that the db is down07:40
jrosserso i think it's worth trying to understand why in your case the loadbalancer thinks there is no backend07:41
noonedeadpunkmornings07:50
jrossermorning07:54
mrf3im surprised that no one got this coner case or just solve it self without report?08:12
mrf3get out of mysql connection is not a "corner" case is a future production problem08:12
mrf3because haproxy by default only enable a single node for the full infraestructure08:12
noonedeadpunkyeah and we have plans to move away from balancing with haproxy to proxysql08:13
noonedeadpunkeventually right now we pushed max connection limit but it's not really a long term solution for sure08:14
noonedeadpunkbut we also need to reduce sleeping amount of connections08:14
noonedeadpunkwe have etherpad with some ideas https://etherpad.opendev.org/p/db_pool_calculations but wasn't able to return back to it :(08:15
anskiyI've seen this same issue actually in lab. Fixed it with this override: https://paste.opendev.org/show/809487/. Another problem of the way it works right now is that queries are not distributed across cluster (galera one) nodes, so all the load essentially hits the first node :(08:18
noonedeadpunkwell, eventually, galera can consume writes only on single node. But reads could be distributed across many08:22
noonedeadpunkHowever haproxy can't really distinguish read/write because it's l3 balancers, while l7 is required for this kind of balancing08:23
mrf3anskiy +1 same happend to me08:23
mrf3i was having nightmare thinking i setup some wrong at network level and then i saw ha-proxy galera balancer down and cause playbooks got wrong08:24
jrossersome specific tuning per deployment is probably needed08:32
jrosserdepending how many services you deploy / how many CPU you controllers have / blah blah this all affects the number of database connections08:32
noonedeadpunkwe set smth like `galera_max_connections: 5000` in user_vars08:35
noonedeadpunkbecause 90% of all connections are sleeping ones atm08:35
mrf3sleeping connections are the problem 08:40
noonedeadpunkwell, it depends. Because it speeds up reaching maria by using already established connections09:10
noonedeadpunkAlso you shouldn't get more connections to DB now except from the already spawned pool09:10
jrosserthe etherpad explains it quite will tbh09:11
jrosser*well09:11
mrf3galera_wait_timeout is the timeout for idle connections?09:59
jrossermrf3: https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/templates/my.cnf.j2#L6610:34
mrf3ty jrosser i setup to default timeout of 120 secs10:54
mrf33600 idle a sql connection its crazy10:55
jrosserthis is not an OSA thing btw10:55
jrosserit is how oslo.db middleware works, and by extension all of the services10:55
jrosseryou can tune the settings on the db connection pool based on your use case10:56
spateljamesdenton morning 13:14
*** arxcruz is now known as arxcruz|ruck13:14
jamesdentongood morning13:14
spatelI found solution :)13:14
jamesdentonmice :)13:15
jamesdentonnice, too!13:15
spateldid you?13:15
jamesdentontell me, was it obvious?13:15
spateli saw your post related that error 13:15
jamesdentonno, i've been messing with it13:15
spatelsolution is Disable -> HP Shared Memory Features13:15
spatelIn BIOS13:15
jamesdentonhmm, ok13:16
spatelFirst you have to upgrade NIC firmware it will give you option in BIOS to disable/enable HP shared Memory 13:16
jamesdentonok, let me reboot this thing and see13:16
spatelI am very surprised there is no simple answer in internet, every post just clueless... 13:17
spatelhttps://ibb.co/YW1HTdG13:17
jamesdentonit all varies so much13:17
spatelIn Device level configuration > disable Shared memory and DPDK will work like charm 13:17
spatelMake sure you have latest firmware for HP NIC otherwise you won't able to find that option in BIOS13:18
jamesdentoni think i do, but i'll double check13:18
jamesdentoni need to compare this config to another G9 i have tested DPDK with. It's possible i disabled it there, i dunno13:19
jamesdentonthat was 2 yrs ago13:19
spatel:) HP just making this harder for everyone.. 13:19
spatelnow i am testing on different server to just check this is correct process and i would like to blog it out so other folks don't need to struggle 13:20
spatellearnt one more thing in blog never reference HP kb article because they started restricting some of them like redhat doing, only paid people allow to use13:21
jamesdentonmaybe a screenshot would help13:21
spatelyep.. i was reading your blog post and same thing happened most of the links are broken :(13:22
jamesdentonwhich post was that?13:22
spatellet me find it.. you have very nice blog related this issue 13:22
spateljamesdenton https://www.jimmdenton.com/proliant-intel-dpdk/13:24
spatelmay be that solve your issue :)13:24
noonedeadpunkhaha lol13:25
jamesdentonWell, i think the IOMMU stuff was <= G8, i'm not seeing the same thing on this G913:25
jamesdentonbut the shared memory stuff sound plausible13:25
spatelI have Gen9 servers and i am seeing same issue13:25
spatelMay be HP later add that feature to disable enable shared memory to handle RMRR (using NIC firmware)13:27
spateljamesdenton i am planning to patch ovn to support dpdk. its broken at present and only supported in ml2.ovs plugin 13:31
jamesdentonby all means13:31
spateljamesdenton did you try AF_XDP ? 13:33
spatellook like its next big thing in OVS, if it workout then it will overtake DPDK 13:34
jamesdentonif it's easier to implement it just might do it13:35
spatelhttps://docs.openvswitch.org/en/latest/intro/install/afxdp/13:35
spateldid you play with it? i am going to try out13:35
jamesdentoni have not13:40
spateljamesdenton i am reading this - https://docs.openstack.org/networking-ovn/queens/admin/dpdk.html13:44
spatelhelp me to understand, why they are saying just set datapath_type=netdev13:44
spatelfor br-int ?13:45
spatelwhat about br-provider, we don't need it there?13:45
jamesdentonyes13:45
jamesdenton"and all other bridges if connected to the integration bridge via patch ports"13:45
spatelbr-provider is external network so that should be connect to dpdk port right?13:46
spatelso we don't need command like this ? - ovs-vsctl add-port br-int dpdk-1 -- set Interface dpdk-1 type=dpdk options:dpdk-devargs=0000:06:00.113:47
jamesdentoni would expect to need that13:49
jamesdentoni don't think those instructions are complete13:49
spatelthat is what i thought something is missing 13:51
spatellet me ask openvswitch channel 13:51
strattaowe need to change our cinder_service _password. I see warnings all over the documentation that says changing the password and running the playbooks will break the region. So... how would I go about changing the cinder_service_password? (or any other passwords that might need to be changed for that matter...)14:01
strattao(sorry to hijack this fascinating conversation spatel & jamesdenton)14:02
spatelstrattao no worry, i mostly go with whatever secret password generated by OSA, why do you want to change that ?14:03
spatelyou are not going to use them in daily basis so why bother 14:04
strattaoour password complexity requirements changed and the ones that were originally autogenerated no longer match the new requirements.14:04
spatelbut why just cinder_service_password only? 14:05
noonedeadpunkstrattao: Actually I think you can jsut update password and re-run the role.14:06
strattaolooks like that's the only one that doesn't meet the new requirements14:07
noonedeadpunkService will be really broken but until the time role execution will end14:07
noonedeadpunkBecause you need services to be restarted after password is changed, and it's changed somewhere in the middle of role execution14:07
noonedeadpunkif you use manila - you should also re-run it14:08
strattaohmmm, well I was testing this out in one of our dev regions and it failed re-running os-cinder. The error was with keystone, so I re-ran os-keystone, then os-cinder, to no avail. Now I have setup-everything running in the background now, but figured I'd ask here if anyone has ever had to update a password since all the documentation says that this is bad and I don't want to break in production.14:10
noonedeadpunkAnd what's the issue with keystone was?14:14
noonedeadpunkHave you tried to comment out no_log there?14:14
noonedeadpunkbecause we should have force to be set for quite some time14:14
noonedeadpunkWhich means that role will attempt to change password14:14
noonedeadpunkUnless you've also changed keystone admin password, then things are really bad14:15
strattaoYeah, you're right - the keystone error I was seeing was that the cinder-api is not responsive. So, I'll try to restart cinder services in the api container now that the password has been reset and then re-run the os-cinder playbook and ..... ??? profit?!?14:23
strattaoNope, that didn't work :(14:26
strattaoOkay, well I'll keep plugging along, thx14:26
spotzstrattao: Depending on the error it might also be a good idea to ask in Cinder assuming it's not an OSA issue14:37
noonedeadpunkI guess it's osa actually....15:06
noonedeadpunkstrattao: but some output would be useful15:06
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible master: Add serial execution to all playbooks  https://review.opendev.org/c/openstack/openstack-ansible/+/80518815:17
*** rpittau is now known as rpittau|afk15:45
spateljamesdenton, i have test my method on 3 servers and it just works after Disable HP shared memory feature in BIOS 15:57
jamesdentonnice! so far, that has not worked for me16:15
*** mgoddard- is now known as mgoddard17:13
-opendevstatus- NOTICE: Zuul has been restarted in order to address a performance regression related to event processing; any changes pushed or approved between roughly 17:00 and 18:30 UTC should be rechecked if they're not already enqueued according to the Zuul status page18:35
mrf3mm guys how much memory needed is need for a controller deployed with ansible? 16GB is not enough22:25

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!