Tuesday, 2021-11-02

rohit02hi team,while deploying OSA Wallaby setup openstack failed at TASK [os_nova : Perform a cell_v2 discover] https://paste.opendev.org/show/810323/07:19
rohit02noonedeadpunk: any idea here while deploying OSA Wallaby setup openstack failed at TASK [os_nova : Perform a cell_v2 discover] https://paste.opendev.org/show/810323/07:43
anskiyrohit02: OperationalError: (1040, 'Too many connections'), you can check mysqladmin processlist and I think there are a bunch of processes in COMMIT state07:45
anskiywhich should lead to this: https://jira.mariadb.org/browse/MDEV-2536807:46
rohit02anskiy:thanx..any resolution to overcome this error?07:47
anskiyI've ended up pinning mysql version to 10.5.6, like this: https://paste.opendev.org/show/810326/07:48
anskiyor, you are just hitting that "too many connections" error normally, in that case you just need to increment galera_max_connections (which is set based on this formulae: (100 x vCPUs), but takes in account ALL nodes in cluster, so, having one node with 2 cores (like the deployment host, which I use for logging) sets this to 200, which could be a little low, dependind on amount of services you're trying to deploy.07:52
noonedeadpunkrohit02: we here bumped max connections for galera08:26
noonedeadpunkie galera_max_connections: 100008:27
noonedeadpunkThis won't solve issue with connections in COMMIT08:27
noonedeadpunkbut if there're plenty connections with wait/sleep state it will08:27
opendevreviewDmitriy Rabotyagov proposed openstack/ansible-role-python_venv_build stable/ussuri: Set centos-7 jobs to non voting  https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/81631610:42
opendevreviewDmitriy Rabotyagov proposed openstack/ansible-role-python_venv_build stable/ussuri: Workaround distro provided pip having old CA certs on centos-7  https://review.opendev.org/c/openstack/ansible-role-python_venv_build/+/81631710:43
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible stable/ussuri: Fetch upper constraints file with curl rather than allow pip to download it  https://review.opendev.org/c/openstack/openstack-ansible/+/81563210:51
jrosser_anskiy: does the connection limit really get calculated based on the deploy host cpus? that sounds surprising?10:52
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible stable/ussuri: Bump OpenStack-Ansible Ussuri  https://review.opendev.org/c/openstack/openstack-ansible/+/81558910:52
anskiyjrosser_: https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/templates/my.cnf.j2#L56-L64 and that default is calculated at the beginning of the same file11:32
jrosser_but only for the galera hosts? https://github.com/openstack/openstack-ansible-galera_server/blob/master/templates/my.cnf.j2#L211:33
jrosser_i'm not sure that i can see how the deploy host having two vcpu would affect it>11:33
anskiyjrosser_: hm, it's true. Maybe I should've digged more into this. But what I was seeing is that I got max_connections capped at 200, and the only node I've had with that amount was the logging one. Or, it could've been a cause of that default value was triggering somehow, instead of proper value from fact.11:39
jrosser_if theres a bug somewhere then it would be great to understand why it's coming out that way for you11:40
jrosser_i guess that the default(2) here https://github.com/openstack/openstack-ansible-galera_server/blob/master/templates/my.cnf.j2#L4 might have something to say about this11:41
jrosser_if the number of vcpu is somehow not available, but i'd have expected that to fail already at the previous line when it tries to retrieve the value into `vcpus`11:42
anskiyjrosser_: yeah. Thanks for pointing this out, I'll try to dig more into this and file a bug if appropriate11:45
mgariepydepending on how many compute nodes you have even with a decent core counts on the galera nodes you might end up not having enough connections.11:48
mgariepyAlso if your cache is stalled for some nodes it can be set a bit too low. (ex: calculate the vcpus of only 1 node out of the 3.)11:49
jrosser_i'm surprused that the template doesnt blow up if the facts are not available11:49
mgariepythere is always 1 there..11:50
mgariepybut yeah. indeed it's surprising that it didn't fail hard lol11:50
mgariepyon a smallish 120 computes nodes i generally set it to : galera_max_connections: 480011:51
opendevreviewMerged openstack/openstack-ansible-os_tempest stable/stein: Fix tempest plugin versions  https://review.opendev.org/c/openstack/openstack-ansible-os_tempest/+/81453511:53
strattaoWe've also seen the issue with the max connections being set to 200 even though I have more than 2 vcpus available. Haven't pinned down the issue, but it does seem that there is a bug with the calculation that is being made. We had to manually bump up the galera_max_connections, but I don't recall having to do that before our Wallaby tests12:26
opendevreviewMerged openstack/openstack-ansible stable/stein: Remove tempest plugins CI overrides  https://review.opendev.org/c/openstack/openstack-ansible/+/81455812:31
opendevreviewMerged openstack/ansible-role-pki master: Slurp all server certs not just first one  https://review.opendev.org/c/openstack/ansible-role-pki/+/81584912:55
opendevreviewMerged openstack/openstack-ansible master: Switch services to track stable/xena  https://review.opendev.org/c/openstack/openstack-ansible/+/81559714:00
*** sshnaidm_ is now known as sshnaidm15:49
jrosser_behaviour of this galera max connections template is bizzare https://paste.opendev.org/show/810346/16:39
mgariepyit does appends strings?16:58
jrosser_aparrently17:01
jrosser_i thought i'd just grab the bit of the template out into a test playbook and mess with it17:01
mgariepywhat version of ansible ?17:01
jrosser_ah good question17:01
jrosser_hmmm 2.9.1117:02
jrosser_whatever i turn the multiply to is puts the vcpu number in that many times17:03
mgariepy '>' not supported between instances of 'str' and 'int'"17:07
mgariepyif you swap the strings for int ? in the append statement?17:09
jrosser_thats why i've had to add the quotes around '2' for example17:09
jrosser_i have no idea how this works at all in the current code :(17:10
mgariepy'2' and '9999' are actually strings. and with ansible 2.11.1 it does error out on the comparison.17:10
mgariepyhttps://paste.opendev.org/show/810347/17:10
jrosser_i was getting `'<' not supported between instances of 'int' and 'AnsibleUnsafeText'"}`17:14
jrosser_without the quotes around the '2' on 2.9.1117:14
mgariepylol17:14
mgariepywe got to love ansible :)17:14
mgariepyvery predictable.17:14
mgariepyfor stables feature that is. lol17:14
jrosser_ah right, and i'm just trying with ansible 4.8.0 and i get the same as you17:15
jrosser_but either way it still templates out N times the vcpu number17:16
mgariepyhttps://paste.opendev.org/show/810348/17:16
mgariepyyep.17:16
jrosser_oh wait - how did you do that?17:17
mgariepyhttps://paste.opendev.org/show/810349/17:17
jrosser_oh right, so i guess this is python interpreter / jinja library version trouble then17:18
mgariepyi wonder what was the rational on calculating the number of connection via the galera core count is about.17:19
mgariepymost api (most of the time) with have X thread (based on core count) i guess and if they run all on the same host some number based on the core count of the host might make sense. but ignoring all the other nodes on the cluster is not a good idea imo.17:21
jrosser_as soon as you hit the connection limit the LB healthcheck starts to fail (as it can't connect either) and you end up with a catastrophic failure17:23
mgariepywhat i've seens is the first server stopping responding. then the second one taking over..17:24
jrosser_anyway, adding the quotes on the numerical values and putting this in `| min | int * 20) %}` makes the template behave here17:25
mgariepywouldn't be an issue if all the nodes were really masters.. proxysql to the rescue ! 17:25
jrosser_and it does fail hard if the vcpus fact is not present17:32
jrosser_i am confused about how this is ending up getting stuck at 200 for people17:32
mgariepyi've seen that with Rocky in the past. when upgrading the controllers. 17:36
mgariepywhen the fact were expired for some nodes it didn't matter. 17:37
mgariepycan a bug somewhere set the fact to 0 ?17:43
mgariepyfor some reason 17:43
jrosser_I wonder if the galera_nodes can be empty somehow17:45
mgariepyif the galera_cluster_members is overwritten by the user ?17:52
mgariepybeside that i don't see how it can be empty.17:53
*** ianw_pto is now known as ianw19:00
strattaospatel, for ovn deployments in wallaby, is there supposed to be a default driver_interface defined in the ml2.ovn section /etc/ansible/roles/os_neutron/vars/main.yml?19:45
spatelstrattao what do you mean?19:46
strattaoI was testing an ovn install off of the stable/wallaby branch and got an error that driver_interface was not defined. I looked in /vars/main.yml of os_neutron and saw that many of the other defined ml2 plugins specify a driver_interface in that file, but ml2.ovn does not.19:50
strattaoI didn't know if that was the place it would get defined, if it even needs it, if there is another reason why it would be complaining about the driver_interface for ovn... etc.19:50
spatelI have never seen that error 20:07
spatelstrattao if you post config/error/output etc.. that would be good20:07
spatelHere what i did - https://satishdotpatel.github.io/openstack-ansible-ovn-deployment-part1/20:08

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!