Tuesday, 2021-11-16

noonedeadpunk#startmeeting openstack_ansible_meeting15:00
opendevmeetMeeting started Tue Nov 16 15:00:59 2021 UTC and is due to finish in 60 minutes.  The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot.15:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.15:00
opendevmeetThe meeting name has been set to 'openstack_ansible_meeting'15:00
noonedeadpunk#topic office hours15:01
noonedeadpunkah, I eventually skipped rollcall :(15:01
noonedeadpunk\o/15:01
damiandabrowski[m]hey!15:01
opendevreviewDmitriy Rabotyagov proposed openstack/openstack-ansible-rabbitmq_server master: Update rabbitmq version  https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/81738015:04
noonedeadpunkSo, I pushed some patches to update rabbitmq and galera version before release15:05
noonedeadpunkFrustrating thing is that for rabbitmq (erlang to be specific) bullseye repo is still missing15:06
noonedeadpunkand galera 10.6 fails while running mariadb-upgrade15:06
noonedeadpunkwell, actually, it timeouts15:06
noonedeadpunkI didn't have time to dig into this though15:06
noonedeadpunkI guess once we figure this out we can do milestone release for testing15:08
noonedeadpunkwhile still taking time before roles branching15:09
noonedeadpunkalso we have less then a month for final release jsut in case15:09
noonedeadpunklike updating maria pooling vars across roles if we got them agreed in time15:10
damiandabrowski[m]regarding to sqlachemy's connection pooling, i've checked how many active connections we have on our production environments and i think we should stick to oslodb's defaults(max_pool_size: 5, max_overflow: 50, pool_timeout: 30). I'll try to run some rally tests to make sure it won't degrade performance 15:16
mgariepyhey.15:20
noonedeadpunk\o/15:20
noonedeadpunkI wonder if we should set max_pool_size: 10 but might be we don't need even this amount15:21
noonedeadpunk5 feels somehow extreme15:21
mgariepyif you want i can take a couple hours to dig issue this afternoon. 15:22
noonedeadpunkI won't refuse help - the more eyes we have on this the more balanced result we get I guess15:24
spatelhey15:24
noonedeadpunkBut I'm absolutely sure we should adjust this results and I'd wait for patches to land before release tbh15:24
damiandabrowski[m]noonedeadpunk: regarding to max_pool_size, all of the environments i checked did not need more than 2 max_pool_size for the 12 hours(and with max_overflow set to 50, it makes it far more than enough). But i'll share some info after rally tests15:26
noonedeadpunkwell, in case of 1  controller failure I guess this number would increase?15:27
damiandabrowski[m]spatel: there is a script i used to check how many active SQL connections I have for each openstack service. You may find it useful15:27
damiandabrowski[m]https://paste.openstack.org/show/811029/15:27
noonedeadpunkI think you meant mgariepy hehe15:27
spatel:)15:28
damiandabrowski[m]ouh my bad, sorry! 15:28
noonedeadpunkandrewbonney: you had your head in this topic as well. wdyt?15:28
mgariepynice damiandabrowski[m] thanks i'll take a look15:30
spateldamiandabrowski[m] i am trying to run but throwing some error :)15:33
spatelmay be my box doesn't have some tools 15:33
spatelsort: cannot read: ./data/keystone: No such file or directory15:34
damiandabrowski[m]well, i found an issue with this script yesterday. Looks like it has to be executed from your $PWD. maybe it's the case?15:35
damiandabrowski[m]line 13 should create ./data and line 16 should create service directories inside it15:36
damiandabrowski[m]ouh and another thing: You need to run this script with 'collect' argument to collect some data, then You will be able to use 'summary' argument ;)15:37
andrewbonneyIt's a while since I've looked at this, but it wouldn't surprise me if existing defaults could be reduced. I just wouldn't have confidence doing so without checking larger deployments15:39
noonedeadpunkwould be interesting to see spatel's result actually15:39
spatelnoonedeadpunk are you talking about related SQL script? 15:40
noonedeadpunkyeah15:40
spatelfyi, i am trying to test it on wallaby not latest branch 15:40
spateldamiandabrowski[m] - https://paste.opendev.org/show/811030/15:41
spatelmay need some love to that script.. 15:42
damiandabrowski[m]tbh i ran it with bash so it may have some issues with sh :/15:42
spateltrying with bash.. 15:44
spatelhttps://paste.opendev.org/show/811031/15:45
spatelall zero.. that is impossible 15:45
noonedeadpunkit's sandbox?15:45
spatellab with 5 compute nodes 15:45
noonedeadpunkand no activity?15:46
spatel1 controller 15:46
spatellet me run on busy cloud 15:46
noonedeadpunkbecause what this script does calculate I belive amount of active sql requests at a time15:46
mgariepyyep is threads are sleeping it won't count them15:47
spatelassuming collect continue collecting data in background right?  ./mysql-active-conn.sh collect15:47
damiandabrowski[m]i think it's very likely to be 0, but You can check the result of `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"`15:48
damiandabrowski[m]yeah, i was running 'collect' in the background for 12 hours15:48
spatelzero.. result 15:48
noonedeadpunkhm, that's weird on busy cloud indeed15:48
spatelhttps://paste.opendev.org/show/811032/15:49
noonedeadpunkthat is possible actually :)15:49
mgariepybut even sleeping connection are still a connection 15:50
damiandabrowski[m]but if i understand it correctly, our main point of implementing pooling is to reduce the number of sleeping connections as we don't need them ;)15:51
damiandabrowski[m]i mean, with max_pool_size=5, we will have 5 sleeping/active connections per worker per service15:51
noonedeadpunkyeah idea is that our current default setup is weird in terms of pooling15:52
noonedeadpunkbecause of huge amount of sleeping connections 15:52
spatelfyi, i rant this script on most busies cloud in my datacenter and result is all zero. i am running collect for just 1 min.. 15:52
spatelran*15:53
opendevreviewJames Denton proposed openstack/openstack-ansible-os_ironic master: Add [nova] section to ironic.conf  https://review.opendev.org/c/openstack/openstack-ansible-os_ironic/+/81811515:53
spatelmay be i am running on stein release thats why :)15:53
noonedeadpunknah I don't think it matters15:53
damiandabrowski[m]and `cat ./data/nova` has only zeros? i was testing it on victoria, but i think it doesn't matter15:54
noonedeadpunkit's pure mysql15:54
spatelk15:54
spateldamiandabrowski[m] yes all zero for nova and even other services also 15:55
spatelonly root and system user has 1 and 5 connection 15:55
damiandabrowski[m]so IMO, openstack queries are parsed super fast so script doesn't catch it 15:55
damiandabrowski[m]i mean, it just parses `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` and saves the output to service files in ./data/*15:56
damiandabrowski[m]during my case, the output was 0 in 99% of cases15:57
spatelnoonedeadpunk this is my mytop command output - https://paste.opendev.org/show/811033/15:59
spatelmajority are sleeping connection15:59
noonedeadpunkyeah and we aim to reduce their number)16:00
spateldoes sleep consume resources ?16:00
noonedeadpunkwell, they do. it's not like _huge_ problem, but unpleasant16:01
noonedeadpunkeventually you have tons of hanging tcp cnnections which also some load on tcp stack 16:01
spatelhow about setting up - interactive_timeout = 30016:02
spateli believe default is 8hrs 16:02
spatelor wait_timeout ?16:03
spatelinteractive_timeout                   | 2880016:03
spatel wait_timeout                          | 2880016:03
damiandabrowski[m]sleeping connections are the most problematic when galera nodes are going down.16:04
damiandabrowski[m]Galera will keep them until timeout, that's how galera can easily reach max_connections16:04
damiandabrowski[m]yeah, actually my main point was to implement connection pooling and lower wait_timeout - but I'm open for Your ideas ;)16:05
spateli had same issue.. that is why i kept max_connection to 5000 and more... my upgrade failed last time because of high max_connection limit 16:05
mgariepydamiandabrowski[m], which service does consume the most mysql connections ?16:08
damiandabrowski[m]in my case: nova16:10
spatelyes nova is always on top 16:10
noonedeadpunk#endmeeting16:11
opendevmeetMeeting ended Tue Nov 16 16:11:58 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:11
opendevmeetMinutes:        https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.html16:11
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.txt16:11
opendevmeetLog:            https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.log.html16:11
mgariepynoonedeadpunk, which one of galera or rabbit do you want me to tackle?16:13
noonedeadpunkwould be great if you could take a look on galera16:14
noonedeadpunkbut I bet this is mariadb bug16:14
noonedeadpunksince according to zuul logs mariadb-upgrade always fails on schema_redundant_indexes16:15
mgariepyok i'll take a look at it a bit later.16:15
noonedeadpunkand this table has been implemented with 10.6 according to https://mariadb.com/docs/reference/mdb/sys/schema_redundant_indexes/16:15
noonedeadpunkalso it's not easily reproducable...16:15
mgariepyok16:18
noonedeadpunkI've also asked in #maria (on libera) but I guess they will jsut suggest to fill in bug report and provide stack trace collected16:19
mgariepyhmm ok it fails sometimes during upgrade? do you have the pointers to the logs somewherE? 16:21
noonedeadpunkcheck out timeout tasks for https://review.opendev.org/c/openstack/openstack-ansible/+/81738516:21
noonedeadpunkthat is always last task before timeout https://zuul.opendev.org/t/openstack/build/fed013733f9f415bb2d1118a26287bab/log/logs/host/mariadb.service.journal-17-59-27.log.txt#20516:22
mgariepyok perfect.16:24
mgariepyit's always a timeout on focal metal or it doesn't matter ?16:24
noonedeadpunkoh well, I saw it indeed 2 times and only on focal right now16:25
mgariepyok 16:25
mgariepyperfect i'll dig into it after lunch. see if i can figure out what's going on.16:25
noonedeadpunkthanks! 16:27
noonedeadpunkI think we should try at least to reproduce and collect debug information to post bug report16:27
noonedeadpunkie https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#enabling-core-dumps16:29
noonedeadpunkas I dont think it's us who does smth wrong16:30
mgariepyyep ok 16:30
mgariepyi'll script the launch in for launch a couple of vms in my cloud to reproduce it and enable core-dumps to try to grab the trace.16:31
opendevreviewJames Denton proposed openstack/openstack-ansible master: Deprecate OVN-related haproxy configuration  https://review.opendev.org/c/openstack/openstack-ansible/+/81385816:32
opendevreviewMerged openstack/ansible-config_template master: Fix repository URL in galaxy.yml  https://review.opendev.org/c/openstack/ansible-config_template/+/81772017:01
*** sshnaidm is now known as sshnaidm|afk17:33
opendevreviewMerged openstack/openstack-ansible-os_nova master: Allow to provide mdev addresses as list  https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/81773817:51
*** tosky is now known as Guest605418:05
*** tosky_ is now known as tosky18:05
mgariepyi'll try to reproduce on a AIO lxc install this way i can reset only the galera container if it pass the mysql_upgrade18:21
*** sshnaidm|afk is now known as sshnaidm18:39
opendevreviewMerged openstack/openstack-ansible master: Minor update of openstack collection  https://review.opendev.org/c/openstack/openstack-ansible/+/81785118:45
opendevreviewMerged openstack/openstack-ansible master: Remove note about metal/horizon compatability  https://review.opendev.org/c/openstack/openstack-ansible/+/77157318:45
*** tosky_ is now known as tosky18:48
mgariepynoonedeadpunk, everysingle time i did the run on my server it has the issue.19:53
mgariepylooks like a race condition to me.19:53
mgariepythe line in journactl : sys.schema_redundant_indexes                       OK  is printer on startup.20:03
opendevreviewMarc GariĆ©py proposed openstack/openstack-ansible-galera_server master: Relaod deamon on overrides file creation  https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/81813820:47
mgariepymaybe this would belong to systemd_services ?20:48
mgariepyi'll be back tomorrow !20:54
*** tosky is now known as Guest607022:42
*** tosky_ is now known as tosky22:42
*** tosky is now known as Guest607323:07
*** tosky_ is now known as tosky23:07

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!