Tuesday, 2021-11-16

noonedeadpunk	#startmeeting openstack_ansible_meeting	15:00
opendevmeet	Meeting started Tue Nov 16 15:00:59 2021 UTC and is due to finish in 60 minutes. The chair is noonedeadpunk. Information about MeetBot at http://wiki.debian.org/MeetBot.	15:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	15:00
opendevmeet	The meeting name has been set to 'openstack_ansible_meeting'	15:00
noonedeadpunk	#topic office hours	15:01
noonedeadpunk	ah, I eventually skipped rollcall :(	15:01
noonedeadpunk	\o/	15:01
damiandabrowski[m]	hey!	15:01
opendevreview	Dmitriy Rabotyagov proposed openstack/openstack-ansible-rabbitmq_server master: Update rabbitmq version https://review.opendev.org/c/openstack/openstack-ansible-rabbitmq_server/+/817380	15:04
noonedeadpunk	So, I pushed some patches to update rabbitmq and galera version before release	15:05
noonedeadpunk	Frustrating thing is that for rabbitmq (erlang to be specific) bullseye repo is still missing	15:06
noonedeadpunk	and galera 10.6 fails while running mariadb-upgrade	15:06
noonedeadpunk	well, actually, it timeouts	15:06
noonedeadpunk	I didn't have time to dig into this though	15:06
noonedeadpunk	I guess once we figure this out we can do milestone release for testing	15:08
noonedeadpunk	while still taking time before roles branching	15:09
noonedeadpunk	also we have less then a month for final release jsut in case	15:09
noonedeadpunk	like updating maria pooling vars across roles if we got them agreed in time	15:10
damiandabrowski[m]	regarding to sqlachemy's connection pooling, i've checked how many active connections we have on our production environments and i think we should stick to oslodb's defaults(max_pool_size: 5, max_overflow: 50, pool_timeout: 30). I'll try to run some rally tests to make sure it won't degrade performance	15:16
mgariepy	hey.	15:20
noonedeadpunk	\o/	15:20
noonedeadpunk	I wonder if we should set max_pool_size: 10 but might be we don't need even this amount	15:21
noonedeadpunk	5 feels somehow extreme	15:21
mgariepy	if you want i can take a couple hours to dig issue this afternoon.	15:22
noonedeadpunk	I won't refuse help - the more eyes we have on this the more balanced result we get I guess	15:24
spatel	hey	15:24
noonedeadpunk	But I'm absolutely sure we should adjust this results and I'd wait for patches to land before release tbh	15:24
damiandabrowski[m]	noonedeadpunk: regarding to max_pool_size, all of the environments i checked did not need more than 2 max_pool_size for the 12 hours(and with max_overflow set to 50, it makes it far more than enough). But i'll share some info after rally tests	15:26
noonedeadpunk	well, in case of 1 controller failure I guess this number would increase?	15:27
damiandabrowski[m]	spatel: there is a script i used to check how many active SQL connections I have for each openstack service. You may find it useful	15:27
damiandabrowski[m]	https://paste.openstack.org/show/811029/	15:27
noonedeadpunk	I think you meant mgariepy hehe	15:27
spatel	:)	15:28
damiandabrowski[m]	ouh my bad, sorry!	15:28
noonedeadpunk	andrewbonney: you had your head in this topic as well. wdyt?	15:28
mgariepy	nice damiandabrowski[m] thanks i'll take a look	15:30
spatel	damiandabrowski[m] i am trying to run but throwing some error :)	15:33
spatel	may be my box doesn't have some tools	15:33
spatel	sort: cannot read: ./data/keystone: No such file or directory	15:34
damiandabrowski[m]	well, i found an issue with this script yesterday. Looks like it has to be executed from your $PWD. maybe it's the case?	15:35
damiandabrowski[m]	line 13 should create ./data and line 16 should create service directories inside it	15:36
damiandabrowski[m]	ouh and another thing: You need to run this script with 'collect' argument to collect some data, then You will be able to use 'summary' argument ;)	15:37
andrewbonney	It's a while since I've looked at this, but it wouldn't surprise me if existing defaults could be reduced. I just wouldn't have confidence doing so without checking larger deployments	15:39
noonedeadpunk	would be interesting to see spatel's result actually	15:39
spatel	noonedeadpunk are you talking about related SQL script?	15:40
noonedeadpunk	yeah	15:40
spatel	fyi, i am trying to test it on wallaby not latest branch	15:40
spatel	damiandabrowski[m] - https://paste.opendev.org/show/811030/	15:41
spatel	may need some love to that script..	15:42
damiandabrowski[m]	tbh i ran it with bash so it may have some issues with sh :/	15:42
spatel	trying with bash..	15:44
spatel	https://paste.opendev.org/show/811031/	15:45
spatel	all zero.. that is impossible	15:45
noonedeadpunk	it's sandbox?	15:45
spatel	lab with 5 compute nodes	15:45
noonedeadpunk	and no activity?	15:46
spatel	1 controller	15:46
spatel	let me run on busy cloud	15:46
noonedeadpunk	because what this script does calculate I belive amount of active sql requests at a time	15:46
mgariepy	yep is threads are sleeping it won't count them	15:47
spatel	assuming collect continue collecting data in background right? ./mysql-active-conn.sh collect	15:47
damiandabrowski[m]	i think it's very likely to be 0, but You can check the result of `mysql -sse "SELECT user,count(*) FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"`	15:48
damiandabrowski[m]	yeah, i was running 'collect' in the background for 12 hours	15:48
spatel	zero.. result	15:48
noonedeadpunk	hm, that's weird on busy cloud indeed	15:48
spatel	https://paste.opendev.org/show/811032/	15:49
noonedeadpunk	that is possible actually :)	15:49
mgariepy	but even sleeping connection are still a connection	15:50
damiandabrowski[m]	but if i understand it correctly, our main point of implementing pooling is to reduce the number of sleeping connections as we don't need them ;)	15:51
damiandabrowski[m]	i mean, with max_pool_size=5, we will have 5 sleeping/active connections per worker per service	15:51
noonedeadpunk	yeah idea is that our current default setup is weird in terms of pooling	15:52
noonedeadpunk	because of huge amount of sleeping connections	15:52
spatel	fyi, i rant this script on most busies cloud in my datacenter and result is all zero. i am running collect for just 1 min..	15:52
spatel	ran*	15:53
opendevreview	James Denton proposed openstack/openstack-ansible-os_ironic master: Add [nova] section to ironic.conf https://review.opendev.org/c/openstack/openstack-ansible-os_ironic/+/818115	15:53
spatel	may be i am running on stein release thats why :)	15:53
noonedeadpunk	nah I don't think it matters	15:53
damiandabrowski[m]	and `cat ./data/nova` has only zeros? i was testing it on victoria, but i think it doesn't matter	15:54
noonedeadpunk	it's pure mysql	15:54
spatel	k	15:54
spatel	damiandabrowski[m] yes all zero for nova and even other services also	15:55
spatel	only root and system user has 1 and 5 connection	15:55
damiandabrowski[m]	so IMO, openstack queries are parsed super fast so script doesn't catch it	15:55
damiandabrowski[m]	i mean, it just parses `mysql -sse "SELECT user,count() FROM information_schema.processlist WHERE command != 'Sleep' GROUP BY user;"` and saves the output to service files in ./data/	15:56
damiandabrowski[m]	during my case, the output was 0 in 99% of cases	15:57
spatel	noonedeadpunk this is my mytop command output - https://paste.opendev.org/show/811033/	15:59
spatel	majority are sleeping connection	15:59
noonedeadpunk	yeah and we aim to reduce their number)	16:00
spatel	does sleep consume resources ?	16:00
noonedeadpunk	well, they do. it's not like _huge_ problem, but unpleasant	16:01
noonedeadpunk	eventually you have tons of hanging tcp cnnections which also some load on tcp stack	16:01
spatel	how about setting up - interactive_timeout = 300	16:02
spatel	i believe default is 8hrs	16:02
spatel	or wait_timeout ?	16:03
spatel	interactive_timeout \| 28800	16:03
spatel	wait_timeout \| 28800	16:03
damiandabrowski[m]	sleeping connections are the most problematic when galera nodes are going down.	16:04
damiandabrowski[m]	Galera will keep them until timeout, that's how galera can easily reach max_connections	16:04
damiandabrowski[m]	yeah, actually my main point was to implement connection pooling and lower wait_timeout - but I'm open for Your ideas ;)	16:05
spatel	i had same issue.. that is why i kept max_connection to 5000 and more... my upgrade failed last time because of high max_connection limit	16:05
mgariepy	damiandabrowski[m], which service does consume the most mysql connections ?	16:08
damiandabrowski[m]	in my case: nova	16:10
spatel	yes nova is always on top	16:10
noonedeadpunk	#endmeeting	16:11
opendevmeet	Meeting ended Tue Nov 16 16:11:58 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	16:11
opendevmeet	Minutes: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.html	16:11
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.txt	16:11
opendevmeet	Log: https://meetings.opendev.org/meetings/openstack_ansible_meeting/2021/openstack_ansible_meeting.2021-11-16-15.00.log.html	16:11
mgariepy	noonedeadpunk, which one of galera or rabbit do you want me to tackle?	16:13
noonedeadpunk	would be great if you could take a look on galera	16:14
noonedeadpunk	but I bet this is mariadb bug	16:14
noonedeadpunk	since according to zuul logs mariadb-upgrade always fails on schema_redundant_indexes	16:15
mgariepy	ok i'll take a look at it a bit later.	16:15
noonedeadpunk	and this table has been implemented with 10.6 according to https://mariadb.com/docs/reference/mdb/sys/schema_redundant_indexes/	16:15
noonedeadpunk	also it's not easily reproducable...	16:15
mgariepy	ok	16:18
noonedeadpunk	I've also asked in #maria (on libera) but I guess they will jsut suggest to fill in bug report and provide stack trace collected	16:19
mgariepy	hmm ok it fails sometimes during upgrade? do you have the pointers to the logs somewherE?	16:21
noonedeadpunk	check out timeout tasks for https://review.opendev.org/c/openstack/openstack-ansible/+/817385	16:21
noonedeadpunk	that is always last task before timeout https://zuul.opendev.org/t/openstack/build/fed013733f9f415bb2d1118a26287bab/log/logs/host/mariadb.service.journal-17-59-27.log.txt#205	16:22
mgariepy	ok perfect.	16:24
mgariepy	it's always a timeout on focal metal or it doesn't matter ?	16:24
noonedeadpunk	oh well, I saw it indeed 2 times and only on focal right now	16:25
mgariepy	ok	16:25
mgariepy	perfect i'll dig into it after lunch. see if i can figure out what's going on.	16:25
noonedeadpunk	thanks!	16:27
noonedeadpunk	I think we should try at least to reproduce and collect debug information to post bug report	16:27
noonedeadpunk	ie https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#enabling-core-dumps	16:29
noonedeadpunk	as I dont think it's us who does smth wrong	16:30
mgariepy	yep ok	16:30
mgariepy	i'll script the launch in for launch a couple of vms in my cloud to reproduce it and enable core-dumps to try to grab the trace.	16:31
opendevreview	James Denton proposed openstack/openstack-ansible master: Deprecate OVN-related haproxy configuration https://review.opendev.org/c/openstack/openstack-ansible/+/813858	16:32
opendevreview	Merged openstack/ansible-config_template master: Fix repository URL in galaxy.yml https://review.opendev.org/c/openstack/ansible-config_template/+/817720	17:01
*** sshnaidm is now known as sshnaidm\|afk		17:33
opendevreview	Merged openstack/openstack-ansible-os_nova master: Allow to provide mdev addresses as list https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/817738	17:51
*** tosky is now known as Guest6054		18:05
*** tosky_ is now known as tosky		18:05
mgariepy	i'll try to reproduce on a AIO lxc install this way i can reset only the galera container if it pass the mysql_upgrade	18:21
*** sshnaidm\|afk is now known as sshnaidm		18:39
opendevreview	Merged openstack/openstack-ansible master: Minor update of openstack collection https://review.opendev.org/c/openstack/openstack-ansible/+/817851	18:45
opendevreview	Merged openstack/openstack-ansible master: Remove note about metal/horizon compatability https://review.opendev.org/c/openstack/openstack-ansible/+/771573	18:45
*** tosky_ is now known as tosky		18:48
mgariepy	noonedeadpunk, everysingle time i did the run on my server it has the issue.	19:53
mgariepy	looks like a race condition to me.	19:53
mgariepy	the line in journactl : sys.schema_redundant_indexes OK is printer on startup.	20:03
opendevreview	Marc Gariépy proposed openstack/openstack-ansible-galera_server master: Relaod deamon on overrides file creation https://review.opendev.org/c/openstack/openstack-ansible-galera_server/+/818138	20:47
mgariepy	maybe this would belong to systemd_services ?	20:48
mgariepy	i'll be back tomorrow !	20:54
*** tosky is now known as Guest6070		22:42
*** tosky_ is now known as tosky		22:42
*** tosky is now known as Guest6073		23:07
*** tosky_ is now known as tosky		23:07

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!