Thursday, 2020-04-16

*** diablo_rojo has joined #opendev-meeting		17:51
*** diablo_rojo has quit IRC		18:18
corvus	hi me!	22:24
corvus	okay, i have stopped zk03	22:24
clarkb	hello I'm paing attention to the zk stuff too and didn't meant to interrupt it. Just noticed the cross platform folks were asked to do ptg thinking by april 16th	22:24
* clarkb follows along here too		22:24
fungi	this is the last cluster member, right?	22:25
corvus	yep	22:25
fungi	awesome	22:25
fungi	so with it down we're running entirely on containerized cluster members already	22:25
corvus	i noticed during the leader election of zk02 that it emitted an error about being unable to write out the 'next version of the dynamic config'	22:26
corvus	it should have write access to the conf dir, but the existing zoo.cfg was root-owned	22:26
corvus	i'm not expecting it to update zoo.cfg, so i dunno what's up with that	22:26
corvus	it also wasn't too upset -- it didn't crash or anything	22:26
corvus	but to learn more, i've chowned those on the other hosts, so maybe in a bit i'll force a leader election and see what happens	22:27
corvus	but for now, i'll just proceed with zk03	22:27
corvus	running playbook now	22:28
corvus	config files look good, starting	22:32
corvus	it says it's synchronizing, but like zk02, it's taking a while; i may stop and start it again	22:36
corvus	that appears better	22:39
corvus	okay, i'm going to try restarting zk02 now since i think it's the leader	22:41
corvus	this is not going well	22:42
corvus	the remaining servers are rejecing client connection requests	22:43
clarkb	corvus: we expected them to elect a elader and continue to run along irght?	22:44
corvus	infra-root: we may be about to see a lot of fallout in zuul	22:44
corvus	yep	22:44
clarkb	the "good" news is we know what that fallout looks like	22:44
clarkb	(its the OOM in scheduler case)	22:45
fungi	thanks for the heads up	22:45
corvus	i restarted zk02 but they're still not happy	22:45
corvus	i'm open to ideas	22:48
corvus	maybe a full stop/start?	22:48
clarkb	Refusing session request for client /104.130.246.196:53542 as it has seen zxid 0xb00000000 our last zxid is 0xa00040d89 client must try another server	22:49
clarkb	that seems to be the issue. I think we want to find the server with zxid 0xb00000000 and make it the running one?	22:49
clarkb	(the issue with clients)	22:49
corvus	i don't think that zk02 is able to join	22:49
fungi	i'm not finding any literal references to "next version of the dynamic config" in a web search, fwiw	22:49
fungi	(if that was a literal quote)	22:50
corvus	it was not	22:50
clarkb	[WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2) is also interesting	22:51
corvus	clarkb: 0xb00000000 looks suspcious to me	22:51
corvus	i'd like to just stop everything and restart	22:51
corvus	any objections?	22:51
fungi	that is a very round nuber	22:51
fungi	number	22:52
fungi	corvus: no objection here	22:52
clarkb	corvus: no objects here and yes I believe it is suspect too	22:54
clarkb	if you look at logs for 'proposed zxid' they are all smaller	22:54
corvus	fully stopped	22:54
corvus	i will start it up in reverse order	22:54
clarkb	when the election starts they emit log info for myid and proposed zkid	22:54
clarkb	*zxid	22:54
corvus	zk_1 \| 2020-04-16 22:55:13,988 [myid:1] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):QuorumPeer@1619] - Error writing next dynamic config file to disk:	22:56
corvus	that's the error	22:56
corvus	it looks like zk1 and zk3 have establisheda quorum with 3 as the leader	22:57
corvus	nodepool has reconnected	22:57
corvus	zk02 still looks like it has not connected	22:58
clarkb	https://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html#sc_reconfig_file	22:58
clarkb	dynamicConfigFile is the config setting for that file path	22:58
clarkb	trying to figure out what the deafult is now	22:58
fungi	"the docs don't mention that the "static" configuration files _needs_	22:59
fungi	to be writable"	22:59
fungi	https://github.com/pravega/zookeeper-operator/issues/66	22:59
corvus	yeah, maybe it's writing to some unwritable path in the container	22:59
corvus	and we should explicitly set that	22:59
clarkb	explicitly setting it solves needing to figure out what the dfeault is :)	23:00
clarkb	I like explicit	23:00
corvus	i still can't think of why zk02 won't join	23:00
clarkb	corvus: 2020-04-16 22:54:49,250 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@685] - Cannot open channel to 1 at election address zk01.openstack.org/23.253.236.126:3888	23:02
clarkb	could it be a tcp problem?	23:02
clarkb	hrm no I can telnet	23:02
clarkb	though maybe that was a race during startup and I need to look later in the logs	23:03
corvus	yeah, that time looks suspect	23:03
clarkb	crazy idea. we stop zk02, move its data file aside and start it and force it to sync that state from the other two	23:03
corvus	sounds good	23:04
clarkb	this assumes that maybe its a conflict between their db states (we would be asserting 01 and 03 are correct)	23:04
corvus	fyi: here's an explanation of the 'smaller identifier' message http://zookeeper-user.578899.n2.nabble.com/Have-smaller-server-identifier-so-dropping-the-connection-td7583860.html	23:04
clarkb	basically do the process of completely replacing a server	23:04
corvus	02 is down, moving data	23:04
corvus	oh!	23:05
corvus	cat myid	23:05
corvus	02	23:05
corvus	that should be single digit	23:05
corvus	and it is on the other hosts	23:06
corvus	how did that end up as 02?	23:06
clarkb	corvus: I believe it was 02 from puppet	23:06
clarkb	let me rephrase	23:06
clarkb	I believe puppet put the 0 prefix because it got it from the hostname	23:06
clarkb	and zk was ok with that on puppet deployed zk	23:06
corvus	hrm. well, i guess i "improved" it then	23:06
corvus	because the ansible stuff should be doing single digits	23:06
corvus	and....	23:07
corvus	maybe i accidentally copied the puppet myid file on 02	23:07
clarkb	oh that could explain the errors with permissions too if file paths were different	23:07
corvus	yeah, i don't think i can confirm that, but i think that's entirely possible	23:07
corvus	i don't think there are any perm errors other than the dynamic config thing	23:08
corvus	(i definitely didn't copy the puppet zoo.cfg file, that's in a different dir)	23:08
corvus	clarkb: so let's continue with the 'restart from scratcch' process, and just also fix the myid file	23:08
clarkb	k	23:08
corvus	i've restarted 2	23:13
corvus	no joy yet	23:13
clarkb	this time around might take longer if its syncing the data?	23:14
clarkb	hrm looks like container isn't running anymore	23:14
corvus	oh, it was a different myid file	23:14
corvus	i stopped it	23:14
corvus	i'm going to rm the data again	23:15
corvus	started again	23:15
clarkb	corvus: INFO [NIOWorkerThread-1:FourLetterCommands@235] - The list of enabled four letter word commands is : [[srvr]] that was me trying to run the stat command and it wasn't allowed	23:24
corvus	clarkb: ah yeah, i think we may need to whitelist them in this version	23:24
clarkb	srvr is apparently more verbose than stat so I'll try that now	23:24
clarkb	and that errors with zk is not currently serving requests	23:25
clarkb	so it seems busy?	23:25
corvus	clarkb: on zk02?	23:25
clarkb	ya	23:25
corvus	that's the one that's not in the quorum	23:25
clarkb	did telnet localhost 2181 and issued that command	23:25
clarkb	its like gearman	23:25
corvus	try it on the other nodes	23:26
clarkb	corvus: ya I seem to recall when I set up the puppetized cluster that things would still respond to confirm but maybe thats beacuse it was always in quorum and never unhappy	23:26
clarkb	but let me see what 01 says	23:26
clarkb	https://issues.apache.org/jira/browse/ZOOKEEPER-2164	23:28
clarkb	corvus: ^ possible that is related?	23:29
clarkb	there is at least one individual in there indicating the docker images have exhibited this where older 3.4 zk never did	23:29
clarkb	corvus: the last thing there indicates that 3.5.8 will have the fix	23:30
corvus	gah	23:30
clarkb	computers	23:30
clarkb	I've been summoned to start dinner plans, back in a bit	23:30
corvus	there are 2 bugs in that bug	23:31
corvus	one of them is this: 1254	23:31
corvus	grr	23:31
corvus	https://github.com/apache/zookeeper/pull/1254	23:31
corvus	i think that's the one that's fixed in 3.5.8	23:31
corvus	and it's related to using 0.0.0.0 and connection ids being smaller	23:32
corvus	so looks very likely	23:32
corvus	okay, we might be able to work around this by specifying our actual ip addresses	23:32
corvus	we should have that in the inventory	23:32
corvus	i'm going to stop 02, then replace the config entirely with ipv4 addrs, then start 2	23:33
clarkb	ok	23:33
corvus	instajoin	23:35
corvus	clarkb: nice find :)	23:35
clarkb	wow	23:37
clarkb	so we need to name the members by ip explicitly?	23:38
corvus	yep, due to that bug	23:39
corvus	i'm a little confused about how we should handle the dynamic config	23:39
corvus	perhaps we should omit it entirely and leave our zoo.cfg file owned by root	23:41
clarkb	corvus: does it try to write to t he normal file by default?	23:41
clarkb	seems like explicitly setting a writeable path should work?	23:42
corvus	clarkb: the way i read that is that if it writes a dynamic config file, it may, in some circumstances, rewrite the main config file and remove some lines	23:42
corvus	which would be weird for us the next time ansible runs	23:42
clarkb	oh huh	23:43
corvus	i think i'm going to eod and leave things as they are right now	23:44
corvus	i have 2 notes in my zk change to address; i'll do that tomorrow, and as part of that, we'll get the ips in the other config files	23:44
clarkb	ya seems stable enough for now, they arejn the emergency file right?	23:44
corvus	then we can merge the change and remove from emergency	23:44
corvus	yep	23:44
corvus	infra-root: ^ fyi	23:45
corvus	basically, zk hosts are stable, but in emergency file, because their configuration is ahead of what's in ansible.	23:45
corvus	and we're running in containers now	23:45
corvus	#status log upgraded zk ensemble to 3.5.7 running in containers	23:46
openstackstatus	corvus: finished logging	23:46
fungi	makes sense, thanks for working through that	23:46
corvus	that almost went perfectly :)	23:46
corvus	we actually did complete a rolling upgrade of zk from 3.4 to 3.5 without any user-visible impact	23:47
corvus	it was only when we did further testing afterwords that we had the hiccup	23:47
fungi	and a worthwhile experiment	23:48

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!