*** diablo_rojo has joined #opendev-meeting | 17:51 | |
*** diablo_rojo has quit IRC | 18:18 | |
corvus | hi me! | 22:24 |
---|---|---|
corvus | okay, i have stopped zk03 | 22:24 |
clarkb | hello I'm paing attention to the zk stuff too and didn't meant to interrupt it. Just noticed the cross platform folks were asked to do ptg thinking by april 16th | 22:24 |
* clarkb follows along here too | 22:24 | |
fungi | this is the last cluster member, right? | 22:25 |
corvus | yep | 22:25 |
fungi | awesome | 22:25 |
fungi | so with it down we're running entirely on containerized cluster members already | 22:25 |
corvus | i noticed during the leader election of zk02 that it emitted an error about being unable to write out the 'next version of the dynamic config' | 22:26 |
corvus | it should have write access to the conf dir, but the existing zoo.cfg was root-owned | 22:26 |
corvus | i'm not expecting it to update zoo.cfg, so i dunno what's up with that | 22:26 |
corvus | it also wasn't too upset -- it didn't crash or anything | 22:26 |
corvus | but to learn more, i've chowned those on the other hosts, so maybe in a bit i'll force a leader election and see what happens | 22:27 |
corvus | but for now, i'll just proceed with zk03 | 22:27 |
corvus | running playbook now | 22:28 |
corvus | config files look good, starting | 22:32 |
corvus | it says it's synchronizing, but like zk02, it's taking a while; i may stop and start it again | 22:36 |
corvus | that appears better | 22:39 |
corvus | okay, i'm going to try restarting zk02 now since i think it's the leader | 22:41 |
corvus | this is not going well | 22:42 |
corvus | the remaining servers are rejecing client connection requests | 22:43 |
clarkb | corvus: we expected them to elect a elader and continue to run along irght? | 22:44 |
corvus | infra-root: we may be about to see a lot of fallout in zuul | 22:44 |
corvus | yep | 22:44 |
clarkb | the "good" news is we know what that fallout looks like | 22:44 |
clarkb | (its the OOM in scheduler case) | 22:45 |
fungi | thanks for the heads up | 22:45 |
corvus | i restarted zk02 but they're still not happy | 22:45 |
corvus | i'm open to ideas | 22:48 |
corvus | maybe a full stop/start? | 22:48 |
clarkb | Refusing session request for client /104.130.246.196:53542 as it has seen zxid 0xb00000000 our last zxid is 0xa00040d89 client must try another server | 22:49 |
clarkb | that seems to be the issue. I think we want to find the server with zxid 0xb00000000 and make it the running one? | 22:49 |
clarkb | (the issue with clients) | 22:49 |
corvus | i don't think that zk02 is able to join | 22:49 |
fungi | i'm not finding any literal references to "next version of the dynamic config" in a web search, fwiw | 22:49 |
fungi | (if that was a literal quote) | 22:50 |
corvus | it was not | 22:50 |
clarkb | [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, so dropping the connection: (3, 2) is also interesting | 22:51 |
corvus | clarkb: 0xb00000000 looks suspcious to me | 22:51 |
corvus | i'd like to just stop everything and restart | 22:51 |
corvus | any objections? | 22:51 |
fungi | that is a very round nuber | 22:51 |
fungi | number | 22:52 |
fungi | corvus: no objection here | 22:52 |
clarkb | corvus: no objects here and yes I believe it is suspect too | 22:54 |
clarkb | if you look at logs for 'proposed zxid' they are all smaller | 22:54 |
corvus | fully stopped | 22:54 |
corvus | i will start it up in reverse order | 22:54 |
clarkb | when the election starts they emit log info for myid and proposed zkid | 22:54 |
clarkb | *zxid | 22:54 |
corvus | zk_1 | 2020-04-16 22:55:13,988 [myid:1] - ERROR [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):QuorumPeer@1619] - Error writing next dynamic config file to disk: | 22:56 |
corvus | that's the error | 22:56 |
corvus | it looks like zk1 and zk3 have establisheda quorum with 3 as the leader | 22:57 |
corvus | nodepool has reconnected | 22:57 |
corvus | zk02 still looks like it has not connected | 22:58 |
clarkb | https://zookeeper.apache.org/doc/r3.5.2-alpha/zookeeperReconfig.html#sc_reconfig_file | 22:58 |
clarkb | dynamicConfigFile is the config setting for that file path | 22:58 |
clarkb | trying to figure out what the deafult is now | 22:58 |
fungi | "the docs don't mention that the "static" configuration files _needs_ | 22:59 |
fungi | to be writable" | 22:59 |
fungi | https://github.com/pravega/zookeeper-operator/issues/66 | 22:59 |
corvus | yeah, maybe it's writing to some unwritable path in the container | 22:59 |
corvus | and we should explicitly set that | 22:59 |
clarkb | explicitly setting it solves needing to figure out what the dfeault is :) | 23:00 |
clarkb | I like explicit | 23:00 |
corvus | i still can't think of why zk02 won't join | 23:00 |
clarkb | corvus: 2020-04-16 22:54:49,250 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@685] - Cannot open channel to 1 at election address zk01.openstack.org/23.253.236.126:3888 | 23:02 |
clarkb | could it be a tcp problem? | 23:02 |
clarkb | hrm no I can telnet | 23:02 |
clarkb | though maybe that was a race during startup and I need to look later in the logs | 23:03 |
corvus | yeah, that time looks suspect | 23:03 |
clarkb | crazy idea. we stop zk02, move its data file aside and start it and force it to sync that state from the other two | 23:03 |
corvus | sounds good | 23:04 |
clarkb | this assumes that maybe its a conflict between their db states (we would be asserting 01 and 03 are correct) | 23:04 |
corvus | fyi: here's an explanation of the 'smaller identifier' message http://zookeeper-user.578899.n2.nabble.com/Have-smaller-server-identifier-so-dropping-the-connection-td7583860.html | 23:04 |
clarkb | basically do the process of completely replacing a server | 23:04 |
corvus | 02 is down, moving data | 23:04 |
corvus | oh! | 23:05 |
corvus | cat myid | 23:05 |
corvus | 02 | 23:05 |
corvus | that should be single digit | 23:05 |
corvus | and it is on the other hosts | 23:06 |
corvus | how did that end up as 02? | 23:06 |
clarkb | corvus: I believe it was 02 from puppet | 23:06 |
clarkb | let me rephrase | 23:06 |
clarkb | I believe puppet put the 0 prefix because it got it from the hostname | 23:06 |
clarkb | and zk was ok with that on puppet deployed zk | 23:06 |
corvus | hrm. well, i guess i "improved" it then | 23:06 |
corvus | because the ansible stuff should be doing single digits | 23:06 |
corvus | and.... | 23:07 |
corvus | maybe i accidentally copied the puppet myid file on 02 | 23:07 |
clarkb | oh that could explain the errors with permissions too if file paths were different | 23:07 |
corvus | yeah, i don't think i can confirm that, but i think that's entirely possible | 23:07 |
corvus | i don't think there are any perm errors other than the dynamic config thing | 23:08 |
corvus | (i definitely didn't copy the puppet zoo.cfg file, that's in a different dir) | 23:08 |
corvus | clarkb: so let's continue with the 'restart from scratcch' process, and just also fix the myid file | 23:08 |
clarkb | k | 23:08 |
corvus | i've restarted 2 | 23:13 |
corvus | no joy yet | 23:13 |
clarkb | this time around might take longer if its syncing the data? | 23:14 |
clarkb | hrm looks like container isn't running anymore | 23:14 |
corvus | oh, it was a *different* myid file | 23:14 |
corvus | i stopped it | 23:14 |
corvus | i'm going to rm the data again | 23:15 |
corvus | started again | 23:15 |
clarkb | corvus: INFO [NIOWorkerThread-1:FourLetterCommands@235] - The list of enabled four letter word commands is : [[srvr]] that was me trying to run the stat command and it wasn't allowed | 23:24 |
corvus | clarkb: ah yeah, i think we may need to whitelist them in this version | 23:24 |
clarkb | srvr is apparently more verbose than stat so I'll try that now | 23:24 |
clarkb | and that errors with zk is not currently serving requests | 23:25 |
clarkb | so it seems busy? | 23:25 |
corvus | clarkb: on zk02? | 23:25 |
clarkb | ya | 23:25 |
corvus | that's the one that's not in the quorum | 23:25 |
clarkb | did telnet localhost 2181 and issued that command | 23:25 |
clarkb | its like gearman | 23:25 |
corvus | try it on the other nodes | 23:26 |
clarkb | corvus: ya I seem to recall when I set up the puppetized cluster that things would still respond to confirm but maybe thats beacuse it was always in quorum and never unhappy | 23:26 |
clarkb | but let me see what 01 says | 23:26 |
clarkb | https://issues.apache.org/jira/browse/ZOOKEEPER-2164 | 23:28 |
clarkb | corvus: ^ possible that is related? | 23:29 |
clarkb | there is at least one individual in there indicating the docker images have exhibited this where older 3.4 zk never did | 23:29 |
clarkb | corvus: the last thing there indicates that 3.5.8 will have the fix | 23:30 |
corvus | gah | 23:30 |
clarkb | computers | 23:30 |
clarkb | I've been summoned to start dinner plans, back in a bit | 23:30 |
corvus | there are 2 bugs in that bug | 23:31 |
corvus | one of them is this: 1254 | 23:31 |
corvus | grr | 23:31 |
corvus | https://github.com/apache/zookeeper/pull/1254 | 23:31 |
corvus | i think that's the one that's fixed in 3.5.8 | 23:31 |
corvus | and it's related to using 0.0.0.0 and connection ids being smaller | 23:32 |
corvus | so looks very likely | 23:32 |
corvus | okay, we *might* be able to work around this by specifying our actual ip addresses | 23:32 |
corvus | we should have that in the inventory | 23:32 |
corvus | i'm going to stop 02, then replace the config entirely with ipv4 addrs, then start 2 | 23:33 |
clarkb | ok | 23:33 |
corvus | instajoin | 23:35 |
corvus | clarkb: nice find :) | 23:35 |
clarkb | wow | 23:37 |
clarkb | so we need to name the members by ip explicitly? | 23:38 |
corvus | yep, due to that bug | 23:39 |
corvus | i'm a little confused about how we should handle the dynamic config | 23:39 |
corvus | perhaps we should omit it entirely and leave our zoo.cfg file owned by root | 23:41 |
clarkb | corvus: does it try to write to t he normal file by default? | 23:41 |
clarkb | seems like explicitly setting a writeable path should work? | 23:42 |
corvus | clarkb: the way i read that is that if it writes a dynamic config file, it may, in some circumstances, rewrite the main config file and remove some lines | 23:42 |
corvus | which would be weird for us the next time ansible runs | 23:42 |
clarkb | oh huh | 23:43 |
corvus | i think i'm going to eod and leave things as they are right now | 23:44 |
corvus | i have 2 notes in my zk change to address; i'll do that tomorrow, and as part of that, we'll get the ips in the other config files | 23:44 |
clarkb | ya seems stable enough for now, they arejn the emergency file right? | 23:44 |
corvus | then we can merge the change and remove from emergency | 23:44 |
corvus | yep | 23:44 |
corvus | infra-root: ^ fyi | 23:45 |
corvus | basically, zk hosts are stable, but in emergency file, because their configuration is ahead of what's in ansible. | 23:45 |
corvus | and we're running in containers now | 23:45 |
corvus | #status log upgraded zk ensemble to 3.5.7 running in containers | 23:46 |
openstackstatus | corvus: finished logging | 23:46 |
fungi | makes sense, thanks for working through that | 23:46 |
corvus | that *almost* went perfectly :) | 23:46 |
corvus | we actually did complete a rolling upgrade of zk from 3.4 to 3.5 without any user-visible impact | 23:47 |
corvus | it was only when we did further testing afterwords that we had the hiccup | 23:47 |
fungi | and a worthwhile experiment | 23:48 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!