You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have 3 nodes es cluster. Each node have 64G memory and 12 x 1T sata disk。The es version is 1.4.4
When I doing rolling restart, the restarted node join cluster very slow even if I disable the cluster.routing.allocation.enable.
From the log, it may take 6 minutes to join cluster after several retrying
So I open the debug log。
When I kill the A node then start it. I found:
A node discovered the master node, but just like waiting for something and blocked.
this is B node's log
[2015-05-29 17:43:32,019][DEBUG][cluster.service ] [B] set local cluster state to version 3810
[2015-05-29 17:49:12,923][DEBUG][discovery.zen.publish ] [B] node [C][r910JSFeSKmdAQGIsWQ0fA][test3][inet[/xxx.xxx.xxx.xxx:9300]] responded for cl
uster state [3810] (took longer than [30s])
After the C responded the state 3810, then the A node join the cluster without any block。
Then I read the souce code and found the cluster may by block by the C node changeCluster state.
So I reproduce these stage and use jstack to dump the process stack 。
elasticsearch[231.189][clusterService#updateTask][T#1]" daemon prio=10 tid=0x00007f00a9b42000 nid=0x2f508 runnable [0x00007ef5f0e37000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
at org.elasticsearch.gateway.local.state.shards.LocalGatewayShardsState.writeShardState(LocalGatewayShardsState.java:294)
at org.elasticsearch.gateway.local.state.shards.LocalGatewayShardsState.clusterChanged(LocalGatewayShardsState.java:153)
at org.elasticsearch.gateway.local.LocalGateway.clusterChanged(LocalGateway.java:208)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:460)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.j
ava:184)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:154)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
The best place to ask about how to configure your cluster correctly is on the forum https://discuss.elastic.co/c/elasticsearch. That said, I'm guessing that you have a lot of indices and it is taking time to process all of the new shard allocations. There have been a number of improvements since 1.4, especially #11179 and #11336 and #11262.
1.6 will be out shortly - I advise upgrading to it when it is out, and it should improve cluster restart times a lot.
thanks clintongormley.
Also the cluster have lots indexes, but most of them are read only.
I post the question to the forum.
If the 1.6 is ready, I would have a try
I have 3 nodes es cluster. Each node have 64G memory and 12 x 1T sata disk。The es version is 1.4.4
When I doing rolling restart, the restarted node join cluster very slow even if I disable the cluster.routing.allocation.enable.
From the log, it may take 6 minutes to join cluster after several retrying
So I open the debug log。
When I kill the A node then start it. I found:
A node discovered the master node, but just like waiting for something and blocked.
this is B node's log
After the C responded the state 3810, then the A node join the cluster without any block。
Then I read the souce code and found the cluster may by block by the C node changeCluster state.
So I reproduce these stage and use jstack to dump the process stack 。
The I use strace to attach the process
I see
What happened in the cluster? How can I speed up the cluster restart?
The text was updated successfully, but these errors were encountered: