Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The restart node join cluster very slow #11415

Closed
foobarget opened this issue May 29, 2015 · 2 comments
Closed

The restart node join cluster very slow #11415

foobarget opened this issue May 29, 2015 · 2 comments

Comments

@foobarget
Copy link

I have 3 nodes es cluster. Each node have 64G memory and 12 x 1T sata disk。The es version is 1.4.4

When I doing rolling restart, the restarted node join cluster very slow even if I disable the cluster.routing.allocation.enable.

From the log, it may take 6 minutes to join cluster after several retrying

So I open the debug log。

When I kill the A node then start it. I found:
A node discovered the master node, but just like waiting for something and blocked.

this is B node's log

[2015-05-29 17:43:32,019][DEBUG][cluster.service          ] [B] set local cluster state to version 3810
[2015-05-29 17:49:12,923][DEBUG][discovery.zen.publish    ] [B] node [C][r910JSFeSKmdAQGIsWQ0fA][test3][inet[/xxx.xxx.xxx.xxx:9300]] responded for cl
uster state [3810] (took longer than [30s])

After the C responded the state 3810, then the A node join the cluster without any block。

Then I read the souce code and found the cluster may by block by the C node changeCluster state.
So I reproduce these stage and use jstack to dump the process stack 。

elasticsearch[231.189][clusterService#updateTask][T#1]" daemon prio=10 tid=0x00007f00a9b42000 nid=0x2f508 runnable [0x00007ef5f0e37000]
   java.lang.Thread.State: RUNNABLE
    at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
    at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
    at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
    at org.elasticsearch.gateway.local.state.shards.LocalGatewayShardsState.writeShardState(LocalGatewayShardsState.java:294)
    at org.elasticsearch.gateway.local.state.shards.LocalGatewayShardsState.clusterChanged(LocalGatewayShardsState.java:153)
    at org.elasticsearch.gateway.local.LocalGateway.clusterChanged(LocalGateway.java:208)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:460)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.j
ava:184)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:154)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

The I use strace to attach the process
I see

[pid 193800] stat("/data1/logsearch/data/elasticsearch/nodes/0/indices/xxx-2015.05.13/4/_state", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 193800] open("/data1/logsearch/data/elasticsearch/nodes/0/indices/xxx-2015.05.13/4/_state/state-63", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 2530
[pid 193800] stat("/data2/logsearch/data/elasticsearch/nodes/0/indices/xxx-2015.05.13/4/_state", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 193800] open("/data2/logsearch/data/elasticsearch/nodes/0/indices/xxx-2015.05.13/4/_state/state-63", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 2530
[pid 193800] stat("/data3/logsearch/data/elasticsearch/nodes/0/indices/xxx-2015.05.13/4/_state", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 193800] open("/data3/logsearch/data/elasticsearch/nodes/0/indices/xxx-2015.05.13/4/_state/state-63", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 2530
......

What happened in the cluster? How can I speed up the cluster restart?

@foobarget foobarget changed the title cluster recover very slow The restart node join cluster very slow May 29, 2015
@clintongormley
Copy link

Hi @foobarget

The best place to ask about how to configure your cluster correctly is on the forum https://discuss.elastic.co/c/elasticsearch. That said, I'm guessing that you have a lot of indices and it is taking time to process all of the new shard allocations. There have been a number of improvements since 1.4, especially #11179 and #11336 and #11262.

1.6 will be out shortly - I advise upgrading to it when it is out, and it should improve cluster restart times a lot.

@foobarget
Copy link
Author

thanks clintongormley.
Also the cluster have lots indexes, but most of them are read only.
I post the question to the forum.
If the 1.6 is ready, I would have a try

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants