Delayed allocation on node leave #11438

kimchy · 2015-05-31T05:57:49Z

Delay allocation of unassigned shards when node leaves for a specific period (defaults to 5m) to give it a chance to come back and not cause excessive allocations.

This new behavior, specifically with the default value, means that when a node leaves the cluster now, the shards assigned to it will only be allocated back to the rest of the cluster after the specified duration.

The number of delayed unassigned shards can be retrieved using the cluster health API.

The setting to control the duration is cluster.routing.allocation.delay_unassigned_allocation.duration, and its dynamically updatable using the cluster update settings API (only applicable to master nodes).

The reroute cluster command now also accepts an optional delayed_duration parameter, when set, it will override the duration for this reroute operation. This can be handy, for example, to set it to 0 and get the current delayed shards to be assigned.

The concept of a node key is also introduced, allowing to specify what represents a node in a "cross restart" manner. The potential values for the setting cluster.routing.allocation.delay_unassigned_allocation.node_key are name (node name), id (node id, note, randomly generated on node startup), host_address (the host ip address)
, host_name (the host name), transport_addresss (the transport address, ip + port of the node). Defaults to transport_address.

Closes #7288

pickypg · 2015-05-31T06:08:34Z

...n/java/org/elasticsearch/cluster/routing/allocation/allocator/DelayUnassignedAllocation.java

+    /**
+     * Clears that current delayed allocation.
+     */
+    public void clearDelayedAllocations() {


This should probably be synchronized too since you're protecting access to delayedAllocations.

doh!, thanks for spotting it, it used to be concurrent map, but then with the sync it was not needed and I missed it

pickypg · 2015-05-31T07:03:00Z

Minor comments and thoughts. I'm excited to see this get in!

clintongormley · 2015-05-31T11:00:05Z

...n/java/org/elasticsearch/cluster/routing/allocation/allocator/DelayUnassignedAllocation.java

+                    for (ObjectCursor<DiscoveryNode> cursor : allocation.nodes().dataNodes().values()) {
+                        if (delayedAllocationNodeKey.equals(nodeKey.getNodeKey(cursor.value))) {
+                            // we found a node key that has came back, remove it from the delayed allocations
+                            logger.info("node {} joined the cluster back, removing delayed allocation", cursor.value);


"joined the cluster back" -> "rejoined the cluster"

clintongormley · 2015-05-31T11:02:26Z

If a node dies unexpectedly, does the admin have a way to override the duration, or is it just a matter of resetting the duration temporarily?

kimchy · 2015-05-31T18:26:44Z

pushed a change for the first review.

If a node dies unexpectedly, does the admin have a way to override the duration, or is it just a matter of resetting the duration temporarily?

Yea, the duration setting needs to be updated temporary to unblock it, we could have a flag in reroute as well to ignore it if we want.

I also want to raise the question of the default for duration. I went with 5 minutes, but I would love to get input into what the best default would be. It needs to cover the most common restart times of a node (+conceptually a vm)

Delay allocation of unassigned shards when node leaves for a specific period (defaults to 5m) to give it a chance to come back and not cause excessive allocations. This new behavior, specifically with the default value, means that when a node leaves the cluster now, the shards assigned to it will only be allocated back to the rest of the cluster after the specified duration. The number of delayed unassigned shards can be retrieved using the cluster health API. The setting to control the duration is `cluster.routing.allocation.delay_unassigned_allocation.duration`, and its dynamically updatable using the cluster update settings API (only applicable to master nodes). The reroute cluster command now also accepts an optional `delayed_duration` parameter, when set, it will override the duration for this reroute operation. This can be handy, for example, to set it to 0 and get the current delayed shards to be assigned. The concept of a node key is also introduced, allowing to specify what represents a node in a "cross restart" manner. The potential values for the setting `cluster.routing.allocation.delay_unassigned_allocation.node_key` are `name` (node name), `id` (node id, note, randomly generated on node startup), `host_address` (the host ip address) , `host_name` (the host name), `transport_addresss` (the transport address, ip + port of the node). Defaults to `transport_address`.

kimchy · 2015-06-01T00:21:21Z

@clintongormley I added the ability to set the delayed_duration in the cluster reroute command as well, so people can more easily set it without going through changing the setting value which is more permanent.

s1monw · 2015-06-01T10:14:30Z

Thanks shay for opening this PR, this has been a pain for a long time. I looked at the implementation and there are a couple of things that concern me:

having a time component directly attached to the shard allocation makes me wanna run away - I am really concerned about this producing very weird (maybe due to misconfiguration what have you) effects where people still see the problems we are trying to fix here.
binding a shard to something like a transport address, node ids, ports, $some_other_random_value is what we are trying to get rid of and not what we should add. We essentially have the right information already to do the right thing if a node comes back.
adding such a large amount of state-full code to a class that I'd love to get rid of essentially is the wrong path forward IMO. GatewayAllocator owns too much already and I think we should rather make use of our already existing abstraction.

Based on this I'd like to propose a different way of preventing ze allocation dance which might allow to separate out the time component as well. What I have in mind is something like a notion of stickiness where a shard can only be allocated on a node that has the shards data locally AND a version of the shard greater or equal to the ShardRouting it used to have before it was moved to UNASSIGNED. Shards that are created due to replica expansion or due to index creation can always be allocated.
Ideally we will be able to have a simple allocation decider that prevents us from allocating the shards.

In a second step we can add some mechanism that removes the stickiness from a shard after time, number of nodes in the cluster, number of unallocated shards, user interaction, $yoursgoeshere

kimchy · 2015-06-12T15:46:13Z

we brainstormed about it a bit, and there is a way to make it simpler, basically keeping on the shard routing when and why a shard moved to unsaying, and use that to decide when to force allocate it (this only applies to replicas basically) when there is no copy found within the cluster. Will open subsequent pull request for the new logic

kimchy added >feature v2.0.0-beta1 review v1.6.0 labels May 31, 2015

pickypg reviewed May 31, 2015
View reviewed changes

clintongormley reviewed May 31, 2015
View reviewed changes

clintongormley added the release highlight label May 31, 2015

kimchy force-pushed the delay_node_leaving_reroute branch from ce79a49 to c953bd2 Compare June 1, 2015 00:20

clintongormley added discuss and removed release highlight v1.6.0 v2.0.0-beta1 labels Jun 2, 2015

kimchy removed discuss >feature review labels Jun 12, 2015

kimchy closed this Jun 12, 2015

clintongormley mentioned this pull request Aug 5, 2015

Provide an option to only assign shards to nodes that already have them as part of allocation #9425

Closed

clintongormley mentioned this pull request Aug 16, 2015

Add option to prevent recovering replica shards onto nodes that did not have them locally. #7288

Closed

clintongormley mentioned this pull request Sep 19, 2015

Recovery should re-assign shards when a node re-joins the cluster #2908

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed allocation on node leave #11438

Delayed allocation on node leave #11438

kimchy commented May 31, 2015

pickypg May 31, 2015

kimchy May 31, 2015

pickypg commented May 31, 2015

clintongormley May 31, 2015

clintongormley commented May 31, 2015

kimchy commented May 31, 2015

kimchy commented Jun 1, 2015

s1monw commented Jun 1, 2015

kimchy commented Jun 12, 2015

Delayed allocation on node leave #11438

Delayed allocation on node leave #11438

Conversation

kimchy commented May 31, 2015

pickypg May 31, 2015

Choose a reason for hiding this comment

kimchy May 31, 2015

Choose a reason for hiding this comment

pickypg commented May 31, 2015

clintongormley May 31, 2015

Choose a reason for hiding this comment

clintongormley commented May 31, 2015

kimchy commented May 31, 2015

kimchy commented Jun 1, 2015

s1monw commented Jun 1, 2015

kimchy commented Jun 12, 2015