Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed allocation on node leave #11438

Closed
wants to merge 1 commit into from

Conversation

kimchy
Copy link
Member

@kimchy kimchy commented May 31, 2015

Delay allocation of unassigned shards when node leaves for a specific period (defaults to 5m) to give it a chance to come back and not cause excessive allocations.

This new behavior, specifically with the default value, means that when a node leaves the cluster now, the shards assigned to it will only be allocated back to the rest of the cluster after the specified duration.

The number of delayed unassigned shards can be retrieved using the cluster health API.

The setting to control the duration is cluster.routing.allocation.delay_unassigned_allocation.duration, and its dynamically updatable using the cluster update settings API (only applicable to master nodes).

The reroute cluster command now also accepts an optional delayed_duration parameter, when set, it will override the duration for this reroute operation. This can be handy, for example, to set it to 0 and get the current delayed shards to be assigned.

The concept of a node key is also introduced, allowing to specify what represents a node in a "cross restart" manner. The potential values for the setting cluster.routing.allocation.delay_unassigned_allocation.node_key are name (node name), id (node id, note, randomly generated on node startup), host_address (the host ip address)
, host_name (the host name), transport_addresss (the transport address, ip + port of the node). Defaults to transport_address.

Closes #7288

/**
* Clears that current delayed allocation.
*/
public void clearDelayedAllocations() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be synchronized too since you're protecting access to delayedAllocations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh!, thanks for spotting it, it used to be concurrent map, but then with the sync it was not needed and I missed it

@pickypg
Copy link
Member

pickypg commented May 31, 2015

Minor comments and thoughts. I'm excited to see this get in!

for (ObjectCursor<DiscoveryNode> cursor : allocation.nodes().dataNodes().values()) {
if (delayedAllocationNodeKey.equals(nodeKey.getNodeKey(cursor.value))) {
// we found a node key that has came back, remove it from the delayed allocations
logger.info("node {} joined the cluster back, removing delayed allocation", cursor.value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"joined the cluster back" -> "rejoined the cluster"

@clintongormley
Copy link

If a node dies unexpectedly, does the admin have a way to override the duration, or is it just a matter of resetting the duration temporarily?

@kimchy
Copy link
Member Author

kimchy commented May 31, 2015

pushed a change for the first review.

If a node dies unexpectedly, does the admin have a way to override the duration, or is it just a matter of resetting the duration temporarily?

Yea, the duration setting needs to be updated temporary to unblock it, we could have a flag in reroute as well to ignore it if we want.

I also want to raise the question of the default for duration. I went with 5 minutes, but I would love to get input into what the best default would be. It needs to cover the most common restart times of a node (+conceptually a vm)

Delay allocation of unassigned shards when node leaves for a specific period (defaults to 5m) to give it a chance to come back and not cause excessive allocations.

This new behavior, specifically with the default value, means that when a node leaves the cluster now, the shards assigned to it will only be allocated back to the rest of the cluster after the specified duration.

The number of delayed unassigned shards can be retrieved using the cluster health API.

The setting to control the duration is `cluster.routing.allocation.delay_unassigned_allocation.duration`, and its dynamically updatable using the cluster update settings API (only applicable to master nodes).

The reroute cluster command now also accepts an optional `delayed_duration` parameter, when set, it will override the duration for this reroute operation. This can be handy, for example, to set it to 0 and get the current delayed shards to be assigned.

The concept of a node key is also introduced, allowing to specify what represents a node in a "cross restart" manner. The potential values for the setting `cluster.routing.allocation.delay_unassigned_allocation.node_key` are `name` (node name), `id` (node id, note, randomly generated on node startup), `host_address` (the host ip address)
, `host_name` (the host name), `transport_addresss` (the transport address, ip + port of the node). Defaults to `transport_address`.
@kimchy kimchy force-pushed the delay_node_leaving_reroute branch from ce79a49 to c953bd2 Compare June 1, 2015 00:20
@kimchy
Copy link
Member Author

kimchy commented Jun 1, 2015

@clintongormley I added the ability to set the delayed_duration in the cluster reroute command as well, so people can more easily set it without going through changing the setting value which is more permanent.

@s1monw
Copy link
Contributor

s1monw commented Jun 1, 2015

Thanks shay for opening this PR, this has been a pain for a long time. I looked at the implementation and there are a couple of things that concern me:

  • having a time component directly attached to the shard allocation makes me wanna run away - I am really concerned about this producing very weird (maybe due to misconfiguration what have you) effects where people still see the problems we are trying to fix here.
  • binding a shard to something like a transport address, node ids, ports, $some_other_random_value is what we are trying to get rid of and not what we should add. We essentially have the right information already to do the right thing if a node comes back.
  • adding such a large amount of state-full code to a class that I'd love to get rid of essentially is the wrong path forward IMO. GatewayAllocator owns too much already and I think we should rather make use of our already existing abstraction.

Based on this I'd like to propose a different way of preventing ze allocation dance which might allow to separate out the time component as well. What I have in mind is something like a notion of stickiness where a shard can only be allocated on a node that has the shards data locally AND a version of the shard greater or equal to the ShardRouting it used to have before it was moved to UNASSIGNED. Shards that are created due to replica expansion or due to index creation can always be allocated.
Ideally we will be able to have a simple allocation decider that prevents us from allocating the shards.

In a second step we can add some mechanism that removes the stickiness from a shard after time, number of nodes in the cluster, number of unallocated shards, user interaction, $yoursgoeshere

@kimchy
Copy link
Member Author

kimchy commented Jun 12, 2015

we brainstormed about it a bit, and there is a way to make it simpler, basically keeping on the shard routing when and why a shard moved to unsaying, and use that to decide when to force allocate it (this only applies to replicas basically) when there is no copy found within the cluster. Will open subsequent pull request for the new logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to prevent recovering replica shards onto nodes that did not have them locally.
5 participants