-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delayed allocation on node leave #11438
Conversation
/** | ||
* Clears that current delayed allocation. | ||
*/ | ||
public void clearDelayedAllocations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be synchronized
too since you're protecting access to delayedAllocations
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh!, thanks for spotting it, it used to be concurrent map, but then with the sync it was not needed and I missed it
Minor comments and thoughts. I'm excited to see this get in! |
for (ObjectCursor<DiscoveryNode> cursor : allocation.nodes().dataNodes().values()) { | ||
if (delayedAllocationNodeKey.equals(nodeKey.getNodeKey(cursor.value))) { | ||
// we found a node key that has came back, remove it from the delayed allocations | ||
logger.info("node {} joined the cluster back, removing delayed allocation", cursor.value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"joined the cluster back" -> "rejoined the cluster"
If a node dies unexpectedly, does the admin have a way to override the |
pushed a change for the first review.
Yea, the duration setting needs to be updated temporary to unblock it, we could have a flag in reroute as well to ignore it if we want. I also want to raise the question of the default for duration. I went with 5 minutes, but I would love to get input into what the best default would be. It needs to cover the most common restart times of a node (+conceptually a vm) |
Delay allocation of unassigned shards when node leaves for a specific period (defaults to 5m) to give it a chance to come back and not cause excessive allocations. This new behavior, specifically with the default value, means that when a node leaves the cluster now, the shards assigned to it will only be allocated back to the rest of the cluster after the specified duration. The number of delayed unassigned shards can be retrieved using the cluster health API. The setting to control the duration is `cluster.routing.allocation.delay_unassigned_allocation.duration`, and its dynamically updatable using the cluster update settings API (only applicable to master nodes). The reroute cluster command now also accepts an optional `delayed_duration` parameter, when set, it will override the duration for this reroute operation. This can be handy, for example, to set it to 0 and get the current delayed shards to be assigned. The concept of a node key is also introduced, allowing to specify what represents a node in a "cross restart" manner. The potential values for the setting `cluster.routing.allocation.delay_unassigned_allocation.node_key` are `name` (node name), `id` (node id, note, randomly generated on node startup), `host_address` (the host ip address) , `host_name` (the host name), `transport_addresss` (the transport address, ip + port of the node). Defaults to `transport_address`.
ce79a49
to
c953bd2
Compare
@clintongormley I added the ability to set the |
Thanks shay for opening this PR, this has been a pain for a long time. I looked at the implementation and there are a couple of things that concern me:
Based on this I'd like to propose a different way of preventing ze allocation dance which might allow to separate out the time component as well. What I have in mind is something like a notion of stickiness where a shard can only be allocated on a node that has the shards data locally AND a version of the shard greater or equal to the In a second step we can add some mechanism that removes the stickiness from a shard after time, number of nodes in the cluster, number of unallocated shards, user interaction, $yoursgoeshere |
we brainstormed about it a bit, and there is a way to make it simpler, basically keeping on the shard routing when and why a shard moved to unsaying, and use that to decide when to force allocate it (this only applies to replicas basically) when there is no copy found within the cluster. Will open subsequent pull request for the new logic |
Delay allocation of unassigned shards when node leaves for a specific period (defaults to 5m) to give it a chance to come back and not cause excessive allocations.
This new behavior, specifically with the default value, means that when a node leaves the cluster now, the shards assigned to it will only be allocated back to the rest of the cluster after the specified duration.
The number of delayed unassigned shards can be retrieved using the cluster health API.
The setting to control the duration is
cluster.routing.allocation.delay_unassigned_allocation.duration
, and its dynamically updatable using the cluster update settings API (only applicable to master nodes).The reroute cluster command now also accepts an optional
delayed_duration
parameter, when set, it will override the duration for this reroute operation. This can be handy, for example, to set it to 0 and get the current delayed shards to be assigned.The concept of a node key is also introduced, allowing to specify what represents a node in a "cross restart" manner. The potential values for the setting
cluster.routing.allocation.delay_unassigned_allocation.node_key
arename
(node name),id
(node id, note, randomly generated on node startup),host_address
(the host ip address),
host_name
(the host name),transport_addresss
(the transport address, ip + port of the node). Defaults totransport_address
.Closes #7288