Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery should re-assign shards when a node re-joins the cluster #2908

Closed
mrflip opened this issue Apr 17, 2013 · 3 comments
Closed

Recovery should re-assign shards when a node re-joins the cluster #2908

mrflip opened this issue Apr 17, 2013 · 3 comments
Assignees
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source.

Comments

@mrflip
Copy link

mrflip commented Apr 17, 2013

When a node disconnects and then re-joins the cluster, it is stripped of all its shard assignments, even if it is the best recovery candidate. This significantly prolongs recovery, increases the burden on the cluster, and in observed cases can cause the cluster to go red (more below).

Either of the following would help the situation:

  1. Re-plan the recovery when a node joins. If the re-joining node has more segments than the current recoverer, stop that recovery and have the re-joining node take it over.
  2. Do not un-assign the shards from the errant node. Let the shards become over-replicated, then rebalance (rather than letting them become under-replicated and then rebalancing).

We run nodes with hundreds of GB of data in EC2 instances. The combination of heavy shards and modest network bandwidth means recovery can take tens of minutes. This is much longer than a node would typically be off-line after encountering a fault -- even a full rebuild of a node takes only a few minutes, and comes back with all its data intact thanks to the magic of EBS. In the common case of a rolling restart each node rejoins within a few seconds.

Suppose in my cluster Kitty has shards 1, 2; Xavier has shards 3, 4; Jane has 1, 3; Hank has 2, 3, 4; Scott has 1, 2, 4. Kitty will phase out (to later re-join); Xavier and Jane will start recover of shards 1 and 2 respectively.

In the current world, with c.r.a.allow_rebalance set to default, Kitty will re-join and do nothing until recovery is complete - she does not send or receive shards, and does not answer queries. Once the cluster is green, she is then assigned an arbitrary portion of shards and the cluster begins rebalancing onto her. In general, only a few of the new shard assignments will overlap and so her first act is to delete most of the data on her disk. With allow rebalance set to always, she will at least begin rebalancing immediately on join, but again to an arbitrary set of shards: she'll rejoin with 1 and 2 complete or mostly-complete (due to intervening writes), but the odds are only (1/nshards^2) that a shard is re-assigned.

The downsides:

  • Any transient disconnect results in a minimum recovery period of (MB per node) / (MB/s recovery throughput).
  • During that time the cluster has gone from 5 strong nodes to 4 nodes doing recovery and serving 125% of their normal data burden.
  • The shards are under-replicated during this time. If Scott blips out during recovery as well, you all of a sudden have effectively no replication for shards 1 and 2 even though three machines have the data. If Hank additionally goes down, we've seen (at least in 19.8) a situation where the shards remained un-assigned until a full-cluster stop / full-cluster start could be effected.

Proposed Alternatives:

  1. Re-plan the recovery when a node joins. When Kitty phases out, her shards are assigned to Xavier and Jane and recovery initiates as normal. When Kitty re-joins, the master node takes stock of how many segments she has for all under-replicated shards, and forms a new recovery plan. Suppose Kitty rejoins with 60% of the current segments for shard 1 and 95% of shard 2, while the other nodes are 70% through transfer. Xavier will complete the recovery of shard 1, and Kitty will delete hers; Jane will stop recovering shard 2, while Kitty will recover that last 5%. Once the cluster is green, some shard will be rebalanced onto Kitty.
  2. Do not un-assign the shards from the errant node. When Kitty phases out, assign her shards as usual to Xavier and Jane -- but leave them also assigned to Kitty. If Kitty rejoins, she will complete whatever incremental recovery of her shards are necessary. The cluster will then choose how to discard the over-replicated shards to find optimal balance. The improvement here is that a) however quickly Kitty completes recovery is how quickly you're serving data from a full-strength cluster again; b) you're spending as little time as possible in an under-replicated state. It's safer to ski downhill.
@ghoseb
Copy link

ghoseb commented Nov 10, 2013

👍

@clintongormley clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Nov 29, 2014
@clintongormley
Copy link

Related to #7288, #8190, #6069

@clintongormley
Copy link

This is now closed by #11438, #12421, and #11417

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source.
Projects
None yet
Development

No branches or pull requests

4 participants