Recovery should re-assign shards when a node re-joins the cluster #2908

mrflip · 2013-04-17T10:44:04Z

When a node disconnects and then re-joins the cluster, it is stripped of all its shard assignments, even if it is the best recovery candidate. This significantly prolongs recovery, increases the burden on the cluster, and in observed cases can cause the cluster to go red (more below).

Either of the following would help the situation:

Re-plan the recovery when a node joins. If the re-joining node has more segments than the current recoverer, stop that recovery and have the re-joining node take it over.
Do not un-assign the shards from the errant node. Let the shards become over-replicated, then rebalance (rather than letting them become under-replicated and then rebalancing).

We run nodes with hundreds of GB of data in EC2 instances. The combination of heavy shards and modest network bandwidth means recovery can take tens of minutes. This is much longer than a node would typically be off-line after encountering a fault -- even a full rebuild of a node takes only a few minutes, and comes back with all its data intact thanks to the magic of EBS. In the common case of a rolling restart each node rejoins within a few seconds.

Suppose in my cluster Kitty has shards 1, 2; Xavier has shards 3, 4; Jane has 1, 3; Hank has 2, 3, 4; Scott has 1, 2, 4. Kitty will phase out (to later re-join); Xavier and Jane will start recover of shards 1 and 2 respectively.

In the current world, with c.r.a.allow_rebalance set to default, Kitty will re-join and do nothing until recovery is complete - she does not send or receive shards, and does not answer queries. Once the cluster is green, she is then assigned an arbitrary portion of shards and the cluster begins rebalancing onto her. In general, only a few of the new shard assignments will overlap and so her first act is to delete most of the data on her disk. With allow rebalance set to always, she will at least begin rebalancing immediately on join, but again to an arbitrary set of shards: she'll rejoin with 1 and 2 complete or mostly-complete (due to intervening writes), but the odds are only (1/nshards^2) that a shard is re-assigned.

The downsides:

Any transient disconnect results in a minimum recovery period of (MB per node) / (MB/s recovery throughput).
During that time the cluster has gone from 5 strong nodes to 4 nodes doing recovery and serving 125% of their normal data burden.
The shards are under-replicated during this time. If Scott blips out during recovery as well, you all of a sudden have effectively no replication for shards 1 and 2 even though three machines have the data. If Hank additionally goes down, we've seen (at least in 19.8) a situation where the shards remained un-assigned until a full-cluster stop / full-cluster start could be effected.

Proposed Alternatives:

Re-plan the recovery when a node joins. When Kitty phases out, her shards are assigned to Xavier and Jane and recovery initiates as normal. When Kitty re-joins, the master node takes stock of how many segments she has for all under-replicated shards, and forms a new recovery plan. Suppose Kitty rejoins with 60% of the current segments for shard 1 and 95% of shard 2, while the other nodes are 70% through transfer. Xavier will complete the recovery of shard 1, and Kitty will delete hers; Jane will stop recovering shard 2, while Kitty will recover that last 5%. Once the cluster is green, some shard will be rebalanced onto Kitty.
Do not un-assign the shards from the errant node. When Kitty phases out, assign her shards as usual to Xavier and Jane -- but leave them also assigned to Kitty. If Kitty rejoins, she will complete whatever incremental recovery of her shards are necessary. The cluster will then choose how to discard the over-replicated shards to find optimal balance. The improvement here is that a) however quickly Kitty completes recovery is how quickly you're serving data from a full-strength cluster again; b) you're spending as little time as possible in an under-replicated state. It's safer to ski downhill.

The text was updated successfully, but these errors were encountered:

ghoseb · 2013-11-10T13:56:09Z

👍

clintongormley · 2014-11-29T14:54:33Z

Related to #7288, #8190, #6069

clintongormley · 2015-09-19T17:30:34Z

This is now closed by #11438, #12421, and #11417

clintongormley assigned bleskes Aug 8, 2014

clintongormley added the :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. label Nov 29, 2014

kadaan mentioned this issue Jun 3, 2015

Balance Shard Recovery #10123

Closed

clintongormley closed this as completed Sep 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery should re-assign shards when a node re-joins the cluster #2908

Recovery should re-assign shards when a node re-joins the cluster #2908

mrflip commented Apr 17, 2013

ghoseb commented Nov 10, 2013

clintongormley commented Nov 29, 2014

clintongormley commented Sep 19, 2015

Recovery should re-assign shards when a node re-joins the cluster #2908

Recovery should re-assign shards when a node re-joins the cluster #2908

Comments

mrflip commented Apr 17, 2013

ghoseb commented Nov 10, 2013

clintongormley commented Nov 29, 2014

clintongormley commented Sep 19, 2015