Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-expand replicas when adding or removing nodes #30423

Merged
merged 4 commits into from
May 7, 2018

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented May 7, 2018

Auto-expands replicas in the same cluster state update (instead of a follow-up reroute) where nodes are added or removed.

Fixes #1873

@ywelsch ywelsch added >bug v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.4.0 labels May 7, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 to move it into the allocation service. I left some nits and questions. Thanks for picking this up!

numberOfReplicas = max;
}

if (numberOfReplicas >= min && numberOfReplicas <= max) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this . If min<= max, we always match this clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if (only) min <= max. Unfortunately we never validated this condition anywhere (I would have expected this during setting validation, but to my surprise, it was not there), and I didn't feel like making a breaking change in this PR by strengthening the validation logic. For now, just as before, if min > max, the auto-expand replicas will be ignored.

@@ -88,6 +119,25 @@ public String toString() {
boolean isEnabled() {
return enabled;
}

public static Map<Integer, List<String>> getAutoExpandReplicaChanges(MetaData metaData, DiscoveryNodes discoveryNodes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some java docs about that we're computing here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you copied over how it was, so this is optional in my mind as it is out of scope of the PR - I'm personally find this confusing without much added value and would prefer a String -> Integer map which gives a per index updated value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I chose to stick with this model for now is that the PR still calls RoutingTable.Builder#updateNumberOfReplicas(int numberOfReplicas, String... indices) to make the actual update, which is also used by MetaDataUpdateSettingsService when a user updates the number of replicas manually. This allows to reuse the method as is. As a follow-up, I would like us to migrate updateNumberOfReplicas from RoutingTable.Builder to RoutingNodes, as I would like (for consistency purposes) all routing table updates to go through that class instead.
I've added some javadocs to explain this.

@ywelsch ywelsch merged commit 82b251a into elastic:master May 7, 2018
@ywelsch
Copy link
Contributor Author

ywelsch commented May 7, 2018

Thanks @bleskes

ywelsch added a commit that referenced this pull request May 7, 2018
Auto-expands replicas in the same cluster state update (instead of a follow-up reroute) where nodes are added or removed.

Closes #1873, fixing an issue where nodes drop their copy of auto-expanded data when coming up, only to sync it again later.
@lukas-vlcek
Copy link
Contributor

@ywelsch I was quickly looking at the test in this PR. This issue resulted in dropping the data and resyncing it from other nodes right after that. If I read the test correctly it works only with cluster state a bit, it does not verify that the data resyncing issue does not occur explicitly. So the assumption is that changes in how the cluster state is represented now should not lead to the replication data issue, is that correct?

@ywelsch
Copy link
Contributor Author

ywelsch commented May 8, 2018

@lukas-vlcek The issue resulted from how two distinct components in the system interacted: the auto-expand replica logic, and the shard deletion logic. The way the shard deletion logic works is as follows: When a data node receives a cluster state where a shard is fully allocated (i.e. primary + all replicas are active) and that node has some unallocated data for that shard on disk, it proceeds to delete said data as there is no need to keep this extra copy of the data around. The way the auto-expand replica logic used to work was as follows: This component hooked into the cluster lifecycle, waiting for cluster state changes, comparing the number of nodes in the cluster with the number of auto-expanded replicas and then submitting a settings update to the cluster to adjust the number_of_replicas if the two got out of sync. The way these two components now interacted resulted in the following behavior: Assume a 5 node cluster and an index with auto-expand-replicas set to 0-all. In that case, number_of_replicas is expanded to 4 (i.e. there are 4 replica copies for each primary). Now assume that one node drops out only to later on rejoin the cluster. When the node drops from the cluster, it is removed from the cluster state. In a follow-up settings update initiated by the auto-expand-logic, number_of_replicas for the index is adjusted from 4 to 3. When the node rejoins the cluster, it is added to the cluster state by the master and receives this cluster state. The cluster state now has 5 nodes again, but the number_of_replicas is not expanded yet from 3 to 4, as this is only triggered shortly after by the auto-expand component (running on the active master). The shard deletion logic on the data node now sees that the shard is fully allocated (number_of_replicas set to 3 with all shard copies active). It then proceeds to delete the extra local shard copy, just before it receives the updated cluster state where number_of_replicas is now adjusted to 4 again. It therefore has to resync the full data from the primary again as the local copy was completely wiped.

The two main options to fix this were:

  • make the shard deletion logic auto-expand replicas aware, accounting for the inconsistencies in the cluster state.
  • ensure that cluster states are always consistent w.r.t the auto-expansion of replicas.

For better separation of concerns, and to have stronger consistency guarantees on cluster states, I opted for the second kind of fix. With this PR, the auto-expansion of replicas is done in the same cluster state update where nodes are added or removed from the cluster state. This means that when the newly joining node receives its first cluster state, number_of_replicas is already correctly expanded to 4, and the node is aware that the shard is not fully allocated, and therefore does not proceed to delete its local shard copy. The tests in this PR check that the consistency guarantees in the cluster state (number_of_replicas is properly expanded at all times) cannot be violated. The shard deletion logic is already separately tested, working on the assumption of a consistent cluster state.

colings86 pushed a commit that referenced this pull request May 8, 2018
Auto-expands replicas in the same cluster state update (instead of a follow-up reroute) where nodes are added or removed.

Closes #1873, fixing an issue where nodes drop their copy of auto-expanded data when coming up, only to sync it again later.
dnhatn added a commit that referenced this pull request May 8, 2018
* master:
  Mute ML upgrade test (#30458)
  Stop forking javac (#30462)
  Client: Deprecate many argument performRequest (#30315)
  Docs: Use task_id in examples of tasks (#30436)
  Security: Rename IndexLifecycleManager to SecurityIndexManager (#30442)
  [Docs] Fix typo in cardinality-aggregation.asciidoc (#30434)
  Avoid NPE in `more_like_this` when field has zero tokens (#30365)
  Build: Switch to building javadoc with html5 (#30440)
  Add a quick tour of the project to CONTRIBUTING (#30187)
  Reindex: Use request flavored methods (#30317)
  Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (#30432)
  Auto-expand replicas when adding or removing nodes (#30423)
  Docs: fix changelog merge
  Fix line length violation in cache tests
  Add stricter geohash parsing (#30376)
  Add failing test for core cache deadlock
  [DOCS] convert forcemerge snippet
  Update forcemerge.asciidoc (#30113)
  Added zentity to the list of API extension plugins (#29143)
  Fix the search request default operation behavior doc (#29302) (#29405)
dnhatn added a commit that referenced this pull request May 8, 2018
* 6.x:
  Stop forking javac (#30462)
  Fix tribe tests
  Docs: Use task_id in examples of tasks (#30436)
  Security: Rename IndexLifecycleManager to SecurityIndexManager (#30442)
  Packaging: Set elasticsearch user to have non-existent homedir (#29007)
  [Docs] Fix typo in cardinality-aggregation.asciidoc (#30434)
  Avoid NPE in `more_like_this` when field has zero tokens (#30365)
  Build: Switch to building javadoc with html5 (#30440)
  Add a quick tour of the project to CONTRIBUTING (#30187)
  Add stricter geohash parsing (#30376)
  Reindex: Use request flavored methods (#30317)
  Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure.  (#30432)
  Auto-expand replicas when adding or removing nodes (#30423)
  Silence IndexUpgradeIT test failures. (#30430)
  Fix line length violation in cache tests
  Add failing test for core cache deadlock
  [DOCS] convert forcemerge snippet
  Update forcemerge.asciidoc (#30113)
  Added zentity to the list of API extension plugins (#29143)
  Fix the search request default operation behavior doc (#29302) (#29405)
  Watcher: Mark watcher as started only after loading watches (#30403)
  Correct wording in log message (#30336)
  Do not fail snapshot when deleting a missing snapshotted file (#30332)
  AwaitsFix testCreateShrinkIndexToN
  DOCS: Correct mapping tags in put-template api
  DOCS: Fix broken link in the put index template api
  Add put index template api to high level rest client (#30400)
  [Docs] Add snippets for POS stop tags default value
  Remove entry inadvertently picked into changelog
  Move respect accept header on no handler to 6.3.1
  Respect accept header on no handler (#30383)
  [Test] Add analysis-nori plugin to the vagrant tests
  [Rollup] Validate timezone in range queries (#30338)
  [Docs] Fix bad link
  [Docs] Fix end of section in the korean plugin docs
  add the Korean nori plugin to the change logs
  Expose the Lucene Korean analyzer module in a plugin (#30397)
  Docs: remove transport_client from CCS role example (#30263)
  Test: remove cluster permission from CCS user (#30262)
  Watcher: Remove unneeded index deletion in tests
  fix docs branch version
  fix lucene snapshot version
  Upgrade to 7.4.0-snapshot-1ed95c097b (#30357)
  [ML][TEST] Clean up jobs in ModelPlotIT
  Watcher: Ensure trigger service pauses execution (#30363)
  [DOCS] Fixes ordering of changelog sections
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372)
  Make RepositoriesMetaData contents unmodifiable (#30361)
  Change signature of Get Repositories Response (#30333)
  6.x Backport: Terms query validate bug  (#30319)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121)
  Security: reduce garbage during index resolution (#30180)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (#30359)
  SQL: Fix bug caused by empty composites (#30343)
  [ML] Account for gaps in data counts after job is reopened (#30294)
  [ML] Refactor DataStreamDiagnostics to use array (#30129)
  Make licensing FIPS-140 compliant (#30251)
  Do not load global state when deleting a snapshot (#29278)
  Don't load global state when only restoring indices (#29239)
  Tests: Use different watch ids per test in smoke test (#30331)
  Watcher: Make start/stop cycle more predictable and synchronous (#30118)
  [Docs] Add term query with normalizer example
  Adds Eclipse config for xpack licence headers (#30299)
  Fix message content in users tool (#30293)
  [DOCS] Removed X-Pack breaking changes page
  [DOCS] Added security breaking change
  [DOCS] Fixes link to TLS LDAP info
  [DOCS] Merges X-Pack release notes into changelog (#30350)
  [DOCS] Fixes broken links to bootstrap user (#30349)
  [Docs] Remove errant changelog line
  Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641)
  [DOCS] Reorganizes authentication details in Stack Overview (#30280)
  Tests: Simplify VersionUtils released version splitting (#30322)
  Fix merging logic of Suggester Options (#29514)
  ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316)
  [DOCS] Adds LDAP realm configuration details (#30214)
  [DOCS] Adds native realm configuration details (#30215)
  Disable SSL on testing old BWC nodes (#30337)
  [DOCS] Enables edit links for X-Pack pages
  Cancelling a peer recovery on the source can leak a primary permit (#30318)
  SQL: Reduce number of ranges generated for comparisons (#30267)
  [DOCS] Adds links to changelog sections
  Convert server javadoc to html5 (#30279)
  REST Client: Add Request object flavored methods (#29623)
  Create default ES_TMPDIR on Windows (#30325)
  [Docs] Clarify `fuzzy_like_this` redirect (#30183)
  Fix docs of the `_ignored` meta field.
  Add a new `_ignored` meta field. (#29658)
  Move repository-azure fixture test to QA project (#30253)
ywelsch added a commit that referenced this pull request May 14, 2018
#30423 combined auto-expansion in the same cluster state update where nodes are removed. As
the auto-expansion step would run before deassociating the dead nodes from the routing table, the
auto-expansion would possibly remove replicas from live nodes instead of dead ones. This commit
reverses the order to ensure that when nodes leave the cluster that the auto-expand-replica
functionality only triggers after failing the shards on the removed nodes. This ensures that active
shards on other live nodes are not failed if the primary resided on a now dead node.
Instead, one of the replicas on the live nodes first gets promoted to primary, and the auto-
expansion (removing replicas) only triggers in a follow-up step (but still same cluster state update).

Relates to #30456 and follow-up of #30423
ywelsch added a commit that referenced this pull request May 14, 2018
#30423 combined auto-expansion in the same cluster state update where nodes are removed. As
the auto-expansion step would run before deassociating the dead nodes from the routing table, the
auto-expansion would possibly remove replicas from live nodes instead of dead ones. This commit
reverses the order to ensure that when nodes leave the cluster that the auto-expand-replica
functionality only triggers after failing the shards on the removed nodes. This ensures that active
shards on other live nodes are not failed if the primary resided on a now dead node.
Instead, one of the replicas on the live nodes first gets promoted to primary, and the auto-
expansion (removing replicas) only triggers in a follow-up step (but still same cluster state update).

Relates to #30456 and follow-up of #30423
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants