Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replica starts peer recovery with safe commit #28181

Merged
merged 10 commits into from
Jan 13, 2018

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jan 11, 2018

Today a replica starts a peer-recovery with the last commit. If the last
commit is not a safe commit, a replica will immediately fallback to the
file based sync which is more expensive than the sequence based
recovery. This commit modifies the peer-recovery in replica to start
with a safe commit. Moreover we can keep the existing translog on the
target if the recovery is sequence based recovery.

Relates #10708

Today a replica starts a peer-recovery with the last commit. If the last
commit is not a safe commit, a replica will immediately fallback to the
file based sync which is more expensive than the sequence based
recovery. This commit modifies the peer-recovery in replica to start
with a safe commit. Moreover, we can keep the existing translog on the
target if the recovery is sequence based recovery.
@dnhatn dnhatn added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.2.0 labels Jan 11, 2018
@dnhatn dnhatn changed the title Replica starts a peer recovery with a safe commit Replica starts peer recovery with safe commit Jan 11, 2018
@dnhatn dnhatn added the review label Jan 11, 2018
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx Nhat. I left some initial feedback


import java.io.IOException;

final class RecoveryOpenSeqBasedEngineRequest extends TransportRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've slept on this (as promised :)) and I prefer we go back to how you had it with a boolean in RecoveryPrepareForTranslogOperationsRequest. The reason is that I want to do some refactoring to simplify how the engine is created and I expect this to change making the boolean not needed and only use one message. I rather not have to deal with two messages and another layer of BWC. I think we should call the boolean "deleteLocalTranslog"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on not having a separate message here.

@@ -188,7 +188,7 @@ public RecoveryResponse recoverToTarget() throws IOException {
runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId()));

try {
prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can roll back all these naming changes if we we keep the old message (and the boolean)


translogLocation.set(writeTranslog(replica.shardId(), translogUUID, translog.currentFileGeneration(), maxSeqNo));

// commit is good, global checkpoint is above max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we lost the extra check that advancing the global checking advanceses the commit usage.

@@ -241,4 +246,38 @@ public void testPeerRecoveryPersistGlobalCheckpoint() throws Exception {
assertThat(replica.getTranslog().getLastSyncedGlobalCheckpoint(), equalTo(numDocs - 1));
}
}

public void testSequenceBasedRecoveryKeepsTranslog() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you double check we have a test that make sure that all forms of recovery removes unneeded ops above the global checkpoint? if we don't think we can now add it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We have RecoveryDuringReplicationTests#testRecoveryAfterPrimaryPromotion. Previously this test was expected to execute a file-based sync, but now it will execute seq-based recovery.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only do sequence number based recovery in case where a safe commit exists (whose definition is based on the local knowledge of the global checkpoint). How about locally replaying the translog up to the (local knowledge of the) globalcheckpoint and then requesting seq-num recovery only from the (local knowledge of the) globalcheckpoint onwards. This might allow sequence-number based recovery in more cases and would also require less translog operations to be shipped over the wire.

@bleskes
Copy link
Contributor

bleskes commented Jan 11, 2018

How about locally replaying the translog up to the (local knowledge of the) globalcheckpoint and then requesting seq-num recovery only from the (local knowledge of the) globalcheckpoint onwards. This might allow sequence-number based recovery in more cases and would also require less translog operations to be shipped over the wire.

That's where we're heading indeed but will require more follow ups.

@dnhatn
Copy link
Member Author

dnhatn commented Jan 11, 2018

@bleskes and @ywelsch, I've replaced the extra message by an additional parameter. Could you please have another look? Thank you.

# Conflicts:
#	core/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some very minor comments. It looks good. I think we can now also add assertions in IndexShard post recovery that we have a "safe" commit if the index is new enough and that we the global checkpoint is <= the local checkpoint etc.?

private final long recoveryId;
private final ShardId shardId;
private final int totalTranslogOps;
private final boolean deleteLocalTranslog;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I came up with the name and I'm sorry for changing my mind. I would propose createNewTranslog. better no?

@@ -188,7 +188,9 @@ public RecoveryResponse recoverToTarget() throws IOException {
runUnderPrimaryPermit(() -> shard.initiateTracking(request.targetAllocationId()));

try {
prepareTargetForTranslog(translog.estimateTotalOperationsFromMinSeq(startingSeqNo));
// For a sequence based recovery, the target can keep its local translog
prepareTargetForTranslog(isSequenceNumberBasedRecoveryPossible == false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we rename isSequenceNumberBasedRecoveryPossible to isSequenceNumberBasedRecovery?

state().getTranslog().totalOperations(totalTranslogOps);
// TODO: take the local checkpoint from store as global checkpoint, once we know it's safe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the todo is still relevant no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed it back as a note as it's a valid TODO.


// As a replica keeps a safe commit, the file-based recovery only happens if the required translog
// for the sequence based recovery are not fully retained and extra documents were added to the primary.
boolean expectSeqNoRecovery = (moreDocs == 0 || frequently());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not have a random boolean here instead of frequently?

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 12, 2018
We are targeting to always have a safe index once the recovery is done.
This invariant does not hold if the translog is manually truncated by
users because the truncate translog cli resets the global checkpoint to
unassigned. This commit assigns the max_seqno of the last commit to the
global checkpoint when truncating translog.

Relates elastic#28181
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx Nhat.

@@ -365,6 +365,7 @@ private void ensureRefCount() {
public void prepareForTranslogOperations(boolean createNewTranslog, int totalTranslogOps) throws IOException {
state().getTranslog().totalOperations(totalTranslogOps);
if (createNewTranslog) {
// TODO: take the local checkpoint from store as global checkpoint, once we know it's safe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add when you expect it to be safe (what version)

dnhatn added a commit that referenced this pull request Jan 13, 2018
We are targeting to always have a safe index once the recovery is done. 
This invariant does not hold if the translog is manually truncated by 
users because the truncate translog cli resets the global checkpoint to
unassigned. This commit assigns the global checkpoint to the max_seqno
of the last commit when truncating translog. We can only safely do it
because the truncate translog command will generate a new history uuid
for that shard. With a new history UUID, sequence-based recovery between
that shard and other old shards will be disabled.

Relates #28181
@dnhatn
Copy link
Member Author

dnhatn commented Jan 13, 2018

Thanks @bleskes and @ywelsch for reviewing.

@dnhatn dnhatn merged commit 095f31b into elastic:master Jan 13, 2018
@dnhatn dnhatn deleted the rollback-replica branch January 13, 2018 00:09
dnhatn added a commit that referenced this pull request Jan 13, 2018
We are targeting to always have a safe index once the recovery is done. 
This invariant does not hold if the translog is manually truncated by 
users because the truncate translog cli resets the global checkpoint to
unassigned. This commit assigns the global checkpoint to the max_seqno
of the last commit when truncating translog. We can only safely do it
because the truncate translog command will generate a new history uuid
for that shard. With a new history UUID, sequence-based recovery between
that shard and other old shards will be disabled.

Relates #28181
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 13, 2018
We introduced a new option `createNewTranslog` in elastic#28181. However, we
named that parameter as deleteLocalTranslog in other places. This commit
makes sure to have a consistent naming in these places.

Relates elastic#28181
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Jan 13, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 13, 2018
* master:
  TEST: init unassigned gcp in testAcquireIndexCommit
  Replica start peer recovery with safe commit (elastic#28181)
  Truncate tlog cli should assign global checkpoint (elastic#28192)
dnhatn added a commit that referenced this pull request Jan 13, 2018
Today a replica starts a peer-recovery with the last commit. If the last
commit is not a safe commit, a replica will immediately fallback to the
file based sync which is more expensive than the sequence based
recovery. This commit modifies the peer-recovery in replica to start
with a safe commit. Moreover we can keep the existing translog on the
target if the recovery is sequence based recovery.

Relates #10708
dnhatn added a commit that referenced this pull request Jan 13, 2018
The previous backport was not corect.

Relates #28181
matarrese added a commit to matarrese/elasticsearch that referenced this pull request Jan 13, 2018
* master: (59 commits)
  Correct backport replica rollback to 6.2 (elastic#28181)
  Backport replica rollback to 6.2 (elastic#28181)
  Rename deleteLocalTranslog to createNewTranslog
  AwaitsFix #testRecoveryAfterPrimaryPromotion
  TEST: init unassigned gcp in testAcquireIndexCommit
  Replica start peer recovery with safe commit (elastic#28181)
  Truncate tlog cli should assign global checkpoint (elastic#28192)
  Fix lock accounting in releasable lock
  Add ability to associate an ID with tasks  (elastic#27764)
  [DOCS] Removed differencies between text and code (elastic#27993)
  text fixes (elastic#28136)
  Update getting-started.asciidoc (elastic#28145)
  [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187)
  Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186)
  Do not keep 5.x commits once having 6.x commits (elastic#28188)
  Rename core module to server (elastic#28180)
  upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  Primary send safe commit in file-based recovery (elastic#28038)
  [Docs] Correct response json in rank-eval.asciidoc
  ...
dnhatn added a commit that referenced this pull request Jan 14, 2018
As a replica always keeps a safe commit and starts peer-recovery with
that commit; file-based recovery  only happens if new operations are
added to the primary and the required translog is not fully retained. In
the test, we tried to produce this condition by flushing a new commit in
order to trim all translog. However, if the new global checkpoint is not
persisted yet, we will keep two commits and not trim translog. This
commit tightens the file-based condition in the test by waiting for the
global checkpoint persisted properly on the new primary before flushing.

Close #28209
Relates #28181
dnhatn added a commit that referenced this pull request Jan 14, 2018
As a replica always keeps a safe commit and starts peer-recovery with
that commit; file-based recovery  only happens if new operations are
added to the primary and the required translog is not fully retained. In
the test, we tried to produce this condition by flushing a new commit in
order to trim all translog. However, if the new global checkpoint is not
persisted yet, we will keep two commits and not trim translog. This
commit tightens the file-based condition in the test by waiting for the
global checkpoint persisted properly on the new primary before flushing.

Close #28209
Relates #28181
matarrese added a commit to matarrese/elasticsearch that referenced this pull request Jan 15, 2018
* master: (74 commits)
  Update version of TaskInfo header serialization after backport
  TEST: Tightens file-based condition in peer-recovery
  Correct backport replica rollback to 6.2 (elastic#28181)
  Backport replica rollback to 6.2 (elastic#28181)
  Rename deleteLocalTranslog to createNewTranslog
  AwaitsFix #testRecoveryAfterPrimaryPromotion
  TEST: init unassigned gcp in testAcquireIndexCommit
  Replica start peer recovery with safe commit (elastic#28181)
  Truncate tlog cli should assign global checkpoint (elastic#28192)
  Fix lock accounting in releasable lock
  Add ability to associate an ID with tasks  (elastic#27764)
  [DOCS] Removed differencies between text and code (elastic#27993)
  text fixes (elastic#28136)
  Update getting-started.asciidoc (elastic#28145)
  [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187)
  Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186)
  Do not keep 5.x commits once having 6.x commits (elastic#28188)
  Rename core module to server (elastic#28180)
  upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 15, 2018
* master: (21 commits)
  [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder
  Painless: Add whitelist extensions (elastic#28161)
  Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter (elastic#28225)
  Avoid doing redundant work when checking for self references. (elastic#26927)
  Fix casts in HotThreads. (elastic#27578)
  Ignore the `-snapshot` suffix when comparing the Lucene version in the build and the docs. (elastic#27927)
  Allow update of `eager_global_ordinals` on `_parent`. (elastic#28014)
  Fix NPE on composite aggregation with sub-aggregations that need scores (elastic#28129)
  `MockTcpTransport` to connect asynchronously (elastic#28203)
  Fix synonym phrase query expansion for cross_fields parsing (elastic#28045)
  Introduce elasticsearch-core jar (elastic#28191)
  elastic#28218: Update the Lucene version for 6.2.0 after backport
  upgrade to lucene 7.2.1 (elastic#28218)
  [Docs] Fix an error in painless-types.asciidoc (elastic#28221)
  Adds metadata to rewritten aggregations (elastic#28185)
  Update version of TaskInfo header serialization after backport
  TEST: Tightens file-based condition in peer-recovery
  Correct backport replica rollback to 6.2 (elastic#28181)
  Backport replica rollback to 6.2 (elastic#28181)
  Rename deleteLocalTranslog to createNewTranslog
  ...
dnhatn added a commit that referenced this pull request Jan 16, 2018
Keeping unsafe commits when opening an engine can be problematic because
these commits are not safe at the recovering time but they can suddenly
become safe in the future. The following issues can happen if unsafe
commits are kept oninit.

1. Replica can use unsafe commit in peer-recovery. This happens when a
replica with a safe commit c1 (max_seqno=1) and an unsafe commit c2
(max_seqno=2) recovers from a primary with c1(max_seqno=1). If a new
document (seqno=2) is added without flushing, the global checkpoint is
advanced to 2; and the replica recovers again, it will use the unsafe
commit c2 (max_seqno=2 <= gcp=2) as the starting commit for sequenced
based recovery even the commit c2 contains a stale operation and the
document (with seqno=2) will not be replicated to the replica.

2. Min translog gen for recovery can go backwards in peer-recovery. This
happens when a replica with a safe commit c1 (local_checkpoint=1,
recovery_translog_gen=1) and an unsafe commit c2 (local_checkpoint=2,
recovery_translog_gen=2). The replica recovers from a primary, and keeps
c2 as the last commit, then sets last_translog_gen to 2. Flushing a new
commit on the replica will cause exception as the new last commit c3
will have recovery_translog_gen=1. The recovery translog generation of a
commit is calculated based on the current local checkpoint. The local
checkpoint of c3 is 1 while the local checkpoint of c2 is 2.

3. Commit without translog can be used for recovery. An old index, which
was created before multiple-commits is introduced (v6.2), may not have a
safe commit. If that index has a snapshotted commit without translog and
an unsafe commit, the policy can consider the snapshotted commit as a
safe commit for recovery even the commit does not have translog.

These issues can be avoided if the combined deletion policy keeps only
the starting commit onInit.

Relates #27804
Relates #28181
dnhatn added a commit that referenced this pull request Jan 16, 2018
Keeping unsafe commits when opening an engine can be problematic because
these commits are not safe at the recovering time but they can suddenly
become safe in the future. The following issues can happen if unsafe
commits are kept oninit.

1. Replica can use unsafe commit in peer-recovery. This happens when a
replica with a safe commit c1 (max_seqno=1) and an unsafe commit c2
(max_seqno=2) recovers from a primary with c1(max_seqno=1). If a new
document (seqno=2) is added without flushing, the global checkpoint is
advanced to 2; and the replica recovers again, it will use the unsafe
commit c2 (max_seqno=2 <= gcp=2) as the starting commit for sequenced
based recovery even the commit c2 contains a stale operation and the
document (with seqno=2) will not be replicated to the replica.

2. Min translog gen for recovery can go backwards in peer-recovery. This
happens when a replica with a safe commit c1 (local_checkpoint=1,
recovery_translog_gen=1) and an unsafe commit c2 (local_checkpoint=2,
recovery_translog_gen=2). The replica recovers from a primary, and keeps
c2 as the last commit, then sets last_translog_gen to 2. Flushing a new
commit on the replica will cause exception as the new last commit c3
will have recovery_translog_gen=1. The recovery translog generation of a
commit is calculated based on the current local checkpoint. The local
checkpoint of c3 is 1 while the local checkpoint of c2 is 2.

3. Commit without translog can be used for recovery. An old index, which
was created before multiple-commits is introduced (v6.2), may not have a
safe commit. If that index has a snapshotted commit without translog and
an unsafe commit, the policy can consider the snapshotted commit as a
safe commit for recovery even the commit does not have translog.

These issues can be avoided if the combined deletion policy keeps only
the starting commit onInit.

Relates #27804
Relates #28181
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 16, 2018
* compile-with-jdk-9: (56 commits)
  TEST: init unassigned gcp in testAcquireIndexCommit
  Replica start peer recovery with safe commit (elastic#28181)
  Truncate tlog cli should assign global checkpoint (elastic#28192)
  Fix lock accounting in releasable lock
  Add ability to associate an ID with tasks  (elastic#27764)
  [DOCS] Removed differencies between text and code (elastic#27993)
  text fixes (elastic#28136)
  Update getting-started.asciidoc (elastic#28145)
  [Docs] Spelling fix in painless-getting-started.asciidoc (elastic#28187)
  Fixed the cat.health REST test to accept 4ms, not just 4.0ms (elastic#28186)
  Do not keep 5.x commits once having 6.x commits (elastic#28188)
  Rename core module to server (elastic#28180)
  upgraded jna from 4.4.0-1 to 4.5.1 (elastic#28183)
  [TEST] Do not call RandomizedTest.scaledRandomIntBetween from multiple threads
  Primary send safe commit in file-based recovery (elastic#28038)
  [Docs] Correct response json in rank-eval.asciidoc
  Add scroll parameter to _reindex API (elastic#28041)
  Include all sentences smaller than fragment_size in the unified highlighter (elastic#28132)
  Modifies the JavaAPI docs related to AggregationBuilder
  [Docs] Improvements in script-fields.asciidoc (elastic#28174)
  ...
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.2.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants