Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sequence-number-based recovery #22484

Merged
merged 46 commits into from
Jan 27, 2017

Conversation

jasontedor
Copy link
Member

@jasontedor jasontedor commented Jan 7, 2017

This commit introduces sequence-number-based recovery. When a replica has fallen out of sync, rather than performing a file-based recovery we first attempt to replay operations since the last local checkpoint on the replica. To do this, at the start of recovery the replica tells the primary what its local checkpoint is. The primary will then wait for all operations between that local checkpoint and the current maximum sequence number to complete; this is to ensure that there are no gaps in the operations that will be replayed from the primary to the replica. This is a best-effort attempt as we currently have no guarantees on the primary that these operations will be available; if we are not able to replay all operations in the desired range, we just fallback to file-based recovery. Later work will strengthen the guarantees.

Relates #10708

@jasontedor jasontedor added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs >enhancement v6.0.0-alpha1 labels Jan 7, 2017
@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch from 6265447 to ac1d630 Compare January 7, 2017 16:04
@jasontedor
Copy link
Member Author

retest this please

1 similar comment
@jasontedor
Copy link
Member Author

retest this please

@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch from ac1d630 to 54e8224 Compare January 8, 2017 02:14
@jasontedor jasontedor changed the title Introduce sequence number-based recovery Introduce sequence-number-based recovery Jan 8, 2017
This commit introduces sequence-number-based recovery. When a replica
has fallen out of sync, rather than performing a file-based recovery we
first attempt to replay operations since the last local checkpoint on
the replica. To do this, at the start of recovery the replica tells the
primary what its local checkpoint is. The primary will then wait for all
operations between that local checkpoint and the current maximum
sequence number to complete; this is to ensure that there are no gaps in
the operations that will be replayed from the primary to the
replica. This is a best-effort attempt as we currently have no
guarantees on the primary that these operations will be available; if we
are not able to replay all operations in the desired range, we just
fallback to file-based recovery. Later work will strengthen the
guarantees.
This commit simplifies sequence number-based recovery. Rather than
execute a dance between the replica and the primary of having the
replica request a sequence number-based recovery, then failling that
recovery if it is not possible and having the replica request a second
file-based recovery, we simply check on the primary side if a sequence
number-based recovery is possible and immediately fallback to file-basd
recovery if not.
@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch from 54e8224 to d360a23 Compare January 8, 2017 02:16
@@ -16,6 +16,7 @@
* specific language governing permissions and limitations
* under the License.
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a typo here

@@ -379,6 +379,7 @@ void setTook(long took) {
void freeze() {
freeze.set(true);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

} else {
// no version conflict
if (index.origin() == Operation.Origin.PRIMARY) {
seqNo = seqNoService().generateSeqNo();
}

/**
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a Javadoc-style comment inside a method where it has no impact on javadoc. While javac will treat it as a block comment either way, my IDE formats it as a Javadoc-style comment instead of as a block comment and it annoys me.

@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch 2 times, most recently from b6a32d1 to b9b816f Compare January 8, 2017 15:27
* master:
  [TEST] Fixed the incorrect indentation for the `skip` clauses in the REST tests
  Fix primary relocation for shadow replicas (elastic#22474)
@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch from b9b816f to 6ec0ef6 Compare January 8, 2017 15:28
This commit removes a field that was left behind in a previous
refactoring that rendered the field obsolete.
@bleskes
Copy link
Contributor

bleskes commented Jan 8, 2017

w00t. I'll review as soon as I catch up with everything.

@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch from 40f5c2e to 8b5ea52 Compare January 8, 2017 19:33
If a file-based recovery completes phase one successfully, but a network
partition happens before the translog is opened, during the retry loop
the recovery target will proceed to attempt a sequence-number-based
recovery as the index files are present. However, as the translog was
never opened it will be missing on disk leading to a no such file
exception while preparing for a sequence-number-based recovery. We
should not let this fail the recovery, but instead proceed to attempt
another file-based recovery.
@jasontedor jasontedor force-pushed the replica-sequence-number-recovery branch from 8b5ea52 to 91e1ff0 Compare January 8, 2017 19:33
A version conflict exception can happen during recovery. If this
operation is from an old primary, a sequence number will have not been
assigned to the operation. In this case, we should skip adding a no-op
to the translog.
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I left some minor comments and suggestions.

@@ -361,7 +361,7 @@ public long getTook() {

void setTranslogLocation(Translog.Location translogLocation) {
if (freeze.get() == null) {
assert failure == null : "failure has to be null to set translog location";
assert failure == null || translogLocation == null: "failure has to be null to set translog location";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering - was this required for this PR or is it preparation for the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will no longer be necessary after #22626.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8b0e501.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I integrated master into this branch after #22626 in d71aa16.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also pushed cea70f4.

}

return new SeqNoStats(maxSeqNo, localCheckpoint, globalCheckpoint);
return SequenceNumbers.loadSeqNoStatsFromLuceneCommit(globalCheckpoint, indexWriter.getLiveCommitData());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this all worth it? I wonder if we should just load from store in the constructor and save on this method:

                switch (openMode) {
                    case OPEN_INDEX_AND_TRANSLOG:
                        seqNoStats = store.loadSeqNoStats(Translog.readGlobalCheckpoint(engineConfig.getTranslogConfig().getTranslogPath()));
                        writer = createWriter(false);
                        break;
                    case OPEN_INDEX_CREATE_TRANSLOG:
                        seqNoStats = store.loadSeqNoStats(SequenceNumbersService.UNASSIGNED_SEQ_NO);
                        writer = createWriter(false);
                        break;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; I pushed 1929c03 (I think that this will also fit better with future developments).

@@ -57,4 +57,5 @@ public int totalOperations() {
}
return null;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a shame to touch this file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed dac513a.

return;
}
final Optional<StartRecoveryRequest> maybeRequest = getStartRecoveryRequest(recoveryTarget);
if (!maybeRequest.isPresent()) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really thank that just have the try catch for errors here, potentially part of the try with resources block ([1]) will be much cleaner and easier to read than the optional song and dance.

[1] starting at try (RecoveryRef recoveryRef = onGoingRecoveries.getRecovery(recoveryId)) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I agree, having tried to rewrite it, I find it easier to reason about as-is. Let's discuss if you feel strongly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Here's a patch with what I meant. Talk tomorrow :)
https://gist.github.com/bleskes/9177f512ebe803dc07dbfdda6f8ef2b7

Copy link
Member Author

@jasontedor jasontedor Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's exactly what I did except I was trying to preserve the messages here and here, and that made it uglier than I preferred. If you're okay dropping those and just getting a generic message, and you appear to be, then I'll just do that already. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe. yeah. I think the error in the cause exception should be enough to give us the information we need. No need to be heroic about the extra info.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed cc2002c.

}

logger.trace("{} preparing shard for peer recovery", recoveryTarget.shardId());
recoveryTarget.indexShard().prepareForIndexRecovery();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - this seems like a weird side effect to have here. Can we move it back to the main method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 34fbb37.

replicas.add(replica);
updateAllocationIDsOnPrimary();
return replica;
}

public synchronized IndexShard addReplica(IndexShard replica) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we name this something like closeAndAddAsInitializingReplica? (feel free to shorten, but I think we should make it clear what do, at the expense of length if need be)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should call this addReplicaWithExistingPath and give it two parameters - ShardPath and node Id? (leaving all the shard wrangling to the caller).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I pushed c0169c2.

shards.recoverReplica(recoveredReplica);
if (flushPrimary && replicaHasDocsSinceLastFlushedCheckpoint) {
// replica has something to catch up with, but since we flushed the primary, we should fall back to full recovery
assertThat(recoveredReplica.recoveryState().getIndex().fileDetails(), not(empty()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also assert the number of translog ops recovered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8960522; please review this one carefully. 😉

@@ -57,11 +61,71 @@ public void testIndexingDuringFileRecovery() throws Exception {
}
}

public void testRecoveryOfDisconnectedReplica() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great simple test. I wonder how much effort it will be to add "ops in flight while starting recovery" to it (or another test). I'm thinking of how we test that we wait for the the translog to have a complete continuous section.

Copy link
Member Author

@jasontedor jasontedor Jan 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 7281b75. 😇

localCheckpointService = new LocalCheckpointService(shardId, indexSettings, maxSeqNo, localCheckpoint);
globalCheckpointService = new GlobalCheckpointService(shardId, indexSettings, globalCheckpoint);
localCheckpointTracker = new LocalCheckpointTracker(indexSettings, maxSeqNo, localCheckpoint);
globalCheckpointTracker = new GlobalCheckpointTracker(indexSettings, globalCheckpoint, logger);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this violate the logger per component policy we have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed b6e6cc3.

@@ -684,13 +670,20 @@ private IndexResult innerIndex(Index index) throws IOException {
final IndexResult indexResult;
if (checkVersionConflictResult.isPresent()) {
indexResult = checkVersionConflictResult.get();
// norelease: this is not correct as this does not force an fsync, and we need to handle failures including replication
if (indexResult.hasFailure() || seqNo == SequenceNumbersService.UNASSIGNED_SEQ_NO) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed another solution for this problem in another channel. The gist was to change the way we deal with version conflicts on replicas. The idea was to try do it as another first and then base this PR on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #22626 for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed 8b0e501 to revert the change in preparation for merging master in after #22626 lands there.

This commit reverts adding no-ops to the translog when a version
conflict exception arises on a replica. Instead, we will treat these as
normal operations on a replica, but this will happen in another commit.
This commit reverts a whitespace change in MultiSnapshot.java.
When reading the translog on the source during peer recovery, if an I/O
exception occurs it is wrapped in an unchecked exception. This is
unnecessary as we can just let the I/O exception bubble all the way
up. This commit does that.
@jasontedor
Copy link
Member Author

@bleskes I pushed adafa21 but I'm not happy with it. Can you take a look and see if you can come up with something better. A failing test seed is FE6770A74885D66E (be warned, it takes several minutes, and this is the only seed I know of that makes the test fail).

* master: (47 commits)
  Remove non needed import
  use expectThrows instead of manually testing exception
  Fix checkstyle and a test
  Update after review
  Read ec2 discovery address from aws instance tags
  Invalidate cached query results if query timed out (elastic#22807)
  Add remaining generated painless API
  Generate reference links for painless API (elastic#22775)
  [TEST] Fix ElasticsearchExceptionTests
  Add parsing method for ElasticsearchException.generateThrowableXContent() (elastic#22783)
  Improve connection closing in `RemoteClusterConnection` (elastic#22804)
  Docs: Cluster allocation explain should be on one page
  Remove DFS_QUERY_AND_FETCH as a search type (elastic#22787)
  Add repository-url module and move URLRepository (elastic#22752)
  fix date-processor to a new default year for every new pipeline execution. (elastic#22601)
  Add tests for top_hits aggregation (elastic#22754)
  [TEST] Added this for 93a28b0 submitted via elastic#22772
  Fix typo in comment in OsProbe.java
  Add new ruby search library to community clients doc (elastic#22765)
  RangeQuery WITHIN case now normalises query (elastic#22431)
  ...
@jasontedor
Copy link
Member Author

@bleskes I pushed 06a3785.

@jasontedor
Copy link
Member Author

retest this please

@jasontedor
Copy link
Member Author

retest this please

3 similar comments
@jasontedor
Copy link
Member Author

retest this please

@bleskes
Copy link
Contributor

bleskes commented Jan 27, 2017

retest this please

@bleskes
Copy link
Contributor

bleskes commented Jan 27, 2017

retest this please

@jasontedor jasontedor merged commit 930282e into elastic:master Jan 27, 2017
@jasontedor jasontedor deleted the replica-sequence-number-recovery branch January 27, 2017 16:16
@jasontedor
Copy link
Member Author

Thanks @bleskes. 😄

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jan 28, 2017
The seq# base recovery logic relies on rolling back lucene to remove any operations above the global checkpoint. This part of the plan is not implemented yet but have to have these guarantees. Instead we should make the seq# logic validate that the last commit point (and the only one we have) maintains the invariant and if not, fall back to file based recovery.

 This commit adds a test that creates situation where rollback is needed (primary fail over with ops in flight) and fixes another issue that was surfaced by it - if a primary can't serve a seq# based recovery request and does a file copy, it still used the incoming `startSeqNo` as a filter.

 Relates to elastic#22484 & #elastic#10708
bleskes added a commit that referenced this pull request Jan 31, 2017
…#22851)

The seq# base recovery logic relies on rolling back lucene to remove any operations above the global checkpoint. This part of the plan is not implemented yet but have to have these guarantees. Instead we should make the seq# logic validate that the last commit point (and the only one we have) maintains the invariant and if not, fall back to file based recovery.

 This commit adds a test that creates situation where rollback is needed (primary failover with ops in flight) and fixes another issue that was surfaced by it - if a primary can't serve a seq# based recovery request and does a file copy, it still used the incoming `startSeqNo` as a filter.

 Relates to #22484 & #10708
bleskes added a commit that referenced this pull request Feb 9, 2017
…sts (#22900)

EvillPeerRecoveryIT checks scenario where recovery is happening while there are on going indexing operation that already have been assigned a seq# . This is fairly hard to achieve and the test goes through a couple of hoops via the plugin infra to achieve that. This PR extends the unit tests infra to allow for those hoops to happen in unit tests. This allows the test to be moved to RecoveryDuringReplicationTests

Relates to #22484
@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants