Introduce sequence-number-based recovery #22484

jasontedor · 2017-01-07T12:52:39Z

This commit introduces sequence-number-based recovery. When a replica has fallen out of sync, rather than performing a file-based recovery we first attempt to replay operations since the last local checkpoint on the replica. To do this, at the start of recovery the replica tells the primary what its local checkpoint is. The primary will then wait for all operations between that local checkpoint and the current maximum sequence number to complete; this is to ensure that there are no gaps in the operations that will be replayed from the primary to the replica. This is a best-effort attempt as we currently have no guarantees on the primary that these operations will be available; if we are not able to replay all operations in the desired range, we just fallback to file-based recovery. Later work will strengthen the guarantees.

Relates #10708

jasontedor · 2017-01-07T16:58:33Z

retest this please

jasontedor · 2017-01-07T21:03:01Z

retest this please

This commit introduces sequence-number-based recovery. When a replica has fallen out of sync, rather than performing a file-based recovery we first attempt to replay operations since the last local checkpoint on the replica. To do this, at the start of recovery the replica tells the primary what its local checkpoint is. The primary will then wait for all operations between that local checkpoint and the current maximum sequence number to complete; this is to ensure that there are no gaps in the operations that will be replayed from the primary to the replica. This is a best-effort attempt as we currently have no guarantees on the primary that these operations will be available; if we are not able to replay all operations in the desired range, we just fallback to file-based recovery. Later work will strengthen the guarantees.

This commit simplifies sequence number-based recovery. Rather than execute a dance between the replica and the primary of having the replica request a sequence number-based recovery, then failling that recovery if it is not possible and having the replica request a second file-based recovery, we simply check on the primary side if a sequence number-based recovery is possible and immediately fallback to file-basd recovery if not.

dakrone · 2017-01-08T04:46:25Z

core/src/main/java/org/elasticsearch/common/util/concurrent/AbstractRefCounted.java

@@ -16,6 +16,7 @@
 * specific language governing permissions and limitations
 * under the License.
 */
+


Looks like a typo here

dakrone · 2017-01-08T04:46:50Z

core/src/main/java/org/elasticsearch/index/engine/Engine.java

@@ -379,6 +379,7 @@ void setTook(long took) {
        void freeze() {
            freeze.set(true);
        }
+


dakrone · 2017-01-08T04:49:34Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

            } else {
                // no version conflict
                if (index.origin() == Operation.Origin.PRIMARY) {
                    seqNo = seqNoService().generateSeqNo();
                }

-                /**
+                /*


This was a Javadoc-style comment inside a method where it has no impact on javadoc. While javac will treat it as a block comment either way, my IDE formats it as a Javadoc-style comment instead of as a block comment and it annoys me.

* master: [TEST] Fixed the incorrect indentation for the `skip` clauses in the REST tests Fix primary relocation for shadow replicas (elastic#22474)

This commit removes a field that was left behind in a previous refactoring that rendered the field obsolete.

bleskes · 2017-01-08T19:10:47Z

w00t. I'll review as soon as I catch up with everything.

If a file-based recovery completes phase one successfully, but a network partition happens before the translog is opened, during the retry loop the recovery target will proceed to attempt a sequence-number-based recovery as the index files are present. However, as the translog was never opened it will be missing on disk leading to a no such file exception while preparing for a sequence-number-based recovery. We should not let this fail the recovery, but instead proceed to attempt another file-based recovery.

A version conflict exception can happen during recovery. If this operation is from an old primary, a sequence number will have not been assigned to the operation. In this case, we should skip adding a no-op to the translog.

bleskes

This looks great. I left some minor comments and suggestions.

bleskes · 2017-01-11T10:53:35Z

core/src/main/java/org/elasticsearch/index/engine/Engine.java

@@ -361,7 +361,7 @@ public long getTook() {

        void setTranslogLocation(Translog.Location translogLocation) {
            if (freeze.get() == null) {
-                assert failure == null : "failure has to be null to set translog location";
+                assert failure == null || translogLocation == null: "failure has to be null to set translog location";


wondering - was this required for this PR or is it preparation for the future?

This change will no longer be necessary after #22626.

I pushed 8b0e501.

I integrated master into this branch after #22626 in d71aa16.

I also pushed cea70f4.

bleskes · 2017-01-11T11:46:54Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

-        }
-
-        return new SeqNoStats(maxSeqNo, localCheckpoint, globalCheckpoint);
+        return SequenceNumbers.loadSeqNoStatsFromLuceneCommit(globalCheckpoint, indexWriter.getLiveCommitData());


Is this all worth it? I wonder if we should just load from store in the constructor and save on this method:

switch (openMode) { case OPEN_INDEX_AND_TRANSLOG: seqNoStats = store.loadSeqNoStats(Translog.readGlobalCheckpoint(engineConfig.getTranslogConfig().getTranslogPath())); writer = createWriter(false); break; case OPEN_INDEX_CREATE_TRANSLOG: seqNoStats = store.loadSeqNoStats(SequenceNumbersService.UNASSIGNED_SEQ_NO); writer = createWriter(false); break;

I agree; I pushed 1929c03 (I think that this will also fit better with future developments).

bleskes · 2017-01-11T11:50:00Z

core/src/main/java/org/elasticsearch/index/translog/MultiSnapshot.java

@@ -57,4 +57,5 @@ public int totalOperations() {
        }
        return null;
    }
+


nit: a shame to touch this file?

I pushed dac513a.

bleskes · 2017-01-11T12:26:09Z

core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

-                return;
-            }
+            final Optional<StartRecoveryRequest> maybeRequest = getStartRecoveryRequest(recoveryTarget);
+            if (!maybeRequest.isPresent()) return;


I really thank that just have the try catch for errors here, potentially part of the try with resources block ([1]) will be much cleaner and easier to read than the optional song and dance.

[1] starting at try (RecoveryRef recoveryRef = onGoingRecoveries.getRecovery(recoveryId)) {

I'm not sure I agree, having tried to rewrite it, I find it easier to reason about as-is. Let's discuss if you feel strongly.

OK. Here's a patch with what I meant. Talk tomorrow :)
https://gist.github.com/bleskes/9177f512ebe803dc07dbfdda6f8ef2b7

Okay, that's exactly what I did except I was trying to preserve the messages here and here, and that made it uglier than I preferred. If you're okay dropping those and just getting a generic message, and you appear to be, then I'll just do that already. 😄

hehe. yeah. I think the error in the cause exception should be enough to give us the information we need. No need to be heroic about the extra info.

I pushed cc2002c.

bleskes · 2017-01-11T12:45:55Z

core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

+            }
+
+            logger.trace("{} preparing shard for peer recovery", recoveryTarget.shardId());
+            recoveryTarget.indexShard().prepareForIndexRecovery();


nit - this seems like a weird side effect to have here. Can we move it back to the main method?

I pushed 34fbb37.

bleskes · 2017-01-11T13:04:25Z

core/src/test/java/org/elasticsearch/index/replication/ESIndexLevelReplicationTestCase.java

            replicas.add(replica);
            updateAllocationIDsOnPrimary();
            return replica;
        }

+        public synchronized IndexShard addReplica(IndexShard replica) throws IOException {


can we name this something like closeAndAddAsInitializingReplica? (feel free to shorten, but I think we should make it clear what do, at the expense of length if need be)

maybe we should call this addReplicaWithExistingPath and give it two parameters - ShardPath and node Id? (leaving all the shard wrangling to the caller).

Okay, I pushed c0169c2.

bleskes · 2017-01-11T13:12:03Z

core/src/test/java/org/elasticsearch/index/replication/RecoveryDuringReplicationTests.java

+            shards.recoverReplica(recoveredReplica);
+            if (flushPrimary && replicaHasDocsSinceLastFlushedCheckpoint) {
+                // replica has something to catch up with, but since we flushed the primary, we should fall back to full recovery
+                assertThat(recoveredReplica.recoveryState().getIndex().fileDetails(), not(empty()));


can we also assert the number of translog ops recovered?

I pushed 8960522; please review this one carefully. 😉

bleskes · 2017-01-11T13:13:38Z

core/src/test/java/org/elasticsearch/index/replication/RecoveryDuringReplicationTests.java

@@ -57,11 +61,71 @@ public void testIndexingDuringFileRecovery() throws Exception {
        }
    }

+    public void testRecoveryOfDisconnectedReplica() throws Exception {


this is a great simple test. I wonder how much effort it will be to add "ops in flight while starting recovery" to it (or another test). I'm thinking of how we test that we wait for the the translog to have a complete continuous section.

I pushed 7281b75. 😇

bleskes · 2017-01-11T13:19:09Z

core/src/main/java/org/elasticsearch/index/seqno/SequenceNumbersService.java

-        localCheckpointService = new LocalCheckpointService(shardId, indexSettings, maxSeqNo, localCheckpoint);
-        globalCheckpointService = new GlobalCheckpointService(shardId, indexSettings, globalCheckpoint);
+        localCheckpointTracker = new LocalCheckpointTracker(indexSettings, maxSeqNo, localCheckpoint);
+        globalCheckpointTracker = new GlobalCheckpointTracker(indexSettings, globalCheckpoint, logger);


doesn't this violate the logger per component policy we have?

I pushed b6e6cc3.

bleskes · 2017-01-11T22:37:42Z

core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -684,13 +670,20 @@ private IndexResult innerIndex(Index index) throws IOException {
            final IndexResult indexResult;
            if (checkVersionConflictResult.isPresent()) {
                indexResult = checkVersionConflictResult.get();
+                // norelease: this is not correct as this does not force an fsync, and we need to handle failures including replication
+                if (indexResult.hasFailure() || seqNo == SequenceNumbersService.UNASSIGNED_SEQ_NO) {


We discussed another solution for this problem in another channel. The gist was to change the way we deal with version conflicts on replicas. The idea was to try do it as another first and then base this PR on it.

I opened #22626 for this.

I pushed 8b0e501 to revert the change in preparation for merging master in after #22626 lands there.

This commit reverts adding no-ops to the translog when a version conflict exception arises on a replica. Instead, we will treat these as normal operations on a replica, but this will happen in another commit.

This commit reverts a whitespace change in MultiSnapshot.java.

When reading the translog on the source during peer recovery, if an I/O exception occurs it is wrapped in an unchecked exception. This is unnecessary as we can just let the I/O exception bubble all the way up. This commit does that.

jasontedor · 2017-01-25T03:25:40Z

@bleskes I pushed adafa21 but I'm not happy with it. Can you take a look and see if you can come up with something better. A failing test seed is FE6770A74885D66E (be warned, it takes several minutes, and this is the only seed I know of that makes the test fail).

This reverts commit adafa21.

* master: (47 commits) Remove non needed import use expectThrows instead of manually testing exception Fix checkstyle and a test Update after review Read ec2 discovery address from aws instance tags Invalidate cached query results if query timed out (elastic#22807) Add remaining generated painless API Generate reference links for painless API (elastic#22775) [TEST] Fix ElasticsearchExceptionTests Add parsing method for ElasticsearchException.generateThrowableXContent() (elastic#22783) Improve connection closing in `RemoteClusterConnection` (elastic#22804) Docs: Cluster allocation explain should be on one page Remove DFS_QUERY_AND_FETCH as a search type (elastic#22787) Add repository-url module and move URLRepository (elastic#22752) fix date-processor to a new default year for every new pipeline execution. (elastic#22601) Add tests for top_hits aggregation (elastic#22754) [TEST] Added this for 93a28b0 submitted via elastic#22772 Fix typo in comment in OsProbe.java Add new ruby search library to community clients doc (elastic#22765) RangeQuery WITHIN case now normalises query (elastic#22431) ...

jasontedor · 2017-01-26T18:09:40Z

@bleskes I pushed 06a3785.

jasontedor · 2017-01-26T19:11:52Z

retest this please

jasontedor · 2017-01-26T21:21:21Z

retest this please

jasontedor · 2017-01-26T23:33:45Z

retest this please

bleskes · 2017-01-27T07:14:25Z

retest this please

bleskes · 2017-01-27T14:24:28Z

retest this please

jasontedor · 2017-01-27T16:16:55Z

Thanks @bleskes. 😄

The seq# base recovery logic relies on rolling back lucene to remove any operations above the global checkpoint. This part of the plan is not implemented yet but have to have these guarantees. Instead we should make the seq# logic validate that the last commit point (and the only one we have) maintains the invariant and if not, fall back to file based recovery. This commit adds a test that creates situation where rollback is needed (primary fail over with ops in flight) and fixes another issue that was surfaced by it - if a primary can't serve a seq# based recovery request and does a file copy, it still used the incoming `startSeqNo` as a filter. Relates to elastic#22484 & #elastic#10708

…#22851) The seq# base recovery logic relies on rolling back lucene to remove any operations above the global checkpoint. This part of the plan is not implemented yet but have to have these guarantees. Instead we should make the seq# logic validate that the last commit point (and the only one we have) maintains the invariant and if not, fall back to file based recovery. This commit adds a test that creates situation where rollback is needed (primary failover with ops in flight) and fixes another issue that was surfaced by it - if a primary can't serve a seq# based recovery request and does a file copy, it still used the incoming `startSeqNo` as a filter. Relates to #22484 & #10708

…sts (#22900) EvillPeerRecoveryIT checks scenario where recovery is happening while there are on going indexing operation that already have been assigned a seq# . This is fairly hard to achieve and the test goes through a couple of hoops via the plugin infra to achieve that. This PR extends the unit tests infra to allow for those hoops to happen in unit tests. This allows the test to be moved to RecoveryDuringReplicationTests Relates to #22484

jasontedor added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs >enhancement v6.0.0-alpha1 labels Jan 7, 2017

jasontedor requested a review from bleskes January 7, 2017 12:52

jasontedor force-pushed the replica-sequence-number-recovery branch from 6265447 to ac1d630 Compare January 7, 2017 16:04

jasontedor force-pushed the replica-sequence-number-recovery branch from ac1d630 to 54e8224 Compare January 8, 2017 02:14

jasontedor changed the title ~~Introduce sequence number-based recovery~~ Introduce sequence-number-based recovery Jan 8, 2017

jasontedor added 2 commits January 7, 2017 21:16

jasontedor force-pushed the replica-sequence-number-recovery branch from 54e8224 to d360a23 Compare January 8, 2017 02:16

dakrone reviewed Jan 8, 2017

View reviewed changes

jasontedor force-pushed the replica-sequence-number-recovery branch 2 times, most recently from b6a32d1 to b9b816f Compare January 8, 2017 15:27

Merge branch 'master' into replica-sequence-number-recovery

6ec0ef6

* master: [TEST] Fixed the incorrect indentation for the `skip` clauses in the REST tests Fix primary relocation for shadow replicas (elastic#22474)

jasontedor force-pushed the replica-sequence-number-recovery branch from b9b816f to 6ec0ef6 Compare January 8, 2017 15:28

Remove obsolete field from RecoverySourceHandler

b9200cf

This commit removes a field that was left behind in a previous refactoring that rendered the field obsolete.

jasontedor force-pushed the replica-sequence-number-recovery branch from 40f5c2e to 8b5ea52 Compare January 8, 2017 19:33

jasontedor force-pushed the replica-sequence-number-recovery branch from 8b5ea52 to 91e1ff0 Compare January 8, 2017 19:33

Skip adding operations without sequence number

1c14260

A version conflict exception can happen during recovery. If this operation is from an old primary, a sequence number will have not been assigned to the operation. In this case, we should skip adding a no-op to the translog.

bleskes suggested changes Jan 11, 2017

View reviewed changes

jasontedor added 3 commits January 15, 2017 15:33

Revert adding no-ops on version confict in replica

8b0e501

This commit reverts adding no-ops to the translog when a version conflict exception arises on a replica. Instead, we will treat these as normal operations on a replica, but this will happen in another commit.

Revert whitespace change in MultiSnapshot.java

dac513a

This commit reverts a whitespace change in MultiSnapshot.java.

Bubble up translog I/O exceptions during recovery

81a1e1c

When reading the translog on the source during peer recovery, if an I/O exception occurs it is wrapped in an unchecked exception. This is unnecessary as we can just let the I/O exception bubble all the way up. This commit does that.

jasontedor added 3 commits January 23, 2017 19:24

More trace logging for test

32c6702

Fix shard ID in logging statement

76f8807

Fix RFGIT#testReusePeerRecovery test bug

adafa21

jasontedor added 4 commits January 25, 2017 10:13

Cleanup

270a68a

Revert "Fix RFGIT#testReusePeerRecovery test bug"

2e67a0b

This reverts commit adafa21.

Rewrite reuse peer recovery test

06a3785

Remove unused imports

eeaa4f9

jasontedor added 2 commits January 26, 2017 14:26

Cleanup test

62aabb0

More cleanup

97e0b20

jasontedor merged commit 930282e into elastic:master Jan 27, 2017

jasontedor deleted the replica-sequence-number-recovery branch January 27, 2017 16:16

This was referenced Jan 28, 2017

Seq Number based recovery should validate last lucene commit max seq# #22851

Merged

Add Sequence Numbers to write operations #10708

Closed

bleskes mentioned this pull request Feb 1, 2017

Move EvilPeerRecoveryIT to a unit test in RecoveryDuringReplicationTests #22900

Merged

jasontedor mentioned this pull request Feb 2, 2017

Avoid losing ops in file-based recovery #22945

Merged

jasontedor mentioned this pull request Jul 6, 2017

Expand How to tune for disk usage #25562

Merged

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Introduce sequence-number-based recovery #22484

Introduce sequence-number-based recovery #22484

Conversation

jasontedor commented Jan 7, 2017 • edited Loading

jasontedor commented Jan 7, 2017

jasontedor commented Jan 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bleskes commented Jan 8, 2017

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor Jan 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor Jan 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jasontedor commented Jan 25, 2017

jasontedor commented Jan 26, 2017

jasontedor commented Jan 26, 2017

jasontedor commented Jan 26, 2017

jasontedor commented Jan 26, 2017

bleskes commented Jan 27, 2017

bleskes commented Jan 27, 2017

jasontedor commented Jan 27, 2017

jasontedor commented Jan 7, 2017 •

edited

Loading

jasontedor Jan 19, 2017 •

edited

Loading

jasontedor Jan 18, 2017 •

edited

Loading