Fix race condition in SnapshotBasedIndexRecoveryIT #79404

fcofdez · 2021-10-19T00:09:46Z

If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, meaning that it's
possible that the available permit is granted to indexRecoveredFromSnapshot1
instead of to indexRecoveredFromSnapshot2.

Relates #79316
Closes #79420

If we don't cancel the re-location of the index to the same target node, it is possible that the recovery is retried, causing a race condition.

elasticmachine · 2021-10-19T00:09:48Z

Pinging @elastic/es-distributed (Team:Distributed)

fcofdez · 2021-10-19T00:11:50Z

...nternalClusterTest/java/org/elasticsearch/indices/recovery/SnapshotBasedIndexRecoveryIT.java

@@ -933,6 +933,11 @@ public void testRecoveryUsingSnapshotsPermitIsReturnedAfterFailureOrCancellation

                targetMockTransportService.clearAllRules();
                channelRef.get().sendResponse(new IOException("unable to clean files"));
+                assertAcked(


I've been trying to find a way to ensure that the RecoveryTarget reference is released and therefore the snapshot file download permit is released but since it happens asynchronously I couldn't find a reliable way to be sure that the permit has been released. Maybe we should add a Thread.sleep here? 🤔

Could we set index.allocation.max_retries: 1 rather than adding this filter? That way we can be sure that it's the failure that releases the permits and not the fact that the allocation filter causes allocation to be cancelled.

In terms of waiting for the permits to be released, maybe add a package-private method that exposes the RecoverySettings on the PeerRecoveryTargetService and then after updating the allocation filter you can assertBusy that all the permits can be acquired.

…index-recovery-it

fcofdez · 2021-10-27T15:35:09Z

@DaveCTurner would you mind taking a look into this when you have the chance? thanks!

fcofdez · 2021-10-28T07:26:38Z

@elasticmachine update branch

fcofdez · 2021-10-28T08:42:34Z

@elasticmachine run elasticsearch-ci/part-2
Unrelated failure

DaveCTurner

LGTM (with a couple of nits)

DaveCTurner · 2021-11-10T10:19:21Z

...a/org/elasticsearch/xpack/snapshotbasedrecoveries/recovery/SnapshotBasedIndexRecoveryIT.java

+                        Releasable snapshotDownloadPermit = peerRecoveryTargetService.tryAcquireSnapshotDownloadPermits();
+                        assertThat(snapshotDownloadPermit, is(notNullValue()));
+                        snapshotDownloadPermit.close();


Slight preference for using a try-with-resources here.

DaveCTurner · 2021-11-10T10:20:44Z

...a/org/elasticsearch/xpack/snapshotbasedrecoveries/recovery/SnapshotBasedIndexRecoveryIT.java

                        .put(MergePolicyConfig.INDEX_MERGE_ENABLED, false)
                        .put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), "1s")
                        .put("index.routing.allocation.require._name", dataNodes.get(0))
-                        .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0)
+                        .put("index.allocation.max_retries", 0)


Slight preference for using the setting directly rather than its literal name:

Suggested change

.put("index.allocation.max_retries", 0)

.put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0)

…index-recovery-it

Fix race condition in SnapshotBasedIndexRecoveryIT

0b67ab1

If we don't cancel the re-location of the index to the same target node, it is possible that the recovery is retried, causing a race condition.

fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 Team:Distributed Meta label for distributed team v7.16.0 labels Oct 19, 2021

fcofdez commented Oct 19, 2021

View reviewed changes

fcofdez requested a review from DaveCTurner October 19, 2021 06:04

fcofdez added 2 commits October 19, 2021 11:09

Merge remote-tracking branch 'origin/master' into fix-snapshot-based-…

77f6bbc

…index-recovery-it

More robust fix

a884aba

danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021

Merge remote-tracking branch 'origin/master' into fix-snapshot-based-…

994b458

…index-recovery-it

fcofdez added the v7.16.1 label Oct 27, 2021

Merge branch 'master' into fix-snapshot-based-index-recovery-it

714d98b

fcofdez added v7.16.0 v8.0.0-beta1 and removed v7.16.1 v8.0.0 labels Oct 28, 2021

DaveCTurner approved these changes Nov 10, 2021

View reviewed changes

fcofdez added 3 commits November 24, 2021 12:11

Merge remote-tracking branch 'origin/master' into fix-snapshot-based-…

af47a40

…index-recovery-it

Review nits

7921065

Merge remote-tracking branch 'origin/master' into fix-snapshot-based-…

03da57c

…index-recovery-it

fcofdez removed v7.16.0 v8.0.0-beta1 labels Nov 29, 2021

fcofdez merged commit 5cb5d92 into elastic:master Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in SnapshotBasedIndexRecoveryIT #79404

Fix race condition in SnapshotBasedIndexRecoveryIT #79404

fcofdez commented Oct 19, 2021 •

edited

Loading

elasticmachine commented Oct 19, 2021

fcofdez Oct 19, 2021

DaveCTurner Oct 19, 2021

fcofdez commented Oct 27, 2021

fcofdez commented Oct 28, 2021

fcofdez commented Oct 28, 2021

DaveCTurner left a comment

DaveCTurner Nov 10, 2021

DaveCTurner Nov 10, 2021

	.put("index.allocation.max_retries", 0)
	.put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0)

Fix race condition in SnapshotBasedIndexRecoveryIT #79404

Fix race condition in SnapshotBasedIndexRecoveryIT #79404

Conversation

fcofdez commented Oct 19, 2021 • edited Loading

elasticmachine commented Oct 19, 2021

fcofdez Oct 19, 2021

Choose a reason for hiding this comment

DaveCTurner Oct 19, 2021

Choose a reason for hiding this comment

fcofdez commented Oct 27, 2021

fcofdez commented Oct 28, 2021

fcofdez commented Oct 28, 2021

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Nov 10, 2021

Choose a reason for hiding this comment

DaveCTurner Nov 10, 2021

Choose a reason for hiding this comment

fcofdez commented Oct 19, 2021 •

edited

Loading