Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in SnapshotBasedIndexRecoveryIT #79404

Merged
merged 8 commits into from
Nov 29, 2021

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Oct 19, 2021

If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, meaning that it's
possible that the available permit is granted to indexRecoveredFromSnapshot1
instead of to indexRecoveredFromSnapshot2.

Relates #79316
Closes #79420

If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, causing a race
condition.
@fcofdez fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 Team:Distributed Meta label for distributed team v7.16.0 labels Oct 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@@ -933,6 +933,11 @@ public void testRecoveryUsingSnapshotsPermitIsReturnedAfterFailureOrCancellation

targetMockTransportService.clearAllRules();
channelRef.get().sendResponse(new IOException("unable to clean files"));
assertAcked(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to find a way to ensure that the RecoveryTarget reference is released and therefore the snapshot file download permit is released but since it happens asynchronously I couldn't find a reliable way to be sure that the permit has been released. Maybe we should add a Thread.sleep here? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set index.allocation.max_retries: 1 rather than adding this filter? That way we can be sure that it's the failure that releases the permits and not the fact that the allocation filter causes allocation to be cancelled.

In terms of waiting for the permits to be released, maybe add a package-private method that exposes the RecoverySettings on the PeerRecoveryTargetService and then after updating the allocation filter you can assertBusy that all the permits can be acquired.

@danhermann danhermann added v8.1.0 and removed v7.16.0 labels Oct 27, 2021
@fcofdez
Copy link
Contributor Author

fcofdez commented Oct 27, 2021

@DaveCTurner would you mind taking a look into this when you have the chance? thanks!

@fcofdez
Copy link
Contributor Author

fcofdez commented Oct 28, 2021

@elasticmachine update branch

@fcofdez
Copy link
Contributor Author

fcofdez commented Oct 28, 2021

@elasticmachine run elasticsearch-ci/part-2
Unrelated failure

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with a couple of nits)

Comment on lines 977 to 979
Releasable snapshotDownloadPermit = peerRecoveryTargetService.tryAcquireSnapshotDownloadPermits();
assertThat(snapshotDownloadPermit, is(notNullValue()));
snapshotDownloadPermit.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for using a try-with-resources here.

.put(MergePolicyConfig.INDEX_MERGE_ENABLED, false)
.put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), "1s")
.put("index.routing.allocation.require._name", dataNodes.get(0))
.put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0)
.put("index.allocation.max_retries", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference for using the setting directly rather than its literal name:

Suggested change
.put("index.allocation.max_retries", 0)
.put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0)

@fcofdez fcofdez merged commit 5cb5d92 into elastic:master Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SnapshotBasedIndexRecoveryIT testRecoveryUsingSnapshotsPermitIsReturnedAfterFailureOrCancellation failing
4 participants