-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in SnapshotBasedIndexRecoveryIT #79404
Fix race condition in SnapshotBasedIndexRecoveryIT #79404
Conversation
If we don't cancel the re-location of the index to the same target node, it is possible that the recovery is retried, causing a race condition.
Pinging @elastic/es-distributed (Team:Distributed) |
@@ -933,6 +933,11 @@ public void testRecoveryUsingSnapshotsPermitIsReturnedAfterFailureOrCancellation | |||
|
|||
targetMockTransportService.clearAllRules(); | |||
channelRef.get().sendResponse(new IOException("unable to clean files")); | |||
assertAcked( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to find a way to ensure that the RecoveryTarget reference is released and therefore the snapshot file download permit is released but since it happens asynchronously I couldn't find a reliable way to be sure that the permit has been released. Maybe we should add a Thread.sleep
here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we set index.allocation.max_retries: 1
rather than adding this filter? That way we can be sure that it's the failure that releases the permits and not the fact that the allocation filter causes allocation to be cancelled.
In terms of waiting for the permits to be released, maybe add a package-private method that exposes the RecoverySettings
on the PeerRecoveryTargetService
and then after updating the allocation filter you can assertBusy
that all the permits can be acquired.
…index-recovery-it
@DaveCTurner would you mind taking a look into this when you have the chance? thanks! |
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/part-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (with a couple of nits)
Releasable snapshotDownloadPermit = peerRecoveryTargetService.tryAcquireSnapshotDownloadPermits(); | ||
assertThat(snapshotDownloadPermit, is(notNullValue())); | ||
snapshotDownloadPermit.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for using a try-with-resources here.
.put(MergePolicyConfig.INDEX_MERGE_ENABLED, false) | ||
.put(IndexService.GLOBAL_CHECKPOINT_SYNC_INTERVAL_SETTING.getKey(), "1s") | ||
.put("index.routing.allocation.require._name", dataNodes.get(0)) | ||
.put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0) | ||
.put("index.allocation.max_retries", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight preference for using the setting directly rather than its literal name:
.put("index.allocation.max_retries", 0) | |
.put(SETTING_ALLOCATION_MAX_RETRY.getKey(), 0) |
…index-recovery-it
…index-recovery-it
If we don't cancel the re-location of the index to the same target
node, it is possible that the recovery is retried, meaning that it's
possible that the available permit is granted to indexRecoveredFromSnapshot1
instead of to indexRecoveredFromSnapshot2.
Relates #79316
Closes #79420