Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot failover/retry on failed shard if a good copy is available #15940

Closed
ppf2 opened this issue Jan 12, 2016 · 1 comment
Closed

Snapshot failover/retry on failed shard if a good copy is available #15940

ppf2 opened this issue Jan 12, 2016 · 1 comment
Labels
discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement stalled

Comments

@ppf2
Copy link
Member

ppf2 commented Jan 12, 2016

Scenario reported by the field is the following.

Periodically, snapshot fails (partial) against a specific shard.

[2015-11-10 07:20:37,413][WARN ][snapshots ] [node_name] [[index_name][1]] [snapshot:20151110t071646z] failed to create snapshot 
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [index_name][1] Failed to snapshot 
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:100) 
at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:871) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: org.elasticsearch.index.engine.FlushFailedEngineException: [index_name][1] Flush failed 
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:715) 
at org.elasticsearch.index.engine.InternalEngine.snapshotIndex(InternalEngine.java:846) 
at org.elasticsearch.index.shard.IndexShard.snapshotIndex(IndexShard.java:772) 
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:83) 
... 4 more 
Caused by: org.apache.lucene.index.CorruptIndexException: [index_name][1] Preexisting corrupted index [corrupted_JwkJ91qoSs2cbwcrhNb0iA] caused by: CorruptIndexException[verification failed : calculated=14wqrat stored=n88qsu] 
org.apache.lucene.index.CorruptIndexException: verification failed : calculated=14wqrat stored=n88qsu 
at org.elasticsearch.index.store.Store$VerifyingIndexInput.verify(Store.java:1507) 
at org.elasticsearch.index.store.Store.verify(Store.java:505) 
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshotFile(BlobStoreIndexShardRepository.java:568) 
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshot(BlobStoreIndexShardRepository.java:507) 
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:140) 
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:85) 
at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:871) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)

at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:602) 
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:583) 
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:150) 
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:709) 
... 7 more

This tends to happen when there is a corruption against a large segment. Since we have to read the entire segment to check for corruption, for large segments, we do not check for corruption until one of the following operations are performed today (snapshot, merge, relocation, peer recovery (when we copy segments over from a primary shard to a replica shard). So this can happen when the cluster is green and snapshot will then detect that a segment is bad and the recovery process will kick in to try to recover the shard from a replica, etc..

If there is a good copy available, snapshot will succeed on that shard on the next scheduled snapshot run. However, for the snapshot operation that was previously issued, there is currently not a failover/retry mechanism to retry the snapshot once recovery is successful, or try snapshot-ing from a copy of the shard instead.

Discussed with @imotov , a solution to this will be complex and we can revisit after the task management api has been implemented in the future so we can keep track of long running jobs.

@ppf2 ppf2 added >enhancement discuss stalled :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jan 12, 2016
@tlrx
Copy link
Member

tlrx commented Mar 22, 2018

This feature request is interesting but since its opening we have not seen enough feedback that it is a feature we should pursue. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s).

@tlrx tlrx closed this as completed Mar 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement stalled
Projects
None yet
Development

No branches or pull requests

2 participants