Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RemoteStoreRefreshListener retries continues even after shard is closed #13996

Closed
shourya035 opened this issue Jun 5, 2024 · 1 comment
Assignees
Labels
bug Something isn't working Storage:Remote

Comments

@shourya035
Copy link
Member

shourya035 commented Jun 5, 2024

Describe the bug

We have seen instances where the RemoteStoreRefreshListener#scheduleRetry still continues even after the parent IndexShard instance has been closed. This is mostly reproducible during the remote migration seeding phase where remote uploads are failing consistently because of underlying issues with the destination remote store

This is also reproducible through the RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail IT. Although the IT in itself does not fail, we are seeing the following log lines right after Allocation service marks the shard copy as failed since the IT intentionally interrupts all remote store uploads to induce a seeding failure.

We see this log line:

[2024-06-05T11:18:08,946][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [test][0] Exception in RemoteStoreRefreshListener.afterRefresh()
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
	at org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:198) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.tests.store.MockDirectoryWrapper.listAll(MockDirectoryWrapper.java:1084) ~[lucene-test-framework-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:60) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:60) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.index.SegmentInfos.getLastCommitGeneration(SegmentInfos.java:224) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.index.SegmentInfos.getLastCommitSegmentsFileName(SegmentInfos.java:245) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.isRefreshAfterCommit(RemoteStoreRefreshListener.java:383) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:238) [classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:158) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:127) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

immediately after

[2024-06-05T11:17:54,653][WARN ][o.o.c.r.a.AllocationService] [node_t0] failing shard [failed shard, shard [test][0], node[8KhmAOUkTMeR6tpNDCegYQ], relocating [qA25wZbZS7eqschELSCcVA], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=UT-UicIZTeGIRkhHZKdx8A, rId=eDf59JrsSRucJxie4E_Njw], expected_shard_size[208], message [failed recovery], failure [RecoveryFailedException[[test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node_t0][127.0.0.1:49965][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[node_t1][127.0.0.1:49967][internal:index/shard/recovery/prepare_translog]]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: java.io.IOException: Failed to upload to remote segment store within remote upload timeout of 0 minutes]; nested: IOException[Failed to upload to remote segment store within remote upload timeout of 0 minutes]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true})
	at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) ~[classes/:?]
	at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) ~[classes/:?]
	at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) ~[classes/:?]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:758) ~[classes/:?]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:688) ~[classes/:?]
	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) ~[classes/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1493) ~[classes/:?]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:455) ~[classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) ~[classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1583) [?:?]

Related component

Storage:Remote

To Reproduce

Run the RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail IT and observe the logs from that test

Expected behavior

The RemoteStoreRefreshListener should not schedule uploads after an IndexShard instance has transitioned to CLOSED state and the underlying Engine + Directory instances have been closed

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7]
@shourya035 Thanks for creating this issue and pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants