[BUG] RemoteStoreRefreshListener retries continues even after shard is closed #13996

shourya035 · 2024-06-05T11:56:45Z

Describe the bug

We have seen instances where the RemoteStoreRefreshListener#scheduleRetry still continues even after the parent IndexShard instance has been closed. This is mostly reproducible during the remote migration seeding phase where remote uploads are failing consistently because of underlying issues with the destination remote store

This is also reproducible through the RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail IT. Although the IT in itself does not fail, we are seeing the following log lines right after Allocation service marks the shard copy as failed since the IT intentionally interrupts all remote store uploads to induce a seeding failure.

We see this log line:

[2024-06-05T11:18:08,946][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [test][0] Exception in RemoteStoreRefreshListener.afterRefresh()
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
	at org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:198) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.tests.store.MockDirectoryWrapper.listAll(MockDirectoryWrapper.java:1084) ~[lucene-test-framework-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:60) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:60) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.index.SegmentInfos.getLastCommitGeneration(SegmentInfos.java:224) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.apache.lucene.index.SegmentInfos.getLastCommitSegmentsFileName(SegmentInfos.java:245) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.isRefreshAfterCommit(RemoteStoreRefreshListener.java:383) ~[classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:238) [classes/:?]
	at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:158) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167) [classes/:?]
	at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:127) [classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [classes/:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

immediately after

[2024-06-05T11:17:54,653][WARN ][o.o.c.r.a.AllocationService] [node_t0] failing shard [failed shard, shard [test][0], node[8KhmAOUkTMeR6tpNDCegYQ], relocating [qA25wZbZS7eqschELSCcVA], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=UT-UicIZTeGIRkhHZKdx8A, rId=eDf59JrsSRucJxie4E_Njw], expected_shard_size[208], message [failed recovery], failure [RecoveryFailedException[[test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node_t0][127.0.0.1:49965][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[node_t1][127.0.0.1:49967][internal:index/shard/recovery/prepare_translog]]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: java.io.IOException: Failed to upload to remote segment store within remote upload timeout of 0 minutes]; nested: IOException[Failed to upload to remote segment store within remote upload timeout of 0 minutes]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true})
	at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) ~[classes/:?]
	at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) ~[classes/:?]
	at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) ~[classes/:?]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:758) ~[classes/:?]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:688) ~[classes/:?]
	at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) ~[classes/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1493) ~[classes/:?]
	at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:455) ~[classes/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) ~[classes/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1583) [?:?]

Related component

Storage:Remote

To Reproduce

Run the RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail IT and observe the logs from that test

Expected behavior

The RemoteStoreRefreshListener should not schedule uploads after an IndexShard instance has transitioned to CLOSED state and the underlying Engine + Directory instances have been closed

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

peternied · 2024-06-05T15:12:35Z

[Triage - attendees 1 2 3 4 5 6 7]
@shourya035 Thanks for creating this issue and pull request

shourya035 added bug Something isn't working untriaged labels Jun 5, 2024

shourya035 self-assigned this Jun 5, 2024

github-actions bot added the Storage:Remote label Jun 5, 2024

shourya035 mentioned this issue Jun 5, 2024

[Remote Store] Add checks to skip remote uploads after shard is closed #13998

Merged

9 tasks

peternied removed the untriaged label Jun 5, 2024

shourya035 closed this as completed Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RemoteStoreRefreshListener retries continues even after shard is closed #13996

[BUG] RemoteStoreRefreshListener retries continues even after shard is closed #13996

shourya035 commented Jun 5, 2024 •

edited

Loading

peternied commented Jun 5, 2024

[BUG] RemoteStoreRefreshListener retries continues even after shard is closed #13996

[BUG] RemoteStoreRefreshListener retries continues even after shard is closed #13996

Comments

shourya035 commented Jun 5, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Jun 5, 2024

shourya035 commented Jun 5, 2024 •

edited

Loading