You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have seen instances where the RemoteStoreRefreshListener#scheduleRetry still continues even after the parent IndexShard instance has been closed. This is mostly reproducible during the remote migration seeding phase where remote uploads are failing consistently because of underlying issues with the destination remote store
This is also reproducible through the RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail IT. Although the IT in itself does not fail, we are seeing the following log lines right after Allocation service marks the shard copy as failed since the IT intentionally interrupts all remote store uploads to induce a seeding failure.
We see this log line:
[2024-06-05T11:18:08,946][ERROR][o.o.i.s.RemoteStoreRefreshListener] [node_t1] [test][0] Exception in RemoteStoreRefreshListener.afterRefresh()
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
at org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:198) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.apache.lucene.tests.store.MockDirectoryWrapper.listAll(MockDirectoryWrapper.java:1084) ~[lucene-test-framework-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:60) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:60) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.apache.lucene.index.SegmentInfos.getLastCommitGeneration(SegmentInfos.java:224) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.apache.lucene.index.SegmentInfos.getLastCommitSegmentsFileName(SegmentInfos.java:245) ~[lucene-core-9.11.0-snapshot-4be6531.jar:9.11.0-snapshot-4be6531 4be65312f7bbd4b09cff66433ebd95b30067b690 - 2024-05-27 18:46:58]
at org.opensearch.index.shard.RemoteStoreRefreshListener.isRefreshAfterCommit(RemoteStoreRefreshListener.java:383) ~[classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments(RemoteStoreRefreshListener.java:238) [classes/:?]
at org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit(RemoteStoreRefreshListener.java:158) [classes/:?]
at org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit(ReleasableRetryableRefreshListener.java:167) [classes/:?]
at org.opensearch.index.shard.ReleasableRetryableRefreshListener.lambda$scheduleRetry$2(ReleasableRetryableRefreshListener.java:127) [classes/:?]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [classes/:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
immediately after
[2024-06-05T11:17:54,653][WARN ][o.o.c.r.a.AllocationService] [node_t0] failing shard [failed shard, shard [test][0], node[8KhmAOUkTMeR6tpNDCegYQ], relocating [qA25wZbZS7eqschELSCcVA], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=UT-UicIZTeGIRkhHZKdx8A, rId=eDf59JrsSRucJxie4E_Njw], expected_shard_size[208], message [failed recovery], failure [RecoveryFailedException[[test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node_t0][127.0.0.1:49965][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[node_t1][127.0.0.1:49967][internal:index/shard/recovery/prepare_translog]]; nested: UncategorizedExecutionException[Failed execution]; nested: NotSerializableExceptionWrapper[execution_exception: java.io.IOException: Failed to upload to remote segment store within remote upload timeout of 0 minutes]; nested: IOException[Failed to upload to remote segment store within remote upload timeout of 0 minutes]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true} ([test][0]: Recovery failed from {node_t0}{qA25wZbZS7eqschELSCcVA}{GPoMkY_zQS6gMvGMztxeRQ}{127.0.0.1}{127.0.0.1:49965}{dimr}{shard_indexing_pressure_enabled=true} into {node_t1}{8KhmAOUkTMeR6tpNDCegYQ}{cPP6MZN1TX-X9lzVQBSIgQ}{127.0.0.1}{127.0.0.1:49967}{dimr}{shard_indexing_pressure_enabled=true})
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:141) ~[classes/:?]
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:180) ~[classes/:?]
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:212) ~[classes/:?]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:758) ~[classes/:?]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:688) ~[classes/:?]
at org.opensearch.telemetry.tracing.handler.TraceableTransportResponseHandler.handleException(TraceableTransportResponseHandler.java:81) ~[classes/:?]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1493) ~[classes/:?]
at org.opensearch.transport.NativeMessageHandler.lambda$handleException$5(NativeMessageHandler.java:455) ~[classes/:?]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) ~[classes/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1583) [?:?]
Related component
Storage:Remote
To Reproduce
Run the RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail IT and observe the logs from that test
Expected behavior
The RemoteStoreRefreshListener should not schedule uploads after an IndexShard instance has transitioned to CLOSED state and the underlying Engine + Directory instances have been closed
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
OS: [e.g. iOS]
Version [e.g. 22]
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Describe the bug
We have seen instances where the
RemoteStoreRefreshListener#scheduleRetry
still continues even after the parentIndexShard
instance has been closed. This is mostly reproducible during the remote migration seeding phase where remote uploads are failing consistently because of underlying issues with the destination remote storeThis is also reproducible through the
RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail
IT. Although the IT in itself does not fail, we are seeing the following log lines right after Allocation service marks the shard copy as failed since the IT intentionally interrupts all remote store uploads to induce a seeding failure.We see this log line:
immediately after
Related component
Storage:Remote
To Reproduce
Run the
RemotePrimaryRelocationIT#testMixedModeRelocation_RemoteSeedingFail
IT and observe the logs from that testExpected behavior
The
RemoteStoreRefreshListener
should not schedule uploads after an IndexShard instance has transitioned toCLOSED
state and the underlying Engine + Directory instances have been closedAdditional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: