Retry transient shard failures in search #56045

jimczi · 2020-04-30T15:46:15Z

Today shard search requests are executed on each replica upon success. If all replicas fail for a shard, we consider the shard as failed and move on with the other shards.
Users can choose whether they accept partial results or not by setting allow_partial_search_results , however they have no choice but to replay the query if they want the full results (assuming that the shard failures were transient).
I am opening this discuss whether we could apply some exponential backoff to retry transient shard failures in search requests.
Failures such as:

Rejected executed exception.
Shard unavailable exception

could be retried with a configurable exponential backoff. This would be useful for search requests that run in the background (with _async_search) and that can afford waiting for a shard recovery.

This issue is also loosely related to #37867 since low-priority search requests could be configured to retry automatically.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-30T15:46:17Z

Pinging @elastic/es-search (:Search/Search)

elasticmachine · 2020-04-30T15:46:19Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

jimczi · 2020-05-05T19:29:47Z

We discussed this in Fix-it Thursday and agreed on two possible improvements:

We shouldn't retry non-transient failures.
We could add a configurable delay to wait for non-assigned shards before raising an error.

These improvements are not linked so I'll open a new issue for the latter so that it can be handled separately.
Shard failures are difficult to diagnose, for instance there is no way to determine if a circuit breaker exception is due to the current shard request or because the node is overloaded.
We didn't reach a conclusion for this specific failure but we agreed that we should categorize each error type in order to determine if they should be retried or not.

jimczi · 2020-05-05T19:39:45Z

I opened #56236 to handle non-assigned shards in search request. This issue is now geared towards classifying shard failures that shouldn't be retried automatically.

elasticsearchmachine · 2022-07-28T12:50:48Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2024-07-17T19:15:47Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

jimczi added >feature discuss :Search/Search Search-related issues that do not fall into other categories :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Apr 30, 2020

elasticmachine added the Team:Search Meta label for search team label Apr 30, 2020

elasticmachine added the Team:Distributed Meta label for distributed team label Apr 30, 2020

jimczi removed the discuss label May 5, 2020

jimczi mentioned this issue Jun 24, 2020

Add a configurable delay for non-assigned shards in search request #56236

Closed

pxsalehi removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team labels Jul 28, 2022

javanna added >enhancement and removed >feature labels Jun 25, 2024

javanna added :Search Foundations/Search Catch all for Search Foundations and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 17, 2024

elasticsearchmachine added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry transient shard failures in search #56045

Retry transient shard failures in search #56045

jimczi commented Apr 30, 2020

elasticmachine commented Apr 30, 2020

elasticmachine commented Apr 30, 2020

jimczi commented May 5, 2020

jimczi commented May 5, 2020

elasticsearchmachine commented Jul 28, 2022

elasticsearchmachine commented Jul 17, 2024

Retry transient shard failures in search #56045

Retry transient shard failures in search #56045

Comments

jimczi commented Apr 30, 2020

elasticmachine commented Apr 30, 2020

elasticmachine commented Apr 30, 2020

jimczi commented May 5, 2020

jimczi commented May 5, 2020

elasticsearchmachine commented Jul 28, 2022

elasticsearchmachine commented Jul 17, 2024