[ML] Adding retries for starting model deployment #99673

jonathan-buttner · 2023-09-19T15:11:07Z

This PR reduces the chunk size when downloading a model to 1 MB. It also adds retry logic for deploying a model specifically for SearchPhaseException. This exception is thrown when a CircuitBreakerError occurs during model deployment. By catching the exception and retrying in the restorer logic we can avoid reloading the entire model.

Addresses part of: #99409

Testing

Retrying does occur:

I can still get CBE to occur though.

…-model-improvements

elasticsearchmachine · 2023-09-19T16:57:42Z

Hi @jonathan-buttner, I've created a changelog YAML for you.

droberts195 · 2023-09-19T17:19:15Z

.../main/java/org/elasticsearch/xpack/ml/inference/persistence/ChunkedTrainedModelRestorer.java

+        while (true) {
+            try {
+                return client.search(searchRequest).actionGet();
+            } catch (SearchPhaseExecutionException e) {


It's highly likely that this will be wrapped in a RemoteTransportException, so catching it like this won't work.

I think you should catch Exception and then use something like:

if (ExceptionsHelper.unwrapCause(e) instanceof SearchPhaseExecutionException) { // do the processing } else { throw e; }

I think ExceptionsHelper.unwrapCause(e) instanceof CircuitBreakingException should also be accepted as a reason to retry. I'm not clear if the CircuitBreakingException will always be wrapped in a SearchPhaseExecutionException. It seems safer to accept either as a reason to retry.

…tner/elasticsearch into ml-start-model-improvements

…-model-improvements

elasticsearchmachine · 2023-09-19T20:02:02Z

Pinging @elastic/ml-core (Team:ML)

davidkyle · 2023-09-20T08:26:23Z

.../main/java/org/elasticsearch/xpack/ml/inference/persistence/ChunkedTrainedModelRestorer.java

+
+                if (failureCount >= retries) {
+                    logger.warn(format("[%s] searching for model part failed %s times, returning failure", modelId, retries));
+                    throw e;


Should the retry logic in TrainedModelAssignmentNodeService now be removed? If the search has failed after N tries it trying to reload the entire model puts the cluster under more strain.

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L219

Instead of throwing e please wrap it in another exception with a sensible error message saying something along the lines of loading model [model_id] failed after [SEARCH_RETRY_LIMIT] retries. The deployment is now in a failed state, the error may be transient please stop the deployment and restart.

The error message ends up in the failed routing state here and will eventually make it's way back to the caller assuming the request did not time out.

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L718

Ah good point 👍

Dave and I talked about this offline and we don't need to adjust the retry logic in TrainedModelAssignmentNodeService because using ElasticsearchException cannot be unwrapped and will not trigger that retry logic.

droberts195

LGTM

elasticsearchmachine · 2023-09-21T19:48:24Z

💚 Backport successful

Status	Branch	Result
✅	8.10

* Reducing chunk size and adding retries * Testing search part * Update docs/changelog/99673.yaml * Unwrapping exception * Updating changelog * Wrapping exception and tests * Adding comments

Reducing chunk size and adding retries

461a91f

jonathan-buttner added >non-issue :ml Machine learning Team:ML Meta label for the ML team v8.11.0 cloud-deploy Publish cloud docker image for Cloud-First-Testing labels Sep 19, 2023

jonathan-buttner added 2 commits September 19, 2023 11:11

Merge branch 'main' of github.com:elastic/elasticsearch into ml-start…

813f490

…-model-improvements

Testing search part

1279fa8

jonathan-buttner added >bug and removed >non-issue labels Sep 19, 2023

Update docs/changelog/99673.yaml

fe3f044

droberts195 reviewed Sep 19, 2023

View reviewed changes

jonathan-buttner added 4 commits September 19, 2023 14:01

Unwrapping exception

74218db

Merge branch 'ml-start-model-improvements' of github.com:jonathan-but…

7a1c210

…tner/elasticsearch into ml-start-model-improvements

Updating changelog

cd66b59

Merge branch 'main' of github.com:elastic/elasticsearch into ml-start…

8796a7e

…-model-improvements

jonathan-buttner marked this pull request as ready for review September 19, 2023 20:01

davidkyle reviewed Sep 20, 2023

View reviewed changes

jonathan-buttner changed the title ~~[ML] Reducing chunk size and adding retries for starting model deployment~~ [ML] Adding retries for starting model deployment Sep 20, 2023

jonathan-buttner added 2 commits September 20, 2023 09:56

Wrapping exception and tests

e872dc3

Adding comments

b21e644

droberts195 approved these changes Sep 21, 2023

View reviewed changes

droberts195 added auto-backport-and-merge Automatically create backport pull requests and merge when ready v8.10.3 labels Sep 21, 2023

jonathan-buttner merged commit a8dac40 into elastic:main Sep 21, 2023
13 checks passed

jonathan-buttner mentioned this pull request Sep 21, 2023

[8.10] [ML] Adding retries for starting model deployment (#99673) #99783

Merged

jonathan-buttner deleted the ml-start-model-improvements branch September 21, 2023 20:03

droberts195 mentioned this pull request Sep 22, 2023

[Ml] CircuitBreakingException when deploying the ELSER model #99409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Adding retries for starting model deployment #99673

[ML] Adding retries for starting model deployment #99673

jonathan-buttner commented Sep 19, 2023 •

edited

Loading

elasticsearchmachine commented Sep 19, 2023

droberts195 Sep 19, 2023

droberts195 Sep 19, 2023

elasticsearchmachine commented Sep 19, 2023

davidkyle Sep 20, 2023

jonathan-buttner Sep 20, 2023

jonathan-buttner Sep 20, 2023

droberts195 left a comment

elasticsearchmachine commented Sep 21, 2023

[ML] Adding retries for starting model deployment #99673

[ML] Adding retries for starting model deployment #99673

Conversation

jonathan-buttner commented Sep 19, 2023 • edited Loading

Testing

elasticsearchmachine commented Sep 19, 2023

droberts195 Sep 19, 2023

Choose a reason for hiding this comment

droberts195 Sep 19, 2023

Choose a reason for hiding this comment

elasticsearchmachine commented Sep 19, 2023

davidkyle Sep 20, 2023

Choose a reason for hiding this comment

jonathan-buttner Sep 20, 2023

Choose a reason for hiding this comment

jonathan-buttner Sep 20, 2023

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Sep 21, 2023

💚 Backport successful

jonathan-buttner commented Sep 19, 2023 •

edited

Loading