Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Adding retries for starting model deployment #99673

Merged

Conversation

jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Sep 19, 2023

This PR reduces the chunk size when downloading a model to 1 MB. It also adds retry logic for deploying a model specifically for SearchPhaseException. This exception is thrown when a CircuitBreakerError occurs during model deployment. By catching the exception and retrying in the restorer logic we can avoid reloading the entire model.

Addresses part of: #99409

Testing

Retrying does occur:

image

I can still get CBE to occur though.

@jonathan-buttner jonathan-buttner added >non-issue :ml Machine learning Team:ML Meta label for the ML team v8.11.0 cloud-deploy Publish cloud docker image for Cloud-First-Testing labels Sep 19, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @jonathan-buttner, I've created a changelog YAML for you.

while (true) {
try {
return client.search(searchRequest).actionGet();
} catch (SearchPhaseExecutionException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's highly likely that this will be wrapped in a RemoteTransportException, so catching it like this won't work.

I think you should catch Exception and then use something like:

    if (ExceptionsHelper.unwrapCause(e) instanceof SearchPhaseExecutionException) {
        // do the processing
    } else {
        throw e;
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ExceptionsHelper.unwrapCause(e) instanceof CircuitBreakingException should also be accepted as a reason to retry. I'm not clear if the CircuitBreakingException will always be wrapped in a SearchPhaseExecutionException. It seems safer to accept either as a reason to retry.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review September 19, 2023 20:01
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)


if (failureCount >= retries) {
logger.warn(format("[%s] searching for model part failed %s times, returning failure", modelId, retries));
throw e;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the retry logic in TrainedModelAssignmentNodeService now be removed? If the search has failed after N tries it trying to reload the entire model puts the cluster under more strain.

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L219

Instead of throwing e please wrap it in another exception with a sensible error message saying something along the lines of loading model [model_id] failed after [SEARCH_RETRY_LIMIT] retries. The deployment is now in a failed state, the error may be transient please stop the deployment and restart.

The error message ends up in the failed routing state here and will eventually make it's way back to the caller assuming the request did not time out.

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L718

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dave and I talked about this offline and we don't need to adjust the retry logic in TrainedModelAssignmentNodeService because using ElasticsearchException cannot be unwrapped and will not trigger that retry logic.

@jonathan-buttner jonathan-buttner changed the title [ML] Reducing chunk size and adding retries for starting model deployment [ML] Adding retries for starting model deployment Sep 20, 2023
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 added auto-backport-and-merge Automatically create backport pull requests and merge when ready v8.10.3 labels Sep 21, 2023
@jonathan-buttner jonathan-buttner merged commit a8dac40 into elastic:main Sep 21, 2023
13 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.10

jonathan-buttner added a commit to jonathan-buttner/elasticsearch that referenced this pull request Sep 21, 2023
* Reducing chunk size and adding retries

* Testing search part

* Update docs/changelog/99673.yaml

* Unwrapping exception

* Updating changelog

* Wrapping exception and tests

* Adding comments
@jonathan-buttner jonathan-buttner deleted the ml-start-model-improvements branch September 21, 2023 20:03
elasticsearchmachine pushed a commit that referenced this pull request Sep 21, 2023
* Reducing chunk size and adding retries

* Testing search part

* Update docs/changelog/99673.yaml

* Unwrapping exception

* Updating changelog

* Wrapping exception and tests

* Adding comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.10.3 v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants