Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ml] CircuitBreakingException when deploying the ELSER model #99409

Closed
davidkyle opened this issue Sep 11, 2023 · 5 comments
Closed

[Ml] CircuitBreakingException when deploying the ELSER model #99409

davidkyle opened this issue Sep 11, 2023 · 5 comments
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@davidkyle
Copy link
Member

Elasticsearch Version

8.9.1

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

Deploying the ELSER model can cause a CircuitBreakingException if the cluster is being used heavily. Often the CircuitBreakingException is thrown because a request from Kibana or another client cannot be serviced with the available memory while loading the ELSER model which temporarily increases memory pressure in the JVM. In this scenario Kibana becomes unresponsive but in many cases the ELSER model deployment completes successfully but the UI takes a while to see that status change.

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/read/search[phase/query]] would be [1045092822/996.6mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1045091992/996.6mb], new bytes reserved: [830/830b], usages [eql_sequence=0/0b, fielddata=1779/1.7kb, request=0/0b, inflight_requests=830/830b, model_inference=0/0b] at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:414)

Steps to Reproduce

The problem occurs more commonly on smaller nodes with less JVM heap memory (e.g. 2GB JVM heap). Reproduce by repeatedly deploying and un-deploying the ELSER model.

Logs (if relevant)

No response

@davidkyle davidkyle added >bug :ml Machine learning labels Sep 11, 2023
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

droberts195 commented Sep 11, 2023

Some ideas to fix this:

  1. Index ELSER in smaller blocks. So change to 1MB instead of 4MB for example. We’ll need to do twice as many searches as before but each will (hopefully) use half the amount of memory.
  2. Add a small delay in between the chunks. This will mean that the PyTorchStateStreamer will need a ThreadPool member. We’ll have to pass one in through its constructor, and the constructors of all the wrapping classes back to MachineLearning. Then it will be possible to use scheduleWithFixedDelay - see
    public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, Executor executor) {
    - to add a random delay of between 5 and 20 milliseconds between searches. Hopefully that slowdown will give the data node some time to garbage collect.
  3. Finally, when the search response is a circuit breaker exception, can we retry it? Once we have delays between searches, it shouldn’t be that hard to increase the delay to a random number of seconds between 3 and 7, and retry the exact same search again instead of moving to the next one. We don’t want to do this forever, but up to 5 retries per model load would be reasonable if we think the cause of the problem is that the data node is under transient pressure.

@davidkyle
Copy link
Member Author

Related to #99592

@droberts195
Copy link
Contributor

droberts195 commented Sep 19, 2023

Regarding #99409 (comment), it seems that we do already retry on circuit breaker exceptions, but we do it at the very outermost layer of loading the model, here:

} else if (ExceptionsHelper.unwrapCause(ex) instanceof SearchPhaseExecutionException) {

This is bad, because if a CBE happens after 80 documents out of 100 have been loaded then on the retry we'll load all 100 again. This probably explains why we're generating such huge amounts of garbage on the data node. It would be better if we retry at the innermost point in response to a circuit breaker exception, say here:

SearchResponse searchResponse = client.search(searchRequest).actionGet();

This inner retry should:

  • Only retry for circuit breaker exceptions (possibly wrapped in remote transport exceptions)
  • Back off for a fairly long time before the retry - several seconds, not just a few milliseconds

Hopefully this will convert a scenario like:

  • Read 80 documents successfully
  • Get a CBE on the 81st
  • Read 90 documents successfully
  • Get a CBE on the 91st
  • Read 100 documents successfully

into:

  • Read 80 documents successfully
  • Get a CBE on the 81st
  • Wait 5 seconds
  • Read 20 documents successfully

Additionally, it's been observed that reducing the chunk size to 1MB from the current 4MB helps a lot with avoiding CBEs. It's possible that Java 20 does garbage collection differently for the smaller chunks of memory. Changing to 1MB chunks for the ELSER download is a really small change, so we should do that as well.

@droberts195
Copy link
Contributor

With the combination of #99673, #99677 and elastic/eland#605 we can consider this fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

4 participants