[Ml] CircuitBreakingException when deploying the ELSER model #99409

davidkyle · 2023-09-11T09:14:13Z

Elasticsearch Version

8.9.1

Installed Plugins

No response

Java Version

bundled

OS Version

any

Problem Description

Deploying the ELSER model can cause a CircuitBreakingException if the cluster is being used heavily. Often the CircuitBreakingException is thrown because a request from Kibana or another client cannot be serviced with the available memory while loading the ELSER model which temporarily increases memory pressure in the JVM. In this scenario Kibana becomes unresponsive but in many cases the ELSER model deployment completes successfully but the UI takes a while to see that status change.

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/read/search[phase/query]] would be [1045092822/996.6mb], which is larger than the limit of [1020054732/972.7mb], real usage: [1045091992/996.6mb], new bytes reserved: [830/830b], usages [eql_sequence=0/0b, fielddata=1779/1.7kb, request=0/0b, inflight_requests=830/830b, model_inference=0/0b] at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:414)

Steps to Reproduce

The problem occurs more commonly on smaller nodes with less JVM heap memory (e.g. 2GB JVM heap). Reproduce by repeatedly deploying and un-deploying the ELSER model.

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-09-11T09:14:47Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2023-09-11T09:35:33Z

Some ideas to fix this:

Index ELSER in smaller blocks. So change

elasticsearch/x-pack/plugin/ml-package-loader/src/main/java/org/elasticsearch/xpack/ml/packageloader/action/ModelImporter.java

Line 43 in d0186e6

private static final int DEFAULT_CHUNK_SIZE = 4 * 1024 * 1024; // 4MB

to 1MB instead of 4MB for example. We’ll need to do twice as many searches as before but each will (hopefully) use half the amount of memory.
Add a small delay in between the chunks. This will mean that the PyTorchStateStreamer will need a ThreadPool member. We’ll have to pass one in through its constructor, and the constructors of all the wrapping classes back to MachineLearning. Then it will be possible to use scheduleWithFixedDelay - see

elasticsearch/server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

Line 536 in 11b6db3

public Cancellable scheduleWithFixedDelay(Runnable command, TimeValue interval, Executor executor) {

- to add a random delay of between 5 and 20 milliseconds between searches. Hopefully that slowdown will give the data node some time to garbage collect.
Finally, when the search response is a circuit breaker exception, can we retry it? Once we have delays between searches, it shouldn’t be that hard to increase the delay to a random number of seconds between 3 and 7, and retry the exact same search again instead of moving to the next one. We don’t want to do this forever, but up to 5 retries per model load would be reasonable if we think the cause of the problem is that the data node is under transient pressure.

davidkyle · 2023-09-18T09:18:10Z

Related to #99592

droberts195 · 2023-09-19T13:54:29Z

Regarding #99409 (comment), it seems that we do already retry on circuit breaker exceptions, but we do it at the very outermost layer of loading the model, here:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java

Line 219 in 11b6db3

    
           } else if (ExceptionsHelper.unwrapCause(ex) instanceof SearchPhaseExecutionException) {

This is bad, because if a CBE happens after 80 documents out of 100 have been loaded then on the retry we'll load all 100 again. This probably explains why we're generating such huge amounts of garbage on the data node. It would be better if we retry at the innermost point in response to a circuit breaker exception, say here:

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/persistence/ChunkedTrainedModelRestorer.java

Line 145 in 4e67df8

SearchResponse searchResponse = client.search(searchRequest).actionGet();

This inner retry should:

Only retry for circuit breaker exceptions (possibly wrapped in remote transport exceptions)
Back off for a fairly long time before the retry - several seconds, not just a few milliseconds

Hopefully this will convert a scenario like:

Read 80 documents successfully
Get a CBE on the 81st
Read 90 documents successfully
Get a CBE on the 91st
Read 100 documents successfully

into:

Read 80 documents successfully
Get a CBE on the 81st
Wait 5 seconds
Read 20 documents successfully

Additionally, it's been observed that reducing the chunk size to 1MB from the current 4MB helps a lot with avoiding CBEs. It's possible that Java 20 does garbage collection differently for the smaller chunks of memory. Changing to 1MB chunks for the ELSER download is a really small change, so we should do that as well.

droberts195 · 2023-09-22T15:09:29Z

With the combination of #99673, #99677 and elastic/eland#605 we can consider this fixed.

davidkyle added >bug :ml Machine learning labels Sep 11, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 11, 2023

droberts195 assigned jonathan-buttner Sep 11, 2023

This was referenced Sep 19, 2023

[ML] Switching to 1 mb chunks for ELSER model import #99677

Merged

[ML] Adding retries for starting model deployment #99673

Merged

Reduce model chunk size to 1 MB elastic/eland#605

Merged

droberts195 closed this as completed Sep 22, 2023

DaveCTurner mentioned this issue Oct 16, 2023

Add method BytesReference#deepCopy #100880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ml] CircuitBreakingException when deploying the ELSER model #99409

[Ml] CircuitBreakingException when deploying the ELSER model #99409

davidkyle commented Sep 11, 2023

elasticsearchmachine commented Sep 11, 2023

droberts195 commented Sep 11, 2023 •

edited

Loading

davidkyle commented Sep 18, 2023

droberts195 commented Sep 19, 2023 •

edited

Loading

droberts195 commented Sep 22, 2023

[Ml] CircuitBreakingException when deploying the ELSER model #99409

[Ml] CircuitBreakingException when deploying the ELSER model #99409

Comments

davidkyle commented Sep 11, 2023

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Sep 11, 2023

droberts195 commented Sep 11, 2023 • edited Loading

davidkyle commented Sep 18, 2023

droberts195 commented Sep 19, 2023 • edited Loading

droberts195 commented Sep 22, 2023

droberts195 commented Sep 11, 2023 •

edited

Loading

droberts195 commented Sep 19, 2023 •

edited

Loading