-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ml] CircuitBreakingException when deploying the ELSER model #99409
Comments
Pinging @elastic/ml-core (Team:ML) |
Some ideas to fix this:
|
Related to #99592 |
Regarding #99409 (comment), it seems that we do already retry on circuit breaker exceptions, but we do it at the very outermost layer of loading the model, here: Line 219 in 11b6db3
This is bad, because if a CBE happens after 80 documents out of 100 have been loaded then on the retry we'll load all 100 again. This probably explains why we're generating such huge amounts of garbage on the data node. It would be better if we retry at the innermost point in response to a circuit breaker exception, say here: Line 145 in 4e67df8
This inner retry should:
Hopefully this will convert a scenario like:
into:
Additionally, it's been observed that reducing the chunk size to 1MB from the current 4MB helps a lot with avoiding CBEs. It's possible that Java 20 does garbage collection differently for the smaller chunks of memory. Changing to 1MB chunks for the ELSER download is a really small change, so we should do that as well. |
With the combination of #99673, #99677 and elastic/eland#605 we can consider this fixed. |
Elasticsearch Version
8.9.1
Installed Plugins
No response
Java Version
bundled
OS Version
any
Problem Description
Deploying the ELSER model can cause a
CircuitBreakingException
if the cluster is being used heavily. Often theCircuitBreakingException
is thrown because a request from Kibana or another client cannot be serviced with the available memory while loading the ELSER model which temporarily increases memory pressure in the JVM. In this scenario Kibana becomes unresponsive but in many cases the ELSER model deployment completes successfully but the UI takes a while to see that status change.Steps to Reproduce
The problem occurs more commonly on smaller nodes with less JVM heap memory (e.g. 2GB JVM heap). Reproduce by repeatedly deploying and un-deploying the ELSER model.
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: