High memory pressure for Elasticsearch versions using JDK 20+ #99592

jonathan-buttner · 2023-09-14T18:56:39Z

Elasticsearch Version

8.7.1 and above, 7.17.10 and above

Installed Plugins

No response

Java Version

JDK 20 or above.

The issue depends on the JDK version rather than the ES version.

For reference, these are the bundled JDK versions (see Dependencies and versions section in Elasticsearch docs per stack version):

Stack  - JDK
-----    ---
7.17.10 - 20.0.1+9
7.17.11 - 20.0.1+9
7.17.12 - 20.0.2+9
7.17.13 - 20.0.2+9
7.17.14 - 21+35
7.17.15 - 21.0.1+12
8.7.1   - 20.0.1+9
8.8.2   - 20.0.1+9
8.9.0   - 20.0.2+9
8.9.1   - 20.0.2+9
8.9.2   - 20.0.2+9
8.10.0  - 20.0.2+9
8.10.1  - 20.0.2+9
8.10.2  - 20.0.2+9
8.10.3  - 21+35
8.10.4  - 21+35
8.11.0  - 21.0.1+12
8.11.1  - 21.0.1+12

The last stack version that allows the escape hatch workaround of re-enabling the disabled JVM setting (-XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC - see Problem Description below) are:

Elasticsearch v8.10.2 for 8.x branch
Elasticsearch v7.17.13 for 7.x branch

since versions after this one bundle JDK21 or later that has the setting removed from the JVM entirely.
Adding or leaving the JVM setting in place and starting/upgrading to 8.10.3+ will fail to start the JVM due to the unknown setting in the later JDK versions.

Switch to non-bundled JDK < 20 (when possible) should also function as a workaround.

OS Version

N/A

Problem Description

In 8.7.1 the bundled JDK was changed to JDK 20.0.1 in this PR: #95373

When retrieving large documents from Elasticsearch we see high memory pressure on the data node returning the documents.

There seems to be a distinct difference in how allocated memory is cleaned up between JDK 19 and JDK 20.

The graphs below show memory usage when allocating many ~5mb byte arrays to transfer a ~400 mb pytorch model from a data node to an ML node. The pytorch model is chunked and stored in separate documents that are retrieved one at a time by the ML node. When repeatedly allocating these large arrays, we see memory pressure distinctly increase in JDK 20.0.1. The graphs show memory usage over time when repeatedly starting and stopping the pytorch model.

Memory pressure for 8.7.0

Memory pressure for 8.7.1

Here's one from using VisualVM to monitor heap usage using a local deployment

Memory pressure for 8.7.1 when using the -XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC options

If the data node is started with the JVM options enabled we see memory usage closer to what it looks like in JDK 19

Steps to Reproduce

The issue can be reproduced easily in cloud but I'll describe the steps for running elasticsearch locally too.

Setup

Cloud
In cloud deploy a cluster with

2 zones and 4 GB data nodes
2 zones and 4 GB ML nodes
Enable monitoring so you can see the heap usage

Locally

Run two nodes (1 data node, and 1 ML node).

Download and install 8.7.1 https://www.elastic.co/downloads/past-releases/elasticsearch-8-7-1
Download and install 8.7.1 of kibana https://www.elastic.co/downloads/past-releases/kibana-8-7-1
Download and install eland https://github.com/elastic/eland

An easy way to run two nodes is simply to decompress the bundle in two places.

Configuration

Create a file under config/jvm.options.d and add the following JVM options for both the data node and ML node

-Xms4g
-Xmx4g

Add the following settings to the data nodes config/elasticsearch.yml file

node.roles: ["master", "data", "data_content", "ingest", "data_hot", "data_warm", "data_cold", "data_frozen", "transform"]
xpack.security.enabled: true
xpack.license.self_generated.type: "trial"

Add the following settings to the ml nodes config/elasticsearch.yml file

node.roles: ["ml"]
xpack.security.enabled: true
xpack.license.self_generated.type: "trial"

Reset the elastic password on the data node

From bin

./elasticsearch-reset-password -i -u elastic --url http://localhost:9200

Create a service token for kibana

From bin

./elasticsearch-service-tokens create elastic/kibana <token name>

Add this token to kibana's kibana.yml file

elasticsearch.serviceAccountToken: "<token>"

and ensure that these settings are disabled elasticsearch.username and elasticsearch.password

Start both elasticsearch nodes and kibana
Connect VisualVM to the data node and observe memory usage over time

Reproducing the bug in cloud and locally

Upload a pytorch model of around ~400 mb

Locally

docker run -it --rm --network host elastic/eland \
    eland_import_hub_model \
      --url http://elastic:changeme@host.docker.internal:9200/ \
      --hub-model-id sentence-transformers/all-distilroberta-v1 \
--clear-previous

For cloud

docker run -it --rm elastic/eland \
    eland_import_hub_model \
      --url https://elastic:<cloud password>@<cloud es url>:9243 \
      --hub-model-id sentence-transformers/all-distilroberta-v1 \
--clear-previous

Repeatedly start and stop the uploaded model (20 - 30 times)
- Navigate to Machine Learning -> Trained models
- Click the start and stop buttons repeatedly
Observe memory usage over time

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

benwtrent · 2023-09-14T19:35:49Z

One key issue is that the particular task that @jonathan-buttner is executing is one that searches over 10s to 100s of docs, each about 1MB in size.

They grab the source (about 1MB in size), stream to a separate process and then search for the next doc.

So, within a second, many 1MB byte buffers are being created and dereferenced.

While we should be better about this (e.g. #99590), the GC collection not happening until we are at critical mass is troubling.

The real-memory circuit breaker does protect us mostly, but GC needs to be better here.

elasticsearchmachine · 2023-09-14T19:37:23Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2023-09-14T19:37:24Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

benwtrent · 2023-09-14T19:38:23Z

Ping @elastic/ml-core

rjernst · 2023-09-14T20:42:36Z

Just a note on the workaround of re-enabling preventative GC, it has been removed in jdk 21 (which is due to be released next week and we will be updating to shortly thereafter).

elasticsearchmachine · 2023-09-18T09:19:14Z

Pinging @elastic/ml-core (Team:ML)

ChrisHegarty · 2023-09-22T15:05:27Z

Correct me if I'm wrong, but Java 21 exhibits better behaviour here ( less or no circuit breaker exceptions ), than that of Java 20.0.x.

jonathan-buttner · 2023-09-26T12:37:07Z

Correct me if I'm wrong, but Java 21 exhibits better behaviour here ( less or no circuit breaker exceptions ), than that of Java 20.0.x.

In my local testing I wasn't able to generate a CBE but memory pressure didn't look as good as Java 19

jonathan-buttner · 2023-09-26T14:31:24Z

I just tested in cloud using the latest 8.11.0-snapshot and I can get CBEs to occur when deploying an ML model around 400 MB. I used an older version of eland which does not leverage a recent fix which stores the model in 1 MB chunks.

geekpete · 2023-10-19T04:24:21Z

Just a note to confirm that the last stack version that allows the escape hatch workaround of re-enabling the disabled JVM setting (-XX:+UnlockDiagnosticVMOptions -XX:+G1UsePreventiveGC) is:

Elasticsearch v8.10.2 for 8.x branch
Elasticsearch v7.17.13 for 7.x branch

since versions after this one bundle JDK21 or later that has the setting removed from the JVM entirely.
Adding or leaving the JVM setting in place and starting/upgrading to 8.10.3+ will fail to start the JVM due to the unknown setting in the later JDK versions.

Bundled JDK versions (see Dependencies and versions section in Elasticsearch docs per stack version):

Stack  - JDK
-----    ---
7.17.9  - 19.0.2+7
7.17.10 - 20.0.1+9
7.17.11 - 20.0.1+9
7.17.12 - 20.0.2+9
7.17.13 - 20.0.2+9
7.17.14 - 21+35
8.8.2   - 20.0.1+9
8.9.0   - 20.0.2+9
8.9.1   - 20.0.2+9
8.10.0  - 20.0.2+9
8.10.1  - 20.0.2+9
8.10.2  - 20.0.2+9
8.10.3  - 21+35
8.10.4  - 21+35
8.11.0  - 21.0.1+12

stefnestor · 2023-11-09T18:01:11Z

Noting this appears to affect starting v7.17.10 (where bundled JDK allows -XX:+G1UsePreventiveGC reapply override). Starting v7.17.14 bundled JDK v21 no longer allows setting override. Where possible, users can switch to non-bundled JDKs.

geekpete · 2023-11-09T22:01:10Z

Right, so we backported the newer JDK to the still supported 7.17 branch and should probably include that in the stack to JDK version table above?

droberts195 · 2023-11-10T16:54:46Z

The fix that worked around this problem for the situation where it was originally seen is elastic/eland#605. That reduced the size of the chunked PyTorch model documents from 4MB to 1MB.

Obviously people are seeing the same underlying problem with different types of data than chunked PyTorch models. However, if any of these situations involve documents that are 2MB or bigger and it's possible to reduce the size of these documents below 2MB then doing so may help to avoid the problem. The reason I am guessing 2MB is the cutoff point is that in heap dumps from nodes that have crashed due to this problem we have observed large numbers of strange unreferenced int arrays that are all exactly 2MB in size. These may be related to something that was changed in Java 20 garbage collection. The fact that reducing chunk size from 4MB to 1MB in Eland made these mysterious arrays go away for that particular use case is what makes me guess that memory chunks bigger than 2MB are the problematic ones. This is only a guess though - I could be wrong but thought I'd mention it just in case it's useful to somebody.

ChrisHegarty · 2023-11-13T12:23:32Z

Awareness of this issue has been raised to OpenJDK. Specifically a comment summarising and referring to this GH issue has been added to several of the JDK issues, e.g. see https://bugs.openjdk.org/browse/JDK-8297639

benwtrent · 2023-12-05T19:31:43Z

@predogma no timeline on any universal fix. Multiple paths are being explored from:

Improving the circuit breaker and how it calculates when it should circuit break
Using less memory and creating less garbage in certain hot-paths (e.g. _search, _bulk)

The work around if they have this issue is to use JDK17.

We have something that can help coming in 8.12 & 8.11.2:

Add more logging to the real memory circuit breaker and lower minimum interval #102396

ewolfman · 2024-01-02T08:49:19Z

hi @ldematte,

You are right @ewolfman: due to bad timing (updating docs at the same time 8.11.1 underwent an emergency release) the known issue has not been picked up by 8.11.1 release notes. Thanks for spotting this, I have backported this manually.

Same question about the documentation for 8.11.3?

ldematte · 2024-01-02T09:02:36Z

Hi @ewolfman, I think it is responsibility of the release manager to write release notes for the versions they release, so if the release notes are incorrect/are missing something we should contact them.
That said, 8.11.2 and 8.11.3 include #102396, which should greatly mitigate the issue; I'm wondering if we still need to mention the issue in the release notes (unless we observe problems again with these versions, of course).

ewolfman · 2024-01-02T09:11:02Z

Thanks @ldematte.

Please note that the comment does appear in 8.11.2, and there is no fix mentioned in 8.11.3. So it is difficult to understand whether the issue still exists in 8.11.3 or not. If it was fixed, I suggest mentioning in the 8.11.3 release notes. If it was not fixed, I think this warning should still be there. For example this issue is one that prevents us from upgrading from version 7.17.7 to latest 8, and this is why it is important to understand what the status is.

ldematte · 2024-01-02T09:58:27Z

Hi @ewolfman, I agree with you that it should be either one or the other (either keep mentioning the known issue in 8.11.3 or mention that the issue is fixed). It seems like a mistake to not have one or the other in the release notes for 8.11.3

kunisen · 2024-01-25T01:22:16Z

Hi @ldematte

Thanks for your help on this issue. May I have a quick confirmation please?
Given #99592 (comment),

That said, 8.11.2 and 8.11.3 include #102396, which should greatly mitigate the issue;
I'm wondering if we still need to mention the issue in the release notes (unless we observe problems again with these versions, of course).

Can we safely say that, "this issue is considered to be fixed in 8.11.2+ and 8.12.0+ versions" (*1)?
We want to make it clear that, "greatly mitigate" + "removal from release notes" can be considered as "fixed" or not, or mitigation might solve or reduce the problem for some usages/use cases but not others depending on the specific pattern (i.e. partially fixed).

(*1) By saying +, we meant to say "8.11.2 and onwards" and "8.12.0 and onwards" versions.

cc @geekpete

mix4242 · 2024-01-29T14:44:29Z

Is it also fixed in 7.17.17?

While not listed as explicitly fixed (& obv this issue is still in an 'Open' state), the 7.17.17 release notes drop this "Known issue" when compared to the 7.17.16 release notes.

Was this purposeful to indicate the issue no longer affecting version 7.17.17?

Robert-Saiter · 2024-02-09T17:43:56Z

Is it also fixed in 7.17.17?

While not listed as explicitly fixed (& obv this issue is still in an 'Open' state), the 7.17.17 release notes drop this "Known issue" when compared to the 7.17.16 release notes.

Was this purposeful to indicate the issue no longer affecting version 7.17.17?

Also looking for clarification on this as it's not clear if the issue is resolved or if the "Known issue" was just left off from the notes.

geekpete · 2024-02-14T03:54:45Z

I tried the repro but could not get it to hit heap pressure for me in the same sized test env on 8.10.2, the heap stayed well below max no matter how many times I stopped/started the ML model.
My aim was to first confirm a repro on an earlier affected version (I picked 8.10.2) then upgrade to the latest stack version and test the repro again to see if it still hit the issue.

@jonathan-buttner are you able to re-check if your repro still affects latest stack version that might help confirm or deny any delivered fix?

jonathan-buttner · 2024-02-14T13:42:39Z

@geekpete

Were you testing locally or in cloud? In my experience it was much easier to reproduce in cloud. I'll spin up a 8.10.2 cloud environment and check again. The other thing to note is that ML has addressed the issue by reducing the chunk size that we store. That doesn't mean the problem is "fixed" just that it isn't reproducible using the ML mechanism. That ML fix should have gone into 8.10.3 and 8.11.0:

#99677

geekpete · 2024-02-14T13:51:23Z

Yeah was testing in cloud.
I guess another workload that could be similar for large amounts of short lived objects is aggregations? Or ad hoc searches across a long retention/many indices is another scenario I saw that hit the impact for another user.

jonathan-buttner · 2024-02-14T14:49:12Z

I was able to get this to happen in cloud on 8.10.0

2x 4gb ES nodes
2x 4gb ML nodes

When starting the deployment try 2 allocations and 2 threads.

jonathan-buttner · 2024-02-14T14:53:17Z

I guess another workload that could be similar for large amounts of short lived objects is aggregations? Or ad gov searches across a long retention/many indices is another scenario I saw that hit the impact for another user.

Hmm, I'm not sure. The test with ELSER is essentially storing around ~100 documents that are like 4 mb each. Maybe creating some dummy data and writing a script to repeatedly search for them.

Another option could be downloading ELSER in 8.10.0 (or pre the 1MB chunk fix that went into 8.10.3 and 8.11.0) and then upgrade to the latest stack version. That way you'll have the data stored in the large documents and can test via the same process.

benwtrent · 2024-02-14T15:24:46Z

That why you'll have the data stored in the large documents and I would expect the issue to still occur on the latest stack version.

I would not have the same expectation (particularly in 8.13 in the future) as various other fixes have been implemented (outside of the particular ML issue) to address the overall problem of memory pressure.

jonathan-buttner · 2024-02-14T15:35:18Z

I would not have the same expectation (particularly in 8.13 in the future) as various other fixes have been implemented (outside of the particular ML issue) to address the overall problem of memory pressure.

Good point, that was a poor assumption on my part, I'll update it.

geekpete · 2024-03-26T23:05:33Z

Is this lower minimum full GC interval in 8.13.0+ likely to have any positive effect on this issue?
#105259

gibrown · 2024-06-06T02:59:30Z

I would not have the same expectation (particularly in 8.13 in the future) as various other fixes have been implemented (outside of the particular ML issue) to address the overall problem of memory pressure.

@benwtrent my team is trying to understand the risk of upgrading from 8.6.2 to 8.13.4 and this issue had caused us to pause (~50k of our few billion docs are over 2mb). This comment, the lack of activity here, and the fact that #103779 got resolved with 8.12 makes me feel like the risk has mostly been mitigated. Is that the right way to think about it?

benwtrent · 2024-06-06T11:34:45Z

@gibrown it has mostly been mitigated. In 8.13 adjusted search hits so that the source is actually referenced from the netty buffers. Meaning, we no longer copy around huge byte arrays, instead use reusable objects.

We have also made many many other memory usage improvements and with how the circuit breaker behaves with the new garbage collection behavior.

Overall, our benchmarks have seen marked improvements over the last 6 months. Now we have also noticed that with JDK22, we see an even better improvement.

https://elasticsearch-benchmarks.elastic.co/#

Here are some hand-picked examples, but almost every track shows significant GC improvement.

Let's compare GC activity between when we updated to JDK21 vs. today:

JDK21:

Today (with all our improvements and JDK22):

Here is a more extreme example:

Today:

gibrown · 2024-06-07T01:08:21Z

Awesome. Great work and thanks for the detailed update.

kibertoad · 2024-06-07T06:34:41Z

@benwtrent Is JDK22 stable again? I remember ES downgraded away from it few versions ago.

benwtrent · 2024-06-07T11:07:44Z

@kibertoad we are still testing, but have found it to be stable. You are correct, we did upgrade and then downgrade. But, with this latest upgrade attempt, we have not encountered the issue we did before.

Additionally, we made some changes to account for the previously detected JVM bug.

The JDK bug we encountered: https://bugs.openjdk.org/browse/JDK-8329528

The work around: #108571

aosavitski · 2024-06-19T13:29:04Z

We have also made many many other memory usage improvements and with how the circuit breaker behaves with the new garbage collection behavior.

Overall, our benchmarks have seen marked improvements over the last 6 months. Now we have also noticed that with JDK22, we see an even better improvement.

@benwtrent are these improvements also planned for version 7?

benwtrent · 2024-06-20T12:32:03Z

@aosavitski many of the changes are not critical or security bug fixes. Generally, that is all we backport to 7.17.x

However, we are backporting the JDK update to JDK22 and our work around for that bug. As jdk updates are security driven, not just performance driven.

benwtrent · 2024-06-24T14:33:45Z

Given the progress, I am going to close this issue.

Further optimizations will occur, but these will be handled in separate issues.

jonathan-buttner added >bug v8.7.1 labels Sep 14, 2023

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Sep 14, 2023

jonathan-buttner changed the title ~~High memory pressure for Elasticsearch versions using JDK 20.0.2~~ High memory pressure for Elasticsearch versions using JDK 20.0.1 Sep 14, 2023

benwtrent added :Search/Search Search-related issues that do not fall into other categories :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload labels Sep 14, 2023

elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label labels Sep 14, 2023

davidkyle mentioned this issue Sep 18, 2023

[Ml] CircuitBreakingException when deploying the ELSER model #99409

Closed

davidkyle added the :ml Machine learning label Sep 18, 2023

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 18, 2023

DaveCTurner mentioned this issue Oct 16, 2023

Add method BytesReference#deepCopy #100880

Closed

ldematte changed the title ~~High memory pressure for Elasticsearch versions using JDK 20.0.1~~ High memory pressure for Elasticsearch versions using JDK 20+ Nov 10, 2023

ldematte mentioned this issue Nov 10, 2023

[doc] Add known issue to all versions affected by GC behaviour change #102025

Merged

ldematte mentioned this issue Nov 14, 2023

[7.17][doc] Add known issue to all versions affected by GC behaviour change #102124

Merged

thecoop mentioned this issue Jan 2, 2024

GC rate increases randomly, increasing the cpu to 100%, dropping indexing rate in v8.11.3 #103779

Closed

ldematte mentioned this issue Jan 31, 2024

Avoid multiple byte copies by queueing Translog.Operations #104918

Open

andsel mentioned this issue Feb 21, 2024

Investigate Java 21 and Jruby compatibility elastic/logstash#15342

Closed

6 tasks

bernd mentioned this issue Mar 15, 2024

[BUG] Higher memory consumption when running with JDK 21 opensearch-project/OpenSearch#12694

Open

tophercullen mentioned this issue May 7, 2024

OpenSearch Data Nodes memory exhaustion after upgrade from 2.9 to 2.12 (JDK 21 upgrade) opensearch-project/OpenSearch#12454

Closed

benwtrent closed this as completed Jun 24, 2024

High memory pressure for Elasticsearch versions using JDK 20+ #99592

High memory pressure for Elasticsearch versions using JDK 20+ #99592

Comments

jonathan-buttner commented Sep 14, 2023 • edited by ldematte Loading

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Setup

Logs (if relevant)

benwtrent commented Sep 14, 2023

elasticsearchmachine commented Sep 14, 2023

elasticsearchmachine commented Sep 14, 2023

benwtrent commented Sep 14, 2023

rjernst commented Sep 14, 2023

elasticsearchmachine commented Sep 18, 2023

ChrisHegarty commented Sep 22, 2023

jonathan-buttner commented Sep 26, 2023

jonathan-buttner commented Sep 26, 2023

geekpete commented Oct 19, 2023 • edited Loading

stefnestor commented Nov 9, 2023

geekpete commented Nov 9, 2023

droberts195 commented Nov 10, 2023

ChrisHegarty commented Nov 13, 2023

benwtrent commented Dec 5, 2023

ewolfman commented Jan 2, 2024

ldematte commented Jan 2, 2024

ewolfman commented Jan 2, 2024 • edited Loading

ldematte commented Jan 2, 2024

kunisen commented Jan 25, 2024

mix4242 commented Jan 29, 2024

Robert-Saiter commented Feb 9, 2024

geekpete commented Feb 14, 2024

jonathan-buttner commented Feb 14, 2024

geekpete commented Feb 14, 2024 • edited Loading

jonathan-buttner commented Feb 14, 2024

jonathan-buttner commented Feb 14, 2024 • edited Loading

benwtrent commented Feb 14, 2024

jonathan-buttner commented Feb 14, 2024

geekpete commented Mar 26, 2024

gibrown commented Jun 6, 2024

benwtrent commented Jun 6, 2024

gibrown commented Jun 7, 2024

kibertoad commented Jun 7, 2024

benwtrent commented Jun 7, 2024

aosavitski commented Jun 19, 2024 • edited Loading

benwtrent commented Jun 20, 2024

benwtrent commented Jun 24, 2024

jonathan-buttner commented Sep 14, 2023 •

edited by ldematte

Loading

geekpete commented Oct 19, 2023 •

edited

Loading

ewolfman commented Jan 2, 2024 •

edited

Loading

geekpete commented Feb 14, 2024 •

edited

Loading

jonathan-buttner commented Feb 14, 2024 •

edited

Loading

aosavitski commented Jun 19, 2024 •

edited

Loading