Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana 7.10.0-SNAPSHOT memory usage leads to OOMKill on ECK #76783

Closed
sebgl opened this issue Sep 4, 2020 · 13 comments
Closed

Kibana 7.10.0-SNAPSHOT memory usage leads to OOMKill on ECK #76783

sebgl opened this issue Sep 4, 2020 · 13 comments
Labels
Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Operations Team label for Operations Team triage_needed

Comments

@sebgl
Copy link

sebgl commented Sep 4, 2020

Kibana version: latest 7.10.0-SNAPSHOT, as of September 4 2020

Elasticsearch version: latest 7.10.0-SNAPSHOT, as of September 4 2020

Kibana 7.10.0-SNAPSHOT seems to OOM after a few seconds/minutes on ECK, before it is completely initialized.
ECK sets a default 1Gi memory limit, which does not seem large enough.

On successful runs (with >1Gi memory limit) the memory usage reported by kubectl top pod is in the 950Mi-1200Mi range.
If I set a memory limit of 900Mi, the Pod gets OOMKilled most of the times.

When running the previous version (7.9.0), I see a memory usage in the 700Mi-750Mi range

Is that a potential bug, or is the higher memory usage expected (in which case we can raise ECK's defaults)?

I cannot easily reproduce the OOMKill when using Docker though:docker run --rm --link elasticsearch:elasticsearch -p 5601:5601 -v /tmp/kibana.yml:/usr/share/kibana/config/kibana.yml --memory=700m docker.elastic.co/kibana/kibana:7.10.0-SNAPSHOT. The container does not get OOM killed. But it's not running with the same configuration as ECK's default one.

ECK issue: elastic/cloud-on-k8s#3710
Maybe related: #72987

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@mikecote mikecote added Team:Operations Team label for Operations Team and removed Team:AppArch labels Sep 4, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@mikecote
Copy link
Contributor

mikecote commented Sep 4, 2020

Sorry AppArch for the ping :( wrong issue.

@etwillbefine
Copy link

I saw the same Behaviour on a Kubernetes Cluster using 7.9.1 Release. The root cause was a "file not found" error while Kibana was starting (tried to read non existing .crt file). Maybe you can find additional Information in your Kibana Logs as well. The Status reported for the Pod was "OOMKilled". Once that error was fixed Kibana started successfully without OOM.

@tylersmalley
Copy link
Contributor

@joshdover this appears to have started with cbf2844

8221 runs with 700MB, OOMs at 650MB
8222 runs with 1100MB, OOMs at 1000MB

@tylersmalley tylersmalley added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Sep 22, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@jbudz
Copy link
Member

jbudz commented Sep 23, 2020

We should make sure max-old-space-size is set and has padding. I don't think we can rely on a (dynamic mem limit + docker overhead + chromium + apm and so on) < docker oom-kill. and now i'm wondering if that'll work with https://github.com/elastic/kibana/blob/master/x-pack/plugins/reporting/server/browsers/chromium/driver_factory/start_logs.ts#L77

Do you have a link to the production configs we use (or slack me)? e.g. not node oom, but this is triggered at the container level, and we're targeting 0 swap or something. any logs available? I'll get an env setup at some point so np if not.

And then for the mem changes - maybe cbf2844#diff-265d2762a1cbdb7bc87354b9cfc97dd6R132 related ref+ new platform observables causing more frequent updates? i haven't gone any further than vaguely remembering the issue and ctrl+f, so just speculating here.

@joshdover
Copy link
Contributor

joshdover commented Sep 23, 2020

PR to resolve this problem with cbf2844: #78342

@joshdover
Copy link
Contributor

#78342 was merged late yesterday. The next 7.10 snapshot (which is building right now) should include this change for testing. @sebgl would you be able to confirm if the issue is fixed for ECK?

@sebgl
Copy link
Author

sebgl commented Sep 29, 2020

Thanks for the head's up @joshdover. We run our E2E tests with the latest SNAPSHOT Docker image every night so we can let you know if this happens again.
In the meantime, I think we can close this.

Thanks for the fix!

@sebgl sebgl closed this as completed Sep 29, 2020
@sebgl
Copy link
Author

sebgl commented Oct 1, 2020

It seems to be happening again in ECK nightly E2E tests: elastic/cloud-on-k8s#3710 (comment).

@sebgl sebgl reopened this Oct 1, 2020
@spalger
Copy link
Contributor

spalger commented Oct 1, 2020

I think #79176 will solve this

@jbudz
Copy link
Member

jbudz commented Dec 3, 2020

Closing this out, upstream is closed.

@jbudz jbudz closed this as completed Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Operations Team label for Operations Team triage_needed
Projects
None yet
Development

No branches or pull requests

8 participants