Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: gradle check failing with java heap OutOfMemoryError #2324

Closed
dreamer-89 opened this issue Jul 11, 2022 · 21 comments
Closed

[Bug]: gradle check failing with java heap OutOfMemoryError #2324

dreamer-89 opened this issue Jul 11, 2022 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Jul 11, 2022

Describe the bug

Public jenkins gradle check job failure due to java heap OutOfMemoryError. Raised this bug to get more understanding around existing gradle check function and prevent it on different machines. The ec2 hosts were previously upgraded to c524xlarge instance. Not sure if instance needs some cleanup.

To reproduce

https://build.ci.opensearch.org/job/gradle-check/381

Expected behavior

Job should not fail with

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

Running on EC2 (Amazon_ec2_cloud) - jenkinsAgentNode-Jenkins-Agent-Ubuntu2004-X64-c524xlarge-Single-Host (i-093f212ad4f5e9583) in /var/jenkins/workspace/gradle-check

Additional context

No response

Relevant log output

1: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':example-plugins:custom-settings:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================

2: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':libs:opensearch-x-content:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================

3: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':qa:repository-multi-version:compileTestJava'.
> java.lang.OutOfMemoryError: Java heap space

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
==============================================================================
@dreamer-89 dreamer-89 added bug Something isn't working untriaged Issues that have not yet been triaged labels Jul 11, 2022
@peterzhuamazon
Copy link
Member

Please let us know what kind of cleanup you need.
We have:

  1. Cleanup docker containers
  2. Cleanup gradle cache and gradle daemons
  3. Cleanup core repo

If you have any commands to clear java heap please let us know.

Thanks.

@bbarani
Copy link
Member

bbarani commented Jul 13, 2022

@dreamer-89 can you explore options to break up the gradle check tasks in to different modules? The current gradle check is very monolithic and we cannot sustain it longer by continuously increasing hardware resources.

@dreamer-89
Copy link
Member Author

Another occurrence in opensearch-project/OpenSearch#3924

In this case, even though the build was successful but marked failed due to heap space.

@dreamer-89
Copy link
Member Author

dreamer-89 commented Jul 15, 2022

@dreamer-89 can you explore options to break up the gradle check tasks in to different modules? The current gradle check is very monolithic and we cannot sustain it longer by continuously increasing hardware resources.

Please let us know what kind of cleanup you need. We have:

  1. Cleanup docker containers
  2. Cleanup gradle cache and gradle daemons
  3. Cleanup core repo

If you have any commands to clear java heap please let us know.

Thanks.

Thanks @peterzhuamazon for providing different options. At this point, I am not sure which cleanup will help. I think we need to deep dive to understand which process is consuming the heap space. As this is repeating across instances, it is good to root cause and then we can try appropriate options from above.

@dreamer-89
Copy link
Member Author

@dreamer-89 can you explore options to break up the gradle check tasks in to different modules? The current gradle check is very monolithic and we cannot sustain it longer by continuously increasing hardware resources.

Thanks @bbarani for the comment. I suspect the existing heap space is not due to limited hardware but instead some resource leak causing heap m/y issue. I think we need to spend brain cycles on existing failure. It could be a test but I started observing these failures recently.

@peterzhuamazon
Copy link
Member

@dreamer-89 can you explore options to break up the gradle check tasks in to different modules? The current gradle check is very monolithic and we cannot sustain it longer by continuously increasing hardware resources.

Please let us know what kind of cleanup you need. We have:

  1. Cleanup docker containers
  2. Cleanup gradle cache and gradle daemons
  3. Cleanup core repo

If you have any commands to clear java heap please let us know.
Thanks.

Thanks @peterzhuamazon for providing different options. At this point, I am not sure which cleanup will help. I think we need to deep dive to understand which process is consuming the heap space. As this is repeating across instances, it is good to root cause and then we can try appropriate options from above.

Hi @dreamer-89 let me clarify those 3 options have ALL been applied to our script already.
Means every time when a gradle check trigger, all those 3 are already cleanup.
I am asking do you have any ways to clean up heap as I am not familiar with that.

@peterzhuamazon
Copy link
Member

Additionally I now remember I even kill all the opensearch process if any still exist.

@bbarani
Copy link
Member

bbarani commented Jul 18, 2022

@dblock @dreamer-89 Please let us know if you can think of any additional clean ups / tweaks that needs to be implemented for this Gradle check. We are still seeing lot of flaky errors, memory issues after increasing the hardware resources and we should focus on fixing it to improve the developer velocity. CC: @peterzhuamazon @CEHENKLE

@owaiskazi19
Copy link
Member

Can we define the recommended heap specifications to ensure container memory? Flags like Xmx and Xms mentioned here

@dblock
Copy link
Member

dblock commented Jul 18, 2022

I don't have any useful advice. But one thing that I did notice - we used to run gradle checks without these problems with a previous set of hardware/jenkins/instances, did we downgrade from that capacity-wise?

@bbarani
Copy link
Member

bbarani commented Jul 19, 2022

I don't have any useful advice. But one thing that I did notice - we used to run gradle checks without these problems with a previous set of hardware/jenkins/instances, did we downgrade from that capacity-wise?

We have actually increased the hardware resources and I think we are using c524xlarge instance now. I assume you are seeing more errors due to the fact that we are running gradle checks more frequently now as we eliminated the need for commenting 'start gradle check' on the PR to begin the process. Having said that, I am seeing very interesting pattern where it passes for certain amount of time and fails continuously for certain amount of time before it starts passing again.

@bbarani bbarani removed the untriaged Issues that have not yet been triaged label Jul 19, 2022
@dblock
Copy link
Member

dblock commented Jul 19, 2022

I hear from @peterzhuamazon that the JDK may have been changed? I would double check that G1GC is enabled.

Are these instances recycled every build?

@peterzhuamazon
Copy link
Member

I hear from @peterzhuamazon that the JDK may have been changed? I would double check that G1GC is enabled.

Are these instances recycled every build?

I am currently writing a setup to permanently recycle all the instances.
AKA run 1 build, delete agent, provision new, run again while apply all cleanups for now.
If this runs well with higher success rate then it probably caused by some zombie process in the middle.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Jul 21, 2022

Hi @dblock @dreamer-89 @bbarani

After this change gradle check generally complete between 27-36min, quicker than the original 45-60min.
They also have less flaky runs in general.
https://build.ci.opensearch.org/job/gradle-check/853/console
https://build.ci.opensearch.org/job/gradle-check/854/console
https://build.ci.opensearch.org/job/gradle-check/855/console

Even the failure is legit failure most of the time:
https://build.ci.opensearch.org/job/gradle-check/850/console

Tho flaky test will occasionally show:
https://build.ci.opensearch.org/job/gradle-check/856/console

This seems to me that gradle check have some zombie process / memory leak that cause the continuous flaky runs on the same runner. By restrict the runs to 1 on each brand new runner, this temporarily resolve the issue and increase the success rate.

Small sample size still but already show a different trend in success rate:
image

@peterzhuamazon
Copy link
Member

Remember this is not a permanent solution, we would like core team to help identify the cause within gradle check and fix the root problem.

Thanks.

@dblock
Copy link
Member

dblock commented Jul 21, 2022

@peterzhuamazon I think we need to switch gradle jobs to run increasingly with --no-daemon. If you want to hunt that down in various build.sh scripts the projects will gladly merge that change I imagine. But also I don't see a problem with terminating agents after every large job. It gives a completely clean machine to every job. What's the cost of recycling an agent?

@peterzhuamazon
Copy link
Member

@peterzhuamazon I think we need to switch gradle jobs to run increasingly with --no-daemon. If you want to hunt that down in various build.sh scripts the projects will gladly merge that change I imagine. But also I don't see a problem with terminating agents after every large job. It gives a completely clean machine to every job. What's the cost of recycling an agent?

Hi @dblock that is already done and that is not related here. We have been using this method since fork Jenkins.

@dblock
Copy link
Member

dblock commented Jul 21, 2022

I saw another out of memory error in https://build.ci.opensearch.org/job/gradle-check/874/

@dreamer-89
Copy link
Member Author

@dreamer-89
Copy link
Member Author

Remember this is not a permanent solution, we would like core team to help identify the cause within gradle check and fix the root problem.

Thanks.

I observed heap issue locally as well. Created opensearch-project/OpenSearch#3973 to track on core.

@peterzhuamazon
Copy link
Member

Hi @dreamer-89 let us know when you have the fix on your side, so we can implement it on Jenkins workflow.
Will close this for now, feel free to reopen the issue when needed.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants