Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nightly integration test failed OOM kill in JDK11 ENV #8729

Closed
pxLi opened this issue Jul 17, 2023 · 3 comments
Closed

[BUG] nightly integration test failed OOM kill in JDK11 ENV #8729

pxLi opened this issue Jul 17, 2023 · 3 comments
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 17, 2023

Describe the bug
internal pipeline rapids_integration-dev-github #758 #762

the executor failed test cases due to lost executor.
After investigation, we found that the memory usage of testing w/ jdk11 is abnormal, and it exceeded the resource limitation of our test ENV.

all other combinations took around 45~50G memory at most,
but jdk11 one consumed more than 55G and caused the OOM kill, this seems always reproducible
image

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Jul 17, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jul 18, 2023
@andygrove andygrove self-assigned this Jul 19, 2023
@andygrove
Copy link
Contributor

andygrove commented Jul 27, 2023

The default garbage collector is different for different JDK versions:

  • JDK 8 default is Parallel.
  • JDK 11 default is G1
  • JDK 12+ default is Shenandoah

@pxLi Could you try setting spark.executor.extraJavaOptions="-XX:+UseParallelGC" and see if that helps?

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 28, 2023

The default garbage collector is different for different JDK versions:

  • JDK 8 default is Parallel.
  • JDK 11 default is G1
  • JDK 12+ default is Shenandoah

@pxLi Could you try setting spark.executor.extraJavaOptions="-XX:+UseParallelGC" and see if that helps?

sure, I will take a try in the CI

Just tried w/

 export PYSP_TEST_spark_driver_extraJavaOptions="-Duser.timezone=UTC -XX:+UseParallelGC"
 export PYSP_TEST_spark_executor_extraJavaOptions="-Duser.timezone=UTC -XX:+UseParallelGC"

but still failed OOM kill (FYI last night's regular run passed w/ mem footprint peak close to 55GB, but did not reach that)
I am going to limit the test parallelism (currently 6) to 4 in JDK11 nightly IT to unblock downstream tests for now

@pxLi
Copy link
Collaborator Author

pxLi commented Nov 3, 2023

no more oom reporting. closing this for now

@pxLi pxLi closed this as completed Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

3 participants