Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended configuration of OOM injection mode #10013

Merged
merged 13 commits into from
Dec 15, 2023

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Dec 11, 2023

This PR enables specifying OOM injection mode with additional config

such as

  • num_ooms to inject
  • skip=N to make triger first OOM on N+1st allocation
  • type with the current default mix CPU_OR_GPU, CPU (just host allocations), GPU (just device allocations)
  • split=bool whether to inject SplitAndRetryOOM

Enables IT invocations such as :

PYSP_TEST_spark_rapids_memory_gpu_state_debug=STDERR \
TEST_PARALLEL=0 \
SPARK_HOME=~/dist/spark-3.3.0-bin-hadoop3 \
./integration_tests/run_pyspark_from_build.sh \
  -k array_exists --test_oom_injection_mode=always:type=CPU,num_ooms=1,skip=4,split=true

This PR requires is stacked on NVIDIA/spark-rapids-jni#1637

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov added the test Only impacts tests label Dec 11, 2023
@gerashegalov gerashegalov self-assigned this Dec 11, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally it looks good.

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
reviews

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Co-authored-by: Jim Brennan <jimb@nvidia.com>
@gerashegalov gerashegalov marked this pull request as ready for review December 14, 2023 18:29
@gerashegalov
Copy link
Collaborator Author

build

Signed-off-by: Gera Shegalov <gera@apache.org>
Copy link
Collaborator

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - just a few minor nits, and it looks like prebuild checks are failing.

Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

@jbrennan333 sorry missed one of your comments. PTAL again.

@jbrennan333
Copy link
Collaborator

build

1 similar comment
@gerashegalov
Copy link
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants