Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] GPU Memory is completely consumed in AWS-EMR #10827

Closed
akmalmasud96 opened this issue May 16, 2024 · 5 comments
Closed

[QST] GPU Memory is completely consumed in AWS-EMR #10827

akmalmasud96 opened this issue May 16, 2024 · 5 comments
Labels
question Further information is requested

Comments

@akmalmasud96
Copy link

I am trying to run Spark-Rapids with AWS-EMR. I am facing a problem that the GPU memory is completely consumed when the processing is initiated. And there is no space left in the GPU memory to perform processing.
Attached is an image, showing this phenomenon.

The AWS-EMR version is 7.1.0.
I want to ask that how can I solve this problem ?

image

@akmalmasud96 akmalmasud96 added ? - Needs Triage Need team to review and classify question Further information is requested labels May 16, 2024
@akmalmasud96
Copy link
Author

The issue was due to configuring the following parameter
"spark.plugins":"com.nvidia.spark.SQLPlugin"
Removing it, resolved the issue.

@abellina
Copy link
Collaborator

@akmalmasud96 just making sure I understand, do you want to run the the spark-rapids plugin? If so the spark.plugins flag is required. Thanks!

@jlowe
Copy link
Member

jlowe commented May 16, 2024

The RAPIDS Accelerator consumes almost all of the GPU memory by default. It does not expect to share the GPU with another process. You can configure the RAPIDS Accelerator to consume much less GPU memory, although that often has detrimental effects on performance due to extra spilling to fit within the smaller amount of GPU memory. See the spark.rapids.memory.gpu.maxAllocFraction or spark.rapids.memory.gpu.reserve configs to limit the amount of memory used, either by setting a maximum fraction or the amount of memory to leave reserved, respectively.

@akmalmasud96
Copy link
Author

akmalmasud96 commented May 17, 2024

@abellina , @jlowe ,Thanks for describing. I had figured it out. I want to ask that can we use Raft in spark-rapids for vector operations ?

@jlowe
Copy link
Member

jlowe commented May 22, 2024

Apologies for the delay, I accidentally missed this. Working with RAPIDS RAFT should be possible, but details will depend on how you're planning on using it from Spark. I assume this is via a UDF, so are you planning on a Python UDF calling the RAPIDS RAFT Python APIs or something lower-level like the C++ RAFT APIs via a Java UDF that allows the data to stay in the JVM process? Python UDF will require sharing the GPU across processes, which is a bit tricky as mentioned above.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants