Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Add in docs about memory debugging [skip ci] #10104

Merged
merged 2 commits into from
Dec 28, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Dec 27, 2023

This fixes #9987

any feedback is appreciated.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2
Copy link
Collaborator Author

revans2 commented Dec 27, 2023

build

jbrennan333
jbrennan333 previously approved these changes Dec 28, 2023
Copy link
Collaborator

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a few nits.

@@ -241,6 +242,11 @@ port 5005.

You can also use [Compute Sanitizer](compute_sanitizer.md) to debug CUDA memory errors.

### Memory Debugging
Java's garbage collector does not play nicely with CUDA memory allocations or with off heap memory.
There are a number of tools that we have developed that can help do
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There are a number of tools that we have developed that can help do
There are a number of tools that we have developed that can help to

close to the actual allocator as possible. But just be careful.

Also know that the address here should correspond to the address in the leak debugging if and only
if it was a DeviceMemoryBuffer that was allocated. In this case, where it is a `cudf::column` the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if it was a DeviceMemoryBuffer that was allocated. In this case, where it is a `cudf::column` the
if it was a `DeviceMemoryBuffer` that was allocated. In this case, where it is a `cudf::column` the


We also don't have a way to
[log exactly what was spilled](https://github.com/NVIDIA/spark-rapids/issues/10103)
and what was read back it. We can probably guess that this is happening from other logs, but it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and what was read back it. We can probably guess that this is happening from other logs, but it
and what was read back. We can probably guess that this is happening from other logs, but it

@revans2
Copy link
Collaborator Author

revans2 commented Dec 28, 2023

@jbrennan333 please take another look

@revans2
Copy link
Collaborator Author

revans2 commented Dec 28, 2023

build

@revans2 revans2 merged commit 585794c into NVIDIA:branch-24.02 Dec 28, 2023
39 checks passed
@revans2 revans2 deleted the doc_mem_debug branch December 28, 2023 22:03
@sameerz sameerz added the documentation Improvements or additions to documentation label Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] Add dev docs for GPU memory debugging
3 participants