Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Metric for maximum GPU memory per task #6745

Open
Tracked by #8027
abellina opened this issue Oct 10, 2022 · 6 comments
Open
Tracked by #8027

[FEA] Metric for maximum GPU memory per task #6745

abellina opened this issue Oct 10, 2022 · 6 comments
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Oct 10, 2022

The maximum amount of GPU memory each task uses is a very helpful metric to know if an application is getting close to needing to spill or not.

Tracking the memory currently on the GPU, or spilled to host memory, etc is also really interesting.

The problem is how to gather this metric in an efficient way. The Retry framework could keep track of the amount of memory that is allocated on a given thread, and the amount that is also deallocated/freed by that thread. It would not take into account the memory that is then freed by other threads (like in the case of spill, or UCX shuffle). Instead we would almost want to associate each allocation with a given thread, but that can be very memory intensive on the host, especially because we are likely to see thousands of buffers active.

We should experiment to see how expensive this is in practice and if it is not too bad implement it.

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 10, 2022
@abellina abellina added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 10, 2022
@sameerz sameerz removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 11, 2022
@abellina
Copy link
Collaborator Author

abellina commented Oct 13, 2022

Thought about this issue a bit more, what I think we want is a version of the tracking_resource_adaptor but, rather than have a single map for all threads, I think that we want to keep track of the maximum outstanding GPU footprint per thread. Also to note, the main motivation here would be to figure out if our estimation on memory usage for some GPU code is higher than anticipated, to help us debug waste or inform heuristics to control what tasks we allow on the GPU.

This should allow us to do the following:

val maxOutstandingUsage = withMemoryTracking { 
  val gpuData = materialize data on gpu
  val result = withResource(gpuData) { _.callCudfFunction }
  result.close() 
    // at this point our maximum outstanding should be:
    // gpuData + max(allocated) inside of `callCudfFunction`
}

In this scenario when we enter the withMemoryTracking block, we would ask a per-thread tracking resource to start tracking this thread before we materialize data. The materialization of gpuData incurs calls to rmm to get memory, so that adds to the outstanding amount, and then the call to the cuDF code could be allocations that are kept around (outstanding) for a while, allocations and frees that happen within the C++ code before the kernel, or results from this code. So we can keep track of how much is outstanding at any given time by adding to a thread-local variable how many bytes have been requested, and subtracting when we call free.

If one of our allocations failed and we handled them via a spill it shouldn't matter. That is because the spill code should be careful to disable the tracking for those spills (e.g. a withoutMemoryTracking call). This means we wouldn't discount frees in this thread for some other thread's allocations that are irrelevant to the code being tracked.

I hope/believe this could be a pretty low overhead system. Note this doesn't, I don't think, help tracking when an expensive kernel is loaded, as far as I understand that can be a one-time-penalty when we open the shared library. I know we have seen this with some of the regular expression kernels in the past. Pinging @jlowe on this overall for comments.

@abellina
Copy link
Collaborator Author

abellina commented Oct 17, 2022

I think one approach here is to have a stack of simple memory tracking info in RmmJni. When a withMemoryTracking block is issued we push to the stack one of these objects. The tracking_resource_adaptor could then check this stack for the current thread, and if it has something in it, it uses the top tracker to track allocations for now.

When withMemoryTracking is finishing, it calls a function in the RMM jni bits to pop this element from the stack. If it is the last element, we have turned the feature off. If it is not the last element we get the amount tracked in this scope and add the maximum outstanding we just popped to the next element in the stack (the calling scope also saw that maximum outstanding), and we continue to track with the remaining tracker in the stack.

We also need to keep a set of addresses we allocated in this thread, unfortunately. Given spill, the current thread may need to spill to satisfy an allocation. It seems we could ignore frees that we didn't allocate while tracking. The hope is that these withMemoryTracking blocks are as close as possible to a cuDF call.

@abellina
Copy link
Collaborator Author

Nsys has added memory tracking capabilities as of late, and we believe we can use the correlationId + NVTX ranges to accomplish this as a post processing step given an NVTX range. We should investigate if this solution does what we need.

@abellina abellina changed the title [FEA] Track held device memory per thread [FEA] Track held device memory per thread (using nsys?) Feb 10, 2023
@revans2 revans2 changed the title [FEA] Track held device memory per thread (using nsys?) [FEA] Metric for maximum GPU memory per task Apr 4, 2023
@wjxiz1992
Copy link
Collaborator

wjxiz1992 commented Oct 25, 2023

Hi @abellina I am trying to profile the GPU memory usage during a query run. I used nsys to profile, but didn't find metrics like peak memory usage
image

I was using NVIDIA Nsight Systems version 2022.2.1.31-5fe97ab installed in our internal cluster.
I saw a post about it: https://forums.developer.nvidia.com/t/nsys-measure-memory/118394 which is posted on 2021, but it contains the memory usage part in the graph...

Update:
The memory usage metrics are disabled by default, it can be turned on by an extra nsys argument --cuda-memory-usage=true
Then we can see the memory utilization part in the graph:
image

@abellina
Copy link
Collaborator Author

I haven't used this feature, the main question I'd have is whether it works with a pool, especially the async pools. It most definitely does not work with ARENA because that's all CPU managed, but cudaAsync I'd hope shows it.

@wjxiz1992
Copy link
Collaborator

The profile result above is from a run with ASYNC pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

3 participants