[FEA] Figure out how to monitor GPU memory usage during scale test #9448

wjxiz1992 · 2023-10-16T08:29:38Z

Is your feature request related to a problem? Please describe.

We have some tools to monitor the system metrics like CPU, IO, Host Memory usage but not include the GPU metrics.
We want to understand the GPU memory usage during the scale test(related to #8811).

An easy way is to launch a script that calls nvidia-smi to capture the gpu metrics every second or more frequently:

while true; do nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv,noheader,nounits >> gpu_usage.csv; sleep 1; done

then we can further draw an image , map the usage to each query etc.

But the problem is RMM allocates the GPU memory beforehand, so nvidia-smi won't capture the actuall GPU memory used by our plugin, nvidia-smi will only see a full GPU memory usage from the very beginning to the end. I know we can disable the RMM pool to allow nvidia-smi works in this case but I doubt if that memory usage is identical to what when RMM ASYNC is enabled.

Need some more solution here.

The text was updated successfully, but these errors were encountered:

mattahrens · 2023-10-17T20:26:46Z

Related issue for this metric generated by the plugin: #6745

mattahrens · 2023-10-17T20:27:28Z

What is the use case for monitoring GPU memory? The preference is to have GPU memory metrics available in the plugin for profiler recommendations.

wjxiz1992 · 2023-10-18T07:20:35Z

It looks like #6745 can satisfy our requirement. We are now trying to profile all queries in the Scale Test. Our use case is just to know the peak GPU memory when running a query. Thanks!
@winningsix according to the issue mentioned by Matt, nsys now supports the track for GPU memory, all we need to do is to launch nsys to profile the spark application, and use nsys to open the output qdrep file, we should be able to see the nice metrics.

wjxiz1992 added feature request New feature or request ? - Needs Triage Need team to review and classify scale test labels Oct 16, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 17, 2023

revans2 mentioned this issue Dec 5, 2023

[FEA] Run Scale tests regularly under a memory constrained situation #9965

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Figure out how to monitor GPU memory usage during scale test #9448

[FEA] Figure out how to monitor GPU memory usage during scale test #9448

wjxiz1992 commented Oct 16, 2023

mattahrens commented Oct 17, 2023

mattahrens commented Oct 17, 2023

wjxiz1992 commented Oct 18, 2023

[FEA] Figure out how to monitor GPU memory usage during scale test #9448

[FEA] Figure out how to monitor GPU memory usage during scale test #9448

Comments

wjxiz1992 commented Oct 16, 2023

mattahrens commented Oct 17, 2023

mattahrens commented Oct 17, 2023

wjxiz1992 commented Oct 18, 2023