Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Figure out how to monitor GPU memory usage during scale test #9448

Open
wjxiz1992 opened this issue Oct 16, 2023 · 3 comments
Open

[FEA] Figure out how to monitor GPU memory usage during scale test #9448

wjxiz1992 opened this issue Oct 16, 2023 · 3 comments
Labels
feature request New feature or request scale test

Comments

@wjxiz1992
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

We have some tools to monitor the system metrics like CPU, IO, Host Memory usage but not include the GPU metrics.
We want to understand the GPU memory usage during the scale test(related to #8811).

An easy way is to launch a script that calls nvidia-smi to capture the gpu metrics every second or more frequently:

while true; do nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv,noheader,nounits >> gpu_usage.csv; sleep 1; done

then we can further draw an image , map the usage to each query etc.

But the problem is RMM allocates the GPU memory beforehand, so nvidia-smi won't capture the actuall GPU memory used by our plugin, nvidia-smi will only see a full GPU memory usage from the very beginning to the end. I know we can disable the RMM pool to allow nvidia-smi works in this case but I doubt if that memory usage is identical to what when RMM ASYNC is enabled.

Need some more solution here.

@wjxiz1992 wjxiz1992 added feature request New feature or request ? - Needs Triage Need team to review and classify scale test labels Oct 16, 2023
@mattahrens
Copy link
Collaborator

Related issue for this metric generated by the plugin: #6745

@mattahrens
Copy link
Collaborator

What is the use case for monitoring GPU memory? The preference is to have GPU memory metrics available in the plugin for profiler recommendations.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 17, 2023
@wjxiz1992
Copy link
Collaborator Author

It looks like #6745 can satisfy our requirement. We are now trying to profile all queries in the Scale Test. Our use case is just to know the peak GPU memory when running a query. Thanks!
@winningsix according to the issue mentioned by Matt, nsys now supports the track for GPU memory, all we need to do is to launch nsys to profile the spark application, and use nsys to open the output qdrep file, we should be able to see the nice metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request scale test
Projects
None yet
Development

No branches or pull requests

2 participants