Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve GDS spill performance on NVIDIA EGX #2592

Closed
rongou opened this issue Jun 4, 2021 · 0 comments
Closed

[FEA] Improve GDS spill performance on NVIDIA EGX #2592

rongou opened this issue Jun 4, 2021 · 0 comments
Assignees
Labels
epic Issue that encompasses a significant feature or body of work performance A performance related task/issue

Comments

@rongou
Copy link
Collaborator

rongou commented Jun 4, 2021

Is your feature request related to a problem? Please describe.
GDS spill does not perform as well on EGX clusters as compared to DGX. There are a couple of potential causes tied to hardware configuration:

  • EGX machines have more RAM (host memory) per GPU, so for a lot of queries spilled buffers can be mostly held in host memory, with only a small portion spilled all the way to disk. Currently GDS spill always spills everything to disk, so it's at a disadvantage.
  • EGX machines usually have fewer NVMe drives compared to DGX, so IO bandwidth is smaller.

Describe the solution you'd like
One possibility is to combine GDS spill with host memory. Device buffers can be spilled to host memory first, and when the host memory spill storage limit is reached, then spill through GDS to disk.

Describe alternatives you've considered
We can try to continue improve GDS spill performance, but in a head to head comparison GDS is probably never going to be faster than host memory.

Additional context
Current spilling logic is mostly in RapidsHostMemoryStore and RapidsGdsStore.

@rongou rongou added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jun 4, 2021
@rongou rongou self-assigned this Jun 4, 2021
@rongou rongou mentioned this issue Jun 4, 2021
11 tasks
@sameerz sameerz added performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Jun 8, 2021
@rongou rongou added the epic Issue that encompasses a significant feature or body of work label Jun 14, 2021
@rongou rongou closed this as completed Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Issue that encompasses a significant feature or body of work performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants