Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] acquire the semaphore after concatToHost in GpuShuffleCoalesceIterator #4395

Closed
abellina opened this issue Dec 20, 2021 · 2 comments · Fixed by #4396
Closed

[FEA] acquire the semaphore after concatToHost in GpuShuffleCoalesceIterator #4395

abellina opened this issue Dec 20, 2021 · 2 comments · Fixed by #4396
Assignees
Labels
feature request New feature or request performance A performance related task/issue

Comments

@abellina
Copy link
Collaborator

abellina commented Dec 20, 2021

There is an opportunity in GpuShuffleCoalesceIterator for optimization where we are holding onto the semaphore before we concat on the host.

The cuDF call we are currently calling is: JCudfSerialization.concatToContiguousTable, but it's trivial to split it up since what it calls inside of cuDF is public already. The proposal is to:

  1. Without holding the semaphore, call: JCudfSerialization.concatToHostBuffer to get a HostConcatResult.
  2. Acquire the semaphore
  3. Get the contiguous table on the GPU by calling: HostConcatResult.toContiguousTable.

Results with this change are promising, saving for all queries about 1 minute of runtime when adding everything up. Most queries are above the 1x line. The queries at or below 0.9 were: q52, q46, q68, q45, and q42. When executed multiple times they all went above 1x (these are single-digit second queries).

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Dec 20, 2021
@abellina abellina self-assigned this Dec 20, 2021
@abellina
Copy link
Collaborator Author

An even better change may be for us to delay putting these batches on the GPU until we really need them. This would be specific for joins, where the build side could rest on the GPU for a while as we wait for the stream side. In these cases what you almost want is to concatToHostBuffer, but keep that HostConcatResult around and not need to acquire the GPU at all until the stream side is also ready to go.

@jlowe
Copy link
Member

jlowe commented Dec 20, 2021

In these cases what you almost want is to concatToHostBuffer, but keep that HostConcatResult around and not need to acquire the GPU at all until the stream side is also ready to go.

This is a join-specific optimization, and I don't think we should get too far ahead of ourselves there. The optimization described above should apply to all cases where we are doing host-side concat, and thus I think is a good optimization as-is. We can always add a more sophisticated optimization for the shuffle-into-a-join case, but let's get this one done first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants