Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinct inner join #10503

Merged
merged 2 commits into from
Mar 4, 2024
Merged

Distinct inner join #10503

merged 2 commits into from
Mar 4, 2024

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Feb 26, 2024

Depends on rapidsai/cudf#15019.

Updates the inner hash join code to leverage the cudf distinct join API where the join can be performed much more quickly when it is known there are no duplicated join keys in the hash table. Since we're already doing a groupby on the build-side keys to estimate the explosiveness of a join, we already know when the hash table is using unique keys.

Running this on NDS at scale factor 3k resulted in a 2% overall improvement with the following statistically-significant (p-value < 0.05) changes at the individual query level:

  • query14_part2 faster by 9%
  • query23_part1 faster by 6%
  • query23_part2 faster by 6%
  • query88 faster by 13%

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added performance A performance related task/issue cudf_dependency An issue or PR with this label depends on a new feature in cudf labels Feb 26, 2024
@jlowe jlowe self-assigned this Feb 26, 2024
@jlowe jlowe marked this pull request as ready for review February 28, 2024 22:46
@jlowe
Copy link
Member Author

jlowe commented Feb 29, 2024

build

@jlowe
Copy link
Member Author

jlowe commented Feb 29, 2024

build

@jlowe jlowe mentioned this pull request Feb 29, 2024
@jlowe jlowe merged commit dfc18b2 into NVIDIA:branch-24.04 Mar 4, 2024
39 of 41 checks passed
@jlowe jlowe deleted the distinct-join branch March 4, 2024 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants