Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinct left join #10520

Merged
merged 6 commits into from
Mar 7, 2024
Merged

Distinct left join #10520

merged 6 commits into from
Mar 7, 2024

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Feb 29, 2024

Depends on #10503 and rapidsai/cudf#15154.

Updates the left hash join code to leverage the cudf distinct left join API where the join can be performed much more quickly when it is known there are no duplicate join keys in the hash table. Unlike distinct inner join, distinct left join does not require gathering the left table.

Running this on NDS at scale factor 3K resulted in no regressions but no statistically significant improvements either, probably because there aren't enough distinct left outer joins in practice on that benchmark. Microbenchmarks show this can be up to 35% faster than a non-distinct left outer hash join.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added performance A performance related task/issue cudf_dependency An issue or PR with this label depends on a new feature in cudf labels Feb 29, 2024
@jlowe jlowe self-assigned this Feb 29, 2024
revans2
revans2 previously approved these changes Mar 4, 2024
integration_tests/src/main/python/join_test.py Outdated Show resolved Hide resolved
@jlowe jlowe marked this pull request as ready for review March 7, 2024 15:36
@jlowe
Copy link
Member Author

jlowe commented Mar 7, 2024

build

@jlowe jlowe merged commit e1cbd6e into NVIDIA:branch-24.04 Mar 7, 2024
42 of 43 checks passed
@jlowe jlowe deleted the left-distinct-join branch March 7, 2024 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants