Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

Closed
5 tasks
revans2 opened this issue Dec 5, 2022 · 1 comment · Fixed by #7902 or #7930
Closed
5 tasks

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

revans2 opened this issue Dec 5, 2022 · 1 comment · Fixed by #7902 or #7930
Assignees
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 5, 2022

Is your feature request related to a problem? Please describe.
This is very similar to #7254

Describe the solution you'd like
For join operators we can run into situations where one or both sides of the join needs to be completely in GPU memory while we stream through the other side of the join. We are working on some out of core fixes to try and help with these situations, but either way at some point we are likely going to need more memory than we get in the default lease.

We should do the same tasks as with GpuWindowExec.

  • Combine the join with any GpuCoalesceBatchExec nodes that would precede it.
  • Do a high water mark estimation on how much memory will be needed in the worst case to complete the join, given the input sizes.
  • Request a higher lease if needed
  • Experiment with RMM high water mark tracking to see how good our estimate is, and verify that we are not missing something
  • Write scale testing to verify that our estimation code does not under estimate the amount of memory needed.
@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 5, 2022
@mattahrens mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Dec 6, 2022
@mattahrens mattahrens changed the title [FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use GpuMemoryLeaseManager [FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOO retry framework Jan 27, 2023
@sameerz sameerz changed the title [FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOO retry framework [FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework Feb 18, 2023
@jbrennan333
Copy link
Collaborator

This issue is being used for join oom retry work in 23.04. We will file additional issues for work in future releases.
This covers two OOM failures we were seeing when testing with low memory.
#7930 adds retrying without splits for cases where we were running out of memory on Table.gather calls (JoinGatherer.gatherNext)
#7902 adds retrying without splits for cases where we were running out of memory during SplittableJoinIterator.createGatherer, typically when creating the gather maps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
4 participants