[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

revans2 · 2022-12-05T21:20:00Z

Is your feature request related to a problem? Please describe.
This is very similar to #7254

Describe the solution you'd like
For join operators we can run into situations where one or both sides of the join needs to be completely in GPU memory while we stream through the other side of the join. We are working on some out of core fixes to try and help with these situations, but either way at some point we are likely going to need more memory than we get in the default lease.

We should do the same tasks as with GpuWindowExec.

Combine the join with any GpuCoalesceBatchExec nodes that would precede it.
Do a high water mark estimation on how much memory will be needed in the worst case to complete the join, given the input sizes.
Request a higher lease if needed
Experiment with RMM high water mark tracking to see how good our estimate is, and verify that we are not missing something
Write scale testing to verify that our estimation code does not under estimate the amount of memory needed.

jbrennan333 · 2023-03-30T15:32:05Z

This issue is being used for join oom retry work in 23.04. We will file additional issues for work in future releases.
This covers two OOM failures we were seeing when testing with low memory.
#7930 adds retrying without splits for cases where we were running out of memory on Table.gather calls (JoinGatherer.gatherNext)
#7902 adds retrying without splits for cases where we were running out of memory during SplittableJoinIterator.createGatherer, typically when creating the gather maps.

revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 5, 2022

revans2 mentioned this issue Dec 5, 2022

[FEA] Avoid memory over usage on GPU nodes in the SparkPlan #7252

Closed

7 tasks

mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Dec 6, 2022

mattahrens changed the title ~~[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use GpuMemoryLeaseManager~~ [FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOO retry framework Jan 27, 2023

mattahrens assigned jbrennan333 Jan 27, 2023

sameerz mentioned this issue Jan 31, 2023

[BUG] INC AFTER CLOSE for ColumnVector during shutdown in the join code #7581

Closed

sameerz changed the title ~~[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOO retry framework~~ [FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework Feb 18, 2023

This was referenced Mar 20, 2023

Add oom retry handling for createGatherer in gpu hash joins #7902

Merged

Add OOM Retry handling for join gather next #7930

Merged

This was linked to pull requests Mar 31, 2023

Add oom retry handling for createGatherer in gpu hash joins #7902

Merged

Add OOM Retry handling for join gather next #7930

Merged

revans2 closed this as completed in #7902 Apr 3, 2023

jbrennan333 mentioned this issue Apr 4, 2023

[FEA] Existence Join should handle RetryOOM exceptions #8023

Closed

sameerz removed the feature request New feature or request label Apr 8, 2023

jbrennan333 mentioned this issue Apr 13, 2023

[FEA] Cross Join should handle RetryOOM exceptions #8097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

revans2 commented Dec 5, 2022

jbrennan333 commented Mar 30, 2023

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

[FEA] Update GpuHashJoin and GpuBroadcastHashJoin to use OOM retry framework #7255

Comments

revans2 commented Dec 5, 2022

jbrennan333 commented Mar 30, 2023