Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement OOM retry framework #7253

Closed
revans2 opened this issue Dec 5, 2022 · 0 comments · Fixed by #7822
Closed

[FEA] Implement OOM retry framework #7253

revans2 opened this issue Dec 5, 2022 · 0 comments · Fixed by #7822
Assignees
Labels
feature request New feature or request reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 5, 2022

Is your feature request related to a problem? Please describe.

Currently memory on the GPU is managed mostly by convention and by the GpuSemaphore. The GpuSemaphore allows a configured number of tasks onto the GPU at any one point in time, but it does not explicitly track or hand out memory to these tasks. By convention different execution paths will assume that they can use 4x the target batch size without any issues and also assume that the input batch size is <= the target batch size. There is also no way to request more memory if the operation knows that it will use more memory than is currently available.

Describe the solution you'd like
Create A GpuMemoryLeaseManager (GMLM) or update the GpuSemaphore to provide the following APIs.

def requestLease(tc: TaskContext, amount: long): MemoryLease
def getTotalLease(tc: TaskContext): Long
def getBaseLease(tc: TaskContext): Long // Not sure if this is needed getTotalLease probably is good enough.
def returnAllLeases(tc: TaskContext): Unit // release any outstanding leases

MemoryLease would be AutoClosable and would return the memory to the GMLM when it is closed.

The GMLM is an arbitrator. It is not intended to actually allocate any memory, just to reduce the load on the GPU if multiple operations would need more memory than is currently available. So for cases like a Join or a window operation where today we cannot guarantee that it will be under the 4x batch size limit. The goal is to eventually update all operators so that the limit is not by convention, but it a set value that can dynamically change if needed.

This is not intended to replace the efforts we have made for out of core algorithms. Those are still needed even on very large memory GPUs because CUDF still has column size limitations.

When a SparkPlan node wants to run on the GPU it will see what the current budget is by asking the GMLM. It will also estimate how much memory it will need to complete the current operation at hand. If the memory needed is more than current lease another lease on more memory will be requested. In order to make that request the SparkPlan node will need to make sure that all of the memory it is currently using is spillable.

When the GMLM receives a request and there is enough memory to fulfill the request it should provide a lease to the new task for the desired amount of memory ideally without blocking.

When there are more requests for memory than there is memory to fulfill the requests the GMLM will need to decide which tasks should be allowed to continue and which must wait. As this is not a simple problem to solve for the time being I would propose that we do a FIFO pattern, where the first task to ask for the memory is the first task to be allowed to run when there is enough memory available. As new tasks with new requests come in that cannot be satisfied all of their previously requested leases are made available to satisfy higher priority tasks. This is why all task memory must be made spillable before requesting a lease. When a lease is closed that memory will also be made available for pending tasks to use in FIFO/priority order. In the future we may have an explicit priority for a task, which would fit in well with this priority queue model.

If the total requested memory is more than the GPU could ever satisfy, then GMLM should treat the request as if it is asking for the entire GPU, and warn loudly that it is doing so. This is an attempt to let the task succeed on the chance that it overestimated the amount of memory that would be needed.

A task will automatically request a lease for 4 * target batch size when it acquires the GPU Semaphore. When the semaphore is released it will also release that original lease. This amount is for backwards compatibility with existing code that has this assumption hard coded. In the future this amount may change. The GMLM should be queried to see what the amount is, rather than go off of the target batch size.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Dec 5, 2022
@mattahrens mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Dec 6, 2022
@sameerz sameerz removed the feature request New feature or request label Dec 9, 2022
@mattahrens mattahrens changed the title [FEA] Write a GpuMemoryLeaseManager or update the GpuSemaphore to provide that functionality [FEA] Implement OOO retry framework Jan 27, 2023
@revans2 revans2 changed the title [FEA] Implement OOO retry framework [FEA] Implement OOM retry framework Jan 31, 2023
@revans2 revans2 linked a pull request Mar 1, 2023 that will close this issue
@mattahrens mattahrens added the feature request New feature or request label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
3 participants