[FEA] Investigate how to handle memory explosion #6785

abellina · 2022-10-13T17:37:31Z

Assuming all goes well and we have #6784, tasks should be able to estimate if they are about to cause an OOM with the expressions they have, the cuDF functions they are about to call, and potentially the input they just materialized from the upstream iterator.

We need to figure out how to handle when the task realizes that they are about to cause an issue. At this point we should make sure that the logging is very clear on our estimation and potentially we need to flip #6745 on before the call to cuDF see if our estimates are accurate, as this data should allow us to improve the memory footprint of the plugin and cuDF, help us with our estimation logic, and they are likely to help us spot fragmentation. There are probably four areas that we should look into at runtime:

split the input: if we know that partial batches are ok to return and that the smaller inputs will each on their own not cause an explosion when calling cuDF, splitting is probably a sensible choice. Note we probably need to set aside batches that are not being used in "the current attempt", and make them spillable. "split the input" is too simplistic a name for this, in reality we may need to fallback to a different algorithm that allows us to process smaller chunks of memory like the out-of-core work that went into aggregates, joins, and sorts already. We also should be able to help the existing out-of-core algorithms along and make them smarter in some cases.
attempt to ask for more memory (a lease): This implies we have a way to track not just how many tasks are on the GPU, but how much memory they are allowed to use. If a lease is available, and we get it, then we may proceed with our call, holding other tasks back.
if no leases are available, wait or let other tasks run: if the task needs to ask for a lease, but no leases are available right now, we may need to either wait and retry or make ourselves spillable and release all that we have leased. This is the halting problem, so we need some sort of heuristic to make a decision here. We also need handle the case where there isn't a chance a single task could succeed (the extreme case where splitting isn't possible and even if it was that the smallest split we can create will end up in OOM no matter the lease e.g. memory requirements would exceed memory pool size).
If we detect OOM with an estimate that is within bounds, retry: This assumes that all the code we are calling is checking the allocations are successful and that it does the right thing when exceptions are thrown (for example in cuDF they follow the RAII pattern, and so the hope is no leaks from that part of the code). A retry presumably means we need to ask for a lease on memory, it also could mean that we use the memory tracking feature to figure out the maximum memory that RMM attempted but failed to satisfy. Note that all of this assumes no fragmentation, we are probably going to find a case where RMM says the failed allocation was X but we have much more available on the GPU. We may need to detect fragmentation here, potentially letting tasks holding onto GPU memory finish, so they allow the pool to defrag itself.

revans2 · 2022-12-05T22:12:30Z

#7252 and #7257 are the plan to be able to tackle this.

abellina added feature request New feature or request ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 13, 2022

This was referenced Oct 13, 2022

[TASK] Run without fatal OOMs #6746

Closed

Dump stack traces for tasks with the semaphore held when OOM goes unhandled #6810

Merged

mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022

mattahrens assigned revans2 Oct 19, 2022

revans2 closed this as completed Dec 5, 2022

sameerz removed the feature request New feature or request label Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Investigate how to handle memory explosion #6785

[FEA] Investigate how to handle memory explosion #6785

abellina commented Oct 13, 2022 •

edited

Loading

revans2 commented Dec 5, 2022

[FEA] Investigate how to handle memory explosion #6785

[FEA] Investigate how to handle memory explosion #6785

Comments

abellina commented Oct 13, 2022 • edited Loading

revans2 commented Dec 5, 2022

abellina commented Oct 13, 2022 •

edited

Loading