Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Investigate how to handle memory explosion #6785

Closed
abellina opened this issue Oct 13, 2022 · 1 comment
Closed

[FEA] Investigate how to handle memory explosion #6785

abellina opened this issue Oct 13, 2022 · 1 comment
Assignees
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Oct 13, 2022

Assuming all goes well and we have #6784, tasks should be able to estimate if they are about to cause an OOM with the expressions they have, the cuDF functions they are about to call, and potentially the input they just materialized from the upstream iterator.

We need to figure out how to handle when the task realizes that they are about to cause an issue. At this point we should make sure that the logging is very clear on our estimation and potentially we need to flip #6745 on before the call to cuDF see if our estimates are accurate, as this data should allow us to improve the memory footprint of the plugin and cuDF, help us with our estimation logic, and they are likely to help us spot fragmentation. There are probably four areas that we should look into at runtime:

  1. split the input: if we know that partial batches are ok to return and that the smaller inputs will each on their own not cause an explosion when calling cuDF, splitting is probably a sensible choice. Note we probably need to set aside batches that are not being used in "the current attempt", and make them spillable. "split the input" is too simplistic a name for this, in reality we may need to fallback to a different algorithm that allows us to process smaller chunks of memory like the out-of-core work that went into aggregates, joins, and sorts already. We also should be able to help the existing out-of-core algorithms along and make them smarter in some cases.

  2. attempt to ask for more memory (a lease): This implies we have a way to track not just how many tasks are on the GPU, but how much memory they are allowed to use. If a lease is available, and we get it, then we may proceed with our call, holding other tasks back.

  3. if no leases are available, wait or let other tasks run: if the task needs to ask for a lease, but no leases are available right now, we may need to either wait and retry or make ourselves spillable and release all that we have leased. This is the halting problem, so we need some sort of heuristic to make a decision here. We also need handle the case where there isn't a chance a single task could succeed (the extreme case where splitting isn't possible and even if it was that the smallest split we can create will end up in OOM no matter the lease e.g. memory requirements would exceed memory pool size).

  4. If we detect OOM with an estimate that is within bounds, retry: This assumes that all the code we are calling is checking the allocations are successful and that it does the right thing when exceptions are thrown (for example in cuDF they follow the RAII pattern, and so the hope is no leaks from that part of the code). A retry presumably means we need to ask for a lease on memory, it also could mean that we use the memory tracking feature to figure out the maximum memory that RMM attempted but failed to satisfy. Note that all of this assumes no fragmentation, we are probably going to find a case where RMM says the failed allocation was X but we have much more available on the GPU. We may need to detect fragmentation here, potentially letting tasks holding onto GPU memory finish, so they allow the pool to defrag itself.

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 13, 2022
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022
@revans2
Copy link
Collaborator

revans2 commented Dec 5, 2022

#7252 and #7257 are the plan to be able to tackle this.

@revans2 revans2 closed this as completed Dec 5, 2022
@sameerz sameerz removed the feature request New feature or request label Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

4 participants