Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] leaks possible in the rapids shuffle if batches are received after the task completes #1177

Closed
abellina opened this issue Nov 20, 2020 · 1 comment · Fixed by #1180
Closed
Assignees
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Nov 20, 2020

See comment below.

Old explanation:

When there is a limit query, tasks complete early and blocks that are being requested in flight could be leaked.

Update:

In looking through my logs I only see cases where we detected this in failure conditions, where the task was closed forcefully. I now think Spark is doing the right thing, even in limit cases, and draining iterators.


The short term solution is to make the RapidsShuffleIterator reject batches as they show up, which would make the RapidsShuffleClient close them on receipt. This can easily be done, but it's wasteful because we are going to be waiting for IO and holding up resources on the peer also when there is no compute to be done for these batches.

A longer term solution will be tracked separately.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 20, 2020
@abellina abellina added the shuffle things that impact the shuffle plugin label Nov 20, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 24, 2020
@sameerz sameerz added this to the Nov 23 - Dec 4 milestone Nov 24, 2020
@sameerz sameerz added the P0 Must have for release label Nov 25, 2020
@abellina abellina changed the title [BUG] limit queries could produce leaks in the rapids shuffle [BUG] leaks possible in the rapids shuffle if batches are received after the task completes Nov 25, 2020
@abellina
Copy link
Collaborator Author

abellina commented Nov 30, 2020

I was having a hard time reproducing the issue again. The reason for this looks to be that we have a coalesce batches after an exchange. Since we normally like to get larger batches, they may well hide the fact that the iterator is requesting more than may eventually be consumed.

If we set the batch size small, concat batches will produce more batches and the LocalLimit nodes will see their limits reached, causing the tasks to complete => and us to receive batches after completion.

When you have an ORDER BY followed by LIMIT, this is a TakeOrderedAndProject. This setup needs to look at all the data, so we drain the batches received by the shuffle, and this bug doesn't apply here.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1177)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants