[BUG] leaks possible in the rapids shuffle if batches are received after the task completes #1177

abellina · 2020-11-20T19:25:56Z

See comment below.

Old explanation:

When there is a limit query, tasks complete early and blocks that are being requested in flight could be leaked.

Update:

In looking through my logs I only see cases where we detected this in failure conditions, where the task was closed forcefully. I now think Spark is doing the right thing, even in limit cases, and draining iterators.

The short term solution is to make the RapidsShuffleIterator reject batches as they show up, which would make the RapidsShuffleClient close them on receipt. This can easily be done, but it's wasteful because we are going to be waiting for IO and holding up resources on the peer also when there is no compute to be done for these batches.

A longer term solution will be tracked separately.

The text was updated successfully, but these errors were encountered:

abellina · 2020-11-30T14:36:54Z

I was having a hard time reproducing the issue again. The reason for this looks to be that we have a coalesce batches after an exchange. Since we normally like to get larger batches, they may well hide the fact that the iterator is requesting more than may eventually be consumed.

If we set the batch size small, concat batches will produce more batches and the LocalLimit nodes will see their limits reached, causing the tasks to complete => and us to receive batches after completion.

When you have an ORDER BY followed by LIMIT, this is a TakeOrderedAndProject. This setup needs to look at all the data, so we drain the batches received by the shuffle, and this bug doesn't apply here.

…IDIA#1177) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 20, 2020

abellina mentioned this issue Nov 20, 2020

[BUG] pending requests should be cancelled when requesting tasks complete #1178

Closed

abellina added the shuffle things that impact the shuffle plugin label Nov 20, 2020

abellina mentioned this issue Nov 20, 2020

Remove batches if they are received after the iterator detects that t… #1180

Merged

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 24, 2020

sameerz assigned abellina Nov 24, 2020

sameerz added this to the Nov 23 - Dec 4 milestone Nov 24, 2020

sameerz added the P0 Must have for release label Nov 25, 2020

abellina changed the title ~~[BUG] limit queries could produce leaks in the rapids shuffle~~ [BUG] leaks possible in the rapids shuffle if batches are received after the task completes Nov 25, 2020

abellina closed this as completed in #1180 Dec 1, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 90bb88738d1fb2133f9f003139d4df712ec0b128 (NV…

e11c71b

…IDIA#1177) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] leaks possible in the rapids shuffle if batches are received after the task completes #1177

[BUG] leaks possible in the rapids shuffle if batches are received after the task completes #1177

abellina commented Nov 20, 2020 •

edited

Loading

abellina commented Nov 30, 2020 •

edited

Loading

[BUG] leaks possible in the rapids shuffle if batches are received after the task completes #1177

[BUG] leaks possible in the rapids shuffle if batches are received after the task completes #1177

Comments

abellina commented Nov 20, 2020 • edited Loading

abellina commented Nov 30, 2020 • edited Loading

abellina commented Nov 20, 2020 •

edited

Loading

abellina commented Nov 30, 2020 •

edited

Loading