[BUG] Code adding buffers to the spillable store should synchronize #1245

abellina · 2020-12-02T20:24:13Z

Prior to inserting buffers into the spillable store, callers should make sure to have synchronized with the GPU ahead of the call.

Update:

After further review it looks like the only issue is adding buffers to the catalog on the write side, after compression. Compression doesn't synchronize at the end, and so we could spill a compressed buffer that isn't fully done. So the fix is only on the code around GpuPartitioning where we finish the batch compressor.

The path below with the SpillableColumnarBatch is covered by the facts that:

Buffers added to the store have to be contiguously split, and contiguous_split currently synchronizes. Note that this is a feature of contiguous_split, and is a byproduct of other other copies that are going on. In general, cudf does not guarantee this.
If the buffer was already contiguously split, it must have come from a shuffle or was previously reconstituted from a spilled buffer. The shuffle is synchronizing on receive, and the reconstituted buffers are synchronously copied back from the host to the device.
If the buffer was compressed, the decompression is asynchronous but that doesn't matter, it happens after a CUDA synchronize.

Previously this seemed to be the issue:

The reason is that as soon as the apply method adds the batch to the catalog, we are at a risk for spilling, which can happen on a different thread than the producer of this batch. Our .next() for ColumnarBatch in a query may produce very likely lazy java representations of memory that are being worked on a CUDA stream asynchronously. Also, given PTDS, those two threads (producer and spiller) could likely be using two different CUDA streams.

I see the following uses:

GpuCoalesceBatches

GpuArrowEvalPythonExec

GpuWindowInPandasExecBase

I don't think SpillableColumnarBatch should be responsible for this, because you want to be able to synchronize once if you are making a collection of batches spillable.

I don't believe this is a hard problem to solve, but I do think it can corrupt data since we could issue a memcpy to host before the device data has fully materialized. Hence, marking this a P1.

The text was updated successfully, but these errors were encountered:

abellina · 2020-12-03T15:32:32Z

Closing this issue. I think the next step in general is to move away from synchronize wherever we can. But that issue is not specific to this, there are many points where we'd like to not stream synchronize if possible and need some design work to account for all cases.

…IDIA#1245) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Dec 2, 2020

abellina changed the title ~~[BUG] Code using SpillableColumnarBatch should synchronize~~ [BUG] Code adding buffers to the spillable store should synchronize Dec 2, 2020

abellina added the shuffle things that impact the shuffle plugin label Dec 2, 2020

abellina mentioned this issue Dec 3, 2020

At BatchedTableCompressor.finish synchronize to allow for "right-size… #1248

Merged

abellina closed this as completed Dec 3, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Dec 7, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to d14b6cce9cc39793c118f065b113c83d0210ceb6 (NV…

7dffedc

…IDIA#1245) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Code adding buffers to the spillable store should synchronize #1245

[BUG] Code adding buffers to the spillable store should synchronize #1245

abellina commented Dec 2, 2020 •

edited

Loading

abellina commented Dec 3, 2020

[BUG] Code adding buffers to the spillable store should synchronize #1245

[BUG] Code adding buffers to the spillable store should synchronize #1245

Comments

abellina commented Dec 2, 2020 • edited Loading

abellina commented Dec 3, 2020

abellina commented Dec 2, 2020 •

edited

Loading