Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional unit tests for GeneratedInternalRowToCudfRowIterator #10087

Merged
merged 5 commits into from
Jan 2, 2024

Conversation

jbrennan333
Copy link
Collaborator

Adding some additional tests to GeneratedInternalRowToCudfRowIteratorRetrySuite.
The existing GPU OOM retry tests were actually testing CPU ooms, after the CPU Host Memory retry changes went in.
Now that we have improved the forced OOM interfaces, we can fix these.
I changed the existing tests to target the correct GPU operations, and added some new tests for the Host Memory allocations.

Signed-off-by: Jim Brennan <jimb@nvidia.com>
@jbrennan333 jbrennan333 added test Only impacts tests reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Dec 20, 2023
@jbrennan333 jbrennan333 self-assigned this Dec 20, 2023
@jbrennan333
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Dec 26, 2023

Just FYI, some of the tests that were added with this PR are failing in the premerge tests. It appears to be for -Dbuildver=333

[2023-12-20T22:05:23.226Z] �[31m- a split and retry when allocating dataBuffer is handled *** FAILED ***�[0m
[2023-12-20T22:05:23.227Z] �[31m  com.nvidia.spark.rapids.jni.CpuSplitAndRetryOOM: CPU OutOfMemory: could not split inputs and retry�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:460)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:132)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$1(GpuColumnarToRowExec.scala:262)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$1$adapted(GpuColumnarToRowExec.scala:260)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at scala.Option.foreach(Option.scala:437)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:260)�[0m
[2023-12-20T22:05:23.227Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)�[0m
[2023-12-20T22:05:23.227Z] �[31m  ...�[0m
[2023-12-20T22:05:24.594Z] �[31m- a split and retry when allocating offsetsBuffer is handled *** FAILED ***�[0m
[2023-12-20T22:05:24.595Z] �[31m  com.nvidia.spark.rapids.jni.CpuSplitAndRetryOOM: CPU OutOfMemory: could not split inputs and retry�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:460)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:132)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$1(GpuColumnarToRowExec.scala:262)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$1$adapted(GpuColumnarToRowExec.scala:260)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at scala.Option.foreach(Option.scala:437)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:260)�[0m
[2023-12-20T22:05:24.595Z] �[31m  at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:304)�[0m
[2023-12-20T22:05:24.595Z] �[31m  ...�[0m

@jbrennan333
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Dec 28, 2023
@jbrennan333
Copy link
Collaborator Author

I am having trouble reproducing the premerge failure locally. I can step through the unit tests in the debugger, and the CpuSplitAndRetryOOM appears to be throwing on the correct allocation, but clearly in the premerge test it is hitting in an allocation in ColumnarToRowIterator.loadNextBatch. It doesn't seem to be a count issue, because it's hitting for both of these unit tests.

@jbrennan333
Copy link
Collaborator Author

I figured out what was happening here. In premerge, another test (HostAllocSuite) was calling HostAlloc.initialize(), which was setting the default allocator in cudf. When I was running this test in isolation, that wasn't happening, so the alloc that was hitting the exceptions in premerge was not using going through our host memory code.
I added a call to HostAlloc.initialize() in RmmSparkRetrySuiteBase, and verified that the other tests that derive from it still pass.

@jbrennan333
Copy link
Collaborator Author

build

@jbrennan333
Copy link
Collaborator Author

@revans2 this should be good now.

@jbrennan333
Copy link
Collaborator Author

build

@jbrennan333 jbrennan333 merged commit cd2b78b into NVIDIA:branch-24.02 Jan 2, 2024
38 of 39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants