[BUG] Report GPU OOM on recent passed CI premerges. #9914

res-life · 2023-12-01T02:32:07Z

Describe the bug
It's separated from 9829
In recent CI premerge, it reports GPU OutOfMemory error:

GPU OutOfMemory

Although the CI is passed.

Note: it's not related to non-UTC time zone PR: #9719, because it's not merged yet when I create this issue.
It's not related to non-UTC time zone PR: #9773, because this is only adding a non-UTC xfail mark. The error is reported when testing UTC time zone.
I think it's not related to non-UTC related PRs, it's a problem already existing.

Steps/Code to reproduce bug
refer to a passed CI 8591.
On the premerge #8591, click Blue Ocean and then click Premerge CI 2, then download the log.

[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 ERROR Executor: Exception in task 0.0 in stage 22013.0 (TID 80619)
[2023-11-30T06:33:07.315Z] com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split input target batch size to less than 10 MiB
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.GpuParquetScan$.splitTargetBatchSize(GpuParquetScan.scala:266)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.chunkedSplit(GpuParquetScan.scala:1968)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1068)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:457)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:581)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
[2023-11-30T06:33:07.315Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-11-30T06:33:07.315Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2023-11-30T06:33:07.315Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2023-11-30T06:33:07.315Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2023-11-30T06:33:07.315Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-30T06:33:07.316Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-30T06:33:07.316Z] 23/11/30 06:33:07 WARN TaskSetManager: Lost task 0.0 in stage 22013.0 (TID 80619) (premerge-ci-2-jenkins-rapids-premerge-github-8591-mprcs-fphkv executor driver): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split input target batch size to less than 10 MiB
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.GpuParquetScan$.splitTargetBatchSize(GpuParquetScan.scala:266)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.chunkedSplit(GpuParquetScan.scala:1968)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1068)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:457)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:581)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-11-30T06:33:07.316Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-30T06:33:07.316Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-30T06:33:07.316Z] 
[2023-11-30T06:33:07.316Z] 23/11/30 06:33:07 ERROR TaskSetManager: Task 0 in stage 22013.0 failed 1 times; aborting job

The text was updated successfully, but these errors were encountered:

res-life · 2023-12-06T10:17:19Z

After I added -Dai.rapids.refcount.debug=true to debug. I did not found any GPU memory leak.
But I found 2 host memory leak:
#9974
#9971

res-life · 2023-12-27T08:14:09Z

This issue is invalid.

In test case test_parquet_testing_error_files it reads a Parquet file large_string_map.brotli.parquet which uncompressed trunck size is greater than 2G.

When GPU use default 1532m in IT:
It reports:

Retrying allocation of 2147483712 after a synchronize

and:

GPU OutOfMemory

// 2147483712 > 2G, 2147483712 is the trunk size of the reading Parquet file.
It's a right behavior. it's a out of memory, because total size is 1532m.

When I set GPU memory as 10G:
cuDF throws: ai.rapids.cudf.CudfColumnSizeOverflowException
Retry logic catchs it and try to split by half and read again, after split size < 10m,
It only reports GPU OutOfMemory without Retrying allocation of 2147483712 after a synchronize.
Although the error log is not accurate, it's not a bug.

res-life · 2023-12-28T01:39:21Z

@abellina help check the last comment.

res-life · 2023-12-28T08:00:31Z

@jlowe Do we need to set GPU memory > 2G in premerge? We expect CudfColumnSizeOverflowException but it reports GPU out of memory when GPU memory < 2G.

CPU error is:

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.BrotliCodec

I think we also do not expect this.

jlowe · 2023-12-28T15:16:12Z

Do we need to set GPU memory > 2G in premerge?

No. We expect the GPU load to fail, whether that's due to OOM or size overflow.

We do expect the CPU BrotliCodec error, since we did not configure brotli support in Spark. The parquet_testing_test will programmatically detect all files and try each one. The brotli file is problematic in different ways between the CPU and GPU which is why it just checks for any exception rather than the same type of exception. If desired we could extend the file detection logic to skip a list of known files we don't really want to test like this one.

res-life added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 1, 2023

This was referenced Dec 1, 2023

[BUG] Java OOM when testing non-UTC time zone with lots of cases fallback. #9829

Closed

[FEA] Support time zones that are not UTC #6839

Open

res-life self-assigned this Dec 4, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023

res-life closed this as completed Dec 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Report GPU OOM on recent passed CI premerges. #9914

[BUG] Report GPU OOM on recent passed CI premerges. #9914

res-life commented Dec 1, 2023

res-life commented Dec 6, 2023

res-life commented Dec 27, 2023 •

edited

Loading

res-life commented Dec 28, 2023

res-life commented Dec 28, 2023

jlowe commented Dec 28, 2023

[BUG] Report GPU OOM on recent passed CI premerges. #9914

[BUG] Report GPU OOM on recent passed CI premerges. #9914

Comments

res-life commented Dec 1, 2023

res-life commented Dec 6, 2023

res-life commented Dec 27, 2023 • edited Loading

res-life commented Dec 28, 2023

res-life commented Dec 28, 2023

jlowe commented Dec 28, 2023

res-life commented Dec 27, 2023 •

edited

Loading