Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Report GPU OOM on recent passed CI premerges. #9914

Closed
res-life opened this issue Dec 1, 2023 · 5 comments
Closed

[BUG] Report GPU OOM on recent passed CI premerges. #9914

res-life opened this issue Dec 1, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@res-life
Copy link
Collaborator

res-life commented Dec 1, 2023

Describe the bug
It's separated from 9829
In recent CI premerge, it reports GPU OutOfMemory error:

GPU OutOfMemory

Although the CI is passed.

Note: it's not related to non-UTC time zone PR: #9719, because it's not merged yet when I create this issue.
It's not related to non-UTC time zone PR: #9773, because this is only adding a non-UTC xfail mark. The error is reported when testing UTC time zone.
I think it's not related to non-UTC related PRs, it's a problem already existing.

Steps/Code to reproduce bug
refer to a passed CI 8591.
On the premerge #8591, click Blue Ocean and then click Premerge CI 2, then download the log.

[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 1] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 2147483712 after a synchronize. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 2147483712 bytes. Total RMM allocated is 137038848 bytes.
[2023-11-30T06:33:07.315Z] 23/11/30 06:33:07 ERROR Executor: Exception in task 0.0 in stage 22013.0 (TID 80619)
[2023-11-30T06:33:07.315Z] com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split input target batch size to less than 10 MiB
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.GpuParquetScan$.splitTargetBatchSize(GpuParquetScan.scala:266)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.chunkedSplit(GpuParquetScan.scala:1968)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1068)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:457)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:581)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
[2023-11-30T06:33:07.315Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-11-30T06:33:07.315Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2023-11-30T06:33:07.315Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2023-11-30T06:33:07.315Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2023-11-30T06:33:07.315Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-30T06:33:07.315Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-30T06:33:07.316Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-30T06:33:07.316Z] 23/11/30 06:33:07 WARN TaskSetManager: Lost task 0.0 in stage 22013.0 (TID 80619) (premerge-ci-2-jenkins-rapids-premerge-github-8591-mprcs-fphkv executor driver): com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split input target batch size to less than 10 MiB
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.GpuParquetScan$.splitTargetBatchSize(GpuParquetScan.scala:266)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MultiFileParquetPartitionReader.chunkedSplit(GpuParquetScan.scala:1968)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$4(GpuMultiFileReader.scala:1068)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:457)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:581)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:510)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
[2023-11-30T06:33:07.316Z] 	at scala.collection.TraversableOnce$FlattenOps$$anon$2.hasNext(TraversableOnce.scala:469)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.GpuColumnarBatchWithPartitionValuesIterator.hasNext(GpuColumnarBatchIterator.scala:114)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1025)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:474)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$fetchNextBatch$3(GpuColumnarToRowExec.scala:285)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.fetchNextBatch(GpuColumnarToRowExec.scala:284)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:257)
[2023-11-30T06:33:07.316Z] 	at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:301)
[2023-11-30T06:33:07.316Z] 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:131)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[2023-11-30T06:33:07.316Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-30T06:33:07.316Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-30T06:33:07.316Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-30T06:33:07.316Z] 
[2023-11-30T06:33:07.316Z] 23/11/30 06:33:07 ERROR TaskSetManager: Task 0 in stage 22013.0 failed 1 times; aborting job
@res-life res-life added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 1, 2023
@res-life res-life self-assigned this Dec 4, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023
@res-life
Copy link
Collaborator Author

res-life commented Dec 6, 2023

After I added -Dai.rapids.refcount.debug=true to debug. I did not found any GPU memory leak.
But I found 2 host memory leak:
#9974
#9971

@res-life
Copy link
Collaborator Author

res-life commented Dec 27, 2023

This issue is invalid.

In test case test_parquet_testing_error_files it reads a Parquet file large_string_map.brotli.parquet which uncompressed trunck size is greater than 2G.

When GPU use default 1532m in IT:
It reports:

Retrying allocation of 2147483712 after a synchronize

and:

GPU OutOfMemory

// 2147483712 > 2G, 2147483712 is the trunk size of the reading Parquet file.
It's a right behavior. it's a out of memory, because total size is 1532m.

When I set GPU memory as 10G:
cuDF throws: ai.rapids.cudf.CudfColumnSizeOverflowException
Retry logic catchs it and try to split by half and read again, after split size < 10m,
It only reports GPU OutOfMemory without Retrying allocation of 2147483712 after a synchronize.
Although the error log is not accurate, it's not a bug.

@res-life
Copy link
Collaborator Author

@abellina help check the last comment.

@res-life
Copy link
Collaborator Author

@jlowe Do we need to set GPU memory > 2G in premerge? We expect CudfColumnSizeOverflowException but it reports GPU out of memory when GPU memory < 2G.

CPU error is:

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.io.compress.BrotliCodec

I think we also do not expect this.

@jlowe
Copy link
Member

jlowe commented Dec 28, 2023

Do we need to set GPU memory > 2G in premerge?

No. We expect the GPU load to fail, whether that's due to OOM or size overflow.

We do expect the CPU BrotliCodec error, since we did not configure brotli support in Spark. The parquet_testing_test will programmatically detect all files and try each one. The brotli file is problematic in different ways between the CPU and GPU which is why it just checks for any exception rather than the same type of exception. If desired we could extend the file detection logic to skip a list of known files we don't really want to test like this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants