Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OOM error in split and retry with multifile coalesce reader with parquet data #9060

Closed
mattahrens opened this issue Aug 16, 2023 · 2 comments
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@mattahrens
Copy link
Collaborator

Customer encountered this OOM error:

23/06/29 06:34:56 WARN DeviceMemoryEventHandler: [RETRY 2] Retrying allocation of 12421594616 after a synchronize. Total RMM allocated is 5213155072 bytes.
23/06/29 06:34:56 INFO DeviceMemoryEventHandler: Device allocation of 12421594616 bytes failed, device store has 0 total and 0 spillable bytes. Attempt 2. Total RMM allocated is 5213155072 bytes. 
23/06/29 06:34:56 WARN DeviceMemoryEventHandler: Device store exhausted, unable to allocate 12421594616 bytes. Total RMM allocated is 5213155072 bytes.
23/06/29 06:34:56 ERROR Executor: Exception in task 49.3 in stage 10.0 (TID 560)
com.nvidia.spark.rapids.jni.SplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:410)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:520)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:458)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:275)
    at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:128)
    at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.$anonfun$readBatch$1(GpuMultiFileReader.scala:1156)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.readBatch(GpuMultiFileReader.scala:1130)
    at com.nvidia.spark.rapids.MultiFileCoalescingPartitionReaderBase.next(GpuMultiFileReader.scala:1115)
    at com.nvidia.spark.rapids.PartitionIterator.hasNext(dataSourceUtil.scala:29)
    at com.nvidia.spark.rapids.MetricsBatchIterator.hasNext(dataSourceUtil.scala:46)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1(GpuDataSourceRDD.scala:62)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(GpuDataSourceRDD.scala:62)
    at scala.Option.exists(Option.scala:376)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:62)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.advanceToNextIter(GpuDataSourceRDD.scala:87)
    at com.nvidia.spark.rapids.shims.GpuDataSourceRDD$$anon$1.hasNext(GpuDataSourceRDD.scala:62)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.sql.rapids.GpuFileSourceScanExec$$anon$1.hasNext(GpuFileSourceScanExec.scala:477)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$hasNext$2(aggregate.scala:548)
    at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
    at scala.Option.getOrElse(Option.scala:189)
    at com.nvidia.spark.rapids.GpuHashAggregateIterator.hasNext(aggregate.scala:548)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:317)
    at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:340)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2(RapidsShuffleInternalManagerBase.scala:281)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$2$adapted(RapidsShuffleInternalManagerBase.scala:274)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1(RapidsShuffleInternalManagerBase.scala:274)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.$anonfun$write$1$adapted(RapidsShuffleInternalManagerBase.scala:273)
    at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
    at org.apache.spark.sql.rapids.RapidsShuffleThreadedWriterBase.write(RapidsShuffleInternalManagerBase.scala:273)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:552)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1533)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:555)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
23/06/29 06:34:56 INFO CoarseGrainedExecutorBackend: Got assigned task 761
23/06/29 06:34:56 INFO Executor: Running task 48.3 in stage 10.0 (TID 761)
23/06/29 06:34:56 INFO GpuParquetMultiFilePartitionReaderFactory: Using the coalesce multi-file Parquet reader
@mattahrens mattahrens added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Aug 16, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Aug 16, 2023
@mattahrens
Copy link
Collaborator Author

Initial fix completed in 23.10: rapidsai/cudf#14079. Next Phase tracked here: rapidsai/cudf#14270.

@mattahrens
Copy link
Collaborator Author

mattahrens commented Aug 5, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

2 participants