[BUG] Leverage OOM retry framework for ORC writes #7341

abellina · 2022-12-12T17:10:57Z

We see some scenarios where ORC writes are OOMing and it would be great to fix this when working with the estimation/lease APIs that we are looking into.

Here is a sample stack trace of the write (it was a partitioned write, with 2GB default batch size):

org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(GpuFileFormatWriter.scala:354)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$write$15(GpuFileFormatWriter.scala:267)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-3-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:121: cudaErrorMemoryAllocation out of memory
	at ai.rapids.cudf.Table.writeORCChunk(Native Method)
	at ai.rapids.cudf.Table.access$1000(Table.java:41)
	at ai.rapids.cudf.Table$ORCTableWriter.write(Table.java:1394)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$2(ColumnarOutputWriter.scala:137)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$2$adapted(ColumnarOutputWriter.scala:134)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.withResource(ColumnarOutputWriter.scala:64)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$1(ColumnarOutputWriter.scala:134)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$1$adapted(ColumnarOutputWriter.scala:133)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.withResource(ColumnarOutputWriter.scala:64)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.writeBatch(ColumnarOutputWriter.scala:133)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.write(ColumnarOutputWriter.scala:99)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.$anonfun$write$5(GpuFileFormatDataWriter.scala:535)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.$anonfun$write$5$adapted(GpuFileFormatDataWriter.scala:455)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.write(GpuFileFormatDataWriter.scala:455)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.write(GpuFileFormatDataWriter.scala:403)
	at org.apache.spark.sql.rapids.GpuFileFormatDataWriter.writeWithIterator(GpuFileFormatDataWriter.scala:87)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$executeTask$1(GpuFileFormatWriter.scala:341)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(GpuFileFormatWriter.scala:348)
	... 9 more

The text was updated successfully, but these errors were encountered:

abellina · 2023-02-16T16:40:24Z

In order to support this for ORC we need certain guarantees from the API, we filed this with the cuDF team: rapidsai/cudf#12792 for that.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Dec 12, 2022

mattahrens mentioned this issue Dec 13, 2022

[FEA] Avoid memory over usage on GPU nodes in the SparkPlan #7252

Closed

7 tasks

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 13, 2022

mattahrens changed the title ~~[BUG] automatic memory-budgeting for ORC writes~~ [BUG] Leverage OOM retry framework for ORC writes Jan 27, 2023

ttnghia self-assigned this Feb 17, 2023

ttnghia removed their assignment Mar 21, 2023

mattahrens assigned revans2 Mar 28, 2023

revans2 mentioned this issue Mar 29, 2023

Add in retry for ORC writes [databricks] #7972

Merged

revans2 closed this as completed in #7972 Mar 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Leverage OOM retry framework for ORC writes #7341

[BUG] Leverage OOM retry framework for ORC writes #7341

abellina commented Dec 12, 2022

abellina commented Feb 16, 2023

[BUG] Leverage OOM retry framework for ORC writes #7341

[BUG] Leverage OOM retry framework for ORC writes #7341

Comments

abellina commented Dec 12, 2022

abellina commented Feb 16, 2023