Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Leverage OOM retry framework for ORC writes #7341

Closed
abellina opened this issue Dec 12, 2022 · 1 comment · Fixed by #7972
Closed

[BUG] Leverage OOM retry framework for ORC writes #7341

abellina opened this issue Dec 12, 2022 · 1 comment · Fixed by #7972
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

We see some scenarios where ORC writes are OOMing and it would be great to fix this when working with the estimation/lease APIs that we are looking into.

Here is a sample stack trace of the write (it was a partitioned write, with 2GB default batch size):

org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(GpuFileFormatWriter.scala:354)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$write$15(GpuFileFormatWriter.scala:267)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-3-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:121: cudaErrorMemoryAllocation out of memory
	at ai.rapids.cudf.Table.writeORCChunk(Native Method)
	at ai.rapids.cudf.Table.access$1000(Table.java:41)
	at ai.rapids.cudf.Table$ORCTableWriter.write(Table.java:1394)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$2(ColumnarOutputWriter.scala:137)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$2$adapted(ColumnarOutputWriter.scala:134)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.withResource(ColumnarOutputWriter.scala:64)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$1(ColumnarOutputWriter.scala:134)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.$anonfun$writeBatch$1$adapted(ColumnarOutputWriter.scala:133)
	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.withResource(ColumnarOutputWriter.scala:64)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.writeBatch(ColumnarOutputWriter.scala:133)
	at com.nvidia.spark.rapids.ColumnarOutputWriter.write(ColumnarOutputWriter.scala:99)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.$anonfun$write$5(GpuFileFormatDataWriter.scala:535)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.$anonfun$write$5$adapted(GpuFileFormatDataWriter.scala:455)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.write(GpuFileFormatDataWriter.scala:455)
	at org.apache.spark.sql.rapids.GpuDynamicPartitionDataSingleWriter.write(GpuFileFormatDataWriter.scala:403)
	at org.apache.spark.sql.rapids.GpuFileFormatDataWriter.writeWithIterator(GpuFileFormatDataWriter.scala:87)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$executeTask$1(GpuFileFormatWriter.scala:341)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(GpuFileFormatWriter.scala:348)
	... 9 more
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Dec 12, 2022
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 13, 2022
@mattahrens mattahrens changed the title [BUG] automatic memory-budgeting for ORC writes [BUG] Leverage OOM retry framework for ORC writes Jan 27, 2023
@abellina
Copy link
Collaborator Author

In order to support this for ORC we need certain guarantees from the API, we filed this with the cuDF team: rapidsai/cudf#12792 for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants