[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021

mtsol · 2023-04-04T09:22:22Z

Describe the bug
This exception occures after a certain level of executions:

2023-04-04 08:08:52 WARN DAGScheduler:69 - Broadcasting large task binary with size 1811.4 KiB
2023-04-04 08:11:05 WARN TaskSetManager:69 - Lost task 0.0 in stage 443.0 (TID 2841) (10.84.179.52 executor 2): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-1-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit
at ai.rapids.cudf.Table.concatenate(Native Method)
at ai.rapids.cudf.Table.concatenate(Table.java:1635)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$2(GpuKeyBatchingIterator.scala:138)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:64)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:62)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$1(GpuKeyBatchingIterator.scala:123)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.concatPending(GpuKeyBatchingIterator.scala:122)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$3(GpuKeyBatchingIterator.scala:166)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2(GpuKeyBatchingIterator.scala:165)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2$adapted(GpuKeyBatchingIterator.scala:162)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)

Steps/Code to reproduce bug
/u/bin/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --master k8s://https:/k8s-master:6443 --deploy-mode cluster --name app-name --conf spark.local.dir=/y/mcpdata --conf spark.kubernetes.executor.request.cores=1 --conf spark.executor.cores=1 --conf spark.executor.instances=2 --conf spark.executor.memory=120G --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/opt/spark/bin/fair_example.xml --conf spark.dynamicAllocation.enabled=false --conf spark.executor.heartbeatInterval=3600s --conf spark.network.timeout=36000s --conf spark.sql.broadcastTimeout=36000 --conf spark.driver.memory=70G --conf spark.kubernetes.namespace=default --conf spark.driver.maxResultSize=50g --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.pyspark.driver.python=python3.8 --conf spark.pyspark.python=python3.8 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=repo/app:tag --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.mount.path=/y/mcpdata --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.mount.path=/u/bin/pipeline_stages --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.mount.path=/u/bin/evaluation_visualizations --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.options.claimName=fe-visualizations-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.options.claimName=fe-pipeline-stages-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.options.claimName=fe-logs-volume --conf spark.kubernetes.driver.label.driver=driver --conf spark.kubernetes.spec.driver.dnsConfig=default-subdomain --conf spark.kubernetes.driverEnv.ENV_SERVER=QA --conf spark.executorEnv.ENV_SERVER=QA --conf spark.sql.adaptive.enabled=false --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.kubernetes.executor.podTemplateFile=/tmp/templates/gpu-template.yaml --conf spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.rapids.sql.rowBasedUDF.enabled=true --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.executor.resource.gpu.vendor=nvidia.com --conf spark.rapids.memory.gpu.oomDumpDir=/y/mcpdata --conf spark.rapids.memory.pinnedPool.size=50g --conf spark.executor.memoryOverhead=25g --conf spark.rapids.sql.batchSizeBytes=32m --conf spark.executor.resource.gpu.discoveryScript=/getGpusResources.sh --conf spark.rapids.sql.explain=ALL --conf spark.rapids.memory.host.spillStorageSize=20g --conf spark.sql.shuffle.partitions=50 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/y/mcpdata/ --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp local:///code.py &

Expected behavior
some cudf.Table went out of memory.

Environment details (please complete the following information)

Environment location: Kubernetes
Spark configuration settings related to the issue: mentioned in spark-submit commadn
Data details: 2 million rows with 50 columns
Spark is processing the data, but after certain number of stages, it brakes.

revans2 · 2023-04-04T21:37:39Z

@mtsol

When discussing this we were a little confused if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?

We are working on a way to mitigate situations like this #7778 The goal is to have this in the 23.06 release. If you want to try and test it sooner I can see if we can come up with a version you could try out.

mtsol · 2023-04-06T05:25:58Z

I would appreciate if you can provide me something to test it sooner.

mtsol · 2023-04-18T10:20:55Z

"if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?"

Ans: In my case, it crashes always X+1 time, like when I have 1.2 million rows in my dataset, everything works fine, but when I increase the data, it crashes with this error, it crashes on 1.5 million, 2 million rows as well, and any number of data between them. I cannot say if it is related to memory leak, but as I observed the error occurs when data increases a certain limit.

PS: I will appreciate if you can provide and jar prior to the release to test if that one works fine with our data.

mtsol · 2023-04-19T10:46:56Z

After debugging and analysis, I found in my code that this statement:

df = df.withColumn(self.output_col_name, concat_ws(col_sep, array(self.input_col_name_list)))

was causing the error on larger data on gpus. And I think there is some bug in the gpu optimization of concat function, which needs to be addressed.

revans2 · 2023-04-19T14:15:57Z

@mstol thanks for the updated info. concat_ws can be a memory hog, especially if you are not also dropping the input columns after concatenating them together. We are aware that we have some problems with batch sizes when doing a ProjectExec that adds more rows. We have plans to work on this #7257 is the epic to work on it.

I am guessing that you just removed that line from your query, and because of that it dropped the total memory pressure at that point in time and for data being processed after it.

revans2 · 2023-04-20T15:38:18Z

@mtsol I have a snapshot jar that you can try.

https://drive.google.com/file/d/15RyaI5OyeSJNEj5G-W4MnN8JeyQPq4ff/view?usp=sharing

Be aware that there are some known bugs with it. Specifically #8147 which is caused by rapidsai/cudf#13173 so it should go without saying, but don't use this in production, and avoid the substring command if you can.

If you want a better version I can upload another one once the issue is fixed.

mtsol added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 4, 2023

mtsol changed the title ~~[BUG]~~ [BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) Apr 4, 2023

mattahrens assigned revans2 Apr 4, 2023

mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021

[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021

mtsol commented Apr 4, 2023 •

edited

Loading

revans2 commented Apr 4, 2023

mtsol commented Apr 6, 2023

mtsol commented Apr 18, 2023 •

edited

Loading

mtsol commented Apr 19, 2023

revans2 commented Apr 19, 2023

revans2 commented Apr 20, 2023

[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021

[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021

Comments

mtsol commented Apr 4, 2023 • edited Loading

revans2 commented Apr 4, 2023

mtsol commented Apr 6, 2023

mtsol commented Apr 18, 2023 • edited Loading

mtsol commented Apr 19, 2023

revans2 commented Apr 19, 2023

revans2 commented Apr 20, 2023

mtsol commented Apr 4, 2023 •

edited

Loading

mtsol commented Apr 18, 2023 •

edited

Loading