Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021

Open
mtsol opened this issue Apr 4, 2023 · 6 comments
Assignees
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@mtsol
Copy link

mtsol commented Apr 4, 2023

Describe the bug
This exception occures after a certain level of executions:

2023-04-04 08:08:52 WARN DAGScheduler:69 - Broadcasting large task binary with size 1811.4 KiB
2023-04-04 08:11:05 WARN TaskSetManager:69 - Lost task 0.0 in stage 443.0 (TID 2841) (10.84.179.52 executor 2): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-1-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit
at ai.rapids.cudf.Table.concatenate(Native Method)
at ai.rapids.cudf.Table.concatenate(Table.java:1635)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$2(GpuKeyBatchingIterator.scala:138)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:64)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:62)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$1(GpuKeyBatchingIterator.scala:123)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.concatPending(GpuKeyBatchingIterator.scala:122)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$3(GpuKeyBatchingIterator.scala:166)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2(GpuKeyBatchingIterator.scala:165)
at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2$adapted(GpuKeyBatchingIterator.scala:162)
at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)

Steps/Code to reproduce bug
/u/bin/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --master k8s://https:/k8s-master:6443 --deploy-mode cluster --name app-name --conf spark.local.dir=/y/mcpdata  --conf spark.kubernetes.executor.request.cores=1 --conf spark.executor.cores=1 --conf spark.executor.instances=2 --conf spark.executor.memory=120G --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/opt/spark/bin/fair_example.xml --conf spark.dynamicAllocation.enabled=false --conf spark.executor.heartbeatInterval=3600s --conf spark.network.timeout=36000s --conf spark.sql.broadcastTimeout=36000 --conf spark.driver.memory=70G --conf spark.kubernetes.namespace=default --conf spark.driver.maxResultSize=50g --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.pyspark.driver.python=python3.8 --conf spark.pyspark.python=python3.8 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=repo/app:tag --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.mount.path=/y/mcpdata --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.mount.path=/u/bin/pipeline_stages --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.mount.path=/u/bin/evaluation_visualizations --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.options.claimName=fe-visualizations-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.options.claimName=fe-pipeline-stages-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.options.claimName=fe-logs-volume --conf spark.kubernetes.driver.label.driver=driver --conf spark.kubernetes.spec.driver.dnsConfig=default-subdomain  --conf spark.kubernetes.driverEnv.ENV_SERVER=QA --conf spark.executorEnv.ENV_SERVER=QA --conf spark.sql.adaptive.enabled=false --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.kubernetes.executor.podTemplateFile=/tmp/templates/gpu-template.yaml --conf spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.rapids.sql.rowBasedUDF.enabled=true --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.executor.resource.gpu.vendor=nvidia.com --conf spark.rapids.memory.gpu.oomDumpDir=/y/mcpdata --conf spark.rapids.memory.pinnedPool.size=50g --conf spark.executor.memoryOverhead=25g --conf spark.rapids.sql.batchSizeBytes=32m --conf spark.executor.resource.gpu.discoveryScript=/getGpusResources.sh --conf spark.rapids.sql.explain=ALL --conf spark.rapids.memory.host.spillStorageSize=20g --conf spark.sql.shuffle.partitions=50 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/y/mcpdata/ --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp local:///code.py &

Expected behavior
some cudf.Table went out of memory.

Environment details (please complete the following information)

  • Environment location: Kubernetes
  • Spark configuration settings related to the issue: mentioned in spark-submit commadn
  • Data details: 2 million rows with 50 columns
  • Spark is processing the data, but after certain number of stages, it brakes.
@mtsol mtsol added ? - Needs Triage Need team to review and classify bug Something isn't working labels Apr 4, 2023
@mtsol mtsol changed the title [BUG] [BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) Apr 4, 2023
@mattahrens mattahrens added reliability Features to improve reliability or bugs that severly impact the reliability of the plugin and removed ? - Needs Triage Need team to review and classify labels Apr 4, 2023
@revans2
Copy link
Collaborator

revans2 commented Apr 4, 2023

@mtsol

When discussing this we were a little confused if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?

We are working on a way to mitigate situations like this #7778 The goal is to have this in the 23.06 release. If you want to try and test it sooner I can see if we can come up with a version you could try out.

@mtsol
Copy link
Author

mtsol commented Apr 6, 2023

I would appreciate if you can provide me something to test it sooner.

@mtsol
Copy link
Author

mtsol commented Apr 18, 2023

"if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?"

Ans: In my case, it crashes always X+1 time, like when I have 1.2 million rows in my dataset, everything works fine, but when I increase the data, it crashes with this error, it crashes on 1.5 million, 2 million rows as well, and any number of data between them. I cannot say if it is related to memory leak, but as I observed the error occurs when data increases a certain limit.

PS: I will appreciate if you can provide and jar prior to the release to test if that one works fine with our data.

@mtsol
Copy link
Author

mtsol commented Apr 19, 2023

After debugging and analysis, I found in my code that this statement:

df = df.withColumn(self.output_col_name, concat_ws(col_sep, array(self.input_col_name_list)))

was causing the error on larger data on gpus. And I think there is some bug in the gpu optimization of concat function, which needs to be addressed.

@revans2
Copy link
Collaborator

revans2 commented Apr 19, 2023

@mstol thanks for the updated info. concat_ws can be a memory hog, especially if you are not also dropping the input columns after concatenating them together. We are aware that we have some problems with batch sizes when doing a ProjectExec that adds more rows. We have plans to work on this #7257 is the epic to work on it.

I am guessing that you just removed that line from your query, and because of that it dropped the total memory pressure at that point in time and for data being processed after it.

@revans2
Copy link
Collaborator

revans2 commented Apr 20, 2023

@mtsol I have a snapshot jar that you can try.

https://drive.google.com/file/d/15RyaI5OyeSJNEj5G-W4MnN8JeyQPq4ff/view?usp=sharing

Be aware that there are some known bugs with it. Specifically #8147 which is caused by rapidsai/cudf#13173 so it should go without saying, but don't use this in production, and avoid the substring command if you can.

If you want a better version I can upload another one once the issue is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

No branches or pull requests

3 participants