[BUG] BroadcastExchangeExec fails to fall back to CPU on driver node on GCP Dataproc #975

sameerz · 2020-10-17T01:51:39Z

Describe the bug
java.lang.UnsatisfiedLinkError: /tmp/cudf6291133341730615510.so: libnvrtc.so.10.2: cannot open shared object file: No such file or directory error when running the Mortgage ETL notebook with the sample mortgage data on a GCP Dataproc cluster, when the master node is not a GPU node.

Error message in this gist: https://gist.github.com/sameerz/b1df1130b0955b562a5801d7eb664811

Steps/Code to reproduce bug

Create a new cluster where the GPU node is CPU only and workers have CPU and GPU. Sample cluster creation scripts: https://gist.github.com/sameerz/d655ff262a295e66b3384fcbd15b67dd
Launch and run the notebook, instructions here: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-gcp.html

@tgravescs pointed out that this is avoidable if we set conf.set('spark.rapids.sql.exec.BroadcastExchangeExec', 'false'). However this notebook worked on Dataproc previously.

Expected behavior
GPU operations do not run on the master node if the master is CPU only.

Environment details (please complete the following information)

Environment location: GCP Dataproc preview-ubuntu18, running either CUDA 10.2 or CUDA 11.0
2 GPU worker nodes
Spark configuration settings related to the issue: Spark config settings are in the initialization script https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/rapids/rapids.sh#L120 and in the Define Spark conf and Create Spark Session section of the notebook https://github.com/NVIDIA/spark-rapids/blob/gh-pages/docs/demo/GCP/Mortgage-ETL-GPU.ipynb.

One note, there is an unrelated error in this notebook - that same cell in the notebook needs to start with from pyspark import SparkConf , but that will be addressed in a separate PR.

Additional context
This error occurs running the 0.2 plugin and 0.15 cudf. Something may have changed in the Dataproc environment.

The text was updated successfully, but these errors were encountered:

jlowe · 2020-10-19T13:45:36Z

Looks like it's trying to build an empty GPU table on the driver so it can turn around and serialize out an empty table. Not sure how we didn't hit this before, as this code has been there a while. Looks like this might be triggered in a case where the broadcast table ends up being empty which I wouldn't expect in the mortgage query.

jlowe · 2020-10-19T19:05:08Z

There might be an easy fix similar to what was done for rapidsai/cudf#5441, where we don't load native libs for HostColumnVector but instead defer the native libs loading to any class used by HostColumnVector that needs those libs. HostColumnVector doesn't have any native methods itself.

aroraakshit · 2020-10-20T23:39:09Z

PR mentioned in this bug (around sparkconf import) filed here: #991

…IDIA#975) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

sameerz added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 17, 2020

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 20, 2020

jlowe self-assigned this Oct 23, 2020

jlowe added this to the Oct 26 - Nov 6 milestone Oct 23, 2020

jlowe mentioned this issue Nov 2, 2020

Fix empty table broadcast requiring a GPU on driver node #1057

Merged

jlowe closed this as completed in #1057 Nov 4, 2020

jlowe mentioned this issue Nov 4, 2020

[QST] cuDF dependencies & UnsatisfiedLinkError #1067

Closed

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 173459e99fa8fce4202e2613f1f2c89a016cd350 (NV…

e0d5c86

…IDIA#975) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] BroadcastExchangeExec fails to fall back to CPU on driver node on GCP Dataproc #975

[BUG] BroadcastExchangeExec fails to fall back to CPU on driver node on GCP Dataproc #975

sameerz commented Oct 17, 2020

jlowe commented Oct 19, 2020

jlowe commented Oct 19, 2020

aroraakshit commented Oct 20, 2020

[BUG] BroadcastExchangeExec fails to fall back to CPU on driver node on GCP Dataproc #975

[BUG] BroadcastExchangeExec fails to fall back to CPU on driver node on GCP Dataproc #975

Comments

sameerz commented Oct 17, 2020

jlowe commented Oct 19, 2020

jlowe commented Oct 19, 2020

aroraakshit commented Oct 20, 2020