Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BroadcastExchangeExec fails to fall back to CPU on driver node on GCP Dataproc #975

Closed
sameerz opened this issue Oct 17, 2020 · 3 comments · Fixed by #1057
Closed
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@sameerz
Copy link
Collaborator

sameerz commented Oct 17, 2020

Describe the bug
java.lang.UnsatisfiedLinkError: /tmp/cudf6291133341730615510.so: libnvrtc.so.10.2: cannot open shared object file: No such file or directory error when running the Mortgage ETL notebook with the sample mortgage data on a GCP Dataproc cluster, when the master node is not a GPU node.

Error message in this gist: https://gist.github.com/sameerz/b1df1130b0955b562a5801d7eb664811

Steps/Code to reproduce bug

  1. Create a new cluster where the GPU node is CPU only and workers have CPU and GPU. Sample cluster creation scripts: https://gist.github.com/sameerz/d655ff262a295e66b3384fcbd15b67dd
  2. Launch and run the notebook, instructions here: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-gcp.html

@tgravescs pointed out that this is avoidable if we set conf.set('spark.rapids.sql.exec.BroadcastExchangeExec', 'false'). However this notebook worked on Dataproc previously.

Expected behavior
GPU operations do not run on the master node if the master is CPU only.

Environment details (please complete the following information)

One note, there is an unrelated error in this notebook - that same cell in the notebook needs to start with from pyspark import SparkConf , but that will be addressed in a separate PR.

Additional context
This error occurs running the 0.2 plugin and 0.15 cudf. Something may have changed in the Dataproc environment.

@sameerz sameerz added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 17, 2020
@jlowe
Copy link
Member

jlowe commented Oct 19, 2020

Looks like it's trying to build an empty GPU table on the driver so it can turn around and serialize out an empty table. Not sure how we didn't hit this before, as this code has been there a while. Looks like this might be triggered in a case where the broadcast table ends up being empty which I wouldn't expect in the mortgage query.

@jlowe
Copy link
Member

jlowe commented Oct 19, 2020

There might be an easy fix similar to what was done for rapidsai/cudf#5441, where we don't load native libs for HostColumnVector but instead defer the native libs loading to any class used by HostColumnVector that needs those libs. HostColumnVector doesn't have any native methods itself.

@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 20, 2020
@aroraakshit
Copy link
Contributor

PR mentioned in this bug (around sparkconf import) filed here: #991

@jlowe jlowe self-assigned this Oct 23, 2020
@jlowe jlowe added this to the Oct 26 - Nov 6 milestone Oct 23, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#975)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants