Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Spark-rapids v0.3.0 pytest integration tests with UCX on FAILED on Yarn cluster #1497

Closed
NvTimLiu opened this issue Jan 12, 2021 · 2 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@NvTimLiu
Copy link
Collaborator

Describe the bug
running v0.3.0 spark-rapids integration tests(pytests) Enabling UCX on Yarn Cluster. It FAILED on src/main/python/cache_test.py always hanging there

Steps/Code to reproduce bug
• Spark-rapids v0.3.0: https://oss.sonatype.org/content/repositories/comnvidia-1036/com/nvidia/rapids-4-spark_2.12/0.3.0/rapids-4-spark_2.12-0.3.0.jar
• cuDF v0.17 : https://urm.nvidia.com/artifactory/sw-spark-maven/ai/rapids/cudf/0.17/cudf-0.17-cuda10-1.jar
• spark-submit scripts: spark-egx-03:/home/timl/yarn-IT/ucx-submit-yarn.sh
• DOCKER_IMAGE=quay.io/nvidia/spark:abellina_ubuntu18cuda10-1-yarn3-ucx-patch
• Yarn Log: http://spark-egx-03:8088/proxy/application_1603128018386_5631/
#!/bin/bash
set +ex
export SPARK_CONF_DIR=/usr/hdp/current/spark2-client/conf/
export HADOOP_HOME=/usr/hdp/3.1.0.0-78/hadoop
export HADOOP_CONF_DIR=/usr/hdp/3.1.0.0-78/hadoop/conf
export SPARK_HOME=/home/timl/spark-3.0.1-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

export PYSPARK_PYTHON=/usr/bin/python3.6
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6

OS_TYPE=ubuntu18
CUDA_NAME=cuda10.1
CUDA_DOCKER_TAG=${CUDA_NAME/./-}

CUDA_CLASSIFIER=${CUDA_NAME/./-}

FINAL=${CUDA_CLASSIFIER: -2}
if [ "$FINAL" == "-0" ]; then
CUDA_CLASSIFIER=${CUDA_CLASSIFIER%-0}
fi
echo $CUDA_CLASSIFIER

DOCKER_IMAGE=quay.io/nvidia/spark:${OS_TYPE}${CUDA_DOCKER_TAG}-yarn3

DOCKER_IMAGE=quay.io/nvidia/spark:${OS_TYPE}cudf17-${CUDA_CLASSIFIER}-udf

DOCKER_IMAGE=quay.io/nvidia/spark:abellina_ubuntu18cuda10-1-yarn3-ucx-patch
DOCKER_IMAGE=quay.io/nvidia/spark:abellina_ubuntu18cuda10-1-yarn3-ucx

SUBMIT_ARGS="
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark301.RapidsShuffleManager
--conf spark.shuffle.service.enabled=false
--conf spark.rapids.shuffle.maxMetadataSize=1MB
--conf spark.rapids.shuffle.transport.enabled=true
--conf spark.rapids.shuffle.compression.codec=none
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp
--conf spark.executorEnv.UCX_ERROR_SIGNALS=
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1
--conf spark.executorEnv.UCX_CUDA_IPC_CACHE=y
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n
--conf spark.rapids.shuffle.ucx.bounceBuffers.size=4MB
--conf spark.rapids.shuffle.ucx.bounceBuffers.device.count=32
--conf spark.rapids.shuffle.ucx.bounceBuffers.host.count=32
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib:/usr/lib/ucx
--conf spark.sql.shuffle.partitions=200
--conf spark.executor.extraClassPath=/usr/lib:/usr/lib/ucx:cudf-0.17-${CUDA_CLASSIFIER}.jar:rapids-4-spark_2.12-0.3.0.jar
--conf spark.sql.broadcastTimeout=7200
--conf spark.network.timeout=3600s
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_RUN_PRIVILEGED_CONTAINER=true
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_RUN_PRIVILEGED_CONTAINER=true
--conf spark.executorEnv.UCX_NET_DEVICES=mlx5_3:1
--master yarn --deploy-mode cluster
--num-executors 4
--driver-memory 40G --executor-memory 200G
--conf spark.executor.cores=40 --conf spark.task.cpus=1
--conf spark.sql.files.maxPartitionBytes=4294967296
--conf spark.yarn.maxAppAttempts=1
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.prefer-pinned=true
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.rapids.sql.concurrentGpuTasks=2
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.executorEnv.PYSPARK_PYTHON=/usr/bin/python3.6
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro,/etc/hadoop/conf:/etc/hadoop/conf:ro
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro,/etc/hadoop/conf:/etc/hadoop/conf:ro
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3.6
--conf spark.driver.PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6
--jars hdfs:/jars/tim-test/cudf-0.17-${CUDA_CLASSIFIER}.jar,hdfs:/jars/tim-test/rapids-4-spark_2.12-0.3.0.jar
"

cd /home/timl/yarn-IT/jars/integration_tests

spark-submit $SUBMIT_ARGS
--archives /home/timl/yarn-IT/jars/integration_tests.zip#sampletests
/home/timl/yarn-IT/jars/run-3.6.py src/main/python/cache_test.py -v -rfExXs

Expected behavior
Tests can PASS

Environment details (please complete the following information)
Yarn cluster, spark-submit scripts: spark-egx-03:/home/timl/yarn-IT/ucx-submit-yarn.sh

Additional context

@NvTimLiu NvTimLiu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 12, 2021
@razajafri razajafri self-assigned this Jan 12, 2021
@sameerz sameerz added this to the Jan 4 - Jan 15 milestone Jan 12, 2021
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jan 12, 2021
@abellina abellina assigned abellina and unassigned razajafri Jan 14, 2021
@NvTimLiu
Copy link
Collaborator Author

@abellina As PR #1540 merged into spark-rapids branch-0.4, should we close this issue, and verify UCX pytests against latest rapids 0.4.0-SNAPSHOT/cudf-0.18-SNAPSHOT?

@abellina
Copy link
Collaborator

I went ahead and ran the pytests with 0.4/0.18 and I am not seeing issues. Closing.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1497)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

4 participants