From 2ddb450697b041353ae927abf71a9ea7c1ccdbfe Mon Sep 17 00:00:00 2001 From: Alessandro Bellina Date: Thu, 22 Jul 2021 13:08:34 -0500 Subject: [PATCH] Tweak RAPIDS Shuffle Manager configs for 21.08 (#2994) * The CDH shuffle manager was missing from the docs, and removes the reference to the Spark 3.2.0 shuffle manager until there is support Signed-off-by: Alessandro Bellina * Make UCX configuration clearer, by specifying minimum and recommended settings * Move UCX_IB_RX_QUEUE_LEN to recommended list. Reorder for clarity * Review comments: Make the shim part of the package generic --- .../rapids-shuffle.md | 49 +++++++++++++------ 1 file changed, 34 insertions(+), 15 deletions(-) diff --git a/docs/additional-functionality/rapids-shuffle.md b/docs/additional-functionality/rapids-shuffle.md index 1104887550e..548736aaa7f 100644 --- a/docs/additional-functionality/rapids-shuffle.md +++ b/docs/additional-functionality/rapids-shuffle.md @@ -290,23 +290,42 @@ In this section, we are using a docker container built using the sample dockerfi | 3.0.3 | com.nvidia.spark.rapids.spark303.RapidsShuffleManager | | 3.0.4 | com.nvidia.spark.rapids.spark304.RapidsShuffleManager | | 3.1.1 | com.nvidia.spark.rapids.spark311.RapidsShuffleManager | + | 3.1.1 CDH | com.nvidia.spark.rapids.spark311cdh.RapidsShuffleManager | | 3.1.2 | com.nvidia.spark.rapids.spark312.RapidsShuffleManager | | 3.1.3 | com.nvidia.spark.rapids.spark313.RapidsShuffleManager | - | 3.2.0 | com.nvidia.spark.rapids.spark320.RapidsShuffleManager | - -2. Recommended settings for UCX 1.10.1+ -```shell -... ---conf spark.shuffle.manager=com.nvidia.spark.rapids.spark301.RapidsShuffleManager \ ---conf spark.shuffle.service.enabled=false \ ---conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \ ---conf spark.executorEnv.UCX_ERROR_SIGNALS= \ ---conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy \ ---conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 \ ---conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \ ---conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024 \ ---conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} -``` + +2. Settings for UCX 1.10.1+: + + Minimum configuration: + + ```shell + ... + --conf spark.shuffle.manager=com.nvidia.spark.rapids.[shim package].RapidsShuffleManager \ + --conf spark.shuffle.service.enabled=false \ + --conf spark.dynamicAllocation.enabled=false \ + --conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} \ + --conf spark.executorEnv.UCX_ERROR_SIGNALS= \ + --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n + ``` + + Recommended configuration: + + ```shell + ... + --conf spark.shuffle.manager=com.nvidia.spark.rapids.[shim package].RapidsShuffleManager \ + --conf spark.shuffle.service.enabled=false \ + --conf spark.dynamicAllocation.enabled=false \ + --conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} \ + --conf spark.executorEnv.UCX_ERROR_SIGNALS= \ + --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \ + --conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024 \ + --conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp \ + --conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy \ + --conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1 + ``` + +Please replace `[shim package]` with the appropriate value. For example, the full class name for +Apache Spark 3.1.3 is: `com.nvidia.spark.rapids.spark313.RapidsShuffleManager`. Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a non-standard location.