Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a getting started guide on workload qualification [skip ci] #4110

Merged
merged 37 commits into from
Nov 22, 2021
Merged
Changes from 1 commit
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
decce71
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
13a1c06
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
75bc077
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
e11c703
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
8de8e59
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
52dab29
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
94fd688
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
81d4403
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
22ef720
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
d9330a1
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
511576c
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
6277dc4
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
1369215
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 16, 2021
cee0fc0
Signed-off-by: Hao Zhu <hazhu@nvidia.comm>
viadea Nov 16, 2021
43a576a
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
67c1dc6
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
fe2584d
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
cf27e7d
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
bbeb3a0
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
5a6cabd
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
ad6c178
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
abadf8b
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 16, 2021
44febae
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
5adbe6f
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
e7feb0e
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
a18b87a
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
fd3a743
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
b1b5d70
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
e96a768
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
a3f8a8c
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
2587531
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
487dd25
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
36c3036
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 18, 2021
5e65948
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
4ca269f
Update docs/get-started/getting-started-workload-qualification.md
viadea Nov 18, 2021
aedec7d
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 18, 2021
eabec6b
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea Nov 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions docs/get-started/getting-started-workload-qualification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
layout: page
title: Workload Qualification
nav_order: 7
parent: Getting-Started
---
# Getting Started on Spark workload qualification
tgravescs marked this conversation as resolved.
Show resolved Hide resolved

The RAPIDS Accelerator for Apache Spark does not support all the features which are supported in Apache Spark.
If you plan to convert existing Spark workload from CPU to GPU, it is highly recommended that you do gap analysis
to figure out if there are any unsupported features such as functions, expressions, data types, data formats.
After that we will know which workload is more suitable to GPU, which is called "workload qualification" here.

If there are certain operators which can not run on GPU due to the current limitations, they will fallback to CPU mode.
As a result, it may incur some performance overhead because of host memory <=> GPU memory transfer.
jlowe marked this conversation as resolved.
Show resolved Hide resolved

This article is to help you get familiar with the best practice and different tools we provide on how to do gap analysis
and workload qualification.

tgravescs marked this conversation as resolved.
Show resolved Hide resolved
## 1. Qualification and Profiling tool

### Requirement

- Access to Spark event logs from Spark 2.x or 3.x
- Spark 3.0.1+ jars
- `rapids-4-spark-tools` jar

### How to use

If you have old Spark event logs from Spark 2.x or 3.x, you can use [Qualification tool](../spark-qualification-tool.md)
viadea marked this conversation as resolved.
Show resolved Hide resolved
and [Profiling tool](../spark-profiling-tool.md) to analyze them.
Qualification tool can output the score, rank and some of the potential not-supported features for each Spark application.
viadea marked this conversation as resolved.
Show resolved Hide resolved
Its output can help you focus on the top N Spark applications which are SQL heavy applications.

Profiling tool can output SQL plan metrics and also print out actual query plans to provide more insights.
viadea marked this conversation as resolved.
Show resolved Hide resolved
For example, below Profiling tool output for a specific Spark application shows that it has a query with a large
viadea marked this conversation as resolved.
Show resolved Hide resolved
`HashAggregate` and `SortMergeJoin`. Those are indicators for a good candidate for RAPIDS accelerator.
viadea marked this conversation as resolved.
Show resolved Hide resolved

```
+--------+-----+------+----------------------------------------------------+-------------+------------------------------------+-------------+----------+
|appIndex|sqlID|nodeID|nodeName |accumulatorId|name |max_value |metricType|
+--------+-----+------+----------------------------------------------------+-------------+------------------------------------+-------------+----------+
|1 |88 |8 |SortMergeJoin |11111 |number of output rows |500000000 |sum |
|1 |88 |9 |HashAggregate |22222 |number of output rows |600000000 |sum |
```

Since the 2 tools are basically analyzing Spark event logs, it can not provide very accurate gap analysis information.
But it is very convenient because you do not need a CPU or GPU Spark cluster to run them.
viadea marked this conversation as resolved.
Show resolved Hide resolved

tgravescs marked this conversation as resolved.
Show resolved Hide resolved
## 2. Function `explainPotentialGPUPlan`

### Requirement

- A Spark 3.x CPU cluster
- A pair of `rapids-4-spark` and `cudf` jars

tgravescs marked this conversation as resolved.
Show resolved Hide resolved
### How to use

In Spark RAPIDS accelerator 21.12, a new function named `explainPotentialGPUPlan` is added which can help us understand
viadea marked this conversation as resolved.
Show resolved Hide resolved
the potential GPU plan and if there are any not-supported features on a CPU cluster.
Basically it can print the logs which are the same as the driver logs with `spark.rapids.sql.explain=all`.
viadea marked this conversation as resolved.
Show resolved Hide resolved

1. In `spark-shell`, add the `rapids-4-spark` and `cudf` jars into --jars option or put them in Spark related classpath.
viadea marked this conversation as resolved.
Show resolved Hide resolved

For example:

```bash
spark-shell --jars /PathTo/cudf-<version>.jar,/PathTo/rapids-4-spark_<version>.jar
```

2. Test if the class can be successfully loaded or not.

```scala
import com.nvidia.spark.rapids.ExplainPlan.explainPotentialGPUPlan
```

3. Enable Spark RAPIDS related parameters to allow supporting existing features.
viadea marked this conversation as resolved.
Show resolved Hide resolved
tgravescs marked this conversation as resolved.
Show resolved Hide resolved

For example, if your jobs have `double`/`float`/`decimal` operators together with some scala UDFs, you can set
below parameters:
viadea marked this conversation as resolved.
Show resolved Hide resolved

```scala
spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)
spark.conf.set("spark.rapids.sql.variableFloatAgg.enabled", true)
spark.conf.set("spark.rapids.sql.decimalType.enabled", true)
spark.conf.set("spark.rapids.sql.castFloatToDecimal.enabled",true)
spark.conf.set("spark.rapids.sql.castDecimalToFloat.enabled",true)
spark.conf.set("spark.rapids.sql.udfCompiler.enabled",true)
```

Please refer to [config doc](../configs.md) for details of those Spark RAPIDS related parameters.
viadea marked this conversation as resolved.
Show resolved Hide resolved

4. Run the function `explainPotentialGPUPlan` on top of Dataframe.
viadea marked this conversation as resolved.
Show resolved Hide resolved

For example:

```scala
scala> val df_multi=spark.sql("SELECT value82*value63 FROM df2 union SELECT value82+value63 FROM df2")
df_multi: org.apache.spark.sql.DataFrame = [(CAST(value82 AS DECIMAL(9,3)) * CAST(value63 AS DECIMAL(9,3))): decimal(15,5)]

scala> val output=com.nvidia.spark.rapids.ExplainPlan.explainPotentialGPUPlan(df_multi)
scala> println(output)
```
sameerz marked this conversation as resolved.
Show resolved Hide resolved

Below are sample driver log messages starting with `!` which indicate the unsupported features in this version:
viadea marked this conversation as resolved.
Show resolved Hide resolved

```
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
!Expression <Multiply> (promote_precision(cast(value82#30 as decimal(9,3))) * promote_precision(cast(value63#31 as decimal(9,3)))) cannot run on GPU because The actual output precision of the multiply ilarge to fit on the GPU DecimalType(19,6)
```

This log can show you which operators(on what data type) can not run on GPU and what is the reason.
If it shows a specific Spark RAPIDS parameter which can be turned on to enable that feature, you can enable that parameter
viadea marked this conversation as resolved.
Show resolved Hide resolved
and try the tool again.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to mention here that one shouldn't blindly enable parameters that are disabled by default without understanding the risks and applicability of that parameters ramifications to their specific application.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some words to ask users to understand the risks of the parameters. How is now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better. Would be nice to link to the configs.md document here so users can know where to lookup any parameters they may find in the output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the configs doc link here.


Since its output is directly based on specific version of `rapids-4-spark` jar, the gap analysis is pretty accurate.
But you need a Spark 3+ CPU cluster and is ok to modify the code to add this function.
tgravescs marked this conversation as resolved.
Show resolved Hide resolved

## 3. Run Spark applications with Spark RAPIDS Accelerator on a GPU Spark Cluster

### Requirement

- A Spark 3.x GPU cluster
- A pair of `rapids-4-spark` and `cudf` jars

### How to use

Follow the getting-started guides to start a Spark 3+ GPU cluster and run the existing Spark workloads on the GPU
cluster with parameter `spark.rapids.sql.explain=all`.
The Spark driver log should be collected to check the not-supported messages.
This is the most accurate way to do gap analysis.
viadea marked this conversation as resolved.
Show resolved Hide resolved