Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation updates for branch 0.2 #617

Merged
merged 5 commits into from
Aug 28, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 14 additions & 13 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ these changes and release updates as quickly as possible.
### Which distributions are supported?

The RAPIDS Accelerator for Apache Spark officially supports
[Apache Spark](get-started/getting-started.md),
[Databricks Runtime 7.0](get-started/getting-started-with-rapids-accelerator-on-databricks.md)
[Apache Spark](get-started/getting-started-on-prem.md),
[Databricks Runtime 7.0](get-started/getting-started-databricks.md)
and [Google Cloud Dataproc](get-started/getting-started-gcp.md).
Most distributions based off of Apache Spark 3.0.0 should work, but because the plugin replaces
parts of the physical plan that Apache Spark considers to be internal the code for those plans
Expand All @@ -37,7 +37,7 @@ set up testing and validation on their distributions.

### What is the right hardware setup to run GPU accelerated Spark?

Reference Architectures should be available around Q4 2020.
Reference architectures should be available around Q4 2020.

### What CUDA versions are supported?

Expand Down Expand Up @@ -75,7 +75,7 @@ speedup, with a 4x speedup typical. We have seen as high as 100x in some specifi
* Writing Parquet/ORC
* Reading CSV
* Transcoding (reading an input file and doing minimal processing before writing it out again,
possibly in a different format, like CSV to parquet)
possibly in a different format, like CSV to Parquet)

### Are there initialization costs?

Expand Down Expand Up @@ -114,8 +114,9 @@ Yes, DPP still works. It might not be as efficient as it could be, and we are w

### Is Adaptive Query Execution (AQE) Supported?

We are in the process of making sure AQE works. Some parts work now, but other parts require some
changes to the internals of Spark, that we are working with the community to be able to support.
In the 0.2 release, AQE is supported but all exchanges will default to the CPU. As of the 0.3
release, running on Spark 3.0.1 and higher any operation that is supported on GPU will now stay on
the GPU when AQE is enabled.

### Are cache and persist supported?

Expand All @@ -127,28 +128,28 @@ the Spark community on changes that would allow us to accelerate compression whe
No, that is not currently supported. It would require much larger changes to Apache Spark to be able
to support this.

### Is pyspark supported?
### Is PySpark supported?

Yes

### Are the R APIs for Spark supported?

Yes, but we don't actively test them.

## Are the Java APIs for Spark supported?
### Are the Java APIs for Spark supported?

Yes, but we don't actively test them.

## Are the Scala APIs for Spark supported?
### Are the Scala APIs for Spark supported?

Yes

## Is the GPU needed on the driver? Are there any benefits to having a GPU on the driver?
### Is the GPU needed on the driver? Are there any benefits to having a GPU on the driver?

The GPU is not needed on the driver and there is no benefit to having one available on the driver
for the RAPIDS plugin.

## How does the performance compare to DataBricks' DeltaEngine?
### How does the performance compare to DataBricks' DeltaEngine?

We have not evaluated the performance yet. DeltaEngine is not open source, so any analysis needs to
be done with Databricks in some form. When DeltaEngine is generally available and the terms of
Expand Down Expand Up @@ -186,7 +187,7 @@ for this issue.
To fix it you can either disable the IOMMU, or you can disable using pinned memory by setting
[spark.rapids.memory.pinnedPool.size](configs.md#memory.pinnedPool.size) to 0.

# Is speculative execution supported?
### Is speculative execution supported?

Yes, speculative execution in Spark is fine with the RAPIDS accelerator plugin.

Expand All @@ -196,4 +197,4 @@ to see how often task speculation occurs and how often the speculating task (i.e
later) finishes before the slow task that triggered speculation. If the speculating task often
finishes first then that's good, it is working as intended. If many tasks are speculating, but the
original task always finishes first then this is a pure loss, the speculation is adding load to
the Spark cluster with no benefit.
the Spark cluster with no benefit.
65 changes: 36 additions & 29 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ For reads when `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `C
between the Julian and Gregorian calendars are wrong, but dates are fine. When
`spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `LEGACY`, however both dates and
timestamps are read incorrectly before the Gregorian calendar transition as described
[here]('https://github.com/NVIDIA/spark-rapids/issues/133).
[here](https://github.com/NVIDIA/spark-rapids/issues/133).

When writing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is currently ignored as described
[here](https://github.com/NVIDIA/spark-rapids/issues/144).
Expand All @@ -193,6 +193,13 @@ The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files an
fall back to the CPU when reading an unsupported compression format, and will error out
in that case.

## Regular Expressions
The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard
matches.

If a null char '\0' is in a string that is being matched by a regular expression, `LIKE` sees it as
the end of the string. This will be fixed in a future release. The issue is [here](https://github.com/NVIDIA/spark-rapids/issues/119).

## Timestamps

Spark stores timestamps internally relative to the JVM time zone. Converting an
Expand Down Expand Up @@ -235,18 +242,18 @@ The following formats/patterns are supported on the GPU. Timezone of UTC is assu

| Format or Pattern | Supported on GPU? |
| --------------------- | ----------------- |
| `"yyyy"` | Yes. |
| `"yyyy-[M]M"` | Yes. |
| `"yyyy-[M]M "` | Yes. |
| `"yyyy-[M]M-[d]d"` | Yes. |
| `"yyyy-[M]M-[d]d "` | Yes. |
| `"yyyy-[M]M-[d]d *"` | Yes. |
| `"yyyy-[M]M-[d]d T*"` | Yes. |
| `"epoch"` | Yes. |
| `"now"` | Yes. |
| `"today"` | Yes. |
| `"tomorrow"` | Yes. |
| `"yesterday"` | Yes. |
| `"yyyy"` | Yes |
| `"yyyy-[M]M"` | Yes |
| `"yyyy-[M]M "` | Yes |
| `"yyyy-[M]M-[d]d"` | Yes |
| `"yyyy-[M]M-[d]d "` | Yes |
| `"yyyy-[M]M-[d]d *"` | Yes |
| `"yyyy-[M]M-[d]d T*"` | Yes |
| `"epoch"` | Yes |
| `"now"` | Yes |
| `"today"` | Yes |
| `"tomorrow"` | Yes |
| `"yesterday"` | Yes |

## String to Timestamp

Expand All @@ -257,22 +264,22 @@ Casting from string to timestamp currently has the following limitations.

| Format or Pattern | Supported on GPU? |
| ------------------------------------------------------------------- | ------------------|
| `"yyyy"` | Yes. |
| `"yyyy-[M]M"` | Yes. |
| `"yyyy-[M]M "` | Yes. |
| `"yyyy-[M]M-[d]d"` | Yes. |
| `"yyyy-[M]M-[d]d "` | Yes. |
| `"yyyy-[M]M-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [1]. |
| `"yyyy-[M]M-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [1]. |
| `"[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [1]. |
| `"T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [1]. |
| `"epoch"` | Yes. |
| `"now"` | Yes. |
| `"today"` | Yes. |
| `"tomorrow"` | Yes. |
| `"yesterday"` | Yes. |

- [1] The timestamp portion must be complete in terms of hours, minutes, seconds, and
| `"yyyy"` | Yes |
| `"yyyy-[M]M"` | Yes |
| `"yyyy-[M]M "` | Yes |
| `"yyyy-[M]M-[d]d"` | Yes |
| `"yyyy-[M]M-[d]d "` | Yes |
| `"yyyy-[M]M-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [\[1\]](#Footnote1) |
| `"yyyy-[M]M-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [\[1\]](#Footnote1) |
| `"[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [\[1\]](#Footnote1) |
| `"T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]"` | Partial [\[1\]](#Footnote1) |
| `"epoch"` | Yes |
| `"now"` | Yes |
| `"today"` | Yes |
| `"tomorrow"` | Yes |
| `"yesterday"` | Yes |

- <a name="Footnote1"></a>[1] The timestamp portion must be complete in terms of hours, minutes, seconds, and
milliseconds, with 2 digits each for hours, minutes, and seconds, and 6 digits for milliseconds.
Only timezone 'Z' (UTC) is supported. Casting unsupported formats will result in null values.

Expand Down
2 changes: 1 addition & 1 deletion docs/dev/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: page
title: Developer Overview
nav_order: 8
nav_order: 9
has_children: true
permalink: /developer-overview/
---
Expand Down
9 changes: 6 additions & 3 deletions docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,10 @@ gcloud services enable storage-api.googleapis.com
```

After the command line environment is setup, log in to your GCP account. You can now create a Dataproc cluster with the configuration shown below.
The configuration will allow users to run any of the [notebook demos](../demo/GCP) on GCP. Alternatively, users can also start 2*2T4 worker nodes.
The configuration will allow users to run any of the [notebook demos](https://github.com/NVIDIA/spark-rapids/tree/branch-0.2/docs/demo/GCP) on GCP. Alternatively, users can also start 2*2T4 worker nodes.
revans2 marked this conversation as resolved.
Show resolved Hide resolved

The script below will initialize with the following:

* [GPU Driver](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/gpu) and [RAPIDS Acclerator for Apache Spark](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids) through initialization actions (the init action is only available in US region public buckets as of 2020-07-16)
* One 8-core master node and 5 32-core worker nodes
* Four NVIDIA T4 for each worker node
Expand All @@ -32,7 +35,7 @@ The configuration will allow users to run any of the [notebook demos](../demo/GC


```bash
export REGION=[Your Prefer GCP Region]
export REGION=[Your Preferred GCP Region]
export GCS_BUCKET=[Your GCS Bucket]
export CLUSTER_NAME=[Your Cluster Name]
export NUM_GPUS=4
Expand Down Expand Up @@ -65,7 +68,7 @@ To use notebooks with a Dataproc cluster, click on the cluster name under the Da

The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the rest as a training set, saving to respective GCS locations. Using the default notebook configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark-parent) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.14), which are pre-downloaded by the GCP Dataproc [RAPIDS init script]().

Once data is prepared, we use the [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc's Zeppelin service to execute the training job on the GPU. NVIDIA also ships [Spark XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC xgboost](https://github.com/dmlc/xgboost). Precompiled [XGBoost4j]() and [XGBoost4j Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) libraries can be downloaded from maven. They are pre-downloaded by the GCP [RAPIDS init action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since github cannot render a Zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view the code content.
Once data is prepared, we use the [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc's Zeppelin service to execute the training job on the GPU. NVIDIA also ships [Spark XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC xgboost](https://github.com/dmlc/xgboost). Precompiled [XGBoost4j](https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/) and [XGBoost4j Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) libraries can be downloaded from maven. They are pre-downloaded by the GCP [RAPIDS init action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since github cannot render a Zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view the code content.

The training time should be around 480 seconds (1/10 of CPU execution time with same config). This is shown under cell:
```scala
Expand Down
57 changes: 0 additions & 57 deletions docs/get-started/getting-started-menu.md

This file was deleted.

Loading