Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]Update 22.06 documentation[skip ci] #5641

Merged
merged 22 commits into from
Jun 3, 2022
Merged

Conversation

viadea
Copy link
Collaborator

@viadea viadea commented May 25, 2022

Fixes #5217

  1. Add download page for 22.06.
    (Some of the features are not ready yet such as Spark 3.3 support, so I will add later once it is merged in 22.06 branch)

  2. Address [FEA] Column reordering for columnar write utility #5460

  3. Address [DOC] FAQ should clarify why Spark's Java and R APIs are not tested #5217

  4. Add K8s doc to mention the base CUDA images and its dockerfile.

  5. Modify the examples README to point to spark-rapids-examples and spark-rapids-benchmark repos.

  6. Swap two steps in Alluxio getting-stated doc because you can not run command mount before starting the alluxio cluster.

  7. Some other minor doc update

Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@viadea viadea added the documentation Improvements or additions to documentation label May 25, 2022
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@viadea viadea requested a review from abellina May 25, 2022 19:09
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@abellina
Copy link
Collaborator

@viadea could we add a FAQ entry to say the ASYNC allocator is on by default but for CUDA 11.4 and older drivers we will fallback to ARENA.

docs/FAQ.md Outdated Show resolved Hide resolved

### Download v22.06.0
* Download the [RAPIDS
Accelerator for Apache Spark 22.06.0 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/22.06.0/rapids-4-spark_2.12-22.06.0.jar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this currently bad link, I'd like to see this checked in as late as possible. Otherwise we end up with every PR in the meantime being flagged for a bad link because it's checked in that way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can wait for some time to merge this PR.
My plan is to merge this PR before the merge request to main, so that future gh-pages update PR can take it from there.

docs/download.md Outdated Show resolved Hide resolved
docs/download.md Outdated
This package is built against CUDA 11.5 and has [CUDA forward
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html) enabled. It is tested
on V100, T4, A2, A10, A30 and A100 GPUs with CUDA 11.0-11.5. For those using other types of GPUs which
do not have CUDA forward compatibility (for example, GeForce), CUDA 11.5 is required. Users will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say "CUDA 11.5 or later is required" here, as CUDA backward compatibility will allow us to run on CUDA versions > 11.5.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

docs/download.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-kubernetes.md Outdated Show resolved Hide resolved
docs/tuning-guide.md Outdated Show resolved Hide resolved
viadea and others added 4 commits May 25, 2022 14:03
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvdia.com>
@viadea
Copy link
Collaborator Author

viadea commented May 25, 2022

@viadea could we add a FAQ entry to say the ASYNC allocator is on by default but for CUDA 11.4 and older drivers we will fallback to ARENA.

Added. How about now?

jlowe
jlowe previously approved these changes May 25, 2022
docs/download.md Outdated Show resolved Hide resolved
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
viadea and others added 3 commits May 25, 2022 14:37
Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
docs/FAQ.md Outdated Show resolved Hide resolved
docs/download.md Outdated Show resolved Hide resolved
docs/download.md Outdated Show resolved Hide resolved
docs/download.md Outdated Show resolved Hide resolved
docs/download.md Outdated Show resolved Hide resolved
Comment on lines +75 to +76
* Enable regular expression by default
* Enable some float related configurations by default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enabling CSV reads, regular expressions, and floating point operations by default ought to be higher on the list of new features. spark.sql.mapKeyDedupPolicy=LAST_WIN is probably not that important to highlight. Rather, we can highlight features like: Improved ANSI support, Supporting for Avro reading of primitive types,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the release notes.

BTW: for "Avro reading of primitive types" it was added for 22.04 before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks.


We suggest reordering the columns needed by the queries and then rewrite the files to make those
columns adjacent. This could help both Spark on CPU and GPU.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment here about using spark.rapids.sql.format.parquet.reader.footer.type=NATIVE if there are a large number of columns and the data format is Parquet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature is experimental. Not sure we're ready to widely advertise it yet, but I'd defer to @revans2 on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, we can add the note about it in the tuning guide after it is no longer experimental.

viadea and others added 6 commits May 27, 2022 10:24
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@sameerz
Copy link
Collaborator

sameerz commented May 27, 2022

Do we need to update other parts of the documentation where we refer to the cudf jar, such as:

  • getting-started-on-prem.md, which has download instructions for cudf and scripts which reference the cudf jar location
  • generate-init-script.ipynb, which downloads and sets environment variables for the cudf jar
  • getting-started-gcp.md , which mentions "The notebook depends on the pre-compiled Spark RAPIDS SQL plugin and cuDF, which are pre-downloaded by the GCP Dataproc RAPIDS init script."
  • getting-started-kubernetes.md, which has download instructions for cudf and scripts which reference the cudf jar location
  • getting-started-workload-qualification, which has instructions on downloading and using the cudf jar
  • spark-profiling-tool.md, which mentions the cudf jar as a depdendancy
  • additional-functionality/rapids-shuffle.md, which includes the cudf jar in the classpath
  • dev/nvtx_profiling.md might need updating, as it mentions compiling the cudf jar separately

Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@viadea
Copy link
Collaborator Author

viadea commented May 31, 2022

Do we need to update other parts of the documentation where we refer to the cudf jar, such as:

* getting-started-on-prem.md, which has download instructions for cudf and scripts which reference the cudf jar location

* generate-init-script.ipynb, which downloads and sets environment variables for the cudf jar

* getting-started-gcp.md , which mentions "The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf), which are pre-downloaded by the GCP Dataproc [RAPIDS init script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids)."

* getting-started-kubernetes.md, which has download instructions for cudf and scripts which reference the cudf jar location

* getting-started-workload-qualification, which has instructions on downloading and using the cudf jar

* spark-profiling-tool.md, which mentions the cudf jar as a depdendancy

* additional-functionality/rapids-shuffle.md, which includes the cudf jar in the classpath

* dev/nvtx_profiling.md might need updating, as it mentions compiling the cudf jar separately

@sameerz

I think most of the above were already handled by previous PRs in 22.06 branch.
But I did some update/fix in the latest commit:

  1. getting-started-gcp.md -- Fixed
  2. getting-started-databricks.md -- removed "with RAPIDS and cuDF" words.
  3. additional-functionality/rapids-shuffle.md -- Fixed.
  4. dev/nvtx_profiling.md -- Since we no longer need to build the cuDF jar ourselves so i just removed the whole step 1.

Regarding "spark-profiling-tool.md", my thought is our profiling tool still needs to print cuDF jar related information based on what version of RAPIDS+CUDF the Spark eventlog was based on. So I keep the example output with cuDF jar info there.

docs/FAQ.md Outdated
@@ -307,11 +307,15 @@ Yes

### Are the R APIs for Spark supported?

Yes, but we don't actively test them.
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for this text and the Java API text below.

Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them, because the RAPIDS Accelerator hooks into Spark not at
Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed both.

Signed-off-by: Hao Zhu <hazhu@nvidia.com>
sameerz
sameerz previously approved these changes Jun 2, 2022
docs/FAQ.md Outdated
@@ -307,11 +307,15 @@ Yes

### Are the R APIs for Spark supported?

Yes, but we don't actively test them.
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion for this text and the Java API text below.

Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them, because the RAPIDS Accelerator hooks into Spark not at
Suggested change
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at
Yes, but we don't actively test them. It is because the RAPIDS Accelerator hooks into Spark not at

@sameerz
Copy link
Collaborator

sameerz commented Jun 2, 2022

build

Signed-off-by: Hao Zhu <hazhu@nvidia.com>
@viadea
Copy link
Collaborator Author

viadea commented Jun 2, 2022

Fixed a parameter typo in docs/additional-functionality/rapids-udfs.md:
spark.rapids.python.gpu.enabled -> spark.rapids.sql.python.gpu.enabled

@sameerz would u mind re-approving?

@tgravescs
Copy link
Collaborator

build

1 similar comment
@tgravescs
Copy link
Collaborator

build

@viadea viadea merged commit ee638d5 into NVIDIA:branch-22.06 Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOC] FAQ should clarify why Spark's Java and R APIs are not tested
5 participants