Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karthik changes #359

Merged
merged 9 commits into from
Jul 16, 2020
Merged

Karthik changes #359

merged 9 commits into from
Jul 16, 2020

Conversation

sameerz
Copy link
Collaborator

@sameerz sameerz commented Jul 15, 2020

  1. Please write a description in this text box of the changes that are being
    made.

Documentation changes by Karthik. Requesting merge to branch-0.2. If merged successfully in 0.2, will merge to the gh-pages branch for visibility at http://nvidia.github.io/spark-rapids.

  1. Please ensure that you have written units tests for the changes made/features
    added.

Reviewed locally with docker run --rm -it --volume="$PWD:/srv/jekyll" -p 4000:4000 sameerz/jekyll:0.1 jekyll serve.

No code changes.

@sameerz sameerz added the documentation Improvements or additions to documentation label Jul 15, 2020
@sameerz sameerz added this to the Jul 6 - Jul 17 milestone Jul 15, 2020
@sameerz
Copy link
Collaborator Author

sameerz commented Jul 15, 2020

build

1 similar comment
@sameerz
Copy link
Collaborator Author

sameerz commented Jul 15, 2020

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two main concerns with this beyond what I have put into the review. The first is maintainability. We have lots of example notebooks that are more or less copies of each other with a few tweaks to them (paths and in some cases configs). I am concerned that if we do have to make changes we are going to miss some of them, especially because the notebooks are in different formats (zepplin and Jupyter) and the formats are not human/git friendly. I am also concerned that we are not running any of these notebooks to see if they actually work. They hopefully do work right now, but at some point they may stop. Are we going to manually test these with each release to be sure that they still work and do the right thing? Like still run everything on the GPU. We have already had one issue where all of the mortgage queries likely need to have spark.rapids.sql.hasNans set to false because of a bug that we found in some floating point aggregation operations. From what I can see none of the notebooks are doing this, but it is hard to tell because the setup for each cluster is inconsistent and in some cases (DataProc) opaque to the point that I don't know if it actually works.

My second concern is about making sure we do the right thing around the mortgage data. Rapids has an entire page describing the data for the mortgage data set and that they have permission to distribute it in this way. https://rapidsai.github.io/demos/datasets/mortgage-data We just jump in pointing people to the data. I think it would be good if we at least had a link in each of these locations so people understand where the data came from, that we do have permission to do this, and where they can get the original data.

docs/get-started/getting-started-gcp.md Outdated Show resolved Hide resolved
docs/index.md Outdated Show resolved Hide resolved
docs/index.md Show resolved Hide resolved
docs/index.md Outdated Show resolved Hide resolved
docs/version/stable-release.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-gcp.md Outdated Show resolved Hide resolved
docs/get-started/getting-started-gcp.md Show resolved Hide resolved
// Start training
println("\n------ Training ------")
val (xgbClassificationModel, _) = benchmark("train") {
xgbClassifier.fit(trainSet)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of confused by the whole XGBoost thing being in here. I understand that we want to cross promote. But we don't explain anywhere about how to install XGBoost or that it is slightly different from the official one, or where to get it. It is just magic which I think we should at least explain a bit of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added reference in README for rapids plugin, cuDF and xgboost. And mentioned that it is installed through init action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed a follow-on issue for XGBoost notebook here: #368


The notebook will first transcode CSV files into Parquet Files and then run a ETL query to prepare the dataset for Training. In the sample notebook, we use 2016 data as evaluation set and the rest as training set, saving to respective GCS location.
First stage with default configuration in notebook should take ~110 seconds (1/3 of CPU execution time with same config) whereas second stage takes ~170 seconds (1/6 of CPU execution time with same config).
Once data is prepared, we use [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc Zeppelin service to execute the training job on GPU. Since GITHUB cannot render zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view code content.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use Zepplin if we have a Jupyter notebook that does the same thing? Maintaning 2 is a lot more problematic than maintaining 1. What is more because of how the notebooks work we are actually maintaining a lot of these notebooks, and it feels really frustratingly needless.

What is more the Jupyter note book does not render for me in github either. I just get a spinner until it gives up and returns an error message.

Sorry, something went wrong. Reload?

Copy link
Contributor

@mengdong mengdong Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometime GITHUB has issue render notebooks. We do need to figure out a way to test notebooks. If we define the spark context in the notebook, a way to test is:

jupyter nbconvert --to script /tmp/test.ipynb --output /tmp/test
pyspark /tmp/test.py 

Could we merge the notebook for the dataproc and implement test later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a follow-on issue here: #367

docs/get-started/getting-started-gcp.md Show resolved Hide resolved
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good. I think the rest of the outstanding items can be addressed as follow on work.

# Getting started with RAPIDS Accelerator on GCP Dataproc
[Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache Spark and Hadoop service. This guide will walk through the steps to show:

* [How to spin up a Dataproc Cluster Accelerated by GPU](getting-started-gcp#how-to-spin-up-a-dataproc-cluster-accelerated-by-gpu)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a follow on work, but the links don't work on github itself. They only appear to work on the final web page. Github expects the links to point to .md I think the statig web site generator fixes that when it produces the web site.

As data scientists shift from using traditional analytics to leveraging AI applications that better model complex market demands, traditional CPU-based processing can no longer keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost efficiently with GPUs.

The RAPIDS Accelerator for Apache Spark combines the power of the <a href="https://github.com/rapidsai/cudf/">RAPIDS cuDF</a> library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on <a href="https://github.com/openucx/ucx/">UCX</a> that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.

## Perfomance & Cost Benefits
Rapids Accelerator for Apache Spark reaps the benefit of GPU perfomance while saving infrastructure costs.
![Perf-cost](/img/Perf-cost.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this can be a follow on fix, but this does not work on github.

@revans2 revans2 merged commit 2bb9b51 into NVIDIA:branch-0.2 Jul 16, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Changes from Karthik

* fixing _config.yml, adding get-started

* fixing _config.yml (removing whitespace)

* Some additional fixes

* remove localhost reference

* updates to index, getting started gcp, perf image, stable release

* remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config

* updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments

* add reference to rapids plugin, cudf and spark xgboost inn readme

Co-authored-by: Karthikeyan <krajendran@nvidia.com>
Co-authored-by: DougM <mengdong0427@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Changes from Karthik

* fixing _config.yml, adding get-started

* fixing _config.yml (removing whitespace)

* Some additional fixes

* remove localhost reference

* updates to index, getting started gcp, perf image, stable release

* remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config

* updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments

* add reference to rapids plugin, cudf and spark xgboost inn readme

Co-authored-by: Karthikeyan <krajendran@nvidia.com>
Co-authored-by: DougM <mengdong0427@gmail.com>
pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants