New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Karthik changes #359

Merged

revans2 merged 9 commits into NVIDIA:branch-0.2 from sameerz:karthik-changes

Jul 16, 2020

Collaborator

sameerz commented Jul 15, 2020

Please write a description in this text box of the changes that are being
made.

Documentation changes by Karthik. Requesting merge to branch-0.2. If merged successfully in 0.2, will merge to the gh-pages branch for visibility at http://nvidia.github.io/spark-rapids.

Please ensure that you have written units tests for the changes made/features
added.

Reviewed locally with docker run --rm -it --volume="$PWD:/srv/jekyll" -p 4000:4000 sameerz/jekyll:0.1 jekyll serve.

No code changes.

sameerz added 4 commits

July 14, 2020 19:29


          Changes from Karthik

3f609e5


          fixing _config.yml, adding get-started

aa67cac


          fixing _config.yml (removing whitespace)

20beaa4


          Some additional fixes

8bc04a4

sameerz added the documentation label

sameerz added this to the Jul 6 - Jul 17 milestone

Collaborator Author

sameerz commented Jul 15, 2020

build

1 similar comment

Collaborator Author

sameerz commented Jul 15, 2020

build

revans2 reviewed

View reviewed changes

Collaborator

revans2 left a comment

I have two main concerns with this beyond what I have put into the review. The first is maintainability. We have lots of example notebooks that are more or less copies of each other with a few tweaks to them (paths and in some cases configs). I am concerned that if we do have to make changes we are going to miss some of them, especially because the notebooks are in different formats (zepplin and Jupyter) and the formats are not human/git friendly. I am also concerned that we are not running any of these notebooks to see if they actually work. They hopefully do work right now, but at some point they may stop. Are we going to manually test these with each release to be sure that they still work and do the right thing? Like still run everything on the GPU. We have already had one issue where all of the mortgage queries likely need to have spark.rapids.sql.hasNans set to false because of a bug that we found in some floating point aggregation operations. From what I can see none of the notebooks are doing this, but it is hard to tell because the setup for each cluster is inconsistent and in some cases (DataProc) opaque to the point that I don't know if it actually works.

My second concern is about making sure we do the right thing around the mortgage data. Rapids has an entire page describing the data for the mortgage data set and that they have permission to distribute it in this way. https://rapidsai.github.io/demos/datasets/mortgage-data We just jump in pointing people to the data. I think it would be good if we at least had a link in each of these locations so people understand where the data came from, that we do have permission to do this, and where they can get the original data.

docs/get-started/getting-started-gcp.md Outdated Show resolved Hide resolved

docs/index.md Outdated Show resolved Hide resolved

docs/index.md Show resolved Hide resolved

docs/index.md Outdated Show resolved Hide resolved

docs/version/stable-release.md Outdated Show resolved Hide resolved

docs/get-started/getting-started-gcp.md Outdated Show resolved Hide resolved

docs/get-started/getting-started-gcp.md Show resolved Hide resolved

docs/get-started/getting-started-gcp.md

+              // Start training
+              println("\n------ Training ------")
+              val (xgbClassificationModel, _) = benchmark("train") {
+                xgbClassifier.fit(trainSet)

Collaborator

revans2 Jul 15, 2020

I am kind of confused by the whole XGBoost thing being in here. I understand that we want to cross promote. But we don't explain anywhere about how to install XGBoost or that it is slightly different from the official one, or where to get it. It is just magic which I think we should at least explain a bit of.

Contributor

mengdong Jul 16, 2020

I have added reference in README for rapids plugin, cuDF and xgboost. And mentioned that it is installed through init action.

Contributor

krajendrannv Jul 16, 2020

Filed a follow-on issue for XGBoost notebook here: #368

docs/get-started/getting-started-gcp.md Outdated

+              The notebook will first transcode CSV files into Parquet Files and then run a ETL query to prepare the dataset for Training. In the sample notebook, we use 2016 data as evaluation set and the rest as training set, saving to respective GCS location.
+              First stage with default configuration in notebook should take ~110 seconds (1/3 of CPU execution time with same config) whereas second stage takes ~170 seconds (1/6 of CPU execution time with same config).
+              Once data is prepared, we use [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc Zeppelin service to execute the training job on GPU. Since GITHUB cannot render zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view code content.

Collaborator

revans2 Jul 15, 2020

Why use Zepplin if we have a Jupyter notebook that does the same thing? Maintaning 2 is a lot more problematic than maintaining 1. What is more because of how the notebooks work we are actually maintaining a lot of these notebooks, and it feels really frustratingly needless.

What is more the Jupyter note book does not render for me in github either. I just get a spinner until it gives up and returns an error message.

Sorry, something went wrong. Reload?

Contributor

mengdong Jul 15, 2020 •

edited

Loading

Sometime GITHUB has issue render notebooks. We do need to figure out a way to test notebooks. If we define the spark context in the notebook, a way to test is:

jupyter nbconvert --to script /tmp/test.ipynb --output /tmp/test
pyspark /tmp/test.py

Could we merge the notebook for the dataproc and implement test later?

Contributor

krajendrannv Jul 16, 2020

I filed a follow-on issue here: #367

docs/get-started/getting-started-gcp.md Show resolved Hide resolved

sameerz and others added 5 commits

July 15, 2020 09:45


          remove localhost reference

6bc8514


          updates to index, getting started gcp, perf image, stable release

aa3ae3b


          remove unused notebook, add description and reference for each cell o…

0bdc140

…f the notebook, reference to mortgage data and spark rapids config


          updated examples with mortgage dataset info, databricks guide with ex…

4a950db

…ecutor info and cluster config based on comments


          add reference to rapids plugin, cudf and spark xgboost inn readme

93047ef

revans2 approved these changes

View reviewed changes

Collaborator

revans2 left a comment

The changes look good. I think the rest of the outstanding items can be addressed as follow on work.

docs/get-started/getting-started-gcp.md

+              # Getting started with RAPIDS Accelerator on GCP Dataproc
+               [Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache Spark and Hadoop service. This guide will walk through the steps to show:
+              * [How to spin up a Dataproc Cluster Accelerated by GPU](getting-started-gcp#how-to-spin-up-a-dataproc-cluster-accelerated-by-gpu)

Collaborator

revans2 Jul 16, 2020

This can be a follow on work, but the links don't work on github itself. They only appear to work on the final web page. Github expects the links to point to .md I think the statig web site generator fixes that when it produces the web site.

docs/index.md

               As data scientists shift from using traditional analytics to leveraging AI applications that better model complex market demands, traditional CPU-based processing can no longer keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost efficiently with GPUs.
               The RAPIDS Accelerator for Apache Spark combines the power of the <a href="https://github.com/rapidsai/cudf/">RAPIDS cuDF</a> library and the scale of the Spark distributed computing framework.  The RAPIDS Accelerator library also has a built-in accelerated shuffle based on <a href="https://github.com/openucx/ucx/">UCX</a> that can be configured to leverage GPU-to-GPU communication and RDMA capabilities.
+              ## Perfomance & Cost Benefits
+              Rapids Accelerator for Apache Spark reaps the benefit of GPU perfomance while saving infrastructure costs.
+              ![Perf-cost](/img/Perf-cost.png)

Collaborator

revans2 Jul 16, 2020

Again this can be a follow on fix, but this does not work on github.

revans2 merged commit 2bb9b51 into NVIDIA:branch-0.2

sameerz mentioned this pull request

Documentation changes for Karthik to gh-pages #369

Merged

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request


          Karthik changes (NVIDIA#359)

b0b5dfa

* Changes from Karthik

* fixing _config.yml, adding get-started

* fixing _config.yml (removing whitespace)

* Some additional fixes

* remove localhost reference

* updates to index, getting started gcp, perf image, stable release

* remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config

* updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments

* add reference to rapids plugin, cudf and spark xgboost inn readme

Co-authored-by: Karthikeyan <krajendran@nvidia.com>
Co-authored-by: DougM <mengdong0427@gmail.com>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request


          Karthik changes (NVIDIA#359)

1dc09f2

* Changes from Karthik

* fixing _config.yml, adding get-started

* fixing _config.yml (removing whitespace)

* Some additional fixes

* remove localhost reference

* updates to index, getting started gcp, perf image, stable release

* remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config

* updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments

* add reference to rapids plugin, cudf and spark xgboost inn readme

Co-authored-by: Karthikeyan <krajendran@nvidia.com>
Co-authored-by: DougM <mengdong0427@gmail.com>

pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request


          Make study spec consistent with study tool (NVIDIA#359)

e2a7b5d

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request


          Update submodule cudf to 58f46a6 (NVIDIA#359)

fec909c

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels