-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karthik changes #359
Karthik changes #359
Conversation
build |
1 similar comment
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two main concerns with this beyond what I have put into the review. The first is maintainability. We have lots of example notebooks that are more or less copies of each other with a few tweaks to them (paths and in some cases configs). I am concerned that if we do have to make changes we are going to miss some of them, especially because the notebooks are in different formats (zepplin and Jupyter) and the formats are not human/git friendly. I am also concerned that we are not running any of these notebooks to see if they actually work. They hopefully do work right now, but at some point they may stop. Are we going to manually test these with each release to be sure that they still work and do the right thing? Like still run everything on the GPU. We have already had one issue where all of the mortgage queries likely need to have spark.rapids.sql.hasNans
set to false
because of a bug that we found in some floating point aggregation operations. From what I can see none of the notebooks are doing this, but it is hard to tell because the setup for each cluster is inconsistent and in some cases (DataProc) opaque to the point that I don't know if it actually works.
My second concern is about making sure we do the right thing around the mortgage data. Rapids has an entire page describing the data for the mortgage data set and that they have permission to distribute it in this way. https://rapidsai.github.io/demos/datasets/mortgage-data We just jump in pointing people to the data. I think it would be good if we at least had a link in each of these locations so people understand where the data came from, that we do have permission to do this, and where they can get the original data.
// Start training | ||
println("\n------ Training ------") | ||
val (xgbClassificationModel, _) = benchmark("train") { | ||
xgbClassifier.fit(trainSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am kind of confused by the whole XGBoost thing being in here. I understand that we want to cross promote. But we don't explain anywhere about how to install XGBoost or that it is slightly different from the official one, or where to get it. It is just magic which I think we should at least explain a bit of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added reference in README for rapids plugin, cuDF and xgboost. And mentioned that it is installed through init action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed a follow-on issue for XGBoost notebook here: #368
|
||
The notebook will first transcode CSV files into Parquet Files and then run a ETL query to prepare the dataset for Training. In the sample notebook, we use 2016 data as evaluation set and the rest as training set, saving to respective GCS location. | ||
First stage with default configuration in notebook should take ~110 seconds (1/3 of CPU execution time with same config) whereas second stage takes ~170 seconds (1/6 of CPU execution time with same config). | ||
Once data is prepared, we use [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc Zeppelin service to execute the training job on GPU. Since GITHUB cannot render zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view code content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use Zepplin if we have a Jupyter notebook that does the same thing? Maintaning 2 is a lot more problematic than maintaining 1. What is more because of how the notebooks work we are actually maintaining a lot of these notebooks, and it feels really frustratingly needless.
What is more the Jupyter note book does not render for me in github either. I just get a spinner until it gives up and returns an error message.
Sorry, something went wrong. Reload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometime GITHUB has issue render notebooks. We do need to figure out a way to test notebooks. If we define the spark context in the notebook, a way to test is:
jupyter nbconvert --to script /tmp/test.ipynb --output /tmp/test
pyspark /tmp/test.py
Could we merge the notebook for the dataproc and implement test later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a follow-on issue here: #367
…f the notebook, reference to mortgage data and spark rapids config
…ecutor info and cluster config based on comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good. I think the rest of the outstanding items can be addressed as follow on work.
# Getting started with RAPIDS Accelerator on GCP Dataproc | ||
[Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache Spark and Hadoop service. This guide will walk through the steps to show: | ||
|
||
* [How to spin up a Dataproc Cluster Accelerated by GPU](getting-started-gcp#how-to-spin-up-a-dataproc-cluster-accelerated-by-gpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a follow on work, but the links don't work on github itself. They only appear to work on the final web page. Github expects the links to point to .md I think the statig web site generator fixes that when it produces the web site.
As data scientists shift from using traditional analytics to leveraging AI applications that better model complex market demands, traditional CPU-based processing can no longer keep up without compromising either speed or cost. The growing adoption of AI in analytics has created the need for a new framework to process data quickly and cost efficiently with GPUs. | ||
|
||
The RAPIDS Accelerator for Apache Spark combines the power of the <a href="https://github.com/rapidsai/cudf/">RAPIDS cuDF</a> library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on <a href="https://github.com/openucx/ucx/">UCX</a> that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. | ||
|
||
## Perfomance & Cost Benefits | ||
Rapids Accelerator for Apache Spark reaps the benefit of GPU perfomance while saving infrastructure costs. | ||
![Perf-cost](/img/Perf-cost.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this can be a follow on fix, but this does not work on github.
* Changes from Karthik * fixing _config.yml, adding get-started * fixing _config.yml (removing whitespace) * Some additional fixes * remove localhost reference * updates to index, getting started gcp, perf image, stable release * remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config * updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments * add reference to rapids plugin, cudf and spark xgboost inn readme Co-authored-by: Karthikeyan <krajendran@nvidia.com> Co-authored-by: DougM <mengdong0427@gmail.com>
* Changes from Karthik * fixing _config.yml, adding get-started * fixing _config.yml (removing whitespace) * Some additional fixes * remove localhost reference * updates to index, getting started gcp, perf image, stable release * remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config * updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments * add reference to rapids plugin, cudf and spark xgboost inn readme Co-authored-by: Karthikeyan <krajendran@nvidia.com> Co-authored-by: DougM <mengdong0427@gmail.com>
Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
made.
Documentation changes by Karthik. Requesting merge to branch-0.2. If merged successfully in 0.2, will merge to the
gh-pages
branch for visibility at http://nvidia.github.io/spark-rapids.added.
Reviewed locally with
docker run --rm -it --volume="$PWD:/srv/jekyll" -p 4000:4000 sameerz/jekyll:0.1 jekyll serve
.No code changes.