Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start Guide for RAPIDS on AWS EMR 6.2 #1371

Merged
merged 34 commits into from
Jan 6, 2021
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5f525c6
Add files via upload
mgzhao Oct 28, 2020
7d98cc6
Add files via upload
mgzhao Oct 28, 2020
cd5afd1
Add files via upload
mgzhao Oct 29, 2020
a9d6154
Add files via upload
mgzhao Oct 30, 2020
75a81c7
Add files via upload
mgzhao Oct 30, 2020
aafbcaf
Add files via upload
mgzhao Oct 30, 2020
38369ad
Add files via upload
mgzhao Oct 30, 2020
f246ecf
Add files via upload
mgzhao Oct 30, 2020
0d68909
Add files via upload
mgzhao Nov 3, 2020
6df285f
Add files via upload
mgzhao Dec 11, 2020
cb4d9d5
Add files via upload
mgzhao Dec 11, 2020
0fceeaf
Add files via upload
mgzhao Dec 11, 2020
fb691f0
Add files via upload
mgzhao Dec 11, 2020
c53ec3e
Add files via upload
mgzhao Dec 11, 2020
3bded18
Add files via upload
mgzhao Dec 11, 2020
56b4f8a
Add files via upload
mgzhao Dec 13, 2020
b7adfb9
Update docs/get-started/getting-started-aws-emr.md
mgzhao Dec 15, 2020
78d7a86
Update docs/get-started/getting-started-aws-emr.md
mgzhao Dec 15, 2020
1d0c0af
Update docs/get-started/getting-started-aws-emr.md
mgzhao Dec 15, 2020
5f4936d
Modified based on PR feedback
mgzhao Dec 18, 2020
7562e9c
Delete Rapids_EMR_GUI_5.PNG
mgzhao Dec 18, 2020
da857e7
Delete Rapids_EMR_GUI_4.PNG
mgzhao Dec 18, 2020
3be0d15
Delete Rapids_EMR_GUI_3.PNG
mgzhao Dec 18, 2020
bcb8212
Delete Rapids_EMR_GUI_2.PNG
mgzhao Dec 18, 2020
e9a688b
Delete Rapids_EMR_GUI_1.PNG
mgzhao Dec 18, 2020
434077f
Delete EMR_notebook_3.png
mgzhao Dec 18, 2020
da79b53
Delete EMR_notebook_2.png
mgzhao Dec 18, 2020
62bdfb3
Delete EMR_notebook_1.png
mgzhao Dec 18, 2020
df70494
Update docs for getting started and FAQ for AWS-EMR, organize img and…
sameerz Jan 3, 2021
5307648
Update docs/get-started/getting-started-aws-emr.md
sameerz Jan 4, 2021
382e5b6
Update docs/get-started/getting-started-aws-emr.md
sameerz Jan 4, 2021
5ed9996
Updated based on comments, updated to latest sparkconfig.png
sameerz Jan 5, 2021
1e451f0
Merge branch 'branch-0.3' of https://github.com/mgzhao/spark-rapids i…
sameerz Jan 5, 2021
94b5b29
Resolving merge conflicts
sameerz Jan 5, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
535 changes: 535 additions & 0 deletions docs/get-started/Mortgage-ETL-GPU-EMR.ipynb

Large diffs are not rendered by default.

218 changes: 218 additions & 0 deletions docs/get-started/getting-started-aws-emr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
# Get Started with RAPIDS on AWS EMR


This is a getting started guide for the RAPIDS Accelerator for Apache Spark on AWS EMR. At the end of this guide, the user will be able to run a sample Apache Spark application that runs on NVIDIA GPUs on AWS EMR.


The current EMR 6.2.0 release supports Spark version 3.0.1 and RAPIDS Accelerator version 0.2.0. For more details of supported applications, please see the [EMR release notes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html).


For more information on AWS EMR, please see the [AWS documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html).

## Configure and Launch AWS EMR with GPU Nodes

### Launch EMR Cluster using AWS CLI

Following steps are based on AWS EMR document - ["Using the Nvidia Spark-RAPIDS Accelerator for Spark"](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html)


You can use AWS CLI to launch a cluster with one Master node (m5.xlarge) and two nodes 2x g4dn.2xlarge

```
aws emr create-cluster \
--release-label emr-6.2.0 \
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \
--service-role EMR_DefaultRole \
--ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \
InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \
InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \
--configurations file:///my-configurations.json \
--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://my-bucket/my-bootstrap-action.sh
```

Please fill with actual value for KeyName and file paths. You can further customize SubnetId, EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, name and region etc.


### Launch EMR Cluster using AWS Console (GUI)

Go to the AWS Management Console and select the `EMR` service from the "Analytics" section. Choose the region you want to launch your cluster in, e.g. US West Oregon, using the dropdown menu in the top right corner. Click `Create cluster` and select `Go to advanced options`, which will bring up a detailed cluster configuration page.

#### Step 1: Software, Configuration and Steps

Select **emr-6.2.0** or latest EMR version for the release, uncheck all the software options, and then check **Hadoop 3.2.1**, **Spark 3.0.1**, **Livy 0.7.0** and **JupyterEnterpriseGateway 2.1.0**.

In the "Edit software settings" field, copy and paste the configuration from the [EMR document](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html). You can also create a JSON file on you own S3 bucket.

For cluster with 2x g4dn.2xlarge GPU instance as core/task nodes, we recommend following default settings
jlowe marked this conversation as resolved.
Show resolved Hide resolved
```
[
{
"Classification":"spark",
"Properties":{
"enableSparkRapids":"true"
}
},
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
"yarn.resource-types":"yarn.io/gpu",
"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
}
},
{
"Classification":"container-executor",
"Properties":{

},
"Configurations":[
{
"Classification":"gpu",
"Properties":{
"module.enabled":"true"
}
},
{
"Classification":"cgroups",
"Properties":{
"root":"/sys/fs/cgroup",
"yarn-hierarchy":"yarn"
}
}
]
},
{
"Classification":"spark-defaults",
"Properties":{
"spark.plugins":"com.nvidia.spark.SQLPlugin",
"spark.sql.sources.useV1SourceList":"",
"spark.executor.extraJavaOptions":"-Dai.rapids.cudf.prefer-pinned=true",
sameerz marked this conversation as resolved.
Show resolved Hide resolved
"spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
"spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar",
"spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
"spark.rapids.sql.concurrentGpuTasks":"4",
sameerz marked this conversation as resolved.
Show resolved Hide resolved
"spark.executor.resource.gpu.amount":"1",
"spark.executor.cores":"8",
"spark.task.cpus ":"1",
"spark.task.resource.gpu.amount":"0.125",
"spark.rapids.memory.pinnedPool.size":"2G",
"spark.executor.memoryOverhead":"2G",
"spark.locality.wait":"0s",
"spark.sql.shuffle.partitions":"200",
"spark.sql.files.maxPartitionBytes":"256m",
"spark.sql.adaptive.enabled":"false"
}
},
{
"Classification":"capacity-scheduler",
"Properties":{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
]

```

For cluster with 2x g4dn.12xlarge as core nodes, you can also make minor change from above settings

```
"spark.rapids.sql.concurrentGpuTasks":"2",
"spark.executor.cores":"12",
"spark.task.resource.gpu.amount":"0.0833",
```


![Step 1: Step 1: Software, Configuration and Steps](img/Rapids_EMR_GUI_1.PNG)

#### Step 2: Hardware

Select the desired VPC and availability zone in the "Network" and "EC2 Subnet" fields respectively. (Default network and subnet are ok)

In the "Core" node row, change the "Instance type" to **g4dn.xlarge**, **g4dn.2xlarge**, or **p3.2xlarge** and ensure "Instance count" is set to **1** or any higher number. Keep the default "Master" node instance type of **m5.xlarge**.

![Step 2: Hardware](img/Rapids_EMR_GUI_2.PNG)

#### Step 3: General Cluster Settings

Enter a custom "Cluster name" and make a note of the s3 folder that cluster logs will be written to.

*Optionally* add key-value "Tags", configure a "Custom AMI", or add custom "Bootstrap Actions" for the EMR cluster on this page.

![Step 3: General Cluster Settings](img/Rapids_EMR_GUI_3.PNG)

#### Step 4: Security

Select an existing "EC2 key pair" that will be used to authenticate SSH access to the cluster's nodes. If you do not have access to an EC2 key pair, follow these instructions to [create an EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair).

*Optionally* set custom security groups in the "EC2 security groups" tab.

In the "EC2 security groups" tab, confirm that the security group chosen for the "Master" node allows for SSH access. Follow these instructions to [allow inbound SSH traffic](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) if the security group does not allow it yet.

![Step 4: Security](img/Rapids_EMR_GUI_4.PNG)

#### Finish Cluster Configuration

The EMR cluster management page displays the status of multiple clusters or detailed information about a chosen cluster. In the detailed cluster view, the "Summary" and "Hardware" tabs can be used to monitor the status of master and core nodes as they provision and initialize.

When the cluster is ready, a green-dot will appear next to the cluster name and the "Status" column will display **Waiting, cluster ready**.

In the cluster's "Summary" tab, find the "Master public DNS" field and click the `SSH` button. Follow the instructions to SSH to the new cluster's master node.

![Finish Cluster Configuration](img/Rapids_EMR_GUI_5.PNG)


### Running an example joint operation using Spark Shell

SSH to the EMR cluster's master node, get into sparks shell and run the sql join example to verify GPU operation.

```
spark-shell
```

Running following Scala code in Spark Shell

```
val data = 1 to 10000
val df1 = sc.parallelize(data).toDF()
val df2 = sc.parallelize(data).toDF()
val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value")
out.count()
```

### Submit Spark jobs to a EMR Cluster Accelerated by GPUs

Similar to spark-submit for on-prem clusters, AWS EMR supports a Spark application job to be submitted. The mortgage examples we use are also available as a spark application. You can also use **spark shell** to run the scala code or **pyspark** to run the python code on master node through CLI.



tgravescs marked this conversation as resolved.
Show resolved Hide resolved
### Running GPU Accelerated Mortgage ETL and XGBoost Example using EMR Notebook

An EMR Notebook is a "serverless" Jupyter notebook. Unlike a traditional notebook, the contents of an EMR Notebook itself—the equations, visualizations, queries, models, code, and narrative text—are saved in Amazon S3 separately from the cluster that runs the code. This provides an EMR Notebook with durable storage, efficient access, and flexibility.

You can use the following step-by-step guide to run the example mortgage dataset using Rapids on Amazon EMR GPU clusters. For more examples, please refer to [NVIDIA/spark-rapids for ETL](https://github.com/NVIDIA/spark-rapids/tree/main/docs/demo) and [NVIDIA/spark-rapids for XGBoost](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples)

![Create EMR Notebook](img/EMR_notebook_2.png)

#### Create EMR Notebook and Connect to EMR GPU Cluster

Go to the AWS Management Console and select Notebooks on the left column. Click the Create notebook button. You can then click "Choose an existing cluster" and pick the right cluster after click Choose button. Once the instance is ready, launch the Jupyter from EMR Notebook instance.

![Create EMR Notebook](img/EMR_notebook_1.png)

#### Runn Mortgage ETL PySpark Notebook on EMR GPU Cluster

Download [the Mortgate ETL PySpark Notebook](Mortgage-ETL-GPU-EMR.ipynb). Make sure to use PySpark as kernel. This example use 1 year (year 2000) data for a two node g4dn GPU cluster. You can adjust settings in the notebook for full mortgage dataset ETL.

When executing the ETL code, you can also saw the Spark Job Progress within the notebook and the code will also display how long it takes to run the query

![Create EMR Notebook](img/EMR_notebook_3.png)

#### Runn Mortgage XGBoost Scala Notebook on EMR GPU Cluster

Please refer to this [quick start guide](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-2/getting-started-guides/csp/aws/Using_EMR_Notebook.md) to running GPU accelerated XGBoost on EMR Spark Cluster.
Binary file added docs/get-started/img/EMR_notebook_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/EMR_notebook_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/EMR_notebook_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/Rapids_EMR_GUI_1.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/Rapids_EMR_GUI_2.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/Rapids_EMR_GUI_2b.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/Rapids_EMR_GUI_3.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/Rapids_EMR_GUI_4.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/get-started/img/Rapids_EMR_GUI_5.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 68 additions & 0 deletions docs/get-started/preview_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
[
sameerz marked this conversation as resolved.
Show resolved Hide resolved
{
"Classification":"spark",
"Properties":{
"enableSparkRapids":"true"
}
},
{
"Classification":"yarn-site",
"Properties":{
"yarn.nodemanager.resource-plugins":"yarn.io/gpu",
"yarn.resource-types":"yarn.io/gpu",
"yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
"yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
"yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
"yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
"yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
"yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
}
},
{
"Classification":"container-executor",
"Properties":{

},
"Configurations":[
{
"Classification":"gpu",
"Properties":{
"module.enabled":"true"
}
},
{
"Classification":"cgroups",
"Properties":{
"root":"/sys/fs/cgroup",
"yarn-hierarchy":"yarn"
}
}
]
},
{
"Classification":"spark-defaults",
"Properties":{
"spark.plugins":"com.nvidia.spark.SQLPlugin",
"spark.sql.sources.useV1SourceList":"",
"spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
"spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
"spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar",
"spark.rapids.sql.concurrentGpuTasks":"1",
"spark.executor.resource.gpu.amount":"1",
"spark.executor.cores":"2",
"spark.task.cpus ":"1",
"spark.task.resource.gpu.amount":"0.5",
"spark.rapids.memory.pinnedPool.size":"0",
"spark.executor.memoryOverhead":"2G",
"spark.locality.wait":"0s",
"spark.sql.shuffle.partitions":"200",
"spark.sql.files.maxPartitionBytes":"512m"
}
},
{
"Classification":"capacity-scheduler",
"Properties":{
"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
}
}
]