diff --git a/docs/FAQ.md b/docs/FAQ.md index 567e72232f3..51ca626eb95 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -26,14 +26,14 @@ top of these changes and release updates as quickly as possible. ### Which distributions are supported? -The RAPIDS Accelerator for Apache Spark officially supports -[Apache Spark](get-started/getting-started-on-prem.md), -[Databricks Runtime 7.3](get-started/getting-started-databricks.md) -and [Google Cloud Dataproc](get-started/getting-started-gcp.md). -Most distributions based off of Apache Spark 3.0.0 should work, but because the plugin replaces -parts of the physical plan that Apache Spark considers to be internal the code for those plans -can change from one distribution to another. We are working with most cloud service providers to -set up testing and validation on their distributions. +The RAPIDS Accelerator for Apache Spark officially supports [Apache +Spark](get-started/getting-started-on-prem.md), [AWS EMR +6.2.0](get-started/getting-started-aws-emr.md), [Databricks Runtime +7.3](get-started/getting-started-databricks.md) and [Google Cloud +Dataproc](get-started/getting-started-gcp.md). Most distributions based off of Apache Spark 3.0.0 +should work, but because the plugin replaces parts of the physical plan that Apache Spark considers +to be internal the code for those plans can change from one distribution to another. We are working +with most cloud service providers to set up testing and validation on their distributions. ### What is the right hardware setup to run GPU accelerated Spark? diff --git a/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb b/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb new file mode 100644 index 00000000000..2be5b0f1419 --- /dev/null +++ b/docs/demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb @@ -0,0 +1,535 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "## Data Source\n", + "\n", + "Dataset is derived from Fannie Mae’s [Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae. For the full raw dataset visit [Fannie Mae]() to register for an account and to download\n", + "\n", + "Instruction is available at NVIDIA [RAPIDS demo site](https://rapidsai.github.io/demos/datasets/mortgage-data).\n", + "\n", + "## Prerequisite\n", + "\n", + "This notebook runs in a AWS EMR cluster with GPU nodes, with [Spark RAPIDS](https://https://docs.aws.amazon.com/emr/index.html) set up.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Spark conf and Create Spark Session\n", + "\n", + "For details explanation for spark conf, please go to Spark RAPIDS config guide.\n", + "Please customeize your Spark configurations based on your GPU cluster.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%configure -f\n", + "{\n", + " \"driverMemory\": \"4000M\",\n", + " \"driverCores\": 2,\n", + " \"executorMemory\": \"4000M\",\n", + " \"conf\": {\"spark.sql.adaptive.enabled\": false, \"spark.dynamicAllocation.enabled\": false, \"spark.executor.instances\":2, \"spark.executor.cores\":2, \"spark.rapids.sql.explain\":\"ALL\", \"spark.task.cpus\":\"1\", \"spark.rapids.sql.concurrentGpuTasks\":\"2\", \"spark.rapids.memory.pinnedPool.size\":\"2G\", \"spark.executor.memoryOverhead\":\"2G\", \"spark.executor.extraJavaOptions\":\"-Dai.rapids.cudf.prefer-pinned=true\", \"spark.locality.wait\":\"0s\", \"spark.sql.files.maxPartitionBytes\":\"512m\", \"spark.executor.resource.gpu.amount\":\"1\", \"spark.task.resource.gpu.amount\":\"0.5\", \"spark.plugins\":\"com.nvidia.spark.SQLPlugin\", \"spark.rapids.sql.hasNans\":\"false\", \"spark.rapids.sql.batchSizeBytes\":\"512M\", \"spark.rapids.sql.reader.batchSizeBytes\":\"768M\", \"spark.rapids.sql.variableFloatAgg.enabled\":\"true\"}\n", + "}\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%info\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "import time\n", + "from pyspark import broadcast\n", + "from pyspark.sql import SparkSession\n", + "from pyspark.sql.functions import *\n", + "from pyspark.sql.types import *\n", + "from pyspark.sql.window import Window\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define ETL Process\n", + "\n", + "Define data schema and steps to do the ETL process:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def _get_quarter_from_csv_file_name():\n", + " return substring_index(substring_index(input_file_name(), '.', 1), '_', -1)\n", + "\n", + "_csv_perf_schema = StructType([\n", + " StructField('loan_id', LongType()),\n", + " StructField('monthly_reporting_period', StringType()),\n", + " StructField('servicer', StringType()),\n", + " StructField('interest_rate', DoubleType()),\n", + " StructField('current_actual_upb', DoubleType()),\n", + " StructField('loan_age', DoubleType()),\n", + " StructField('remaining_months_to_legal_maturity', DoubleType()),\n", + " StructField('adj_remaining_months_to_maturity', DoubleType()),\n", + " StructField('maturity_date', StringType()),\n", + " StructField('msa', DoubleType()),\n", + " StructField('current_loan_delinquency_status', IntegerType()),\n", + " StructField('mod_flag', StringType()),\n", + " StructField('zero_balance_code', StringType()),\n", + " StructField('zero_balance_effective_date', StringType()),\n", + " StructField('last_paid_installment_date', StringType()),\n", + " StructField('foreclosed_after', StringType()),\n", + " StructField('disposition_date', StringType()),\n", + " StructField('foreclosure_costs', DoubleType()),\n", + " StructField('prop_preservation_and_repair_costs', DoubleType()),\n", + " StructField('asset_recovery_costs', DoubleType()),\n", + " StructField('misc_holding_expenses', DoubleType()),\n", + " StructField('holding_taxes', DoubleType()),\n", + " StructField('net_sale_proceeds', DoubleType()),\n", + " StructField('credit_enhancement_proceeds', DoubleType()),\n", + " StructField('repurchase_make_whole_proceeds', StringType()),\n", + " StructField('other_foreclosure_proceeds', DoubleType()),\n", + " StructField('non_interest_bearing_upb', DoubleType()),\n", + " StructField('principal_forgiveness_upb', StringType()),\n", + " StructField('repurchase_make_whole_proceeds_flag', StringType()),\n", + " StructField('foreclosure_principal_write_off_amount', StringType()),\n", + " StructField('servicing_activity_indicator', StringType())])\n", + "_csv_acq_schema = StructType([\n", + " StructField('loan_id', LongType()),\n", + " StructField('orig_channel', StringType()),\n", + " StructField('seller_name', StringType()),\n", + " StructField('orig_interest_rate', DoubleType()),\n", + " StructField('orig_upb', IntegerType()),\n", + " StructField('orig_loan_term', IntegerType()),\n", + " StructField('orig_date', StringType()),\n", + " StructField('first_pay_date', StringType()),\n", + " StructField('orig_ltv', DoubleType()),\n", + " StructField('orig_cltv', DoubleType()),\n", + " StructField('num_borrowers', DoubleType()),\n", + " StructField('dti', DoubleType()),\n", + " StructField('borrower_credit_score', DoubleType()),\n", + " StructField('first_home_buyer', StringType()),\n", + " StructField('loan_purpose', StringType()),\n", + " StructField('property_type', StringType()),\n", + " StructField('num_units', IntegerType()),\n", + " StructField('occupancy_status', StringType()),\n", + " StructField('property_state', StringType()),\n", + " StructField('zip', IntegerType()),\n", + " StructField('mortgage_insurance_percent', DoubleType()),\n", + " StructField('product_type', StringType()),\n", + " StructField('coborrow_credit_score', DoubleType()),\n", + " StructField('mortgage_insurance_type', DoubleType()),\n", + " StructField('relocation_mortgage_indicator', StringType())])\n", + "_name_mapping = [\n", + " (\"WITMER FUNDING, LLC\", \"Witmer\"),\n", + " (\"WELLS FARGO CREDIT RISK TRANSFER SECURITIES TRUST 2015\", \"Wells Fargo\"),\n", + " (\"WELLS FARGO BANK, NA\" , \"Wells Fargo\"),\n", + " (\"WELLS FARGO BANK, N.A.\" , \"Wells Fargo\"),\n", + " (\"WELLS FARGO BANK, NA\" , \"Wells Fargo\"),\n", + " (\"USAA FEDERAL SAVINGS BANK\" , \"USAA\"),\n", + " (\"UNITED SHORE FINANCIAL SERVICES, LLC D\\\\/B\\\\/A UNITED WHOLESALE MORTGAGE\" , \"United Seq(e\"),\n", + " (\"U.S. BANK N.A.\" , \"US Bank\"),\n", + " (\"SUNTRUST MORTGAGE INC.\" , \"Suntrust\"),\n", + " (\"STONEGATE MORTGAGE CORPORATION\" , \"Stonegate Mortgage\"),\n", + " (\"STEARNS LENDING, LLC\" , \"Stearns Lending\"),\n", + " (\"STEARNS LENDING, INC.\" , \"Stearns Lending\"),\n", + " (\"SIERRA PACIFIC MORTGAGE COMPANY, INC.\" , \"Sierra Pacific Mortgage\"),\n", + " (\"REGIONS BANK\" , \"Regions\"),\n", + " (\"RBC MORTGAGE COMPANY\" , \"RBC\"),\n", + " (\"QUICKEN LOANS INC.\" , \"Quicken Loans\"),\n", + " (\"PULTE MORTGAGE, L.L.C.\" , \"Pulte Mortgage\"),\n", + " (\"PROVIDENT FUNDING ASSOCIATES, L.P.\" , \"Provident Funding\"),\n", + " (\"PROSPECT MORTGAGE, LLC\" , \"Prospect Mortgage\"),\n", + " (\"PRINCIPAL RESIDENTIAL MORTGAGE CAPITAL RESOURCES, LLC\" , \"Principal Residential\"),\n", + " (\"PNC BANK, N.A.\" , \"PNC\"),\n", + " (\"PMT CREDIT RISK TRANSFER TRUST 2015-2\" , \"PennyMac\"),\n", + " (\"PHH MORTGAGE CORPORATION\" , \"PHH Mortgage\"),\n", + " (\"PENNYMAC CORP.\" , \"PennyMac\"),\n", + " (\"PACIFIC UNION FINANCIAL, LLC\" , \"Other\"),\n", + " (\"OTHER\" , \"Other\"),\n", + " (\"NYCB MORTGAGE COMPANY, LLC\" , \"NYCB\"),\n", + " (\"NEW YORK COMMUNITY BANK\" , \"NYCB\"),\n", + " (\"NETBANK FUNDING SERVICES\" , \"Netbank\"),\n", + " (\"NATIONSTAR MORTGAGE, LLC\" , \"Nationstar Mortgage\"),\n", + " (\"METLIFE BANK, NA\" , \"Metlife\"),\n", + " (\"LOANDEPOT.COM, LLC\" , \"LoanDepot.com\"),\n", + " (\"J.P. MORGAN MADISON AVENUE SECURITIES TRUST, SERIES 2015-1\" , \"JP Morgan Chase\"),\n", + " (\"J.P. MORGAN MADISON AVENUE SECURITIES TRUST, SERIES 2014-1\" , \"JP Morgan Chase\"),\n", + " (\"JPMORGAN CHASE BANK, NATIONAL ASSOCIATION\" , \"JP Morgan Chase\"),\n", + " (\"JPMORGAN CHASE BANK, NA\" , \"JP Morgan Chase\"),\n", + " (\"JP MORGAN CHASE BANK, NA\" , \"JP Morgan Chase\"),\n", + " (\"IRWIN MORTGAGE, CORPORATION\" , \"Irwin Mortgage\"),\n", + " (\"IMPAC MORTGAGE CORP.\" , \"Impac Mortgage\"),\n", + " (\"HSBC BANK USA, NATIONAL ASSOCIATION\" , \"HSBC\"),\n", + " (\"HOMEWARD RESIDENTIAL, INC.\" , \"Homeward Mortgage\"),\n", + " (\"HOMESTREET BANK\" , \"Other\"),\n", + " (\"HOMEBRIDGE FINANCIAL SERVICES, INC.\" , \"HomeBridge\"),\n", + " (\"HARWOOD STREET FUNDING I, LLC\" , \"Harwood Mortgage\"),\n", + " (\"GUILD MORTGAGE COMPANY\" , \"Guild Mortgage\"),\n", + " (\"GMAC MORTGAGE, LLC (USAA FEDERAL SAVINGS BANK)\" , \"GMAC\"),\n", + " (\"GMAC MORTGAGE, LLC\" , \"GMAC\"),\n", + " (\"GMAC (USAA)\" , \"GMAC\"),\n", + " (\"FREMONT BANK\" , \"Fremont Bank\"),\n", + " (\"FREEDOM MORTGAGE CORP.\" , \"Freedom Mortgage\"),\n", + " (\"FRANKLIN AMERICAN MORTGAGE COMPANY\" , \"Franklin America\"),\n", + " (\"FLEET NATIONAL BANK\" , \"Fleet National\"),\n", + " (\"FLAGSTAR CAPITAL MARKETS CORPORATION\" , \"Flagstar Bank\"),\n", + " (\"FLAGSTAR BANK, FSB\" , \"Flagstar Bank\"),\n", + " (\"FIRST TENNESSEE BANK NATIONAL ASSOCIATION\" , \"Other\"),\n", + " (\"FIFTH THIRD BANK\" , \"Fifth Third Bank\"),\n", + " (\"FEDERAL HOME LOAN BANK OF CHICAGO\" , \"Fedral Home of Chicago\"),\n", + " (\"FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB\" , \"FDIC\"),\n", + " (\"DOWNEY SAVINGS AND LOAN ASSOCIATION, F.A.\" , \"Downey Mortgage\"),\n", + " (\"DITECH FINANCIAL LLC\" , \"Ditech\"),\n", + " (\"CITIMORTGAGE, INC.\" , \"Citi\"),\n", + " (\"CHICAGO MORTGAGE SOLUTIONS DBA INTERFIRST MORTGAGE COMPANY\" , \"Chicago Mortgage\"),\n", + " (\"CHICAGO MORTGAGE SOLUTIONS DBA INTERBANK MORTGAGE COMPANY\" , \"Chicago Mortgage\"),\n", + " (\"CHASE HOME FINANCE, LLC\" , \"JP Morgan Chase\"),\n", + " (\"CHASE HOME FINANCE FRANKLIN AMERICAN MORTGAGE COMPANY\" , \"JP Morgan Chase\"),\n", + " (\"CHASE HOME FINANCE (CIE 1)\" , \"JP Morgan Chase\"),\n", + " (\"CHASE HOME FINANCE\" , \"JP Morgan Chase\"),\n", + " (\"CASHCALL, INC.\" , \"CashCall\"),\n", + " (\"CAPITAL ONE, NATIONAL ASSOCIATION\" , \"Capital One\"),\n", + " (\"CALIBER HOME LOANS, INC.\" , \"Caliber Funding\"),\n", + " (\"BISHOPS GATE RESIDENTIAL MORTGAGE TRUST\" , \"Bishops Gate Mortgage\"),\n", + " (\"BANK OF AMERICA, N.A.\" , \"Bank of America\"),\n", + " (\"AMTRUST BANK\" , \"AmTrust\"),\n", + " (\"AMERISAVE MORTGAGE CORPORATION\" , \"Amerisave\"),\n", + " (\"AMERIHOME MORTGAGE COMPANY, LLC\" , \"AmeriHome Mortgage\"),\n", + " (\"ALLY BANK\" , \"Ally Bank\"),\n", + " (\"ACADEMY MORTGAGE CORPORATION\" , \"Academy Mortgage\"),\n", + " (\"NO CASH-OUT REFINANCE\" , \"OTHER REFINANCE\"),\n", + " (\"REFINANCE - NOT SPECIFIED\" , \"OTHER REFINANCE\"),\n", + " (\"Other REFINANCE\" , \"OTHER REFINANCE\")]\n", + "\n", + "cate_col_names = [\n", + " \"orig_channel\",\n", + " \"first_home_buyer\",\n", + " \"loan_purpose\",\n", + " \"property_type\",\n", + " \"occupancy_status\",\n", + " \"property_state\",\n", + " \"relocation_mortgage_indicator\",\n", + " \"seller_name\",\n", + " \"mod_flag\"\n", + "]\n", + "# Numberic columns\n", + "label_col_name = \"delinquency_12\"\n", + "numeric_col_names = [\n", + " \"orig_interest_rate\",\n", + " \"orig_upb\",\n", + " \"orig_loan_term\",\n", + " \"orig_ltv\",\n", + " \"orig_cltv\",\n", + " \"num_borrowers\",\n", + " \"dti\",\n", + " \"borrower_credit_score\",\n", + " \"num_units\",\n", + " \"zip\",\n", + " \"mortgage_insurance_percent\",\n", + " \"current_loan_delinquency_status\",\n", + " \"current_actual_upb\",\n", + " \"interest_rate\",\n", + " \"loan_age\",\n", + " \"msa\",\n", + " \"non_interest_bearing_upb\",\n", + " label_col_name\n", + "]\n", + "all_col_names = cate_col_names + numeric_col_names\n", + "\n", + "def read_perf_csv(spark, path):\n", + " return spark.read.format('csv') \\\n", + " .option('nullValue', '') \\\n", + " .option('header', 'false') \\\n", + " .option('delimiter', '|') \\\n", + " .schema(_csv_perf_schema) \\\n", + " .load(path) \\\n", + " .withColumn('quarter', _get_quarter_from_csv_file_name())\n", + "\n", + "def read_acq_csv(spark, path):\n", + " return spark.read.format('csv') \\\n", + " .option('nullValue', '') \\\n", + " .option('header', 'false') \\\n", + " .option('delimiter', '|') \\\n", + " .schema(_csv_acq_schema) \\\n", + " .load(path) \\\n", + " .withColumn('quarter', _get_quarter_from_csv_file_name())\n", + "\n", + "def _parse_dates(perf):\n", + " return perf \\\n", + " .withColumn('monthly_reporting_period', to_date(col('monthly_reporting_period'), 'MM/dd/yyyy')) \\\n", + " .withColumn('monthly_reporting_period_month', month(col('monthly_reporting_period'))) \\\n", + " .withColumn('monthly_reporting_period_year', year(col('monthly_reporting_period'))) \\\n", + " .withColumn('monthly_reporting_period_day', dayofmonth(col('monthly_reporting_period'))) \\\n", + " .withColumn('last_paid_installment_date', to_date(col('last_paid_installment_date'), 'MM/dd/yyyy')) \\\n", + " .withColumn('foreclosed_after', to_date(col('foreclosed_after'), 'MM/dd/yyyy')) \\\n", + " .withColumn('disposition_date', to_date(col('disposition_date'), 'MM/dd/yyyy')) \\\n", + " .withColumn('maturity_date', to_date(col('maturity_date'), 'MM/yyyy')) \\\n", + " .withColumn('zero_balance_effective_date', to_date(col('zero_balance_effective_date'), 'MM/yyyy'))\n", + "\n", + "def _create_perf_deliquency(spark, perf):\n", + " aggDF = perf.select(\n", + " col(\"quarter\"),\n", + " col(\"loan_id\"),\n", + " col(\"current_loan_delinquency_status\"),\n", + " when(col(\"current_loan_delinquency_status\") >= 1, col(\"monthly_reporting_period\")).alias(\"delinquency_30\"),\n", + " when(col(\"current_loan_delinquency_status\") >= 3, col(\"monthly_reporting_period\")).alias(\"delinquency_90\"),\n", + " when(col(\"current_loan_delinquency_status\") >= 6, col(\"monthly_reporting_period\")).alias(\"delinquency_180\")) \\\n", + " .groupBy(\"quarter\", \"loan_id\") \\\n", + " .agg(\n", + " max(\"current_loan_delinquency_status\").alias(\"delinquency_12\"),\n", + " min(\"delinquency_30\").alias(\"delinquency_30\"),\n", + " min(\"delinquency_90\").alias(\"delinquency_90\"),\n", + " min(\"delinquency_180\").alias(\"delinquency_180\")) \\\n", + " .select(\n", + " col(\"quarter\"),\n", + " col(\"loan_id\"),\n", + " (col(\"delinquency_12\") >= 1).alias(\"ever_30\"),\n", + " (col(\"delinquency_12\") >= 3).alias(\"ever_90\"),\n", + " (col(\"delinquency_12\") >= 6).alias(\"ever_180\"),\n", + " col(\"delinquency_30\"),\n", + " col(\"delinquency_90\"),\n", + " col(\"delinquency_180\"))\n", + " joinedDf = perf \\\n", + " .withColumnRenamed(\"monthly_reporting_period\", \"timestamp\") \\\n", + " .withColumnRenamed(\"monthly_reporting_period_month\", \"timestamp_month\") \\\n", + " .withColumnRenamed(\"monthly_reporting_period_year\", \"timestamp_year\") \\\n", + " .withColumnRenamed(\"current_loan_delinquency_status\", \"delinquency_12\") \\\n", + " .withColumnRenamed(\"current_actual_upb\", \"upb_12\") \\\n", + " .select(\"quarter\", \"loan_id\", \"timestamp\", \"delinquency_12\", \"upb_12\", \"timestamp_month\", \"timestamp_year\") \\\n", + " .join(aggDF, [\"loan_id\", \"quarter\"], \"left_outer\")\n", + " # calculate the 12 month delinquency and upb values\n", + " months = 12\n", + " monthArray = [lit(x) for x in range(0, 12)]\n", + " # explode on a small amount of data is actually slightly more efficient than a cross join\n", + " testDf = joinedDf \\\n", + " .withColumn(\"month_y\", explode(array(monthArray))) \\\n", + " .select(\n", + " col(\"quarter\"),\n", + " floor(((col(\"timestamp_year\") * 12 + col(\"timestamp_month\")) - 24000) / months).alias(\"josh_mody\"),\n", + " floor(((col(\"timestamp_year\") * 12 + col(\"timestamp_month\")) - 24000 - col(\"month_y\")) / months).alias(\"josh_mody_n\"),\n", + " col(\"ever_30\"),\n", + " col(\"ever_90\"),\n", + " col(\"ever_180\"),\n", + " col(\"delinquency_30\"),\n", + " col(\"delinquency_90\"),\n", + " col(\"delinquency_180\"),\n", + " col(\"loan_id\"),\n", + " col(\"month_y\"),\n", + " col(\"delinquency_12\"),\n", + " col(\"upb_12\")) \\\n", + " .groupBy(\"quarter\", \"loan_id\", \"josh_mody_n\", \"ever_30\", \"ever_90\", \"ever_180\", \"delinquency_30\", \"delinquency_90\", \"delinquency_180\", \"month_y\") \\\n", + " .agg(max(\"delinquency_12\").alias(\"delinquency_12\"), min(\"upb_12\").alias(\"upb_12\")) \\\n", + " .withColumn(\"timestamp_year\", floor((lit(24000) + (col(\"josh_mody_n\") * lit(months)) + (col(\"month_y\") - 1)) / lit(12))) \\\n", + " .selectExpr('*', 'pmod(24000 + (josh_mody_n * {}) + month_y, 12) as timestamp_month_tmp'.format(months)) \\\n", + " .withColumn(\"timestamp_month\", when(col(\"timestamp_month_tmp\") == lit(0), lit(12)).otherwise(col(\"timestamp_month_tmp\"))) \\\n", + " .withColumn(\"delinquency_12\", ((col(\"delinquency_12\") > 3).cast(\"int\") + (col(\"upb_12\") == 0).cast(\"int\")).alias(\"delinquency_12\")) \\\n", + " .drop(\"timestamp_month_tmp\", \"josh_mody_n\", \"month_y\")\n", + " return perf.withColumnRenamed(\"monthly_reporting_period_month\", \"timestamp_month\") \\\n", + " .withColumnRenamed(\"monthly_reporting_period_year\", \"timestamp_year\") \\\n", + " .join(testDf, [\"quarter\", \"loan_id\", \"timestamp_year\", \"timestamp_month\"], \"left\") \\\n", + " .drop(\"timestamp_year\", \"timestamp_month\")\n", + "\n", + "def _create_acquisition(spark, acq):\n", + " nameMapping = spark.createDataFrame(_name_mapping, [\"from_seller_name\", \"to_seller_name\"])\n", + " return acq.join(nameMapping, col(\"seller_name\") == col(\"from_seller_name\"), \"left\") \\\n", + " .drop(\"from_seller_name\") \\\n", + " .withColumn(\"old_name\", col(\"seller_name\")) \\\n", + " .withColumn(\"seller_name\", coalesce(col(\"to_seller_name\"), col(\"seller_name\"))) \\\n", + " .drop(\"to_seller_name\") \\\n", + " .withColumn(\"orig_date\", to_date(col(\"orig_date\"), \"MM/yyyy\")) \\\n", + " .withColumn(\"first_pay_date\", to_date(col(\"first_pay_date\"), \"MM/yyyy\"))\n", + "\n", + "def _gen_dictionary(etl_df, col_names):\n", + " cnt_table = etl_df.select(posexplode(array([col(i) for i in col_names])))\\\n", + " .withColumnRenamed(\"pos\", \"column_id\")\\\n", + " .withColumnRenamed(\"col\", \"data\")\\\n", + " .filter(\"data is not null\")\\\n", + " .groupBy(\"column_id\", \"data\")\\\n", + " .count()\n", + " windowed = Window.partitionBy(\"column_id\").orderBy(desc(\"count\"))\n", + " return cnt_table.withColumn(\"id\", row_number().over(windowed)).drop(\"count\")\n", + "\n", + "\n", + "def _cast_string_columns_to_numeric(spark, input_df):\n", + " cached_dict_df = _gen_dictionary(input_df, cate_col_names).cache()\n", + " output_df = input_df\n", + " # Generate the final table with all columns being numeric.\n", + " for col_pos, col_name in enumerate(cate_col_names):\n", + " col_dict_df = cached_dict_df.filter(col(\"column_id\") == col_pos)\\\n", + " .drop(\"column_id\")\\\n", + " .withColumnRenamed(\"data\", col_name)\n", + " \n", + " output_df = output_df.join(broadcast(col_dict_df), col_name, \"left\")\\\n", + " .drop(col_name)\\\n", + " .withColumnRenamed(\"id\", col_name)\n", + " return output_df\n", + "\n", + "def run_mortgage(spark, perf, acq):\n", + " parsed_perf = _parse_dates(perf)\n", + " perf_deliqency = _create_perf_deliquency(spark, parsed_perf)\n", + " cleaned_acq = _create_acquisition(spark, acq)\n", + " df = perf_deliqency.join(cleaned_acq, [\"loan_id\", \"quarter\"], \"inner\")\n", + " # change to this for 20 year mortgage data - test_quarters = ['2016Q1','2016Q2','2016Q3','2016Q4']\n", + " test_quarters = ['2000Q4']\n", + " train_df = df.filter(df.quarter.isin(test_quarters)).drop(\"quarter\")\n", + " test_df = df.filter(df.quarter.isin(test_quarters)).drop(\"quarter\")\n", + " casted_train_df = _cast_string_columns_to_numeric(spark, train_df)\\\n", + " .select(all_col_names)\\\n", + " .withColumn(label_col_name, when(col(label_col_name) > 0, 1).otherwise(0))\\\n", + " .fillna(float(0))\n", + " casted_test_df = _cast_string_columns_to_numeric(spark, test_df)\\\n", + " .select(all_col_names)\\\n", + " .withColumn(label_col_name, when(col(label_col_name) > 0, 1).otherwise(0))\\\n", + " .fillna(float(0))\n", + " return casted_train_df, casted_test_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define Data Input/Output location\n", + "\n", + "This example is using one year mortgage data (year 2000) for GPU Spark cluster (2x g4dn.2xlarge). Please use large GPU cluster to process the full mortgage data. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "orig_perf_path = 's3://spark-xgboost-mortgage-dataset-east1/mortgage-etl-demo/perf/*'\n", + "orig_acq_path = 's3://spark-xgboost-mortgage-dataset-east1/mortgage-etl-demo/acq/*'\n", + "\n", + "train_path = 's3://spark-xgboost-mortgage-dataset-east1/mortgage-xgboost-demo/train/'\n", + "test_path = 's3://spark-xgboost-mortgage-dataset-east1/mortgage-xgboost-demo/test/'\n", + "\n", + "tmp_perf_path = 's3://spark-xgboost-mortgage-dataset-east1/mortgage_parquet_gpu/perf/'\n", + "tmp_acq_path = 's3://spark-xgboost-mortgage-dataset-east1/mortgage_parquet_gpu/acq/'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Read CSV data and Transcode to Parquet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Lets transcode the data first\n", + "start = time.time()\n", + "# we want a few big files instead of lots of small files\n", + "spark.conf.set('spark.sql.files.maxPartitionBytes', '200G')\n", + "acq = read_acq_csv(spark, orig_acq_path)\n", + "acq.repartition(20).write.parquet(tmp_acq_path, mode='overwrite')\n", + "perf = read_perf_csv(spark, orig_perf_path)\n", + "perf.coalesce(80).write.parquet(tmp_perf_path, mode='overwrite')\n", + "end = time.time()\n", + "print(end - start)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Execute ETL Code Defined in 1st Cell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Now lets actually process the data\\n\",\n", + "start = time.time()\n", + "spark.conf.set('spark.sql.files.maxPartitionBytes', '1G')\n", + "spark.conf.set('spark.sql.shuffle.partitions', '160')\n", + "perf = spark.read.parquet(tmp_perf_path)\n", + "acq = spark.read.parquet(tmp_acq_path)\n", + "train_out, test_out = run_mortgage(spark, perf, acq)\n", + "train_out.write.parquet(train_path, mode='overwrite')\n", + "end = time.time()\n", + "print(end - start)\n", + "test_out.write.parquet(test_path, mode='overwrite')\n", + "end = time.time()\n", + "print(end - start)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Print Physical Plan" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#train_out.explain()\n", + "print(spark._jvm.org.apache.spark.sql.api.python.PythonSQLUtils.explainString(train_out._jdf.queryExecution(), 'simple'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "PySpark", + "language": "", + "name": "pysparkkernel" + }, + "language_info": { + "codemirror_mode": { + "name": "python", + "version": 3 + }, + "mimetype": "text/x-python", + "name": "pyspark", + "pygments_lexer": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/get-started/getting-started-aws-emr.md b/docs/get-started/getting-started-aws-emr.md new file mode 100644 index 00000000000..cdd96a16476 --- /dev/null +++ b/docs/get-started/getting-started-aws-emr.md @@ -0,0 +1,263 @@ +--- +layout: page +title: AWS-EMR +nav_order: 2 +parent: Getting-Started +--- +# Get Started with RAPIDS on AWS EMR + +This is a getting started guide for the RAPIDS Accelerator for Apache Spark on AWS EMR. At the end +of this guide, the user will be able to run a sample Apache Spark application that runs on NVIDIA +GPUs on AWS EMR. + +The current EMR 6.2.0 release supports Spark version 3.0.1 and RAPIDS Accelerator version 0.2.0. For +more details of supported applications, please see the [EMR release +notes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html). + +For more information on AWS EMR, please see the [AWS +documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). + +## Configure and Launch AWS EMR with GPU Nodes + +The following steps are based on the AWS EMR document ["Using the Nvidia Spark-RAPIDS Accelerator +for Spark"](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html) + +### Launch an EMR Cluster using AWS CLI + +You can use the AWS CLI to launch a cluster with one Master node (m5.xlarge) and two +g4dn.2xlarge nodes: + +``` +aws emr create-cluster \ +--release-label emr-6.2.0 \ +--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \ +--service-role EMR_DefaultRole \ +--ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \ +--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \ + InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \ + InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \ +--configurations file:///my-configurations.json \ +--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://my-bucket/my-bootstrap-action.sh +``` + +Please fill with actual value for `KeyName` and file paths. You can further customize SubnetId, +EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, name and region etc. + +The `my-configurations.json` installs the spark-rapids plugin on your cluster, configures YARN to use + +GPUs, configures Spark to use RAPIDS, and configures the YARN capacity scheduler. An example JSON + +configuration can be found in the section on launching in the GUI below. + +The `my-boostrap-action.sh` script referenced in the above script opens cgroup permissions to YARN +on your cluster. This is required for YARN to use GPUs. An example script is as follows: +```bash +#!/bin/bash + +set -ex + +sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct +sudo chmod a+rwx -R /sys/fs/cgroup/devices +``` + +### Launch an EMR Cluster using AWS Console (GUI) + +Go to the AWS Management Console and select the `EMR` service from the "Analytics" section. Choose +the region you want to launch your cluster in, e.g. US West (Oregon), using the dropdown menu in the +top right corner. Click `Create cluster` and select `Go to advanced options`, which will bring up a +detailed cluster configuration page. + +#### Step 1: Software Configuration and Steps + +Select **emr-6.2.0** for the release, uncheck all the software options, and then check **Hadoop +3.2.1**, **Spark 3.0.1**, **Livy 0.7.0** and **JupyterEnterpriseGateway 2.1.0**. + +In the "Edit software settings" field, copy and paste the configuration from the [EMR +document](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html). You can also +create a JSON file on you own S3 bucket. + +For clusters with 2x g4dn.2xlarge GPU instances as worker nodes, we recommend the following +default settings: +```json +[ + { + "Classification":"spark", + "Properties":{ + "enableSparkRapids":"true" + } + }, + { + "Classification":"yarn-site", + "Properties":{ + "yarn.nodemanager.resource-plugins":"yarn.io/gpu", + "yarn.resource-types":"yarn.io/gpu", + "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", + "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", + "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", + "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", + "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", + "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" + } + }, + { + "Classification":"container-executor", + "Properties":{ + + }, + "Configurations":[ + { + "Classification":"gpu", + "Properties":{ + "module.enabled":"true" + } + }, + { + "Classification":"cgroups", + "Properties":{ + "root":"/sys/fs/cgroup", + "yarn-hierarchy":"yarn" + } + } + ] + }, + { + "Classification":"spark-defaults", + "Properties":{ + "spark.plugins":"com.nvidia.spark.SQLPlugin", + "spark.sql.sources.useV1SourceList":"", + "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", + "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar", + "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", + "spark.rapids.sql.concurrentGpuTasks":"2", + "spark.executor.resource.gpu.amount":"1", + "spark.executor.cores":"8", + "spark.task.cpus ":"1", + "spark.task.resource.gpu.amount":"0.125", + "spark.rapids.memory.pinnedPool.size":"2G", + "spark.executor.memoryOverhead":"2G", + "spark.locality.wait":"0s", + "spark.sql.shuffle.partitions":"200", + "spark.sql.files.maxPartitionBytes":"256m", + "spark.sql.adaptive.enabled":"false" + } + }, + { + "Classification":"capacity-scheduler", + "Properties":{ + "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" + } + } +] + +``` +Adjust the settings as appropriate for your cluster. For example, setting the appropriate +number of cores based on the node type. The `spark.task.resource.gpu.amount` should be set to +1/(number of cores per executor) which will allow multiple tasks to run in parallel on the GPU. + +For example, for clusters with 2x g4dn.12xlarge as core nodes, use the following: + +```json + "spark.executor.cores":"12", + "spark.task.resource.gpu.amount":"0.0833", +``` + +More configuration details can be found in the [configuration](../configs.md) documentation. + +![Step 1: Step 1: Software, Configuration and Steps](../img/AWS-EMR/RAPIDS_EMR_GUI_1.png) + +#### Step 2: Hardware + +Select the desired VPC and availability zone in the "Network" and "EC2 Subnet" fields respectively. (Default network and subnet are ok) + +In the "Core" node row, change the "Instance type" to **g4dn.xlarge**, **g4dn.2xlarge**, or **p3.2xlarge** and ensure "Instance count" is set to **1** or any higher number. Keep the default "Master" node instance type of **m5.xlarge**. + +![Step 2: Hardware](../img/AWS-EMR/RAPIDS_EMR_GUI_2.png) + +#### Step 3: General Cluster Settings + +Enter a custom "Cluster name" and make a note of the s3 folder that cluster logs will be written to. + +Add a custom "Bootstrap Actions" to allow cgroup permissions to YARN on your cluster. An example +bootstrap script is as follows: +```bash +#!/bin/bash + +set -ex + +sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct +sudo chmod a+rwx -R /sys/fs/cgroup/devices +``` + +*Optionally* add key-value "Tags", configure a "Custom AMI" for the EMR cluster on this page. + +![Step 3: General Cluster Settings](../img/AWS-EMR/RAPIDS_EMR_GUI_3.png) + +#### Step 4: Security + +Select an existing "EC2 key pair" that will be used to authenticate SSH access to the cluster's nodes. If you do not have access to an EC2 key pair, follow these instructions to [create an EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). + +*Optionally* set custom security groups in the "EC2 security groups" tab. + +In the "EC2 security groups" tab, confirm that the security group chosen for the "Master" node allows for SSH access. Follow these instructions to [allow inbound SSH traffic](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) if the security group does not allow it yet. + +![Step 4: Security](../img/AWS-EMR/RAPIDS_EMR_GUI_4.png) + +#### Finish Cluster Configuration + +The EMR cluster management page displays the status of multiple clusters or detailed information about a chosen cluster. In the detailed cluster view, the "Summary" and "Hardware" tabs can be used to monitor the status of master and core nodes as they provision and initialize. + +When the cluster is ready, a green-dot will appear next to the cluster name and the "Status" column will display **Waiting, cluster ready**. + +In the cluster's "Summary" tab, find the "Master public DNS" field and click the `SSH` button. Follow the instructions to SSH to the new cluster's master node. + +![Finish Cluster Configuration](../img/AWS-EMR/RAPIDS_EMR_GUI_5.png) + + +### Running an example joint operation using Spark Shell + +SSH to the EMR cluster's master node, get into sparks shell and run the sql join example to verify GPU operation. + +```bash +spark-shell +``` + +Running following Scala code in Spark Shell + +```scala +val data = 1 to 10000 +val df1 = sc.parallelize(data).toDF() +val df2 = sc.parallelize(data).toDF() +val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value") +out.count() +out.explain() +``` + +### Submit Spark jobs to a EMR Cluster Accelerated by GPUs + +Similar to spark-submit for on-prem clusters, AWS EMR supports a Spark application job to be submitted. The mortgage examples we use are also available as a spark application. You can also use **spark shell** to run the scala code or **pyspark** to run the python code on master node through CLI. + +### Running GPU Accelerated Mortgage ETL and XGBoost Example using EMR Notebook + +An EMR Notebook is a "serverless" Jupyter notebook. Unlike a traditional notebook, the contents of an EMR Notebook itself—the equations, visualizations, queries, models, code, and narrative text—are saved in Amazon S3 separately from the cluster that runs the code. This provides an EMR Notebook with durable storage, efficient access, and flexibility. + +You can use the following step-by-step guide to run the example mortgage dataset using RAPIDS on Amazon EMR GPU clusters. For more examples, please refer to [NVIDIA/spark-rapids for ETL](https://github.com/NVIDIA/spark-rapids/tree/main/docs/demo) and [NVIDIA/spark-rapids for XGBoost](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples) + +![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_2.png) + +#### Create EMR Notebook and Connect to EMR GPU Cluster + +Go to the AWS Management Console and select Notebooks on the left column. Click the Create notebook button. You can then click "Choose an existing cluster" and pick the right cluster after click Choose button. Once the instance is ready, launch the Jupyter from EMR Notebook instance. + +![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_1.png) + +#### Run Mortgage ETL PySpark Notebook on EMR GPU Cluster + +Download [the Mortgate ETL PySpark Notebook](../demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb). Make sure to use PySpark as kernel. This example use 1 year (year 2000) data for a two node g4dn GPU cluster. You can adjust settings in the notebook for full mortgage dataset ETL. + +When executing the ETL code, you can also saw the Spark Job Progress within the notebook and the code will also display how long it takes to run the query + +![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_3.png) + +#### Run Mortgage XGBoost Scala Notebook on EMR GPU Cluster + +Please refer to this [quick start guide](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-2/getting-started-guides/csp/aws/Using_EMR_Notebook.md) to running GPU accelerated XGBoost on EMR Spark Cluster. diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md index 49b7b46f022..63e169ba3e1 100644 --- a/docs/get-started/getting-started-databricks.md +++ b/docs/get-started/getting-started-databricks.md @@ -34,7 +34,7 @@ We will need to create an initialization script for the cluster that installs th 4. Go back and edit your cluster to configure it to use the init script. To do this, click the “Clusters” button on the left panel, then select your cluster. 5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in the advanced options section, and paste the initialization script: `dbfs:/databricks/init_scripts/init.sh`, then click “Add”. - ![Init Script](../img/initscript.png) + ![Init Script](../img/Databricks/initscript.png) 6. Now select the “Spark” tab, and paste the following config options into the Spark Config section. Change the config values based on the workers you choose. See Apache Spark [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator for Apache Spark [descriptions](../configs.md) for each config. @@ -55,7 +55,7 @@ We will need to create an initialization script for the cluster that installs th spark.rapids.sql.concurrentGpuTasks 2 ``` - ![Spark Config](../img/sparkconfig.png) + ![Spark Config](../img/Databricks/sparkconfig.png) 7. Once you’ve added the Spark config, click “Confirm and Restart”. 8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF. diff --git a/docs/get-started/getting-started-gcp.md b/docs/get-started/getting-started-gcp.md index d8fc6bbd808..823227c964b 100644 --- a/docs/get-started/getting-started-gcp.md +++ b/docs/get-started/getting-started-gcp.md @@ -1,7 +1,7 @@ --- layout: page title: GCP Dataproc -nav_order: 2 +nav_order: 4 parent: Getting-Started --- @@ -59,12 +59,12 @@ gcloud dataproc clusters create $CLUSTER_NAME \ ``` This may take around 5-15 minutes to complete. You can navigate to the Dataproc clusters tab in the Google Cloud Console to see the progress. -![Dataproc Cluster](../img/dataproc-cluster.png) +![Dataproc Cluster](../img/GCP/dataproc-cluster.png) ## Run PySpark or Scala Notebook on a Dataproc Cluster Accelerated by GPUs To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab and navigate to the "Web Interfaces" tab. Under "Web Interfaces", click on the JupyterLab or Jupyter link to start to use sample [Mortgage ETL on GPU Jupyter Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years [Mortgage data](https://rapidsai.github.io/demos/datasets/mortgage-data). -![Dataproc Web Interfaces](../img/dataproc-service.png) +![Dataproc Web Interfaces](../img/GCP/dataproc-service.png) The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the rest as a training set, saving to respective GCS locations. Using the default notebook configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.15), which are pre-downloaded by the GCP Dataproc [RAPIDS init script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). diff --git a/docs/img/AWS-EMR/EMR_notebook_1.png b/docs/img/AWS-EMR/EMR_notebook_1.png new file mode 100644 index 00000000000..18dc7a95921 Binary files /dev/null and b/docs/img/AWS-EMR/EMR_notebook_1.png differ diff --git a/docs/img/AWS-EMR/EMR_notebook_2.png b/docs/img/AWS-EMR/EMR_notebook_2.png new file mode 100644 index 00000000000..6b6758a0522 Binary files /dev/null and b/docs/img/AWS-EMR/EMR_notebook_2.png differ diff --git a/docs/img/AWS-EMR/EMR_notebook_3.png b/docs/img/AWS-EMR/EMR_notebook_3.png new file mode 100644 index 00000000000..9b87a6bcbf4 Binary files /dev/null and b/docs/img/AWS-EMR/EMR_notebook_3.png differ diff --git a/docs/img/AWS-EMR/RAPIDS_EMR_GUI_1.png b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_1.png new file mode 100644 index 00000000000..08c3b2a22a4 Binary files /dev/null and b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_1.png differ diff --git a/docs/img/AWS-EMR/RAPIDS_EMR_GUI_2.png b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_2.png new file mode 100644 index 00000000000..f0444dc655d Binary files /dev/null and b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_2.png differ diff --git a/docs/img/AWS-EMR/RAPIDS_EMR_GUI_2b.png b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_2b.png new file mode 100644 index 00000000000..ffd1253b974 Binary files /dev/null and b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_2b.png differ diff --git a/docs/img/AWS-EMR/RAPIDS_EMR_GUI_3.png b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_3.png new file mode 100644 index 00000000000..d08d8b6dc54 Binary files /dev/null and b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_3.png differ diff --git a/docs/img/AWS-EMR/RAPIDS_EMR_GUI_4.png b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_4.png new file mode 100644 index 00000000000..1953bf68b30 Binary files /dev/null and b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_4.png differ diff --git a/docs/img/AWS-EMR/RAPIDS_EMR_GUI_5.png b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_5.png new file mode 100644 index 00000000000..cbb675d4a1c Binary files /dev/null and b/docs/img/AWS-EMR/RAPIDS_EMR_GUI_5.png differ diff --git a/docs/img/initscript.png b/docs/img/Databricks/initscript.png similarity index 100% rename from docs/img/initscript.png rename to docs/img/Databricks/initscript.png diff --git a/docs/img/sparkconfig.png b/docs/img/Databricks/sparkconfig.png similarity index 100% rename from docs/img/sparkconfig.png rename to docs/img/Databricks/sparkconfig.png diff --git a/docs/img/dataproc-cluster.png b/docs/img/GCP/dataproc-cluster.png similarity index 100% rename from docs/img/dataproc-cluster.png rename to docs/img/GCP/dataproc-cluster.png diff --git a/docs/img/dataproc-service.png b/docs/img/GCP/dataproc-service.png similarity index 100% rename from docs/img/dataproc-service.png rename to docs/img/GCP/dataproc-service.png