Karthik changes (NVIDIA#359)

* Changes from Karthik * fixing _config.yml, adding get-started * fixing _config.yml (removing whitespace) * Some additional fixes * remove localhost reference * updates to index, getting started gcp, perf image, stable release * remove unused notebook, add description and reference for each cell of the notebook, reference to mortgage data and spark rapids config * updated examples with mortgage dataset info, databricks guide with executor info and cluster config based on comments * add reference to rapids plugin, cudf and spark xgboost inn readme Co-authored-by: Karthikeyan <krajendran@nvidia.com> Co-authored-by: DougM <mengdong0427@gmail.com>
nartal1 · Jul 16, 2020 · 1dc09f2 · 1dc09f2
1 parent 5522eb6
commit 1dc09f2
Show file tree

Hide file tree

Showing 22 changed files with 4,143 additions and 79 deletions.
diff --git a/docs/demo/Databricks/generate-init-script.ipynb b/docs/demo/Databricks/generate-init-script.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"code","source":["dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-0.1.0-databricks.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0-databricks/rapids-4-spark_2.12-0.1.0-databricks.jar\nsudo wget -O /databricks/jars/cudf-0.14-cuda10-1.jar https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-1.jar\"\"\", True)"],"metadata":{},"outputs":[],"execution_count":1},{"cell_type":"code","source":["%sh\ncd ../../dbfs/databricks/init_scripts\npwd\nls -ltr\ncat init.sh"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":3}],"metadata":{"name":"generate-init-script","notebookId":2645746662301564},"nbformat":4,"nbformat_minor":0}
diff --git a/docs/demo/GCP/Mortgage-ETL-CPU.ipynb b/docs/demo/GCP/Mortgage-ETL-CPU.ipynb
diff --git a/docs/demo/GCP/Mortgage-ETL-GPU.ipynb b/docs/demo/GCP/Mortgage-ETL-GPU.ipynb
diff --git a/docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb b/docs/demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb
@@ -0,0 +1,173 @@
+{
+  "metadata": {
+    "name": "mortgage-gpu-scala",
+    "kernelspec": {
+      "language": "scala",
+      "name": "spark2-scala"
+    },
+    "language_info": {
+      "codemirror_mode": "text/x-scala",
+      "file_extension": ".scala",
+      "mimetype": "text/x-scala",
+      "name": "scala",
+      "pygments_lexer": "scala"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 2,
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "# Introduction to XGBoost Spark with GPU\n\nMortgage is an example of xgboost classifier to do binary classification. This notebook will show you how to load data, train the xgboost model and use this model to predict if a mushroom is \"poisonous\". Camparing to original XGBoost Spark code, there\u0027re only one API difference.\n\n## Load libraries\nFirst load some common libraries will be used by both GPU version and CPU version xgboost."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}\nimport org.apache.spark.sql.SparkSession\nimport org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator\nimport org.apache.spark.sql.types.{DoubleType, IntegerType, StructField, StructType}\nimport org.apache.spark.SparkConf"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "Besides CPU version requires some extra libraries, such as:\n\n```scala\nimport org.apache.spark.ml.feature.VectorAssembler\nimport org.apache.spark.sql.DataFrame\nimport org.apache.spark.sql.functions._\nimport org.apache.spark.sql.types.FloatType\n```"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "// Update all path with your Dataproc Environment\nval trainPath \u003d \"gs://dataproc-nv-demo/mortgage_full/train/\"\nval evalPath  \u003d \"gs://dataproc-nv-demo/mortgage_full/test/\"\nval transPath \u003d \"gs://dataproc-nv-demo/mortgage_full/test/\"\nval modelPath \u003d \"gs://dataproc-nv-demo/mortgage_full/model/mortgage\""
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Build the schema and parameters\nThe mortgage data has 27 columns: 26 features and 1 label. \"deinquency_12\" is the label column. The schema will be used to load data in the future.\n\nThe next block also defines some key parameters used in xgboost training process."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "val labelColName \u003d \"delinquency_12\"\nval schema \u003d StructType(List(\n  StructField(\"orig_channel\", DoubleType),\n  StructField(\"first_home_buyer\", DoubleType),\n  StructField(\"loan_purpose\", DoubleType),\n  StructField(\"property_type\", DoubleType),\n  StructField(\"occupancy_status\", DoubleType),\n  StructField(\"property_state\", DoubleType),\n  StructField(\"product_type\", DoubleType),\n  StructField(\"relocation_mortgage_indicator\", DoubleType),\n  StructField(\"seller_name\", DoubleType),\n  StructField(\"mod_flag\", DoubleType),\n  StructField(\"orig_interest_rate\", DoubleType),\n  StructField(\"orig_upb\", IntegerType),\n  StructField(\"orig_loan_term\", IntegerType),\n  StructField(\"orig_ltv\", DoubleType),\n  StructField(\"orig_cltv\", DoubleType),\n  StructField(\"num_borrowers\", DoubleType),\n  StructField(\"dti\", DoubleType),\n  StructField(\"borrower_credit_score\", DoubleType),\n  StructField(\"num_units\", IntegerType),\n  StructField(\"zip\", IntegerType),\n  StructField(\"mortgage_insurance_percent\", DoubleType),\n  StructField(\"current_loan_delinquency_status\", IntegerType),\n  StructField(\"current_actual_upb\", DoubleType),\n  StructField(\"interest_rate\", DoubleType),\n  StructField(\"loan_age\", DoubleType),\n  StructField(\"msa\", DoubleType),\n  StructField(\"non_interest_bearing_upb\", DoubleType),\n  StructField(labelColName, IntegerType)))\n\nval featureNames \u003d schema.filter(_.name !\u003d labelColName).map(_.name)"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Create a new spark session and load data\n\nA new spark session should be created to continue all the following spark operations.\n\nNOTE: in this notebook, the dependency jars have been loaded when installing toree kernel. Alternatively the jars can be loaded into notebook by [%AddJar magic](https://toree.incubator.apache.org/docs/current/user/faq/). However, there\u0027s one restriction for `%AddJar`: the jar uploaded can only be available when `AddJar` is called just after a new spark session is created. Do it as below:\n\n```scala\nimport org.apache.spark.sql.SparkSession\nval spark \u003d SparkSession.builder().appName(\"mortgage-GPU\").getOrCreate\n%AddJar file:/data/libs/cudf-XXX-cuda10.jar\n%AddJar file:/data/libs/rapids-4-spark-XXX.jar\n%AddJar file:/data/libs/xgboost4j_3.0-XXX.jar\n%AddJar file:/data/libs/xgboost4j-spark_3.0-XXX.jar\n// ...\n```\n\n##### Please note the new jar \"rapids-4-spark-XXX.jar\" is only needed for GPU version, you can not add it to dependence list for CPU version."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "// Build the spark session and data reader as usual\nval conf \u003d new SparkConf()\nconf.set(\"spark.executor.instances\", \"20\")\nconf.set(\"spark.executor.cores\", \"7\")\nconf.set(\"spark.task.cpus\", \"7\")\nconf.set(\"spark.executor.memory\", \"24g\")\nconf.set(\"spark.rapids.memory.pinnedPool.size\", \"2G\")\nconf.set(\"spark.executor.memoryOverhead\", \"16G\")\nconf.set(\"spark.executor.extraJavaOptions\", \"-Dai.rapids.cudf.prefer-pinned\u003dtrue\")\nconf.set(\"spark.locality.wait\", \"0s\")\nconf.set(\"spark.sql.files.maxPartitionBytes\", \"512m\")\nconf.set(\"spark.executor.resource.gpu.amount\", \"1\")\nconf.set(\"spark.task.resource.gpu.amount\", \"1\")\nconf.set(\"spark.plugins\", \"com.nvidia.spark.SQLPlugin\")\nconf.set(\"spark.rapids.sql.hasNans\", \"false\")\nconf.set(\"spark.rapids.sql.batchSizeBytes\", \"512M\")\nconf.set(\"spark.rapids.sql.reader.batchSizeBytes\", \"768M\")\nconf.set(\"spark.rapids.sql.variableFloatAgg.enabled\", \"true\")\nconf.set(\"spark.rapids.memory.gpu.pooling.enabled\", \"false\")\n// conf.set(\"spark.rapids.memory.gpu.allocFraction\", \"0.1\")\nval spark \u003d SparkSession.builder.appName(\"mortgage-gpu\")\n                               .enableHiveSupport()\n                               .config(conf)\n                               .getOrCreate\nval reader \u003d spark.read.option(\"header\", true).schema(schema)"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "val trainSet \u003d reader.parquet(trainPath)\nval evalSet  \u003d reader.parquet(evalPath)\nval transSet \u003d reader.parquet(transPath)"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Set xgboost parameters and build a XGBoostClassifier\n\nFor CPU version, `num_workers` is recommended being equal to the number of CPU cores, while for GPU version, it should be set to the number of GPUs in Spark cluster.\n\nBesides the `tree_method` for CPU version is also different from that for GPU version. Now only \"gpu_hist\" is supported for training on GPU.\n\n```scala\n// difference in parameters\n  \"num_workers\" -\u003e 12,\n  \"tree_method\" -\u003e \"hist\",\n```"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "val commParamMap \u003d Map(\n  \"eta\" -\u003e 0.1,\n  \"gamma\" -\u003e 0.1,\n  \"missing\" -\u003e 0.0,\n  \"max_depth\" -\u003e 10,\n  \"max_leaves\" -\u003e 256,\n  \"objective\" -\u003e \"binary:logistic\",\n  \"grow_policy\" -\u003e \"depthwise\",\n  \"min_child_weight\" -\u003e 30,\n  \"lambda\" -\u003e 1,\n  \"scale_pos_weight\" -\u003e 2,\n  \"subsample\" -\u003e 1,\n  \"num_round\" -\u003e 100)\n  \nval xgbParamFinal \u003d commParamMap ++ Map(\"tree_method\" -\u003e \"gpu_hist\", \"num_workers\" -\u003e 20, \"nthread\" -\u003e 7)"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "Here comes the only API difference,`setFeaturesCol` in CPU version vs `setFeaturesCols` in GPU version.\n\nIn previous block, it said that CPU version needs `VectorAssembler` to assemble multiple feature columns into one column, because `setFeaturesCol` only accepts one feature column with the type of `vector`.\n\nBut `setFeaturesCols` supports multiple columns directly, so set the feautres column names directly to `XGBoostClassifier`. \n\nCPU version:\n\n```scala\nval xgbClassifier  \u003d new XGBoostClassifier(paramMap)\n  .setLabelCol(labelName)\n  .setFeaturesCol(\"features\")\n```"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "val xgbClassifier \u003d new XGBoostClassifier(xgbParamFinal)\n      .setLabelCol(labelColName)\n      // \u003d\u003d\u003d diff \u003d\u003d\u003d\n      .setFeaturesCols(featureNames)"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Benchmark and train\nThe object `benchmark` is used to compute the elapsed time of some operations.\n\nTraining with evaluation sets is also supported in 2 ways, the same as CPU version\u0027s behavior:\n\n* Call API `setEvalSets` after initializing an XGBoostClassifier\n\n```scala\nxgbClassifier.setEvalSets(Map(\"eval\" -\u003e evalSet))\n\n```\n\n* Use parameter `eval_sets` when initializing an XGBoostClassifier\n\n```scala\nval paramMapWithEval \u003d paramMap + (\"eval_sets\" -\u003e Map(\"eval\" -\u003e evalSet))\nval xgbClassifierWithEval \u003d new XGBoostClassifier(paramMapWithEval)\n```\n\nHere chooses the API way to set evaluation sets."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "xgbClassifier.setEvalSets(Map(\"eval\" -\u003e evalSet))"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 15,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "def benchmark[R](phase: String)(block: \u003d\u003e R): (R, Float) \u003d {\n  val t0 \u003d System.currentTimeMillis\n  val result \u003d block // call-by-name\n  val t1 \u003d System.currentTimeMillis\n  println(\"Elapsed time [\" + phase + \"]: \" + ((t1 - t0).toFloat / 1000) + \"s\")\n  (result, (t1 - t0).toFloat / 1000)\n}"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "CPU version reqires an extra step before fitting data to classifier, using `VectorAssembler` to assemble all feature columns into one column. The following code snip shows how to do the vectorizing.\n\n```scala\nobject Vectorize {\n  def apply(df: DataFrame, featureNames: Seq[String], labelName: String): DataFrame \u003d {\n    val toFloat \u003d df.schema.map(f \u003d\u003e col(f.name).cast(FloatType))\n    new VectorAssembler()\n      .setInputCols(featureNames.toArray)\n      .setOutputCol(\"features\")\n      .transform(df.select(toFloat:_*))\n      .select(col(\"features\"), col(labelName))\n  }\n}\n\ntrainSet \u003d Vectorize(trainSet, featureCols, labelName)\nevalSet \u003d Vectorize(evalSet, featureCols, labelName)\ntransSet \u003d Vectorize(transSet, featureCols, labelName)\n\n```\n\n`VectorAssembler` is not needed for GPU version. Just fit the loaded data directly to XGBoostClassifier."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 17,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "// Start training\nprintln(\"\\n------ Training ------\")\nval (xgbClassificationModel, _) \u003d benchmark(\"train\") {\n  xgbClassifier.fit(trainSet)\n}"
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": "## Transformation and evaluation\nHere uses `transSet` to evaluate our model and prints some useful columns to show our prediction result. After that `MulticlassClassificationEvaluator` is used to calculate an overall accuracy of our predictions."
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 19,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "println(\"\\n------ Transforming ------\")\nval (results, _) \u003d benchmark(\"transform\") {\n  val ret \u003d xgbClassificationModel.transform(transSet).cache()\n  ret\n}\nz.show(results.select(\"orig_channel\", labelColName,\"rawPrediction\",\"probability\",\"prediction\").limit(10))\n\nprintln(\"\\n------Accuracy of Evaluation------\")\nval evaluator \u003d new MulticlassClassificationEvaluator().setLabelCol(labelColName)\nval accuracy \u003d evaluator.evaluate(results)\nprintln(accuracy)"
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 20,
+      "metadata": {
+        "autoscroll": "auto"
+      },
+      "outputs": [],
+      "source": "xgbClassificationModel.write.overwrite.save(modelPath)\n\nval modelFromDisk \u003d XGBoostClassificationModel.load(modelPath)\n\nval (results2, _) \u003d benchmark(\"transform2\") {\n  modelFromDisk.transform(transSet)\n}\nz.show(results2.limit(5))"
+    }
+  ]
+}