Refactoring as well as review comments changes

NVIDIA · May 30, 2024 · ece6971 · ece6971
1 parent 1d35aef
commit ece6971
Show file tree

Hide file tree

Showing 45 changed files with 254 additions and 1,248 deletions.
diff --git a/README.md b/README.md
@@ -4,8 +4,8 @@ A repo for Spark related benchmark sets and utilities using the
 [RAPIDS Accelerator For Apache Spark](https://github.com/NVIDIA/spark-rapids). 
 
 ## Benchmark sets:
-- [Nvidia Decision Support(NDS)](./nds/)
-- [Nvidia Decision Support-H(NDS-H)](./nds-h/)
+- [Nvidia Decision Support ( NDS )](./nds/)
+- [Nvidia Decision Support-H ( NDS-H )](./nds-h/)
 
 Please see README in each benchmark set for more details including building instructions and usage
 descriptions.
diff --git a/nds-h/README.md b/nds-h/README.md
@@ -7,21 +7,30 @@
 
     ```bash
     sudo locale-gen en_US.UTF-8
-    sudo apt install openjdk-8-jdk-headless gcc make flex bison byacc maven spark
+    sudo apt install openjdk-8-jdk-headless gcc make flex bison byacc maven
     sudo apt install dos2unix
     ```
-3. Update Configuration
-
-    - Update shared/base.template line 26 with the Spark home location.
-
-4. For GPU run
+3. Install and set up SPARK. 
+    - Download latest distro from [here](https://spark.apache.org/downloads.html)
+    - Preferably >= 3.4
+    - Find and note SPARK_HOME ( /DOWNLOAD/LOCATION/spark-<3.4.1>-bin-hadoop3 )
+    - (For local) Follow the steps mentioned [here](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#local-mode)
+      for local setup
+    - (For local) Update *_gpu_* --files with the getGpuResources.sh location as mentioned in the link above
+    - (For local) Update spark master in shared/base.template with local[*]
+    - (For local) Remove the conf - "spark.task.resource.gpu.amount=0.05" from all template files
+
+4. Update Configuration
+    - Update [shared/base.template](../shared/base.template) line 26 with the Spark home location.
+
+5. For GPU run
     - Download the latest RAPIDS jar from [here](https://oss.sonatype.org/content/repositories/staging/com/nvidia/rapids-4-spark_2.12/)
 
-    - Update shared/base.template line 36 with rapids plugin jar location
+    - Update [shared/base.template](../shared/base.template) line 36 with rapids plugin jar location
 
-5. TPC-H Tools
+6. TPC-H Tools
 
-    - Download TPC-H Tools from the [official TPC website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp). The tool will be downloaded as a zip package with a random guid string prefix.
+    - Download TPC-H Tools from the [official TPC website] (https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp). The tool will be downloaded as a zip package with a random guid string prefix.
 
     - After unzipping it, a folder called `TPC-H V3.0.1` will be seen.
 
@@ -42,17 +51,18 @@ submit the Spark job with the configs defined in the template.
 Example command to submit via `spark-submit-template` utility:
 
 ```bash
-shared/spark-submit-template convert_submit_cpu.template \
-../nds-h/ndsH_transcode.py  <raw_data_file_location>  <parquet_location> report.txt
+cd shared
+spark-submit-template convert_submit_cpu.template \
+../nds-h/nds_h_transcode.py  <raw_data_file_location>  <parquet_location> report.txt
 ```
 
-We provide 2 types of template files used in different steps of NDS:
+We provide 2 types of template files used in different steps of NDS-H:
 
-1. *convert_submit_*.template for converting the data by using ndsH_transcode.py
-2. *power_run_*.template for power run by using nds_power.py
+1. *convert_submit_*.template for converting the data by using nds_h_transcode.py
+2. *power_run_*.template for power run by using nds_h_power.py
 
 We predefine different template files for different environment.
-For example, we provide below template files to run ndsH_transcode.py for different environment:
+For example, we provide below template files to run nds_h_transcode.py for different environment:
 
 * `convert_submit_cpu.template` is for Spark CPU cluster
 * `convert_submit_gpu.template` is for Spark GPU cluster
@@ -76,8 +86,8 @@ When running multiple steps of NDS-H, you only need to modify `base.template` to
 To generate data for local -
 
 ```bash
-$ python ndsH_gen_data.py -h
-usage: ndsH_gen_data.py [-h] [--overwrite_output] scale parallel data_dir
+$ python nds_h_gen_data.py -h
+usage: nds_h_gen_data.py [-h] <scale> <parallel> <data_dir> [--overwrite_output]
 positional arguments:
   scale               volume of data to generate in GB.
   parallel            build data in <parallel_value> separate chunks
@@ -92,19 +102,20 @@ optional arguments:
 Example command:
 
 ```bash
-python ndsH_gen_data.py 100 100 /data/raw_sf100 --overwrite_output
+python nds_h_gen_data.py 100 100 /data/raw_sf100 --overwrite_output
 ```
 
 ### Convert DSV to Parquet or Other data sources
 
-To do the data conversion, the `ndsH_transcode.py` need to be submitted as a Spark job. User can leverage
+To do the data conversion, the `nds_h_transcode.py` need to be submitted as a Spark job. User can leverage
 the [spark-submit-template](../shared/spark-submit-template) utility to simplify the submission.
-The utility requires a pre-defined [template file](../shared/convert_submit_gpu.template) where user needs to put necessary Spark configurations. Alternatively user can submit the `ndsH_transcode.py` directly to spark with arbitrary Spark parameters.
+The utility requires a pre-defined [template file](../shared/convert_submit_gpu.template) where user needs to put necessary Spark configurations. Alternatively user can submit the `nds_h_transcode.py` directly to spark with arbitrary Spark parameters.
 
 DSV ( pipe ) is the default input format for data conversion, it can be overridden by `--input_format`.
 
 ```bash
-shared/spark-submit-template convert_submit_cpu.template ../nds-h/ndsH_transcode.py <input_data_location>
+cd shared
+./spark-submit-template convert_submit_cpu.template ../nds-h/nds_h_transcode.py <input_data_location>
 <output_data_location> <report_file_location>
 ```
 
@@ -115,31 +126,35 @@ The [templates.patch](./tpch-gen/patches/template_new.patch) that contains neces
 ### Generate Specific Query or Query Streams
 
 ```text
-usage: ndsH_gen_query_stream.py [-h] (--streams STREAMS)
+usage: nds_h_gen_query_stream.py [-h] (--template TEMPLATE | --streams STREAMS)
                               scale output_dir
 
 positional arguments:
   scale                assume a database of this scale factor.
   output_dir           generate query in directory.
+  template | stream    generate query stream or from a template arugment
 
 optional arguments:
   -h, --help           show this help message and exit
-  This argument is mutually exclusive with --streams. It
+  --template TEMPLATE  build queries from this template. Only used to generate one query from one tempalte. 
+                       This argument is mutually exclusive with --streams. It
                        is often used for test purpose.
   --streams STREAMS    generate how many query streams. This argument is mutually exclusive with --template.
 ```
 
-Example command to generate one query using query1.tpl:
+Example command to generate one query using template 1.sql ( There are 22 default queries and templates):
 
 ```bash
-python ndsH_gen_query_stream.py 3000 ./query_1
+cd nds-h
+python nds_h_gen_query_stream.py 3000 ./query_1 --template <query_number>
 ```
 
-Example command to generate 10 query streams each one of which contains all NDS queries but in
+Example command to generate 10 query streams each one of which contains all NDS-H queries but in
 different order:
 
 ```bash
-python ndsH_gen_query_stream.py 3000 ./query_streams --streams 10
+cd nds-h
+python nds_h_gen_query_stream.py 3000 ./query_streams --streams 10
 ```
 
 ## Benchmark Runner
@@ -153,25 +168,30 @@ finished. This is often used for test or query monitoring purpose.
 To build:
 
 ```bash
-cd shared/jvm_listener
+cd utils/jvm_listener
 mvn package
 ```
 
-`nds-benchmark-listener-1.0-SNAPSHOT.jar` will be generated in `jvm_listener/target` folder.
+`benchmark-listener-1.0-SNAPSHOT.jar` will be generated in `jvm_listener/target` folder.
 
 ### Power Run
 
-_After_ user generates query streams, Power Run can be executed using one of them by submitting `ndsH_power.py` to Spark.
+_After_ user generates query streams, Power Run can be executed using one of them by submitting `nds_h_power.py` to Spark.
 
-Arguments supported by `ndsH_power.py`:
+Arguments supported by `nds_h_power.py`:
 
 ```text
-usage: nds_power.py [-h] [--input_format {parquet,orc,avro,csv,json,iceberg,delta}] [--output_prefix OUTPUT_PREFIX] [--output_format OUTPUT_FORMAT] [--property_file PROPERTY_FILE] time_log
+usage: nds_h_power.py [-h] [--input_format {parquet,}] 
+                           [--output_format OUTPUT_FORMAT] 
+                           [--property_file PROPERTY_FILE]
+                           <input_data_location> 
+                           <query_stream_file>
+                           <time_log_file>
 
 positional arguments:
-  input_prefix          text to prepend to every input file path (e.g., "hdfs:///ds-generated-data").
-  query_stream_file     query stream file that contains NDS queries in specific order
-  time_log              path to execution time log, only support local path.
+  input_data_location     input data location (e.g., "hdfs:///ds-generated-data").
+  query_stream_file       query stream file that contains NDS-H queries in specific order
+  time_log_file           path to execution time log, only support local path.
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -185,15 +205,16 @@ optional arguments:
                         property file for Spark configuration.
 ```
 
-Example command to submit nds_power.py by spark-submit-template utility:
+Example command to submit nds_h_power.py by spark-submit-template utility:
 
 ```bash
-shared/spark-submit-template power_run_gpu.template \
-nds_power.py \
+cd shared
+./spark-submit-template power_run_gpu.template \
+../nds-h/nds_h_power.py \
 <parquet_folder_location> \
-<query_stream>/query_0.sql \
+<query_stream_folder>/query_0.sql \
 time.csv \
---property_file properties/aqe-on.properties
+--property_file ../utils/properties/aqe-on.properties
 ```
 
 User can also use `spark-submit` to submit `ndsH_power.py` directly.
@@ -210,7 +231,7 @@ The command above will use `collect()` action to trigger Spark job for each quer
 
 ```bash
 ./spark-submit-template power_run_gpu.template \
-nds_power.py \
+nds_h_power.py \
 parquet_sf3k \
 ./nds_query_streams/query_0.sql \
 time.csv

diff --git a/nds-h/check.py b/nds-h/check.py