Skip to content

Commit

Permalink
Refactoring as well as review comments changes
Browse files Browse the repository at this point in the history
  • Loading branch information
bilalbari committed May 30, 2024
1 parent 1d35aef commit ece6971
Show file tree
Hide file tree
Showing 45 changed files with 254 additions and 1,248 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ A repo for Spark related benchmark sets and utilities using the
[RAPIDS Accelerator For Apache Spark](https://github.com/NVIDIA/spark-rapids).

## Benchmark sets:
- [Nvidia Decision Support(NDS)](./nds/)
- [Nvidia Decision Support-H(NDS-H)](./nds-h/)
- [Nvidia Decision Support ( NDS )](./nds/)
- [Nvidia Decision Support-H ( NDS-H )](./nds-h/)

Please see README in each benchmark set for more details including building instructions and usage
descriptions.
103 changes: 62 additions & 41 deletions nds-h/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,30 @@

```bash
sudo locale-gen en_US.UTF-8
sudo apt install openjdk-8-jdk-headless gcc make flex bison byacc maven spark
sudo apt install openjdk-8-jdk-headless gcc make flex bison byacc maven
sudo apt install dos2unix
```
3. Update Configuration

- Update shared/base.template line 26 with the Spark home location.

4. For GPU run
3. Install and set up SPARK.
- Download latest distro from [here](https://spark.apache.org/downloads.html)
- Preferably >= 3.4
- Find and note SPARK_HOME ( /DOWNLOAD/LOCATION/spark-<3.4.1>-bin-hadoop3 )
- (For local) Follow the steps mentioned [here](https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/on-premise.html#local-mode)
for local setup
- (For local) Update *_gpu_* --files with the getGpuResources.sh location as mentioned in the link above
- (For local) Update spark master in shared/base.template with local[*]
- (For local) Remove the conf - "spark.task.resource.gpu.amount=0.05" from all template files

4. Update Configuration
- Update [shared/base.template](../shared/base.template) line 26 with the Spark home location.

5. For GPU run
- Download the latest RAPIDS jar from [here](https://oss.sonatype.org/content/repositories/staging/com/nvidia/rapids-4-spark_2.12/)

- Update shared/base.template line 36 with rapids plugin jar location
- Update [shared/base.template](../shared/base.template) line 36 with rapids plugin jar location

5. TPC-H Tools
6. TPC-H Tools

- Download TPC-H Tools from the [official TPC website](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp). The tool will be downloaded as a zip package with a random guid string prefix.
- Download TPC-H Tools from the [official TPC website] (https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp). The tool will be downloaded as a zip package with a random guid string prefix.

- After unzipping it, a folder called `TPC-H V3.0.1` will be seen.

Expand All @@ -42,17 +51,18 @@ submit the Spark job with the configs defined in the template.
Example command to submit via `spark-submit-template` utility:

```bash
shared/spark-submit-template convert_submit_cpu.template \
../nds-h/ndsH_transcode.py <raw_data_file_location> <parquet_location> report.txt
cd shared
spark-submit-template convert_submit_cpu.template \
../nds-h/nds_h_transcode.py <raw_data_file_location> <parquet_location> report.txt
```

We provide 2 types of template files used in different steps of NDS:
We provide 2 types of template files used in different steps of NDS-H:

1. *convert_submit_*.template for converting the data by using ndsH_transcode.py
2. *power_run_*.template for power run by using nds_power.py
1. *convert_submit_*.template for converting the data by using nds_h_transcode.py
2. *power_run_*.template for power run by using nds_h_power.py

We predefine different template files for different environment.
For example, we provide below template files to run ndsH_transcode.py for different environment:
For example, we provide below template files to run nds_h_transcode.py for different environment:

* `convert_submit_cpu.template` is for Spark CPU cluster
* `convert_submit_gpu.template` is for Spark GPU cluster
Expand All @@ -76,8 +86,8 @@ When running multiple steps of NDS-H, you only need to modify `base.template` to
To generate data for local -

```bash
$ python ndsH_gen_data.py -h
usage: ndsH_gen_data.py [-h] [--overwrite_output] scale parallel data_dir
$ python nds_h_gen_data.py -h
usage: nds_h_gen_data.py [-h] <scale> <parallel> <data_dir> [--overwrite_output]
positional arguments:
scale volume of data to generate in GB.
parallel build data in <parallel_value> separate chunks
Expand All @@ -92,19 +102,20 @@ optional arguments:
Example command:
```bash
python ndsH_gen_data.py 100 100 /data/raw_sf100 --overwrite_output
python nds_h_gen_data.py 100 100 /data/raw_sf100 --overwrite_output
```
### Convert DSV to Parquet or Other data sources
To do the data conversion, the `ndsH_transcode.py` need to be submitted as a Spark job. User can leverage
To do the data conversion, the `nds_h_transcode.py` need to be submitted as a Spark job. User can leverage
the [spark-submit-template](../shared/spark-submit-template) utility to simplify the submission.
The utility requires a pre-defined [template file](../shared/convert_submit_gpu.template) where user needs to put necessary Spark configurations. Alternatively user can submit the `ndsH_transcode.py` directly to spark with arbitrary Spark parameters.
The utility requires a pre-defined [template file](../shared/convert_submit_gpu.template) where user needs to put necessary Spark configurations. Alternatively user can submit the `nds_h_transcode.py` directly to spark with arbitrary Spark parameters.
DSV ( pipe ) is the default input format for data conversion, it can be overridden by `--input_format`.
```bash
shared/spark-submit-template convert_submit_cpu.template ../nds-h/ndsH_transcode.py <input_data_location>
cd shared
./spark-submit-template convert_submit_cpu.template ../nds-h/nds_h_transcode.py <input_data_location>
<output_data_location> <report_file_location>
```
Expand All @@ -115,31 +126,35 @@ The [templates.patch](./tpch-gen/patches/template_new.patch) that contains neces
### Generate Specific Query or Query Streams
```text
usage: ndsH_gen_query_stream.py [-h] (--streams STREAMS)
usage: nds_h_gen_query_stream.py [-h] (--template TEMPLATE | --streams STREAMS)
scale output_dir
positional arguments:
scale assume a database of this scale factor.
output_dir generate query in directory.
template | stream generate query stream or from a template arugment
optional arguments:
-h, --help show this help message and exit
This argument is mutually exclusive with --streams. It
--template TEMPLATE build queries from this template. Only used to generate one query from one tempalte.
This argument is mutually exclusive with --streams. It
is often used for test purpose.
--streams STREAMS generate how many query streams. This argument is mutually exclusive with --template.
```
Example command to generate one query using query1.tpl:
Example command to generate one query using template 1.sql ( There are 22 default queries and templates):
```bash
python ndsH_gen_query_stream.py 3000 ./query_1
cd nds-h
python nds_h_gen_query_stream.py 3000 ./query_1 --template <query_number>
```
Example command to generate 10 query streams each one of which contains all NDS queries but in
Example command to generate 10 query streams each one of which contains all NDS-H queries but in
different order:
```bash
python ndsH_gen_query_stream.py 3000 ./query_streams --streams 10
cd nds-h
python nds_h_gen_query_stream.py 3000 ./query_streams --streams 10
```
## Benchmark Runner
Expand All @@ -153,25 +168,30 @@ finished. This is often used for test or query monitoring purpose.
To build:
```bash
cd shared/jvm_listener
cd utils/jvm_listener
mvn package
```
`nds-benchmark-listener-1.0-SNAPSHOT.jar` will be generated in `jvm_listener/target` folder.
`benchmark-listener-1.0-SNAPSHOT.jar` will be generated in `jvm_listener/target` folder.
### Power Run
_After_ user generates query streams, Power Run can be executed using one of them by submitting `ndsH_power.py` to Spark.
_After_ user generates query streams, Power Run can be executed using one of them by submitting `nds_h_power.py` to Spark.
Arguments supported by `ndsH_power.py`:
Arguments supported by `nds_h_power.py`:
```text
usage: nds_power.py [-h] [--input_format {parquet,orc,avro,csv,json,iceberg,delta}] [--output_prefix OUTPUT_PREFIX] [--output_format OUTPUT_FORMAT] [--property_file PROPERTY_FILE] time_log
usage: nds_h_power.py [-h] [--input_format {parquet,}]
[--output_format OUTPUT_FORMAT]
[--property_file PROPERTY_FILE]
<input_data_location>
<query_stream_file>
<time_log_file>
positional arguments:
input_prefix text to prepend to every input file path (e.g., "hdfs:///ds-generated-data").
query_stream_file query stream file that contains NDS queries in specific order
time_log path to execution time log, only support local path.
input_data_location input data location (e.g., "hdfs:///ds-generated-data").
query_stream_file query stream file that contains NDS-H queries in specific order
time_log_file path to execution time log, only support local path.
optional arguments:
-h, --help show this help message and exit
Expand All @@ -185,15 +205,16 @@ optional arguments:
property file for Spark configuration.
```
Example command to submit nds_power.py by spark-submit-template utility:
Example command to submit nds_h_power.py by spark-submit-template utility:
```bash
shared/spark-submit-template power_run_gpu.template \
nds_power.py \
cd shared
./spark-submit-template power_run_gpu.template \
../nds-h/nds_h_power.py \
<parquet_folder_location> \
<query_stream>/query_0.sql \
<query_stream_folder>/query_0.sql \
time.csv \
--property_file properties/aqe-on.properties
--property_file ../utils/properties/aqe-on.properties
```
User can also use `spark-submit` to submit `ndsH_power.py` directly.
Expand All @@ -210,7 +231,7 @@ The command above will use `collect()` action to trigger Spark job for each quer
```bash
./spark-submit-template power_run_gpu.template \
nds_power.py \
nds_h_power.py \
parquet_sf3k \
./nds_query_streams/query_0.sql \
time.csv
Expand Down
151 changes: 0 additions & 151 deletions nds-h/check.py

This file was deleted.

Loading

0 comments on commit ece6971

Please sign in to comment.