-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark runner script #918
Changes from all commits
e9604ed
e0e0818
ab39665
77fef5f
543f227
e34b9c0
10713c5
bed19ac
6044d8e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
# Copyright (c) 2020, NVIDIA CORPORATION. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import argparse | ||
import os | ||
import sys | ||
|
||
def main(): | ||
"""Iterate over a series of configurations and run benchmarks for each of the specified | ||
queries using that configuration. | ||
|
||
Example usage: | ||
|
||
python benchmark.py \ | ||
--template /path/to/template \ | ||
--benchmark tpcds \ | ||
--input /path/to/input \ | ||
--input-format parquet \ | ||
--output /path/to/output \ | ||
--output-format parquet \ | ||
--configs cpu gpu-ucx-on \ | ||
--query q4 q5 | ||
|
||
In this example, configuration key-value pairs will be loaded from cpu.properties and | ||
gpu-ucx-on.properties and appended to a spark-submit-template.txt to build the spark-submit | ||
abellina marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. where does spark-submit-template.txt come from? ./? What if I have multiple of these and want to switch between them? what if I run from different directory? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've made the template file configurable now. |
||
commands to run the benchmark. These configuration property files simply contain key-value | ||
pairs in the format key=value with one pair per line. For example: | ||
|
||
spark.executor.cores=2 | ||
spark.rapids.sql.enabled=true | ||
spark.sql.adaptive.enabled=true | ||
|
||
A template file must be provided, containing the command to call spark-submit along | ||
with any cluster-specific configuration options and any spark configuration settings that | ||
will be common to all benchmark runs. The template should end with a line-continuation | ||
symbol since additional --conf options will be appended for each benchmark run. | ||
|
||
Example template: | ||
|
||
$SPARK_HOME/bin/spark-submit \ | ||
--master $SPARK_MASTER_URL \ | ||
--conf spark.plugins=com.nvidia.spark.SQLPlugin \ | ||
--conf spark.eventLog.enabled=true \ | ||
--conf spark.eventLog.dir=./spark-event-logs \ | ||
|
||
The output and output-format arguments can be omitted to run the benchmark and collect | ||
results to the driver rather than write the query output to disk. | ||
|
||
This benchmark script assumes that the following environment variables have been set for | ||
the location of the relevant JAR files to be used: | ||
|
||
- SPARK_RAPIDS_PLUGIN_JAR | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any reason we use env variables vs parameters to script? For script purposes parameters would be easier, I think its also more obvious to user and they don't accidentally get something unexpected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My reasoning was that we tell users to set up these environment variables in the getting started guide and I have been using this approach as a bit of a stop-gap solution for the reporting tools to show the plugin and cuDF versions that were used to run benchmarks. This isn't ideal and it would be better to use cuDF and plugin APIs to get the version numbers instead. I haven't looked into whether this is possible or not. I'll give this some more thought. |
||
- SPARK_RAPIDS_PLUGIN_INTEGRATION_TEST_JAR | ||
- CUDF_JAR | ||
|
||
""" | ||
|
||
parser = argparse.ArgumentParser(description='Run TPC benchmarks.') | ||
parser.add_argument('--benchmark', required=True, | ||
help='Name of benchmark to run (tpcds, tpcxbb, tpch)') | ||
parser.add_argument('--template', required=True, | ||
help='Path to a template script that invokes spark-submit') | ||
parser.add_argument('--input', required=True, | ||
help='Path to source data set') | ||
parser.add_argument('--input-format', required=True, | ||
help='Format of input data set (parquet or csv)') | ||
parser.add_argument('--output', required=True, | ||
help='Path to write query output to') | ||
parser.add_argument('--output-format', required=True, | ||
help='Format to write to (parquet or orc)') | ||
parser.add_argument('--configs', required=True, type=str, nargs='+', | ||
help='One or more configuration filenames to run') | ||
parser.add_argument('--query', required=True, type=str, nargs='+', | ||
help='Queries to run') | ||
parser.add_argument('--iterations', required=True, | ||
help='The number of iterations to run (defaults to 1)') | ||
|
||
args = parser.parse_args() | ||
|
||
if args.benchmark == "tpcds": | ||
class_name = "com.nvidia.spark.rapids.tests.tpcds.TpcdsLikeBench" | ||
elif args.benchmark == "tpcxbb": | ||
class_name = "com.nvidia.spark.rapids.tests.tpcxbb.TpcxbbLikeBench" | ||
elif args.benchmark == "tpch": | ||
class_name = "com.nvidia.spark.rapids.tests.tpch.TpchLikeBench" | ||
else: | ||
sys.exit("invalid benchmark name") | ||
|
||
with open(args.template, "r") as myfile: | ||
template = myfile.read() | ||
|
||
for config_name in args.configs: | ||
config = load_properties(config_name + ".properties") | ||
for query in args.query: | ||
summary_file_prefix = "{}-{}-{}".format(args.benchmark, query, config_name) | ||
|
||
cmd = ['--conf spark.app.name="' + summary_file_prefix + '"'] | ||
for k, v in config.items(): | ||
cmd.append("--conf " + k + "=" + v) | ||
|
||
cmd.append("--jars $SPARK_RAPIDS_PLUGIN_JAR,$CUDF_JAR") | ||
cmd.append("--class " + class_name) | ||
cmd.append("$SPARK_RAPIDS_PLUGIN_INTEGRATION_TEST_JAR") | ||
cmd.append("--input " + args.input) | ||
|
||
if args.input_format is not None: | ||
cmd.append("--input-format {}".format(args.input_format)) | ||
|
||
if args.output is not None: | ||
cmd.append("--output " + args.output + "/" + config_name + "/" + query) | ||
|
||
if args.output_format is not None: | ||
cmd.append("--output-format {}".format(args.output_format)) | ||
|
||
cmd.append("--query " + query) | ||
cmd.append("--summary-file-prefix " + summary_file_prefix) | ||
|
||
if args.iterations is None: | ||
cmd.append("--iterations 1") | ||
else: | ||
cmd.append("--iterations {}".format(args.iterations)) | ||
|
||
cmd = template.strip() + "\n " + " ".join(cmd).strip() | ||
|
||
# run spark-submit | ||
print(cmd) | ||
os.system(cmd) | ||
|
||
|
||
def load_properties(filename): | ||
myvars = {} | ||
with open(filename) as myfile: | ||
for line in myfile: | ||
name, var = line.partition("=")[::2] | ||
myvars[name.strip()] = var.strip() | ||
return myvars | ||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the --template option