Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add command-line interface for TPC-* for use with spark-submit #823

Merged
merged 3 commits into from
Sep 28, 2020

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Sep 22, 2020

Provides a command-line interface for TPC-* benchmarks so that they can be run from spark-submit as an alternative to running them interactively via spark-shell. This makes it easy to run a batch of queries from a bash script. For example:

queries=("q25" "q65" "q77" "q60" "q50")

for query in "${queries[@]}"
do	

$SPARK_HOME/bin/spark-submit \
  ...
  --jars $SPARK_RAPIDS_PLUGIN_JAR,$CUDF_JAR \
  --class com.nvidia.spark.rapids.tests.tpcxbb.TpcdsLikeBench \
  $SPARK_RAPIDS_PLUGIN_INTEGRATION_TEST_JAR \
  --input /path/to/tpcsdata \
  --input-format parquet \
  --output /path/to/results/$query \
  --output-format parquet \
  --query $query \
  --iterations 3

done

This closes #795

TPC-H had fallen a bit behind so this PR also makes TPC-H consistent with the other benchmarks.

@andygrove andygrove added the test Only impacts tests label Sep 22, 2020
@andygrove andygrove added this to the Sep 14 - Sep 25 milestone Sep 22, 2020
@andygrove andygrove self-assigned this Sep 22, 2020
@andygrove andygrove changed the title [WIP] Add command-line interface for TPC-DS and TPCx-BB for use with spark-submit [WIP] Add command-line interface for TPC-* for use with spark-submit Sep 22, 2020
@andygrove andygrove changed the title [WIP] Add command-line interface for TPC-* for use with spark-submit Add command-line interface for TPC-* for use with spark-submit Sep 22, 2020
@andygrove
Copy link
Contributor Author

build

2 similar comments
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove mentioned this pull request Sep 22, 2020
@andygrove
Copy link
Contributor Author

build

@sameerz sameerz added the performance A performance related task/issue label Sep 23, 2020
*
* @param spark The Spark session
* @param query The name of the query to run e.g. "q5"
* @param iterations The number of times to run the query.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here and rest of functions, any reason to not have all parameters in docs?

}
}


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit remove extra newline

@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@tgravescs Thanks for the review. I've updated the javadocs to add the missing parameters and also addressed the nits.

@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

andygrove commented Sep 24, 2020

@tgravescs @abellina This is ready for re-review. This is working well in general but there is the known issue #852 for comparing ordering across partitioned files. That bug already exists in branch-0.3 and is not made any worse with this PR (which just adds a CLI for this utility) so I'd like to get this one in and follow up with a separate PR to fix the underlying bug because it may require a different approach entirely.

@andygrove
Copy link
Contributor Author

I added a check for multiple partitions and at least have it failing now if ignoreOrdering=true rather than report incorrect differences.

@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@@ -0,0 +1,197 @@
/*
* Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2020

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this file already existed and I renamed it. Let me see if I can rebase so git recognizes this is a rename rather than a delete and add.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah if it was a git mv then its ok, we should try to make sure its a move so we don't lose history

Copy link
Contributor Author

@andygrove andygrove Sep 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have rebased and if you look at the individual commits you can see that I did a git mv in the first commit and then changed the file contents in the second commit but the "files changed" tab in the github UI for this PR still shows it as a delete and add for some reason. I do see the correct history locally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as its a git move its fine.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 6f34062 into NVIDIA:branch-0.3 Sep 28, 2020
@andygrove andygrove deleted the tpc-spark-submit branch September 28, 2020 19:36
@sameerz sameerz added benchmark Benchmarking, benchmarking tools and removed test Only impacts tests labels Oct 4, 2020
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this pull request Nov 20, 2020
…A#823)

* rename tpch benchmark file

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add command-line interface to benchmarks

Signed-off-by: Andy Grove <andygrove@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…A#823)

* rename tpch benchmark file

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add command-line interface to benchmarks

Signed-off-by: Andy Grove <andygrove@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…A#823)

* rename tpch benchmark file

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add command-line interface to benchmarks

Signed-off-by: Andy Grove <andygrove@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
* Revert "Update JNI submodule ref to cudf v22.12.01 (NVIDIA#816)"

This reverts commit 353e2f9.

* re-target cudf submodule to tag v22.12.00

Signed-off-by: Peixin Li <pxli@nyu.edu>

Signed-off-by: Peixin Li <pxli@nyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Benchmarking, benchmarking tools performance A performance related task/issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Make it easier to run TPC-* benchmarks with spark-submit
3 participants