Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Qualification tool support #2574

Merged
merged 14 commits into from
Jun 4, 2021
Merged

Conversation

tgravescs
Copy link
Collaborator

this adds support for the Qualification tool that ranks applications based on if they are a good fit for the plugin. This currently ranks based on SQL dataframe time / application time. It reports potential problems (UDFs) that we find. It OPTIONALLY reports the percent executor CPU time. With a lot of apps adding the percent executor cpu time can take a very long time, so I made it off by default.
These latter things are just reported for the user as information and not used in the rankings.

I also changed the default format to csv. User can also output to text and I made both work with HDFS.

I split the qualification tool into its own Main function since these seem like distinct tools with different audiences, we can discuss if people have other opinions. if we make it one for qualification and profiling we need to come up with good generic name and then some options.

I tried to remove calls to things that aren't used for qualification so you will see options around that added. I also had to change a few tables so I didn't have to join across so many tables. The query with 100 tpcds apps becomes huge and the spark analyzer takes forever to run over it because we have so many tables.

This also contains various bug fixes to handled truncated files and missing data.

This has very minimal doc changes - those will come later.

Fixed a bug with dropping tables and then removed caching.

I added more tests and manually ran the over the tpcds logs.

Output of the tool looks like:

### Qualification ###
+--------------------------------------+-------------------+-----+------------------+----------------------+------------+-------------------------+
|App Name                              |App ID             |Rank |Potential Problems|SQL Dataframe Duration|App Duration|Executor CPU Time Percent|
+--------------------------------------+-------------------+-----+------------------+----------------------+------------+-------------------------+
|Rapids Spark Profiling Tool Unit Tests|local-1622043423018|68.19|                  |11128                 |16319       |70.91                    |
|Rapids Spark Profiling Tool Unit Tests|local-1621969619749|14.34|UDF               |1560                  |10880       |43.79                    |
|Rapids Spark Profiling Tool Unit Tests|local-1621966649543|0.0  |                  |0                     |10650       |26.65                    |
|Rapids Spark Profiling Tool Unit Tests|local-1621955976602|0.0  |                  |0                     |10419       |25.8                     |
+--------------------------------------+-------------------+-----+------------------+----------------------+------------+-------------------------+


@tgravescs tgravescs added the feature request New feature or request label Jun 3, 2021
@tgravescs tgravescs added this to the May 24 - Jun 4 milestone Jun 3, 2021
@tgravescs tgravescs self-assigned this Jun 3, 2021
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

tested timed out for some reason.

@tgravescs
Copy link
Collaborator Author

looks like didn't get nodes for a long time: 14:00:01 [Warning][sw-gpu-spark/premerge-test-jenkins-rapids-premerge-github-1776-xpf3l-tw3jd][FailedScheduling] 0/428 nodes are available: 137 Insufficient memory, 182 Insufficient nvidia.com/gpu, 23 node(s) were unschedulable, 402 node(s) didn't match node selector, 93 Insufficient cpu.

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 1129641 into NVIDIA:branch-21.06 Jun 4, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Qualification tool

Signed-off-by: Thomas Graves <tgraves@apache.org>

* remove unused func

* Add missing files

* Add checks for format option

* cast columsn to string to write to text

* Revert "Add checks for format option"

This reverts commit 6f5271c.

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* update output dir

* formating

* Update help messages

* update app name

* cleanup

* put test functions back

* fix typo
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Qualification tool

Signed-off-by: Thomas Graves <tgraves@apache.org>

* remove unused func

* Add missing files

* Add checks for format option

* cast columsn to string to write to text

* Revert "Add checks for format option"

This reverts commit 6f5271c.

* cleanup

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

* update output dir

* formating

* Update help messages

* update app name

* cleanup

* put test functions back

* fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants