[FEA] Profiling and qualification tool #2483

tgravescs · 2021-05-21T20:36:26Z

Is your feature request related to a problem? Please describe.
We want to create a profiling and qualification tool for customers and developers. This is a high level overview and doesn't give low level details currently

Qualification (against CPU event logs) to help indicate if an application might be a good fit for the rapids plugin.

[P0] What % of runtime is spent in SQL/Dataframe operations
[P0] Time spent(%) on IO vs. Computation vs. Shuffle for SQL/dataframe operations
[P0] Reports for all applications, by application, with filtering criteria - this is filtering out application event logs before running the qualification tool on it
[P0] easy for user to install(one step) -> rely on Spark being installed and a single tool jar with instructions

Profiling for Support (against GPU event logs) to get profiling and debug info to help analyze jobs:

Other pieces:

[P0] Testing different event logs - Databricks, incomplete event log, very large event log, tpcds event log, etc..
[P0] Documentation

Specific tests: (ideally we validate that what we generate for each event log is correct in addition to the tool runs, these would be for both profiling and qualification).

try a very large event log (10GB) if we can find one
at least one event log from each of the spark versions we supports - EMR, GCP, databricks, each spark versions
Test reading from S3
Test real customer event logs that have dataset operations
Scale testing (we ran on 105 tpcds results) - what can we scale to and how much memory do we need. Qualification testing thee executor cpu time option takes a long time with the 105, but without it should be able to higher. would be good to get an idea of how much memory needed
Validate the output for a few of the tpcds log files - ie the numbers we print are correct, especially for things we aggregate
Test error handling
shuffle skew check with customer logs for profiling
test with files which are not event logs in a directory passed to the tool
test both GPU and CPU event logs
test with failures - retried stages
test reading compressed and rolled spark event log files

tgravescs · 2021-05-21T20:38:01Z

Profiling side second PR: https://github.com/NVIDIA/spark-rapids/pull/2469/files

tgravescs · 2021-06-03T18:30:28Z

qualification tool PR #2574 covers first 2 bullets

tgravescs · 2021-06-04T17:34:29Z

pr #2590 cover more profiling printing for Analysis and a few collection items missing.

tgravescs · 2021-06-04T17:35:49Z

pr #2576 covers filtering criteria

tgravescs · 2021-06-08T17:01:02Z

#2636 pr covered printing sql plans to file

nartal1 · 2021-06-08T18:12:51Z

#2632 covers listing failed jobs/stages/jobs and failed executors.

tgravescs · 2021-06-09T15:20:13Z

List of possibly issues/improvements:

Qual tool:

databricks event log without application end, the score is 0 because no app duration recorded. Can we do something smarter here?

tgravescs added feature request New feature or request P0 Must have for release labels May 21, 2021

tgravescs assigned tgravescs and nartal1 May 21, 2021

tgravescs added this to the June 7 - June 18 milestone Jun 10, 2021

tgravescs closed this as completed Jun 10, 2021

tgravescs mentioned this issue Jun 10, 2021

Update changelog for 21.06 release [skip ci] #2664

Merged

This was linked to pull requests Jun 10, 2021

Qualification tool - add in estimating the App end time when the event log missing application end event #2658

Closed

Qualification tool - add in estimating the App end time when the event log missing application end event #2663

Merged

tgravescs removed a link to a pull request Jun 10, 2021

Qualification tool - add in estimating the App end time when the event log missing application end event #2658

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Profiling and qualification tool #2483

[FEA] Profiling and qualification tool #2483

tgravescs commented May 21, 2021 •

edited

Loading

tgravescs commented May 21, 2021

tgravescs commented Jun 3, 2021

tgravescs commented Jun 4, 2021

tgravescs commented Jun 4, 2021

tgravescs commented Jun 8, 2021

nartal1 commented Jun 8, 2021

tgravescs commented Jun 9, 2021

[FEA] Profiling and qualification tool #2483

[FEA] Profiling and qualification tool #2483

Comments

tgravescs commented May 21, 2021 • edited Loading

tgravescs commented May 21, 2021

tgravescs commented Jun 3, 2021

tgravescs commented Jun 4, 2021

tgravescs commented Jun 4, 2021

tgravescs commented Jun 8, 2021

nartal1 commented Jun 8, 2021

tgravescs commented Jun 9, 2021

tgravescs commented May 21, 2021 •

edited

Loading