Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Profiling and qualification tool #2483

Closed
17 tasks done
tgravescs opened this issue May 21, 2021 · 7 comments · Fixed by #2632, #2636, #2643, #2651 or #2669
Closed
17 tasks done

[FEA] Profiling and qualification tool #2483

tgravescs opened this issue May 21, 2021 · 7 comments · Fixed by #2632, #2636, #2643, #2651 or #2669
Assignees
Labels
feature request New feature or request P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented May 21, 2021

Is your feature request related to a problem? Please describe.
We want to create a profiling and qualification tool for customers and developers. This is a high level overview and doesn't give low level details currently

Qualification (against CPU event logs) to help indicate if an application might be a good fit for the rapids plugin.

  • [P0] What % of runtime is spent in SQL/Dataframe operations
  • [P0] Time spent(%) on IO vs. Computation vs. Shuffle for SQL/dataframe operations
  • [P0] Reports for all applications, by application, with filtering criteria - this is filtering out application event logs before running the qualification tool on it
  • [P0] easy for user to install(one step) -> rely on Spark being installed and a single tool jar with instructions

Profiling for Support (against GPU event logs) to get profiling and debug info to help analyze jobs:

  • [P0] Spark version(including Spark vendor’s customized version)
  • [P0] Spark Properties(especially Rapids related ones) / Hadoop Properties
  • [P0] SQL Physical Plan for each query inside this application
  • [P0] Executors information(memory, CPU, GPU resources) (needs tests)
  • [P0] List failed jobs/stages/tasks if there is any.
  • [P0] List failed executors if there is any.
  • [P0] Job <-> Stage mapping and SQL<->Job mapping. (new tables for this)
  • [P0] What % of runtime is spent in SQL/Dataframe operations
  • [P0] Ability to compare the duration for all queries in all input event logs.
  • [P0] Time spent(%) on IO vs. Computation vs. Shuffle
  • [P0] What DataFormat(parquet/ORC/CSV), Storage Type(S3/Iceberg) is used (not done) -> adding ability to print the sql plans to a file, so you should be able to see this information from there.

Other pieces:

  • [P0] Testing different event logs - Databricks, incomplete event log, very large event log, tpcds event log, etc..
  • [P0] Documentation

Specific tests: (ideally we validate that what we generate for each event log is correct in addition to the tool runs, these would be for both profiling and qualification).

  • try a very large event log (10GB) if we can find one
  • at least one event log from each of the spark versions we supports - EMR, GCP, databricks, each spark versions
  • Test reading from S3
  • Test real customer event logs that have dataset operations
  • Scale testing (we ran on 105 tpcds results) - what can we scale to and how much memory do we need. Qualification testing thee executor cpu time option takes a long time with the 105, but without it should be able to higher. would be good to get an idea of how much memory needed
  • Validate the output for a few of the tpcds log files - ie the numbers we print are correct, especially for things we aggregate
  • Test error handling
  • shuffle skew check with customer logs for profiling
  • test with files which are not event logs in a directory passed to the tool
  • test both GPU and CPU event logs
  • test with failures - retried stages
  • test reading compressed and rolled spark event log files
@tgravescs tgravescs added feature request New feature or request P0 Must have for release labels May 21, 2021
@tgravescs
Copy link
Collaborator Author

Profiling side second PR: https://github.com/NVIDIA/spark-rapids/pull/2469/files

@tgravescs
Copy link
Collaborator Author

qualification tool PR #2574 covers first 2 bullets

@tgravescs
Copy link
Collaborator Author

pr #2590 cover more profiling printing for Analysis and a few collection items missing.

@tgravescs
Copy link
Collaborator Author

pr #2576 covers filtering criteria

@tgravescs
Copy link
Collaborator Author

#2636 pr covered printing sql plans to file

@nartal1
Copy link
Collaborator

nartal1 commented Jun 8, 2021

#2632 covers listing failed jobs/stages/jobs and failed executors.

@tgravescs
Copy link
Collaborator Author

List of possibly issues/improvements:

Qual tool:

  • databricks event log without application end, the score is 0 because no app duration recorded. Can we do something smarter here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment