Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualification tool: Detect RDD Api's in SQL plan #3819

Merged
merged 9 commits into from
Oct 18, 2021

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Oct 14, 2021

This fixes #3535.

Whenever RDD is converted to Dataset/DataFrame, there is SerializeFromObject in the SQL plan. In this PR, the logic is same as that for Dataset i.e if we see RDD to Dataset/DataFrame conversion, then we don't consider the time taken by that operation while calculating the score.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1 nartal1 self-assigned this Oct 14, 2021
@nartal1 nartal1 added the tools label Oct 14, 2021
@nartal1 nartal1 added this to the Oct 4 - Oct 15 milestone Oct 14, 2021
Signed-off-by: Niranjan Artal <nartal@nvidia.com>
Copy link
Collaborator

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible, we should ask the requestor of this feature to test on the eventlogs they ran into this issue on.

@@ -121,6 +121,8 @@ total task time in SQL Dataframe operations).

Each application(event log) could have multiple SQL queries. If a SQL's plan has Dataset API inside such as keyword
`$Lambda` or `.apply`, that SQL query is categorized as a DataSet SQL query, otherwise it is a Dataframe SQL query.
If there are RDD to Dataset/Dataframe conversion then it would have `SerializeFromObject` in it's SQL plan. These
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the SerializeFromObject and show up when doing DataSet to Dataframe operation as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would reword sentence above, perhaps have one that says: If a SQL's plan has a Dataset API or RDD call inside of it, that SQL query is not categorized as a Dataframe SQL query. We are unable to determine how much of that query is made up of Dataset or RDD calls so the entire query task time is not included in the score.

And honestly I would remove the other bit about how we determine RDD or Dataset as user shouldn't really need to know that.

@@ -1,2 +1,2 @@
App Name,App ID,Score,Potential Problems,SQL Dataframe Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,ReadFileFormat Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Unsupported Nested Complex Types
Spark shell,local-1624892957956,37581.00,"",3751,37581,17801,58.47,false,0,"",20,100.00,"","","",""
Spark shell,local-1624892957956,0.0,"",0,0,17801,0.0,false,0,"",20,100.00,"","","",""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have these go to 0, I want to make sure we are parsing spark2 logs properly, if we need to create a new spark2 log that doesn't use RDD lets do that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will create another spark2 event log that doesn't use RDD

@@ -0,0 +1,2 @@
App Name,App ID,Score,Potential Problems,SQL DF Duration,SQL Dataframe Task Duration,App Duration,Executor CPU Time Percent,App Duration Estimated,SQL Duration with Potential Problems,SQL Ids with Failures,Read Score Percent,Read File Format Score,Unsupported Read File Formats and Types,Unsupported Write Data Format,,Complex Types,Unsupported Nested Complex Types
Spark shell,local-1634067832387,0.0,"",0,0,108138,0.0,false,0,"",20,100.0,"","","",""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to start moving away from static files if its something small that can be run real time. In this case I don't think we really need this one as it appears to be tested by other, right? if so just remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, We do have SerializeFromObject in SQL plan in other event logs. So we are already testing it in other files. Will remove this and corresponding tests and event log.

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 14, 2021

if possible, we should ask the requestor of this feature to test on the eventlogs they ran into this issue on.

@viadea - Is it possbile for you test on the eventlogs where RDD to Dataframe conversion is used and verify if the scoring(i.e top N) better represents the queries which use only dataframes.

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 15, 2021

build

@tgravescs
Copy link
Collaborator

tests failing please take a look

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 15, 2021

build

Signed-off-by: Niranjan Artal <nartal@nvidia.com>
@tgravescs
Copy link
Collaborator

build

@tgravescs
Copy link
Collaborator

build

@nartal1 nartal1 merged commit e6a69f1 into NVIDIA:branch-21.12 Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool can detect RDD APIs in SQL plan
3 participants