Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Spark UT issues in RapidsDataFrameAggregateSuite #10943

Merged
merged 2 commits into from
Jun 8, 2024

Conversation

thirtiseven
Copy link
Collaborator

@thirtiseven thirtiseven commented May 29, 2024

closes #10772

issues in RapidsDataFrameAggregateSuite are not real bugs (Thanks @abellina )

This pr:

  • ported 4 cases related to collect and add sort_array around them, because order of elements in the array is non-deterministic in collect_list and collect_set.
  • marked "SPARK-19471: AggregationIterator does not initialize the generated result projection before using it as won't fix issue because it is a Codegen related UT to find WholeStageCodegenExec in gpu plan.
  • removed SPARK-24788: RelationalGroupedDataset.toString with unresolved exprs should not fail because it can pass now.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@thirtiseven thirtiseven self-assigned this May 29, 2024
@sameerz sameerz added the test Only impacts tests label May 29, 2024
@thirtiseven thirtiseven changed the base branch from branch-24.06 to branch-24.08 May 29, 2024 12:32
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@@ -83,6 +83,7 @@ abstract class BackendTestSettings {
// or a description like "This simply can't work on GPU".
// It should never be "unknown" or "need investigation"
case class KNOWN_ISSUE(reason: String) extends ExcludeReason
case class ADJUST_UT(reason: String) extends ExcludeReason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this vs. KNOWN_ISSUE? Are we intending to fix these? If so, we should file a tracking issue and use KNOWN_ISSUE with the issue URL as the description. If we're not intending to fix these, why not use KNOWN_ISSUE with the same description?

Copy link
Collaborator Author

@thirtiseven thirtiseven Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to mark the test cases that Spark UT cases have issues themselves, something is wrong or not working for plugin. But we still want to test what it meant to test by adjusting the case. Maybe a better name would be INVALID_CASE or SPARK_UT_ISSUE?

cc @binmahone

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree KNOWN_ISSUE is good enough. Why not file an issue for tolerating non-determinism by sort and reference it?

Copy link
Collaborator

@binmahone binmahone Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADJUST_UT in our context means "the spark test case will work for rapids, there is no bug in rapids, but the test case itself needs some modification". For example, if a spark test case looks like:

test("testcase1"){
val x = spark.sql("select sum(x), y from testdata group by y")
assert(x.at(0) == ('x0', 100))
assert(x.at(1) == ('x1', 200))
}

Notice the operations and assertions are hard coded in the test case.
We know that Rapids may return results in a different order so the test case will fail for us.
From framework perspective, we have no chance to ask it to do any sort before asserting results.
However, the test case is still meaningful and we should enable it to increase test coverage.

This is where ADJUST_UT can help. we can adjust the above test case to a new one (and at the same time exclude the old test case with reason ADJUST_UT):

test("NEW testcase1"){
val x = spark.sql("select sum(x), y from testdata group by y")
// sort x to make the result deterministic
...
// make assertions based on the deterministic result
...
}

By doing this, the test case testcase1 is considered to be "solved", there will be NO follow up issue, so it's not a known issue.

Based on our experience on Gluten, this type of case is very common, So I think it's necessary to add the ADJUST_UT enum.

What you think? @jlowe @gerashegalov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this, I like the ADJUST_UT label. We can always go back and look at all the tests we adjusted (audit them).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this, I like the ADJUST_UT label. We can always go back and look at all the tests we adjusted (audit them).

thx Alessandro

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this makes sense to me. We can go with this for this PR

I wonder if we define test("testcase1") in our code, does it override the test in the base class? If it is possible then we could just do the override and not have a special ADJUST_UT exclude tag.

Currently the framework does not have this feature (everything has to be explicit now). But I agree with you that this is a good idea. @HaoYang670 please raise a framework feature request if you also see potential value of this.

Another idea for collect_list kind of issues with SQL, we could probably register our own UDAF as collect_list which will either be a simple delegate to the real collect_list , collect_list followed by sort otherwise.

Yeah, this would be the test case where ADJUST_UT is NOT finally needed. However extra sort will be a performance hurt. In my point of view, we can tolerate minor different result with Vainilla Spark, as long as both are correct answer under ANSI SQL standard. Do you think our team can reach concenses on this? @gerashegalov

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@gerashegalov gerashegalov Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However extra sort will be a performance hurt. In my point of view, we can tolerate minor different result with Vainilla Spark, as long as both are correct answer under ANSI SQL standard.

What I mean is overriding collect_list only in the test code, injecting the sort only in specially tagged tests where we know that that the order variance is permissible. We don't need to pay unnecessary performance penalty in the production code. I am bringing this up as an idea to discuss for follow-up work. This PR is fine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @gerashegalov I can roughly imagine what you're suggesting. Still it will be good if you can bring it up as a formal proposal (for further discussion). For now, I'm not quite sure to what extent can your idea solve the inconsistent problem.

@GaryShen2008
Copy link
Collaborator

build

2 similar comments
@firestarman
Copy link
Collaborator

build

@GaryShen2008
Copy link
Collaborator

build

@thirtiseven thirtiseven merged commit 18c2579 into NVIDIA:branch-24.08 Jun 8, 2024
44 checks passed
SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024
* Fix Spark UT issues in RapidsDataFrameAggregateSuite

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

* Added SPARK-24788 back

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

---------

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Issues found by Spark UT Framework on RapidsDataFrameAggregateSuite
9 participants