Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal can be different #10136

Closed
firestarman opened this issue Jan 2, 2024 · 3 comments · Fixed by #10137
Assignees
Labels
bug Something isn't working

Comments

@firestarman
Copy link
Collaborator

firestarman commented Jan 2, 2024

The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may not be equal to each other after the prunePartitionForFileSourceScan rule applies for some cases.

Because this rule will remove some partition columns that are not used by the first downstream ProjectExec for some patterns for optimization, leading to some partition columns not exist in the finalized output. Then the AttributeReferences in some filters(e.g. partitionFilters) but excluded from the finalized output will not be canonicalized correctly.

Repro steps:

Launch a GPU spark session, and run

Seq((1, 11, "s1"), (2, 22, "s2"), (3, 33, "s3"), (4, 44, "s4"), (5, 55, "s5")).toDF("a", "b", "c").write.mode("overwrite").partitionBy("b", "c").parquet("/tmp/reuse")
spark.conf.set("spark.sql.sources.useV1SourceList", "parquet")

val df1 = spark.read.parquet("/tmp/reuse").where("b == 33").select("a", "c")
val df2 = spark.read.parquet("/tmp/reuse").where("b == 33").select("a", "c")

val fileScans = df1.unionAll(df2).queryExecution.executedPlan.collect {
  case plan if plan.getClass.getName == "org.apache.spark.sql.rapids.GpuFileSourceScanExec" => plan
}

fileScans.head.canonicalized == fileScans.last.canonicalized

You will get

scala> fileScans.head.canonicalized == fileScans.last.canonicalized
res3: Boolean = false

But they should be equal to each other.

@firestarman firestarman added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 2, 2024
@firestarman firestarman changed the title [BUG] Canonicalized GPU file scans may not be equal to each other [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may not be equal Jan 2, 2024
@firestarman firestarman changed the title [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may not be equal [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may not be equal Jan 2, 2024
@firestarman firestarman changed the title [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may not be equal [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may be unequal Jan 2, 2024
@firestarman firestarman changed the title [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may be unequal [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may be different Jan 2, 2024
@firestarman firestarman changed the title [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal may be different [BUG] The canonicalized version of GpuFileSourceScanExecs that suppose to be semantic-equal can be different Jan 2, 2024
@mythrocks
Copy link
Collaborator

I'm trying to understand the problem.

Is it that the partition column b is pruned from (only) one scan-exec, when it should have been pruned from both? Or that it shouldn't have been pruned from either?

@jlowe
Copy link
Member

jlowe commented Jan 2, 2024

The problem is that the GpuFileSourceScanExec node doesn't canonicalize properly so semantically identical copies are recognized as such. The same partition columns are pruned from both, but the issue is that there are filter expressions that are not getting canonicalized properly. There's a bug in the expression lists we're passing to canonicalization steps for the partition and data filter parameters. That bug causes us to potentially skip some filter expressions that need canonicalization, and that in turn causes the semantic equivalence comparison to fail.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 2, 2024
@firestarman
Copy link
Collaborator Author

firestarman commented Jan 3, 2024

I'm trying to understand the problem.

Column b is a partition column and will be pruned from the finalized output of GpuFileSourceScanExec because it does not required by the next select. This is an optimization to avoid generating useless partition columns.

But column b always exists in the partitionFilters. When doing the canonicalization, the AttributeReference of column b can not be found in the finalized output, so its expression id will not be normalized in QueryPlan.normalizeExpressions, causing the semantic equivalence comparison to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants