[SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites. #41763

YannisSismanis · 2023-06-27T21:48:30Z

What changes were proposed in this pull request?

Adds per-rule validation checks for the following:

aggregate expressions in Aggregate plans are valid.
Grouping key types in Aggregate plans cannot by of type Map.
No dangling references have been generated.

This validation is by default enabled for all tests or selectively using the spark.sql.planChangeValidation=true flag.

Why are the changes needed?

Extra validation for optimizer rewrites.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

cloud-fan · 2023-10-04T06:35:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala

+      case expr: AggregateExpression =>
+        val aggFunction = expr.aggregateFunction
+        aggFunction.children.foreach {
+          child =>


code style nit to save one indentation level. It's also the style of the previous code: https://github.com/apache/spark/pull/41763/files#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aL432

aggFunction.children.foreach { child => ... }

cloud-fan · 2023-10-04T06:36:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala

+          msg = s"Non-deterministic expression '${
+            toSQLExpr(expr)
+          }' should not appear in " +
+            "grouping expression.",


Suggested change

msg = s"Non-deterministic expression '${

toSQLExpr(expr)

}' should not appear in " +

"grouping expression.",

msg = s"Non-deterministic expression '${toSQLExpr(expr)}' should not appear in " +

"grouping expression.",

cloud-fan · 2023-10-04T06:38:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+  def validateNoDanglingReferences(plan: LogicalPlan): Option[String] = {
+    plan.collectFirst {
+      // DML commands and multi instance relations (like InMemoryRelation caches)
+      // have different output semantics than typical queries.


Do we really need to special-case these two? QueryPlan#missingInputs should have already taken care of them

final def missingInput: AttributeSet = references -- inputSet

Anyway, not a big deal, it's faster to skip some cases that will never hit missing attr issue.

cloud-fan · 2023-10-04T06:38:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+      case _: Command => None
+      case _: MultiInstanceRelation => None
+      case n if canGetOutputAttrs(n) =>
+        if ( n.missingInput.nonEmpty) {


Suggested change

if ( n.missingInput.nonEmpty) {

if (n.missingInput.nonEmpty) {

cloud-fan · 2023-10-04T06:40:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+          None
+        } catch {
+          case _: AnalysisException =>
+            Some(s"Aggregate: ${a.toString} is not a valid aggregate expression")


shall we take the actual error message? case e: AnalysisException => Some(e.getMessage)

cloud-fan · 2023-10-04T06:41:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

+      .orElse(LogicalPlanIntegrity.validateNoDanglingReferences(currentPlan))
+      .orElse(LogicalPlanIntegrity.validateGroupByTypes(currentPlan))
+      .orElse(LogicalPlanIntegrity.validateAggregateExpressions(currentPlan))
+      .map( err => s"${err}\nPrevious schema:${previousPlan.output.mkString(", ")}" +


Suggested change

.map( err => s"${err}\nPrevious schema:${previousPlan.output.mkString(", ")}" +

.map(err => s"${err}\nPrevious schema:${previousPlan.output.mkString(", ")}" +

cloud-fan · 2023-10-06T03:19:47Z

thanks, merging to master!

…ewrites ### What changes were proposed in this pull request? Adds per-rule validation checks for the following: 1. aggregate expressions in Aggregate plans are valid. 2. Grouping key types in Aggregate plans cannot by of type Map. 3. No dangling references have been generated. This validation is by default enabled for all tests or selectively using the spark.sql.planChangeValidation=true flag. ### Why are the changes needed? Extra validation for optimizer rewrites. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes apache#41763 from YannisSismanis/SC-130139_followup. Authored-by: Yannis Sismanis <yannis.sismanis@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

minor

ee46534

github-actions bot added the SQL label Jun 27, 2023

YannisSismanis changed the title ~~[SPARK-44219] Adds extra per-rule validations for optimization rewrites.~~ [SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites. Jun 27, 2023

YannisSismanis added 5 commits June 29, 2023 22:45

minor

2e7b401

minor

2255c25

minor

507c27d

minot

048d063

minr

53ef2a9

cloud-fan reviewed Jul 21, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Show resolved Hide resolved

cloud-fan reviewed Jul 21, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

YannisSismanis requested a review from cloud-fan August 15, 2023 18:56

Merge branch 'master' into SC-130139_followup

3746b66

cloud-fan reviewed Aug 24, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 25, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

YannisSismanis added 2 commits October 3, 2023 10:49

Merge branch 'master' into SC-130139_followup

a4d515b

update

874e092

YannisSismanis requested a review from cloud-fan October 3, 2023 18:57

cloud-fan reviewed Oct 4, 2023

View reviewed changes

cloud-fan approved these changes Oct 4, 2023

View reviewed changes

mino

4e15bfd

YannisSismanis requested a review from cloud-fan October 5, 2023 18:47

cloud-fan closed this in 2ce1a87 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites. #41763

[SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites. #41763

YannisSismanis commented Jun 27, 2023 •

edited

Loading

cloud-fan Oct 4, 2023

cloud-fan Oct 4, 2023

cloud-fan Oct 4, 2023

cloud-fan Oct 4, 2023

cloud-fan Oct 4, 2023

cloud-fan Oct 4, 2023

cloud-fan Oct 4, 2023

cloud-fan commented Oct 6, 2023

	if ( n.missingInput.nonEmpty) {
	if (n.missingInput.nonEmpty) {

	.map( err => s"${err}\nPrevious schema:${previousPlan.output.mkString(", ")}" +
	.map(err => s"${err}\nPrevious schema:${previousPlan.output.mkString(", ")}" +

[SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites. #41763

[SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites. #41763

Conversation

YannisSismanis commented Jun 27, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan Oct 4, 2023

Choose a reason for hiding this comment

cloud-fan commented Oct 6, 2023

YannisSismanis commented Jun 27, 2023 •

edited

Loading