[Spark] Delta SQL parser change to support CLUSTER BY #2328

zedtang · 2023-11-22T19:12:47Z

Which Delta project/connector is this regarding?

Description

Resolves #2593

This PR is part of #1874.

This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace CLUSTER BY with PARTITIONED BY, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in the logical plan's partitioning transforms.

When we integrate with Apache Spark's CLUSTER BY
implementation(PR), we'll remove the workaround in this PR.

How was this patch tested?

New unit tests.

Does this PR introduce any user-facing changes?

Yes, it introduces parser support for CLUSTER BY syntax.

This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace `CLUSTER BY` with `PARTITIONED BY`, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in TableSpec's properties. When we integrate with OSS Spark's CLUSTER BY implementation([PR](apache/spark#42577)), we'll remove the workaround in this PR.

vkorukanti · 2023-11-27T23:09:32Z

spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala

@@ -76,6 +77,8 @@ class DeltaSqlParser(val delegate: ParserInterface) extends ParserInterface {

  override def parsePlan(sqlText: String): LogicalPlan = parse(sqlText) { parser =>
    builder.visit(parser.singleStatement()) match {
+      case clusterByPlan: ClusterByPlan =>


Sorry, not obvious to me. How does the Spark returns a ClusterByPlan? Can a user type CREATE TABLE ... CLUSTER BY using the Spark 3.5 which the Delta depends on?

The ClusterByPlan is injected by Delta in L416 below. We are using this workaround since Spark 3.5 doesn't support CLUSTER BY yet.

The flow looks like this:

User types CREATE TABLE ... CLUSTER BY.

CLUSTER BY gets matched with clusterBySpec in DeltaSqlBase.g4 and visitClusterBy is called.

visitClusterBy will create a ClusterByPlan which gets detected in parsePlan and calls ClusterByParserUtils.parsePlan

ClusterByParserUtils.parsePlan will replace CLUSTER BY with PARTITIONED BY and call SparkSqlParser(delegate.parsePlan) for validation.

After successful parsing, clustering column is saved in table's partitioning transforms.

imback82

looks reasonable to me as a temporary solution until we integrate with OSS Spark's CLUSTER BY

imback82 · 2023-11-27T23:52:27Z

spark/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala

+   * The plan will be used as a sentinel for DeltaSqlParser to process it further.
+   */
+  override def visitClusterBy(ctx: ClusterByContext): LogicalPlan = withOrigin(ctx) {
+    val clusterBySpecCtx = ctx.clusterBySpec.asScala.head


is head always safe?

Yes. This function is invoked when we've matched clusterBySpec+, which means one or more clusterBySpec

imback82 · 2023-11-27T23:54:57Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/ClusteredTableUtils.scala

+  }
+
+  // ClassTag is added to avoid the "same type after erasure" issue with the case class.
+  def apply[_: ClassTag](columnNames: Seq[Seq[String]]): ClusterBySpec = {


Looks like this is only used by parser, so it will be removed when we integrated with OSS? If so, can you add a comment that this will be removed (or just put this functionality inside the caller to make it easy to remember); note that I don't see this function in OSS Spark either.

Moved to its own package clustering.temp and file ClusterBySpec.scala.

imback82 · 2023-11-28T01:23:26Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/ClusteredTableUtils.scala

+ *
+ * @param columnNames the names of the columns used for clustering.
+ */
+case class ClusterBySpec(columnNames: Seq[NamedReference]) {


Maybe name this to TempClusterBySpec to make it obvious?

Moved to its own package clustering.temp and file ClusterBySpec.scala.

This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace `CLUSTER BY` with `PARTITIONED BY`, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in the logical plan's partitioning transforms. When we integrate with Apache Spark's CLUSTER BY implementation([PR](apache/spark#42577)), we'll remove the workaround in this PR. Closes delta-io#2328 GitOrigin-RevId: 19262070edbcaead765e7f9eefe96b6e63a7f884

vkorukanti reviewed Nov 27, 2023

View reviewed changes

zedtang requested a review from vkorukanti November 27, 2023 23:32

imback82 approved these changes Nov 27, 2023

View reviewed changes

imback82 reviewed Nov 28, 2023

View reviewed changes

vkorukanti approved these changes Nov 30, 2023

View reviewed changes

pass down cluster by as Transform

f39ce4a

allisonport-db closed this in 87ff9c4 Dec 13, 2023

zedtang deleted the delta-parser-change branch December 14, 2023 01:00

zedtang mentioned this pull request Jan 8, 2024

[Feature Request] Store CLUSTER BY columns using DomainMetadata #2447

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Delta SQL parser change to support CLUSTER BY #2328

[Spark] Delta SQL parser change to support CLUSTER BY #2328

zedtang commented Nov 22, 2023 •

edited

Loading

vkorukanti Nov 27, 2023

zedtang Nov 27, 2023 •

edited

Loading

imback82 left a comment

imback82 Nov 27, 2023

zedtang Nov 29, 2023

imback82 Nov 27, 2023

zedtang Nov 30, 2023

imback82 Nov 28, 2023

zedtang Nov 30, 2023

[Spark] Delta SQL parser change to support CLUSTER BY #2328

[Spark] Delta SQL parser change to support CLUSTER BY #2328

Conversation

zedtang commented Nov 22, 2023 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

vkorukanti Nov 27, 2023

Choose a reason for hiding this comment

zedtang Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

imback82 Nov 27, 2023

Choose a reason for hiding this comment

zedtang Nov 29, 2023

Choose a reason for hiding this comment

imback82 Nov 27, 2023

Choose a reason for hiding this comment

zedtang Nov 30, 2023

Choose a reason for hiding this comment

imback82 Nov 28, 2023

Choose a reason for hiding this comment

zedtang Nov 30, 2023

Choose a reason for hiding this comment

zedtang commented Nov 22, 2023 •

edited

Loading

zedtang Nov 27, 2023 •

edited

Loading