[SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE #42577

imback82 · 2023-08-21T02:37:37Z

What changes were proposed in this pull request?

This proposes to introduce CLUSTER BY SQL clause to CREATE/REPLACE SQL syntax:

CREATE TABLE tbl(a int, b string) CLUSTER BY (a, b)

This doesn't introduce a default implementation for clustering, but it's up to the catalog/datasource implementation to utilize the clustering information (e.g., Delta, Iceberg, etc.).

Why are the changes needed?

To introduce the concept of clustering to datasources.

Does this PR introduce any user-facing change?

Yes, this introduces a new SQL keyword.

How was this patch tested?

Added extensive unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

imback82

@cloud-fan this PR is ready for review. I left my questions in the PR. TIA!

imback82 · 2023-11-06T04:05:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -253,7 +270,8 @@ case class CatalogTable(
    tracksPartitionsInCatalog: Boolean = false,
    schemaPreservesCase: Boolean = true,
    ignoredProperties: Map[String, String] = Map.empty,
-    viewOriginalText: Option[String] = None) {
+    viewOriginalText: Option[String] = None,
+    clusterBySpec: Option[ClusterBySpec] = None) {


We haven't added any field to CatalogTable in the last 5 years and should we not add this new field? Alternatively, we can store clustering columns in the property and make the property as reserved?

Since we have to store it in the table properties anyway at the Hive layer, I think it's simpler to just do it ahead here.

Done, I guess no need to make it as a reserved property like properties in TableCatalog?

imback82 · 2023-11-06T04:06:45Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

    )

    val filteredProperties = properties.filterNot {
      case (key, _) => excludedTableProperties.contains(key)
    }
    val comment = properties.get("comment")
+    val clusterBySpec = properties.get("clusteringColumns").map(ClusterBySpec(_))


should we make this property as reserved?

imback82 · 2023-11-06T04:09:10Z

...e/src/test/scala/org/apache/spark/sql/hive/execution/command/CreateTableClusterBySuite.scala

+ * table catalog.
+ */
+class CreateTableClusterBySuite extends v1.CreateTableClusterBySuiteBase with CommandSuiteBase {
+  // Hive doesn't support nested column names with space and dot.


Actually, the exception stack trace looks like the following:

[CANNOT_RECOGNIZE_HIVE_TYPE] Cannot recognize hive type string: "STRUCT<COL4.1:INT>", column: `col3`. The specified data type for the field cannot be recognized by Spark SQL. Please check the data type of the specified field and ensure that it is a valid Spark SQL data type. Refer to the Spark SQL documentation for a list of valid data types and their format. If the data type is correct, please ensure that you are using a supported version of Spark SQL. SQLSTATE: 429BB org.apache.spark.SparkException: [CANNOT_RECOGNIZE_HIVE_TYPE] Cannot recognize hive type string: "STRUCT<COL4.1:INT>", column: `col3`. The specified data type for the field cannot be recognized by Spark SQL. Please check the data type of the specified field and ensure that it is a valid Spark SQL data type. Refer to the Spark SQL documentation for a list of valid data types and their format. If the data type is correct, please ensure that you are using a supported version of Spark SQL. SQLSTATE: 429BB at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1637) at org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1067) at org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1082) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)

COL4.1 in "STRUCT<COL4.1:INT>" is not quoted, so the parser fails. We prob. need to fix DataType.catalogString...

+1, cc @beliefer can you help to take a look?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

cloud-fan · 2023-11-06T04:54:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+}
+
+object ClusterBySpec {
+  def apply(columns: String): ClusterBySpec = columns match {


this looks weird, where do we use it?

this is used for parsing property value back to ClusterBySpec. I renamed this to fromProperty.

cloud-fan · 2023-11-07T09:07:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+object ClusterBySpec {
+  def fromProperty(columns: String): ClusterBySpec = columns match {
+    case "" => ClusterBySpec(Seq.empty[UnresolvedAttribute])
+    case _ => ClusterBySpec(columns.split(",").map(_.trim).map(UnresolvedAttribute.quotedString))


hmm what if the column name contains ,? shall we use JSON format?

cloud-fan · 2023-11-07T09:11:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+ *
+ * @param columnNames the names of the columns used for clustering.
+ */
+case class ClusterBySpec(columnNames: Seq[UnresolvedAttribute]) {


It's weird to use UnresolvedAttribute here as we are not going to resolve it in the analyzer. How about just Seq[Seq[String]] or Seq[FieldReference]?

I went with Seq[NamedReference] to be consistent with *Transform (including ClusterByTransform). Also, some of the helper function FieldReference.unapply returns NamedReference, so it's easier to work with NamedReference.

cloud-fan · 2023-11-07T09:13:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/expressions/expressions.scala

+  override val name: String = "cluster_by"
+
+  override def references: Array[NamedReference] = {
+    arguments.collect { case named: NamedReference => named }


it's columnNames: Seq[NamedReference], we can just return columnNames here.

cloud-fan · 2023-11-07T09:18:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

@@ -297,6 +297,7 @@ case class PreprocessTableCreation(catalog: SessionCatalog) extends Rule[Logical

    val normalizedPartCols = normalizePartitionColumns(schema, table)
    val normalizedBucketSpec = normalizeBucketSpec(schema, table)
+    val normalizedClusterBySpec = normalizeClusterBySpec(schema, table)


instead of normalizing it here, shall we normalize it before we convert cluster spec to table properties?

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala

cloud-fan · 2023-11-07T09:20:37Z

...re/src/test/scala/org/apache/spark/sql/execution/command/CreateTableClusterBySuiteBase.scala

+  protected val nestedClusteringColumns: Seq[String] =
+    Seq("col2.col3", "col2.`col4 1`", "col3.`col4.1`")
+
+  def validateClusterBy(tableIdent: TableIdentifier, clusteringColumns: Seq[String]): Unit


better to not use v1 TableIdentifier in base test suite that works for both v1 and v2.

Seq[String] should be good.

or just String as the qualified table name with dot.

cloud-fan · 2023-11-07T09:23:12Z

...ore/src/test/scala/org/apache/spark/sql/execution/command/v1/CreateTableClusterBySuite.scala

+      checkError(
+        exception = intercept[AnalysisException](
+          sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing CLUSTER BY (unknown)")),
+        errorClass = "COLUMN_NOT_DEFINED_IN_TABLE",


shouldn't this be a general behavior? Analyzer should check the existence of clustering columns. I think it's already the case for partition and bucket cols.

oh I see https://github.com/apache/spark/pull/42577/files#diff-d96524deae4f3cbf30881f842ab6091aea1de9d3e5c6810f4f35d3dc697dd3d3R58 It's better if we can unify the error between v1 and v2.

now that the logic follows the same as in PreprocessTableCreation, the error message is unitifed: https://github.com/apache/spark/pull/42577/files#diff-f2a04f920c41d18a7d387216f86405bfdc6fb09c44ebe1bb09312ba7dde55333R216

...ore/src/test/scala/org/apache/spark/sql/execution/command/v2/CreateTableClusterBySuite.scala

imback82 · 2023-11-08T03:36:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+  }
+
+  def fromProperty(columns: String): ClusterBySpec = {
+    ClusterBySpec(mapper.readValue[Seq[Seq[String]]](columns).map(FieldReference(_)))


Alternatively I could have serialized FieldReference, but this approach is more generic.

imback82 · 2023-11-08T03:43:46Z

...ore/src/test/scala/org/apache/spark/sql/execution/command/v1/CreateTableClusterBySuite.scala

+      checkError(
+        exception = intercept[AnalysisException](
+          sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing CLUSTER BY (unknown)")),
+        errorClass = "COLUMN_NOT_DEFINED_IN_TABLE",


now that the logic follows the same as in PreprocessTableCreation, the error message is unitifed: https://github.com/apache/spark/pull/42577/files#diff-f2a04f920c41d18a7d387216f86405bfdc6fb09c44ebe1bb09312ba7dde55333R216

cloud-fan · 2023-11-08T06:37:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

+      clusterBySpec: ClusterBySpec,
+      resolver: Resolver): ClusterBySpec = {
+    val normalizedColumns = clusterBySpec.columnNames.map { columnName =>
+      val position = SchemaUtils.findColumnPosition(


This checks column existence but we only hit it for v1 path (cover to v1 command), where do we check column existence for pure v2 path?

it's happening here for v2 path:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

Lines 273 to 279 in 6abc4a1

case transform: RewritableTransform =>

val rewritten = transform.references().map { ref =>

// Throws an exception if the reference cannot be resolved

val position = SchemaUtils.findColumnPosition(ref.fieldNames(), schema, resolver)

FieldReference(SchemaUtils.getColumnName(position, schema))

}

transform.withReferences(rewritten)

cloud-fan · 2023-11-08T06:39:38Z

...re/src/test/scala/org/apache/spark/sql/execution/command/CreateTableClusterBySuiteBase.scala

+  }
+
+  test("test clustering columns with comma") {
+    assume(!catalogVersion.contains("Hive")) // Hive catalog doesn't support column names with dots.


we can override def excluded in the hive suite to exclude this test case.

cloud-fan · 2023-11-09T11:27:55Z

thanks, merging to master!

This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace `CLUSTER BY` with `PARTITIONED BY`, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in TableSpec's properties. When we integrate with OSS Spark's CLUSTER BY implementation([PR](apache/spark#42577)), we'll remove the workaround in this PR.

This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace `CLUSTER BY` with `PARTITIONED BY`, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in the logical plan's partitioning transforms. When we integrate with Apache Spark's CLUSTER BY implementation([PR](apache/spark#42577)), we'll remove the workaround in this PR. Closes #2328 GitOrigin-RevId: 19262070edbcaead765e7f9eefe96b6e63a7f884

This PR changes the Delta SQL parser to support CLUSTER BY for Liquid clustering. Since Delta imports Spark as a library and can't change the source code directly, we instead replace `CLUSTER BY` with `PARTITIONED BY`, and leverage the Spark SQL parser to perform validation. After parsing, clustering columns will be stored in the logical plan's partitioning transforms. When we integrate with Apache Spark's CLUSTER BY implementation([PR](apache/spark#42577)), we'll remove the workaround in this PR. Closes delta-io#2328 GitOrigin-RevId: 19262070edbcaead765e7f9eefe96b6e63a7f884

… change clustering columns ### What changes were proposed in this pull request? Introduce ALTER TABLE ... CLUSTER BY SQL syntax to change the clustering columns: ```sql ALTER TABLE tbl CLUSTER BY (a, b); -- update clustering columns to a and b ALTER TABLE tbl CLUSTER BY NONE; -- remove clustering columns ``` This change updates the clustering columns for catalogs to utilize. Clustering columns are maintained in: * CatalogTable's `PROP_CLUSTERING_COLUMNS` for session catalog * Table's `partitioning` transform array for V2 catalog which is consistent with CREATE TABLE CLUSTER BY( #42577). ### Why are the changes needed? Provides a way to update the clustering columns. ### Does this PR introduce _any_ user-facing change? Yes, it introduces new SQL syntax and a new keyword NONE. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47156 from zedtang/alter-table-cluster-by. Lead-authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… change clustering columns ### What changes were proposed in this pull request? Introduce ALTER TABLE ... CLUSTER BY SQL syntax to change the clustering columns: ```sql ALTER TABLE tbl CLUSTER BY (a, b); -- update clustering columns to a and b ALTER TABLE tbl CLUSTER BY NONE; -- remove clustering columns ``` This change updates the clustering columns for catalogs to utilize. Clustering columns are maintained in: * CatalogTable's `PROP_CLUSTERING_COLUMNS` for session catalog * Table's `partitioning` transform array for V2 catalog which is consistent with CREATE TABLE CLUSTER BY( apache#42577). ### Why are the changes needed? Provides a way to update the clustering columns. ### Does this PR introduce _any_ user-facing change? Yes, it introduces new SQL syntax and a new keyword NONE. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47156 from zedtang/alter-table-cluster-by. Lead-authored-by: Jiaheng Tang <jiaheng.tang@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

initial commit

71faed7

github-actions bot added the SQL label Aug 21, 2023

imback82 changed the title ~~[SPARK-XXXXX][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE~~ [WIP][SPARK-XXXXX][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE Aug 21, 2023

imback82 marked this pull request as draft August 21, 2023 02:39

imback82 changed the title ~~[WIP][SPARK-XXXXX][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE~~ [WIP][SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE Aug 21, 2023

imback82 added 3 commits September 8, 2023 20:38

Merge branch 'master' into cluster_by

e83d706

update doc for error class

51516d0

scala 2.13 fix

bb40915

github-actions bot added the DOCS label Sep 9, 2023

imback82 added 7 commits October 8, 2023 21:41

Merge branch 'master' into cluster_by

f8b7e6f

Merge branch 'master' into cluster_by

4237193

Fix replace table, and add tests for DDLParserSuite.scala

6872414

add TransformExtractorSuite

c4e2e6c

support for v1 and session catalog + tests

51d3d51

fix error condition doc

730be5b

hive tests

354a257

imback82 marked this pull request as ready for review November 4, 2023 14:49

imback82 changed the title ~~[WIP][SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE~~ [SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE Nov 4, 2023

imback82 commented Nov 6, 2023

View reviewed changes

cloud-fan reviewed Nov 6, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 6, 2023

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Nov 6, 2023

View reviewed changes

imback82 added 4 commits November 6, 2023 10:50

address comments

d3538ad

additional comments addresed

8c9defb

remove unnecessary changes

15311bc

fix comments

08421f2

cloud-fan reviewed Nov 7, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala Show resolved Hide resolved

cloud-fan reviewed Nov 7, 2023

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/execution/command/v2/CreateTableClusterBySuite.scala Show resolved Hide resolved

imback82 added 2 commits November 7, 2023 19:35

address comments

1262ce1

revert unnecessary changes

1be8e4a

imback82 commented Nov 8, 2023

View reviewed changes

fix scala style

328e461

cloud-fan reviewed Nov 8, 2023

View reviewed changes

address comment

9e52100

cloud-fan closed this in 5ac88b1 Nov 9, 2023

zedtang mentioned this pull request Nov 22, 2023

[Spark] Delta SQL parser change to support CLUSTER BY delta-io/delta#2328

Closed

5 tasks

zedtang mentioned this pull request Jul 1, 2024

[SPARK-48760][SQL] Introduce ALTER TABLE ... CLUSTER BY SQL syntax to change clustering columns #47156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE #42577

[SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE #42577

imback82 commented Aug 21, 2023 •

edited

Loading

imback82 left a comment

imback82 Nov 6, 2023

cloud-fan Nov 6, 2023

imback82 Nov 6, 2023

imback82 Nov 6, 2023

imback82 Nov 6, 2023

cloud-fan Nov 8, 2023

cloud-fan Nov 6, 2023

imback82 Nov 6, 2023

cloud-fan Nov 7, 2023

imback82 Nov 8, 2023

cloud-fan Nov 7, 2023

imback82 Nov 8, 2023

cloud-fan Nov 7, 2023

imback82 Nov 8, 2023

cloud-fan Nov 7, 2023

imback82 Nov 8, 2023

cloud-fan Nov 7, 2023

cloud-fan Nov 7, 2023

cloud-fan Nov 7, 2023

imback82 Nov 8, 2023

cloud-fan Nov 7, 2023

cloud-fan Nov 7, 2023

imback82 Nov 8, 2023

imback82 Nov 8, 2023

imback82 Nov 8, 2023

cloud-fan Nov 8, 2023

imback82 Nov 8, 2023

cloud-fan Nov 8, 2023

imback82 Nov 8, 2023

cloud-fan commented Nov 9, 2023

	case transform: RewritableTransform =>
	val rewritten = transform.references().map { ref =>
	// Throws an exception if the reference cannot be resolved
	val position = SchemaUtils.findColumnPosition(ref.fieldNames(), schema, resolver)
	FieldReference(SchemaUtils.getColumnName(position, schema))
	}
	transform.withReferences(rewritten)

[SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE #42577

[SPARK-44886][SQL] Introduce CLUSTER BY clause for CREATE/REPLACE TABLE #42577

Conversation

imback82 commented Aug 21, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Nov 9, 2023

imback82 commented Aug 21, 2023 •

edited

Loading