[SPARK-49556][SQL] Add SQL pipe syntax for the SELECT operator #48047

dtenedor · 2024-09-10T04:05:14Z

What changes were proposed in this pull request?

This PR adds SQL pipe syntax support for the SELECT operator.

For example:

CREATE TABLE t(x INT, y STRING) USING CSV;
INSERT INTO t VALUES (0, 'abc'), (1, 'def');

TABLE t
|> SELECT x, y

0	abc
1	def

TABLE t
|> SELECT x, y
|> SELECT x + LENGTH(y) AS z

3
4

(SELECT * FROM t UNION ALL SELECT * FROM t)
|> SELECT x + LENGTH(y) AS result

3
3
4
4

TABLE t
|> SELECT sum(x) AS result

Error: aggregate functions are not allowed in the pipe operator |> SELECT clause; please use the |> AGGREGATE clause instead

Why are the changes needed?

The SQL pipe operator syntax will let users compose queries in a more flexible fashion.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds a few unit test cases, but mostly relies on golden file test coverage. I did this to make sure the answers are correct as this feature is implemented and also so we can look at the analyzer output plans to ensure they look right as well.

Was this patch authored or co-authored using generative AI tooling?

No

commit commit

dtenedor · 2024-09-10T21:05:08Z

cc @cloud-fan @srielau @gengliangwang

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

common/utils/src/main/resources/error/error-conditions.json

sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

exclude flaky ThriftServerQueryTestSuite for new golden file

dtenedor

Thanks @gengliangwang for your review!

sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql

common/utils/src/main/resources/error/error-conditions.json

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

dtenedor

Thanks @gengliangwang for your review again!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

switch to expression switch to expression switch to expression moving error checking to checkanalysis

dtenedor

Thanks again @gengliangwang for your careful attention

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2024-09-12T17:26:00Z

sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseParser.g4

@@ -604,6 +604,7 @@ queryTerm
        operator=INTERSECT setQuantifier? right=queryTerm                                #setOperation
    | left=queryTerm {!legacy_setops_precedence_enabled}?
        operator=(UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm              #setOperation
+    | left=queryTerm OPERATOR_PIPE operatorPipeRightSide                                 #operatorPipeStatement


I had a hard time understanding this recursive parser rule. How does it match continuous pipe operators? And what is the Operator Precedence with mixed classic SQL query syntax and the new pipe syntax?

I'm not familiar with ANTLR enough. So this recursive parser rule matches the SQL string from the end? e.g. it finds the first operatorPipeRightSide from the end, and then tries to match a chain of pipe operators.

Sure, no problem, I can try to explain it.

ANTLR tokenizes each SQL query it receives, converting the input string into a sequence of tokens (using SqlBaseLexer.g4). Then the parser's job (in this file) is to convert that sequence of tokens into an initial unresolved logical plan representing the parse tree.

To do so, the parser checks each rule in the listed sequence, one-by-one, comparing the provided tokens at the current index in the sequence with the required tokens from the rule. If the rule matches, wherein all keywords and other components in the rule map to corresponding input tokens, then the parser generates the rule's unresolved logical plan tree using the logic in AstBuilder.scala.

In this case, we define the new token OPERATOR_PIPE: '|>'; in SqlBaseLexer.g4. Then we add a new option to the existing queryTerm rule to allow any syntax matching an existing queryTerm to appear on the left side of this |> token and the syntax of operatorPipeRightSide on the right side (which in this PR is limited to only a selectClause).

ANTLR grammar allows left-recursive rules wherein any alternative may begin with a reference to the same rule, so the queryTerm on the left side may match any valid existing syntax for a queryTerm such as TABLE t, a table subquery, etc. Since we are extending queryTerm to also match against queryTerm OPERATOR_PIPE operatorPipeRightSide, this alternative implements the recursion wherein we may chain multiple pipe operators together. For example, in TABLE t |> SELECT x |> LIMIT 2, TABLE t matches a queryTerm, then TABLE t |> SELECT x matches another, and finally the entire query (using the new recursive #operatorPipeStatement alternative two times).

Otherwise, if the rule does not match, then the parser moves on to try the next rule in the sequence, and so on, similar to a Scala pattern-match. This defines the precedence of the rules amongst each other: the ones appearing first in the list in SqlBaseParser.g4 apply first.

So the parser generates a basic parse tree, and AstBuilder.scala transforms that into an unresolved logical plan? Thanks for the clear and detailed explanation! I'm adding SQL syntax too and this is very helpful.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PipeSelect.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala

respond to code review comments respond to code review comments

dtenedor

Thanks @cloud-fan for your review!!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PipeSelect.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala

cloud-fan · 2024-09-13T06:15:53Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala

@@ -880,4 +881,20 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
    parser.parsePlan("SELECT\u30001") // Unicode ideographic space
  }
  // scalastyle:on
+
+  test("Operator pipe SQL syntax") {
+    withSQLConf(SQLConf.OPERATOR_PIPE_SYNTAX_ENABLED.key -> "true") {


we don't need this now

I tried removing it, but the test failed :) it seems like SparkSqlParserSuite is not triggering Utils.isTesting for some reason. (It does seem to work for SQLQueryTestSuite.)

cloud-fan · 2024-09-13T06:16:32Z

...erver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala

@@ -103,7 +103,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite with SharedThriftServ
    // SPARK-42921
    "timestampNTZ/datetime-special-ansi.sql",
    // SPARK-47264
-    "collations.sql"
+    "collations.sql",
+    "pipe-operators.sql"


why it doesn't work in thrift-server?

good question; previously I found it was flaky and failing sometimes because it wasn't sorting the output result rows for some reason.

But it again now with a python script running it 25 times locally, it seems to be passing now. I can re-enable the test here.

Update on this, the test flaked out again on CI:

It seems this test suite is not manually sorting the output result rows in the absence of an ORDER BY clause, like the other SQL test suite variants. I'm adding it back to the blocklist for now.

dtenedor

Thanks all for your reviews :)

dtenedor · 2024-09-13T18:45:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala

@@ -880,4 +881,20 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession {
    parser.parsePlan("SELECT\u30001") // Unicode ideographic space
  }
  // scalastyle:on
+
+  test("Operator pipe SQL syntax") {
+    withSQLConf(SQLConf.OPERATOR_PIPE_SYNTAX_ENABLED.key -> "true") {


I tried removing it, but the test failed :) it seems like SparkSqlParserSuite is not triggering Utils.isTesting for some reason. (It does seem to work for SQLQueryTestSuite.)

dtenedor · 2024-09-13T18:56:03Z

...erver/src/test/scala/org/apache/spark/sql/hive/thriftserver/ThriftServerQueryTestSuite.scala

@@ -103,7 +103,8 @@ class ThriftServerQueryTestSuite extends SQLQueryTestSuite with SharedThriftServ
    // SPARK-42921
    "timestampNTZ/datetime-special-ansi.sql",
    // SPARK-47264
-    "collations.sql"
+    "collations.sql",
+    "pipe-operators.sql"


good question; previously I found it was flaky and failing sometimes because it wasn't sorting the output result rows for some reason.

But it again now with a python script running it 25 times locally, it seems to be passing now. I can re-enable the test here.

cloud-fan · 2024-09-14T04:16:09Z

thanks, merging to master!

commit

2522fb2

commit commit

github-actions bot added the SQL label Sep 10, 2024

dtenedor added 2 commits September 10, 2024 11:56

add more testing

0cd4f2a

add more testing

599b294

gengliangwang reviewed Sep 10, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Sep 10, 2024

View reviewed changes

common/utils/src/main/resources/error/error-conditions.json Show resolved Hide resolved

gengliangwang reviewed Sep 10, 2024

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql Show resolved Hide resolved

gengliangwang reviewed Sep 10, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

dtenedor added 2 commits September 10, 2024 16:30

exclude flaky ThriftServerQueryTestSuite for new golden file

fac88af

exclude flaky ThriftServerQueryTestSuite for new golden file

respond to code review comments

51a01d1

dtenedor commented Sep 10, 2024

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/pipe-operators.sql Show resolved Hide resolved

common/utils/src/main/resources/error/error-conditions.json Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

dtenedor requested a review from gengliangwang September 10, 2024 23:33

gengliangwang reviewed Sep 11, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

respond to code review comments

0ee5fc4

dtenedor commented Sep 11, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

dtenedor requested a review from gengliangwang September 11, 2024 18:46

dtenedor added 2 commits September 11, 2024 11:49

sync

0d862eb

switch to expression

557bd0c

switch to expression switch to expression switch to expression moving error checking to checkanalysis

dtenedor commented Sep 12, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

dtenedor mentioned this pull request Sep 12, 2024

[SPARK-49557][SQL] Add SQL pipe syntax for the WHERE operator #48091

Closed

cloud-fan reviewed Sep 12, 2024

View reviewed changes

dtenedor requested a review from cloud-fan September 12, 2024 19:57

dtenedor mentioned this pull request Sep 12, 2024

[WIP][SPARK-49561][SQL] Add SQL pipe syntax for the PIVOT and UNPIVOT operators #48093

Draft