-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48985][CONNECT] Connect Compatible Expression Constructors #47464
Conversation
cc @beliefer and @zhengruifeng |
@@ -519,6 +519,8 @@ object FunctionRegistry { | |||
expressionBuilder("mode", ModeBuilder), | |||
expression[HllSketchAgg]("hll_sketch_agg"), | |||
expression[HllUnionAgg]("hll_union_agg"), | |||
expression[Product]("product"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in this file actually increase our surface area. So I'd like to have some feedback and understand if this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there was some reason to not add product
here @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion now, as we are going for feature parity between SQL and DataFrame APIs recently. cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually yeah. Those methods were intentionally not added to SQL expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me try something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a separate registry for these functions, and only use them in Dataframe resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -2637,6 +2637,26 @@ | |||
"message" : [ | |||
"expects %1$, %2$ and so on, but got %0$." | |||
] | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is the proper way of doing this.
@@ -519,6 +519,8 @@ object FunctionRegistry { | |||
expressionBuilder("mode", ModeBuilder), | |||
expression[HllSketchAgg]("hll_sketch_agg"), | |||
expression[HllUnionAgg]("hll_union_agg"), | |||
expression[Product]("product"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember there was some reason to not add product
here @cloud-fan
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Outdated
Show resolved
Hide resolved
case other => SessionWindow(timeCol, other) | ||
} | ||
Some( | ||
Alias(sessionWindow, "session_window")(nonInheritableMetadataKeys = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhengruifeng Why do you explicitly remove the Dataset.DATASET_ID_KEY
and Dataset.COL_POS_KEY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is to be consistent with Spark Classic at that time
spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Lines 5708 to 5712 in 15c98e0
def session_window(timeColumn: Column, gapDuration: String): Column = { | |
withExpr { | |
SessionWindow(timeColumn.expr, gapDuration) | |
}.as("session_window") | |
} |
as
is an alias for name
which remove them
spark/sql/core/src/main/scala/org/apache/spark/sql/Column.scala
Lines 1186 to 1192 in 9cf6dc8
def name(alias: String): Column = withExpr { | |
// SPARK-33536: an alias is no longer a column reference. Therefore, | |
// we should not inherit the column reference related metadata in an alias | |
// so that it is not caught as a column reference in DetectAmbiguousSelfJoin. | |
Alias(expr, alias)( | |
nonInheritableMetadataKeys = Seq(Dataset.DATASET_ID_KEY, Dataset.COL_POS_KEY)) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, that makes sense.
}.as("session_window") | ||
} | ||
def session_window(timeColumn: Column, gapDuration: String): Column = | ||
session_window(timeColumn, lit(gapDuration)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now supported because of the small change I made to ResolveTimeWindows.
@@ -5743,7 +5728,7 @@ object functions { | |||
* @since 3.2.0 | |||
*/ | |||
def session_window(timeColumn: Column, gapDuration: Column): Column = | |||
Column.fn("session_window", timeColumn, gapDuration).as("session_window") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alias is not needed. This is added by the ResolveTimeWindows rule.
case Literal(interval: CalendarInterval, CalendarIntervalType) => | ||
interval == null || interval.months + interval.days + interval.microseconds <= 0 | ||
case _ => true | ||
val filterByTimeRange = if (gapDuration.foldable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit better than pattern matching on a literal since it supports more complex expressions, in particular Cast(Literal("<some gap duration>r", StringType), CalendarIntervalType)
. Normally you'd expect the constant folding to take care of this, but this happens during optimization, and this rule is part of analysis; hence the need for manual constant folding.
Merging. |
### What changes were proposed in this pull request? There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions. ### Why are the changes needed? Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47464 from hvanhovell/SPARK-48985. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
### What changes were proposed in this pull request? There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions. ### Why are the changes needed? Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47464 from hvanhovell/SPARK-48985. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
What changes were proposed in this pull request?
There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions.
Why are the changes needed?
Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.