Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48985][CONNECT] Connect Compatible Expression Constructors #47464

Closed
wants to merge 11 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions.

Why are the changes needed?

Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon
Copy link
Member

cc @beliefer and @zhengruifeng

@hvanhovell hvanhovell requested review from zhengruifeng, MaxGekk, cloud-fan and HyukjinKwon and removed request for zhengruifeng July 24, 2024 01:47
@@ -519,6 +519,8 @@ object FunctionRegistry {
expressionBuilder("mode", ModeBuilder),
expression[HllSketchAgg]("hll_sketch_agg"),
expression[HllUnionAgg]("hll_union_agg"),
expression[Product]("product"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file actually increase our surface area. So I'd like to have some feedback and understand if this is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was some reason to not add product here @cloud-fan

#38915 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion now, as we are going for feature parity between SQL and DataFrame APIs recently. cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually yeah. Those methods were intentionally not added to SQL expressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me try something different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a separate registry for these functions, and only use them in Dataframe resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -2637,6 +2637,26 @@
"message" : [
"expects %1$, %2$ and so on, but got %0$."
]
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the proper way of doing this.

HyukjinKwon
HyukjinKwon previously approved these changes Jul 24, 2024
@@ -519,6 +519,8 @@ object FunctionRegistry {
expressionBuilder("mode", ModeBuilder),
expression[HllSketchAgg]("hll_sketch_agg"),
expression[HllUnionAgg]("hll_union_agg"),
expression[Product]("product"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was some reason to not add product here @cloud-fan

#38915 (comment)

case other => SessionWindow(timeCol, other)
}
Some(
Alias(sessionWindow, "session_window")(nonInheritableMetadataKeys =
Copy link
Contributor Author

@hvanhovell hvanhovell Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng Why do you explicitly remove the Dataset.DATASET_ID_KEY and Dataset.COL_POS_KEY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is to be consistent with Spark Classic at that time

def session_window(timeColumn: Column, gapDuration: String): Column = {
withExpr {
SessionWindow(timeColumn.expr, gapDuration)
}.as("session_window")
}

as is an alias for name which remove them

def name(alias: String): Column = withExpr {
// SPARK-33536: an alias is no longer a column reference. Therefore,
// we should not inherit the column reference related metadata in an alias
// so that it is not caught as a column reference in DetectAmbiguousSelfJoin.
Alias(expr, alias)(
nonInheritableMetadataKeys = Seq(Dataset.DATASET_ID_KEY, Dataset.COL_POS_KEY))
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, that makes sense.

}.as("session_window")
}
def session_window(timeColumn: Column, gapDuration: String): Column =
session_window(timeColumn, lit(gapDuration))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now supported because of the small change I made to ResolveTimeWindows.

@@ -5743,7 +5728,7 @@ object functions {
* @since 3.2.0
*/
def session_window(timeColumn: Column, gapDuration: Column): Column =
Column.fn("session_window", timeColumn, gapDuration).as("session_window")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alias is not needed. This is added by the ResolveTimeWindows rule.

case Literal(interval: CalendarInterval, CalendarIntervalType) =>
interval == null || interval.months + interval.days + interval.microseconds <= 0
case _ => true
val filterByTimeRange = if (gapDuration.foldable) {
Copy link
Contributor Author

@hvanhovell hvanhovell Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit better than pattern matching on a literal since it supports more complex expressions, in particular Cast(Literal("<some gap duration>r", StringType), CalendarIntervalType). Normally you'd expect the constant folding to take care of this, but this happens during optimization, and this rule is part of analysis; hence the need for manual constant folding.

@hvanhovell
Copy link
Contributor Author

Merging.

@asfgit asfgit closed this in 80223bb Jul 28, 2024
ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
### What changes were proposed in this pull request?
There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions.

### Why are the changes needed?
Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47464 from hvanhovell/SPARK-48985.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
fusheng-rd pushed a commit to fusheng-rd/spark that referenced this pull request Aug 6, 2024
### What changes were proposed in this pull request?
There are a number of hard coded expressions in the SparkConnectPlanner. Most of these expressions are hardcoded because they are missing a proper constructor, or because they are not registered in the FunctionRegistry. The Column API has a similar problem. This PR fixes most of these exceptions.

### Why are the changes needed?
Reduce the number of hard coded expressions in the SparkConnectPlanner and the Column API. This will make it significantly easier to create an implementation agnostic Column API.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47464 from hvanhovell/SPARK-48985.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants