Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supports casting between ANSI interval types and integral types #5353

Merged

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Apr 28, 2022

Contributes #5113
Closes #5111
Supports:

  • Cast year-month interval type to (byte | short | int | long) types
  • Cast (byte | short | int | long) types to year-month interval type
  • Cast day-time interval type to (byte | short | int | long) types
  • Cast (byte | short | int | long) types to day-time interval type

Signed-off-by: Chong Gao res_life@163.com

Chong Gao added 2 commits April 28, 2022 17:10
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life res-life force-pushed the cast-between-interval-and-integer branch from f570c04 to 98a9e29 Compare April 28, 2022 10:37
@res-life
Copy link
Collaborator Author

res-life commented Apr 28, 2022

Depends on this: #5352.
Cherry-picked the #5352 PR

Spark change is:
apache/spark@9553ed7072

Cpu throws a exception for the following scenario because startField != endField

spark.sql("select cast(1 as interval year to month)").show()
===== CPU output: =====
pyspark.sql.utils.AnalysisException: cannot resolve 'CAST(1 AS INTERVAL YEAR TO MONTH)' due to data type mismatch: cannot cast int to interval year to month; line 1 pos 7;
'Project [unresolvedalias(cast(1 as interval year to month), None)]
+- OneRowRelation

// cast(1 as interval year) is OK.
// cast(1 as interval year to month) is NOK.
// this is the Spark code  check `startField` == `endField`
case (_: IntegralType, DayTimeIntervalType(s, e)) if s == e => true

Gpu do not need to do this. CPU already checked this in the analysis phase.

@res-life
Copy link
Collaborator Author

build

@@ -561,6 +561,39 @@ object GpuCast extends Arm {
GpuIntervalUtils.castStringToDayTimeIntervalWithThrow(
input.asInstanceOf[ColumnVector], dayTime)

// cast(`day time interval` as integral)
case (dt: DataType, _: LongType) if GpuTypeShims.isSupportedDayTimeType(dt) =>
GpuIntervalUtils.dayTimeIntervalToLong(input.asInstanceOf[ColumnVector], dt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But input is not guaranteed to be a ColumnVector for nested types. Like casting an array of DayTimeInterval to an Array of longs. These need to be ColumnView, but you should be able to treat it exactly the same as a ColumnVector.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pytest.mark.parametrize('integral_type', integral_types)
def test_cast_day_time_interval_to_integral_no_overflow(integral_type):
assert_gpu_and_cpu_are_equal_collect(
lambda spark: unary_op_df(spark, DayTimeIntervalGen(start_field='day', end_field='day', min_value=timedelta(seconds=-128 * 86400), max_value=timedelta(seconds=127 * 86400)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If we want to have these all as separate queries, then lets have a separate test for each one so we can parallelize the execution and not have one test failure keep another test from running. If we are okay with them being a single test where one failure can mask another, like it is here, then can we combine them into a single query so it can run faster?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, combined into a single query to run faster.

@sameerz sameerz added the feature request New feature or request label Apr 28, 2022
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented May 11, 2022

No need to update AnsiCast I think, please double-check. @revans2
Spark does not allow inserting the interval column into the integral column.

spark.sql("create table t2(c1 interval year, c2 long) using parquet")
spark.sql("insert into t2 values (interval 1 year, 9223372036854775807)")
spark.sql("insert into t2 select c2, c1 from t2")
pyspark.sql.utils.AnalysisException: Cannot write incompatible data to table '`default`.`t2`':
- Cannot safely cast 'c1': bigint to interval year
- Cannot safely cast 'c2': interval year to bigint

@res-life
Copy link
Collaborator Author

@revans2 Help review, thanks.

@res-life res-life merged commit 78750ef into NVIDIA:branch-22.06 May 18, 2022
@res-life res-life deleted the cast-between-interval-and-integer branch May 18, 2022 01:28
@sameerz sameerz added this to the May 2 - May 20 milestone May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] ANSI mode: CAST between ANSI intervals and IntegralType
3 participants