Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Support years with up to 7 digits when casting from String to Date in Spark 3.2 #3382

Closed
andygrove opened this issue Sep 3, 2021 · 6 comments · Fixed by #3439 or #3531
Closed
Assignees
Labels
bug Something isn't working P0 Must have for release Spark 3.2+

Comments

@andygrove
Copy link
Contributor

andygrove commented Sep 3, 2021

Describe the bug

We have test failures in CastOpSuite now that Spark 3.2 supports years with up to 7 digits when casting from string to date.

- Cast from string to date using random inputs *** FAILED ***
  Mismatch casting string [\n90713] to DateType. CPU: 713-01-01; GPU: null (CastOpSuite.scala:234)
- Cast from string to date using random inputs with valid year prefix *** FAILED ***
  Mismatch casting string [2021838] to DateType. CPU: 838-01-01; GPU: null (CastOpSuite.scala:234)

Steps/Code to reproduce bug
Run CastOpSuite with Spark 3.2

Expected behavior
Tests should pass.

Environment details (please complete the following information)
N/A

Additional context
Spark commit: apache/spark@c9813f7

See issue #3406 for supporting other changes in SPARK-35780.

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify Spark 3.2+ labels Sep 3, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Sep 7, 2021
@andygrove
Copy link
Contributor Author

andygrove commented Sep 7, 2021

The change here is that Spark 3.0 / 3.1 only supports 4 digit years when casting from string to date, but Spark 3.2 supports between 4 and 7 digits. See apache/spark@c9813f7 for more information.

The test output is confusing because Date.toString is not showing the year correctly.

javadate

Spark 3.0

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.3
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select cast('202' as date)").show
+-----------------+
|CAST(202 AS DATE)|
+-----------------+
|             null|
+-----------------+


scala> spark.sql("select cast('2021' as date)").show
+------------------+
|CAST(2021 AS DATE)|
+------------------+
|        2021-01-01|
+------------------+


scala> spark.sql("select cast('20212' as date)").show
+-------------------+
|CAST(20212 AS DATE)|
+-------------------+
|               null|
+-------------------+

Spark 3.2

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1-SNAPSHOT
      /_/
         
Using Scala version 2.12.14 (OpenJDK 64-Bit Server VM, Java 1.8.0_292)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select cast('202' as date)").show
+-----------------+
|CAST(202 AS DATE)|
+-----------------+
|             null|
+-----------------+


scala> spark.sql("select cast('2021' as date)").show
+------------------+
|CAST(2021 AS DATE)|
+------------------+
|        2021-01-01|
+------------------+


scala> spark.sql("select cast('20212' as date)").show
+-------------------+
|CAST(20212 AS DATE)|
+-------------------+
|       +20212-01-01|
+-------------------+

@andygrove andygrove changed the title [BUG] Cast string to date not compatible with Spark 3.2 [BUG] Support years with up to 7 digits when casting from String to Date in Spark 3.2 Sep 7, 2021
@sperlingxx sperlingxx self-assigned this Sep 8, 2021
@sperlingxx
Copy link
Collaborator

sperlingxx commented Sep 10, 2021

Hi @andygrove, I checked the timestamp implementation of cuDF. For now, it only supports parsing strings with fixed-length specifiers. For specifier "Y", the length is 4. And it looks not a small work to support variable-length specifiers in terms of cuDF. Shall I file an issue in cuDF repo?

@sperlingxx
Copy link
Collaborator

sperlingxx commented Sep 10, 2021

Alternatively, perhaps we can manually extract the year part, then narrow down the values of year to fit the 4-digits range. And weadd back the exceeded values of year through cuDF API ColumnView.addCalendricalMonths after casting from string to timestamp.

@revans2
Copy link
Collaborator

revans2 commented Sep 10, 2021

I think this is another one of those places where we are going to need to do something custom to actually fix it. CUDF has been very adamant in the past that the date formats follow the C standard library formats, and as was stated before variable length is not something that they are willing to support because of the possibility of ambiguity in the formats and because of being standards based. If variable width is supported and we have the pattern %Y%m%d how do you parse out 1111121? is it 111-11-21, 1111-12-1, or is it 11111-2-1?

Perhaps in the short term we add in a config so users can opt-into the smaller date cast functionality in 3.2.

@abellina abellina reopened this Sep 17, 2021
@abellina
Copy link
Collaborator

I still see these failures, even after @sperlingxx's PR went in.

@revans2
Copy link
Collaborator

revans2 commented Sep 17, 2021

I have a "fix" for this that has us fall back to the CPU in cases where we cannot support it, but gives the user a config to override it. I am working on updating the tests now. Not sure if I will get done before the end of day or not (meetings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release Spark 3.2+
Projects
None yet
5 participants