Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] custom kernel for date/timestamp formatting/parsing #10032

Open
revans2 opened this issue Dec 12, 2023 · 5 comments
Open

[FEA] custom kernel for date/timestamp formatting/parsing #10032

revans2 opened this issue Dec 12, 2023 · 5 comments
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

Is your feature request related to a problem? Please describe.
Spark uses java for date/timestamp parsing and formatting. We have been using a CUDF kernel that uses formats that are compatible with python/C++. But the java formats are very different, so we have to do a mapping. But there are java formats that are not ambiguous until they are mapped into the format the cudf supports. We really should just write our own kernel that tries to do what Spark/Java does directly.

@res-life
Copy link
Collaborator

res-life commented Dec 26, 2023

One related issue: #10083

Java API gets:
+10000-01-01
And cuDF gets:
0000-01-01
for date: 10000-01-01 when the format is yyyy-MM-dd

@andygrove
Copy link
Contributor

andygrove commented Jan 8, 2024

Some notes on parsing dates from JSON, based on #9975

Depending on the Spark version, there can be different code paths depending on whether a dateFormat is specified or not. Some of the differences that we need to be able to handle are:

  • We sometimes need to support single-digit months and days, and sometimes we require two digits
  • We sometimes need to trim all leading and trailing whitespace, sometimes we only trim specific whitespace chars, sometimes we don't trim at all
  • Sometimes we perform a cast instead of a parse, and in that case we support special values "epoch", "now", "today", "yesterday", and "tomorrow" (definitely an edge case because it doesn't much sense to store relative terms like this in a json file)

@res-life
Copy link
Collaborator

Sometimes we perform a cast instead of a parse, and in that case we support special values "epoch", "now", "today", "yesterday", and "tomorrow" (definitely an edge case because it doesn't much sense to store relative terms like this in a json file)

When cast string to timestamp, only Spark31x supports special values you mentioned, Spark 320 and 320+ do not support special values.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 19, 2024

only Spark31x supports special values you mentioned

Are we just not going to support the special values in Spark 3.1 and document it? or are we going to do special post processing to fix them up?

@res-life
Copy link
Collaborator

Are we just not going to support the special values in Spark 3.1 and document it? or are we going to do special post processing to fix them up?

I suggest do special post processing:
For now and epoch they are not time zone awared.
For today/tomorrow/yesterday they are time zone awared. Generate them in Java in the default time zone, then replace the mached string to the values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants