Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Take into account org.apache.spark.timeZone in Parquet/Avro from Spark 3.2 #9632

Closed
ttnghia opened this issue Nov 3, 2023 · 6 comments
Assignees
Labels
feature request New feature or request

Comments

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 3, 2023

From Spark 3.2, a new metadata key org.apache.spark.timeZone is written into the the output Parquet/Avro file. That metadata is used to check and rebase datetime.

We need to check that metadata while rebasing datetime. In particular, we need to throw exception if the file was written in timezone other than UTC.

Ref: apache/spark#34973

@ttnghia ttnghia added feature request New feature or request ? - Needs Triage Need team to review and classify labels Nov 3, 2023
@mattahrens
Copy link
Collaborator

Scope for this issue: throw exception at runtime if we detect a non-UTC timezone.

Also, make sure to file a separate issue to track fixing this once we have timezone support fully enabled.

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 7, 2023
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 9, 2023

Reading parquet now checks for the timezone. We only need additional work for Avro.

@winningsix
Copy link
Collaborator

Reading parquet now checks for the timezone. We only need additional work for Avro.

I did a test on Avro in a round-trip integration test. Based on my test, currently our Gpu Avro scan didn't have timestamp or date support yet. So probably we should narrow down the scope only with Parquet and file another ticket tracking Avro part. And we can revisit it once we had some customer needs for Avro and it may have pre-requisite for Avro support in timestamp and date.

23/11/10 01:31:08 WARN GpuOverrides: 
  !Exec <BatchScanExec> cannot run on GPU because not all scans can be replaced
    !Input <AvroScan> cannot run on GPU because unsupported data types TimestampType [_c0], DateType [_c1] in read for Avro
    @Expression <AttributeReference> _c0#396 could run on GPU
    @Expression <AttributeReference> _c1#397 could run on GPU

@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 10, 2023

It seems that we don't support date/time in Avro yet. So I'm fine with closing this and we can reopen or file a new issue when it is needed.

@ttnghia ttnghia closed this as completed Nov 10, 2023
@winningsix
Copy link
Collaborator

@ttnghia Do we have a PR link this JIRA?

@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 17, 2023

Yes, the config is checked since this PR: #9631.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants