Skip to content

[QST] Recommended approach for a reference avro writer/reader #5381

Answered by revans2
galipremsagar asked this question in General
Discussion options

You must be logged in to vote

Pandas nullable dtypes written to a parquet file -> read parquet in pyspark -> write to orc in pyspark

I think you mean write in avro instead of orc, but I get the general idea.

My main concern with this is around date and time types. Parquet has had issues with date/time types related to Gregorian vs Julian calendars. In fact Spark 3.0+ now has two separate modes that are controlled by a combination of Spark specific metadata in the file and Spark configs when reading/writing the data. This same issue/feature exists in the avro spark code too. So if you want to test timestamps before 1900 you are likely going to run into these issues.

You also need to be aware that Spark only supports …

Replies: 6 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by sameerz
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
3 participants
Converted from issue

This discussion was converted from issue #927 on April 28, 2022 23:08.