-
Notifications
You must be signed in to change notification settings - Fork 28.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-40876][SQL] Widening type promotions in Parquet readers
### What changes were proposed in this pull request? This change adds the following conversions to the vectorized and non-vectorized Parquet readers corresponding to type promotions that are strictly widening without precision loss: - Int -> Long - Float -> Double - Int -> Double - Date -> TimestampNTZ (Timestamp without timezone only as a date has no timezone information) - Decimal with higher precision (already supported in non-vectorized reader) ### Why are the changes needed? These type promotions support two similar use cases: 1. Reading a set of Parquet files with different types, e.g a mix of Int and Long for a given column. If the read schema is Long, the reader should be able to read all files and promote Ints to Longs instead of failing. 2. Widening the type of a column in a table that already contains Parquet files, e.g. an Int column isn't large enough to accommodate IDs and is changed to Long. Existing Parquet files storing the value as Int should still be correctly read by upcasting values. The second use case in particular will enable widening the type of columns or fields in existing Delta tables. ### Does this PR introduce _any_ user-facing change? The following fails before this change: ``` Seq(0).toDF("a").write.parquet(path) spark.read.schema("a LONG").parquet(path).where(s"a < ${Long.MaxValue}").collect() ``` With the Int->Long promotion in both the vectorized and non-vectorized parquet readers, it succeeds and produces correct results, without overflow or loss of precision. The same is true for Float->Double, Int->Double, Decimal with higher precision and Date->TimestampNTZ ### How was this patch tested? - Added `ParquetTypeWideningSuite` covering the promotions included in this change, in particular: - Non-dictionary encoded values / dictionary encoded values for each promotion - Timestamp rebase modes `LEGACY` and `CORRECTED` for Date -> TimestampNTZ - Promotions between decimal types with different physical storage: `INT32`, `INT64`, `BINARY`, `FIXED_LEN_BYTE_ARRAY` - Reading values written with Parquet V1 and V2 writers. - Updated/Removed two tests that expect type promotion to fail. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44368 from johanl-db/SPARK-40876-parquet-type-promotion. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
- Loading branch information
Showing
6 changed files
with
596 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.