[SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers #44513

johanl-db · 2023-12-27T17:08:14Z

What changes were proposed in this pull request?

This is a follow-up from #44368 implementing an additional type promotion to decimals with larger precision and scale.
As long as the precision increases by at least as much as the scale, the decimal values can be promoted without loss of precision: Decimal(6, 2) -> Decimal(8, 4): 1234.56 -> 1234.5600.

The non-vectorized reader (parquet-mr) is already able to do this type promotion, this PR implements it for the vectorized reader.

Why are the changes needed?

This allows reading multiple parquet files that contain decimal with different precision/scales

Does this PR introduce any user-facing change?

Yes, the following now succeeds when using the vectorized Parquet reader:

  Seq(20).toDF($"a".cast(DecimalType(4, 2))).write.parquet(path)
  spark.read.schema("a decimal(6, 4)").parquet(path).collect()

It failed before with the vectorized reader and succeeded with the non-vectorized reader.

How was this patch tested?

Tests added to ParquetWideningTypeSuite to cover decimal promotion between decimals with different physical types: INT32, INT64, FIXED_LEN_BYTE_ARRAY.

Was this patch authored or co-authored using generative AI tooling?

No

johanl-db · 2023-12-27T17:10:44Z

...src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java

+        boolean needsUpcast = sparkType == LongType || sparkType == DoubleType ||
+          (isDate && sparkType == TimestampNTZType) ||
+          (isDecimal && !DecimalType.is32BitDecimalType(sparkType));


This fixes an issue from #44368, we were incorrectly disabling lazy dictionary decoding for any non-decimal (INT32) type

LuciferYang · 2023-12-28T08:07:54Z

 - SPARK-34212 Parquet should read decimals correctly *** FAILED *** (364 milliseconds)
[info]   Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown (ParquetQuerySuite.scala:1055)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Assertions.intercept(Assertions.scala:766)
[info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
[info]   at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1564)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$213(ParquetQuerySuite.scala:1055)
[info]   at scala.collection.immutable.List.foreach(List.scala:333)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$211(ParquetQuerySuite.scala:1054)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
[info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56)
[info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetQuerySuite.scala:47)

The failed test case seems to be related to this PR, could you check it? @johanl-db

johanl-db · 2023-12-28T16:33:47Z

 - SPARK-34212 Parquet should read decimals correctly *** FAILED *** (364 milliseconds)
[info]   Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown (ParquetQuerySuite.scala:1055)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Assertions.intercept(Assertions.scala:766)
[info]   at org.scalatest.Assertions.intercept$(Assertions.scala:746)
[info]   at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1564)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$213(ParquetQuerySuite.scala:1055)
[info]   at scala.collection.immutable.List.foreach(List.scala:333)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.$anonfun$new$211(ParquetQuerySuite.scala:1054)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
[info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56)
[info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38)
[info]   at org.apache.spark.sql.execution.datasources.parquet.ParquetQuerySuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(ParquetQuerySuite.scala:47)

The failed test case seems to be related to this PR, could you check it? @johanl-db

I fixed it, the last check timed out on another unrelated test though.

johanl-db · 2024-01-04T09:29:58Z

@LuciferYang or @cloud-fan since you approved the previous similar change #44368, could you take a look at this PR?

cloud-fan · 2024-01-08T08:52:02Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

@@ -1444,14 +1641,29 @@ private static boolean isDateTypeMatched(ColumnDescriptor descriptor) {
  }

  private static boolean isDecimalTypeMatched(ColumnDescriptor descriptor, DataType dt) {


I think the name should be isDecimalTypeCompatible?

cloud-fan · 2024-01-08T08:56:24Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

+
+    @Override
+    public void skipValues(int total, VectorizedValuesReader valuesReader) {
+        valuesReader.skipIntegers(total);


Suggested change

valuesReader.skipIntegers(total);

valuesReader.skipIntegers(total);

cloud-fan · 2024-01-08T08:57:30Z

thanks, merging to master!

… when ANSI mode is on ### What changes were proposed in this pull request? This PR is a followup of #44513 that excludes `Decimal(5, 4)` for `10.34` that cannot be represented with ANSI mode on. ### Why are the changes needed? ANSI build is broken (https://github.com/apache/spark/actions/runs/7455394893/job/20284415710): ``` org.apache.spark.SparkArithmeticException: [NUMERIC_VALUE_OUT_OF_RANGE] 10.34 cannot be represented as Decimal(5, 4). If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error, and return NULL instead. SQLSTATE: 22003 == DataFrame == "cast" was called from org.apache.spark.sql.execution.datasources.parquet.ParquetTypeWideningSuite.writeParquetFiles(ParquetTypeWideningSuite.scala:113) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotChangeDecimalPrecisionError(QueryExecutionErrors.scala:116) at org.apache.spark.sql.errors.QueryExecutionErrors.cannotChangeDecimalPrecisionError(QueryExecutionErrors.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test cases should cover. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44632 from HyukjinKwon/SPARK-40876-followup. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…n Parquet vectorized reader ### What changes were proposed in this pull request? This is a follow-up from #44368 and #44513, implementing an additional type promotion from integers to decimals in the parquet vectorized reader, bringing it at parity with the non-vectorized reader in that regard. ### Why are the changes needed? This allows reading parquet files that have different schemas and mix decimals and integers - e.g reading files containing either `Decimal(15, 2)` and `INT32` as `Decimal(15, 2)` - as long as the requested decimal type is large enough to accommodate the integer values without precision loss. ### Does this PR introduce _any_ user-facing change? Yes, the following now succeeds when using the vectorized Parquet reader: ``` Seq(20).toDF($"a".cast(IntegerType)).write.parquet(path) spark.read.schema("a decimal(12, 0)").parquet(path).collect() ``` It failed before with the vectorized reader and succeeded with the non-vectorized reader. ### How was this patch tested? - Tests added to `ParquetWideningTypeSuite` - Updated relevant `ParquetQuerySuite` test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44803 from johanl-db/SPARK-40876-widening-promotion-int-to-decimal. Authored-by: Johan Lasperas <johan.lasperas@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added the SQL label Dec 27, 2023

johanl-db commented Dec 27, 2023

View reviewed changes

Promote to decimal with larger scale and precision

2966027

johanl-db force-pushed the SPARK-40876-parquet-type-promotion-decimal-scale branch from be33b03 to 2966027 Compare December 27, 2023 17:35

johanl-db added 2 commits December 28, 2023 11:18

Fix isDecimalTypeMatched + more tests

823247c

Formatting

ba5b652

cloud-fan reviewed Jan 8, 2024

View reviewed changes

cloud-fan approved these changes Jan 8, 2024

View reviewed changes

cloud-fan closed this in d439e34 Jan 8, 2024

HyukjinKwon mentioned this pull request Jan 9, 2024

[SPARK-40876][SQL][TESTS][FOLLOW-UP] Remove invalid decimal test case when ANSI mode is on #44632

Closed

johanl-db mentioned this pull request Jan 19, 2024

[SPARK-40876][SQL][FOLLOWUP] Widening type promotion from integers to decimal in Parquet vectorized reader #44803

Closed

wayneguow mentioned this pull request Aug 2, 2024

[SPARK-49095][SQL] Update DecimalTypeand decimal fields compatible logic of Avro data source to avoid loss of decimal precision #47584

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers #44513

[SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers #44513

johanl-db commented Dec 27, 2023

johanl-db Dec 27, 2023

LuciferYang commented Dec 28, 2023

johanl-db commented Dec 28, 2023

johanl-db commented Jan 4, 2024

cloud-fan Jan 8, 2024

cloud-fan Jan 8, 2024

cloud-fan commented Jan 8, 2024

		@@ -1444,14 +1641,29 @@ private static boolean isDateTypeMatched(ColumnDescriptor descriptor) {
		}

		private static boolean isDecimalTypeMatched(ColumnDescriptor descriptor, DataType dt) {

	valuesReader.skipIntegers(total);
	valuesReader.skipIntegers(total);

[SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers #44513

[SPARK-40876][SQL] Widening type promotion for decimals with larger scale in Parquet readers #44513

Conversation

johanl-db commented Dec 27, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

johanl-db Dec 27, 2023

Choose a reason for hiding this comment

LuciferYang commented Dec 28, 2023

johanl-db commented Dec 28, 2023

johanl-db commented Jan 4, 2024

cloud-fan Jan 8, 2024

Choose a reason for hiding this comment

cloud-fan Jan 8, 2024

Choose a reason for hiding this comment

cloud-fan commented Jan 8, 2024