Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Support for wider types in read schemas for Parquet Reads #11512

Open
mythrocks opened this issue Sep 26, 2024 · 0 comments
Open

[BUG] Support for wider types in read schemas for Parquet Reads #11512

mythrocks opened this issue Sep 26, 2024 · 0 comments
Labels
bug Something isn't working Spark 4.0+ Spark 4.0+ issues

Comments

@mythrocks
Copy link
Collaborator

TL;DR:

When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.

Details:

This is with reference to apache/spark#44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.

For instance, a Parquet file with an Integer column a should be readable with a read-schema that defines a as having a type Long.

Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After apache/spark#44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.

spark-rapids's `parquet_test.py::test_parquet_check_schema_compatibility integration test currently looks as follows:

def test_parquet_check_schema_compatibility(spark_tmp_path):
    data_path = spark_tmp_path + '/PARQUET_DATA'
    gen_list = [('int', int_gen), ('long', long_gen), ('dec32', decimal_gen_32bit)]
    with_cpu_session(lambda spark: gen_df(spark, gen_list).coalesce(1).write.parquet(data_path))

    read_int_as_long = StructType(
        [StructField('long', LongType()), StructField('int', LongType())])
    assert_gpu_and_cpu_error(
        lambda spark: spark.read.schema(read_int_as_long).parquet(data_path).collect(),
        conf={},
        error_message='Parquet column cannot be converted')

Spark 4's change in behaviour causes this test to fail thus:

        """
>       with pytest.raises(Exception) as excinfo:
E       Failed: DID NOT RAISE <class 'Exception'>

../../../../integration_tests/src/main/python/asserts.py:650: Failed
@mythrocks mythrocks added ? - Needs Triage Need team to review and classify feature request New feature or request Spark 4.0+ Spark 4.0+ issues labels Sep 26, 2024
mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Sep 27, 2024
Fixes NVIDIA#11015.
Contributes to NVIDIA#11004.

This commit addresses the tests that fail in parquet_test.py, when
run on Spark 4.

1. Some of the tests were failing as a result of NVIDIA#5114.  Those tests
have been disabled, at least until we get around to supporting
aggregations with ANSI mode enabled.

2. `test_parquet_check_schema_compatibility` fails on Spark 4 regardless
of ANSI mode, because it tests implicit type promotions where the read
schema includes wider columns than the write schema.  This will require
new code.  The test is disabled until NVIDIA#11512 is addressed.

3. `test_parquet_int32_downcast` had an erroneous setup phase that fails
   in ANSI mode.  This has been corrected. The test was refactored to
run in ANSI and non-ANSI mode.

Signed-off-by: MithunR <mithunr@nvidia.com>
@mattahrens mattahrens added bug Something isn't working and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels Oct 1, 2024
@mattahrens mattahrens changed the title [FEA] Support for wider types in read schemas for Parquet Reads [BUG] Support for wider types in read schemas for Parquet Reads Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark 4.0+ Spark 4.0+ issues
Projects
None yet
Development

No branches or pull requests

2 participants