Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check schema compatibility when building parquet readers #5434

Merged
merged 13 commits into from
May 12, 2022

Conversation

sperlingxx
Copy link
Collaborator

@sperlingxx sperlingxx commented May 7, 2022

Fixes #5200 #5445

Currently, there exists no check of schema compatibility between file schema (parquet type) and read schema (spark type) when building GPU parquet readers. Thus, reading parquet data with incompatilble spark types is actually allowed on the GPU (such as: reading int as long), which leads to a lot of underfined behaviors.

The purpose of this PR is to add the schema check, referring the checking process of Spark. Converters downcasting INT32 to Byte/Short/Date are added in this PR as well.

Note: This PR uses some deprecated API of parquet-mr, in order to accommodate Spark 3.1.

Signed-off-by: sperlingxx lovedreamf@gmail.com

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx requested review from wbo4958 and jlowe May 7, 2022 08:00
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@wbo4958
Copy link
Collaborator

wbo4958 commented May 9, 2022

LGTM

@sameerz sameerz added the bug Something isn't working label May 10, 2022
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
// }
// TODO: Add below converters for INT32. Converters work when evolving schema over cuDF
// table read from Parquet file. https://github.com/NVIDIA/spark-rapids/issues/5445
if (dt == DataTypes.ByteType || dt == DataTypes.ShortType || dt == DataTypes.DateType) {
Copy link
Collaborator Author

@sperlingxx sperlingxx May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added downcast converters for INT_32 in this PR to close #5445, since UT cases of parquet writing fails if these combinations are disabled.

@sperlingxx
Copy link
Collaborator Author

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

jlowe
jlowe previously approved these changes May 11, 2022
@sperlingxx
Copy link
Collaborator Author

build

1 similar comment
@sperlingxx
Copy link
Collaborator Author

build

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore the approval

@sperlingxx sperlingxx merged commit e41a6f3 into NVIDIA:branch-22.06 May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] More detailed logs to show which parquet file and which data type has mismatch.
5 participants