Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading decimal columns from parquet files #1294

Merged
merged 19 commits into from
Jan 5, 2021

Conversation

sperlingxx
Copy link
Collaborator

@sperlingxx sperlingxx commented Dec 7, 2020

This pull request is to enable reading decimal columns from parquet files via turning allowDecimal to true. This pull request also provides test coverage for decimal reading.
But there exist some limits on decimal reading: we can only read decimal columns whose storage type is INT32/64. For now, cuDF doesn't support FIXED_LENGTH_BYTE_ARRAY. INT32 will be read as INT64 because we only support DECIMAL64 in spark-rapids.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sameerz @jlowe @tgravescs I personally feel that this is getting to be way too late to put this into the 0.3 release. We really should be looking at bug fixes instead of new feature work. Especially because we know that rapidsai/cudf#6909 is still an outstanding issue. But if this is something we need to get in, then we need to at a minimum we need to throw an exception if we get back a data type we don't expect for a decimal column.

@jlowe
Copy link
Member

jlowe commented Dec 8, 2020

I personally feel that this is getting to be way too late to put this into the 0.3 release.

Agree. I would like to see the legacy decimal encoding supported before this goes in, otherwise we're left in a situation where the plugin crashes a query that used to work without it.

@sperlingxx sperlingxx changed the title Support reading decimal columns from parquet files [WIP] Support reading decimal columns from parquet files Dec 9, 2020
@sperlingxx
Copy link
Collaborator Author

build

1 similar comment
@sperlingxx
Copy link
Collaborator Author

build

@jlowe
Copy link
Member

jlowe commented Dec 9, 2020

@sperlingxx per the above discussion, I think this PR should be retargeted to branch-0.4.

@sperlingxx
Copy link
Collaborator Author

@sperlingxx per the above discussion, I think this PR should be retargeted to branch-0.4.

Yes! So, I labeled this pull request with WIP.

@sameerz sameerz added the feature request New feature or request label Dec 10, 2020
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! So, I labeled this pull request with WIP.

That's fine, adding a request to retarget this PR to branch-0.4 so it cannot be accidentally merged in the interim.

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx changed the title [WIP] Support reading decimal columns from parquet files [REVIEW] Support reading decimal columns from parquet files Dec 16, 2020
@sperlingxx
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failures are related, I'm guessing because the 3.1.0 shims were not updated.

@sperlingxx
Copy link
Collaborator Author

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@sameerz sameerz added this to the Jan 4 - Jan 15 milestone Dec 18, 2020
@jlowe
Copy link
Member

jlowe commented Dec 18, 2020

I also noticed that this PR didn't generate a new docs/supported_ops.md as it should. We're not currently generating the documentation for scans from the ScanMeta, but it is hardcoded in SupportedOpsDocs.help which should be updated accordingly. After updating the help method, performing a usual mvn verify build will generate a new supported_ops.md file that should be added to this PR.

@sperlingxx
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better, but the supported_ops.md still states at the bottom of the page that Parquet does not support reading the Decimal type. The static table in TypeChecks needs to be updated to reflect that decimal is support for Parquet input and the supported_ops.md file regenerated by mvn verify.

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

This is looking better, but the supported_ops.md still states at the bottom of the page that Parquet does not support reading the Decimal type. The static table in TypeChecks needs to be updated to reflect that decimal is support for Parquet input and the supported_ops.md file regenerated by mvn verify.

The static table has been updated.

@jlowe jlowe changed the title [REVIEW] Support reading decimal columns from parquet files Support reading decimal columns from parquet files Jan 5, 2021
@revans2 revans2 merged commit 6635370 into NVIDIA:branch-0.4 Jan 5, 2021
@sperlingxx sperlingxx deleted the dec_pq_read branch April 8, 2021 03:05
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#1294)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants