Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] libcudf generates invalid offsets for LIST column when loading Parquet file with chunked reader #15306

Closed
jlowe opened this issue Mar 14, 2024 · 0 comments · Fixed by #15342
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Mar 14, 2024

Describe the bug
The libcudf chunked Parquet reader can generate a LIST column that has invalid offsets where an offset goes back to zero after row 0 (e.g.: 0, 7, 23, 43, 53, 56, 0, 71). Loading without the chunked reader does not produce invalid offsets.

Steps/Code to reproduce bug
Load the following file with the chunked Parquet reader and note that the offsets column for the new_10 column (column index 11 in the resulting table, a LIST of TIMESTAMP_DAYS) are incorrect.
1418348638.parquet.gz

Expected behavior
LIST offset column values should never be less than the previous value in the offset column.

@jlowe jlowe added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Mar 14, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 20, 2024
…nked reader. (#15342)

Fixes #15306

The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader.  This led to corruption in the decode kernels.  The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs.

This affected the number of chunks generated in some of the very (unrealistically) constrained tests.

Authors:
  - https://github.com/nvdbaranec
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15342
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant