Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC writer does not correctly output statistics for empty data file #14675

Closed
ttnghia opened this issue Dec 27, 2023 · 4 comments · Fixed by #14707
Closed

[BUG] ORC writer does not correctly output statistics for empty data file #14675

ttnghia opened this issue Dec 27, 2023 · 4 comments · Fixed by #14707
Assignees
Labels
bug Something isn't working

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Dec 27, 2023

From our ORC tests, we found that for an empty table the output file does not contains column statistics and some file statistics are also incorrect:

File Version: 0.12 with FUTURE by Unknown(5) 
Rows: 0
Compression: SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
Type: struct<id:bigint,p:bigint>

Stripe Statistics:

File Statistics:

Stripes:

File length: 68 bytes
Padding length: 0 bytes
Padding ratio: 0%

So there are at least two (observable) problems:

  • Compression size: 262144: This is incorrect, since there is no data.
  • There are no stripes/column statistics.
@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 27, 2023

CC @vuule.

@vuule
Copy link
Contributor

vuule commented Dec 27, 2023

I have a repro of "no stripes/column statistics". I think it's expected to have no stripe statistics, since there are no stripes. But we should probably include file statistics.

Where are you reading the compression size from? We do have a (passing) test that might be related, CompStatsEmptyTable.

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 28, 2023

I'm reading metadata using orc tool. This is my script:

java -jar orc-tools-1.8.4-uber.jar meta $1

@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 28, 2023

It seems that orc tool is wrong about compression size. I dug into its code and saw that it prints the value of compressionBlockSize for that.

@ttnghia ttnghia changed the title [BUG] ORC writer does not correctly output statistics for empty file [BUG] ORC writer does not correctly output statistics for empty data file Dec 28, 2023
rapids-bot bot pushed a commit that referenced this issue Jan 8, 2024
…4707)

Fixes #14675
Write file-level statistics even when stripe-level statistics don't exist (no stripes).
Written statistics are in line with Pandas - zero sum, no min/max.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: #14707
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants