-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC writer does not correctly output statistics for empty data file #14675
Comments
CC @vuule. |
I have a repro of "no stripes/column statistics". I think it's expected to have no stripe statistics, since there are no stripes. But we should probably include file statistics. Where are you reading the compression size from? We do have a (passing) test that might be related, |
I'm reading metadata using orc tool. This is my script:
|
It seems that orc tool is wrong about compression size. I dug into its code and saw that it prints the value of |
…4707) Fixes #14675 Write file-level statistics even when stripe-level statistics don't exist (no stripes). Written statistics are in line with Pandas - zero sum, no min/max. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #14707
From our ORC tests, we found that for an empty table the output file does not contains column statistics and some file statistics are also incorrect:
So there are at least two (observable) problems:
Compression size: 262144
: This is incorrect, since there is no data.The text was updated successfully, but these errors were encountered: