Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the over-estimating size when doing ORC COALESCING reading #2945

Closed
wbo4958 opened this issue Jul 16, 2021 · 2 comments
Closed

Improve the over-estimating size when doing ORC COALESCING reading #2945

wbo4958 opened this issue Jul 16, 2021 · 2 comments
Assignees
Labels
improve P1 Nice to have for release

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Jul 16, 2021

With #2909, ORC has supported COALESCING reading.

Before coalescing small files, we need to estimate an initial size to allocate a HostMemoryBuffer to store the HEADER, STRIPES, and FOOTER. From the testing result on non-partitioned 5000 files total 1.3G, there is overestimating for the initial estimated size.

stderr:21/07/16 09:08:38 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63929078, and the true size: 63160978
stderr:21/07/16 09:08:38 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63535693, and the true size: 62768848
stderr:21/07/16 09:08:38 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63159114, and the true size: 62391504
stderr:21/07/16 09:08:38 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63055310, and the true size: 62287543
stderr:21/07/16 09:08:38 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 64388818, and the true size: 63620556
stderr:21/07/16 09:08:39 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63271377, and the true size: 62502777
stderr:21/07/16 09:08:39 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63699389, and the true size: 62931986
stderr:21/07/16 09:08:39 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 63398089, and the true size: 62630814
stderr:21/07/16 09:08:41 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62961599, and the true size: 62193810
stderr:21/07/16 09:08:41 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62870696, and the true size: 62103385
stderr:21/07/16 09:08:42 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62784883, and the true size: 62017554
stderr:21/07/16 09:08:42 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62698196, and the true size: 61932040
stderr:21/07/16 09:08:43 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62590045, and the true size: 61822125
stderr:21/07/16 09:08:43 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62491433, and the true size: 61724033
stderr:21/07/16 09:08:43 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62394913, and the true size: 61629064
stderr:21/07/16 09:08:44 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62287476, and the true size: 61521093
stderr:21/07/16 09:08:44 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62168654, and the true size: 61402519
stderr:21/07/16 09:08:45 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 62029026, and the true size: 61262705
stderr:21/07/16 09:08:45 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 61874813, and the true size: 61108187
stderr:21/07/16 09:08:46 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 61643721, and the true size: 60877037
stderr:21/07/16 09:08:46 INFO MultiFileOrcPartitionReader: ORC Coalescing reading estimates the initTotalSize: 45697529, and the true size: 45092366

The over-estimating size is about 750K (initial size - the true size) for a total of 62M data.

@wbo4958 wbo4958 added P1 Nice to have for release improve labels Jul 16, 2021
@wbo4958 wbo4958 self-assigned this Jul 16, 2021
@GaryShen2008
Copy link
Collaborator

Better to have a design doc for the ORC coalesce, then we can review the current way.

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Aug 30, 2021

PR merged. Close this issue

@wbo4958 wbo4958 closed this as completed Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve P1 Nice to have for release
Projects
None yet
Development

No branches or pull requests

2 participants