Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]Sort the single block info by the header for AVRO coalescing reading. #5347

Open
firestarman opened this issue Apr 28, 2022 · 3 comments
Labels
performance A performance related task/issue

Comments

@firestarman
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
The parent of coalescing reading will sort the blocks according to the file path, so the same block in the same file can be at least coalesced. For your case, if the blocks are sorted like

FileA(Header_same) -> FileB( Header_different) -> FIleC(Header_same), the FileA, FileB, FileC can't be coalesced.

What if we sort them as below
FileA(Header_same) -> FIleC(Header_same) -> FileB( Header_different)

Then FileA, FileC can be coalesced.

This is from #5306 (comment)

@firestarman firestarman added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 28, 2022
@sameerz sameerz added performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels May 3, 2022
@jlowe
Copy link
Member

jlowe commented May 3, 2022

Is the reordering going to be OK in practice? Worried about a case where data was written out in a sorted order, and user query is assuming data read by a task will be read in a sorted order. If we reorder the files then we'll come up with a different data order than the CPU would for the task, and I'm wondering if we could potentially break some expectations.

@wbo4958
Copy link
Collaborator

wbo4958 commented May 5, 2022

Yeah, It indeed breaks this case as Jason said.

@wbo4958 wbo4958 closed this as completed May 5, 2022
@revans2 revans2 reopened this May 5, 2022
@revans2
Copy link
Collaborator

revans2 commented May 5, 2022

I reopened this because at least we can have people opt into this. Also if I remember correctly Spark was sorting the files in an odd way where it was grouping them by whole files as much as possible, but if they didn't fit then the bits and pieces were combined together in the end in a single task. So if a user is relying on this behavior they may be in for problems with Spark itself, if the files ever grow larger than a single split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

5 participants