[FEA]Sort the single block info by the header for AVRO coalescing reading. #5347

firestarman · 2022-04-28T01:22:01Z

Is your feature request related to a problem? Please describe.
The parent of coalescing reading will sort the blocks according to the file path, so the same block in the same file can be at least coalesced. For your case, if the blocks are sorted like

FileA(Header_same) -> FileB( Header_different) -> FIleC(Header_same), the FileA, FileB, FileC can't be coalesced.

What if we sort them as below
FileA(Header_same) -> FIleC(Header_same) -> FileB( Header_different)

Then FileA, FileC can be coalesced.

This is from #5306 (comment)

jlowe · 2022-05-03T20:36:37Z

Is the reordering going to be OK in practice? Worried about a case where data was written out in a sorted order, and user query is assuming data read by a task will be read in a sorted order. If we reorder the files then we'll come up with a different data order than the CPU would for the task, and I'm wondering if we could potentially break some expectations.

wbo4958 · 2022-05-05T01:08:38Z

Yeah, It indeed breaks this case as Jason said.

revans2 · 2022-05-05T13:57:21Z

I reopened this because at least we can have people opt into this. Also if I remember correctly Spark was sorting the files in an odd way where it was grouping them by whole files as much as possible, but if they didn't fit then the bits and pieces were combined together in the end in a single task. So if a user is relying on this behavior they may be in for problems with Spark itself, if the files ever grow larger than a single split.

firestarman added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 28, 2022

firestarman mentioned this issue Apr 28, 2022

Support coalescing reading for avro #5306

Merged

sameerz added performance A performance related task/issue and removed ? - Needs Triage Need team to review and classify feature request New feature or request labels May 3, 2022

firestarman mentioned this issue May 4, 2022

[FEA] Support reading Avro #4831

Open

30 tasks

wbo4958 closed this as completed May 5, 2022

revans2 reopened this May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]Sort the single block info by the header for AVRO coalescing reading. #5347

[FEA]Sort the single block info by the header for AVRO coalescing reading. #5347

firestarman commented Apr 28, 2022

jlowe commented May 3, 2022

wbo4958 commented May 5, 2022

revans2 commented May 5, 2022

[FEA]Sort the single block info by the header for AVRO coalescing reading. #5347

[FEA]Sort the single block info by the header for AVRO coalescing reading. #5347

Comments

firestarman commented Apr 28, 2022

jlowe commented May 3, 2022

wbo4958 commented May 5, 2022

revans2 commented May 5, 2022