Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Relaxed schema alignment for parquet file list read #18803

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Sep 18, 2024

Fixes #18568
Fixes #17254

  • Support reading from a list of parquet files that have the same set of columns, but not necessarily in the same order

    • The dtype of every column must be equal
  • When the files have differing columns:

    • If there are extra columns in files following the first file, those columns are silently ignored (not read)
      • Let me know if we want to error here instead
    • If there are columns in the first file missing from the files after it, an error will be raised. Later I will make a PR to introduce an option to instead project NULL rows for this case.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Sep 18, 2024
Copy link

codecov bot commented Sep 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@77b1e42). Learn more about missing BASE report.
Report is 12 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #18803   +/-   ##
=======================================
  Coverage        ?   79.86%           
=======================================
  Files           ?     1518           
  Lines           ?   205582           
  Branches        ?     2892           
=======================================
  Hits            ?   164180           
  Misses          ?    40854           
  Partials        ?      548           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review September 18, 2024 07:56
@nameexhaustion nameexhaustion marked this pull request as draft September 18, 2024 11:17
@nameexhaustion nameexhaustion force-pushed the pq-col-unaligned branch 2 times, most recently from c248acb to bea8e17 Compare September 18, 2024 11:37
@nameexhaustion nameexhaustion marked this pull request as ready for review September 18, 2024 11:57
@kszlim
Copy link
Contributor

kszlim commented Sep 18, 2024

Does:

The dtype of every column must be equal

Imply no upcasting if you have f32 in one file and f64 in another for one column?

@ion-elgreco
Copy link
Contributor

Does:

The dtype of every column must be equal

Imply no upcasting if you have f32 in one file and f64 in another for one column?

This shouldn't be enforced otherwise that would mean type widening won't be possible.

@ion-elgreco
Copy link
Contributor

Any chance you will allow to pass a schema that should be used instead of fetching it from the metadata of the first file?

@ritchie46 ritchie46 merged commit 9f2e410 into pola-rs:main Sep 20, 2024
27 checks passed
@coastalwhite
Copy link
Collaborator

Any chance you will allow to pass a schema that should be used instead of fetching it from the metadata of the first file?

This is tracked in #15111.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
5 participants