-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional schema to scan_parquet #15111
Comments
I like this or even just an option to default to the A work around to get part of the behavior is
Of course this workaround goes the opposite direction with respect to avoiding scanning the first file and instead scans all the files' metadata. It's awkward that |
@deanm0000 I don't really get the pushback from the core team on this. This strictness shouldn't be there because if you are passing a schema, you are doing this willingly and know that it will fit. Datafusion and Pyarrow have no issue handling this type of behavior. It's a workaround that unfortunately ruins performance. I've done it here as well: https://github.com/ion-elgreco/polars-deltalake/blob/d9fcb4d9d7337bd163ce3ee344225516e53da4da/python/src/lib.rs#L126 It should be a single scan for the optimizer to properly work. |
@deanm0000 Actually a diagonal concat wouldn't be enough. Take this example: I have two parquets, with the columns ["foo", "bar"]. I pass a schema of ["foo","bar","baz"]. Simply reading and diagonally concatenating means it will ignore the column "baz". This column should get a null array of the dtype since I provided it in the schema |
Description
If a schema is passed to
scan_parquet
it should be used to coerce every read parquet file into this schema. This also means Polars does not to fetch the metadata of the first parquet.There are many cases where you know what the schema will be, such as using (Delta, Iceberg, Hudi). If the API can be made more flexible here it opens the door for fully native readers for these lakehouse formats, which is widely used.
The text was updated successfully, but these errors were encountered: