Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional schema to scan_parquet #15111

Open
ion-elgreco opened this issue Mar 17, 2024 · 3 comments
Open

Add optional schema to scan_parquet #15111

ion-elgreco opened this issue Mar 17, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@ion-elgreco
Copy link
Contributor

Description

If a schema is passed to scan_parquet it should be used to coerce every read parquet file into this schema. This also means Polars does not to fetch the metadata of the first parquet.

There are many cases where you know what the schema will be, such as using (Delta, Iceberg, Hudi). If the API can be made more flexible here it opens the door for fully native readers for these lakehouse formats, which is widely used.

@ion-elgreco ion-elgreco added the enhancement New feature or an improvement of an existing feature label Mar 17, 2024
@deanm0000
Copy link
Collaborator

I like this or even just an option to default to the pl.concat(..., how='diagonal_relaxed') behavior and I think similar requests have come up in the past (although I can't find them at the moment) but I think there's been pushback from the core team in that they want the strictness.

A work around to get part of the behavior is

df=pl.scan_parquet(...)
your_schema = {} #actual schema here, of course
import json
files = json.loads(df.serialize())['Scan']['paths']
df=pl.concat([
    pl.scan_parquet(x)
    .select([pl.col(x).cast(y) for x,y in your_schema.items()])
    for x in files
])

Of course this workaround goes the opposite direction with respect to avoiding scanning the first file and instead scans all the files' metadata. It's awkward that serialize doesn't have an option to output a dict so we have to parse the json but oh well.

@ion-elgreco
Copy link
Contributor Author

@deanm0000 I don't really get the pushback from the core team on this. This strictness shouldn't be there because if you are passing a schema, you are doing this willingly and know that it will fit. Datafusion and Pyarrow have no issue handling this type of behavior.

It's a workaround that unfortunately ruins performance. I've done it here as well: https://github.com/ion-elgreco/polars-deltalake/blob/d9fcb4d9d7337bd163ce3ee344225516e53da4da/python/src/lib.rs#L126

It should be a single scan for the optimizer to properly work.

@ion-elgreco
Copy link
Contributor Author

@deanm0000 Actually a diagonal concat wouldn't be enough. Take this example:

I have two parquets, with the columns ["foo", "bar"]. I pass a schema of ["foo","bar","baz"]. Simply reading and diagonally concatenating means it will ignore the column "baz". This column should get a null array of the dtype since I provided it in the schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants