Add optional schema to scan_parquet #15111

ion-elgreco · 2024-03-17T15:58:16Z

Description

If a schema is passed to scan_parquet it should be used to coerce every read parquet file into this schema. This also means Polars does not to fetch the metadata of the first parquet.

There are many cases where you know what the schema will be, such as using (Delta, Iceberg, Hudi). If the API can be made more flexible here it opens the door for fully native readers for these lakehouse formats, which is widely used.

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-03-18T16:11:55Z

I like this or even just an option to default to the pl.concat(..., how='diagonal_relaxed') behavior and I think similar requests have come up in the past (although I can't find them at the moment) but I think there's been pushback from the core team in that they want the strictness.

A work around to get part of the behavior is

df=pl.scan_parquet(...)
your_schema = {} #actual schema here, of course
import json
files = json.loads(df.serialize())['Scan']['paths']
df=pl.concat([
    pl.scan_parquet(x)
    .select([pl.col(x).cast(y) for x,y in your_schema.items()])
    for x in files
])

Of course this workaround goes the opposite direction with respect to avoiding scanning the first file and instead scans all the files' metadata. It's awkward that serialize doesn't have an option to output a dict so we have to parse the json but oh well.

ion-elgreco · 2024-03-18T16:26:27Z

@deanm0000 I don't really get the pushback from the core team on this. This strictness shouldn't be there because if you are passing a schema, you are doing this willingly and know that it will fit. Datafusion and Pyarrow have no issue handling this type of behavior.

It's a workaround that unfortunately ruins performance. I've done it here as well: https://github.com/ion-elgreco/polars-deltalake/blob/d9fcb4d9d7337bd163ce3ee344225516e53da4da/python/src/lib.rs#L126

It should be a single scan for the optimizer to properly work.

ion-elgreco · 2024-03-18T16:52:31Z

@deanm0000 Actually a diagonal concat wouldn't be enough. Take this example:

I have two parquets, with the columns ["foo", "bar"]. I pass a schema of ["foo","bar","baz"]. Simply reading and diagonally concatenating means it will ignore the column "baz". This column should get a null array of the dtype since I provided it in the schema

ion-elgreco added the enhancement New feature or an improvement of an existing feature label Mar 17, 2024

coastalwhite mentioned this issue Sep 20, 2024

feat: Relaxed schema alignment for parquet file list read #18803

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional schema to scan_parquet #15111

Add optional schema to scan_parquet #15111

ion-elgreco commented Mar 17, 2024

deanm0000 commented Mar 18, 2024

ion-elgreco commented Mar 18, 2024

ion-elgreco commented Mar 18, 2024

Add optional schema to scan_parquet #15111

Add optional schema to scan_parquet #15111

Comments

ion-elgreco commented Mar 17, 2024

Description

deanm0000 commented Mar 18, 2024

ion-elgreco commented Mar 18, 2024

ion-elgreco commented Mar 18, 2024