-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908
Comments
@emkornfield would you mind take a quick glance on this? |
@matteosdocsity trying to reproduce creating the Parquet files locally (to see if there would be any noticeable difference between a file written by pyarrow 12 vs 13), but I get an error running your code to create the pyarrow table: To demonstrate with just the first Decimal value: >>> pa.array([Decimal('12345678901234567890.12345678901234567890')], type=pa.decimal128(38, 18))
...
ArrowInvalid: Rescaling Decimal128 value would cause data loss If I don't specify a specific decimal type but let it be inferred, I get the following type: >>> pa.array([Decimal('12345678901234567890.12345678901234567890')]).type
Decimal256Type(decimal256(40, 20)) |
@jorisvandenbossche yes sorry my mistake it's pa.decimal256(40,20)...i've updated the code |
Thanks. One more question: you are using a nested data type (the decimals are in a struct in a list type). Do you also see the issue with non-nested decimals? Or only specifically with this nesting? |
yes @jorisvandenbossche exactly it's an array of decimals |
Comparing the two files (written with pyarrow 12.0.1 and with 14.0.2), the main difference I notice is the different name used for the list element. Comparing both Parquet schemas:
This is was a deliberate change in pyarrow to follow more closely the parquet spec (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists), and can be controlled by passing In the metadata I also see that the listed encodings have a different order (pyarrow 12 puts RLE_DICTIONARY first, while pyarrow 14 puts PLAIN first). Just to try to narrow down the issue, @matteosdocsity you could also try tweaking some parameters based on that (e.g. |
This is in 13.0: #35758 . Personally I don't think it would matter... |
@jorisvandenbossche it works like a charm, thanks! # Define kwargs
kwargs = {
'use_compliant_nested_type': False
}
# Save DataFrame to Parquet with dynamic kwargs
df.to_parquet('output.parquet', engine='pyarrow', **kwargs) This ensures compliant nested types are used when saving DataFrames to Parquet using Pandas and pyarrow. |
It's strange that BigQuery does not read that properly though .. They even have a page explicitly mentioning the expected schema for list types: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type (where they are using "element" instead of "item"). If you need to specify |
I think there are two issues here.
|
I have seen a similar issue with a Parquet file created by Hudi. There is a nesting list as below:
The C++ parquet reader infers its schema as arrow/cpp/src/parquet/arrow/schema.cc Lines 657 to 663 in 12dddfc
I think we need to regard them as a nesting two-level list, meaning that the correct interpretation is |
@jorisvandenbossche I tested as you reported here , and via Terraform forcing the BigQuery table schema to have the column to be a record of record of BIGNUMERIC element instead of item: {
"name": "sessions_array",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "list",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "element",
"type": "BIGNUMERIC",
"mode": "NULLABLE"
}
]
}
]
}
In combination with |
Describe the bug, including details regarding any error messages, version, and platform.
When using
pyarrow
versions greater than12.0.1
to write Parquet files that are then loaded into Google BigQuery, the fields containingDecimal
values (used to represent BigQuery'sBIGNUMERIC
type) are being read asNULL
by BigQuery. This issue does not occur withpyarrow==12.0.1
.Environment
Steps to Reproduce
Decimal
values.pyarrow==13.0.0
or later.Expected Behavior
BigQuery should correctly read the
Decimal
values from the Parquet file and populate the corresponding fields in the table.Actual Behavior
BigQuery reads the fields corresponding to
Decimal
values asNULL
.Code Example
Notes
This issue does not occur with pyarrow==12.0.1.
The problem seems to be related to how pyarrow serializes Decimal types in Parquet files and how BigQuery interprets them.
Workaround
Using pyarrow==12.0.1 and Python<3.12 resolves the issue, but this is not ideal as it requires using an older version of the library, which may lack other features or bug fixes.
Component(s)
Python
The text was updated successfully, but these errors were encountered: