Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908

Closed
matteosdocsity opened this issue Sep 2, 2024 · 12 comments

Comments

@matteosdocsity
Copy link

matteosdocsity commented Sep 2, 2024

Describe the bug, including details regarding any error messages, version, and platform.

When using pyarrow versions greater than 12.0.1 to write Parquet files that are then loaded into Google BigQuery, the fields containing Decimal values (used to represent BigQuery's BIGNUMERIC type) are being read as NULL by BigQuery. This issue does not occur with pyarrow==12.0.1.

Environment

  • Python Version: 3.12
  • pyarrow Version: 13.0.0 and above (issue observed)
  • Google BigQuery Version: N/A (BigQuery as the consumer of the Parquet files)
  • Operating System: [macOS 14.6.1]

Steps to Reproduce

  1. Create a Pandas DataFrame with a column containing lists of Decimal values.
  2. Write the DataFrame to a Parquet file using pyarrow==13.0.0 or later.
  3. Load the Parquet file into a Google BigQuery external table.
  4. Query the table in BigQuery.

Expected Behavior

BigQuery should correctly read the Decimal values from the Parquet file and populate the corresponding fields in the table.

Actual Behavior

BigQuery reads the fields corresponding to Decimal values as NULL.

Code Example

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from decimal import Decimal
from google.cloud import storage
import io

# Sample DataFrame
data = {
    'sessions_array': [
        [{'item': Decimal('12345678901234567890.12345678901234567890')}, {'item': Decimal('23456789012345678901.12345678901234567891')}],
        [{'item': Decimal('34567890123456789012.12345678901234567892')}]
    ]
}

df = pd.DataFrame(data)

# Define PyArrow schema
schema = pa.schema([
    pa.field('sessions_array', pa.list_(pa.struct([
        pa.field('item', pa.decimal256(40, 20))
    ])))
])

# Convert DataFrame to PyArrow Table with the defined schema
table = pa.Table.from_pandas(df, schema=schema)

# Write Parquet to buffer
buffer = io.BytesIO()
pq.write_table(table, buffer)
buffer.seek(0)

# Upload buffer to Google Cloud Storage or save locally
with open('/tmp/test.parquet', 'wb') as f:
    f.write(buffer.getbuffer())

Notes

This issue does not occur with pyarrow==12.0.1.
The problem seems to be related to how pyarrow serializes Decimal types in Parquet files and how BigQuery interprets them.

Workaround

Using pyarrow==12.0.1 and Python<3.12 resolves the issue, but this is not ideal as it requires using an older version of the library, which may lack other features or bug fixes.

Component(s)

Python

@mapleFU
Copy link
Member

mapleFU commented Sep 2, 2024

@emkornfield would you mind take a quick glance on this?

@jorisvandenbossche
Copy link
Member

@matteosdocsity trying to reproduce creating the Parquet files locally (to see if there would be any noticeable difference between a file written by pyarrow 12 vs 13), but I get an error running your code to create the pyarrow table: ArrowInvalid: Rescaling Decimal128 value would cause data loss

To demonstrate with just the first Decimal value:

>>> pa.array([Decimal('12345678901234567890.12345678901234567890')], type=pa.decimal128(38, 18))
...
ArrowInvalid: Rescaling Decimal128 value would cause data loss

If I don't specify a specific decimal type but let it be inferred, I get the following type:

>>> pa.array([Decimal('12345678901234567890.12345678901234567890')]).type
Decimal256Type(decimal256(40, 20))

@matteosdocsity
Copy link
Author

matteosdocsity commented Sep 3, 2024

@jorisvandenbossche yes sorry my mistake it's pa.decimal256(40,20)...i've updated the code

@jorisvandenbossche
Copy link
Member

Thanks. One more question: you are using a nested data type (the decimals are in a struct in a list type). Do you also see the issue with non-nested decimals? Or only specifically with this nesting?

@matteosdocsity
Copy link
Author

yes @jorisvandenbossche exactly it's an array of decimals

@jorisvandenbossche
Copy link
Member

Comparing the two files (written with pyarrow 12.0.1 and with 14.0.2), the main difference I notice is the different name used for the list element. Comparing both Parquet schemas:

In [9]: meta12.schema
Out[9]: 
<pyarrow._parquet.ParquetSchema object at 0x7f542c306800>
required group field_id=-1 schema {
  optional group field_id=-1 sessions_array (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 item {
        optional fixed_len_byte_array(17) field_id=-1 item (Decimal(precision=40, scale=20));
      }
    }
  }
}

In [10]: meta14.schema
Out[10]: 
<pyarrow._parquet.ParquetSchema object at 0x7f542c4ac9c0>
required group field_id=-1 schema {
  optional group field_id=-1 sessions_array (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 element {     # <------------------- this is "element" and not "item"
        optional fixed_len_byte_array(17) field_id=-1 item (Decimal(precision=40, scale=20));
      }
    }
  }
}

This is was a deliberate change in pyarrow to follow more closely the parquet spec (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists), and can be controlled by passing use_compliant_nested_type=True/False (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html).
While it is closer to the spec, maybe bigquery does something different based on that. @matteosdocsity could you try again with setting that to False to see if that changes anything?

In the metadata I also see that the listed encodings have a different order (pyarrow 12 puts RLE_DICTIONARY first, while pyarrow 14 puts PLAIN first). Just to try to narrow down the issue, @matteosdocsity you could also try tweaking some parameters based on that (e.g. use_dictionary=False)

@mapleFU
Copy link
Member

mapleFU commented Sep 3, 2024

In the metadata I also see that the listed encodings have a different order (pyarrow 12 puts RLE_DICTIONARY first, while pyarrow 14 puts PLAIN first). Just to try to narrow down the issue, @matteosdocsity you could also try tweaking some parameters based on that (e.g. use_dictionary=False)

This is in 13.0: #35758 . Personally I don't think it would matter...

@matteosdocsity
Copy link
Author

matteosdocsity commented Sep 4, 2024

@jorisvandenbossche it works like a charm, thanks!
The method described also works with Pandas by setting dynamic keyword arguments. Here's an example:

# Define kwargs
kwargs = {
    'use_compliant_nested_type': False
}

# Save DataFrame to Parquet with dynamic kwargs
df.to_parquet('output.parquet', engine='pyarrow', **kwargs)

This ensures compliant nested types are used when saving DataFrames to Parquet using Pandas and pyarrow.

@jorisvandenbossche jorisvandenbossche changed the title BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 [Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 Sep 4, 2024
@jorisvandenbossche
Copy link
Member

It's strange that BigQuery does not read that properly though .. They even have a page explicitly mentioning the expected schema for list types: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type (where they are using "element" instead of "item").

If you need to specify use_compliant_nested_type as False to have BigQuery read the data correctly, that might be worth reporting to BigQuery.

@emkornfield
Copy link
Contributor

I think there are two issues here.

  1. The docs mentioned in the comment above require add a flag to enable the schema inference (enable_list_inference), otherwise the schema is copied directly from the parquet file (i.e. with the intermediate "item"), so the new "element" field would be be treated as a new field and ignored. This is intended behavior if the flag is not set.
  2. I think there is a bug enable_list_inference which doesn't follow backwards compatibility rules, specifically on the interchangeability of names.

@wgtmac
Copy link
Member

wgtmac commented Sep 6, 2024

I have seen a similar issue with a Parquet file created by Hudi. There is a nesting list as below:

  optional group a (LIST) {
    repeated group array (LIST) {
      repeated int32 array;
    }
  }

The C++ parquet reader infers its schema as array<struct<array:array<int>>>. The root cause is here:

// We distinguish the special case that we have
//
// required/optional group name=whatever {
// repeated group name=array or $SOMETHING_tuple {
// required/optional TYPE item;
// }
// }

I think we need to regard them as a nesting two-level list, meaning that the correct interpretation is array<array:array<int>>. Am I right? @emkornfield @mapleFU

@matteosdocsity
Copy link
Author

@jorisvandenbossche I tested as you reported here , and via Terraform forcing the BigQuery table schema to have the column to be a record of record of BIGNUMERIC element instead of item:

     {
        "name": "sessions_array",
        "type": "RECORD",
        "mode": "NULLABLE",
        "fields": [
          {
            "name": "list",
            "type": "RECORD",
            "mode": "REPEATED",
            "fields": [
              {
                "name": "element",
                "type": "BIGNUMERIC",
                "mode": "NULLABLE"
              }
            ]
          }
        ]
      }

In combination with use_compliant_nested_type=True the BigQuery reads works.
In consideration of the parquet spec I'm considering to move my BigQuery table to this schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants