[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908

matteosdocsity · 2024-09-02T13:54:59Z

Describe the bug, including details regarding any error messages, version, and platform.

When using pyarrow versions greater than 12.0.1 to write Parquet files that are then loaded into Google BigQuery, the fields containing Decimal values (used to represent BigQuery's BIGNUMERIC type) are being read as NULL by BigQuery. This issue does not occur with pyarrow==12.0.1.

Environment

Python Version: 3.12
pyarrow Version: 13.0.0 and above (issue observed)
Google BigQuery Version: N/A (BigQuery as the consumer of the Parquet files)
Operating System: [macOS 14.6.1]

Steps to Reproduce

Create a Pandas DataFrame with a column containing lists of Decimal values.
Write the DataFrame to a Parquet file using pyarrow==13.0.0 or later.
Load the Parquet file into a Google BigQuery external table.
Query the table in BigQuery.

Expected Behavior

BigQuery should correctly read the Decimal values from the Parquet file and populate the corresponding fields in the table.

Actual Behavior

BigQuery reads the fields corresponding to Decimal values as NULL.

Code Example

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from decimal import Decimal
from google.cloud import storage
import io

# Sample DataFrame
data = {
    'sessions_array': [
        [{'item': Decimal('12345678901234567890.12345678901234567890')}, {'item': Decimal('23456789012345678901.12345678901234567891')}],
        [{'item': Decimal('34567890123456789012.12345678901234567892')}]
    ]
}

df = pd.DataFrame(data)

# Define PyArrow schema
schema = pa.schema([
    pa.field('sessions_array', pa.list_(pa.struct([
        pa.field('item', pa.decimal256(40, 20))
    ])))
])

# Convert DataFrame to PyArrow Table with the defined schema
table = pa.Table.from_pandas(df, schema=schema)

# Write Parquet to buffer
buffer = io.BytesIO()
pq.write_table(table, buffer)
buffer.seek(0)

# Upload buffer to Google Cloud Storage or save locally
with open('/tmp/test.parquet', 'wb') as f:
    f.write(buffer.getbuffer())

Notes

This issue does not occur with pyarrow==12.0.1.
The problem seems to be related to how pyarrow serializes Decimal types in Parquet files and how BigQuery interprets them.

Workaround

Using pyarrow==12.0.1 and Python<3.12 resolves the issue, but this is not ideal as it requires using an older version of the library, which may lack other features or bug fixes.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-09-02T15:41:32Z

@emkornfield would you mind take a quick glance on this?

jorisvandenbossche · 2024-09-03T12:55:43Z

@matteosdocsity trying to reproduce creating the Parquet files locally (to see if there would be any noticeable difference between a file written by pyarrow 12 vs 13), but I get an error running your code to create the pyarrow table: ArrowInvalid: Rescaling Decimal128 value would cause data loss

To demonstrate with just the first Decimal value:

>>> pa.array([Decimal('12345678901234567890.12345678901234567890')], type=pa.decimal128(38, 18))
...
ArrowInvalid: Rescaling Decimal128 value would cause data loss

If I don't specify a specific decimal type but let it be inferred, I get the following type:

>>> pa.array([Decimal('12345678901234567890.12345678901234567890')]).type
Decimal256Type(decimal256(40, 20))

matteosdocsity · 2024-09-03T13:15:39Z

@jorisvandenbossche yes sorry my mistake it's pa.decimal256(40,20)...i've updated the code

jorisvandenbossche · 2024-09-03T15:08:39Z

Thanks. One more question: you are using a nested data type (the decimals are in a struct in a list type). Do you also see the issue with non-nested decimals? Or only specifically with this nesting?

matteosdocsity · 2024-09-03T15:14:07Z

yes @jorisvandenbossche exactly it's an array of decimals

jorisvandenbossche · 2024-09-03T15:23:01Z

Comparing the two files (written with pyarrow 12.0.1 and with 14.0.2), the main difference I notice is the different name used for the list element. Comparing both Parquet schemas:

In [9]: meta12.schema
Out[9]: 
<pyarrow._parquet.ParquetSchema object at 0x7f542c306800>
required group field_id=-1 schema {
  optional group field_id=-1 sessions_array (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 item {
        optional fixed_len_byte_array(17) field_id=-1 item (Decimal(precision=40, scale=20));
      }
    }
  }
}

In [10]: meta14.schema
Out[10]: 
<pyarrow._parquet.ParquetSchema object at 0x7f542c4ac9c0>
required group field_id=-1 schema {
  optional group field_id=-1 sessions_array (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 element {     # <------------------- this is "element" and not "item"
        optional fixed_len_byte_array(17) field_id=-1 item (Decimal(precision=40, scale=20));
      }
    }
  }
}

This is was a deliberate change in pyarrow to follow more closely the parquet spec (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists), and can be controlled by passing use_compliant_nested_type=True/False (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html).
While it is closer to the spec, maybe bigquery does something different based on that. @matteosdocsity could you try again with setting that to False to see if that changes anything?

In the metadata I also see that the listed encodings have a different order (pyarrow 12 puts RLE_DICTIONARY first, while pyarrow 14 puts PLAIN first). Just to try to narrow down the issue, @matteosdocsity you could also try tweaking some parameters based on that (e.g. use_dictionary=False)

mapleFU · 2024-09-03T15:44:45Z

In the metadata I also see that the listed encodings have a different order (pyarrow 12 puts RLE_DICTIONARY first, while pyarrow 14 puts PLAIN first). Just to try to narrow down the issue, @matteosdocsity you could also try tweaking some parameters based on that (e.g. use_dictionary=False)

This is in 13.0: #35758 . Personally I don't think it would matter...

matteosdocsity · 2024-09-04T08:46:48Z

@jorisvandenbossche it works like a charm, thanks!
The method described also works with Pandas by setting dynamic keyword arguments. Here's an example:

# Define kwargs
kwargs = {
    'use_compliant_nested_type': False
}

# Save DataFrame to Parquet with dynamic kwargs
df.to_parquet('output.parquet', engine='pyarrow', **kwargs)

This ensures compliant nested types are used when saving DataFrames to Parquet using Pandas and pyarrow.

jorisvandenbossche · 2024-09-04T08:55:55Z

It's strange that BigQuery does not read that properly though .. They even have a page explicitly mentioning the expected schema for list types: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type (where they are using "element" instead of "item").

If you need to specify use_compliant_nested_type as False to have BigQuery read the data correctly, that might be worth reporting to BigQuery.

emkornfield · 2024-09-04T17:00:09Z

I think there are two issues here.

The docs mentioned in the comment above require add a flag to enable the schema inference (enable_list_inference), otherwise the schema is copied directly from the parquet file (i.e. with the intermediate "item"), so the new "element" field would be be treated as a new field and ignored. This is intended behavior if the flag is not set.
I think there is a bug enable_list_inference which doesn't follow backwards compatibility rules, specifically on the interchangeability of names.

wgtmac · 2024-09-06T09:00:54Z

I have seen a similar issue with a Parquet file created by Hudi. There is a nesting list as below:

  optional group a (LIST) {
    repeated group array (LIST) {
      repeated int32 array;
    }
  }

The C++ parquet reader infers its schema as array<struct<array:array<int>>>. The root cause is here:

arrow/cpp/src/parquet/arrow/schema.cc

Lines 657 to 663 in 12dddfc

    
           // We distinguish the special case that we have 
        
           // 
        
           // required/optional group name=whatever { 
        
           //   repeated group name=array or $SOMETHING_tuple { 
        
           //     required/optional TYPE item; 
        
           //   } 
        
           // }

I think we need to regard them as a nesting two-level list, meaning that the correct interpretation is array<array:array<int>>. Am I right? @emkornfield @mapleFU

matteosdocsity · 2024-09-06T10:07:17Z

@jorisvandenbossche I tested as you reported here , and via Terraform forcing the BigQuery table schema to have the column to be a record of record of BIGNUMERIC element instead of item:

     {
        "name": "sessions_array",
        "type": "RECORD",
        "mode": "NULLABLE",
        "fields": [
          {
            "name": "list",
            "type": "RECORD",
            "mode": "REPEATED",
            "fields": [
              {
                "name": "element",
                "type": "BIGNUMERIC",
                "mode": "NULLABLE"
              }
            ]
          }
        ]
      }

In combination with use_compliant_nested_type=True the BigQuery reads works.
In consideration of the parquet spec I'm considering to move my BigQuery table to this schema.

matteosdocsity added the Type: bug label Sep 2, 2024

github-actions bot added the Component: Python label Sep 2, 2024

matteosdocsity closed this as completed Sep 4, 2024

jorisvandenbossche changed the title ~~BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1~~ [Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 Sep 4, 2024

tswast mentioned this issue Sep 6, 2024

Cannot append to REPEATED field when using client.load_table_from_file with parquet file googleapis/python-bigquery#2008

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908

[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908

matteosdocsity commented Sep 2, 2024 •

edited

Loading

mapleFU commented Sep 2, 2024

jorisvandenbossche commented Sep 3, 2024

matteosdocsity commented Sep 3, 2024 •

edited

Loading

jorisvandenbossche commented Sep 3, 2024

matteosdocsity commented Sep 3, 2024

jorisvandenbossche commented Sep 3, 2024

mapleFU commented Sep 3, 2024 •

edited

Loading

matteosdocsity commented Sep 4, 2024 •

edited

Loading

jorisvandenbossche commented Sep 4, 2024

emkornfield commented Sep 4, 2024

wgtmac commented Sep 6, 2024 •

edited

Loading

matteosdocsity commented Sep 6, 2024

[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908

[Parquet] BigQuery Reads Null Values from Parquet Files Generated with pyarrow Versions > 12.0.1 #43908

Comments

matteosdocsity commented Sep 2, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Code Example

Notes

Workaround

Component(s)

mapleFU commented Sep 2, 2024

jorisvandenbossche commented Sep 3, 2024

matteosdocsity commented Sep 3, 2024 • edited Loading

jorisvandenbossche commented Sep 3, 2024

matteosdocsity commented Sep 3, 2024

jorisvandenbossche commented Sep 3, 2024

mapleFU commented Sep 3, 2024 • edited Loading

matteosdocsity commented Sep 4, 2024 • edited Loading

jorisvandenbossche commented Sep 4, 2024

emkornfield commented Sep 4, 2024

wgtmac commented Sep 6, 2024 • edited Loading

matteosdocsity commented Sep 6, 2024

matteosdocsity commented Sep 2, 2024 •

edited

Loading

matteosdocsity commented Sep 3, 2024 •

edited

Loading

mapleFU commented Sep 3, 2024 •

edited

Loading

matteosdocsity commented Sep 4, 2024 •

edited

Loading

wgtmac commented Sep 6, 2024 •

edited

Loading