Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec #936

Open
HonahX opened this issue Jul 16, 2024 · 1 comment

Comments

@HonahX
Copy link
Contributor

HonahX commented Jul 16, 2024

Apache Iceberg version

main (development)

Please describe the bug 🐞

According to the parquet data type mappings spec. DecimalType should map to INT32 when precision <= 9, INT64 when precision <= 18, and fixed otherwise.

However, currently arrow write all decimal type as fixed in parquet. This may not be a big issue since the logical type is correct and may require upstream support:

Updated: Thanks @syun64 for providing the link of upstream PR that fix this

Simple test:

from pyiceberg.catalog import load_catalog
from pyiceberg.types import *
from pyiceberg.schema import *
import pyarrow as pa

rest_catalog = load_catalog(
    "rest",
    **{
        ...
    },
)


decimal_schema = Schema(NestedField(1, "decimal", DecimalType(7, 0)))
decimal_arrow_schema = pa.schema(
    [
        ("decimal", pa.decimal128(7, 0)),
    ]
)

decimal_arrow_table = pa.Table.from_pylist(
    [
        {
            "decimal": 123,
        }
    ],
    schema=decimal_arrow_schema,
)

tbl = rest_catalog.create_table(
    "pyiceberg_test.test_decimal_type", schema=decimal_arrow_schema
)

tbl.append(decimal_arrow_table)
> parquet-tools inspect 00000-0-bff20a80-0e80-4b53-ba35-2c94498fa507.parquet

############ file meta data ############
created_by: parquet-cpp-arrow version 16.1.0
num_columns: 1
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 465


############ Columns ############
decimal

############ Column(decimal) ############
name: decimal
path: decimal
max_definition_level: 1
max_repetition_level: 0
physical_type: FIXED_LEN_BYTE_ARRAY
logical_type: Decimal(precision=7, scale=0)
converted_type (legacy): DECIMAL
compression: ZSTD (space_saved: -25%)
@HonahX HonahX changed the title [Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec [Spec][Upstream] Mapping from DecimalType to Parquet physical type not aligned with spec Jul 16, 2024
@sungwy
Copy link
Collaborator

sungwy commented Jul 16, 2024

Hi @HonahX thank you for raising this issue! Having worked on type casting PRs recently, this one piqued my interest...

It looks like there was a PR merged recently that exposed the feature from C++ Arrow to the Python bindings through a flag store_decimal_as_integer, which was released in version 17.0.0:

>>> import pyarrow as pa
>>> schema = pa.schema([pa.field('decimal', pa.decimal128(precision=2))])
>>> table = pa.Table.from_pydict({"decimal": [1.1, 2.2]})
>>> table.cast(schema)
pyarrow.Table
decimal: decimal128(2, 0)
----
decimal: [[1,2]]
>>> import pyarrow.parquet as pq
>>> pq.write_table(table.cast(schema), "test.parquet", store_decimal_as_integer=True)
>>> pq.read_metadata("test.parquet")
<pyarrow._parquet.FileMetaData object at 0x730658de3600>
  created_by: parquet-cpp-arrow version 17.0.0
  num_columns: 1
  num_rows: 2
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 377
>>> pq.read_metadata("test.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x730656702200>
required group field_id=-1 schema {
  optional int32 field_id=-1 decimal (Decimal(precision=2, scale=0));

I just checked using the latest release 17.0.0 and I've confirmed that the parquet phyiscal types are being written as Integers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants