Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

maxfirman · 2024-09-03T17:06:42Z

Apache Iceberg version

0.7.0

Please describe the bug 🐞

There is a regression in introduced in version 0.7.0 where arrow tables written with a "string" data type, get cast to "large_string" when read back from Iceberg.

The code below reproduces the bug. The assertion succeeds in v0.6.1, but fails in 0.7.0 because the schema is being changed from "string" to "large_string".

from tempfile import TemporaryDirectory

import pyarrow
from pyiceberg.catalog.sql import SqlCatalog


def main():
    with TemporaryDirectory() as warehouse_path:
        catalog = SqlCatalog(
            "default",
            **{
                "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
                "warehouse": f"file://{warehouse_path}",
            },
        )

        catalog.create_namespace("default")

        schema = pyarrow.schema(
            [
                pyarrow.field("foo", pyarrow.string(), nullable=True),
            ]
        )

        df = pyarrow.table(data={"foo": ["bar"]}, schema=schema)

        table = catalog.create_table(
            "default.test_table",
            schema=df.schema,
        )

        table.append(df)

        # read arrow table back table from iceberg
        df2 = table.scan().to_arrow()

        # this assert succeeds with 0.6.1, but fails with 0.7.0 because the column type
        # has changed from "string" to "large_string"
        assert df.equals(df2)


if __name__ == "__main__":
    main()

kevinjqliu · 2024-09-04T00:25:43Z

To summarize, given a table created with string type schema and written to with string type data, reading the table back returns pyarrow dataframe with large_string type.

Expected: return pyarrow dataframe with string type, matching the table schema

Confirmed the above issue on main branch.

kevinjqliu · 2024-09-04T00:27:11Z

The issue above is was mentioned here #986 (comment)

On read, pyarrow will use large type as default. It is controlled by this table property (courtesy of #986)

iceberg-python/pyiceberg/io/pyarrow.py

Lines 1365 to 1371 in 9857107

    
               @property 
        
               def _use_large_types(self) -> bool: 
        
                   """Whether to represent data as large arrow types. 
        
                   Defaults to True. 
        
                   """ 
        
                   return property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, True)

kevinjqliu · 2024-09-04T00:33:12Z

As a workaround, you can manually set the table property to force the read path to use the string type

        from pyiceberg.io import PYARROW_USE_LARGE_TYPES_ON_READ
        table = catalog.create_table(
            "default.test_table",
            schema=df.schema,
            properties={PYARROW_USE_LARGE_TYPES_ON_READ: False}
        )

maxfirman · 2024-09-04T09:02:45Z

Thanks @kevinjqliu. I can confirm that the workaround resolves the problem when using latest main branch but not v0.7.0 or v0.7.1.

Setting PYARROW_USE_LARGE_TYPES_ON_READ=False will cause the test to fail the other way around, i.e a pyarrow table with a large_string will be read back with a string. I'm guess this is just a fundamental limitation in that Iceberg only has one string type.

I would be tempted to change the default value of PYARROW_USE_LARGE_TYPES_ON_READ to True, as I would consider pyarrow string to be the more commonly used type compared to large_string. This would also give backwards compatibility with pyiceberg <0.7.0.

A further improvement would be to write some kind of type hint into the iceberg metadata that would tell pyiceberg whether the string column was supposed to be interpreted as a pyarrow large_string.

kevinjqliu · 2024-09-04T15:01:02Z

I'm guess this is just a fundamental limitation in that Iceberg only has one string type.

Yea, there's a separation of Iceberg type and the Arrow/Parquet/on-disk type. Iceberg has one string type; Arrow has two.
On iceberg table write, an Iceberg string type can be written to disk as Arrow large type or normal type.
On iceberg table read, the Iceberg string type should read back as either Arrow large type or normal type based on the on-disk schema.

The problem here is that PYARROW_USE_LARGE_TYPES_ON_READ defaults to True, which for the scenario where an Iceberg string type is written as normal string type on disk, will read back as large string type.

Perhaps, instead of setting PYARROW_USE_LARGE_TYPES_ON_READ to True, we can leave it unset by default, which will then use the on-disk representation.

cc @sungwy / @Fokko
I think #929 will help resolve this issue (based on this comment #929 (comment))

kevinjqliu added the bug Something isn't working label Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

maxfirman commented Sep 3, 2024

kevinjqliu commented Sep 4, 2024

kevinjqliu commented Sep 4, 2024

kevinjqliu commented Sep 4, 2024

maxfirman commented Sep 4, 2024

kevinjqliu commented Sep 4, 2024

Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

Comments

maxfirman commented Sep 3, 2024

Apache Iceberg version

Please describe the bug 🐞

kevinjqliu commented Sep 4, 2024

kevinjqliu commented Sep 4, 2024

kevinjqliu commented Sep 4, 2024

maxfirman commented Sep 4, 2024

kevinjqliu commented Sep 4, 2024