Pyarrow IO property for configuring large v small types on read #986

sungwy · 2024-07-31T12:54:33Z

This addresses the issue discussed in the formal proposal discussed in the Google Doc.

The current behavior to always cast to large types results in RSS memory usage explosion as is highlighted in the benchmark discussed in the documentation.

sungwy · 2024-07-31T18:45:34Z

Once approved/merged, I'd like to bring this up on the discussion thread to add this item to 0.7.1 patch release as well. It's a small feature, and it would help with alleviate the memory issues we are running into (I expect other users would as well)

kevinjqliu

Thanks for working on this, I left a few comments!

kevinjqliu · 2024-08-01T17:33:50Z

pyiceberg/io/__init__.py

@@ -80,6 +80,7 @@
 GCS_ENDPOINT = "gcs.endpoint"
 GCS_DEFAULT_LOCATION = "gcs.default-bucket-location"
 GCS_VERSION_AWARE = "gcs.version-aware"
+PYARROW_USE_LARGE_TYPES_ON_READ = "pyarrow.use-large-types-on-read"


nit: if this is a pyarrow specific setting, lets move it to the pyarrow file

I thought of that, but decided to leave it here because I liked having the FileIO properties together in one place. WDYT?

Pyarrow is one of the FileIO implementations and this setting is specifically for Pyarrow. In the future, when we add more FileIO implementations, such as the rust one, it'll be good to have a clear separation between the FileIO settings.

I think it also makes more sense to move this inside of the Arrow file.

Thanks @Fokko and @kevinjqliu - I'll keep this in mind the next time I touch these files 🙂

pyiceberg/io/pyarrow.py

kevinjqliu · 2024-08-01T17:54:26Z

pyiceberg/io/pyarrow.py

@@ -1146,6 +1152,31 @@ def primitive(self, primitive: pa.DataType) -> pa.DataType:
        return primitive


+class _ConvertToSmallTypes(PyArrowSchemaVisitor[Union[pa.DataType, pa.Schema]]):
+    def schema(self, schema: pa.Schema, struct_result: pa.StructType) -> pa.Schema:


nit: looks like this is the same function definition as the one in _ConvertToLargeTypes, as with other functions here.
Perhaps abstract into a common class and extend/override specific functions.

I thought of that, but I didn't like naming one as _ConvertToLargeTypes, and then having an arg like reverse: bool

I was thinking something with inheritance like:
_ConvertToArrowTypes
_ConvertToLargeTypes(_ConvertToArrowTypes)
_ConvertToSmallTypes(_ConvertToArrowTypes)

tests/integration/test_reads.py

kevinjqliu · 2024-08-01T18:08:34Z

tests/io/test_pyarrow_visitor.py

@@ -596,6 +597,11 @@ def test_pyarrow_schema_ensure_large_types(pyarrow_schema_nested_without_ids: pa
    assert _pyarrow_schema_ensure_large_types(pyarrow_schema_nested_without_ids) == expected_schema


+def test_pyarrow_schema_ensure_small_types(pyarrow_schema_nested_without_ids: pa.Schema) -> None:
+    schema_with_large_types = _pyarrow_schema_ensure_small_types(pyarrow_schema_nested_without_ids)


nit: name is large_type, function is small_type

what is this function testing for?

This was for testing the roundtrip conversion - fixed it to use the correct function

sungwy

Thank you for the review feedback @kevinjqliu ! Adopted most of the feedback and left some comments for the others.

pyiceberg/io/pyarrow.py

sungwy · 2024-08-01T18:43:36Z

pyiceberg/io/__init__.py

@@ -80,6 +80,7 @@
 GCS_ENDPOINT = "gcs.endpoint"
 GCS_DEFAULT_LOCATION = "gcs.default-bucket-location"
 GCS_VERSION_AWARE = "gcs.version-aware"
+PYARROW_USE_LARGE_TYPES_ON_READ = "pyarrow.use-large-types-on-read"


I thought of that, but decided to leave it here because I liked having the FileIO properties together in one place. WDYT?

kevinjqliu

LGTM!

kevinjqliu · 2024-08-01T20:07:49Z

pyiceberg/io/pyarrow.py

@@ -1146,6 +1152,31 @@ def primitive(self, primitive: pa.DataType) -> pa.DataType:
        return primitive


+class _ConvertToSmallTypes(PyArrowSchemaVisitor[Union[pa.DataType, pa.Schema]]):
+    def schema(self, schema: pa.Schema, struct_result: pa.StructType) -> pa.Schema:


I was thinking something with inheritance like:
_ConvertToArrowTypes
_ConvertToLargeTypes(_ConvertToArrowTypes)
_ConvertToSmallTypes(_ConvertToArrowTypes)

sungwy · 2024-08-02T13:52:03Z

Thanks for the review @kevinjqliu ! Just updated it to make use of @ndrluis 's cleaned up function property_as_bool

kevinjqliu

LGTM!

mkdocs/docs/configuration.md

kevinjqliu

LGTM!

HonahX · 2024-08-05T07:57:41Z

@sungwy Thanks for working on this!

It seems we also need to update schema_to_pyarrow/_cast_if_needed to honor the new property. Otherwise

iceberg-python/pyiceberg/io/pyarrow.py

Lines 1471 to 1478 in 846713b

    
           def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: 
        
               file_field = self._file_schema.find_field(field.field_id) 
        
               if field.field_type.is_primitive: 
        
                   if field.field_type != file_field.field_type: 
        
                       return values.cast( 
        
                           schema_to_pyarrow(promote(file_field.field_type, field.field_type), include_field_ids=self._include_field_ids) 
        
                       )

If we have type promotion from string to binary, the schema_to_parrow will convert BinaryType() to pa.large_binary

Example to reproduce:

@pytest.mark.integration
@pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")])
def test_table_scan_override_with_small_types(catalog: Catalog) -> None:
    identifier = "default.test_table_scan_override_with_small_types"
    arrow_table = pa.Table.from_arrays(
        [pa.array(["a", "b", "c"]), pa.array([b"a", b"b", b"c"]), pa.array([["a", "b"], ["c", "d"], ["e", "f"]])],
        names=["string", "binary", "list"],
    )
    try:
        catalog.drop_table(identifier)
    except NoSuchTableError:
        pass

    tbl = catalog.create_table(
        identifier,
        schema=arrow_table.schema,
    )

    tbl.append(arrow_table)

    with tbl.update_schema() as update_schema:
        update_schema.update_column("string", BinaryType())

    tbl.io.properties[PYARROW_USE_LARGE_TYPES_ON_READ] = "False"
    result_table = tbl.scan().to_arrow()

    expected_schema = pa.schema([
        pa.field("string", pa.large_binary()), # should be pa.binary()
        pa.field("binary", pa.binary()),
        pa.field("list", pa.list_(pa.string())),
    ])
    assert result_table.schema.equals(expected_schema)

##### result_table.schema #####
string: large_binary
binary: binary
list: list<element: string>
  child 0, element: string

sungwy · 2024-08-05T15:50:32Z

@sungwy Thanks for working on this!

It seems we also need to update schema_to_pyarrow/_cast_if_needed to honor the new property.

Thanks @HonahX ! I've updated the code in order to accommodate this edge case.

Fokko · 2024-08-07T09:03:46Z

pyiceberg/io/__init__.py

@@ -80,6 +80,7 @@
 GCS_ENDPOINT = "gcs.endpoint"
 GCS_DEFAULT_LOCATION = "gcs.default-bucket-location"
 GCS_VERSION_AWARE = "gcs.version-aware"
+PYARROW_USE_LARGE_TYPES_ON_READ = "pyarrow.use-large-types-on-read"


I think it also makes more sense to move this inside of the Arrow file.

Fokko · 2024-08-07T09:09:20Z

pyiceberg/io/pyarrow.py

@@ -1303,6 +1345,8 @@ def project_table(
            # When FsSpec is not installed
            raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}") from e

+    use_large_types = property_as_bool(io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, True)


This is the only part I wouldn't say I like where we now force the table to use large or normal tables. When we read record batches I agree that we need to force the schema, but for the table, we have to read all the footers anyway.

Once #929 goes in, I think we still need to change that, but let's defer that question for now.

fusion2222 · 2024-08-12T14:05:10Z

Does anyone know when this will be released?

sungwy · 2024-08-12T14:06:51Z

Hi @fusion2222 - This will be released with 0.8.0, which will be a few months away (roughly 1~3 months)

sungwy added 2 commits July 31, 2024 12:51

upyarrow IO property for configuring large v small types on read

87e436d

tests

2d84784

sungwy requested a review from Fokko July 31, 2024 18:43

sungwy mentioned this pull request Aug 1, 2024

Configure timestamp downcast programmatically #960

Open

kevinjqliu reviewed Aug 1, 2024

View reviewed changes

adopt feedback

a3eba4a

sungwy commented Aug 1, 2024

View reviewed changes

kevinjqliu approved these changes Aug 1, 2024

View reviewed changes

sungwy added 2 commits August 2, 2024 13:42

Merge branch 'main' into small-types-option

771a4b8

use property_as_bool

f082eb5

ndrluis approved these changes Aug 2, 2024

View reviewed changes

fix

cf4052f

kevinjqliu approved these changes Aug 2, 2024

View reviewed changes

docs

5172918

kevinjqliu reviewed Aug 2, 2024

View reviewed changes

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved

mkdocs/docs/configuration.md Outdated Show resolved Hide resolved

nits

cd819b3

kevinjqliu approved these changes Aug 3, 2024

View reviewed changes

respect flag on promotion

0fd31d7

HonahX approved these changes Aug 6, 2024

View reviewed changes

sungwy added this to the PyIceberg 0.8.0 release milestone Aug 6, 2024

sungwy added 2 commits August 6, 2024 13:13

Merge branch 'main' into small-types-option

3e72c0d

lint

ad6a6cb

Fokko approved these changes Aug 7, 2024

View reviewed changes

Fokko merged commit 8aeab49 into apache:main Aug 7, 2024
7 checks passed

sungwy deleted the small-types-option branch August 7, 2024 12:39

kevinjqliu mentioned this pull request Sep 4, 2024

Regression in 0.7.0 due to type coercion from "string" to "large_string" #1128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow IO property for configuring large v small types on read #986

Pyarrow IO property for configuring large v small types on read #986

sungwy commented Jul 31, 2024 •

edited

Loading

sungwy commented Jul 31, 2024

kevinjqliu left a comment

kevinjqliu Aug 1, 2024

sungwy Aug 1, 2024

kevinjqliu Aug 1, 2024

Fokko Aug 7, 2024

sungwy Aug 7, 2024 •

edited

Loading

kevinjqliu Aug 1, 2024

sungwy Aug 1, 2024

kevinjqliu Aug 1, 2024

kevinjqliu Aug 1, 2024

sungwy Aug 1, 2024

sungwy left a comment

sungwy Aug 1, 2024

kevinjqliu left a comment

kevinjqliu Aug 1, 2024

sungwy commented Aug 2, 2024

kevinjqliu left a comment

kevinjqliu left a comment

HonahX commented Aug 5, 2024 •

edited

Loading

sungwy commented Aug 5, 2024

Fokko Aug 7, 2024

Fokko Aug 7, 2024

fusion2222 commented Aug 12, 2024

sungwy commented Aug 12, 2024

Pyarrow IO property for configuring large v small types on read #986

Pyarrow IO property for configuring large v small types on read #986

Conversation

sungwy commented Jul 31, 2024 • edited Loading

sungwy commented Jul 31, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Aug 2, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

HonahX commented Aug 5, 2024 • edited Loading

sungwy commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fusion2222 commented Aug 12, 2024

sungwy commented Aug 12, 2024

sungwy commented Jul 31, 2024 •

edited

Loading

sungwy Aug 7, 2024 •

edited

Loading

HonahX commented Aug 5, 2024 •

edited

Loading