Deleting with condition in partitioned tables is buggy #1044

panbamid-r · 2024-08-12T09:22:42Z

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

TLDR: Deleting based on a condition, on a partitioned iceberg table, yields incorrect results and deletes more records than expected.

I've encountered the following issue while experimenting with pyiceberg, and trying to delete some records based on a condition on a partitioned table. I am using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog.

I have a pandas dataframe with 100 records, which is populated as such:

id: randomly, between 1 and 10,000
created_at: date, randomly between 2024-01-01 and 2024-02-01
relevancy_score: randomly, between 0 and 1, following a beta distribution with alpha=2, beta=20.

Having created the iceberg table:

schema = Schema(
    NestedField(field_id=101, name="id", field_type=LongType(), required=True),
    NestedField(field_id=103, name="created_at", field_type=DateType(), required=False),
    NestedField(field_id=104, name="relevancy_score", field_type=DoubleType(), required=False),
)

partition_spec = PartitionSpec(
    PartitionField(
        source_id=103, field_id=2000, transform=DayTransform(), name="created_at_day"
    )
)

catalog.create_table(
    identifier=f"{glue_database_name}.{table_name}",
    schema=schema,
    partition_spec=partition_spec,
)

I append my df to the table:

arrow_schema = iceberg_table.schema().as_arrow()
docs_table = pa.Table.from_pandas(df, schema=arrow_schema)

# Append the data to the Iceberg table
iceberg_table.append(docs_table)

and I get the following info:

total_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after append: {total_records}")

lower_before = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 before deletion: {lower_before}")

greater_before = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 before deletion: {greater_before}")

Total records after append: (100, 3)
Records with relevancy_score < 0.1 before deletion: (64, 3)
Records with relevancy_score >= 0.1 before deletion: (36, 3)

so then I run the deletion:

delete_condition = GreaterThanOrEqual("relevancy_score", 0.1)
iceberg_table.delete(delete_condition)

and the results are quite unexpected:

# Verify the total number of records in the table after deletion
remaining_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after deletion: {remaining_records}")

# Verify the number of records that meet the relevancy_score condition after deletion
lower_after = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 after deletion: {lower_after}")

greater_after = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 after deletion: {greater_after}")

Total records after deletion: (26, 3)
Records with relevancy_score < 0.1 after deletion: (26, 3)
Records with relevancy_score >= 0.1 after deletion: (0, 3)

After checking the manifest files after the deletion what seems to be happening is:

The snapshot correctly keeps the files/days where there are no records where relevancy_score >= 0.1
The snapshot correctly marks as deleted the files/days where there are no records where relevancy_score < 0.1
The snapshot seems to incorrectly ignore completely files/days where there are both records with relevancy_score >= 0.1 and relevancy_score < 0.1

For example, let's consider the following slice of the dataset:

id	created_at	relevancy_score
808	2024-01-02	0.18115229995825877
1424	2024-01-02	0.13995060752152988
8026	2024-01-02	0.05992063209461162

The .avro files of the manifest post-deletion, do not mention any data files of the 2024-01-02 partition whatsoever when it should have kept the record with id=8026

I'm attaching an accompanying notebook which I used to recreate the issue
I didn't encounter a similar issue when working with a non-partitioned table

The text was updated successfully, but these errors were encountered:

Fokko · 2024-08-12T10:03:07Z

@panbamid-r Thanks for raising this and the comprehensive example. I'm able to reproduce it locally, let me dig into this right away

Fokko self-assigned this Aug 12, 2024

Fokko mentioned this issue Aug 12, 2024

Fix tracing existing entries when there are deletes #1046

Merged

sungwy closed this as completed in #1046 Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting with condition in partitioned tables is buggy #1044

Deleting with condition in partitioned tables is buggy #1044

panbamid-r commented Aug 12, 2024 •

edited

Loading

Fokko commented Aug 12, 2024

Deleting with condition in partitioned tables is buggy #1044

Deleting with condition in partitioned tables is buggy #1044

Comments

panbamid-r commented Aug 12, 2024 • edited Loading

Apache Iceberg version

Please describe the bug 🐞

Fokko commented Aug 12, 2024

panbamid-r commented Aug 12, 2024 •

edited

Loading