Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting with condition in partitioned tables is buggy #1044

Closed
panbamid-r opened this issue Aug 12, 2024 · 1 comment · Fixed by #1046
Closed

Deleting with condition in partitioned tables is buggy #1044

panbamid-r opened this issue Aug 12, 2024 · 1 comment · Fixed by #1046
Assignees

Comments

@panbamid-r
Copy link

panbamid-r commented Aug 12, 2024

Apache Iceberg version

0.7.0 (latest release)

Please describe the bug 🐞

TLDR: Deleting based on a condition, on a partitioned iceberg table, yields incorrect results and deletes more records than expected.

I've encountered the following issue while experimenting with pyiceberg, and trying to delete some records based on a condition on a partitioned table. I am using the latest version of pyiceberg (0.7.0) and I utilize the glue catalog.

I have a pandas dataframe with 100 records, which is populated as such:

  • id: randomly, between 1 and 10,000
  • created_at: date, randomly between 2024-01-01 and 2024-02-01
  • relevancy_score: randomly, between 0 and 1, following a beta distribution with alpha=2, beta=20.

Having created the iceberg table:

schema = Schema(
    NestedField(field_id=101, name="id", field_type=LongType(), required=True),
    NestedField(field_id=103, name="created_at", field_type=DateType(), required=False),
    NestedField(field_id=104, name="relevancy_score", field_type=DoubleType(), required=False),
)

partition_spec = PartitionSpec(
    PartitionField(
        source_id=103, field_id=2000, transform=DayTransform(), name="created_at_day"
    )
)

catalog.create_table(
    identifier=f"{glue_database_name}.{table_name}",
    schema=schema,
    partition_spec=partition_spec,
)

I append my df to the table:

arrow_schema = iceberg_table.schema().as_arrow()
docs_table = pa.Table.from_pandas(df, schema=arrow_schema)

# Append the data to the Iceberg table
iceberg_table.append(docs_table)

and I get the following info:

total_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after append: {total_records}")

lower_before = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 before deletion: {lower_before}")

greater_before = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 before deletion: {greater_before}")

Total records after append: (100, 3)
Records with relevancy_score < 0.1 before deletion: (64, 3)
Records with relevancy_score >= 0.1 before deletion: (36, 3)

so then I run the deletion:

delete_condition = GreaterThanOrEqual("relevancy_score", 0.1)
iceberg_table.delete(delete_condition)

and the results are quite unexpected:

# Verify the total number of records in the table after deletion
remaining_records = iceberg_table.scan().to_pandas().shape
print(f"Total records after deletion: {remaining_records}")

# Verify the number of records that meet the relevancy_score condition after deletion
lower_after = iceberg_table.scan(row_filter=LessThan("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score < 0.1 after deletion: {lower_after}")

greater_after = iceberg_table.scan(row_filter=GreaterThanOrEqual("relevancy_score", 0.1)).to_pandas().shape
print(f"Records with relevancy_score >= 0.1 after deletion: {greater_after}")

Total records after deletion: (26, 3)
Records with relevancy_score < 0.1 after deletion: (26, 3)
Records with relevancy_score >= 0.1 after deletion: (0, 3)

After checking the manifest files after the deletion what seems to be happening is:

  • The snapshot correctly keeps the files/days where there are no records where relevancy_score >= 0.1
  • The snapshot correctly marks as deleted the files/days where there are no records where relevancy_score < 0.1
  • The snapshot seems to incorrectly ignore completely files/days where there are both records with relevancy_score >= 0.1 and relevancy_score < 0.1

For example, let's consider the following slice of the dataset:

id created_at relevancy_score
808 2024-01-02 0.18115229995825877
1424 2024-01-02 0.13995060752152988
8026 2024-01-02 0.05992063209461162

The .avro files of the manifest post-deletion, do not mention any data files of the 2024-01-02 partition whatsoever when it should have kept the record with id=8026

I'm attaching an accompanying notebook which I used to recreate the issue
I didn't encounter a similar issue when working with a non-partitioned table

@Fokko
Copy link
Contributor

Fokko commented Aug 12, 2024

@panbamid-r Thanks for raising this and the comprehensive example. I'm able to reproduce it locally, let me dig into this right away

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants