Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

Closed
pxLi opened this issue Nov 29, 2023 · 3 comments
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Nov 29, 2023

Describe the bug
first seen in rapids_integration-dev-github, build ID 863 (jdk11 runtime + spark 330)

[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-True][DATAGEN_SEED=1701225104, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":537,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"ÿÆ\\u0012\x93Yп\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bܤ"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":541,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"þ\x7f³´ï\\u0014Ê1\\u000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}
[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_dataframe_api[None-False][DATAGEN_SEED=1701225104, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":443,"minValues":{"a":0,"b":"a","c":"\\u0001\\tÛá®oÔhÌ\\u0010yÎCÎô\\u0012\x94³EB\x82>à\\u00062%.o\\u001D¥"},"maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":15}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":447,"minValues":{"a":0,"b":"a","c":"\\u0001saD\\u000FzkÍÄ|¤ó_\x90ÑO¹¸åßÔ\x82|µ~\x91òbVå"},"maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":18}}'}
[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows[None-False][DATAGEN_SEED=1701225104, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000Ò@\\u0010Ò>-ªUË\\u0017\\u0001ð\\u0007Æ9blðl@æaB\\"|ÉùÐõ"},"maxValues":{"a":4,"b":"g","c":"ÿ) å.>^\\fc\\u0012Gl¥èÞþÎÌ>\\f²Óý¤\x9cÕ\\u0007¸\x99O"},"nullCount":{"a":0,"b":0,"c":21}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000Ò@\\u0010Ò>-ªUË\\u0017\\u0001ð\\u0007Æ9blðl@æaB\\"|ÉùÐõ"},"maxValues":{"a":4,"b":"g","c":"þë\\u0014øgb\\u001B\x88¢¢À\\u0006Ù\x9aõf\\u0013m^è³iöÈ?\\b¯v\\u0000Ù"},"nullCount":{"a":0,"b":0,"c":18}}'}
[2023-11-29T02:56:13.494Z] = 3 failed, 276 passed, 47 skipped, 4 xfailed, 40 warnings in 1468.57s (0:24:28) =

mismatched CPU and GPU output:

[2023-11-29T02:56:13.489Z] E           AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-11-29T02:56:13.489Z] E           CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":537,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"ÿÆ\\u0012\x93Yп\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bܤ"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}
[2023-11-29T02:56:13.489Z] E           GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":541,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"þ\x7f³´ï\\u0014Ê1\\u000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}

others

[2023-11-29T02:56:13.488Z] filename = '00000000000000000002.json'
[2023-11-29T02:56:13.488Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":537,"minValu..."ÿÆ\\u0012\x93Yп\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bܤ"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}}
[2023-11-29T02:56:13.488Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":541,"minValu...000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}}


[2023-11-29T02:56:13.490Z] filename = '00000000000000000001.json'
[2023-11-29T02:56:13.490Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":443,"minValu...maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":15}}'}}
[2023-11-29T02:56:13.490Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":447,"minValu...maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005ø趽ec\\ntà\x98ø[m̼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":18}}'}}


[2023-11-29T02:56:13.492Z] filename = '00000000000000000001.json'
[2023-11-29T02:56:13.492Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":1020,"minVal...Values":{"a":4,"b":"g","c":"ÿ) å.>^\\fc\\u0012Gl¥èÞþÎÌ>\\f²Óý¤\x9cÕ\\u0007¸\x99O"},"nullCount":{"a":0,"b":0,"c":21}}'}}
[2023-11-29T02:56:13.492Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":1020,"minVal...b":"g","c":"þë\\u0014øgb\\u001B\x88¢¢À\\u0006Ù\x9aõf\\u0013m^è³iöÈ?\\b¯v\\u0000Ù"},"nullCount":{"a":0,"b":0,"c":18}}'}}

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Nov 29, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023
@jlowe
Copy link
Member

jlowe commented Dec 6, 2023

I cannot reproduce this, and AFAIK it has not been seen since it occurred. My guess is that somehow the data was spread across tasks differently between the two runs, as the data was verified to be correct (i.e.: all rows were accounted for, ignoring ordering) before the metadata equality check afterwards failed.

@jlowe
Copy link
Member

jlowe commented Dec 8, 2023

I found a way to reliably reproduce this:

SPARK_SUBMIT_FLAGS="--master local[2]" TEST_PARALLEL=0 SPARK_HOME=/home/jlowe/spark-3.2.1-bin-hadoop3.2/ DATAGEN_SEED=1702052986 PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.0.1 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_delete_dataframe_api and None-True" --delta_lake --debug_tmp_path

As suspected, the table contents as a whole are correct, but somehow some of the rows have been swizzled between the two output files between the CPU and GPU runs. AFAICT nothing is actually wrong semantically with the output produced by the GPU relative to the CPU, but it would be good to understand how we're reliably getting the rows crossed here. I suspect for some odd reason the GPU run is getting a different set of input files for the tasks than the CPU does.

@jlowe
Copy link
Member

jlowe commented Dec 11, 2023

Debugged why this is failing for certain datagen seeds. The problem can occur when a particular datagen seed causes two or more Parquet files within a table to be generated with the same file size. ext4 and most other Linux filesystems will return a directory listing in an order influenced by the order the files are written to the directory, and it's not deterministic which files will be written before other files as tasks execute in parallel. The input files are sorted in descending order by file size, but when two or more files are the same size, the sorted ordering can be different between two directories that contain the same files but were created in a different order.

I do not see a way to really fix this other than use a datagen seed which is known to produce files that can be deterministically sorted, as was done in #10009, or change the test to not try to compare metadata. The latter would allow a lot of subtle bugs to slip through, so I think using a fixed datagen seed is the better route.

Closing this bug as setting a fixed seed as done by #10009 is the long-term solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

No branches or pull requests

3 participants