[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

pxLi · 2023-11-29T03:22:16Z

Describe the bug
first seen in rapids_integration-dev-github, build ID 863 (jdk11 runtime + spark 330)

[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-True][DATAGEN_SEED=1701225104, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":537,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"ÿÆ\\u0012\x93YÐ¿\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bÜ¤"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":541,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"þ\x7f³´ï\\u0014Ê1\\u000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}
[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_dataframe_api[None-False][DATAGEN_SEED=1701225104, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":443,"minValues":{"a":0,"b":"a","c":"\\u0001\\tÛá®oÔhÌ\\u0010yÎCÎô\\u0012\x94³EB\x82>à\\u00062%.o\\u001D¥"},"maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005øè¶½ec\\ntà\x98ø[mÌ¼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":15}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":447,"minValues":{"a":0,"b":"a","c":"\\u0001saD\\u000FzkÍÄ|¤ó_\x90ÑO¹¸åßÔ\x82|µ~\x91òbVå"},"maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005øè¶½ec\\ntà\x98ø[mÌ¼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":18}}'}
[2023-11-29T02:56:13.493Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows[None-False][DATAGEN_SEED=1701225104, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-11-29T02:56:13.493Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000Ò@\\u0010Ò>-ªUË\\u0017\\u0001ð\\u0007Æ9blðl@æaB\\"|ÉùÐõ"},"maxValues":{"a":4,"b":"g","c":"ÿ) å.>^\\fc\\u0012Gl¥èÞþÎÌ>\\f²Óý¤\x9cÕ\\u0007¸\x99O"},"nullCount":{"a":0,"b":0,"c":21}}'}
[2023-11-29T02:56:13.493Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000Ò@\\u0010Ò>-ªUË\\u0017\\u0001ð\\u0007Æ9blðl@æaB\\"|ÉùÐõ"},"maxValues":{"a":4,"b":"g","c":"þë\\u0014øgb\\u001B\x88¢¢À\\u0006Ù\x9aõf\\u0013m^è³iöÈ?\\b¯v\\u0000Ù"},"nullCount":{"a":0,"b":0,"c":18}}'}
[2023-11-29T02:56:13.494Z] = 3 failed, 276 passed, 47 skipped, 4 xfailed, 40 warnings in 1468.57s (0:24:28) =

mismatched CPU and GPU output:

[2023-11-29T02:56:13.489Z] E           AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-11-29T02:56:13.489Z] E           CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":537,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"ÿÆ\\u0012\x93YÐ¿\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bÜ¤"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}
[2023-11-29T02:56:13.489Z] E           GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":541,"minValues":{"a":0,"b":"d","c":"\\u0000¶\'\x86åJ\\u0000J\x94C\x8cï=\\"7¡àÛ+\x9aànÊ\x92±`Iì\\b\x99"},"maxValues":{"a":4,"b":"g","c":"þ\x7f³´ï\\u0014Ê1\\u000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}

others

[2023-11-29T02:56:13.488Z] filename = '00000000000000000002.json'
[2023-11-29T02:56:13.488Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":537,"minValu..."ÿÆ\\u0012\x93YÐ¿\\u000EwÐ.Ñ\x8a®tîá|\\u0017=8§¦\x98³þ÷\x9bÜ¤"},"nullCount":{"a":0,"b":0,"c":29,"_change_type":537}}'}}
[2023-11-29T02:56:13.488Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":541,"minValu...000Bï\x81\x83Ü^\x97´$\\u0017\x9a¡\\n\\"\\u000EÈ/\\t\x904\x97¨"},"nullCount":{"a":0,"b":0,"c":33,"_change_type":541}}'}}


[2023-11-29T02:56:13.490Z] filename = '00000000000000000001.json'
[2023-11-29T02:56:13.490Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":443,"minValu...maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005øè¶½ec\\ntà\x98ø[mÌ¼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":15}}'}}
[2023-11-29T02:56:13.490Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":447,"minValu...maxValues":{"a":4,"b":"c","c":"ÿ^±Úq\\u000Bx¼\\u0005øè¶½ec\\ntà\x98ø[mÌ¼Óå\\tXE¼"},"nullCount":{"a":0,"b":0,"c":18}}'}}


[2023-11-29T02:56:13.492Z] filename = '00000000000000000001.json'
[2023-11-29T02:56:13.492Z] c_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":1020,"minVal...Values":{"a":4,"b":"g","c":"ÿ) å.>^\\fc\\u0012Gl¥èÞþÎÌ>\\f²Óý¤\x9cÕ\\u0007¸\x99O"},"nullCount":{"a":0,"b":0,"c":21}}'}}
[2023-11-29T02:56:13.492Z] g_json = {'add': {'dataChange': True, 'partitionValues': {}, 'path': 'partsnappy.parquet', 'stats': '{"numRecords":1020,"minVal...b":"g","c":"þë\\u0014øgb\\u001B\x88¢¢À\\u0006Ù\x9aõf\\u0013m^è³iöÈ?\\b¯v\\u0000Ù"},"nullCount":{"a":0,"b":0,"c":18}}'}}

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jlowe · 2023-12-06T23:02:26Z

I cannot reproduce this, and AFAIK it has not been seen since it occurred. My guess is that somehow the data was spread across tasks differently between the two runs, as the data was verified to be correct (i.e.: all rows were accounted for, ignoring ordering) before the metadata equality check afterwards failed.

jlowe · 2023-12-08T22:59:35Z

I found a way to reliably reproduce this:

SPARK_SUBMIT_FLAGS="--master local[2]" TEST_PARALLEL=0 SPARK_HOME=/home/jlowe/spark-3.2.1-bin-hadoop3.2/ DATAGEN_SEED=1702052986 PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.0.1 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_delete_dataframe_api and None-True" --delta_lake --debug_tmp_path

As suspected, the table contents as a whole are correct, but somehow some of the rows have been swizzled between the two output files between the CPU and GPU runs. AFAICT nothing is actually wrong semantically with the output produced by the GPU relative to the CPU, but it would be good to understand how we're reliably getting the rows crossed here. I suspect for some odd reason the GPU run is getting a different set of input files for the tasks than the CPU does.

jlowe · 2023-12-11T16:24:45Z

Debugged why this is failing for certain datagen seeds. The problem can occur when a particular datagen seed causes two or more Parquet files within a table to be generated with the same file size. ext4 and most other Linux filesystems will return a directory listing in an order influenced by the order the files are written to the directory, and it's not deterministic which files will be written before other files as tasks execute in parallel. The input files are sorted in descending order by file size, but when two or more files are the same size, the sorted ordering can be different between two directories that contain the same files but were created in a different order.

I do not see a way to really fix this other than use a datagen seed which is known to produce files that can be deterministically sorted, as was done in #10009, or change the test to not try to compare metadata. The latter would allow a lot of subtle bugs to slip through, so I think using a fixed datagen seed is the better route.

Closing this bug as setting a fixed seed as done by #10009 is the long-term solution.

pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Nov 29, 2023

mattahrens assigned jlowe Dec 5, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 5, 2023

res-life mentioned this issue Dec 11, 2023

Using fix seed to unblock 23.12 release; Move the blocked issues to 24.02 #10009

Merged

jlowe closed this as completed Dec 11, 2023

jlowe mentioned this issue Dec 27, 2023

[BUG] Some deltalake tests failed on ARM64 with DATAGEN_SEED=1702341898 #10025

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

pxLi commented Nov 29, 2023 •

edited

Loading

jlowe commented Dec 6, 2023

jlowe commented Dec 8, 2023

jlowe commented Dec 11, 2023

[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884

Comments

pxLi commented Nov 29, 2023 • edited Loading

jlowe commented Dec 6, 2023

jlowe commented Dec 8, 2023

jlowe commented Dec 11, 2023

pxLi commented Nov 29, 2023 •

edited

Loading