Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Some deltalake tests failed on ARM64 with DATAGEN_SEED=1702341898 #10025

Closed
revans2 opened this issue Dec 12, 2023 · 2 comments
Closed
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Dec 12, 2023

Describe the bug
I have not been able to reproduce this on my desktop. But we had an arm64 build with Spark 3.4.2 and deltalake 2.4.0 fail the following tests.

[2023-12-12T01:07:56.056Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.056Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":575,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.056Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":580,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.056Z] FAILED ../../src/main/python/delta_lake_delete_test.py::test_delta_delete_rows[None-False][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-12-12T01:07:56.056Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":580,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":575,"minValues":{"a":0,"b":"d","c":"\\u0001ÑKÃä¦ý\\u001BQî\x94³\\u001DB¨¾z¦!Âx`ÖæÜ\x88\x92±}"},"maxValues":{"a":4,"b":"g","c":"ÿú\x82h!ï\\u00051ç>3\xa0\\u0006³\x7fp\\u0017c\\u0003ÜüÔCæT¨\x85èe\x86"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_dataframe_api[None-False][DATAGEN_SEED=1702341898, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000001.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows_with_dv[True-None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,ColumnarToRowExec,RapidsDeltaWriteExec,GenerateExec,DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows[None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22}}'}
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_rows_with_dv[True-None-False][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(HashAggregateExec,ColumnarToRowExec,RapidsDeltaWriteExec,GenerateExec,DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22},"tightBounds":true}'}
[2023-12-12T01:07:56.057Z] Starting with datagen test seed: 1702341898. Set env variable SPARK_RAPIDS_TEST_DATAGEN_SEED to override.
[2023-12-12T01:07:56.057Z] Starting with OOM injection seed: 1702341898. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to override.
[2023-12-12T01:07:56.057Z] 2023-12-12 00:44:58 INFO     Executing global initialization tasks before test launches
[2023-12-12T01:07:56.057Z] 2023-12-12 00:44:58 INFO     Creating directory /home/jenkins/agent/workspace/rapids_it-arm64-dev/jars/integration_tests/target/run_dir-20231212004458-nCyn/hive with permissions 0o777
[2023-12-12T01:07:56.057Z] 2023-12-12 00:44:58 INFO     Skipping findspark init because on xdist master
[2023-12-12T01:07:56.057Z] FAILED ../../src/main/python/delta_lake_update_test.py::test_delta_update_dataframe_api[None-True][DATAGEN_SEED=1702341898, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] - AssertionError: Delta log 00000000000000000002.json is different at key 'add':
[2023-12-12T01:07:56.057Z] CPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000î]Î\x9e´® \\u0012;\x9cÉ\\u000Fy·þ/\x91\xa0÷h\x96\\u00194&M)¡c~"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":20}}'}
[2023-12-12T01:07:56.057Z] GPU: {'path': 'partsnappy.parquet', 'partitionValues': {}, 'dataChange': True, 'stats': '{"numRecords":1020,"minValues":{"a":0,"b":"a","c":"\\u0000\xa0`\x94?/n¦¿Ñ\x80©ëí3\\"¶<\x93Í\\\\h\\u0010L\x83ñ\\u0010r/è"},"maxValues":{"a":4,"b":"g","c":"ÿ\x8e^<æC_ÓAϽ£Ì\x98Í\x98r«3W¾äíj%Ëý\x83LÖ"},"nullCount":{"a":0,"b":0,"c":22}}'}

In all of these cases it appears that some of the data was partitioned slightly differently. I know that @jlowe was working on a fix for some deltalake issues where we got unlucky and the order of the files read was non-deterministic because the sizes matched exactly. I am not sure if that is the case here, or if something else is happening. Especially for the delete case.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Dec 12, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 12, 2023
@jlowe
Copy link
Member

jlowe commented Dec 27, 2023

This is the same root cause as described at #9884 (comment). Two files created during setup have the exact same filesize, so when it tries to order them by size it's non-deterministic.

@jlowe
Copy link
Member

jlowe commented Dec 27, 2023

Note that I can reproduce these issues by forcing two threads, e.g.:

SPARK_SUBMIT_FLAGS="--master local[2]" TEST_PARALLEL=0 SPARK_HOME=/home/jlowe/spark-3.4.1-bin-hadoop3/ DATAGEN_SEED=1702341898 PYSP_TEST_spark_jars_packages=io.delta:delta-core_2.12:2.4.0 PYSP_TEST_spark_sql_extensions=io.delta.sql.DeltaSparkSessionExtension PYSP_TEST_spark_sql_catalog_spark__catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog integration_tests/run_pyspark_from_build.sh -k "test_delta_update_dataframe and None-False" --delta_lake --debug_tmp_path

@andygrove andygrove removed their assignment Apr 1, 2024
@andygrove andygrove added the ? - Needs Triage Need team to review and classify label Apr 1, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 2, 2024
@mattahrens mattahrens closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants