Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] orc_test:test_pred_push_round_trip failed #3059

Closed
pxLi opened this issue Jul 28, 2021 · 8 comments · Fixed by #3079
Closed

[BUG] orc_test:test_pred_push_round_trip failed #3059

pxLi opened this issue Jul 28, 2021 · 8 comments · Fixed by #3079
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Jul 28, 2021

Describe the bug
orc_test:test_pred_push_round_trip failed in nightly integration tests pipeline (seeing failures w/ spark 301, 302, 303 and databricks)

17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Timestamp]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Timestamp]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Timestamp]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Timestamp]

detailed log,

[2021-07-28T09:21:33.378Z] _ test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Short] _
[2021-07-28T09:21:33.378Z] 
[2021-07-28T09:21:33.378Z] spark_tmp_path = '/tmp/pyspark_tests//482405/', orc_gen = Short
[2021-07-28T09:21:33.378Z] read_func = <function read_orc_df at 0x7f8b0d826b80>, v1_enabled_list = ''
[2021-07-28T09:21:33.378Z] reader_confs = {'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}
[2021-07-28T09:21:33.378Z] 
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('orc_gen', orc_pred_push_gens, ids=idfn)
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('read_func', [read_orc_df, read_orc_sql])
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('v1_enabled_list', ["", "orc"])
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('reader_confs', reader_opt_confs, ids=idfn)
[2021-07-28T09:21:33.378Z]     def test_pred_push_round_trip(spark_tmp_path, orc_gen, read_func, v1_enabled_list, reader_confs):
[2021-07-28T09:21:33.378Z]         data_path = spark_tmp_path + '/ORC_DATA'
[2021-07-28T09:21:33.378Z]         # Append two struct columns to verify nested predicate pushdown.
[2021-07-28T09:21:33.378Z]         gen_list = [('a', RepeatSeqGen(orc_gen, 100)), ('b', orc_gen),
[2021-07-28T09:21:33.378Z]             ('s1', StructGen([['sa', orc_gen]])),
[2021-07-28T09:21:33.378Z]             ('s2', StructGen([['sa', StructGen([['ssa', orc_gen]])]]))]
[2021-07-28T09:21:33.378Z]         s0 = gen_scalar(orc_gen, force_no_nulls=True)
[2021-07-28T09:21:33.378Z]         with_cpu_session(
[2021-07-28T09:21:33.378Z]                 lambda spark : gen_df(spark, gen_list).orderBy('a').write.orc(data_path))
[2021-07-28T09:21:33.378Z]         all_confs = reader_confs.copy()
[2021-07-28T09:21:33.378Z]         all_confs.update({'spark.sql.sources.useV1SourceList': v1_enabled_list,
[2021-07-28T09:21:33.378Z]             'spark.sql.optimizer.nestedSchemaPruning.enabled': "false"})
[2021-07-28T09:21:33.378Z]         rf = read_func(data_path)
[2021-07-28T09:21:33.378Z] >       assert_gpu_and_cpu_are_equal_collect(
[2021-07-28T09:21:33.378Z]                 lambda spark: rf(spark).select(f.col('a') >= s0, f.col('s1.sa') >= s0, f.col('s2.sa.ssa') >= s0),
[2021-07-28T09:21:33.378Z]                 conf=all_confs)
[2021-07-28T09:21:33.378Z] 
[2021-07-28T09:21:33.378Z] src/main/python/orc_test.py:141: 
[2021-07-28T09:21:33.378Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-07-28T09:21:33.378Z] src/main/python/asserts.py:387: in assert_gpu_and_cpu_are_equal_collect
[2021-07-28T09:21:33.378Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2021-07-28T09:21:33.378Z] src/main/python/asserts.py:379: in _assert_gpu_and_cpu_are_equal
[2021-07-28T09:21:33.378Z]     assert_equal(from_cpu, from_gpu)
[2021-07-28T09:21:33.378Z] src/main/python/asserts.py:95: in assert_equal
[2021-07-28T09:21:33.378Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2021-07-28T09:21:33.379Z] src/main/python/asserts.py:42: in _assert_equal
[2021-07-28T09:21:33.379Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-07-28T09:21:33.379Z] src/main/python/asserts.py:35: in _assert_equal
[2021-07-28T09:21:33.379Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2021-07-28T09:21:33.379Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-07-28T09:21:33.379Z] 
[2021-07-28T09:21:33.379Z] cpu = False, gpu = None
[2021-07-28T09:21:33.379Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f8b02294ca0>
[2021-07-28T09:21:33.379Z] path = [184, '(s2.sa.ssa >= CAST(22357 AS SMALLINT))']
[2021-07-28T09:21:33.379Z] 
[2021-07-28T09:21:33.379Z]     def _assert_equal(cpu, gpu, float_check, path):
[2021-07-28T09:21:33.379Z]         t = type(cpu)
[2021-07-28T09:21:33.379Z]         if (t is Row):
[2021-07-28T09:21:33.379Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2021-07-28T09:21:33.379Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
[2021-07-28T09:21:33.379Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
[2021-07-28T09:21:33.379Z]                 for field in cpu.__fields__:
[2021-07-28T09:21:33.379Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2021-07-28T09:21:33.379Z]             else:
[2021-07-28T09:21:33.379Z]                 for index in range(len(cpu)):
[2021-07-28T09:21:33.379Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-07-28T09:21:33.379Z]         elif (t is list):
[2021-07-28T09:21:33.379Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2021-07-28T09:21:33.379Z]             for index in range(len(cpu)):
[2021-07-28T09:21:33.379Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-07-28T09:21:33.379Z]         elif (t is pytypes.GeneratorType):
[2021-07-28T09:21:33.379Z]             index = 0
[2021-07-28T09:21:33.379Z]             # generator has no zip :( so we have to do this the hard way
[2021-07-28T09:21:33.379Z]             done = False
[2021-07-28T09:21:33.379Z]             while not done:
[2021-07-28T09:21:33.379Z]                 sub_cpu = None
[2021-07-28T09:21:33.379Z]                 sub_gpu = None
[2021-07-28T09:21:33.379Z]                 try:
[2021-07-28T09:21:33.379Z]                     sub_cpu = next(cpu)
[2021-07-28T09:21:33.379Z]                 except StopIteration:
[2021-07-28T09:21:33.379Z]                     done = True
[2021-07-28T09:21:33.379Z]     
[2021-07-28T09:21:33.379Z]                 try:
[2021-07-28T09:21:33.379Z]                     sub_gpu = next(gpu)
[2021-07-28T09:21:33.379Z]                 except StopIteration:
[2021-07-28T09:21:33.379Z]                     done = True
[2021-07-28T09:21:33.379Z]     
[2021-07-28T09:21:33.379Z]                 if done:
[2021-07-28T09:21:33.379Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
[2021-07-28T09:21:33.379Z]                 else:
[2021-07-28T09:21:33.379Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
[2021-07-28T09:21:33.379Z]     
[2021-07-28T09:21:33.379Z]                 index = index + 1
[2021-07-28T09:21:33.379Z]         elif (t is dict):
[2021-07-28T09:21:33.379Z]             # TODO eventually we need to split this up so we can do the right thing for float/double
[2021-07-28T09:21:33.379Z]             # values stored under the map some where, especially for NaNs
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU map values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif (t is int):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif (t is float):
[2021-07-28T09:21:33.379Z]             if (math.isnan(cpu)):
[2021-07-28T09:21:33.379Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]             else:
[2021-07-28T09:21:33.379Z]                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, str):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, datetime):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, date):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, bool):
[2021-07-28T09:21:33.379Z] >           assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
[2021-07-28T09:21:33.379Z] E           AssertionError: GPU and CPU boolean values are different at [184, '(s2.sa.ssa >= CAST(22357 AS SMALLINT))']

Steps/Code to reproduce bug

# build plugin against branch-21.08
# standalone
TEST_PARALLEL=0 SPARK_SUBMIT_FLAGS="--master spark://127.0.0.1:7077" integration_tests/run_pyspark_from_build.sh -k orc_test
# local
 integration_tests/run_pyspark_from_build.sh -k orc_test
@pxLi pxLi added bug Something isn't working test Only impacts tests labels Jul 28, 2021
@firestarman
Copy link
Collaborator

firestarman commented Jul 28, 2021

Will check it tomorrow. Seems to be related to my latest merged PR.

@firestarman firestarman self-assigned this Jul 28, 2021
@firestarman
Copy link
Collaborator

I can not repro this locally.
I am wondering whether it is related to parallel runs ... Will check more tomorrow.

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 28, 2021

How did you run it locally?
I checked that only pre-merge could pass the test, I am wondering if this test relies on some temp files when doing mvn verify or some side-effects due to non-deterministic tests ordering.
In other scenarios, like nightly-databricks (spark local parallel), run w/ run_pyspark_from_build.sh directly (spark local parallel or standalone non-parallel) and our integration tests (standalone non-parallel), they failed w/ the same error.

@pxLi
Copy link
Collaborator Author

pxLi commented Jul 29, 2021

tried debug w/ @firestarman , but no luck.
This test could fail and pass in multiple scenarios, we have checked GPU type, nvidia driver version, cuda runtime version and linux kernel version, but fail to find a reasonable pattern. It also could be related to cluster setup or application params

move this one to Release 21.10

jlowe pushed a commit that referenced this issue Jul 29, 2021
We do this because of the issue #3059,
which seems not to be easy to fix in a short time.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator

Eventually I have got a orc file to repro this. And filed an issue (rapidsai/cudf#8910) in cudf.

@jlowe
Copy link
Member

jlowe commented Jul 30, 2021

Any insight as to why this was so hard to reproduce locally but happens pretty much every time in nightly CI?

@firestarman
Copy link
Collaborator

Any insight as to why this was so hard to reproduce locally but happens pretty much every time in nightly CI?

That's what I am still trying to figure out ...

@firestarman
Copy link
Collaborator

firestarman commented Aug 3, 2021

Any insight as to why this was so hard to reproduce locally but happens pretty much every time in nightly CI?

That's what I am still trying to figure out ...

It turns out to be related to the value of N in the parameter --master local[N].
On nodes with 12 CPU cores, tests for coalescing type always pass when N >= 12, and always fail when N is equal to any integer value between 1 and 11 (both inclusive).

@wbo4958 told me coalescing really happens when N < 12, so the reason why tests fail is probably that cudf fails to read the coalesced orc file. When N >=12, no coalescing, tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants