[BUG] orc_test:test_pred_push_round_trip failed #3059

pxLi · 2021-07-28T09:58:12Z

Describe the bug
orc_test:test_pred_push_round_trip failed in nightly integration tests pipeline (seeing failures w/ spark 301, 302, 303 and databricks)

17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Timestamp]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_sql-Timestamp]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_df-Timestamp]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Byte]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Short]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Integer]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Long]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Float]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Double]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-String]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Date]
17:24:47  FAILED src/main/python/orc_test.py::test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}-orc-read_orc_sql-Timestamp]

detailed log,

[2021-07-28T09:21:33.378Z] _ test_pred_push_round_trip[{'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}--read_orc_df-Short] _
[2021-07-28T09:21:33.378Z] 
[2021-07-28T09:21:33.378Z] spark_tmp_path = '/tmp/pyspark_tests//482405/', orc_gen = Short
[2021-07-28T09:21:33.378Z] read_func = <function read_orc_df at 0x7f8b0d826b80>, v1_enabled_list = ''
[2021-07-28T09:21:33.378Z] reader_confs = {'spark.rapids.sql.format.orc.reader.type': 'COALESCING'}
[2021-07-28T09:21:33.378Z] 
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('orc_gen', orc_pred_push_gens, ids=idfn)
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('read_func', [read_orc_df, read_orc_sql])
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('v1_enabled_list', ["", "orc"])
[2021-07-28T09:21:33.378Z]     @pytest.mark.parametrize('reader_confs', reader_opt_confs, ids=idfn)
[2021-07-28T09:21:33.378Z]     def test_pred_push_round_trip(spark_tmp_path, orc_gen, read_func, v1_enabled_list, reader_confs):
[2021-07-28T09:21:33.378Z]         data_path = spark_tmp_path + '/ORC_DATA'
[2021-07-28T09:21:33.378Z]         # Append two struct columns to verify nested predicate pushdown.
[2021-07-28T09:21:33.378Z]         gen_list = [('a', RepeatSeqGen(orc_gen, 100)), ('b', orc_gen),
[2021-07-28T09:21:33.378Z]             ('s1', StructGen([['sa', orc_gen]])),
[2021-07-28T09:21:33.378Z]             ('s2', StructGen([['sa', StructGen([['ssa', orc_gen]])]]))]
[2021-07-28T09:21:33.378Z]         s0 = gen_scalar(orc_gen, force_no_nulls=True)
[2021-07-28T09:21:33.378Z]         with_cpu_session(
[2021-07-28T09:21:33.378Z]                 lambda spark : gen_df(spark, gen_list).orderBy('a').write.orc(data_path))
[2021-07-28T09:21:33.378Z]         all_confs = reader_confs.copy()
[2021-07-28T09:21:33.378Z]         all_confs.update({'spark.sql.sources.useV1SourceList': v1_enabled_list,
[2021-07-28T09:21:33.378Z]             'spark.sql.optimizer.nestedSchemaPruning.enabled': "false"})
[2021-07-28T09:21:33.378Z]         rf = read_func(data_path)
[2021-07-28T09:21:33.378Z] >       assert_gpu_and_cpu_are_equal_collect(
[2021-07-28T09:21:33.378Z]                 lambda spark: rf(spark).select(f.col('a') >= s0, f.col('s1.sa') >= s0, f.col('s2.sa.ssa') >= s0),
[2021-07-28T09:21:33.378Z]                 conf=all_confs)
[2021-07-28T09:21:33.378Z] 
[2021-07-28T09:21:33.378Z] src/main/python/orc_test.py:141: 
[2021-07-28T09:21:33.378Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-07-28T09:21:33.378Z] src/main/python/asserts.py:387: in assert_gpu_and_cpu_are_equal_collect
[2021-07-28T09:21:33.378Z]     _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first)
[2021-07-28T09:21:33.378Z] src/main/python/asserts.py:379: in _assert_gpu_and_cpu_are_equal
[2021-07-28T09:21:33.378Z]     assert_equal(from_cpu, from_gpu)
[2021-07-28T09:21:33.378Z] src/main/python/asserts.py:95: in assert_equal
[2021-07-28T09:21:33.378Z]     _assert_equal(cpu, gpu, float_check=get_float_check(), path=[])
[2021-07-28T09:21:33.379Z] src/main/python/asserts.py:42: in _assert_equal
[2021-07-28T09:21:33.379Z]     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-07-28T09:21:33.379Z] src/main/python/asserts.py:35: in _assert_equal
[2021-07-28T09:21:33.379Z]     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2021-07-28T09:21:33.379Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2021-07-28T09:21:33.379Z] 
[2021-07-28T09:21:33.379Z] cpu = False, gpu = None
[2021-07-28T09:21:33.379Z] float_check = <function get_float_check.<locals>.<lambda> at 0x7f8b02294ca0>
[2021-07-28T09:21:33.379Z] path = [184, '(s2.sa.ssa >= CAST(22357 AS SMALLINT))']
[2021-07-28T09:21:33.379Z] 
[2021-07-28T09:21:33.379Z]     def _assert_equal(cpu, gpu, float_check, path):
[2021-07-28T09:21:33.379Z]         t = type(cpu)
[2021-07-28T09:21:33.379Z]         if (t is Row):
[2021-07-28T09:21:33.379Z]             assert len(cpu) == len(gpu), "CPU and GPU row have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2021-07-28T09:21:33.379Z]             if hasattr(cpu, "__fields__") and hasattr(gpu, "__fields__"):
[2021-07-28T09:21:33.379Z]                 assert cpu.__fields__ == gpu.__fields__, "CPU and GPU row have different fields at {} CPU: {} GPU: {}".format(path, cpu.__fields__, gpu.__fields__)
[2021-07-28T09:21:33.379Z]                 for field in cpu.__fields__:
[2021-07-28T09:21:33.379Z]                     _assert_equal(cpu[field], gpu[field], float_check, path + [field])
[2021-07-28T09:21:33.379Z]             else:
[2021-07-28T09:21:33.379Z]                 for index in range(len(cpu)):
[2021-07-28T09:21:33.379Z]                     _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-07-28T09:21:33.379Z]         elif (t is list):
[2021-07-28T09:21:33.379Z]             assert len(cpu) == len(gpu), "CPU and GPU list have different lengths at {} CPU: {} GPU: {}".format(path, len(cpu), len(gpu))
[2021-07-28T09:21:33.379Z]             for index in range(len(cpu)):
[2021-07-28T09:21:33.379Z]                 _assert_equal(cpu[index], gpu[index], float_check, path + [index])
[2021-07-28T09:21:33.379Z]         elif (t is pytypes.GeneratorType):
[2021-07-28T09:21:33.379Z]             index = 0
[2021-07-28T09:21:33.379Z]             # generator has no zip :( so we have to do this the hard way
[2021-07-28T09:21:33.379Z]             done = False
[2021-07-28T09:21:33.379Z]             while not done:
[2021-07-28T09:21:33.379Z]                 sub_cpu = None
[2021-07-28T09:21:33.379Z]                 sub_gpu = None
[2021-07-28T09:21:33.379Z]                 try:
[2021-07-28T09:21:33.379Z]                     sub_cpu = next(cpu)
[2021-07-28T09:21:33.379Z]                 except StopIteration:
[2021-07-28T09:21:33.379Z]                     done = True
[2021-07-28T09:21:33.379Z]     
[2021-07-28T09:21:33.379Z]                 try:
[2021-07-28T09:21:33.379Z]                     sub_gpu = next(gpu)
[2021-07-28T09:21:33.379Z]                 except StopIteration:
[2021-07-28T09:21:33.379Z]                     done = True
[2021-07-28T09:21:33.379Z]     
[2021-07-28T09:21:33.379Z]                 if done:
[2021-07-28T09:21:33.379Z]                     assert sub_cpu == sub_gpu and sub_cpu == None, "CPU and GPU generators have different lengths at {}".format(path)
[2021-07-28T09:21:33.379Z]                 else:
[2021-07-28T09:21:33.379Z]                     _assert_equal(sub_cpu, sub_gpu, float_check, path + [index])
[2021-07-28T09:21:33.379Z]     
[2021-07-28T09:21:33.379Z]                 index = index + 1
[2021-07-28T09:21:33.379Z]         elif (t is dict):
[2021-07-28T09:21:33.379Z]             # TODO eventually we need to split this up so we can do the right thing for float/double
[2021-07-28T09:21:33.379Z]             # values stored under the map some where, especially for NaNs
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU map values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif (t is int):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU int values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif (t is float):
[2021-07-28T09:21:33.379Z]             if (math.isnan(cpu)):
[2021-07-28T09:21:33.379Z]                 assert math.isnan(gpu), "GPU and CPU float values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]             else:
[2021-07-28T09:21:33.379Z]                 assert float_check(cpu, gpu), "GPU and CPU float values are different {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, str):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU string values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, datetime):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU timestamp values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, date):
[2021-07-28T09:21:33.379Z]             assert cpu == gpu, "GPU and CPU date values are different at {}".format(path)
[2021-07-28T09:21:33.379Z]         elif isinstance(cpu, bool):
[2021-07-28T09:21:33.379Z] >           assert cpu == gpu, "GPU and CPU boolean values are different at {}".format(path)
[2021-07-28T09:21:33.379Z] E           AssertionError: GPU and CPU boolean values are different at [184, '(s2.sa.ssa >= CAST(22357 AS SMALLINT))']

Steps/Code to reproduce bug

# build plugin against branch-21.08
# standalone
TEST_PARALLEL=0 SPARK_SUBMIT_FLAGS="--master spark://127.0.0.1:7077" integration_tests/run_pyspark_from_build.sh -k orc_test
# local
 integration_tests/run_pyspark_from_build.sh -k orc_test

The text was updated successfully, but these errors were encountered:

firestarman · 2021-07-28T11:59:51Z

Will check it tomorrow. Seems to be related to my latest merged PR.

firestarman · 2021-07-28T13:14:03Z

I can not repro this locally.
I am wondering whether it is related to parallel runs ... Will check more tomorrow.

pxLi · 2021-07-28T14:08:47Z

How did you run it locally?
I checked that only pre-merge could pass the test, I am wondering if this test relies on some temp files when doing mvn verify or some side-effects due to non-deterministic tests ordering.
In other scenarios, like nightly-databricks (spark local parallel), run w/ run_pyspark_from_build.sh directly (spark local parallel or standalone non-parallel) and our integration tests (standalone non-parallel), they failed w/ the same error.

pxLi · 2021-07-29T07:55:52Z

tried debug w/ @firestarman , but no luck.
This test could fail and pass in multiple scenarios, we have checked GPU type, nvidia driver version, cuda runtime version and linux kernel version, but fail to find a reasonable pattern. It also could be related to cluster setup or application params

move this one to Release 21.10

We do this because of the issue #3059, which seems not to be easy to fix in a short time. Signed-off-by: Firestarman <firestarmanllc@gmail.com>

firestarman · 2021-07-30T08:53:05Z

Eventually I have got a orc file to repro this. And filed an issue (rapidsai/cudf#8910) in cudf.

jlowe · 2021-07-30T13:26:40Z

Any insight as to why this was so hard to reproduce locally but happens pretty much every time in nightly CI?

firestarman · 2021-08-02T08:09:39Z

Any insight as to why this was so hard to reproduce locally but happens pretty much every time in nightly CI?

That's what I am still trying to figure out ...

firestarman · 2021-08-03T03:48:42Z

Any insight as to why this was so hard to reproduce locally but happens pretty much every time in nightly CI?

That's what I am still trying to figure out ...

It turns out to be related to the value of N in the parameter --master local[N].
On nodes with 12 CPU cores, tests for coalescing type always pass when N >= 12, and always fail when N is equal to any integer value between 1 and 11 (both inclusive).

@wbo4958 told me coalescing really happens when N < 12, so the reason why tests fail is probably that cudf fails to read the coalesced orc file. When N >=12, no coalescing, tests pass.

pxLi added bug Something isn't working test Only impacts tests labels Jul 28, 2021

firestarman self-assigned this Jul 28, 2021

firestarman mentioned this issue Jul 29, 2021

Remove the struct support in ORC reader #3079

Merged

tgravescs mentioned this issue Jul 29, 2021

[FEA] ORC reader supports reading Struct columns. #2879

Closed

jlowe closed this as completed in #3079 Jul 29, 2021

jlowe pushed a commit that referenced this issue Jul 29, 2021

Disable the struct support for orc reader. (#3079)

f8ba1ba

We do this because of the issue #3059, which seems not to be easy to fix in a short time. Signed-off-by: Firestarman <firestarmanllc@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] orc_test:test_pred_push_round_trip failed #3059

[BUG] orc_test:test_pred_push_round_trip failed #3059

pxLi commented Jul 28, 2021 •

edited

Loading

firestarman commented Jul 28, 2021 •

edited

Loading

firestarman commented Jul 28, 2021

pxLi commented Jul 28, 2021 •

edited

Loading

pxLi commented Jul 29, 2021 •

edited

Loading

firestarman commented Jul 30, 2021

jlowe commented Jul 30, 2021

firestarman commented Aug 2, 2021

firestarman commented Aug 3, 2021 •

edited

Loading

[BUG] orc_test:test_pred_push_round_trip failed #3059

[BUG] orc_test:test_pred_push_round_trip failed #3059

Comments

pxLi commented Jul 28, 2021 • edited Loading

firestarman commented Jul 28, 2021 • edited Loading

firestarman commented Jul 28, 2021

pxLi commented Jul 28, 2021 • edited Loading

pxLi commented Jul 29, 2021 • edited Loading

firestarman commented Jul 30, 2021

jlowe commented Jul 30, 2021

firestarman commented Aug 2, 2021

firestarman commented Aug 3, 2021 • edited Loading

pxLi commented Jul 28, 2021 •

edited

Loading

firestarman commented Jul 28, 2021 •

edited

Loading

pxLi commented Jul 28, 2021 •

edited

Loading

pxLi commented Jul 29, 2021 •

edited

Loading

firestarman commented Aug 3, 2021 •

edited

Loading