Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_round_robin_sort_fallback failed #2407

Closed
jlowe opened this issue May 13, 2021 · 1 comment · Fixed by #2516
Closed

[BUG] test_round_robin_sort_fallback failed #2407

jlowe opened this issue May 13, 2021 · 1 comment · Fixed by #2516
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@jlowe
Copy link
Member

jlowe commented May 13, 2021

Last night's Dataproc integration test run saw a failure in test_round_robin_sort_fallback. From the failure log:

06:40:46  �[1m�[31mE                   : java.lang.AssertionError: assertion failed: Could not find ShuffleExchangeExec in the GPU plan�[0m
06:40:46  �[1m�[31mE                   GpuColumnarToRow false�[0m
06:40:46  �[1m�[31mE                   +- GpuShuffleCoalesce 2147483647�[0m
06:40:46  �[1m�[31mE                      +- GpuColumnarExchange gpuroundrobinpartitioning(2345), REPARTITION_WITH_NUM, [id=#616183]�[0m
06:40:46  �[1m�[31mE                         +- GpuProject [a#749732, 1 AS x#749734]�[0m
06:40:46  �[1m�[31mE                            +- GpuRowToColumnar TargetSize(2147483647)�[0m
06:40:46  �[1m�[31mE                               +- *(1) Scan ExistingRDD[a#749732]�[0m

Full log for reference:

06:40:46  �[31m�[1m____________ test_round_robin_sort_fallback[[('a', Array(String))]] ____________�[0m
06:40:46  
06:40:46  data_gen = [('a', Array(String))]
06:40:46  
06:40:46      �[37m@allow_non_gpu�[39;49;00m(�[33m'�[39;49;00m�[33mShuffleExchangeExec�[39;49;00m�[33m'�[39;49;00m, �[33m'�[39;49;00m�[33mRoundRobinPartitioning�[39;49;00m�[33m'�[39;49;00m)
06:40:46      �[37m@pytest�[39;49;00m.mark.parametrize(�[33m'�[39;49;00m�[33mdata_gen�[39;49;00m�[33m'�[39;49;00m, [[(�[33m'�[39;49;00m�[33ma�[39;49;00m�[33m'�[39;49;00m, ArrayGen(string_gen))],
06:40:46          [(�[33m'�[39;49;00m�[33ma�[39;49;00m�[33m'�[39;49;00m, StructGen([
06:40:46            (�[33m'�[39;49;00m�[33ma_1�[39;49;00m�[33m'�[39;49;00m, StructGen([
06:40:46              (�[33m'�[39;49;00m�[33ma_1_1�[39;49;00m�[33m'�[39;49;00m, int_gen),
06:40:46              (�[33m'�[39;49;00m�[33ma_1_2�[39;49;00m�[33m'�[39;49;00m, float_gen),
06:40:46              (�[33m'�[39;49;00m�[33ma_1_3�[39;49;00m�[33m'�[39;49;00m, double_gen)
06:40:46            ])),
06:40:46            (�[33m'�[39;49;00m�[33mb_1�[39;49;00m�[33m'�[39;49;00m, long_gen)
06:40:46          ]))],
06:40:46          [(�[33m'�[39;49;00m�[33ma�[39;49;00m�[33m'�[39;49;00m, simple_string_to_string_map_gen)]], ids=idfn)
06:40:46      �[37m@ignore_order�[39;49;00m(local=�[94mTrue�[39;49;00m) �[90m# To avoid extra data shuffle by 'sort on Spark' for this repartition test.�[39;49;00m
06:40:46      �[94mdef�[39;49;00m �[92mtest_round_robin_sort_fallback�[39;49;00m(data_gen):
06:40:46          �[94mfrom�[39;49;00m �[04m�[96mpyspark�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96msql�[39;49;00m�[04m�[96m.�[39;49;00m�[04m�[96mfunctions�[39;49;00m �[94mimport�[39;49;00m lit
06:40:46  >       assert_gpu_fallback_collect(
06:40:46                  �[90m# Add a computed column to avoid shuffle being optimized back to a CPU shuffle like in test_repartition_df�[39;49;00m
06:40:46                  �[94mlambda�[39;49;00m spark : gen_df(spark, data_gen).withColumn(�[33m'�[39;49;00m�[33mx�[39;49;00m�[33m'�[39;49;00m, lit(�[94m1�[39;49;00m)).repartition(�[94m13�[39;49;00m),
06:40:46                  �[33m'�[39;49;00m�[33mShuffleExchangeExec�[39;49;00m�[33m'�[39;49;00m)
06:40:46  
06:40:46  �[1m�[31mintegration_tests/src/main/python/repart_test.py�[0m:97: 
06:40:46  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
06:40:46  �[1m�[31mintegration_tests/src/main/python/asserts.py�[0m:309: in assert_gpu_fallback_collect
06:40:46      jvm.com.nvidia.spark.rapids.ExecutionPlanCaptureCallback.assertCapturedAndGpuFellBack(cpu_fallback_class_name, �[94m2000�[39;49;00m)
06:40:46  �[1m�[31m/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1620898185056_0001/container_e01_1620898185056_0001_01_000001/py4j-0.10.9-src.zip/py4j/java_gateway.py�[0m:1304: in __call__
06:40:46      return_value = get_return_value(
06:40:46  �[1m�[31m/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1620898185056_0001/container_e01_1620898185056_0001_01_000001/pyspark.zip/pyspark/sql/utils.py�[0m:111: in deco
06:40:46      �[94mreturn�[39;49;00m f(*a, **kw)
06:40:46  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
06:40:46  
06:40:46  answer = 'xro1463102'
06:40:46  gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f396e8889d0>
06:40:46  target_id = 'z:com.nvidia.spark.rapids.ExecutionPlanCaptureCallback'
06:40:46  name = 'assertCapturedAndGpuFellBack'
06:40:46  
06:40:46      �[94mdef�[39;49;00m �[92mget_return_value�[39;49;00m(answer, gateway_client, target_id=�[94mNone�[39;49;00m, name=�[94mNone�[39;49;00m):
06:40:46          �[33m"""Converts an answer received from the Java gateway into a Python object.�[39;49;00m
06:40:46      �[33m�[39;49;00m
06:40:46      �[33m    For example, string representation of integers are converted to Python�[39;49;00m
06:40:46      �[33m    integer, string representation of objects are converted to JavaObject�[39;49;00m
06:40:46      �[33m    instances, etc.�[39;49;00m
06:40:46      �[33m�[39;49;00m
06:40:46      �[33m    :param answer: the string returned by the Java gateway�[39;49;00m
06:40:46      �[33m    :param gateway_client: the gateway client used to communicate with the Java�[39;49;00m
06:40:46      �[33m        Gateway. Only necessary if the answer is a reference (e.g., object,�[39;49;00m
06:40:46      �[33m        list, map)�[39;49;00m
06:40:46      �[33m    :param target_id: the name of the object from which the answer comes from�[39;49;00m
06:40:46      �[33m        (e.g., *object1* in `object1.hello()`). Optional.�[39;49;00m
06:40:46      �[33m    :param name: the name of the member from which the answer comes from�[39;49;00m
06:40:46      �[33m        (e.g., *hello* in `object1.hello()`). Optional.�[39;49;00m
06:40:46      �[33m    """�[39;49;00m
06:40:46          �[94mif�[39;49;00m is_error(answer)[�[94m0�[39;49;00m]:
06:40:46              �[94mif�[39;49;00m �[96mlen�[39;49;00m(answer) > �[94m1�[39;49;00m:
06:40:46                  �[96mtype�[39;49;00m = answer[�[94m1�[39;49;00m]
06:40:46                  value = OUTPUT_CONVERTER[�[96mtype�[39;49;00m](answer[�[94m2�[39;49;00m:], gateway_client)
06:40:46                  �[94mif�[39;49;00m answer[�[94m1�[39;49;00m] == REFERENCE_TYPE:
06:40:46  >                   �[94mraise�[39;49;00m Py4JJavaError(
06:40:46                          �[33m"�[39;49;00m�[33mAn error occurred while calling �[39;49;00m�[33m{0}�[39;49;00m�[33m{1}�[39;49;00m�[33m{2}�[39;49;00m�[33m.�[39;49;00m�[33m\n�[39;49;00m�[33m"�[39;49;00m.
06:40:46                          �[96mformat�[39;49;00m(target_id, �[33m"�[39;49;00m�[33m.�[39;49;00m�[33m"�[39;49;00m, name), value)
06:40:46  �[1m�[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.nvidia.spark.rapids.ExecutionPlanCaptureCallback.assertCapturedAndGpuFellBack.�[0m
06:40:46  �[1m�[31mE                   : java.lang.AssertionError: assertion failed: Could not find ShuffleExchangeExec in the GPU plan�[0m
06:40:46  �[1m�[31mE                   GpuColumnarToRow false�[0m
06:40:46  �[1m�[31mE                   +- GpuShuffleCoalesce 2147483647�[0m
06:40:46  �[1m�[31mE                      +- GpuColumnarExchange gpuroundrobinpartitioning(2345), REPARTITION_WITH_NUM, [id=#616183]�[0m
06:40:46  �[1m�[31mE                         +- GpuProject [a#749732, 1 AS x#749734]�[0m
06:40:46  �[1m�[31mE                            +- GpuRowToColumnar TargetSize(2147483647)�[0m
06:40:46  �[1m�[31mE                               +- *(1) Scan ExistingRDD[a#749732]�[0m
06:40:46  �[1m�[31mE                   �[0m
06:40:46  �[1m�[31mE                   	at scala.Predef$.assert(Predef.scala:223)�[0m
06:40:46  �[1m�[31mE                   	at com.nvidia.spark.rapids.ExecutionPlanCaptureCallback$.assertDidFallBack(Plugin.scala:318)�[0m
06:40:46  �[1m�[31mE                   	at com.nvidia.spark.rapids.ExecutionPlanCaptureCallback$.assertCapturedAndGpuFellBack(Plugin.scala:312)�[0m
06:40:46  �[1m�[31mE                   	at com.nvidia.spark.rapids.ExecutionPlanCaptureCallback.assertCapturedAndGpuFellBack(Plugin.scala)�[0m
06:40:46  �[1m�[31mE                   	at sun.reflect.GeneratedMethodAccessor278.invoke(Unknown Source)�[0m
06:40:46  �[1m�[31mE                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)�[0m
06:40:46  �[1m�[31mE                   	at java.lang.reflect.Method.invoke(Method.java:498)�[0m
06:40:46  �[1m�[31mE                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)�[0m
06:40:46  �[1m�[31mE                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)�[0m
06:40:46  �[1m�[31mE                   	at py4j.Gateway.invoke(Gateway.java:282)�[0m
06:40:46  �[1m�[31mE                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)�[0m
06:40:46  �[1m�[31mE                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)�[0m
06:40:46  �[1m�[31mE                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)�[0m
06:40:46  �[1m�[31mE                   	at java.lang.Thread.run(Thread.java:748)�[0m
06:40:46  
06:40:46  �[1m�[31m/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1620898185056_0001/container_e01_1620898185056_0001_01_000001/py4j-0.10.9-src.zip/py4j/protocol.py�[0m:326: Py4JJavaError
06:40:46  ----------------------------- Captured stdout call -----------------------------
06:40:46  ### CPU RUN ###
06:40:46  ### GPU RUN ###
@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 13, 2021
@jlowe jlowe added the P0 Must have for release label May 14, 2021
@jlowe jlowe self-assigned this May 14, 2021
@jlowe jlowe removed the ? - Needs Triage Need team to review and classify label May 14, 2021
@jlowe
Copy link
Member Author

jlowe commented May 26, 2021

Finally found out what's happening with this issue. In #2503 I changed the column names in the test to avoid ambiguity with the immediately preceding test, test_repartition_df which tests array of string as well. Instead of calling the array column a it was changed to ag and instead of calling the extra column x it was changed to extra.

Last night's Dataproc integration test run saw this test fail again, but note the difference in what the test called the column name and what was logged in the error:

06:42:01  =================================== FAILURES ===================================
06:42:01  [31m[1m___________ test_round_robin_sort_fallback[[('ag', Array(String))]] ____________[0m
06:42:01  
06:42:01  data_gen = [('ag', Array(String))]
06:42:01  
[...]
06:42:01  [1m[31mE                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.nvidia.spark.rapids.ExecutionPlanCaptureCallback.assertCapturedAndGpuFellBack.[0m
06:42:01  [1m[31mE                   : java.lang.AssertionError: assertion failed: Could not find ShuffleExchangeExec in the GPU plan[0m
06:42:01  [1m[31mE                   GpuColumnarToRow false[0m
06:42:01  [1m[31mE                   +- GpuShuffleCoalesce 2147483647[0m
06:42:01  [1m[31mE                      +- GpuColumnarExchange gpuroundrobinpartitioning(2345), REPARTITION_WITH_NUM, [id=#625927][0m
06:42:01  [1m[31mE                         +- GpuProject [a#753458, 1 AS x#753460][0m
06:42:01  [1m[31mE                            +- GpuRowToColumnar TargetSize(2147483647)[0m
06:42:01  [1m[31mE                               +- *(1) Scan ExistingRDD[a#753458][0m
[...]
06:42:01  =========================== short test summary info ============================
06:42:01  FAILED integration_tests/src/main/python/repart_test.py::test_round_robin_sort_fallback[[('ag', Array(String))]][IGNORE_ORDER({'local': True}), ALLOW_NON_GPU(ShuffleExchangeExec,RoundRobinPartitioning)]

Note that the plan that was captured is using a schema with column names a and x not ag and extra as expected. This proves we're capturing the previous test's plan rather than the intended plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant