Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUILD] databricks IT tests should run in parallel #1499

Closed
tgravescs opened this issue Jan 12, 2021 · 22 comments
Closed

[BUILD] databricks IT tests should run in parallel #1499

tgravescs opened this issue Jan 12, 2021 · 22 comments
Assignees
Labels
build Related to CI / CD or cleanly building

Comments

@tgravescs
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
We should enable the databricks IT tests to run in parallel. We added support for it but its not used in the databricks test scripts.

@tgravescs tgravescs added the build Related to CI / CD or cleanly building label Jan 12, 2021
@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 13, 2021

I'll make it run parallel with python xdist, similar as spark pre-merge build did

@NvTimLiu
Copy link
Collaborator

Still working on the issue. I can set up ENV to run parallel tests as the pre-merge build did. There are some failures which I'm checking if there are some python modules missing.

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 19, 2021

Test pipeline : https://blossom.nvidia.com/sw-gpu-spark-jenkins/view/Testing/job/tim-db-build-0/

Build & test (45minuntes) can be down within 1 hour

Most of the tests PASS, but still have some failures, tracking

21:33:45 �[31m= �[31m�[1m213 failed�[0m, �[32m4164 passed�[0m, �[33m130 skipped�[0m, �[33m163 xfailed�[0m, �[33m6 xpassed�[0m, �[33m66 warnings�[0m�[31m in 2165.62s (0:36:05)�[0m�[31m =�[0m

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 19, 2021

@tgravescs @revans2 @jlowe
I tried to run spark-rapids Databricks IT with pytest parallel. The parallel pipeline was done in 50 minutes.

Blossom Jenkins: https://blossom.nvidia.com/sw-gpu-spark-jenkins/view/Testing/job/tim-db-build-0/7

PR1549: https://github.com/NVIDIA/spark-rapids/pull/1549/files

But there are below 3 modules failed, could you please help to check? Thanks!

21:38:38 195 failed 4182 passed 130 skipped 163 xfailed xpassed 68 warningsin 2214.54s (0:36:54)

window_function_test.py: FAILED in the full parallel tests, PASS when it runs pipeline indepenently as below
python "$SCRIPTPATH"/runtests.py --rootdir "$SCRIPTPATH" "$SCRIPTPATH"/src/main/python/window_function_test.py

tpch_test.py: FAILED in the parallel tests, skip all PASS without parallel as below
spark-submit ./runtests.py --runtime_env="databricks" src/main/python/tpch_test.py ssssssssssssssssssssssssssssssssssssssssssss [100%]

udf_test.py: FAILED with python parallel tests, PASS if using spark-submit

I saw exceptions in parallel log: integration_tests/target/run_dir/target/surefire-reports/scala-test-detailed-output.log
21/01/19 09:43:14.469 Thread-4 WARN SQLExecution: Error executing delta metering
java.lang.NullPointerException
at org.apache.spark.sql.execution.CacheManager$.getSessionUuidOpt(CacheManager.scala:461)
at org.apache.spark.sql.execution.CacheManager.$anonfun$lookupCachedData$1(CacheManager.scala:337)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.execution.CacheManager.lookupCachedData(CacheManager.scala:337)
at org.apache.spark.sql.execution.CacheManager$$anonfun$1.applyOrElse(CacheManager.scala:382)

@tgravescs
Copy link
Collaborator Author

oh some of them are hitting the stack overflow issue in optimizer.SparkPlanStats.computeStats now for some reason, the script might have other options we weren't using before, need to look more

@tgravescs
Copy link
Collaborator Author

cache tests are different, the error above is tpch. The window tests failed because spark context already stopped, presumably from one of the other errors

@tgravescs
Copy link
Collaborator Author

tpch weren't running before because we didn't set --std_input_path

@NvTimLiu
Copy link
Collaborator

tpch_test.py can SKIP by remove --std_input_path : https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L97

udf-test.py can PASS by remove: export PYSP_TEST_spark_driver_extraJavaOptions="-ea -Duser.timezone=UTC $COVERAGE_SUBMIT_FLAGS"
https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L85
We still got 18 FAILED as below:
https://blossom.nvidia.com/sw-gpu-spark-jenkins/view/Testing/job/tim-db-build-0/8/consoleText

Line 18136: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Lower_Upper-Byte][IGNORE_ORDER]
Line 18137: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded-Short][IGNORE_ORDER]
Line 18138: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Lower_Upper-Short][IGNORE_ORDER]
Line 18139: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded-Integer][IGNORE_ORDER]
Line 18140: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Lower_Upper-Integer][IGNORE_ORDER]
Line 18141: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded_Following-Byte][IGNORE_ORDER]
Line 18142: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded_Following-Short][IGNORE_ORDER]
Line 18143: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded_Following-Integer][IGNORE_ORDER]
Line 18144: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded_Preceding-Byte][IGNORE_ORDER]
Line 18145: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded_Preceding-Short][IGNORE_ORDER]
Line 18146: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded_Preceding-Integer][IGNORE_ORDER]
Line 18147: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[No_Partition-Byte][IGNORE_ORDER]
Line 18148: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[No_Partition-Short][IGNORE_ORDER]
Line 18149: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[No_Partition-Integer][IGNORE_ORDER]
Line 18150: [2021-01-20T11:16:28.806Z] FAILED ../../src/main/python/udf_test.py::test_window_aggregate_udf_array_from_python[Unbounded-Byte][IGNORE_ORDER]

@NvTimLiu
Copy link
Collaborator

g4dn.xlarge CPU(s): 4, Mem 16G

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 20, 2021

As the nightly Databricks pipeline run the integration tests without any parameters, so I also remove below configs in the parallel scriptsun_pyspark_from_build.sh, and got PASS for the databricks parallel IT.
============== 4304 passed, 207 skipped, 159 xfailed, 6 xpassed, 16 warnings in 2372.64s (0:39:32) ==============

https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L85-L88
https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L97

@revans2
Copy link
Collaborator

revans2 commented Jan 20, 2021

Are you sure that they are all passing and not just being skipped? Some of the lines you removed are the ones that allow TPCH to run. You also removed the lines for setting the time zone to UTC which might cause all of the timestamp tests to be skipped if the time zone is not UTC by default.

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 20, 2021

@revans2
The skipped tests are list below.
I suppose the databricks-nightly pipeline also skips these TPCH & some of the udf tests. I mean the nightly Databricks pipeline build.sh also run the integration tests without any parameters , and then it got PASS.

https://blossom.nvidia.com/sw-gpu-spark-jenkins/job/rapids_databricks301_nightly-dev-github/56/consoleFull
16:22:07 = 4397 passed, 207 skipped, 159 xfailed, 6 xpassed, 2 warnings in 13349.19s (3:42:29) =
XPASS ../../src/main/python/sort_test.py::test_multi_orderby[Double] Spark has -0.0 < 0.0 before Spark 3.1

SKIPPED [32] ../../src/main/python/conftest.py:169: std_input_path is not configured
SKIPPED [1] ../../src/main/python/conftest.py:352: Mortgage not configured to run
SKIPPED [1] ../../src/main/python/conftest.py:402: rapids_udf_example_native not configured to run
SKIPPED [105] ../../src/main/python/conftest.py:388: TPC-DS not configured to run
SKIPPED [44] ../../src/main/python/conftest.py:276: TPCH not configured to run
SKIPPED [4] ../../src/main/python/conftest.py:318: TPCxBB not configured to run
SKIPPED [11] ../../src/main/python/conftest.py:396: cudf_udf not configured to run
SKIPPED [4] ../../src/main/python/udf_test.py:89: #757
SKIPPED [5] ../../src/main/python/udf_test.py:136: #740
============== 4304 passed, 207 skipped, 159 xfailed, 6 xpassed, 16 warnings in 2372.64s (0:39:32) ==============

@NvTimLiu
Copy link
Collaborator

https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L85-L88 (PASS udf_test.py)
https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L97 (Skip tpch_test.py)
• I did not observe window_function_test.py failures by removing above configs, maybe the the above changes impact window_function_test.py tests

@revans2
Copy link
Collaborator

revans2 commented Jan 20, 2021

What happens if it is just lines 85, 86 and 97 that are removed?

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 20, 2021

I guess it can also PASS, too. Let me check it. I'll update the result here.

@revans2
Copy link
Collaborator

revans2 commented Jan 20, 2021

Lines 85 and 86 make me think that we cannot set the java command line options with find spark in databricks. I am not sure why this would cause some of the tests to fail though. I think we need someone to acutely debug and root cause these issues at this point.

Removing line 97 just disables a lot of tests and side steps the problem.

@NvTimLiu
Copy link
Collaborator

What happens if it is just lines 85, 86 and 97 that are removed?

Also PASS

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 21, 2021

Lines 85 and 86 make me think that we cannot set the java command line options with find spark in databricks. I am not sure why this would cause some of the tests to fail though. I think we need someone to acutely debug and root cause these issues at this point.

Removing line 97 just disables a lot of tests and side steps the problem.

@revans2 @tgravescs Need we create a issue for the databricks IT skipping below tests?
SKIPPED [32] ../../src/main/python/conftest.py:169: std_input_path is not configured
SKIPPED [44] ../../src/main/python/conftest.py:276: TPCH not configured to run

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 21, 2021

pipeline: https://blossom.nvidia.com/sw-gpu-spark-jenkins/view/Testing/job/tim-db-build-0/9/console
18:04:03 4397 passed , 207 skipped , 159 xfailed , 6 xpassed , 16 warnings in 2517.03s (0:41:57) =
18:04:03 Setting default log level to "WARN".
18:04:03 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18:04:04 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@tgravescs
Copy link
Collaborator Author

yes we need a specific issue to investigate those failures

@tgravescs
Copy link
Collaborator Author

initial change for nightly fixed by #1645
@NvTimLiu can you update the integration builds?

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Feb 5, 2021

close the issue per #1645 merged

@NvTimLiu NvTimLiu closed this as completed Feb 5, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1499)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

No branches or pull requests

3 participants