Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Build Failure when building from source #2546

Closed
tregodev opened this issue Jun 1, 2021 · 16 comments · Fixed by #2768
Closed

[BUG] Build Failure when building from source #2546

tregodev opened this issue Jun 1, 2021 · 16 comments · Fixed by #2768
Assignees
Labels
bug Something isn't working build Related to CI / CD or cleanly building

Comments

@tregodev
Copy link

tregodev commented Jun 1, 2021

Hey,

I am trying to build the main branch without any modifications to the code using

mvn verify

but unfortunately this fails due to this error

[ERROR] Failed to execute goal on project rapids-4-spark-shims-spark312_2.12: Could not resolve dependencies for project com.nvidia:rapids-4-spark-shims-spark312_2.12:jar:0.5.0: org.apache.spark:spark-sql_2.12:jar:3.1.2-SNAPSHOT was not found in https://oss.sonatype.org/content/repositories/snapshots during a previous attempt. This failure was cached in the local repository and resolution is not reattempted until the update interval of snapshots-repo has elapsed or updates are forced -> [

Is there any simple way to fix this?

Thanks!

@tregodev tregodev added ? - Needs Triage Need team to review and classify bug Something isn't working labels Jun 1, 2021
@pxLi
Copy link
Collaborator

pxLi commented Jun 1, 2021

which branch are you working on? spark has removed 3.1.2-SNAPSHOT libraries from snapshot maven repo (https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/), and going to release 3.1.2 shortly.

To successfully build, please upmerge/rebase against upstream branch-21.06, thx~

@tregodev
Copy link
Author

tregodev commented Jun 1, 2021

Hi,

thank you for the quick answer.

I tried before using branch-21.06 but it lead to the following error:

- GPU partition with compression *** FAILED ***
  ai.rapids.cudf.CudfException: CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-pre_release-github-28-cuda11/cpp/src/bitmask/null_mask.cu:93: 700 cudaErrorIllegalAddress an illegal memory access was encountered
  at ai.rapids.cudf.ColumnView.binaryOpVV(Native Method)
  at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1012)
  at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1005)
  at ai.rapids.cudf.BinaryOperable.equalToNullAware(BinaryOperable.java:542)
  at ai.rapids.cudf.BinaryOperable.equalToNullAware(BinaryOperable.java:549)
  at com.nvidia.spark.rapids.GpuPartitioningSuite.$anonfun$compareBatches$3(GpuPartitioningSuite.scala:97)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at com.nvidia.spark.rapids.GpuPartitioningSuite.$anonfun$compareBatches$2(GpuPartitioningSuite.scala:93)

@pxLi
Copy link
Collaborator

pxLi commented Jun 1, 2021

which GPU device and cuda toolkit you have locally? branch-21.06 require cuda 11.0+ and driver 450.80.02+

@tregodev
Copy link
Author

tregodev commented Jun 1, 2021

Hey,

cuda version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

GPU device:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   27C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@sameerz sameerz added build Related to CI / CD or cleanly building and removed ? - Needs Triage Need team to review and classify labels Jun 1, 2021
@Salonijain27
Copy link
Contributor

Hi @tregodev,
As mentioned above by @pxLi, the requirements for spark-rapids branch-21.06 are:
CUDA 11.0+
NVIDIA driver 450.80.02+

From the information provided above, you need to update your CUDA Drivers from 450.51.05 to 450.80.02+ in order to resolve the issue.

For more information, please refer to the dependency list for libcudf here

@tregodev
Copy link
Author

tregodev commented Jun 2, 2021

Hey all,

thank you for the help. I updated the drivers NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 .

However, the error still persists:

GpuPartitioningSuite:
- GPU partition
- GPU partition with compression *** FAILED ***
  ai.rapids.cudf.CudfException: CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-pre_release-github-34-cuda11/cpp/src/bitmask/null_mask.cu:93: 700 cudaErrorIllegalAddress an illegal memory access was encountered
  at ai.rapids.cudf.ColumnView.binaryOpVV(Native Method)
  at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1012)
  at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1005)
  at ai.rapids.cudf.BinaryOperable.equalToNullAware(BinaryOperable.java:542)
  at ai.rapids.cudf.BinaryOperable.equalToNullAware(BinaryOperable.java:549)
  at com.nvidia.spark.rapids.GpuPartitioningSuite.$anonfun$compareBatches$3(GpuPartitioningSuite.scala:97)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at com.nvidia.spark.rapids.GpuPartitioningSuite.$anonfun$compareBatches$2(GpuPartitioningSuite.scala:93)
  ...
HashSortOptimizeSuite:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 21.06.0-SNAPSHOT:
[INFO] 
[INFO] RAPIDS Accelerator for Apache Spark Root Project ... SUCCESS [  2.307 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin ..... SUCCESS [01:56 min]
[INFO] RAPIDS Accelerator for Apache Spark Shuffle Plugin . SUCCESS [ 14.426 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Shims SUCCESS [  1.445 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.1 Shim SUCCESS [ 17.805 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.1 EMR Shim SUCCESS [  9.187 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.2 Shim SUCCESS [  8.897 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.1 Shim SUCCESS [ 23.266 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.2 Shim SUCCESS [  9.413 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.3 Shim SUCCESS [ 13.843 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.3 Shim SUCCESS [ 14.354 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Shim Aggregator SUCCESS [  1.265 s]
[INFO] RAPIDS Accelerator for Apache Spark Scala UDF Plugin SUCCESS [ 56.674 s]
[INFO] RAPIDS Accelerator for Apache Spark Distribution ... SUCCESS [ 11.587 s]
[INFO] rapids-4-spark-api-validation ...................... SUCCESS [  9.626 s]
[INFO] RAPIDS Accelerator for Apache Spark tools .......... SUCCESS [ 27.882 s]
[INFO] RAPIDS Accelerator for Apache Spark UDF Examples ... SUCCESS [  8.920 s]
[INFO] RAPIDS Accelerator for Apache Spark Tests .......... FAILURE [05:18 min]
[INFO] rapids-4-spark-integration-tests_2.12 .............. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11:06 min
[INFO] Finished at: 2021-06-02T07:26:32Z
[INFO] ------------------------------------------------------------------------

@abellina
Copy link
Collaborator

abellina commented Jun 2, 2021

I tried running this locally but I can't get it to reproduce. Also our CI jobs seem to be OK.

The P100 is Pascal and is supported. Your driver seems fine as well according to: https://docs.nvidia.com/deploy/cuda-compatibility/index.html.

I assume you have CUDA runtime 11.0 or 11.2 installed, as 11.3 is not something we test with. Can you verify what runtime you are using? The runtime is normally installed in a path like /usr/local/cuda-11.0.

I am sorry to ask for more input, but here are some suggestions:

  1. Could you verify tests/target/surefire-reports/ and grep for ERROR or Exception to see if there is anything interesting (likely before this test).
  2. It would be interesting to run without this test to see if other tests succeed and is isolated.
  3. If you do have 11.2 or 11.0 installed, in addition to 11.3, can you try exporting LD_LIBRARY_PATH=/usr/local/cuda-11.0 or LD_LIBRARY_PATH=/usr/local/cuda-11.2, and then retrying the mvn verify command?

Thanks for your patience.

@viadea
Copy link
Collaborator

viadea commented Jun 2, 2021

I just tested on a VM on azure using Standard_NC6s_v2 which has 1 P100.
Env:
ubuntu 20.04
Driver 465.19.01
CUDA Version: 11.3

It builds fine if we skip the tests: mvn clean verify -DskipTests

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 21.06.0-SNAPSHOT:
[INFO]
[INFO] RAPIDS Accelerator for Apache Spark Root Project ... SUCCESS [  7.354 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin ..... SUCCESS [01:44 min]
[INFO] RAPIDS Accelerator for Apache Spark Shuffle Plugin . SUCCESS [ 11.091 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Shims SUCCESS [  0.399 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.1 Shim SUCCESS [ 13.284 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.1 EMR Shim SUCCESS [  6.553 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.2 Shim SUCCESS [  6.908 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.1 Shim SUCCESS [ 23.966 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.2 Shim SUCCESS [  7.924 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.1 Cloudera Shim SUCCESS [ 24.506 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.3 Shim SUCCESS [ 13.477 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.3 Shim SUCCESS [ 12.769 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Shim Aggregator SUCCESS [  0.115 s]
[INFO] RAPIDS Accelerator for Apache Spark Scala UDF Plugin SUCCESS [ 34.466 s]
[INFO] RAPIDS Accelerator for Apache Spark Distribution ... SUCCESS [ 12.421 s]
[INFO] rapids-4-spark-api-validation ...................... SUCCESS [  6.475 s]
[INFO] RAPIDS Accelerator for Apache Spark tools .......... SUCCESS [ 40.835 s]
[INFO] RAPIDS Accelerator for Apache Spark UDF Examples ... SUCCESS [  7.581 s]
[INFO] RAPIDS Accelerator for Apache Spark Tests .......... SUCCESS [ 32.170 s]
[INFO] rapids-4-spark-integration-tests_2.12 .............. SUCCESS [ 26.995 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  06:33 min
[INFO] Finished at: 2021-06-02T23:38:57Z
[INFO] ------------------------------------------------------------------------

/mnt/git/spark-rapids# nvidia-smi
Wed Jun  2 23:39:14 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Tesla P1...  On   | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However if we do not skip the tests, I can reproduce the same:

GpuPartitioningSuite:
- GPU partition
- GPU partition with compression *** FAILED ***
  ai.rapids.cudf.CudfException: CUDA error encountered at: /home/jenkins/agent/workspace/jenkins-cudf_nightly-pre_release-github-35-cuda11/cpp/src/bitmask/null_mask.cu:93: 700 cudaErrorIllegalAddress an illegal memory access was encountered
  at ai.rapids.cudf.ColumnView.binaryOpVV(Native Method)
  at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1012)
  at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1005)
  at ai.rapids.cudf.BinaryOperable.equalToNullAware(BinaryOperable.java:542)
  at ai.rapids.cudf.BinaryOperable.equalToNullAware(BinaryOperable.java:549)
  at com.nvidia.spark.rapids.GpuPartitioningSuite.$anonfun$compareBatches$3(GpuPartitioningSuite.scala:97)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at com.nvidia.spark.rapids.GpuPartitioningSuite.$anonfun$compareBatches$2(GpuPartitioningSuite.scala:93)

together with another test failure:

WindowFunctionSuite:
- WITH DECIMALS: [Window] [ROWS] [-2, 3]
- WITH DECIMALS: [Window] [ROWS] [-2, CURRENT ROW]
- WITH DECIMALS: [Window] [ROWS] [-2, UNBOUNDED FOLLOWING]
- WITH DECIMALS: [Window] [ROWS] [CURRENT ROW, 3]
- WITH DECIMALS: [Window] [ROWS] [CURRENT ROW, CURRENT ROW]
- WITH DECIMALS: [Window] [ROWS] [CURRENT ROW, UNBOUNDED FOLLOWING]
- WITH DECIMALS: [Window] [ROWS] [UNBOUNDED PRECEDING, 3]  *** FAILED ***
  Running on the GPU and on the CPU did not match
  CPU: WrappedArray([43,13.0000000000,15.0000000000,3], [95,13.0000000000,52.0000000000,4], [168,13.0000000000,73.0000000000,5], [186,13.0000000000,73.0000000000,6], [272,13.0000000000,86.0000000000,7], [289,13.0000000000,86.0000000000,8], [350,13.0000000000,86.0000000000,9], [425,13.0000000000,86.0000000000,10], [523,13.0000000000,98.0000000000,11], [601,13.0000000000,98.0000000000,12], [686,13.0000000000,98.0000000000,13], [700,13.0000000000,98.0000000000,14], [772,13.0000000000,98.0000000000,15], [864,13.0000000000,98.0000000000,16], [948,13.0000000000,98.0000000000,17], [1013,13.0000000000,98.0000000000,18], [1066,13.0000000000,98.0000000000,19], [1066,13.0000000000,98.0000000000,19], [1066,13.0000000000,98.0000000000,19], [1066,13.0000000000,98.0000000000,19])

  GPU: WrappedArray([43,13.0000000000,15.0000000000,3], [60,13.0000000000,17.0000000000,4], [121,13.0000000000,61.0000000000,5], [196,13.0000000000,75.0000000000,6], [294,13.0000000000,98.0000000000,7], [386,13.0000000000,98.0000000000,8], [470,13.0000000000,98.0000000000,9], [535,13.0000000000,98.0000000000,10], [588,13.0000000000,98.0000000000,11], [640,13.0000000000,98.0000000000,12], [713,13.0000000000,98.0000000000,13], [731,13.0000000000,98.0000000000,14], [817,13.0000000000,98.0000000000,15], [895,13.0000000000,98.0000000000,16], [980,13.0000000000,98.0000000000,17], [994,13.0000000000,98.0000000000,18], [1066,13.0000000000,98.0000000000,19], [1066,13.0000000000,98.0000000000,19], [1066,13.0000000000,98.0000000000,19], [1066,13.0000000000,98.0000000000,19]) (SparkQueryCompareTestSuite.scala:737)

Changing LD_LIBRARY_PATH to cuda 11.0 does not fix this issue.

@viadea
Copy link
Collaborator

viadea commented Jun 3, 2021

I also tested the same thing on a VM on azure using Standard_NC6s_v3 which has 1 V100.
Env:
ubuntu 20.04
Driver 465.19.01
CUDA Version: 11.3

V100 runs fine.

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 21.06.0-SNAPSHOT:
[INFO]
[INFO] RAPIDS Accelerator for Apache Spark Root Project ... SUCCESS [  0.999 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin ..... SUCCESS [01:33 min]
[INFO] RAPIDS Accelerator for Apache Spark Shuffle Plugin . SUCCESS [ 11.798 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Shims SUCCESS [  0.962 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.1 Shim SUCCESS [ 13.208 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.1 EMR Shim SUCCESS [  6.748 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.2 Shim SUCCESS [  6.457 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.1 Shim SUCCESS [ 17.421 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.2 Shim SUCCESS [  6.801 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.1 Cloudera Shim SUCCESS [ 12.238 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.0.3 Shim SUCCESS [  6.601 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Spark 3.1.3 Shim SUCCESS [  6.865 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin Shim Aggregator SUCCESS [  0.885 s]
[INFO] RAPIDS Accelerator for Apache Spark Scala UDF Plugin SUCCESS [ 39.868 s]
[INFO] RAPIDS Accelerator for Apache Spark Distribution ... SUCCESS [ 12.151 s]
[INFO] rapids-4-spark-api-validation ...................... SUCCESS [  7.061 s]
[INFO] RAPIDS Accelerator for Apache Spark tools .......... SUCCESS [ 44.687 s]
[INFO] RAPIDS Accelerator for Apache Spark UDF Examples ... SUCCESS [  7.488 s]
[INFO] RAPIDS Accelerator for Apache Spark Tests .......... SUCCESS [04:22 min]
[INFO] rapids-4-spark-integration-tests_2.12 .............. SUCCESS [ 58.838 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  10:16 min
[INFO] Finished at: 2021-06-03T01:17:01Z
[INFO] ------------------------------------------------------------------------

@firestarman
Copy link
Collaborator

Concerning the CUDA error encountered issue, you could try to remove the cache under $HOME/.cudf/ and run build again.

@tregodev
Copy link
Author

tregodev commented Jun 3, 2021

Hey,

I can confirm that using mvn clean verify -DskipTests also made the build successful.

@firestarman I did not find any cache files under $HOME/.cudf/ , only the three different cudf folders. For what its worth, I got the same error in both ubuntu 20.04 and centos 7.

@abellina I uploaded the surefire-reports grep contents to pastebin Exception and ERROR.

Changing LD_LIBRARY_PATH to point to 11.3, 11.2, 11.0 did not change the outcome.

@mythrocks
Copy link
Collaborator

Hmm. I can confirm that all the CUDF window function tests seem to be running against a P100, with CUDA 11.2.2.

mithunr@mithunr-p100-20:~/work/cudf$ ./ROLLING_TEST
...
[----------] 3 tests from RollingDictionaryTest
[ RUN      ] RollingDictionaryTest.Count
[       OK ] RollingDictionaryTest.Count (1 ms)
[ RUN      ] RollingDictionaryTest.MinMax
[       OK ] RollingDictionaryTest.MinMax (1 ms)
[ RUN      ] RollingDictionaryTest.LeadLag
[       OK ] RollingDictionaryTest.LeadLag (2 ms)
[----------] 3 tests from RollingDictionaryTest (4 ms total)

[----------] Global test environment tear-down
[==========] 2032 tests from 241 test suites ran. (3328 ms total)
[  PASSED  ] 2032 tests.

It is entirely possible that the specific spark-rapids WindowFunctionSuite tests are exercising a path that isn't covered from the CUDF tests. I'm investigating.

@mythrocks
Copy link
Collaborator

The behaviour does seem quite strange. If all other tests except [Window] [ROWS] [CURRENT ROW, UNBOUNDED FOLLOWING] are commented out, the failing test seems to succeed. This is proving a little hard to narrow down.

@mythrocks
Copy link
Collaborator

This is beginning to look like a memory corruption. On adding the uid, dateLong and dollars to the output projection, the failing test began to pass, and the following test failed thus:

- WITH DECIMALS: [Window] [ROWS] [UNBOUNDED PRECEDING, 3]  *** FAILED ***
  Running on the GPU and on the CPU did not match
  CPU: WrappedArray([null,null,null,43,13.0000000000,15.0000000000,3], [null,null,13,95,13.0000000000,52.0000000000,4], [null,null,15,168,13.0000000000,73.00
00000000,5], [null,null,15,186,13.0000000000,73.0000000000,6], [null,null,52,272,13.0000000000,86.0000000000,7], [null,null,73,289,13.0000000000,86.000000000
0,8], [null,null,18,350,13.0000000000,86.0000000000,9], [null,null,86,425,13.0000000000,86.0000000000,10], [null,null,17,523,13.0000000000,98.0000000000,11],
 [null,null,61,601,13.0000000000,98.0000000000,12], [null,null,75,686,13.0000000000,98.0000000000,13], [null,null,98,700,13.0000000000,98.0000000000,14], [nu
ll,null,78,772,13.0000000000,98.0000000000,15], [null,null,85,864,13.0000000000,98.0000000000,16], [null,null,14,948,13.0000000000,98.0000000000,17], [null,n
ull,72,1013,13.0000000000,98.0000000000,18], [null,null,92,1066,13.0000000000,98.0000000000,19], [null,null,84,1066,13.0000000000,98.0000000000,19], [null,nu
ll,65,1066,13.0000000000,98.0000000000,19], [null,null,53,1066,13.0000000000,98.0000000000,19])

  GPU: WrappedArray([null,null,null,43,13.0000000000,15.0000000000,3], [null,null,13,60,13.0000000000,17.0000000000,4], [null,null,15,121,13.0000000000,61.00
00000000,5], [null,null,15,196,13.0000000000,75.0000000000,6], [null,null,17,294,13.0000000000,98.0000000000,7], [null,null,61,386,13.0000000000,98.000000000
0,8], [null,null,75,470,13.0000000000,98.0000000000,9], [null,null,98,535,13.0000000000,98.0000000000,10], [null,null,92,588,13.0000000000,98.0000000000,11],
 [null,null,84,640,13.0000000000,98.0000000000,12], [null,null,65,713,13.0000000000,98.0000000000,13], [null,null,53,731,13.0000000000,98.0000000000,14], [nu
ll,null,52,817,13.0000000000,98.0000000000,15], [null,null,73,895,13.0000000000,98.0000000000,16], [null,null,18,980,13.0000000000,98.0000000000,17], [null,n
ull,86,994,13.0000000000,98.0000000000,18], [null,null,78,1066,13.0000000000,98.0000000000,19], [null,null,85,1066,13.0000000000,98.0000000000,19], [null,nul
l,14,1066,13.0000000000,98.0000000000,19], [null,null,72,1066,13.0000000000,98.0000000000,19]) (SparkQueryCompareTestSuite.scala:737)

Note that the first three columns are turning up as nulls. These should have simply been projections from the input. Something's fishy.

@mythrocks
Copy link
Collaborator

mythrocks commented Jun 11, 2021

Note: At least part of the problem is that the tests do not order the output rows. It would be good to add ordering, perhaps based on uid, dateLong, and dollars. The results for nearly all the queries seem to match GPU, when the output is sorted.

But one or two tests (namely [CURRENT ROW, UNBOUNDED FOLLOWING] and [UNBOUNDED PRECEDING, 3]) indicate that there could be a memory corruption, or something. Still investigating.

@mythrocks
Copy link
Collaborator

mythrocks commented Jun 21, 2021

Please pardon the delay, I've been barking up the wrong tree. The window function implementation isn't the problem. The tests produced different results on Pascal, thanks to nondeterministic ordering.

#2768 should sort this out. Tested on Turing and Pascal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants