Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] java.lang.ClassCastException: GpuCompressedColumnVector cannot be cast to GpuColumnVector #2378

Closed
zhnin opened this issue May 10, 2021 · 6 comments · Fixed by #2396
Closed
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@zhnin
Copy link

zhnin commented May 10, 2021

Describe the bug
Hi, when i runTPCx-BB Query7 and Query15 (SF1, use decimal) with

spark.rapids.sql.decimalType.enabled      true
spark.rapids.shuffle.compression.codec    lz4

I got a exception:

2021-05-10 11:26:20,143 ERROR executor.Executor: Exception in task 96.0 in stage 21.0 (TID 103)
java.lang.ClassCastException: com.nvidia.spark.rapids.GpuCompressedColumnVector cannot be cast to 
com.nvidia.spark.rapids.GpuColumnVector
        at com.nvidia.spark.rapids.GpuColumnVector.extractColumns(GpuColumnVector.java:839)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$2(GpuColumnarToRowExec.scala:203)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$2$adapted(GpuColumnarToRowExec.scala:201)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.withResource(GpuColumnarToRowExec.scala:177)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$1(GpuColumnarToRowExec.scala:201)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.$anonfun$loadNextBatch$1$adapted(GpuColumnarToRowExec.scala:200)
        at scala.Option.foreach(Option.scala:407)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:200)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:238)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at com.nvidia.spark.rapids.RowToColumnarIterator.hasNext(GpuRowToColumnarExec.scala:564)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$iterHasNext$1(GpuCoalesceBatches.scala:175)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$iterHasNext$1$adapted(GpuCoalesceBatches.scala:174)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:133)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.iterHasNext(GpuCoalesceBatches.scala:174)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1(GpuCoalesceBatches.scala:183)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$hasNext$1$adapted(GpuCoalesceBatches.scala:182)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:133)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.hasNext(GpuCoalesceBatches.scala:182)
        at com.nvidia.spark.rapids.ConcatAndConsumeAll$.getSingleBatchWithVerification(GpuCoalesceBatches.scala:79)
        at com.nvidia.spark.rapids.GpuShuffledHashJoinBase.$anonfun$doExecuteColumnar$1(GpuShuffledHashJoinBase.scala:79)
        at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-05-10 11:26:20,175 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 146

Steps/Code to reproduce bug

  • TPCx-BB Query7 and Query15 (Data: SF1, use decimal)
  • spark-shell --master spark://master:7077

Environment details (please complete the following information)

  • Spark: 3.1.1
  • Driver&CUDA: 450.80.02&V11.0
  • Spark-rapids: rapids-4-spark_2.12-0.6.0-SNAPSHOT.jar 7c7832a
  • Cudf: cudf-0.20-SNAPSHOT-cuda11.jar
  • Spark configuration
spark.master                                        spark://master:7077
spark.plugins                                       com.nvidia.spark.SQLPlugin
spark.executor.resource.gpu.amount                  1
spark.driver.memory                                 4g
spark.executor.memory                               16g
spark.executor.cores                                5
spark.task.cpus                                     1
spark.task.resource.gpu.amount                      0.2
spark.shuffle.manager                               com.nvidia.spark.rapids.spark311.RapidsShuffleManager
spark.shuffle.service.enabled                       false
spark.executorEnv.UCX_TLS                           cuda_copy,cuda_ipc,tcp
spark.executorEnv.UCX_ERROR_SIGNALS
spark.executorEnv.UCX_MAX_RNDV_RAILS                1
spark.executorEnv.UCX_MEMTYPE_CACHE                 n
spark.executorEnv.UCX_RNDV_SCHEME                   put_zcopy
spark.executorEnv.UCX_CUDA_IPC_CACHE                y
spark.driver.extraClassPath                         /mnt/vdb/0.6/jars/*:/mnt/vdb/0.6/lib/ucx/lib
spark.executor.extraClassPath                       /mnt/vdb/0.6/jars/*:/mnt/vdb/0.6/lib/ucx/lib

spark.rapids.sql.decimalType.enabled                true
spark.rapids.shuffle.compression.codec              lz4

Additional context
Add any other context about the problem here.

  • If convert decimal to double, it could run.
  • If set spark.rapids.sql.decimalType.enabled=false , it could run.
@zhnin zhnin added ? - Needs Triage Need team to review and classify bug Something isn't working labels May 10, 2021
@abellina abellina self-assigned this May 11, 2021
@abellina
Copy link
Collaborator

@zhnin thanks for the details in this issue. I'll try to reproduce this locally and get back to you.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 11, 2021
@abellina
Copy link
Collaborator

@zhnin I am able to reproduce thanks to the well documented issue. Looking into a fix.

@abellina
Copy link
Collaborator

@andygrove suggested looking at this one rule, and removing it makes @zhnin's case pass.

@revans2 @jlowe @andygrove, we could either remove this, or perhaps test for compressed vectors in C2R. Any strong feelings?

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
index 6e3ef94b..c99ee4cf 100644
--- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
+++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
@@ -170,10 +170,10 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
    *       not unusual.
    */
   def optimizeCoalesce(plan: SparkPlan): SparkPlan = plan match {
-    case c2r: GpuColumnarToRowExecParent if c2r.child.isInstanceOf[GpuCoalesceBatches] =>
-      // Don't build a batch if we are just going to go back to ROWS
-      val co = c2r.child.asInstanceOf[GpuCoalesceBatches]
-      c2r.withNewChildren(co.children.map(optimizeCoalesce))
+    //case c2r: GpuColumnarToRowExecParent if c2r.child.isInstanceOf[GpuCoalesceBatches] =>
+    //  // Don't build a batch if we are just going to go back to ROWS
+    //  val co = c2r.child.asInstanceOf[GpuCoalesceBatches]
+    //  c2r.withNewChildren(co.children.map(optimizeCoalesce))
     case GpuCoalesceBatches(r2c: GpuRowToColumnarExec, goal: TargetSize) =>
       // TODO in the future we should support this for all goals, but
       // GpuRowToColumnarExec preallocates all of the memory, and the builder does not

@jlowe
Copy link
Member

jlowe commented May 11, 2021

I think there's basically three ways to tackle this:

  1. Avoid "optimizing" a GPU coalesce if the preceding node is a shuffle
  2. Have GpuColumnarToRow expect and handle compressed batches
  3. Use a separate exec node to handle decompression separate from coalesce, i.e.: similar to the GpuShuffleCoalesceExec that is used for legacy shuffle

I think the first option is the best option, at least in the short term. Getting good decompression performance does require some batching which is what GpuCoalesceExec is already doing, and I'd rather not spread the knowledge and handling of compressed batches to something like GpuColumnarToRow. Having a separate exec for handling decompression could be a bit cleaner, but it's a more invasive change and will have a lot of overlap with coalesce exec since we want to build bigger batches for better decompression parallelism.

So my vote is to make the rule a bit smarter and not have it optimize a GpuCoalesceExec if its preceding node is a shuffle.

@abellina
Copy link
Collaborator

@zhnin could you try again with the latest changes in branch-0.6 to make sure it works for you?

@zhnin
Copy link
Author

zhnin commented May 13, 2021

@zhnin could you try again with the latest changes in branch-0.6 to make sure it works for you?

Yes, i have tested, it works to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants