[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

jlowe · 2021-02-02T22:07:55Z

Describe the bug
Running TpcdsLikeSpark.query("q59") at a large scale fails with a join mismatch error:

java.lang.IllegalStateException: Join needs to run on CPU but at least one input query stage ran on GPU
  at com.nvidia.spark.rapids.SparkPlanMeta.makeShuffleConsistent(RapidsMeta.scala:572)
  at com.nvidia.spark.rapids.SparkPlanMeta.fixUpJoinConsistencyIfNeeded(RapidsMeta.scala:587)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1$adapted(RapidsMeta.scala:584)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at com.nvidia.spark.rapids.SparkPlanMeta.fixUpJoinConsistencyIfNeeded(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1$adapted(RapidsMeta.scala:584)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at com.nvidia.spark.rapids.SparkPlanMeta.fixUpJoinConsistencyIfNeeded(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.runAfterTagRules(RapidsMeta.scala:640)
  at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:2633)
  at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:2612)
  at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:2608)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$1(AdaptiveSparkPlanExec.scala:599)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:89)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:599)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.reOptimize(AdaptiveSparkPlanExec.scala:503)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:219)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:159)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:255)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
  at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2940)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2940)
  ... 49 elided

Steps/Code to reproduce bug
With AQE enabled (i.e.: spark.sql.adaptive.enabled=true) run TpcdsLikeSpark.query("q59")(spark).collect
Note that this error can be replicated at much smaller dataset scales if broadcast joins are effectively disabled (e.g.: by setting spark.sql.autoBroadcastJoinThreshold=1)

The text was updated successfully, but these errors were encountered:

jlowe added bug Something isn't working SQL part of the SQL/Dataframe plugin labels Feb 2, 2021

jlowe self-assigned this Feb 2, 2021

jlowe mentioned this issue Feb 3, 2021

Fix extraneous shuffles added by AQE #1655

Merged

jlowe closed this as completed in #1655 Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

jlowe commented Feb 2, 2021

[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

Comments

jlowe commented Feb 2, 2021