Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

Closed
jlowe opened this issue Feb 2, 2021 · 0 comments · Fixed by #1655
Closed

[BUG] TPC-DS-like query 59 at scale=3TB with AQE fails with join mismatch #1654

jlowe opened this issue Feb 2, 2021 · 0 comments · Fixed by #1655
Assignees
Labels
bug Something isn't working SQL part of the SQL/Dataframe plugin

Comments

@jlowe
Copy link
Member

jlowe commented Feb 2, 2021

Describe the bug
Running TpcdsLikeSpark.query("q59") at a large scale fails with a join mismatch error:

java.lang.IllegalStateException: Join needs to run on CPU but at least one input query stage ran on GPU
  at com.nvidia.spark.rapids.SparkPlanMeta.makeShuffleConsistent(RapidsMeta.scala:572)
  at com.nvidia.spark.rapids.SparkPlanMeta.fixUpJoinConsistencyIfNeeded(RapidsMeta.scala:587)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1$adapted(RapidsMeta.scala:584)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at com.nvidia.spark.rapids.SparkPlanMeta.fixUpJoinConsistencyIfNeeded(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.$anonfun$fixUpJoinConsistencyIfNeeded$1$adapted(RapidsMeta.scala:584)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at com.nvidia.spark.rapids.SparkPlanMeta.fixUpJoinConsistencyIfNeeded(RapidsMeta.scala:584)
  at com.nvidia.spark.rapids.SparkPlanMeta.runAfterTagRules(RapidsMeta.scala:640)
  at com.nvidia.spark.rapids.GpuOverrides.apply(GpuOverrides.scala:2633)
  at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:2612)
  at com.nvidia.spark.rapids.GpuQueryStagePrepOverrides.apply(GpuOverrides.scala:2608)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.$anonfun$applyPhysicalRules$1(AdaptiveSparkPlanExec.scala:599)
  at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
  at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
  at scala.collection.immutable.List.foldLeft(List.scala:89)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec$.applyPhysicalRules(AdaptiveSparkPlanExec.scala:599)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.reOptimize(AdaptiveSparkPlanExec.scala:503)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:219)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:159)
  at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:255)
  at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
  at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2940)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2940)
  ... 49 elided

Steps/Code to reproduce bug
With AQE enabled (i.e.: spark.sql.adaptive.enabled=true) run TpcdsLikeSpark.query("q59")(spark).collect
Note that this error can be replicated at much smaller dataset scales if broadcast joins are effectively disabled (e.g.: by setting spark.sql.autoBroadcastJoinThreshold=1)

@jlowe jlowe added bug Something isn't working SQL part of the SQL/Dataframe plugin labels Feb 2, 2021
@jlowe jlowe self-assigned this Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant