Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

Closed
revans2 opened this issue Oct 18, 2023 · 5 comments · Fixed by #10015
Closed

[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

revans2 opened this issue Oct 18, 2023 · 5 comments · Fixed by #10015
Assignees
Labels
bug Something isn't working

Comments

@revans2
Copy link
Collaborator

revans2 commented Oct 18, 2023

Describe the bug
In Spark 3.5.0 some change went in, we need to track it down to know what happened, but it expanded non-deterministic expressions to be run in more areas, not just the original Project, Filter, Aggregate and Window.

https://github.com/apache/spark/blob/8c2b3319c6734250ff9d72f3d7e5cab56b142195/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L584-L589

As a part of this in Spark 3.5.1 a bug fix went in for generate.

https://issues.apache.org/jira/browse/SPARK-45171
apache/spark@9c0b803ba12

But that bug fix happens during execution of the plan. We can run the query on 3.5.0 and get a different error.

23/10/18 14:02:47 ERROR Executor: Exception in task 6.0 in stage 2.0 (TID 11)   
java.lang.IllegalArgumentException: Mismatch in the number of rows in the data columns
	at com.nvidia.spark.rapids.GpuListUtils$.$anonfun$replaceListDataColumnAsView$1(GpuListUtils.scala:45)
	at com.nvidia.spark.rapids.GpuListUtils$.$anonfun$replaceListDataColumnAsView$1$adapted(GpuListUtils.scala:42)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuListUtils$.replaceListDataColumnAsView(GpuListUtils.scala:42)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.$anonfun$columnarEval$3(higherOrderFunctions.scala:283)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.$anonfun$columnarEval$1(higherOrderFunctions.scala:282)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.columnarEval(higherOrderFunctions.scala:278)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.columnarEval$(higherOrderFunctions.scala:277)
	at com.nvidia.spark.rapids.GpuArrayTransform.columnarEval(higherOrderFunctions.scala:291)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:111)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:220)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:217)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:217)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:252)
	at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:111)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$7(basicPhysicalOperators.scala:161)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:268)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$6(basicPhysicalOperators.scala:161)

We need to

  1. Fix the bug that we see when we run this query (not on the CPU because they still have a bug there)
  2. Have a test that will work for 3.5.1, even if it is still a WIP so that we get the same value as the CPU does.
  3. Figure out what other operators might now support non-deterministic expressions
  4. File issues/add test cases for them so that we get proper coverage

Steps/Code to reproduce bug

spark.range(1).selectExpr("explode(transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22))").show()

I had a few issues with the SQL as is on some versions of Spark, so I went with this that appears to be a problem on every version that we support.

Expected behavior
We should be able to run the query correctly even if the CPU cannot.

@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 18, 2023
@mattahrens mattahrens added audit_3.5.1 Audit related tasks for 3.5.1 and removed ? - Needs Triage Need team to review and classify audit_3.5.1 Audit related tasks for 3.5.1 labels Oct 24, 2023
@razajafri
Copy link
Collaborator

There are two more expressions added to the original code I think according to this piece of code, Expand and Generate.

Regarding the bug in the plugin, It is happening because we are running the rand() function twice, once when the sequence is created and then a second time as part of Project when calling transform. I am not sure why we are doing it this way, although I understand there must be a good reason. The input to the transform function has already been evaluated, why not just use that instead of rerunning it? Whatever the reason is, the rand() function generates a different length for the sequence and when we try to replace the list child it complains about the lengths being different.

@razajafri
Copy link
Collaborator

Looking more into this I found this comment that hints at the deeper problem.

@revans2 can we talk a bit about this for me to understand this problem better?

@revans2
Copy link
Collaborator Author

revans2 commented Dec 11, 2023

@razajafri you are right this error has nothing to do with explode. It is a problem with non-determinism and higher order functions.

spark.range(1).selectExpr("transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) as t").show()

Caused the same error on Spark 3.3.0, but if we refactor it to pull the rand out to be in a separate project, then we get an answer out.

> spark.range(1).selectExpr("cast(rand()*1000 as int) + 1 as num_entries").selectExpr("*", "transform(sequence(0, num_entries), x -> x * 22) as t").show()
+-----------+--------------------+
|num_entries|                   t|
+-----------+--------------------+
|        455|[0, 22, 44, 66, 8...|
+-----------+--------------------+

I don't know yet why we are calling rand twice. I need to look into it more.

@razajafri
Copy link
Collaborator

Thank you for confirming

Just to add my findings. This is the line where we call it the first time and then we call makeElementProjectBatch where we call project which runs rand again here

@revans2
Copy link
Collaborator Author

revans2 commented Dec 11, 2023

Okay I figured it out and fixed it. It's my bad. It looks like I started to go down the route of making it computed once, but then ended up doing it twice for some odd reason.

@revans2 revans2 self-assigned this Dec 11, 2023
@revans2 revans2 changed the title [BUG] Non-Deterministic Expressions are not as limited in 3.5.0 [BUG] Non-Deterministic expressions in an array_transform can cause errors Dec 11, 2023
@revans2 revans2 removed the audit_3.5.1 Audit related tasks for 3.5.1 label Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants