[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

revans2 · 2023-10-18T14:11:53Z

Describe the bug
In Spark 3.5.0 some change went in, we need to track it down to know what happened, but it expanded non-deterministic expressions to be run in more areas, not just the original Project, Filter, Aggregate and Window.

https://github.com/apache/spark/blob/8c2b3319c6734250ff9d72f3d7e5cab56b142195/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L584-L589

As a part of this in Spark 3.5.1 a bug fix went in for generate.

https://issues.apache.org/jira/browse/SPARK-45171
apache/spark@9c0b803ba12

But that bug fix happens during execution of the plan. We can run the query on 3.5.0 and get a different error.

23/10/18 14:02:47 ERROR Executor: Exception in task 6.0 in stage 2.0 (TID 11)   
java.lang.IllegalArgumentException: Mismatch in the number of rows in the data columns
	at com.nvidia.spark.rapids.GpuListUtils$.$anonfun$replaceListDataColumnAsView$1(GpuListUtils.scala:45)
	at com.nvidia.spark.rapids.GpuListUtils$.$anonfun$replaceListDataColumnAsView$1$adapted(GpuListUtils.scala:42)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuListUtils$.replaceListDataColumnAsView(GpuListUtils.scala:42)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.$anonfun$columnarEval$3(higherOrderFunctions.scala:283)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.$anonfun$columnarEval$1(higherOrderFunctions.scala:282)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:29)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.columnarEval(higherOrderFunctions.scala:278)
	at com.nvidia.spark.rapids.GpuArrayTransformBase.columnarEval$(higherOrderFunctions.scala:277)
	at com.nvidia.spark.rapids.GpuArrayTransform.columnarEval(higherOrderFunctions.scala:291)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:111)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:220)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:217)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:217)
	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:252)
	at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:111)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$7(basicPhysicalOperators.scala:161)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRestoreOnRetry(RmmRapidsRetryIterator.scala:268)
	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$projectWithRetrySingleBatch$6(basicPhysicalOperators.scala:161)

We need to

Fix the bug that we see when we run this query (not on the CPU because they still have a bug there)
Have a test that will work for 3.5.1, even if it is still a WIP so that we get the same value as the CPU does.
Figure out what other operators might now support non-deterministic expressions
File issues/add test cases for them so that we get proper coverage

Steps/Code to reproduce bug

spark.range(1).selectExpr("explode(transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22))").show()

I had a few issues with the SQL as is on some versions of Spark, so I went with this that appears to be a problem on every version that we support.

Expected behavior
We should be able to run the query correctly even if the CPU cannot.

The text was updated successfully, but these errors were encountered:

razajafri · 2023-12-09T01:53:21Z

There are two more expressions added to the original code I think according to this piece of code, Expand and Generate.

Regarding the bug in the plugin, It is happening because we are running the rand() function twice, once when the sequence is created and then a second time as part of Project when calling transform. I am not sure why we are doing it this way, although I understand there must be a good reason. The input to the transform function has already been evaluated, why not just use that instead of rerunning it? Whatever the reason is, the rand() function generates a different length for the sequence and when we try to replace the list child it complains about the lengths being different.

razajafri · 2023-12-09T20:19:07Z

Looking more into this I found this comment that hints at the deeper problem.

@revans2 can we talk a bit about this for me to understand this problem better?

revans2 · 2023-12-11T19:04:48Z

@razajafri you are right this error has nothing to do with explode. It is a problem with non-determinism and higher order functions.

spark.range(1).selectExpr("transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) as t").show()

Caused the same error on Spark 3.3.0, but if we refactor it to pull the rand out to be in a separate project, then we get an answer out.

> spark.range(1).selectExpr("cast(rand()*1000 as int) + 1 as num_entries").selectExpr("*", "transform(sequence(0, num_entries), x -> x * 22) as t").show()
+-----------+--------------------+
|num_entries|                   t|
+-----------+--------------------+
|        455|[0, 22, 44, 66, 8...|
+-----------+--------------------+

I don't know yet why we are calling rand twice. I need to look into it more.

razajafri · 2023-12-11T19:23:02Z

Thank you for confirming

Just to add my findings. This is the line where we call it the first time and then we call makeElementProjectBatch where we call project which runs rand again here

revans2 · 2023-12-11T20:16:49Z

Okay I figured it out and fixed it. It's my bad. It looks like I started to go down the route of making it computed once, but then ended up doing it twice for some odd reason.

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 18, 2023

mattahrens added audit_3.5.1 Audit related tasks for 3.5.1 and removed ? - Needs Triage Need team to review and classify audit_3.5.1 Audit related tasks for 3.5.1 labels Oct 24, 2023

razajafri mentioned this issue Dec 9, 2023

[FEA] Create spark 3.5.1 shim and build env #9258

Closed

5 tasks

revans2 self-assigned this Dec 11, 2023

revans2 changed the title ~~[BUG] Non-Deterministic Expressions are not as limited in 3.5.0~~ [BUG] Non-Deterministic expressions in an array_transform can cause errors Dec 11, 2023

revans2 removed the audit_3.5.1 Audit related tasks for 3.5.1 label Dec 11, 2023

revans2 mentioned this issue Dec 11, 2023

Fix array_transform to not recompute the argument #10015

Merged

razajafri closed this as completed in #10015 Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

revans2 commented Oct 18, 2023

razajafri commented Dec 9, 2023

razajafri commented Dec 9, 2023

revans2 commented Dec 11, 2023

razajafri commented Dec 11, 2023

revans2 commented Dec 11, 2023

[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

[BUG] Non-Deterministic expressions in an array_transform can cause errors #9472

Comments

revans2 commented Oct 18, 2023

razajafri commented Dec 9, 2023

razajafri commented Dec 9, 2023

revans2 commented Dec 11, 2023

razajafri commented Dec 11, 2023

revans2 commented Dec 11, 2023