Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document compatability of operations with side effects. #3946

Merged
merged 1 commit into from
Oct 28, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 51 additions & 1 deletion docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -624,4 +624,54 @@ The GPU implementation of RLike does not support empty groups correctly.
| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `z()?` | `a` | No Match | Match |
| `z()*` | `a` | No Match | Match |
| `z()*` | `a` | No Match | Match |

## Conditionals and operations with side effects (ANSI mode)

In Apache Spark condition operations like `if`, `coalesce`, and `case/when` lazily evaluate
their parameters on a row by row basis. On the GPU it is generally more efficient to
evaluate the parameters regardless of the condition and then select which result to return
based on the condition. This is fine so long as there are no side effects caused by evaluating
a parameter. For most expressions in Spark this is true, but in ANSI mode many expressions can
throw exceptions, like for the `Add` expression if an overflow happens. This is also true of
UDFs, because by their nature they are user defined and can have side effects like throwing
exceptions.

Currently, the RAPIDS Accelerator
[assumes that there are no side effects](https://github.com/NVIDIA/spark-rapids/issues/3849).
This can result it situations, specifically in ANSI mode, where the RAPIDS Accelerator will
always throw an exception, but Spark on the CPU will not. For example:

```scala
spark.conf.set("spark.sql.ansi.enabled", "true")

Seq(0L, Long.MaxValue).toDF("val")
.repartition(1) // The repartition makes Spark not optimize selectExpr away
.selectExpr("IF(val > 1000, null, val + 1) as ret")
.show()
```

If the above example is run on the CPU you will get a result like.
```
+----+
| ret|
+----+
| 1|
|null|
+----+
```

But if it is run on the GPU an overflow exception is thrown. As was explained before this
is because the RAPIDS Accelerator will evaluate both `val + 1` and `null` regardless of
the result of the condition. In some cases you can work around this. The above example
could be re-written so the `if` happens before the `Add` operation.

```scala
Seq(0L, Long.MaxValue).toDF("val")
.repartition(1) // The repartition makes Spark not optimize selectExpr away
.selectExpr("IF(val > 1000, null, val) + 1 as ret")
.show()
```

But this is not something that can be done generically and requires inner knowledge about
what can trigger a side effect.