Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update string to float compatibility doc[skip ci] #10156

Merged
merged 2 commits into from
Jan 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -722,14 +722,18 @@ This configuration is enabled by default. To disable this operation on the GPU s

### String to Float

Casting from string to floating-point types on the GPU returns incorrect results when the string
represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.
Comment on lines -726 to -727
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our behavior is still inconsistent with Spark for this case, although the behavior does seem to have changed since this documentation was written. Perhaps we need to update this documentation rather than remove it?

The test below was with Spark 3.1.1

scala> val df = Seq("1.7976931348623158E308", "123").toDF("a").repartition(2)
scala> val df2 = df.withColumn("b", col("a").cast(DataTypes.DoubleType))
scala> spark.conf.set("spark.rapids.sql.enabled", false)
scala> df2.show
+--------------------+--------------------+
|                   a|                   b|
+--------------------+--------------------+
|1.797693134862315...|1.797693134862315...|
|                 123|               123.0|
+--------------------+--------------------+

scala> spark.conf.set("spark.rapids.sql.enabled", true)

scala> df2.show
24/01/08 18:34:56 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(cast(a#4 as double) as string) AS b#30 will run on GPU
      *Expression <Cast> cast(cast(a#4 as double) as string) will run on GPU
        *Expression <Cast> cast(a#4 as double) will run on GPU
    *Exec <ShuffleExchangeExec> will run on GPU
      *Partitioning <RoundRobinPartitioning> will run on GPU
      ! <LocalTableScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
        @Expression <AttributeReference> a#4 could run on GPU

+--------------------+--------+
|                   a|       b|
+--------------------+--------+
|1.797693134862315...|Infinity|
|                 123|   123.0|
+--------------------+--------+

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks for your test!

Casting from string to double on the GPU returns incorrect results when the string represents any
number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The default behavior
in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.

- `1.7976931348623158E308 <= x < 1.7976931348623159E308`
- `-1.7976931348623159E308 < x <= -1.7976931348623158E308`

Also, the GPU does not support casting from strings containing hex values.
Casting from string to double on the GPU could also sometimes return incorrect results if the string
contains high precision values. Apache Spark rounds the values to the nearest double, while the GPU
truncates the values directly.

Also, the GPU does not support casting from strings containing hex values to floating-point types.

This configuration is enabled by default. To disable this operation on the GPU set
[`spark.rapids.sql.castStringToFloat.enabled`](additional-functionality/advanced_configs.md#sql.castStringToFloat.enabled) to `false`.
Expand Down