Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rlike support #3796

Merged
merged 10 commits into from
Oct 19, 2021
54 changes: 51 additions & 3 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,54 @@ distribution. Because the results are not bit-for-bit identical with the Apache
`approximate_percentile`, this feature is disabled by default and can be enabled by setting
`spark.rapids.approxPercentileEnabled=true`.

There are known issues with the approximate percentile implementation
([#3706](https://github.com/NVIDIA/spark-rapids/issues/3706),
[#3692](https://github.com/NVIDIA/spark-rapids/issues/3692)) and the feature should be considered experimental.
## RLike

The GPU implementation of RLike has a number of known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.
razajafri marked this conversation as resolved.
Show resolved Hide resolved

A summary of known issues is shown below but this is not intended to be a comprehensive list. We recommend that you
do your own testing to verify whether the GPU implementation of `RLike` is suitable for your use case.

revans2 marked this conversation as resolved.
Show resolved Hide resolved
### Multi-line handling

The GPU implementation of RLike supports `^` and `$` to represent the start and end of lines within a string but
Spark uses `^` and `$` to refer to the start and end of the entire string (equivalent to `\A` and `\Z`).

| Pattern | Input | Spark on CPU | Spark on GPU |
|---------|--------|--------------|--------------|
| `^A` | `A\nB` | Match | Match |
| `A$` | `A\nB` | No Match | Match |
| `^B` | `A\nB` | No Match | Match |
| `B$` | `A\nB` | Match | Match |
revans2 marked this conversation as resolved.
Show resolved Hide resolved

### Null character in input

The GPU implementation of RLike will not match anything after a null character within a string.

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|-----------|--------------|--------------|
| `A` | `\u0000A` | Match | No Match |

### Qualifiers with nothing to repeat

Spark supports qualifiers in cases where there is nothing to repeat. For example, Spark supports `a*+` and this
will match all inputs. The GPU implementation of RLike does not support this syntax and will throw an exception with
the message `nothing to repeat at position 0`.

### Stricter escaping requirements

The GPU implementation of RLike has stricter requirements around escaping special characters in some cases.

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `a[-+]` | `a-` | Match | No Match |
| `a[\-\+]` | `a-` | Match | Match |

### Miscellaneous

Here are some other edge cases where results do not match between CPU and GPU.
revans2 marked this conversation as resolved.
Show resolved Hide resolved

| Pattern | Input | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `z()?1+` | `a12b` | No Match | Match |
revans2 marked this conversation as resolved.
Show resolved Hide resolved
| `z()*1+` | `a12b` | No Match | Match |
1 change: 1 addition & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.PromotePrecision"></a>spark.rapids.sql.expression.PromotePrecision| |PromotePrecision before arithmetic operations between DecimalType data|true|None|
<a name="sql.expression.PythonUDF"></a>spark.rapids.sql.expression.PythonUDF| |UDF run in an external python process. Does not actually run on the GPU, but the transfer of data to/from it can be accelerated|true|None|
<a name="sql.expression.Quarter"></a>spark.rapids.sql.expression.Quarter|`quarter`|Returns the quarter of the year for date, in the range 1 to 4|true|None|
<a name="sql.expression.RLike"></a>spark.rapids.sql.expression.RLike|`rlike`|RLike|false|This is disabled by default because The GPU implementation of rlike is not compatible with Apache Spark. See the compatibility guide for more information.|
<a name="sql.expression.Rand"></a>spark.rapids.sql.expression.Rand|`random`, `rand`|Generate a random column with i.i.d. uniformly distributed values in [0, 1)|true|None|
<a name="sql.expression.Rank"></a>spark.rapids.sql.expression.Rank|`rank`|Window function that returns the rank value within the aggregation window|true|None|
<a name="sql.expression.RegExpReplace"></a>spark.rapids.sql.expression.RegExpReplace|`regexp_replace`|RegExpReplace support for string literal input patterns|true|None|
Expand Down
240 changes: 154 additions & 86 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -9611,16 +9611,21 @@ are limited.
<td> </td>
</tr>
<tr>
<td rowSpan="2">Rand</td>
<td rowSpan="2">`random`, `rand`</td>
<td rowSpan="2">Generate a random column with i.i.d. uniformly distributed values in [0, 1)</td>
<td rowSpan="2">None</td>
<td rowSpan="2">project</td>
<td>seed</td>
<td rowSpan="3">RLike</td>
<td rowSpan="3">`rlike`</td>
<td rowSpan="3">RLike</td>
<td rowSpan="3">This is disabled by default because The GPU implementation of rlike is not compatible with Apache Spark. See the compatibility guide for more information.</td>
<td rowSpan="3">project</td>
<td>str</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td>S</td>
<td> </td>
<td> </td>
Expand All @@ -9630,6 +9635,22 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>regexp</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>Literal value only</em></td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
Expand All @@ -9638,13 +9659,13 @@ are limited.
</tr>
<tr>
<td>result</td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
Expand Down Expand Up @@ -9684,6 +9705,53 @@ are limited.
<th>UDT</th>
</tr>
<tr>
<td rowSpan="2">Rand</td>
<td rowSpan="2">`random`, `rand`</td>
<td rowSpan="2">Generate a random column with i.i.d. uniformly distributed values in [0, 1)</td>
<td rowSpan="2">None</td>
<td rowSpan="2">project</td>
<td>seed</td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="2">Rank</td>
<td rowSpan="2">`rank`</td>
<td rowSpan="2">Window function that returns the rank value within the aggregation window</td>
Expand Down Expand Up @@ -9978,6 +10046,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="3">Round</td>
<td rowSpan="3">`round`</td>
<td rowSpan="3">Round an expression to d decimal places using HALF_UP rounding mode</td>
Expand Down Expand Up @@ -10046,32 +10140,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="1">RowNumber</td>
<td rowSpan="1">`row_number`</td>
<td rowSpan="1">Window function that returns the index for the row within the aggregation window</td>
Expand Down Expand Up @@ -10396,6 +10464,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="2">Signum</td>
<td rowSpan="2">`sign`, `signum`</td>
<td rowSpan="2">Returns -1.0, 0.0 or 1.0 as expr is negative, 0 or positive</td>
Expand Down Expand Up @@ -10443,32 +10537,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="4">Sin</td>
<td rowSpan="4">`sin`</td>
<td rowSpan="4">Sine</td>
Expand Down Expand Up @@ -10764,6 +10832,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="2">SortOrder</td>
<td rowSpan="2"> </td>
<td rowSpan="2">Sort order</td>
Expand Down Expand Up @@ -10811,32 +10905,6 @@ are limited.
<td><b>NS</b></td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="1">SparkPartitionID</td>
<td rowSpan="1">`spark_partition_id`</td>
<td rowSpan="1">Returns the current partition id</td>
Expand Down
Loading