Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support str_to_map [databricks] #4636

Merged
merged 77 commits into from
Feb 23, 2022
Merged
Show file tree
Hide file tree
Changes from 73 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
77a760d
Add GpuOverride
ttnghia Jan 24, 2022
02cf8b8
Implement skeleton
ttnghia Jan 24, 2022
6d49cf6
Fix type check
ttnghia Jan 25, 2022
0ff0e3f
WIP
ttnghia Jan 25, 2022
5e6d23b
Simplify code
ttnghia Jan 25, 2022
7832760
Fix delimiter check
ttnghia Jan 25, 2022
6b4ffe5
Add comments
ttnghia Jan 26, 2022
2a64e97
Implement `GpuCreateMap.createMapFromKeysValuesAsStructs`
ttnghia Jan 26, 2022
08bb1b2
Using `GpuCreateMap.createMapFromKeysValuesAsStructs` to create map f…
ttnghia Jan 26, 2022
ac99eff
Update comment
ttnghia Jan 26, 2022
657fcc3
Change input for `createMapFromKeysValuesAsStructs`
ttnghia Jan 27, 2022
fca136a
Fix `toMap` method
ttnghia Jan 27, 2022
5c3e755
MISC
ttnghia Jan 28, 2022
a654cb2
Merge branch 'branch-22.04' into str_to_map
ttnghia Jan 28, 2022
310f331
Update copyright year
ttnghia Jan 28, 2022
a6d70ac
Remove unused constructors
ttnghia Jan 28, 2022
395e41f
Rename variables
ttnghia Jan 28, 2022
f3074e4
Fix type check
ttnghia Jan 28, 2022
c5d6780
Rename variable
ttnghia Jan 28, 2022
7d1ef76
Add test for scalar input
ttnghia Feb 2, 2022
6d314ad
Add test for column input
ttnghia Feb 2, 2022
64fb127
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 2, 2022
78b43c4
Tag for Gpu only if the input is not foldable
ttnghia Feb 3, 2022
e1f16e6
Working test
ttnghia Feb 3, 2022
a573f84
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 4, 2022
b459c2c
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 7, 2022
1cfc173
Draft implementation of string_split with regexp support
andygrove Feb 7, 2022
3929461
Rename maxSplit to limit and add TODO comment
andygrove Feb 7, 2022
92366bd
code cleanup and add separate tests for negative, zero, and positive …
andygrove Feb 7, 2022
51d870d
fall back to CPU for limit of 0 or 1
andygrove Feb 7, 2022
98f2e7e
WIP
ttnghia Feb 7, 2022
893705c
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 7, 2022
9c09603
fall back to CPU for split on regex containing string or line anchors
andygrove Feb 7, 2022
3d63d5b
Handle the case when all values are nulls
ttnghia Feb 7, 2022
d004361
Fix bug when cannot split string
ttnghia Feb 8, 2022
daa2bdb
Add tests
ttnghia Feb 8, 2022
caf598c
Fix delimiter check
ttnghia Feb 8, 2022
e686874
Change test
ttnghia Feb 8, 2022
fa3f4a9
Fix delimiter check
ttnghia Feb 8, 2022
5556717
move some logic from GpuStringSplit to GpuStringSplitMeta
andygrove Feb 8, 2022
14b873c
Additional tests
andygrove Feb 8, 2022
6773387
Merge branch 'string-split-regexp' into str_to_map
ttnghia Feb 8, 2022
5cc7ccd
update shims
andygrove Feb 8, 2022
19918c2
check that expression has been tagged
andygrove Feb 8, 2022
bc247e7
Fix compile error due to merging upstream
ttnghia Feb 8, 2022
481b356
update split_re tests to actually use regexp rather than simple strings
andygrove Feb 9, 2022
02fad12
merge from branch-22.04
andygrove Feb 10, 2022
0ef4830
fix merge issue
andygrove Feb 10, 2022
e396c69
update compatibility guide
andygrove Feb 10, 2022
e7dbe57
Merge branch 'string-split-regexp' into str_to_map
ttnghia Feb 10, 2022
7f1db1d
Update shims
ttnghia Feb 10, 2022
c70390f
fix incorrect imports in shim layer
andygrove Feb 11, 2022
b7f0cba
Revert "fix incorrect imports in shim layer"
andygrove Feb 11, 2022
4c94170
fix incorrect imports in shim layer
andygrove Feb 11, 2022
d5543d8
Merge branch 'string-split-regexp' into str_to_map
ttnghia Feb 11, 2022
aceea3d
Support regex
ttnghia Feb 11, 2022
e2ce6ea
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 14, 2022
b89beb6
Rewrite StringToMap
ttnghia Feb 14, 2022
f0c4a46
Fix tests
ttnghia Feb 14, 2022
c68b790
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 14, 2022
1090a50
Extract common code for GpuStringSplitMeta and GpuStringToMapMeta
ttnghia Feb 14, 2022
afe3f99
Rewrite function
ttnghia Feb 15, 2022
6f70de3
Fix tests
ttnghia Feb 15, 2022
1eefad7
Rename function
ttnghia Feb 15, 2022
afd1f16
Use `makeStructView` instead of `makeStruct` to avoid copying data
ttnghia Feb 15, 2022
6231d5b
Add regex tests
ttnghia Feb 15, 2022
dbfb5ac
Update `compatibility.md`
ttnghia Feb 15, 2022
b05de28
Update description
ttnghia Feb 15, 2022
e610f9b
Update docs
ttnghia Feb 15, 2022
656df3d
Update test
ttnghia Feb 15, 2022
db65868
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 17, 2022
57ba669
Rename test
ttnghia Feb 17, 2022
0bab1db
Inherit ShimExpression
ttnghia Feb 18, 2022
c237cb9
Update special pattern to almost always generate duplicate keys
ttnghia Feb 18, 2022
f08c093
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 18, 2022
4ef4a8e
Fix python import
ttnghia Feb 18, 2022
7c05021
Merge branch 'branch-22.04' into str_to_map
ttnghia Feb 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -523,6 +523,7 @@ The following Apache Spark regular expression functions and expressions are supp
- `regexp_like`
- `regexp_replace`
- `string_split`
- `str_to_map`

Regular expression evaluation on the GPU can potentially have high memory overhead and cause out-of-memory errors. To
disable regular expressions on the GPU, set `spark.rapids.sql.regexp.enabled=false`.
Expand All @@ -536,7 +537,7 @@ Here are some examples of regular expression patterns that are not supported on
- Line anchor `$`
- String anchor `\Z`
- String anchor `\z` is not supported by `regexp_replace`
- Line and string anchors are not supported by `string_split`
- Line and string anchors are not supported by `string_split` and `str_to_map`
- Non-digit character class `\D`
- Non-word character class `\W`
- Word and non-word boundaries, `\b` and `\B`
Expand Down
1 change: 1 addition & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.StringRepeat"></a>spark.rapids.sql.expression.StringRepeat|`repeat`|StringRepeat operator that repeats the given strings with numbers of times given by repeatTimes|true|None|
<a name="sql.expression.StringReplace"></a>spark.rapids.sql.expression.StringReplace|`replace`|StringReplace operator|true|None|
<a name="sql.expression.StringSplit"></a>spark.rapids.sql.expression.StringSplit|`split`|Splits `str` around occurrences that match `regex`|true|None|
<a name="sql.expression.StringToMap"></a>spark.rapids.sql.expression.StringToMap|`str_to_map`|Creates a map after splitting the input string into pairs of key-value strings|true|None|
<a name="sql.expression.StringTrim"></a>spark.rapids.sql.expression.StringTrim|`trim`|StringTrim operator|true|None|
<a name="sql.expression.StringTrimLeft"></a>spark.rapids.sql.expression.StringTrimLeft|`ltrim`|StringTrimLeft operator|true|None|
<a name="sql.expression.StringTrimRight"></a>spark.rapids.sql.expression.StringTrimRight|`rtrim`|StringTrimRight operator|true|None|
Expand Down
297 changes: 193 additions & 104 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -12207,6 +12207,95 @@ are limited.
<th>UDT</th>
</tr>
<tr>
<td rowSpan="4">StringToMap</td>
<td rowSpan="4">`str_to_map`</td>
<td rowSpan="4">Creates a map after splitting the input string into pairs of key-value strings</td>
<td rowSpan="4">None</td>
<td rowSpan="4">project</td>
<td>str</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>pairDelim</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>keyValueDelim</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>result</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowSpan="3">StringTrim</td>
<td rowSpan="3">`trim`</td>
<td rowSpan="3">StringTrim operator</td>
Expand Down Expand Up @@ -12500,6 +12589,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="4">SubstringIndex</td>
<td rowSpan="4">`substring_index`</td>
<td rowSpan="4">substring_index operator</td>
Expand Down Expand Up @@ -12589,32 +12704,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">Subtract</td>
<td rowSpan="6">`-`</td>
<td rowSpan="6">Subtraction</td>
Expand Down Expand Up @@ -12927,6 +13016,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="3">TimeAdd</td>
<td rowSpan="3"> </td>
<td rowSpan="3">Adds interval to timestamp</td>
Expand Down Expand Up @@ -12995,32 +13110,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="3">TimeSub</td>
<td rowSpan="3"> </td>
<td rowSpan="3">Subtracts interval from timestamp</td>
Expand Down Expand Up @@ -13319,6 +13408,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="3">TransformValues</td>
<td rowSpan="3">`transform_values`</td>
<td rowSpan="3">Transform values in a map using a transform function</td>
Expand Down Expand Up @@ -13387,32 +13502,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="4">UnaryMinus</td>
<td rowSpan="4">`negative`</td>
<td rowSpan="4">Negate a numeric value</td>
Expand Down Expand Up @@ -13713,6 +13802,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="2">UnscaledValue</td>
<td rowSpan="2"> </td>
<td rowSpan="2">Convert a Decimal to an unscaled long value for some aggregation optimizations</td>
Expand Down Expand Up @@ -13760,32 +13875,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="2">Upper</td>
<td rowSpan="2">`upper`, `ucase`</td>
<td rowSpan="2">String uppercase operator</td>
Expand Down
Loading