Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexp_replace support regexp [databricks] #4063

Merged
merged 25 commits into from
Nov 17, 2021

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Nov 9, 2021

Closes #4001

Note that this PR disables regexp_replace by default where it was enabled before and fell back to CPU if the pattern used any regular expression syntax, and performed a literal string replacement on GPU in other cases.

@andygrove andygrove force-pushed the regexp-replace-support-regexp branch from 951d0b7 to 98c415a Compare November 9, 2021 22:20
@andygrove
Copy link
Contributor Author

build

3 similar comments
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title WIP: Regexp_replace support regexp [databricks] Regexp_replace support regexp [databricks] Nov 12, 2021
@andygrove andygrove marked this pull request as ready for review November 12, 2021 04:28
case '^' | '$' if replace =>
// this is a bit extreme and it would be good to replace with finer-grained
// rules
throw new RegexUnsupportedException("regexp_replace on GPU does not support ^ or $")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #4091 as a follow-on issue to improve support here

@andygrove andygrove changed the title Regexp_replace support regexp [databricks] WIP: Regexp_replace support regexp [databricks] Nov 12, 2021
@andygrove andygrove marked this pull request as draft November 12, 2021 04:53
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

Failed with:

 [2021-11-12T17:32:57.322Z] _______________ test_select[REGEXP_REPLACE(strF, 'Yu', 'Eric')] ________________
[2021-11-12T17:32:57.322Z] [gw3] linux -- Python 3.8.12 /databricks/conda/envs/cudf-udf/bin/python
[2021-11-12T17:32:57.322Z] 
[2021-11-12T17:32:57.322Z] sql_query_line = ("SELECT REGEXP_REPLACE(strF, 'Yu', 'Eric') FROM test_table", "REGEXP_REPLACE(strF, 'Yu', 'Eric')")
[2021-11-12T17:32:57.322Z] pytestconfig = <_pytest.config.Config object at 0x7f8c14b7b8e0>
[2021-11-12T17:32:57.322Z] 
[2021-11-12T17:32:57.322Z]     @approximate_float
[2021-11-12T17:32:57.322Z]     @incompat
[2021-11-12T17:32:57.322Z]     @qarun
[2021-11-12T17:32:57.322Z]     @pytest.mark.parametrize('sql_query_line', SELECT_SQL, ids=idfn)
[2021-11-12T17:32:57.322Z]     def test_select(sql_query_line, pytestconfig):
[2021-11-12T17:32:57.322Z]         sql_query = sql_query_line[0]
[2021-11-12T17:32:57.322Z]         if sql_query:
[2021-11-12T17:32:57.322Z]             print(sql_query)
[2021-11-12T17:32:57.322Z]             with_cpu_session(num_stringDf)
[2021-11-12T17:32:57.322Z] >           assert_gpu_and_cpu_are_equal_collect(lambda spark: spark.sql(sql_query), conf=_qa_conf)
[2021-11-12T17:32:57.322Z] 
[2021-11-12T17:32:57.322Z] ../../src/main/python/qa_nightly_select_test.py:169: 

@andygrove andygrove self-assigned this Nov 12, 2021
@andygrove andygrove added this to the Nov 1 - Nov 12 milestone Nov 12, 2021
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove marked this pull request as ready for review November 12, 2021 19:47
@andygrove andygrove changed the title WIP: Regexp_replace support regexp [databricks] Regexp_replace support regexp [databricks] Nov 12, 2021
@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build failed against spark 3.3.0

 Error: :24:28.617Z] [ERROR] [Error] /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3275/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala:371: value partitionSchemaOption is not a member of org.apache.spark.sql.execution.datasources.HadoopFsRelation
Error: :24:28.617Z] [ERROR] [Error] /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-3275/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala:348: value partitionSchemaOption is not a member of org.apache.spark.sql.execution.datasources.HadoopFsRelation

docs/compatibility.md Outdated Show resolved Hide resolved
docs/compatibility.md Outdated Show resolved Hide resolved
rep: Expression): GpuExpression = GpuStringReplace(lhs, regexp, rep)
}),
(a, conf, p, r) => new GpuRegExpReplaceMeta(a, conf, p, r)).disabledByDefault(
"The GPU implementation of regexp_replace is disabled by default. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization is incorrect since this is tacked onto a prefix to make the sentence, and also this ends up being very redundant in the end as seen in the generated docs. Probably can just say "the implementation is not 100% compatible. See the compatibility guide for more details..

rep: Expression): GpuExpression = GpuStringReplace(lhs, regexp, rep)
}),
(a, conf, p, r) => new GpuRegExpReplaceMeta(a, conf, p, r)).disabledByDefault(
"The GPU implementation of regexp_replace is disabled by default. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same capitalization and redundancy issues here.

}
}),
(a, conf, p, r) => new GpuRegExpReplaceMeta(a, conf, p, r)).disabledByDefault(
"The GPU implementation of regexp_replace is disabled by default. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same capitalization and redundancy issues here.

}
}),
(a, conf, p, r) => new GpuRegExpReplaceMeta(a, conf, p, r)).disabledByDefault(
"The GPU implementation of regexp_replace is disabled by default. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same capitalization and redundancy issues here.

@@ -2950,8 +2950,8 @@ object GpuOverrides extends Logging {
("str", TypeSig.STRING, TypeSig.STRING),
("regexp", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING)),
(a, conf, p, r) => new GpuRLikeMeta(a, conf, p, r)).disabledByDefault(
"The GPU implementation of rlike is not " +
"compatible with Apache Spark. See the compatibility guide for more information."),
"The GPU implementation of rlike is disabled by default. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same capitalization and redundancy issues here.

override def tagExprForGpu(): Unit = {
expr.regexp match {
case Literal(null, _) =>
willNotWorkOnGpu(s"RegExpReplace with null pattern is not supported on GPU")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Catalyst class not being placed on the GPU should already be printed before this, so should be able to simplify to something like:

Suggested change
willNotWorkOnGpu(s"RegExpReplace with null pattern is not supported on GPU")
willNotWorkOnGpu(s"null pattern is not supported on GPU")

applies to all the other similar willNotWorkOnGpu calls added


GpuOverrides.extractLit(expr.pos).foreach { lit =>
if (lit.value.asInstanceOf[Int] != 1) {
willNotWorkOnGpu("Only a search starting position of 1 is supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
willNotWorkOnGpu("Only a search starting position of 1 is supported")
willNotWorkOnGpu("only a search starting position of 1 is supported")


GpuOverrides.extractLit(expr.pos).foreach { lit =>
if (lit.value.asInstanceOf[Int] != 1) {
willNotWorkOnGpu("Only a search starting position of 1 is supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
willNotWorkOnGpu("Only a search starting position of 1 is supported")
willNotWorkOnGpu("only a search starting position of 1 is supported")

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

buuild

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit ade6591 into NVIDIA:branch-21.12 Nov 17, 2021
@andygrove andygrove deleted the regexp-replace-support-regexp branch November 17, 2021 00:46
@sameerz sameerz added the feature request New feature or request label Dec 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add regexp support to regexp_replace
3 participants