Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support REGEXP_REPLACE to replace null values #3876

Closed
viadea opened this issue Oct 21, 2021 · 4 comments · Fixed by #3968
Closed

[FEA] Support REGEXP_REPLACE to replace null values #3876

viadea opened this issue Oct 21, 2021 · 4 comments · Fixed by #3968
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P1 Nice to have for release

Comments

@viadea
Copy link
Collaborator

viadea commented Oct 21, 2021

Is your feature request related to a problem? Please describe.
Need to support REGEXP_REPLACE to replace null values.
Such as REGEXP_REPLACE(col, '\000', '')

For example, here is a minimum reproduce:

val address = Seq((1,"xxxx"),
(2,"fdsfds \u0000dfsdfs"),
(3,"sdfdsf sdfdsf \u0000sdf"))

import spark.implicits._
val df = address.toDF("id","txt")
df.write.mode("overwrite").format("parquet").save("/tmp/testparquet")
val df2=spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
spark.sql("select id,txt, REGEXP_REPLACE(txt, '\000', '') AS new_txt from df2").collect()

It will fallback with below driver log:

!Expression <RegExpReplace> regexp_replace(txt#336, , , 1) cannot run on GPU because Only non-null, non-empty String literals that are not regex patterns are supported by RegExpReplace on the GPU

Below is the result:

res36: Array[org.apache.spark.sql.Row] = Array([3,sdfdsf sdfdsf ?sdf,sdfdsf sdfdsf sdf], [2,fdsfds ?dfsdfs,fdsfds dfsdfs], [1,xxxx,xxxx])
@viadea viadea added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 21, 2021
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 21, 2021
@sameerz
Copy link
Collaborator

sameerz commented Oct 21, 2021

Related issue: rapidsai/cudf#6196

@sameerz
Copy link
Collaborator

sameerz commented Oct 22, 2021

Correction, 6196 is about treating a null character as end of string. This is about replacing null values.

@Salonijain27 Salonijain27 added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Oct 26, 2021
@andygrove andygrove self-assigned this Oct 29, 2021
@andygrove andygrove added this to the Oct 18 - Oct 29 milestone Oct 29, 2021
@viadea
Copy link
Collaborator Author

viadea commented Oct 29, 2021

@andygrove another variation query is

REGEXP_REPLACE(col, '\000', '')

@viadea
Copy link
Collaborator Author

viadea commented Nov 2, 2021

Confirmed it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request P1 Nice to have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants