Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Regular expressions - support line anchors in choice #6882

Closed
NVnavkumar opened this issue Oct 21, 2022 · 3 comments · Fixed by #8047
Closed

[FEA] Regular expressions - support line anchors in choice #6882

NVnavkumar opened this issue Oct 21, 2022 · 3 comments · Fixed by #8047
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request tech debt

Comments

@NVnavkumar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
I wish the RAPIDS Accelerator for Apache Spark supported line anchors in Regular Expression choice (|) expressions.

Describe the solution you'd like

scala> spark.conf.set("spark.rapids.sql.enabled", "false")
scala> val df = spark.range(0, 1000).selectExpr("CAST(id AS STRING) as id")

scala> df.createOrReplaceTempView("table")

scala> spark.sql("select concat(a.id, '/', b.id, '\n') as id from table as a right join  table as b on a.id = b.id").write.mode("overwrite").parquet("/tmp/test-re-data-20221021")

scala> spark.conf.set("spark.rapids.sql.enabled", "true")

scala> spark.read.parquet("/tmp/test-re-data-20221021").selectExpr("regexp_extract(id, '\\\\d+(6$|7$)', 1)").collect()
22/10/21 17:40:44 WARN GpuOverrides:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
  @Expression <Alias> regexp_extract(id#103, \d+(6$|7$), 1) AS regexp_extract(id, \d+(6$|7$), 1)#105 could run on GPU
    !Expression <RegExpExtract> regexp_extract(id#103, \d+(6$|7$), 1) cannot run on GPU because cuDF does not support terms ending with line anchors on one side of a choice near index 4; regex group count is 0, but the specified group index is 1
      @Expression <AttributeReference> id#103 could run on GPU
      @Expression <Literal> \d+(6$|7$) could run on GPU
      @Expression <Literal> 1 could run on GPU

res19: Array[org.apache.spark.sql.Row] = Array([], [], [], [6], [7], [], [], [], [6], [7], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [], [6], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [7], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [6], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [7], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [6], [7], [], [], [], [], [], [], [], [6], [7], [], [], [], [], [6], [7], [], ...

I would like the RegExpExtract to run on the GPU.

@NVnavkumar NVnavkumar added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 21, 2022
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 24, 2022
@NVnavkumar
Copy link
Collaborator Author

NVnavkumar commented Oct 24, 2022

Filed rapidsai/cudf#11979 against cuDF.

This would give us the ability to pass arbitrary line terminators in and support all scenarios involving the line anchor $

@NVnavkumar NVnavkumar self-assigned this Oct 28, 2022
@NVnavkumar
Copy link
Collaborator Author

Also, fixing this will probably involve ultimately involve using cuDF multiline mode, which means the handling of $ will require major updates.

@NVnavkumar
Copy link
Collaborator Author

Also, fixing this will probably involve ultimately involve using cuDF multiline mode, which means the handling of $ will require major updates.

Filed #7090 to track the updates to handle $.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request tech debt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants