[FEA] Make line terminator sequences in regular expression using `$` a configurable option when `MULTILINE` flag is enabled #11979

NVnavkumar · 2022-10-24T20:26:16Z

Is your feature request related to a problem? Please describe.
In regular expression multiline mode, currently the $ matches at the position right before a newline character \n (a line terminator) in cuDF. In Python, this behavior makes sense and is consistent with the Python implementation. However, Apache Spark uses the JDK (Java) implementation, and line terminator sequences are a bit more complex. The JDK regular expression library utilizes either of newline (\n), carriage return (\r), carriage return followed by newline (\r\n), and 3 other Unicode newline variants as a line terminator.

Describe the solution you'd like
It would be useful if we could configure the concept of line terminator sequences in cuDF. Ideally, this could be an optional parameter that would support a simple array of strings for line terminator sequences. Another alternative could be another flag for line terminators or another type of MULTILINE flag

Describe alternatives you've considered
Currently, spark-rapids handles $ by doing a heavy translation from a JDK regular expression to another regular expression supported by cuDF that handles the multiple possible line terminator sequences that the JDK uses. It cannot use cuDF MULTILINE mode because only the newline is handled there. With this translation, we are limited to only using the $ in simple scenarios at the end of the regular expression, we cannot use them in choice | right now among other constructions because of the complexity.

The text was updated successfully, but these errors were encountered:

davidwendt · 2022-10-25T20:24:34Z

Could you provide some example use cases?
Only \n is supported for new-line but it may be possible to optionally support \r instead or to optionally support either \r or \n. But I do not think it will be possible to support \r\n as a single new-line entity(?).

Would it be possible to convert \r\n to \n before calling the regex APIs? This should work for matching but maybe there is an issue with extract or replace that I'm not seeing.

>>> import cudf
>>> import re
>>> s = cudf.Series(["abc\nfff\nabc", "fff\nabc\nlll", "abc", "", "abc\n"])
>>> s
0    abc\nfff\nabc
1    fff\nabc\nlll
2              abc
3                 
4            abc\n
dtype: object
>>> sr = cudf.Series(["abc\r\nfff\r\nabc", "fff\r\nabc\r\nlll", "abc", "", "abc\r\n"])
>>> sr
0    abc\r\nfff\r\nabc
1    fff\r\nabc\r\nlll
2                  abc
3                     
4              abc\r\n
dtype: object

>>> s.str.extract("(^abc$)", flags=re.MULTILINE)
      0
0   abc
1   abc
2   abc
3  <NA>
4   abc

>>> sr.str.replace('\r\n', '\n', regex=False).str.extract("(^abc$)", flags=re.MULTILINE)
      0
0   abc
1   abc
2   abc
3  <NA>
4   abc

>>> sr.str.replace('\r\n', '\n', regex=False).str.extract("(^abc$)")
      0
0  <NA>
1  <NA>
2   abc
3  <NA>
4  <NA>
>>> s.str.extract("(^abc$)")
      0
0  <NA>
1  <NA>
2   abc
3  <NA>
4  <NA>

(verified these results with Pandas as well)

Note that you get the same result for (^abc$) and ^(abc)$

The new-line cannot be captured with ^ or $ with extract. For example:

>>> s.str.extract("(^abc$)", re.MULTILINE)
      0
0   abc
1   abc
2   abc
3  <NA>
4   abc
>>> s.str.extract("(^abc$f)", re.MULTILINE)
      0
0  <NA>
1  <NA>
2  <NA>
3  <NA>
4  <NA>
>>> s.str.extract("(^abc$^f)", re.MULTILINE)
      0
0  <NA>
1  <NA>
2  <NA>
3  <NA>
4  <NA>
>>> s.str.extract("(^abc^f)", re.MULTILINE)
      0
0  <NA>
1  <NA>
2  <NA>
3  <NA>
4  <NA>

(same for Pandas too)

Can you provide examples where replacing \r\n with \n would not work for you?

NVnavkumar · 2022-10-31T22:24:08Z

In this case, the use case is using extract, so substituting \r\n for \n will probably not work.

davidwendt · 2022-10-31T22:30:10Z

In this case, the use case is using extract, so substituting \r\n for \n will probably not work.

Could you provide such an example where this would not work?
My examples above the extracted string never contain the \r or the \n in the result so there would be no need to fix up the result.

NVnavkumar · 2022-11-08T11:03:03Z

In this case, the use case is using extract, so substituting \r\n for \n will probably not work.

Could you provide such an example where this would not work? My examples above the extracted string never contain the \r or the \n in the result so there would be no need to fix up the result.

So I did some initial testing. This substitution approach might work; however, I think there is a small unresolved issue. It looks like if we use re.MULTILINE, the \\Z cannot easily be used in combination with $ (because $ at the point can also match end of string). What's also interesting here is that in Python natively, when NOT using re.MULTILINE, $ actually matches before the \n:

>>> s = cudf.Series(["a.html\r\n", "b.txt\r\n", "c.html\r\nabcd", "d.txt"])
>>> r = s.str.replace("\r\n", "\n")
>>> r.str.extract("\\w+\\.(html$|txt$)")
      0
0  <NA>
1  <NA>
2  <NA>
3   txt

>>> for s in ["a.html\n", "b.txt\n", "c.html\nabcd", "d.txt"]:
...     m = re.match("\\w+\\.(html$|txt$)", s)
...     if m:
...             m.group(1)
...     else:
...             '<NA>'
...
'html'
'txt'
'<NA>'
'txt'

Now with re.MULTILINE:

>>> r.str.extract("\\w+\\.(html$|txt$)", re.MULTILINE)
      0
0  html
1   txt
2  html
3   txt
>>> for s in ["a.html\n", "b.txt\n", "c.html\nabcd", "d.txt"]:
...     m = re.match("\\w+\\.(html$|txt$)", s, re.MULTILINE)
...     if m:
...             m.group(1)
...     else:
...             '<NA>'
...
'html'
'txt'
'html'
'txt'

So, it looks like cuDF is consistent with native Python in MULTILINE mode, and not in normal mode. This substitution strategy could work in normal mode, if that issue is resolved.

davidwendt · 2022-11-10T11:47:52Z

The edge case of '$' matching "...\n" when \n is at the end of the string in non-MULTILINE mode is not Python specific: https://www.regular-expressions.info/refanchors.html
I can look at supporting this edge case in libcudf.

…2181) Support regex EOL where the string ends with a new-line character. This matches the behavior for the EOL anchor `$` described here: https://www.regular-expressions.info/refanchors.html Additional gtests are included. The doxygen for cudf regex support is also updated. Close #11979 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #12181

NVnavkumar added Needs Triage Need team to review and classify feature request New feature or request labels Oct 24, 2022

NVnavkumar mentioned this issue Oct 24, 2022

[FEA] Regular expressions - support line anchors in choice NVIDIA/spark-rapids#6882

Closed

davidwendt added the strings strings issues (C++ and Python) label Oct 25, 2022

davidwendt self-assigned this Oct 26, 2022

GregoryKimball added 0 - Waiting on Author Waiting for author to respond to review and removed Needs Triage Need team to review and classify labels Oct 30, 2022

sameerz removed the 0 - Waiting on Author Waiting for author to respond to review label Nov 16, 2022

NVnavkumar mentioned this issue Nov 17, 2022

[BUG] Refactor line terminator handling code NVIDIA/spark-rapids#7090

Closed

davidwendt mentioned this issue Nov 17, 2022

Support regex EOL where the string ends with a new-line character #12181

Merged

3 tasks

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Nov 21, 2022

rapids-bot bot closed this as completed in #12181 Nov 28, 2022

andygrove mentioned this issue Jan 25, 2023

Fix regressions related to cuDF changes in handline of end-of-line/string anchors NVIDIA/spark-rapids#7211

Merged

5 tasks

davidwendt mentioned this issue Feb 14, 2023

[BUG] [Regexp] Line anchor '$' incorrect matching of unicode line terminators NVIDIA/spark-rapids#7585

Open

NVnavkumar mentioned this issue May 14, 2024

[FEA] Make line terminator sequence handling in regular expression engine a configurable option #15746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Make line terminator sequences in regular expression using `$` a configurable option when `MULTILINE` flag is enabled #11979

[FEA] Make line terminator sequences in regular expression using `$` a configurable option when `MULTILINE` flag is enabled #11979

NVnavkumar commented Oct 24, 2022 •

edited by davidwendt

Loading

davidwendt commented Oct 25, 2022 •

edited

Loading

NVnavkumar commented Oct 31, 2022

davidwendt commented Oct 31, 2022 •

edited

Loading

NVnavkumar commented Nov 8, 2022

davidwendt commented Nov 10, 2022

[FEA] Make line terminator sequences in regular expression using $ a configurable option when MULTILINE flag is enabled #11979

[FEA] Make line terminator sequences in regular expression using $ a configurable option when MULTILINE flag is enabled #11979

Comments

NVnavkumar commented Oct 24, 2022 • edited by davidwendt Loading

davidwendt commented Oct 25, 2022 • edited Loading

NVnavkumar commented Oct 31, 2022

davidwendt commented Oct 31, 2022 • edited Loading

NVnavkumar commented Nov 8, 2022

davidwendt commented Nov 10, 2022

[FEA] Make line terminator sequences in regular expression using `$` a configurable option when `MULTILINE` flag is enabled #11979

[FEA] Make line terminator sequences in regular expression using `$` a configurable option when `MULTILINE` flag is enabled #11979

NVnavkumar commented Oct 24, 2022 •

edited by davidwendt

Loading

davidwendt commented Oct 25, 2022 •

edited

Loading

davidwendt commented Oct 31, 2022 •

edited

Loading