Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude all unicode line terminator characters from matching dot #5424

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented May 4, 2022

Signed-off-by: Anthony Chang antchang@nvidia.com

Fixes #5415

The wildcard . matching should exclude all line terminators defined in Java (Line terminator section). This PR adds exclusions for next-line (\u0085), line-separator (\u2028), and paragraph-separator (\u2029) characters.

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more unit tests to test these cases? In particular add cases where these characters are in the input string, and the regular expression should match in CPU and GPU?

@sameerz sameerz added the bug Something isn't working label May 5, 2022
@sameerz sameerz added this to the May 2 - May 20 milestone May 5, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@NVnavkumar
Copy link
Collaborator

Changes look good. One more thing, can you try updating the variable here in RegularExpressionTranspilerSuite.scala, add the 3 unicode line terminators to it, and re-run the suite? This should enable the fuzz testing to generate more examples with line terminators to confirm that there aren't any edge cases missed.

  private val REGEXP_LIMITED_CHARS_COMMON = "|()[]{},.^$*+?abc123x\\ \t\r\n\f\u000bBsdwSDWzZ"

@anthony-chang
Copy link
Contributor Author

build

@NVnavkumar NVnavkumar merged commit 00282b3 into NVIDIA:branch-22.06 May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Regular Expressions: matching the dot . doesn't fully exclude all unicode line terminator characters
3 participants