Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regular Expressions: matching the dot . doesn't fully exclude all unicode line terminator characters #5415

Closed
NVnavkumar opened this issue May 3, 2022 · 0 comments · Fixed by #5424
Assignees
Labels
bug Something isn't working

Comments

@NVnavkumar
Copy link
Collaborator

Describe the bug
When using the wildcard ., the plugin will transpile it to [^\r\n], because . in Java excludes line terminator characters when using the . wildcard, while cuDF does not (matches any character). However, this not exclude other unicode line terminator characters described here in the section Line terminators. We should also exclude the next-line (\u0085), line-separator (\u2028), and paragraph-separator (\u2029) characters as well.

Expected behavior
The wildcard . should exclude these extra unicode line terminator characters in addition to carriage-return and newline.

Additional context
None

@NVnavkumar NVnavkumar added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 3, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 3, 2022
@anthony-chang anthony-chang self-assigned this May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants