Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regular expressions with hex digits not working as expected #4486

Closed
andygrove opened this issue Jan 10, 2022 · 3 comments · Fixed by #4869
Closed

[BUG] Regular expressions with hex digits not working as expected #4486

andygrove opened this issue Jan 10, 2022 · 3 comments · Fixed by #4869
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug

I am seeing cuDF fail to match some hex digits correctly. Here is one example. I do not yet know if this is related to #4409 or whether this is a separate issue.

javaPattern=\xA9, cudfPattern=\xA9, input='©', cpu=true, gpu=false

Steps/Code to reproduce bug

  test("compare CPU and GPU: hex") {
    val patterns = Seq(raw"\x61", raw"\xA9", raw"\xa9")
    val inputs = Seq("a", "b", "\u00A9")
    assertCpuGpuMatchesRegexpFind(patterns, inputs)
  }

Expected behavior
Test should pass.

Environment details (please complete the following information)
N/A

Additional context
N/A

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 10, 2022
@andygrove andygrove added this to the Jan 10 - Jan 28 milestone Jan 10, 2022
@andygrove andygrove self-assigned this Jan 10, 2022
@andygrove
Copy link
Contributor Author

This is perhaps due to cuDF requiring UTF-8 and the fact that the copyright symbol has bytes 0xC2 0xA9 or 11000010 10101001 in UTF-8 encoding, and byte 0xA9 in ANSI encoding.

@andygrove andygrove removed this from the Jan 10 - Jan 28 milestone Jan 10, 2022
@andygrove andygrove removed their assignment Jan 10, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 11, 2022
@andygrove
Copy link
Contributor Author

Now that #4492 is merged, we fall back to CPU for now for hex digits.

@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Feb 8, 2022

With cuDF PR #10220 now merged, at least 127 characters should be supported by the cuDF regex hex escape, so this should be re-enabled and tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants