Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Find ways to convert regular expressions into faster operations #10600

Open
revans2 opened this issue Mar 15, 2024 · 1 comment
Open

[FEA] Find ways to convert regular expressions into faster operations #10600

revans2 opened this issue Mar 15, 2024 · 1 comment
Labels
feature request New feature or request performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 15, 2024

Is your feature request related to a problem? Please describe.
Processing regular expressions on the GPU is really expensive compared to purpose built kernels. We already have done some work to avoid running a regular expression in some cases with StringSplit when the result would just be a regular string.

#4854

And similarly for RegexpReplace when it would just be a list of choices for replacement

#7967

I would like to see us do similar things for RLike. Spark already optimizes LIKE so that if it sees %FOO, it is translated to an ends_with, FOO% to a starts_with, and %FOO% to a contains.

I want to do the same kind of thing but using our regexp transpiler.

We might also want to have some simplification rules for RLIKE in the transpile to make it easier to find these situations instead of hard coding a list of them.

  • FOO, .*FOO, FOO.*, ^.*FOO, \A.*FOO, ^.*FOO.*$, \A.*FOO.*\Z can be converted into contains
  • ^FOO, ^FOO.*, \AFOO, \AFOO.* can be be converted into starts_with
  • FOO$, .*FOO$, FOO\Z, .*FOO\Z can be be converted into ends_with

Note that these are after capture groups have been removed/ignored, and FOO represents any pattern that can be converted into a static string, similar to what we do for StringSplit and RegexpReplace. In fact the multi-match that regexp replace does should work for contains because we could use find_multiple and then check for any of the values in the array being >= 0

https://github.com/rapidsai/cudf/blob/610c022164b71614fb92366b3694df6d700283c8/cpp/include/cudf/strings/find_multiple.hpp#L57

We could also do a custom kernel if we see a need for it and want more speed.

There might be more, but a few of these we have see in real world queries and it would be good to try and speed these up. As in all things we should run some benchmarks to really see how much better the speedup is, and if it is worth the added complexity. My guess is that it will always be a win, and for long strings it can be a huge win.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Mar 15, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Apr 4, 2024
@mattahrens
Copy link
Collaborator

Related to: #10741

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

2 participants