Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish "simplified" regex codegen strategy #61902

Closed
8 of 10 tasks
stephentoub opened this issue Nov 22, 2021 · 4 comments · Fixed by #62318
Closed
8 of 10 tasks

Finish "simplified" regex codegen strategy #61902

stephentoub opened this issue Nov 22, 2021 · 4 comments · Fixed by #62318
Assignees
Labels
area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Nov 22, 2021

Prior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc.

To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality.

However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:

We're on track to have all but the last two (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking.

However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached new Regex(...) instance. This might represent a perf regression for that very small subset of use cases. However, a big benefit would be not having to fully implement RightToLeft in the new codegen strategy, getting to delete several thousand lines of difficult code from RegexCompiler and several thousand lines of difficult code from the source generator, and we could also stop running RegexWriter at all for RegexCompiler and the source generator. And we could mark various protected members of RegexRunner as obsolete and consider removing them in the future.

@stephentoub stephentoub added this to the 7.0.0 milestone Nov 22, 2021
@stephentoub stephentoub self-assigned this Nov 22, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Nov 22, 2021
@ghost
Copy link

ghost commented Nov 22, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Prior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc.

To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality.

However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:

  • Greedy, non-atomic node loops (RegexNode.Loop). Lazy loops are already supported, as are all forms of loops around a single character ("one", "notone", "set"), and all atomic loops.
  • Backreferences (RegexNode.Ref)
  • Backreference conditionals (RegexNode.Testref)
  • Expression conditionals (RegexNode.Testgroup)
  • Alternations that aren't top-level or atomic.
  • Captures in all locations. Currently only some locations for captures are supported.
  • Balancing groups.
  • RegexOptions.RightToLeft
  • Positive lookbehinds
  • Negative lookbehinds

We're on track to have all but the last three (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking.

However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached new Regex(...) instance. This might represent a perf regression for that very small subset of use cases. However, a big benefit would be not having to fully implement RightToLeft in the new codegen strategy, getting to delete several thousand lines of difficult code from RegexCompiler and several thousand lines of difficult code from the source generator, and we could also stop running RegexWriter at all for RegexCompiler and the source generator.

Author: stephentoub
Assignees: stephentoub
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@danmoseley
Copy link
Member

So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter

As discussed offline, I support this strategy. Although a quite small subset of patterns will likely see a performance impact, the effect on maintenance costs will be very substantial, and will likely open up other opportunities for optimization that aren't feasible today.

@danmoseley
Copy link
Member

cc @joperezr for his take.

@joperezr
Copy link
Member

@stephentoub briefly mentioned RightToLeft case when we initially chatted, I agree and think it makes sense to have the limitation of only supporting it using Interpreted for some cases when on RightToLeft.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 3, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 3, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jan 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants