Finish "simplified" regex codegen strategy #61902

stephentoub · 2021-11-22T02:17:09Z

Prior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc.

To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality.

However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:

We're on track to have all but the last two (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking.

However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached new Regex(...) instance. This might represent a perf regression for that very small subset of use cases. However, a big benefit would be not having to fully implement RightToLeft in the new codegen strategy, getting to delete several thousand lines of difficult code from RegexCompiler and several thousand lines of difficult code from the source generator, and we could also stop running RegexWriter at all for RegexCompiler and the source generator. And we could mark various protected members of RegexRunner as obsolete and consider removing them in the future.

The text was updated successfully, but these errors were encountered:

ghost · 2021-11-22T02:17:12Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Prior to .NET 5, both the RegexInterpreter and the RegexCompiler worked on the same strategy: RegexWriter would transform the RegexNode tree into a series of opcodes and operands, which would then be processed. RegexInterpreter would process them at match time, and RegexCompiler would emit equivalent code for them, except with various constructs hardcoded, evaluating at compile-time what would otherwise be evaluated at match-time. This included various optimizations, like unrolling loops. However, it also a) left a non-trivial amount of optimization opportunity on the table, and b) results in difficult to follow code; the crux of the issue is RegexCompiler still following the same plan with many of the same limitations as RegexInterpreter, just slightly optimized by avoiding various kinds of branches at match time. This is all the more impactful with the source generator added in .NET 7, which spits out C# code intended to be human-readable/debuggable/etc.

To help with this, in .NET 5 we started incrementally added a new code generation strategy for the matching engine. Rather than being based on the opcodes output by RegexWriter, it instead outputs code based on walking the original RegexNode tree, resulting in baking the structure of the processing into the code rather than relying on the structure being pushed and popped onto the runstack/runtrack arrays used by RegexInterpreter (and RegexCompiler prior). While it was originally added to aid in performance, its impact is an order of magnitude larger for readability for the source generator in .NET 7, as it results in code much closer to that which a dev would actually write if they were manually implementing the same functionality.

However, we only provided a subset of support in .NET 5. Based on our corpus of "real-world" regexes, the .NET 5 implementation was able to support ~40% of regexes, which those using the new strategy, and the remainder falling back to the old strategy. We've been picking away at this, and as of today, ~70% of regexes are now covered. There are only a few remaining categories of work remaining to address the remaining gap:

We're on track to have all but the last three (RightToLeft and lookbehinds) implemented for .NET 7. Today, lookbehinds are implemented on top of RegexOptions.RightToLeft support, so it's difficult to properly implement lookbehinds without RightToLeft, and RightToLeft is a large undertaking.

However, in our corpus of "real-world" regexes, only 0.6% have RegexOptions.RightToLeft defined at the top-level, and only 3.3% involve RegexOptions.RightToLeft anywhere (meaning they use either RightToLeft or lookbehinds, which are based on RightToLeft), and only 1.4% involve RegexOptions.RightToLeft and RegexOptions.Compiled. So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter, such that if you use either of those, even if you specify RegexOptions.Compiled (i.e. you're in that 1.4% of expressions), you'll get the interpreter, and if you use the source generator, we'd spit out the fallback path that just emits a cached new Regex(...) instance. This might represent a perf regression for that very small subset of use cases. However, a big benefit would be not having to fully implement RightToLeft in the new codegen strategy, getting to delete several thousand lines of difficult code from RegexCompiler and several thousand lines of difficult code from the source generator, and we could also stop running RegexWriter at all for RegexCompiler and the source generator.

Author:	stephentoub
Assignees:	stephentoub
Labels:	`area-System.Text.RegularExpressions`
Milestone:	7.0.0

danmoseley · 2021-11-23T04:16:04Z

So, it would be reasonable in the name of maintainability to say that we only support RightToLeft and lookbehinds in the interpreter

As discussed offline, I support this strategy. Although a quite small subset of patterns will likely see a performance impact, the effect on maintenance costs will be very substantial, and will likely open up other opportunities for optimization that aren't feasible today.

danmoseley · 2021-11-23T04:16:17Z

cc @joperezr for his take.

joperezr · 2021-11-23T19:31:30Z

@stephentoub briefly mentioned RightToLeft case when we initially chatted, I agree and think it makes sense to have the limitation of only supporting it using Interpreted for some cases when on RightToLeft.

stephentoub added the area-System.Text.RegularExpressions label Nov 22, 2021

stephentoub added this to the 7.0.0 milestone Nov 22, 2021

stephentoub self-assigned this Nov 22, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Nov 22, 2021

stephentoub mentioned this issue Nov 22, 2021

Add backtracking loops, backreferences, and if-then-else constructs to Regex "simplified" code gen #61906

Merged

This was referenced Dec 2, 2021

Finish new code gen approach for RegexCompiler / source generator #62268

Merged

Delete old code generation approach from RegexCompiler / source generator #62318

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 3, 2021

stephentoub closed this as completed in #62318 Dec 3, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 3, 2021

ghost locked as resolved and limited conversation to collaborators Jan 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finish "simplified" regex codegen strategy #61902

Finish "simplified" regex codegen strategy #61902

stephentoub commented Nov 22, 2021 •

edited

Loading

ghost commented Nov 22, 2021

danmoseley commented Nov 23, 2021

danmoseley commented Nov 23, 2021

joperezr commented Nov 23, 2021

Finish "simplified" regex codegen strategy #61902

Finish "simplified" regex codegen strategy #61902

Comments

stephentoub commented Nov 22, 2021 • edited Loading

ghost commented Nov 22, 2021

danmoseley commented Nov 23, 2021

danmoseley commented Nov 23, 2021

joperezr commented Nov 23, 2021

stephentoub commented Nov 22, 2021 •

edited

Loading