-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/7.0] Fix handling of backtracking stack with some loops #79361
Conversation
With both RegexOptions.Compiled and the Regex source generator, Regex greedy loops with - a minimum bound of at least 2 - no child constructs that backtrack - and a child that's more than a one/notone/set (aka things that match a single character) are possibly leaving state on the backtracking stack when: - at least one iteration of the loop successfully matches - but not enough iterations match to make the loop successful such that matching the loop fails In that case, if a previous construct in the pattern pushed any state onto the backtracking stack such that it expects to be able to pop off and use that state upon backtracking to it, it will potentially pop the erroneously leftover state. This can then cause execution to go awry, as it's getting back an unexpected value. That can lead to false positives, false negatives, or exceptions such as an IndexOutOfRangeException due to trying to pop too much from the backtracking stack. We already have the ability to remember the backtracking stack position when we initially enter the loop so that we can reset to that position later on. The fix is simply to extend that to also perform that reset when failing the match of such a loop in such circumstances.
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsBackport of #79353 to release/7.0 /cc @stephentoub Customer ImpactTestingRiskIMPORTANT: Is this backport for a servicing release? If so and this change touches code that ships in a NuGet package, please make certain that you have added any necessary package authoring and gotten it explicitly reviewed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation and code comments; helped me understand what otherwise would have been obscure to me.
This extends the fix made earlier around proper handling of the backtracking stack in EmitLoop. If this loop is inside of another loop, then we need to preserve the startingStackpos on the backtracking stack upon successfully completing the loop, as the outer loop might iterate and cause another occurrence of this loop to run, which will then overwrite our single startingStackpos local. If that iteration backtracks, we then need to be able to pop the previous iterations startingStackpos and restore its value.
Backport of #79353 to release/7.0
/cc @stephentoub
Customer Impact
Some more complicated regex patterns fail to execute properly when using RegexOptions.Compiled or the regex source generator. This can amount to false positive matches, false negative matches, or even unexpected exceptions. This was reported in an issue using a pattern
@"(ver\.? |[_ ]+)?\d+(\.\d+){2,3}$"
. It's a regression between .NET 6 and .NET 7 with RegexOptions.Compiled.This can affect a pattern containing a greedy loop with:
That loop can end up leaving extraneous state on the backtracking stack when:
In that case, if a previous construct in the pattern pushed any state onto the backtracking stack such that it expects to be able to pop off and use that state upon backtracking to it, it will potentially pop the erroneously leftover state and try to use it.
We already have the ability to remember the backtracking stack position when we initially enter the loop so that we can reset to that position later on. The fix is simply to extend that to also perform that reset when failing the match of such a loop in such circumstances.
Testing
Existing inner and outer loop regex tests, plus new test cases based on the repro case.
Risk
The change is very isolated and relatively straightforward to prove is correct from a reasoned perspective, but this is also a complicated region of code, and given all the possible combinations involved, it's feasible but unlikely something unexpected will be affected. More likely is there are additional such corner cases that we're still missing; this won't fix those, but it won't make them worse.