Several improvements / simplifications in Regex #100315

stephentoub · 2024-03-26T22:15:13Z

This started out as a small improvement for one thing and grew to be something else.

Initially, my intent was just to improve how SearchValues<char> applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits [\u0000-\u007F-[0-9]]. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set.

However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.

This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, so we can just do the range check first and skip the more expensive set check if it's not needed.

That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate..

This started out as a small improvement for one thing and grew to be something else. Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**. However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.** This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.** That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**.

MihaZupan

Nice!

since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion

The probabilistic map currently only has a vectorized path for IndexOfAny.
We could add LastIndexOfAny pretty easily (the interesting code would be shared with IndexOfAny). I take it this change might make that more important?

We don't, however, have a vectorized fallback for AnyExcept after falling off the Ascii fast path (no "inverse bloom filter").
Currently, we'll fall back to an O(i * m) loop for that. We could make it O(i), but it'd still be char-by-char.
Is that a concern here?

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

…r.Emitter.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>

stephentoub · 2024-03-29T15:37:20Z

We could add LastIndexOfAny pretty easily (the interesting code would be shared with IndexOfAny). I take it this change might make that more important?

Yeah, that'd be useful to add in, in particular if we can do it without adding in much code.

Is that a concern here?

Why would it be O(i * m)? Don't we have some O(1) implementation for determining whether a character is in the set represented by the SearchValues, which would make it O(i)? The alternative to not using SearchValues here is falling back to a linear walk, so I'm not too concerned about it as long as the overheads compared to that aren't meaningful.

MihaZupan · 2024-04-02T12:56:21Z

Don't we have some O(1) implementation for determining whether a character is in the set represented by the SearchValues, which would make it O(i)

The single-character check when using the probabilistic map is currently O(needle) when the character is in the set (which is the common case for -Except operations). We could of course switch to a HashSet/lookup table instead.

stephentoub · 2024-04-02T16:13:53Z

Don't we have some O(1) implementation for determining whether a character is in the set represented by the SearchValues, which would make it O(i)

The single-character check when using the probabilistic map is currently O(needle) when the character is in the set (which is the common case for -Except operations). We could of course switch to a HashSet/lookup table instead.

Ok, I hadn't realized that (or maybe I knew it at one point and forgot). It'd be good to switch it to using a set lookup.

TonyValenti · 2024-04-04T00:53:39Z

I just wanted y'all to know I love reading commit notes like this!

LoopedBard3 · 2024-04-09T16:47:55Z

Possible related regressions:

Windows x64: [Perf] Windows/x64: 3 Regressions on 4/3/2024 9:18:48 PM perf-autofiling-issues#32428, [Perf] Windows/x64: 1 Regression on 4/3/2024 9:18:48 PM perf-autofiling-issues#32419
Linux Ampere Arm64: [Perf] Linux/arm64: 4 Regressions on 4/3/2024 11:48:33 PM perf-autofiling-issues#32582

Related improvements:

[Perf] Windows/x64: 5 Improvements on 4/3/2024 9:18:48 PM perf-autofiling-issues#32450
[Perf] Linux/x64: 6 Improvements on 4/3/2024 9:18:48 PM perf-autofiling-issues#32438
[Perf] Linux/arm64: 4 Improvements on 4/3/2024 11:48:33 PM perf-autofiling-issues#32587
[Perf] Windows/arm64: 4 Improvements on 4/3/2024 11:48:33 PM perf-autofiling-issues#32573
[Perf] Windows/arm64: 2 Improvements on 4/3/2024 11:48:33 PM perf-autofiling-issues#32610

matouskozak · 2024-04-10T10:05:24Z

Possible related Mono regressions:

[Perf] Linux/arm64: 2 Regressions on 4/3/2024 11:48:33 PM perf-autofiling-issues#32511

stephentoub · 2024-04-10T17:30:20Z

The regressions mostly look to be places we're now using SearchValues<char> more. For example, the mono regression is because code previously generated to be:

while ((uint)iteration < (uint)slice.Length && ((ch = slice[iteration]) < 128 ? ("\0\0怀Ͽ\ufffe蟿\ufffe\u07ff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\0\f\0-/0:A[_`a{KÅ")))
{
    iteration++;
}

is now generated to be:

int iteration = slice.IndexOfAnyExcept(Utilities.s_nonAscii_2D303132333435363738394142434445464748494A4B4C4D4E4F505152535455565758595A6162636465666768696A6B6C6D6E6F707172737475767778797AE284AA);
if (iteration < 0)
{
    iteration = slice.Length;
}
...
internal static readonly SearchValues<char> s_nonAscii_2D303132333435363738394142434445464748494A4B4C4D4E4F505152535455565758595A6162636465666768696A6B6C6D6E6F707172737475767778797AE284AA = SearchValues.Create("-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");

@MihaZupan, will this hit problematic paths? (Note the non-ASCII Kelvin sign at the end there.)

MihaZupan · 2024-04-10T17:37:53Z

Yes, IndexOfAnyExcept here could hit the slow O(i * m) fallback.

If have have explicit support for ISAs (IndexOfAnyAsciiSearcher.IsVectorizationSupported) we'll start with a vectorized ASCII scan, but as soon as we hit any non-ascii char, we'll go back to that same slow fallback.

I've been playing around with a few options here (e.g. perfect hash / lookup table ...), I'll still have to throw a few more hours at it though.

MihaZupan · 2024-04-11T17:07:38Z

E.g. for "-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK" if you hit a non-ASCII char:
(current vs. O(1) checks)

Method	Toolchain	Length	TextHasNonAscii	Mean	Ratio
IndexOfAnyExcept	main	1000	True	4,020.8 ns	1.00
IndexOfAnyExcept	pr	1000	True	661.3 ns	0.16

stephentoub · 2024-04-11T17:25:32Z

E.g. for "-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK" if you hit a non-ASCII char: (current vs. O(1) checks)

Method Toolchain Length TextHasNonAscii Mean Ratio
IndexOfAnyExcept main 1000 True 4,020.8 ns 1.00
IndexOfAnyExcept pr 1000 True 661.3 ns 0.16

"pr" here is with your upcoming fixes?

MihaZupan · 2024-04-11T17:29:04Z

Yeah, should be on that order of magnitude

* Several improvements / simplifications in Regex This started out as a small improvement for one thing and grew to be something else. Initially, my intent was just to improve how `SearchValues<char>` applies to character classes with subtraction. Character class subtraction isn't frequently used, but it is a convenient way to express removing subsets of ranges, e.g. all ASCII other than digits `[\u0000-\u007F-[0-9]]`. Currently when we go to enumerate the characters in a char class, for perf reasons we only do the enumeration if we can enumerate sets and up to the max space provided, in order to keep the time down. We immediately give up if the char class has subtraction, but given that we've already limited how many values we're enumerating, if there is subtraction we can afford to query for just those chars that would otherwise pass in order to enable the subtraction. So, with this PR, we can now support using SearchValues in this manner: **this means that whereas previously we would have generated an IndexOfAny for any of the ASCII characters or anything non-ASCII, then with a fallback for if we hit something non-ASCII, now we'll just create an IndexOfAny for the full set**. However, that triggered a (then defunct) assert which led me to see that we have a bunch of duplicated logic around asserts: we'd frequently be checking to see if a set contained at most 5 chars (in support of a time when we didn't have SearchValues and only optimized IndexOfAny for up to 5 chars) and then subsequently would see if it contained only ASCII. We no longer need that separation, especially since SearchValues will now both vectorize probabilistic map searches and will first do a search for the ASCII portion (or anything non-ASCII). **This then means we can delete a variety of duplicated code while also expanding what we recognize for use with SearchValues.** This then lead to seeing that in a variety of places we compute the set of chars in a set and then check whether it could instead be satisfied just as a range but not if the set of chars is small. The former check is more expensive than the latter, but we were doing the first one first presumably in order to be able to do the set size check as part of the latter. However, we don't need it for that, as a single subtraction gives us the size of the range, **so we can just do the range check first and skip the more expensive set check if it's not needed.** That then led to seeing that we're not using range-based searching in the interpreter or non-backtracking engines. **This adds that support, such that the interpreter/non-backtracking engines will now search for the next starting location using IndexOfAny{Except}InRange if appropriate.**. * Update src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com> --------- Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>

stephentoub requested a review from MihaZupan March 26, 2024 22:15

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Mar 26, 2024

dotnet-policy-service bot assigned stephentoub Mar 26, 2024

MihaZupan approved these changes Mar 27, 2024

View reviewed changes

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs Outdated Show resolved Hide resolved

Update src/libraries/System.Text.RegularExpressions/gen/RegexGenerato…

dc4ddc7

…r.Emitter.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>

This was referenced Mar 29, 2024

[browser][MT] TIMED_OUT but 0 was expected #99888

Closed

System.Runtime.InteropServices.JavaScript.Tests.JSImportTest.MissingImport test fails on CI leg: browser-wasm linux Release LibraryTests_Threading #100421

Closed

Merge branch 'main' into regextweaks

e1cde36

stephentoub merged commit b040ed6 into dotnet:main Apr 3, 2024
87 checks passed

stephentoub deleted the regextweaks branch April 3, 2024 16:05

kotlarmilos mentioned this pull request Apr 11, 2024

[Perf] Linux/x64: 3 Improvements on 4/3/2024 9:18:48 PM dotnet/perf-autofiling-issues#32494

Closed

MihaZupan mentioned this pull request Apr 13, 2024

Rework ProbabilisticMap character checks in SearchValues #101001

Merged

github-actions bot locked and limited conversation to collaborators May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several improvements / simplifications in Regex #100315

Several improvements / simplifications in Regex #100315

stephentoub commented Mar 26, 2024

MihaZupan left a comment

stephentoub commented Mar 29, 2024

MihaZupan commented Apr 2, 2024

stephentoub commented Apr 2, 2024

TonyValenti commented Apr 4, 2024

LoopedBard3 commented Apr 9, 2024 •

edited by DrewScoggins

Loading

matouskozak commented Apr 10, 2024

stephentoub commented Apr 10, 2024 •

edited

Loading

MihaZupan commented Apr 10, 2024 •

edited

Loading

MihaZupan commented Apr 11, 2024 •

edited

Loading

stephentoub commented Apr 11, 2024

MihaZupan commented Apr 11, 2024

Several improvements / simplifications in Regex #100315

Several improvements / simplifications in Regex #100315

Conversation

stephentoub commented Mar 26, 2024

MihaZupan left a comment

Choose a reason for hiding this comment

stephentoub commented Mar 29, 2024

MihaZupan commented Apr 2, 2024

stephentoub commented Apr 2, 2024

TonyValenti commented Apr 4, 2024

LoopedBard3 commented Apr 9, 2024 • edited by DrewScoggins Loading

matouskozak commented Apr 10, 2024

stephentoub commented Apr 10, 2024 • edited Loading

MihaZupan commented Apr 10, 2024 • edited Loading

MihaZupan commented Apr 11, 2024 • edited Loading

stephentoub commented Apr 11, 2024

MihaZupan commented Apr 11, 2024

LoopedBard3 commented Apr 9, 2024 •

edited by DrewScoggins

Loading

stephentoub commented Apr 10, 2024 •

edited

Loading

MihaZupan commented Apr 10, 2024 •

edited

Loading

MihaZupan commented Apr 11, 2024 •

edited

Loading