Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RegexFindOptimization for embedded strings #67907

Merged
merged 1 commit into from
Apr 13, 2022

Conversation

stephentoub
Copy link
Member

Currently if a pattern begins with a multi-character string, we'll use IndexOf with that substring to find the next possible match location. But if that multi-character string is a non-zero fixed distance into the pattern, we won't see it. This changes that, letting us find such strings based on the fixed-distance sets we're already gathering, and using IndexOf for it.

From dotnet/performance:

Method Toolchain Pattern Options Mean Ratio
Count \main\corerun.exe [a-z]shing None 9.699 ms 1.00
Count \pr\corerun.exe [a-z]shing None 2.206 ms 0.23
Count \main\corerun.exe [a-z]shing Compiled 7.210 ms 1.00
Count \pr\corerun.exe [a-z]shing Compiled 2.155 ms 0.30
Count \main\corerun.exe [a-z]shing NonBacktracking 9.946 ms 1.00
Count \pr\corerun.exe [a-z]shing NonBacktracking 2.352 ms 0.24

There are ~170 patterns in our corpus that benefit from this.

@ghost
Copy link

ghost commented Apr 12, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Currently if a pattern begins with a multi-character string, we'll use IndexOf with that substring to find the next possible match location. But if that multi-character string is a non-zero fixed distance into the pattern, we won't see it. This changes that, letting us find such strings based on the fixed-distance sets we're already gathering, and using IndexOf for it.

From dotnet/performance:

Method Toolchain Pattern Options Mean Ratio
Count \main\corerun.exe [a-z]shing None 9.699 ms 1.00
Count \pr\corerun.exe [a-z]shing None 2.206 ms 0.23
Count \main\corerun.exe [a-z]shing Compiled 7.210 ms 1.00
Count \pr\corerun.exe [a-z]shing Compiled 2.155 ms 0.30
Count \main\corerun.exe [a-z]shing NonBacktracking 9.946 ms 1.00
Count \pr\corerun.exe [a-z]shing NonBacktracking 2.352 ms 0.24

There are ~170 patterns in our corpus that benefit from this.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

Currently if a pattern begins with a multi-character string, we'll use IndexOf with that substring to find the next possible match location.  But if that multi-character string is a non-zero fixed distance into the pattern, we won't see it.  This changes that, letting us find such strings based on the fixed-distance sets we're already gathering, and using IndexOf for it.
Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments but looks good otherwise. Great to see those nice improvements.

@stephentoub stephentoub merged commit f216e77 into dotnet:main Apr 13, 2022
@stephentoub stephentoub deleted the stringsearch branch April 13, 2022 20:35
@AndyAyersMS
Copy link
Member

Improvements on ubuntu x64: dotnet/perf-autofiling-issues#4747

@dakersnar
Copy link
Contributor

dakersnar commented May 13, 2022

Improvements on all configurations for a variety of tests. Here's the most drastic improvement:

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_Leipzig.Count(Pattern: "[a-z]shing", Options: None)

Result Ratio Operating System Bit Processor Name
Faster 3.05 ubuntu 20.04 Arm64 Unknown processor
Faster 3.05 Windows 10 Arm64 Microsoft SQ1 3.0 GHz
Faster 3.04 Windows 11 Arm64 Microsoft SQ1 3.0 GHz
Faster 2.72 Windows 11 Arm64 Unknown processor
Faster 3.72 macOS Monterey 12.3 Arm64 Apple M1 Max
Faster 5.53 Windows 10 X64 Intel Xeon Platinum 8272CL CPU 2.60GHz
Faster 3.96 Windows 10 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Faster 4.79 Windows 10 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Faster 4.44 Windows 11 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 6.27 Windows 11 X64 AMD Ryzen 9 5950X
Faster 4.61 Windows 11 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Faster 4.50 Windows 11 X64 Intel Core i9-9900T CPU 2.10GHz
Faster 4.71 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)
Faster 4.62 pop 22.04 X64 Intel Core i7-6600U CPU 2.60GHz (Skylake)
Faster 3.99 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz
Faster 3.33 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)
Faster 8.98 ubuntu 20.04 X64 AMD Ryzen 9 5900X
Faster 4.82 ubuntu 20.04 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)
Faster 4.65 Windows 10 X86 Intel Xeon CPU E5-1650 v4 3.60GHz
Faster 5.05 Windows 10 X86 Intel Core i7-6700 CPU 3.40GHz (Skylake)
Faster 4.66 Windows 11 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores
Faster 4.63 macOS Monterey 12.2.1 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell)
Faster 4.97 macOS Monterey 12.3.1 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell)

@ghost ghost locked as resolved and limited conversation to collaborators Jun 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants