Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize String.Replace(string, string) #44088

Merged
merged 1 commit into from
Oct 31, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Oct 30, 2020

We can significantly improve its throughput for a few scenarios:

  • If both oldValue and newValue are single characters in the form of strings, we can just delegate to the Replace(char, char) overload, which is much faster.
  • If oldValue is a single character but newValue isn't, we can use IndexOf to find it, rather than an open-coded loop, making search times much faster for reasonably sized strings where the thing being searched for isn't immediate.
  • If oldValue is multiple characters and the string being searched for isn't super frequent (e.g. doesn't repeat every few characters), we can significantly speed up throughput by using IndexOf to search for the whole string. For example, replacing "\r\n" with "\n" in the contents of a typical file.

This does come at a measurable cost when the oldValue is really common and tightly packed, e.g. searching for "aa" in "aaaaaaaaaaaaaaaaaa", so we can decide whether the tradeoff is the right one.

Some example results:

  • "This is a test. This is only a test.".Replace("!", "") is 4x faster.
  • The source text for a 35-line .cs file containing the benchmarks I was running, with src.Replace("\r\n", "\n") is 2.3x faster.
  • guidWithDashes.Replace("-", "") is 10% faster.
  • But "aaaaaaaaaaaaaaaaaaaaaa".Replace("a", "bbb") and "aaaaaaaaaaaaaaaaaaaaaa".Replace("aa", "bbb") are both 2x slower, as we end up using IndexOf to search for a that's the very next character.

Basically it comes down to whether we believe vectorization is good or bad for this use case, whether the thing being searched for will be tightly packed or not.

Contributes to (or fixes?) #44061

We can significantly improve its throughput for a few scenarios:
- If both oldValue and newValue are single characters in the form of strings, we can just delegate to the Replace(char, char) overload, which is much faster.
- If oldValue is a single character but newValue isn't, we can use IndexOf to find it, rather than an open-coded loop, making search times much faster for reasonably sized strings.
- If oldValue is multiple characters and the string being searched for isn't super frequent (e.g. doesn't repeat every few characters), we can significantly speed up throughput by using IndexOf to search for the whole string.  For example, replacing "\r\n" with "\n" in the contents of a typical file.

This does come at a measurable cost when the oldValue is really common and tightly packed, e.g. searching for "aa" in "aaaaaaaaaaaaaaaaaa", so we can decide whether the tradeoff is the right one.
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It is reasonable to assume that the improved cases are the common usage pattern.

@stephentoub
Copy link
Member Author

Failures are #44037

@stephentoub stephentoub merged commit f29a272 into dotnet:master Oct 31, 2020
@stephentoub stephentoub deleted the stringreplaceperf branch October 31, 2020 02:22
@ghost ghost locked as resolved and limited conversation to collaborators Dec 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants