Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Closed
stephentoub opened this issue Jan 31, 2020 · 8 comments · Fixed by #40729
Closed

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

stephentoub opened this issue Jan 31, 2020 · 8 comments · Fixed by #40729
Assignees
Labels
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Jan 31, 2020

Our regex engine can now spend a decent amount of time inside of span helpers like:

public static unsafe int IndexOfAny(ref char searchSpace, char value0, char value1, int length)

and in particular, we seem to hit this path:
if (pCh[0] == value0 || pCh[0] == value1)
goto Found;
if (pCh[1] == value0 || pCh[1] == value1)
goto Found1;
if (pCh[2] == value0 || pCh[2] == value1)
goto Found2;
if (pCh[3] == value0 || pCh[3] == value1)
goto Found3;

fairly frequently. It'd be great to investigate whether we can do anything to improve the performance of these IndexOfAny helpers, whether it's by improving how we do the vectorization, or utilizing intrinsics directly if that would help, etc. I believe @tannergooding had some ideas.

For example, we spend ~30% of the time in the regex redux benchmark in this helper:
image

cc: @danmosemsft

@stephentoub stephentoub added area-System.Runtime tenet-performance Performance related issue help wanted [up-for-grabs] Good issue for external contributors labels Jan 31, 2020
@stephentoub stephentoub added this to the 5.0 milestone Jan 31, 2020
@stephentoub stephentoub changed the title Improve SpanHelpers.IndexOfAny Improve SpanHelpers.IndexOfAny throughput / vectorization Jan 31, 2020
@tannergooding
Copy link
Member

tannergooding commented Jan 31, 2020

From initial glance, we never hit the vectorized path for less than 64 bytes on most modern machines (32 bytes on older machines without AVX/AVX2 support or for ARM machines).
Further, we also enforce "alignment" and so we will do sequential scanning for the first few cases (data is not likely to be aligned properly) and therefore also for trailing elements.

I imagine the biggest benefit would be just handling the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

Other than that, we'd likely get some benefit from switching to use Hardware Intrinsics (rather than Vector<T>). This is namely because Vector<T> doesn't currently support containment and isn't properly "vex" aware. So, essentially, on modern x86 computers we emit more instructions than necessary and as we aren't encoding the operations "efficiently". Hardware Intrinsics fully support both of these and so can clean up the codegen a bit. Utilizing hardware intrinsics would also allow us to vectorize more cases (as we could cover 16 bytes and above; not just 32 or 64-bytes and above).

@tannergooding
Copy link
Member

Also CC. @GrabYourPitchforks

@nietras
Copy link
Contributor

nietras commented Apr 3, 2020

@stephentoub @tannergooding I would like to give this a shot. Seems like a good opportunity to sharpen my intrinsics skills. Focus seems to be on improving perf for less than 64 bytes, hence plan could be:

  • Ensure there are benchmarks in place for <64 bytes cases. Ideally, getting patterns observed by @stephentoub What is in place currently?
  • First handling:

the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

  • Then perhaps exploring (although perhaps still falling back to Vector<T> for ARM if relevant)

switching to use Hardware Intrinsics (rather than Vector).

@benaadams
Copy link
Member

What is in place currently?

Have 4 PRs for these that I've not moved over as they had been open for about a year

Intrinsicify SpanHelpers.IndexOfAny(char,char) dotnet/coreclr#22877

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) dotnet/coreclr#22878

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) dotnet/coreclr#22879

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) dotnet/coreclr#22880

So was waiting for other of my intrinsic changes to be merged before bothering e.g. #32371

@nietras
Copy link
Contributor

nietras commented Apr 3, 2020

Have 4 PRs for these

@benaadams ha of course you do! 👍 I'll let you finish then. Let me know if you need any help or something :)

@danmoseley danmoseley assigned nietras and unassigned nietras Apr 3, 2020
@danmoseley
Copy link
Member

@GrabYourPitchforks seems there's a PR dependency chain that includes #32371 -- could you help shuffle it along?

@ericstj
Copy link
Member

ericstj commented Aug 5, 2020

@benaadams were you still going to give this a try?

@benaadams
Copy link
Member

@benaadams were you still going to give this a try?

Intrinsicify SpanHelpers.IndexOfAny(char,char) #40589

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) #40590

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) #40591

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) #40592

@danmoseley danmoseley removed the help wanted [up-for-grabs] Good issue for external contributors label Aug 13, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants