Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

stephentoub · 2020-01-31T19:44:12Z

Our regex engine can now spend a decent amount of time inside of span helpers like:

runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

Line 467 in 2355a10

    
           public static unsafe int IndexOfAny(ref char searchSpace, char value0, char value1, int length)

and in particular, we seem to hit this path:

runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

Lines 492 to 499 in 2355a10

    
           if (pCh[0] == value0 || pCh[0] == value1) 
        
               goto Found; 
        
           if (pCh[1] == value0 || pCh[1] == value1) 
        
               goto Found1; 
        
           if (pCh[2] == value0 || pCh[2] == value1) 
        
               goto Found2; 
        
           if (pCh[3] == value0 || pCh[3] == value1) 
        
               goto Found3;

fairly frequently. It'd be great to investigate whether we can do anything to improve the performance of these IndexOfAny helpers, whether it's by improving how we do the vectorization, or utilizing intrinsics directly if that would help, etc. I believe @tannergooding had some ideas.

For example, we spend ~30% of the time in the regex redux benchmark in this helper:

cc: @danmosemsft

tannergooding · 2020-01-31T20:05:31Z

From initial glance, we never hit the vectorized path for less than 64 bytes on most modern machines (32 bytes on older machines without AVX/AVX2 support or for ARM machines).
Further, we also enforce "alignment" and so we will do sequential scanning for the first few cases (data is not likely to be aligned properly) and therefore also for trailing elements.

I imagine the biggest benefit would be just handling the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

Other than that, we'd likely get some benefit from switching to use Hardware Intrinsics (rather than Vector<T>). This is namely because Vector<T> doesn't currently support containment and isn't properly "vex" aware. So, essentially, on modern x86 computers we emit more instructions than necessary and as we aren't encoding the operations "efficiently". Hardware Intrinsics fully support both of these and so can clean up the codegen a bit. Utilizing hardware intrinsics would also allow us to vectorize more cases (as we could cover 16 bytes and above; not just 32 or 64-bytes and above).

tannergooding · 2020-01-31T20:08:03Z

Also CC. @GrabYourPitchforks

nietras · 2020-04-03T07:47:01Z

@stephentoub @tannergooding I would like to give this a shot. Seems like a good opportunity to sharpen my intrinsics skills. Focus seems to be on improving perf for less than 64 bytes, hence plan could be:

Ensure there are benchmarks in place for <64 bytes cases. Ideally, getting patterns observed by @stephentoub What is in place currently?
First handling:

the initial leading/trailing elements as unaligned (doing at most 2 unaligned iterations and the remaining aligned) which would avoid the small 1-4 iteration loops with many branches (the unaligned vector operation should be faster even if the load happens to cross a cache line boundary; since it will cut down the 8 comparisons/branches and the loop).

Then perhaps exploring (although perhaps still falling back to Vector<T> for ARM if relevant)

switching to use Hardware Intrinsics (rather than Vector).

benaadams · 2020-04-03T11:13:58Z

What is in place currently?

Have 4 PRs for these that I've not moved over as they had been open for about a year

Intrinsicify SpanHelpers.IndexOfAny(char,char) dotnet/coreclr#22877

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) dotnet/coreclr#22878

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) dotnet/coreclr#22879

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) dotnet/coreclr#22880

So was waiting for other of my intrinsic changes to be merged before bothering e.g. #32371

nietras · 2020-04-03T11:18:15Z

Have 4 PRs for these

@benaadams ha of course you do! 👍 I'll let you finish then. Let me know if you need any help or something :)

danmoseley · 2020-04-03T14:54:03Z

@GrabYourPitchforks seems there's a PR dependency chain that includes #32371 -- could you help shuffle it along?

ericstj · 2020-08-05T23:34:56Z

@benaadams were you still going to give this a try?

benaadams · 2020-08-09T17:11:16Z

@benaadams were you still going to give this a try?

Intrinsicify SpanHelpers.IndexOfAny(char,char) #40589

Intrinsicify SpanHelpers.IndexOfAny(char,char,char) #40590

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char) #40591

Intrinsicify SpanHelpers.IndexOfAny(char,char,char,char,char) #40592

stephentoub added area-System.Runtime tenet-performance Performance related issue help wanted [up-for-grabs] Good issue for external contributors labels Jan 31, 2020

stephentoub added this to the 5.0 milestone Jan 31, 2020

stephentoub changed the title ~~Improve SpanHelpers.IndexOfAny~~ Improve SpanHelpers.IndexOfAny throughput / vectorization Jan 31, 2020

stephentoub mentioned this issue Feb 3, 2020

Regex optimization opportunities #1349

Closed

41 tasks

danmoseley assigned nietras and unassigned nietras Apr 3, 2020

danmoseley mentioned this issue May 3, 2020

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

Merged

stephentoub mentioned this issue Aug 12, 2020

Intrinsicify SpanHelpers.IndexOfAny(char,char) #40589

Closed

benaadams mentioned this issue Aug 12, 2020

Intrinsicify SpanHelpers.IndexOfAny(char,...) #40729

Merged

danmoseley removed the help wanted [up-for-grabs] Good issue for external contributors label Aug 13, 2020

danmoseley assigned benaadams Aug 13, 2020

kunalspathak closed this as completed in #40729 Aug 17, 2020

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

stephentoub commented Jan 31, 2020 •

edited

Loading

tannergooding commented Jan 31, 2020 •

edited

Loading

tannergooding commented Jan 31, 2020

nietras commented Apr 3, 2020

benaadams commented Apr 3, 2020

nietras commented Apr 3, 2020

danmoseley commented Apr 3, 2020

ericstj commented Aug 5, 2020

benaadams commented Aug 9, 2020

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Comments

stephentoub commented Jan 31, 2020 • edited Loading

tannergooding commented Jan 31, 2020 • edited Loading

tannergooding commented Jan 31, 2020

nietras commented Apr 3, 2020

benaadams commented Apr 3, 2020

nietras commented Apr 3, 2020

danmoseley commented Apr 3, 2020

ericstj commented Aug 5, 2020

benaadams commented Aug 9, 2020

stephentoub commented Jan 31, 2020 •

edited

Loading

tannergooding commented Jan 31, 2020 •

edited

Loading