Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

benaadams · 2020-02-15T14:56:45Z

Bit more stable across sizes; but mostly similar. The biggest win I'd highlight is that it's now emitted at R2R rather than always requiring JIT.

gist Benchamark+Results

Resolves #32363

/cc @ahsonkhan

benaadams · 2020-02-15T23:33:18Z

Redoing this on top of @ahsonkhan's change #32364 as that outperformed this in various areas

ahsonkhan · 2020-02-15T23:39:08Z

@benaadams, can you run the following benchmark with what's in master (with my recent change) vs. what's in this PR to measure/validate the small buffer perf?

I am asking because I noticed not using the actually built SequenceEqual method was giving different results (compared to having your own local implementation in the benchmark). The RuntimeHelpers.IsBitwiseEquatable<T> call with multiple return points should be part of the benchmark (it ends up changing the results noticably).

[BenchmarkCategory(Categories.CoreFX, Categories.JSON)]
[DisassemblyDiagnoser(printPrologAndEpilog: true, recursiveDepth: 5)]
public class SequenceEqualThreshold
{
    private byte[] _input;
    private byte[] _expected;

    [Params(0, 1, 2, 3, 4)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        var builder = new StringBuilder();
        for (int i = 0; i < Length; i++)
        {
            builder.Append("a");
        }
        string input = builder.ToString();
        _input = Encoding.UTF8.GetBytes(input);
        _expected = _input;
        _expected = Encoding.UTF8.GetBytes(input);
        Console.WriteLine(typeof(Span<byte>).AssemblyQualifiedName);
        Console.WriteLine(typeof(Span<byte>).Assembly.Location);
    }

    [Benchmark]
    public bool SequenceEqual()
    {
        return _input.AsSpan().SequenceEqual(_expected);
    }
}

This is what I did here: #32363

Using the dotnet/performance repo (see https://github.com/dotnet/performance/blob/ca80d8e2886b583d0a69635740b188248d3d6fdd/src/benchmarks/micro/README.md#private-runtime-builds):

Build dotnet/runtime master: build.cmd -subsetCategory coreclr -c Release && build.cmd -subsetCategory libraries /p:CoreCLRConfiguration=Release
Create a copy of "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0" into a new directory called 5.0.0_Before.
Copy the recently built relevant files from E:\GitHub\Fork\runtime\artifacts\bin\coreclr\Windows_NT.x64.Release (S.P.Corelib.dll, CoreRun.exe, etc.) into it (i.e. into 5.0.0_Before).
Create another copy of "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0" into a new into a new directory called 5.0.0_After.
Make changes to the implementation in System.Private.Corelib (the optimization you are testing from this PR) and just rebuild the coreclr dlls:
build.cmd -subsetCategory coreclr -c Release
Copy the newly built dlls (including S.P.Corelib) from E:\GitHub\Fork\runtime\artifacts\bin\coreclr\Windows_NT.x64.Release into 5.0.0_After.
In dotnet/performance repo: cd src\benchmarks\micro
dotnet.exe run -c Release -f netcoreapp5.0 --filter SequenceEqualThreshold --corerun "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0_Before\CoreRun.exe" --artifacts "E:\results\before" && dotnet.exe run -c Release -f netcoreapp5.0 --filter SequenceEqualThreshold --corerun "E:\GitHub\Fork\runtime\artifacts\bin\testhost\netcoreapp5.0-Windows_NT-Release-x64\shared\Microsoft.NETCore.App\5.0.0_After\CoreRun.exe" --artifacts "E:\results\after"
cd ..\..\tools\ResultsComparer
dotnet run --base "E:\results\before" --diff "E:\results\after" --threshold 2%

Here are the dlls I copy/override:

If you have another, easier way to do it, please do that (and share) :) I probably made things more complicated than needed, so there gotta be a better way to do the perf measurements to speed up inner dev loop.

Btw, @adamsitnik - the workflow instructions need to be updated. The testhost\corerun folder doesn't contain the latest built System.Private.Corelib.dll which is why I ended up having to manually copy the new dlls to that folder.

Also, we may want to see whether removing multiple return statements in the main public method helps (also apparently, the if-branch is the special case, so putting the common code in the else branch or outside the if might be better for perf too, so inverted the condition). Maybe you can find ways to optimize that as well in different ways :)

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool SequenceEqual<T>(this Span<T> span, ReadOnlySpan<T> other) where T : IEquatable<T>
{
    int length = span.Length;

    bool result = length == other.Length;

    if (!RuntimeHelpers.IsBitwiseEquatable<T>())
    {
        result = result &&
            SpanHelpers.SequenceEqual(ref MemoryMarshal.GetReference(span), ref MemoryMarshal.GetReference(other), length);
    }
    else
    {
        nuint size = (nuint)Unsafe.SizeOf<T>();
        result = result &&
        SpanHelpers.SequenceEqual(
            ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)),
            ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(other)),
            ((nuint)length) * size);  // If this multiplication overflows, the Span we got overflows the entire address range. There's no happy outcome for this api in such a case so we choose not to take the overhead of checking.
    }

    return result;
}

jkotas · 2020-02-16T01:02:49Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

+            {
+                while (offset < lengthToExamine)
+                {
+                    if (Unsafe.Add(ref first, (IntPtr)offset) != (Unsafe.Add(ref second, (IntPtr)offset)))


nint -> IntPtr casts are a performance trap. I believe that it will go to 64-bit long first on 32-platforms, and the 64-bit long then gets down-casted using checked cast to 32-bit again.

Uses the explicit operator IntPtr(int value) -> IntPtr(int value) so should be ok? (rather than nuint which would go via long)

FYI what Jan said is the reason the UTF-8 transcoding logic uses void* as an intermediary when converting between IntPtr and (whatever integral type).

runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cs

Line 114 in a42c583

uint remainingInputBytes = (uint)(void*)Unsafe.ByteOffset(ref *pInputBuffer, ref *pFinalPosWhereCanReadDWordFromInputBuffer) + 4;

You are right. This should be fine. I thought there is unsigned/signed conversion too.

But as @GrabYourPitchforks noted it is very easy to miss the cases where it is not fine. We had number of 32-bit specific perf bugs because of that.

Yeah, I previously had an issue because pointers are unsigned so my less than zero tests always went the wrong way, which I hadn't expected :(

benaadams · 2020-02-16T01:18:43Z

@ahsonkhan I was using local copies of SequenceEquals for previous vs master vs PR vs loop method; where master outperforms the PR on short lengths; though am combining the two which looks like it improves on both.

Doing it this way for a couple reasons.

Creating and passing the spans in below code takes significantly longer than the SequenceEqual(ref, ref, nuint) method in its entirety (perhaps we need a .SequenceEqual extension on array so it can bypass span creation if you just have arrays, as string.Equals does?); whereas I wanted a fair comparison against a byte-wise loop (calling overheads something to look at separately?)

public bool SequenceEqual()
{
   return _input.AsSpan().SequenceEqual(_expected);
}

Is quite a pain to have 4 coreclrs for the 4 comparisons I'm making (as you detail above); and I'm not currently very efficient with the new runtime repo so have to keep looking up build steps and hunting for files, which is slower for iterations.
Dasming the tiered/vectorized coreclr is more tricky... whereas its quite easy to right click on method in exe using @EgorBo's Disamso see the asm, make some changes, hit refresh get new asm, etc.

Also, we may want to see whether removing multiple return statements in the main public method helps:

Should branch eliminate to only 1 return?

benaadams · 2020-02-16T01:31:45Z

Probably workflow-wise 3. (iterating on the asm) is the highest factor as you set the bar quite high with the last PR 😄

ahsonkhan · 2020-02-16T02:01:17Z

Should branch eliminate to only 1 return?

I assume so since the check is an intrinsic. How can we test/verify that is indeed the case? Is there a way to observe that in the disassembly?

I will re-run the benchmark tomorrow to verify that it has no perf impact.

benaadams · 2020-02-16T06:36:52Z

Span passing costs seem very high?

e.g. passing 2 Spans from one method to another is more expensive than comparing the whole 4096 byte spans for equality? (Windows)

|                   Method | Length |      Mean |     Error |    StdDev |
|------------------------- |------- |----------:|----------:|----------:|
|       UseSequenceEqualPR |   4096 | 14.618 ns | 0.0257 ns | 0.0215 ns |
| UseSequenceEqualPRDirect |   4096 |  6.812 ns | 0.0125 ns | 0.0105 ns |

// Passing as params

[Benchmark]
[MethodImpl(MethodImplOptions.NoInlining)]
public bool UseSequenceEqualPR()
{
    return SequenceEqualPR(_input.Span, _expected.Span);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static bool SequenceEqualPR(Span<byte> input, Span<byte> expected)
{
    return input.Length == expected.Length && SequenceEqualPR(
        ref MemoryMarshal.GetReference(input),
        ref MemoryMarshal.GetReference(expected),
        (nuint)input.Length);
}

// Using direct

[Benchmark]
[MethodImpl(MethodImplOptions.NoInlining)]
public bool UseSequenceEqualPRDirect()
{
    return SequenceEqualPRDirect();
}

[MethodImpl(MethodImplOptions.NoInlining)]
private bool SequenceEqualPRDirect()
{
    Span<byte> input = _input.Span;
    Span<byte> expected = _expected.Span;

    return input.Length == expected.Length && SequenceEqualPR(
        ref MemoryMarshal.GetReference(input),
        ref MemoryMarshal.GetReference(expected),
        (nuint)input.Length);
}

benaadams · 2020-02-16T06:49:17Z

Updated the benchmark to show the Span passing cost https://gist.github.com/benaadams/bf85405a5eae4c750cf6470a5506fd8d can make the SequenceEqual(ref, ref, nuint) method faster, but its already less than 50% of the invocation cost of SequenceEqual<T>(this Span<T> span, Span<T> other) even up to 4096 bytes 🤔

benaadams · 2020-02-16T15:40:13Z

Raised issue for the Span<byte> costs when used as parameters #32396

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

benaadams · 2020-03-14T04:02:53Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

@@ -1309,85 +1311,191 @@ public static unsafe int LastIndexOfAny(ref byte searchSpace, byte value0, byte

        // Optimized byte-based SequenceEquals. The "length" parameter for this one is declared a nuint rather than int as we also use it for types other than byte
        // where the length can exceed 2Gb once scaled by sizeof(T).
-        [MethodImpl(MethodImplOptions.AggressiveOptimization)]


As it now has Sse2 intrinisics, removed the AggressiveOptimization which prevents them from being emitted at R2R.

Pure Vector<T> methods are blocked from R2R so need AggressiveOptimization to bypass the inline restrictions at Tier0.

Note this is a regression on Arm as it will run Tier0 code; however I couldn't find a #if to put it behind; and it should get picked up by #33308

benaadams · 2020-03-14T04:07:14Z

R2R version

; Assembly listing for method SpanHelpers:SequenceEqual(byref,byref,long):bool
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ReadyToRun compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] ( 11, 10   )   byref  ->  rcx        
;  ...
;* V47 tmp27        [V47    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;
; Lcl frame size = 0

G_M37173_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M37173_IG02:
       cmp      r8, 8
       jae      SHORT G_M37173_IG07
						;; bbWeight=1    PerfScore 1.25
G_M37173_IG03:
       cmp      r8, 4
       jae      SHORT G_M37173_IG06
       xor      eax, eax
       mov      r9, r8
       and      r9, 2
       test     r9, r9
       je       SHORT G_M37173_IG04
       movzx    rax, word  ptr [rcx]
       movzx    r10, word  ptr [rdx]
       sub      eax, r10d
						;; bbWeight=0.50 PerfScore 3.75
G_M37173_IG04:
       test     r8b, 1
       je       SHORT G_M37173_IG05
       movzx    rcx, byte  ptr [rcx+r9]
       movzx    rdx, byte  ptr [rdx+r9]
       sub      ecx, edx
       or       ecx, eax
       mov      eax, ecx
						;; bbWeight=0.50 PerfScore 3.00
G_M37173_IG05:
       test     eax, eax
       sete     al
       movzx    rax, al
       jmp      SHORT G_M37173_IG08
						;; bbWeight=0.50 PerfScore 1.75
G_M37173_IG06:
       add      r8, -4
       mov      eax, dword ptr [rcx]
       sub      eax, dword ptr [rdx]
       mov      ecx, dword ptr [rcx+r8]
       sub      ecx, dword ptr [rdx+r8]
       or       eax, ecx
       test     eax, eax
       sete     al
       movzx    rax, al
       jmp      SHORT G_M37173_IG08
						;; bbWeight=0.50 PerfScore 6.00
G_M37173_IG07:
       cmp      rcx, rdx
       je       SHORT G_M37173_IG09
       jmp      SHORT G_M37173_IG11
						;; bbWeight=0.50 PerfScore 1.63
G_M37173_IG08:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
G_M37173_IG09:
       mov      eax, 1
						;; bbWeight=0.50 PerfScore 0.13
G_M37173_IG10:
       ret      
						;; bbWeight=0.50 PerfScore 0.50
G_M37173_IG11:
       cmp      r8, 16
       jb       SHORT G_M37173_IG14
       xor      rax, rax
       add      r8, -16
       test     r8, r8
       je       SHORT G_M37173_IG13
						;; bbWeight=0.50 PerfScore 1.50
G_M37173_IG12:
       movups   xmm0, xmmword ptr [rcx+rax]
       movups   xmm1, xmmword ptr [rdx+rax]
       pcmpeqb  xmm0, xmm1
       pmovmskb  r9d, xmm0
       cmp      r9d, 0xFFFF
       jne      SHORT G_M37173_IG15
       add      rax, 16
       cmp      r8, rax
       ja       SHORT G_M37173_IG12
						;; bbWeight=4    PerfScore 49.00
G_M37173_IG13:
       movups   xmm0, xmmword ptr [rcx+r8]
       movups   xmm1, xmmword ptr [rdx+r8]
       pcmpeqb  xmm0, xmm1
       pmovmskb  ecx, xmm0
       cmp      ecx, 0xFFFF
       jne      SHORT G_M37173_IG15
       jmp      SHORT G_M37173_IG09
						;; bbWeight=0.50 PerfScore 6.38
G_M37173_IG14:
       lea      rax, [r8-8]
       mov      r8, qword ptr [rcx]
       sub      r8, qword ptr [rdx]
       mov      rcx, qword ptr [rcx+rax]
       sub      rcx, qword ptr [rdx+rax]
       or       r8, rcx
       test     r8, r8
       sete     al
       movzx    rax, al
       jmp      SHORT G_M37173_IG08
						;; bbWeight=0.50 PerfScore 6.13
G_M37173_IG15:
       xor      eax, eax
						;; bbWeight=0.50 PerfScore 0.13
G_M37173_IG16:
       ret      
						;; bbWeight=0.50 PerfScore 0.50

; Total bytes of code 225, prolog size 0, PerfScore 104.63, (MethodHash=63986eca) for method SpanHelpers:SequenceEqual(byref,byref,long):bool
; ============================================================

adamsitnik · 2020-03-16T11:43:39Z

the workflow instructions need to be updated.

Please excuse me for the late response. Both the benchmarking and profiling docs have been updated some time ago and now they are up to date:

https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md
https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md

Please let me know if something does not work as expected.

benaadams · 2020-03-16T13:08:42Z

Bit more stable across sizes; but mostly similar. The biggest win I'd highlight is that it's now emitted at R2R rather than always requiring JIT.

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

benaadams · 2020-03-27T01:02:09Z

/cc @GrabYourPitchforks any more to do here?

danmoseley · 2020-05-02T16:21:36Z

@tannergooding do you have time to help get this reviewed? I know @GrabYourPitchforks is fully occupied with something critical. At least one other PR is blocked on this one.

GrabYourPitchforks · 2020-05-02T16:27:54Z

@danmosemsft did you mean to comment on a different PR? This one is merged.

danmoseley · 2020-05-03T01:23:19Z

Doh. My goal was to unblock @benadams
#25023 (comment)

benaadams · 2020-05-03T01:26:34Z

Need to minimise code churn/merge clashes between the PRs, have done a cleanup PR to make it easier #35765

Dotnet-GitSync-Bot added the area-System.Memory label Feb 15, 2020

benaadams mentioned this pull request Feb 15, 2020

Can Span<T>.SequenceEqual be optimized further to be faster for small buffers (buffer.Length < 5)? #32363

Closed

benaadams requested review from ahsonkhan and GrabYourPitchforks February 15, 2020 14:58

jkotas added the tenet-performance Performance related issue label Feb 15, 2020

benaadams force-pushed the SequenceEqual branch from fd38233 to 7a7e391 Compare February 15, 2020 16:03

benaadams changed the title ~~Use intrinsics for SequenceEqual<byte>~~ Use intrinsics for SequenceEqual<byte> and improve short lengths Feb 15, 2020

ahsonkhan added this to the 5.0 milestone Feb 15, 2020

jkotas reviewed Feb 16, 2020

View reviewed changes

benaadams force-pushed the SequenceEqual branch 2 times, most recently from 06a3478 to 99ca277 Compare February 16, 2020 03:43

benaadams mentioned this pull request Feb 16, 2020

Strange Span costs for as Memory.Span -> parameter #32396

Closed

benaadams mentioned this pull request Feb 20, 2020

Use xmm for stack prolog zeroing rather than rep stos #32538

Merged

benaadams force-pushed the SequenceEqual branch 3 times, most recently from a03a876 to 2c19a63 Compare March 13, 2020 06:09

benaadams mentioned this pull request Mar 13, 2020

Sse2 Intrinsics and Vector<T> paths no longer being output at R2R #33544

Closed

benaadams force-pushed the SequenceEqual branch 2 times, most recently from cfd2b32 to 822cdc3 Compare March 13, 2020 22:45

GrabYourPitchforks reviewed Mar 14, 2020

View reviewed changes

benaadams force-pushed the SequenceEqual branch from 822cdc3 to 8dfa95c Compare March 14, 2020 03:59

benaadams commented Mar 14, 2020

View reviewed changes

benaadams mentioned this pull request Mar 14, 2020

Optimize library code using arm64 intrinsics #33308

Closed

benaadams force-pushed the SequenceEqual branch 2 times, most recently from 858ef64 to 8111936 Compare March 16, 2020 00:38

benaadams changed the title ~~Use intrinsics for SequenceEqual<byte> and improve short lengths~~ Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R Mar 16, 2020

GrabYourPitchforks reviewed Mar 16, 2020

View reviewed changes

benaadams added 4 commits March 17, 2020 12:10

Intrinsicify SequenceEqual<byte>

6d6ed8b

Feedback + R2R

64c3d82

Tweak jmps

0353f74

Fix potential overflows

6b3970a

benaadams force-pushed the SequenceEqual branch from 9ecbc0f to 6b3970a Compare March 17, 2020 12:10

benaadams requested a review from GrabYourPitchforks March 17, 2020 15:44

benaadams mentioned this pull request Apr 3, 2020

Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Closed

GrabYourPitchforks approved these changes Apr 27, 2020

View reviewed changes

GrabYourPitchforks merged commit 535b998 into dotnet:master Apr 27, 2020

GrabYourPitchforks added the enhancement Product code improvement that does NOT require public API changes/additions label Apr 27, 2020

saucecontrol mentioned this pull request Apr 30, 2020

Optimize BitOperations uses in CoreLib #35650

Merged

benaadams deleted the SequenceEqual branch May 2, 2020 16:33

kunalspathak mentioned this pull request Sep 2, 2020

[ARM64] Possible perf regression: slicing #41704

Open

ghost locked as resolved and limited conversation to collaborators Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

benaadams commented Feb 15, 2020 •

edited

Loading

benaadams commented Feb 15, 2020

ahsonkhan commented Feb 15, 2020 •

edited

Loading

jkotas Feb 16, 2020 •

edited

Loading

benaadams Feb 16, 2020

GrabYourPitchforks Feb 16, 2020

jkotas Feb 16, 2020

jkotas Feb 16, 2020

benaadams Feb 16, 2020

benaadams commented Feb 16, 2020 •

edited

Loading

benaadams commented Feb 16, 2020

ahsonkhan commented Feb 16, 2020 •

edited

Loading

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

benaadams Mar 14, 2020

benaadams commented Mar 14, 2020 •

edited

Loading

adamsitnik commented Mar 16, 2020

benaadams commented Mar 16, 2020

benaadams commented Mar 27, 2020

danmoseley commented May 2, 2020

GrabYourPitchforks commented May 2, 2020

danmoseley commented May 3, 2020

benaadams commented May 3, 2020

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371

Conversation

benaadams commented Feb 15, 2020 • edited Loading

benaadams commented Feb 15, 2020

ahsonkhan commented Feb 15, 2020 • edited Loading

jkotas Feb 16, 2020 • edited Loading

Choose a reason for hiding this comment

benaadams Feb 16, 2020

Choose a reason for hiding this comment

GrabYourPitchforks Feb 16, 2020

Choose a reason for hiding this comment

jkotas Feb 16, 2020

Choose a reason for hiding this comment

jkotas Feb 16, 2020

Choose a reason for hiding this comment

benaadams Feb 16, 2020

Choose a reason for hiding this comment

benaadams commented Feb 16, 2020 • edited Loading

benaadams commented Feb 16, 2020

ahsonkhan commented Feb 16, 2020 • edited Loading

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

benaadams commented Feb 16, 2020

benaadams Mar 14, 2020

Choose a reason for hiding this comment

benaadams commented Mar 14, 2020 • edited Loading

adamsitnik commented Mar 16, 2020

benaadams commented Mar 16, 2020

benaadams commented Mar 27, 2020

danmoseley commented May 2, 2020

GrabYourPitchforks commented May 2, 2020

danmoseley commented May 3, 2020

benaadams commented May 3, 2020

benaadams commented Feb 15, 2020 •

edited

Loading

ahsonkhan commented Feb 15, 2020 •

edited

Loading

jkotas Feb 16, 2020 •

edited

Loading

benaadams commented Feb 16, 2020 •

edited

Loading

ahsonkhan commented Feb 16, 2020 •

edited

Loading

benaadams commented Mar 14, 2020 •

edited

Loading