Accelerate Vector128<long>::op_Multiply on x64 #103555

EgorBo · 2024-06-17T07:46:12Z

This PR optimizes Vector128 and Vector256 multiplication for long/ulong when AVX512 is not presented in the system. It makes XxHash128 faster, see #103555 (comment)

public Vector128<long> Foo(Vector128<long> a, Vector128<long> b) => a * b;

Current codegen on x64 cpu without AVX512:

; Method MyBench:Foo
       push     rsi
       push     rbx
       sub      rsp, 104
       mov      rbx, rdx
       mov      rdx, qword ptr [r8]
       mov      qword ptr [rsp+0x58], rdx
       mov      rdx, qword ptr [r9]
       mov      qword ptr [rsp+0x50], rdx
       mov      rdx, qword ptr [rsp+0x58]
       imul     rdx, qword ptr [rsp+0x50]
       mov      qword ptr [rsp+0x60], rdx
       mov      rsi, qword ptr [rsp+0x60]
       mov      rdx, qword ptr [r8+0x08]
       mov      qword ptr [rsp+0x40], rdx
       mov      rdx, qword ptr [r9+0x08]
       mov      qword ptr [rsp+0x38], rdx
       mov      rcx, qword ptr [rsp+0x40]
       mov      rdx, qword ptr [rsp+0x38]
       call     [System.Runtime.Intrinsics.Scalar`1[long]:Multiply(long,long):long]   ;;; not inlined call!
       mov      qword ptr [rsp+0x48], rax
       mov      rax, qword ptr [rsp+0x48]
       mov      qword ptr [rsp+0x20], rsi
       mov      qword ptr [rsp+0x28], rax
       vmovaps  xmm0, xmmword ptr [rsp+0x20]
       vmovups  xmmword ptr [rbx], xmm0
       mov      rax, rbx
       add      rsp, 104
       pop      rbx
       pop      rsi
       ret      
; Total bytes of code: 120

New codegen:

; Method MyBench:Foo
       vmovups  xmm0, xmmword ptr [r8]
       vmovups  xmm1, xmmword ptr [r9]
       vpmuludq xmm2, xmm1, xmm0
       vpshufd  xmm1, xmm1, -79
       vpmulld  xmm0, xmm1, xmm0
       vxorps   xmm1, xmm1, xmm1
       vphaddd  xmm0, xmm0, xmm1
       vpshufd  xmm0, xmm0, 115
       vpaddq   xmm0, xmm0, xmm2
       vmovups  xmmword ptr [rdx], xmm0
       mov      rax, rdx
       ret      
; Total bytes of code: 50

dotnet-policy-service · 2024-06-17T07:46:47Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

EgorBo · 2024-06-17T08:09:35Z

Note: results should be better if we do it in JIT, it will enable loop hoisting, cse, etc for MUL

neon-sunset · 2024-06-17T10:18:02Z

Note #103539 (comment) (and https://godbolt.org/z/eqsrf341M) from xxHash128 issue.

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs

…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>

EgorBo · 2024-06-20T14:07:52Z

@EgorBot -amd -intel -arm64 -profiler --envvars DOTNET_PreferredVectorBitWidth:128

using System.IO.Hashing;
using BenchmarkDotNet.Attributes;

public class Bench
{
    static readonly byte[] Data = new byte[1000000];

    [Benchmark]
    public byte[] BenchXxHash128()
    {
        XxHash128 hash = new();
        hash.Append(Data);
        return hash.GetHashAndReset();
    }
}

EgorBot · 2024-06-20T14:26:58Z

Benchmark results on Intel

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-ITXSAG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-XSORFZ : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	43.41 μs	0.087 μs	1.00
BenchXxHash128	PR	43.33 μs	0.009 μs	1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBot · 2024-06-20T14:27:46Z

Benchmark results on Amd

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 7763, 1 CPU, 16 logical and 8 physical cores
  Job-SUBLYH : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-OPUYDY : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	71.20 μs	0.022 μs	1.00
BenchXxHash128	PR	43.84 μs	0.013 μs	0.62

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBot · 2024-06-20T14:31:07Z

Benchmark results on Arm64

BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-EDPWDU : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-TIALUR : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
EnvironmentVariables=DOTNET_PreferredVectorBitWidth=128

Method	Toolchain	Mean	Error	Ratio
BenchXxHash128	Main	116.9 μs	0.11 μs	1.00
BenchXxHash128	PR	116.8 μs	0.07 μs	1.00

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

EgorBo · 2024-06-21T11:57:19Z

/azp list

EgorBo · 2024-06-21T11:57:47Z

/azp run runtime-coreclr jitstress-isas-x86

azure-pipelines · 2024-06-21T11:57:55Z

Azure Pipelines successfully started running 1 pipeline(s).

EgorBo · 2024-06-21T12:29:23Z

@tannergooding PTAL, I'll add arm64 separately, need to test different impls.
I've expanded it in importer similar to existing op_Multiply expansions

Benchmark improvement: #103555 (comment)

src/coreclr/jit/gentree.cpp

tannergooding · 2024-06-27T15:13:47Z

src/coreclr/jit/gentree.cpp

+                        // Vector256<int> tmp3 = Avx2.HorizontalAdd(tmp2.AsInt32(), Vector256<int>.Zero);
+                        GenTreeHWIntrinsic* tmp3 =
+                            gtNewSimdHWIntrinsicNode(type, tmp2, gtNewZeroConNode(type),
+                                                     is256 ? NI_AVX2_HorizontalAdd : NI_SSSE3_HorizontalAdd,
+                                                     CORINFO_TYPE_UINT, simdSize);


I know in other places we've started avoiding hadd in favor of shuffle+add, might be worth seeing if that's appropriate here too (low priority, non blocking)

I tried to benchmark different implementations for it and they all were equaly fast e.g. #99871 (comment)

tannergooding · 2024-06-27T15:15:14Z

src/coreclr/jit/hwintrinsicxarch.cpp

                {
-                    // TODO-XARCH-CQ: We should support long/ulong multiplication
+                    // TODO-XARCH-CQ: 32bit support


What's blocking 32-bit support? It doesn't look like we're using any _X64 intrinsics in the fallback logic?

Not sure to be honest, that check was pre-existing, I only changed comment

…-64bit

Accelerate Vector128 mul for long/ulong

ae17211

dotnet-issue-labeler bot added the area-System.Runtime.Intrinsics label Jun 17, 2024

dotnet-policy-service bot assigned EgorBo Jun 17, 2024

EgorBo added 2 commits June 17, 2024 12:43

better ulong version

afda312

fix build

ab01574

This was referenced Jun 17, 2024

GC/Regressions/v2.0-beta2/452950 failed in CI #103494

Closed

System.Numerics.Tensors.Tests.TensorSpanTests test failure #103525

Closed

EgorBo added 2 commits June 17, 2024 14:33

Update Vector128_1.cs

21b42de

Sse41 version

581f1e2

tannergooding reviewed Jun 17, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs Outdated Show resolved Hide resolved

EgorBo and others added 2 commits June 17, 2024 17:01

Update src/libraries/System.Private.CoreLib/src/System/Runtime/Intrin…

49a359f

…sics/Vector128_1.cs Co-authored-by: Tanner Gooding <tagoo@outlook.com>

Update Vector128_1.cs

57898f0

dotnet deleted a comment from EgorBot Jun 17, 2024

EgorBo added 5 commits June 19, 2024 15:23

Update Vector128_1.cs

f1be705

Update Vector128_1.cs

95d0eb8

Update Vector128_1.cs

dcfd93d

Update Vector128_1.cs

7fec9e3

Update Vector128_1.cs

e172296

This was referenced Jun 19, 2024

STJ NullPropertyNameFail test failing in CI #103715

Closed

NativeAOT legs timing out in CI #102239

Closed

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

dotnet deleted a comment from EgorBot Jun 20, 2024

This was referenced Jun 20, 2024

Test failure: GC\\Features\\HeapExpansion\\Finalizer\\Finalizer.cmd #102706

Closed

[Test Failure] System.Net.Http.WinHttpHandlerFunctional.Tests.BidirectionStreamingTest.BackwardsCompatibility_DowngradeToHttp11 #103754

Closed

revert unrelated changes

60441f3

This comment was marked as resolved.

Sign in to view

EgorBo requested a review from tannergooding June 21, 2024 12:29

build-analysis bot mentioned this pull request Jun 21, 2024

[browser] Unable to evaluate script: tab crashed #103623

Open

EgorBo marked this pull request as ready for review June 24, 2024 14:26

tannergooding reviewed Jun 27, 2024

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Jun 27, 2024

View reviewed changes

tannergooding approved these changes Jun 27, 2024

View reviewed changes

tannergooding reviewed Jun 27, 2024

View reviewed changes

EgorBo added 2 commits June 28, 2024 17:51

Merge branch 'main' of https://github.com/dotnet/runtime into arm-mul…

5b78ddd

…-64bit

Address feedback

cc257dd

EgorBo mentioned this pull request Jun 28, 2024

Optimize Vector128<long> multiplication for arm64 #104177

Merged

This was referenced Jun 28, 2024

ExplicitConversion_FromSingle failing due to NaN != NaN #103347

Open

Nuget: Central Directory corrupt dotnet/dnceng#3099

Open

The job running on agent NetCore-Public ran longer than the maximum time #104044

Closed

EgorBo merged commit 33ca32d into dotnet:main Jun 28, 2024
167 of 173 checks passed

EgorBo deleted the arm-mul-64bit branch June 28, 2024 21:40

github-actions bot locked and limited conversation to collaborators Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate Vector128<long>::op_Multiply on x64 #103555

Accelerate Vector128<long>::op_Multiply on x64 #103555

EgorBo commented Jun 17, 2024 •

edited

Loading

dotnet-policy-service bot commented Jun 17, 2024

EgorBo commented Jun 17, 2024

neon-sunset commented Jun 17, 2024

EgorBo commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBo commented Jun 21, 2024

This comment was marked as resolved.

EgorBo commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

EgorBo commented Jun 21, 2024 •

edited

Loading

tannergooding Jun 27, 2024

EgorBo Jun 27, 2024

tannergooding Jun 27, 2024

EgorBo Jun 27, 2024

Accelerate Vector128<long>::op_Multiply on x64 #103555

Accelerate Vector128<long>::op_Multiply on x64 #103555

Conversation

EgorBo commented Jun 17, 2024 • edited Loading

dotnet-policy-service bot commented Jun 17, 2024

EgorBo commented Jun 17, 2024

neon-sunset commented Jun 17, 2024

EgorBo commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBot commented Jun 20, 2024

EgorBo commented Jun 21, 2024

This comment was marked as resolved.

EgorBo commented Jun 21, 2024

azure-pipelines bot commented Jun 21, 2024

EgorBo commented Jun 21, 2024 • edited Loading

tannergooding Jun 27, 2024

Choose a reason for hiding this comment

EgorBo Jun 27, 2024

Choose a reason for hiding this comment

tannergooding Jun 27, 2024

Choose a reason for hiding this comment

EgorBo Jun 27, 2024

Choose a reason for hiding this comment

EgorBo commented Jun 17, 2024 •

edited

Loading

EgorBo commented Jun 21, 2024 •

edited

Loading