Accelerate additional cross platform hardware intrinsics #61649

tannergooding · 2021-11-16T02:27:20Z

This continues the work on #49397 which started with #53450 and #60094

In particular, this moves

IsHardwareAccelerated
floating-point/integer conversions
CompareAll and CompanyAny APIs for each comparison type
to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will likely be one or two more PRs after this one covering:

Native Integer support
Misc APIs that have been recently added to Vector such as Sum or approved but NYI such as ShiftLeft/ShiftRight

Once this is in, the library side work to switch over to using the xplat APIs can also happen.

ghost · 2021-11-16T02:27:27Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

tannergooding · 2021-11-17T00:48:29Z

src/coreclr/jit/simdashwintrinsic.cpp

+                case NI_VectorT128_ConvertToUInt32:
+                {
+                    assert(simdBaseType == TYP_FLOAT);
+                    return gtNewSimdHWIntrinsicNode(retType, op1, NI_AdvSimd_ConvertToUInt32RoundToZero,
+                                                    simdBaseJitType, simdSize, /* isSimdAsHWIntrinsic */ true);
+                }
+
+                case NI_VectorT128_ConvertToUInt64:
+                {
+                    assert(simdBaseType == TYP_DOUBLE);
+                    return gtNewSimdHWIntrinsicNode(retType, op1, NI_AdvSimd_Arm64_ConvertToUInt64RoundToZero,
+                                                    simdBaseJitType, simdSize, /* isSimdAsHWIntrinsic */ true);
+                }


It's worth calling out these weren't accelerated at all on ARM64 before; and now they are and via a single instruction.

tannergooding · 2021-11-17T00:49:30Z

src/coreclr/jit/simdintrinsiclist.h

-SIMD_INTRINSIC("ConvertToInt32",            false,       ConvertToInt32,           "ConvertToInt32",         TYP_STRUCT,     1,      {TYP_STRUCT, TYP_UNDEF,  TYP_UNDEF},   {TYP_FLOAT, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF})
-// Convert double to long
-SIMD_INTRINSIC("ConvertToInt64",            false,       ConvertToInt64,           "ConvertToInt64",         TYP_STRUCT,     1,      {TYP_STRUCT, TYP_UNDEF,  TYP_UNDEF},   {TYP_DOUBLE, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF})
-


We're getting near the point that the rest of this "legacy" SIMD intrinsic support can be removed entirely as nearly everything has moved onto SIMDAsHWIntrinsic now.

tannergooding · 2021-11-17T00:50:54Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

+            if (Sse2.IsSupported)
+            {
+                // Based on __m256d int64_to_double_fast_precise(const __m256i v)
+                // from https://stackoverflow.com/a/41223013/12860347. CC BY-SA 4.0


The three bits of code here are new algorithms and are significantly faster than the previous.

They are also correct where-as various inputs on the old algorithm would actually return a different result as compared to the scalar versions.

tannergooding · 2021-11-17T02:02:25Z

Only diffs for --pmi --frameworks only show the Vector/Vector128/Vector256.ConvertToDouble/ConverToSingle changes in managed.
No diffs for --pmi --benchmarks

`--pmi --tests`

Below regressions are because we had a call to a software fallback and are now inlining intrinsic code

Improvements are all the "standard" ones we've seen from previous moves from legacy "SIMD" to the "SIMDAsHWIntrinsic" support.

For example, we now support containment:

- vmovupd  ymm6, ymmword ptr[rsi]
- vcvttps2dq ymm6, ymm6
+ vcvttps2dq ymm6, ymmword ptr[rsi]

Also algorithms that are smaller and now correct (ConvertToSingle(Vector128<uint>)):

- vmovupd  ymm6, ymmword ptr[rsp+60H]
- vmovaps  ymm1, ymm6
- vpsrld   ymm6, 16
- vpslld   ymm1, 16
- vpsrld   ymm1, 16
- mov      rcx, 0xD1FFAB1E
- vmovd    xmm0, rcx
- vpbroadcastd ymm0, ymm0
- vorps    ymm6, ymm0
- vsubps   ymm6, ymm0
- vcvtdq2ps ymm1, ymm1
- vaddps   ymm6, ymm1
+ vmovupd  ymm0, ymmword ptr[rsp+60H]
+ vpand    ymm1, ymm0, ymmword ptr[reloc @RWD00]
+ vpsrld   ymm0, ymm0, 16
+ vcvtdq2ps ymm1, ymm1
+ vcvtdq2ps ymm6, ymm0
+ vfmadd132ps ymm6, ymm1, ymmword ptr[reloc @RWD32]

and

- vextracti128 xmm0, xmm6, 1
- vmovaps  ymm1, ymm0
- vpsrldq  ymm1, 8
- vmovd    rcx, xmm1
- vcvtsi2sd  xmm1, rcx
- vpslldq  ymm1, 8
- vmovd    rcx, xmm0
- vcvtsi2sd  xmm1, rcx
- vmovaps  ymm0, ymm6
- vpsrldq  ymm0, 8
- vmovd    rcx, xmm0
- vcvtsi2sd  xmm0, rcx
- vpslldq  ymm0, 8
- vmovd    rcx, xmm6
- vcvtsi2sd  xmm0, rcx
- vmovaps  ymm7, ymm0
- vinsertf128 xmm7, xmm1, 1
+ vmovupd  ymm0, ymmword ptr[rsp+60H]
+ vmovupd  ymm1, ymmword ptr[reloc @RWD00]
+ vpblendd ymm1, ymm1, ymm0, 5
+ vpsrlq   ymm0, ymm0, 32
+ vpxor    ymm0, ymm0, ymmword ptr[reloc @RWD32]
+ vsubpd   ymm0, ymm0, ymmword ptr[reloc @RWD64]
+ vaddpd   ymm6, ymm0, ymm1

Notable ConvertToInt64(Vector<double>) isn't "intrinsic" anymore on x64:

- vmovupd  ymm6, ymmword ptr[rsi]
- vextractf128 xmm0, xmm6, 1
- vmovaps  ymm1, ymm0
- vpsrldq  ymm1, 8
- vcvttsd2si  rcx, xmm1
- vmovd    xmm1, rcx
- vpslldq  ymm1, 8
- vcvttsd2si  rcx, xmm0
- vmovd    xmm0, rcx
- vpor     ymm1, ymm0
- vmovaps  ymm0, ymm6
- vpsrldq  ymm0, 8
- vcvttsd2si  rcx, xmm0
- vmovd    xmm0, rcx
- vpslldq  ymm0, 8
- vcvttsd2si  rcx, xmm6
- vmovd    xmm6, rcx
- vpor     ymm6, ymm0
- vinserti128 xmm6, xmm1, 1
+ lea      rcx, [rsp+60H]
+ vmovupd  ymm0, ymmword ptr[rsi]
+ vmovupd  ymmword ptr[rsp+40H], ymm0
+ lea      rdx, bword ptr [rsp+40H]
+ call     Vector:ConvertToInt64(Vector`1):Vector`1

It already wasn't on x86 but on x64 it was really just an inlined + unrolled version of the software fallback. This really deserves proper vectorization and probably shouldn't have been inlined with how big it was in the first place.

Diff

Total bytes of base: 163647693
Total bytes of diff: 163646334
Total bytes of delta: -1359 (-0.00 % of base)
Total relative delta: -2.94
    diff is an improvement.
    relative diff is an improvement.


Top file improvements (bytes):
        -273 : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm (-0.01% of base)
        -217 : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm (-0.01% of base)
        -201 : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm (-0.01% of base)
        -199 : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm (-0.01% of base)
        -191 : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm (-0.81% of base)
        -182 : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm (-0.15% of base)
         -96 : JIT\Directed\Convert\out_of_range_fp_to_int_conversions\out_of_range_fp_to_int_conversions.dasm (-0.64% of base)

7 total files with Code Size differences (7 improved, 0 regressed), 3575 unchanged.

Top method regressions (bytes):
          32 (13.11% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          20 ( 5.56% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          18 ( 5.39% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          12 ( 5.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          12 ( 3.79% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          10 ( 3.44% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          10 ( 2.76% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructFldScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructLclFldScenario():this
           8 ( 2.49% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToDoubleDouble):this
           8 ( 2.20% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassLclFldScenario():this
           3 ( 1.27% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunBasicScenario_UnsafeRead():this
           3 ( 0.95% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunLclVarScenario_UnsafeRead():this
           2 ( 0.63% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
           1 ( 0.34% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunClassFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructLclFldScenario():this

Top method improvements (bytes):
        -122 (-15.04% of base) : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
        -106 (-8.09% of base) : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
         -96 (-0.97% of base) : JIT\Directed\Convert\out_of_range_fp_to_int_conversions\out_of_range_fp_to_int_conversions.dasm - Program:TestBitValue(int,Nullable`1,Nullable`1)
         -46 (-4.08% of base) : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleUInt64(Vector`1):int
         -41 (-11.23% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -41 (-10.79% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -41 (-9.86% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -41 (-10.30% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -41 (-10.38% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -40 (-6.30% of base) : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleUInt64(Vector`1):int
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructFldScenario():this
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -39 (-14.03% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -39 (-12.19% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -38 (-13.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -38 (-8.78% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunLclVarScenario_UnsafeRead():this
         -37 (-11.60% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -33 (-10.15% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunBasicScenario_UnsafeRead():this
         -31 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -31 (-9.66% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this

Top method regressions (percentages):
          32 (13.11% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          20 ( 5.56% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          18 ( 5.39% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          12 ( 5.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          12 ( 3.79% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          10 ( 3.44% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructFldScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructLclFldScenario():this
          10 ( 2.76% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
           8 ( 2.49% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToDoubleDouble):this
           8 ( 2.20% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassLclFldScenario():this
           3 ( 1.27% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunBasicScenario_UnsafeRead():this
           3 ( 0.95% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunLclVarScenario_UnsafeRead():this
           2 ( 0.63% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
           1 ( 0.34% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunClassFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructLclFldScenario():this

Top method improvements (percentages):
        -122 (-15.04% of base) : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
         -39 (-14.03% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -38 (-13.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -39 (-12.19% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -37 (-11.60% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -41 (-11.23% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -41 (-10.79% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructFldScenario():this
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -41 (-10.38% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -41 (-10.30% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -33 (-10.15% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunBasicScenario_UnsafeRead():this
         -41 (-9.86% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -31 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -23 (-9.70% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunBasicScenario_UnsafeRead():this
         -31 (-9.66% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -29 (-9.39% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructFldScenario():this
         -29 (-9.39% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -31 (-9.31% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -30 (-8.98% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this

69 total methods with Code Size differences (52 improved, 17 regressed), 502905 unchanged.

tannergooding · 2021-11-17T07:09:17Z

Test failures are because the current RyuJIT algorithm for ulong->double conversion is incorrect.

This can be repro'd using 17178724614669765595 where:

RyuJIT: 17178724614669766656
C# + double.Parse + this PR: 17178724614669764608

…nstant and return true where supported

…sics

…ed versions on x86/x64

…n blend

…han long.MaxValue

tannergooding · 2022-01-03T18:19:40Z

Rebased onto dotnet/main; this is still pending review and would be very beneficial to get merged so the PRs adding new SIMD logic can trivially support ARM from the start.

echesakov

Have some questions, but, other than that, looks good.

src/coreclr/jit/simdashwintrinsic.cpp

src/libraries/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.NoArmIntrinsics.xml

EgorBo · 2022-01-20T16:17:20Z

Improvement on win-arm64: dotnet/perf-autofiling-issues#2974

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 16, 2021

tannergooding commented Nov 17, 2021

View reviewed changes

tannergooding mentioned this pull request Nov 18, 2021

Updating the x64 casting behavior to be IEEE 754 compliant and to use saturation for overflow #61761

Closed

tannergooding force-pushed the xplat-hwintrin branch 3 times, most recently from ecbf81c to 073a445 Compare November 23, 2021 15:56

runfoapp bot mentioned this pull request Nov 23, 2021

System.Net.Sockets.Tests.SendPacketsAsync.SendPacketsElement_FileZeroCount_Success sometimes fails #60017

Closed

tannergooding force-pushed the xplat-hwintrin branch from 073a445 to e0b3289 Compare December 3, 2021 03:47

tannergooding marked this pull request as ready for review December 3, 2021 18:41

tannergooding changed the title ~~[WIP]: Accelerate additional cross platform hardware intrinsics~~ Accelerate additional cross platform hardware intrinsics Dec 6, 2021

tannergooding mentioned this pull request Jan 3, 2022

Improved HTTP/1 header parser #63295

Closed

tannergooding added 12 commits January 3, 2022 10:19

Updating Vector64/128/256.IsHardwareAccelerated to be treated as a co…

f84facc

…nstant and return true where supported

Accelerate the CmpOpAll intrinsics

acf45f1

Accelerate the CmpOpAny intrinsics

dc3cbfc

Accelerate the ConverToDouble/Int32/Int64/Single/UInt32/UInt64 intrin…

7c9701a

…sics

Applying formatting patch

3eda56e

Fixing ConvertToInt32 and ConvertToSingle to use the right intrinsic

fa2fd35

Fixing some issues and assert types are correct

a419973

Updating ConvertToDouble and ConvertToSingle to have correct vectoriz…

806ba01

…ed versions on x86/x64

Ensure Vector<T>.ConvertToDouble/Single are accelerated

7e69c8e

Swap operands and invert immediate so the constant can be contained o…

a6edef2

…n blend

Restrict ConvertToDouble(Vector128<UInt64>) tests to inputs no more t…

391d177

…han long.MaxValue

Ensure that we create a long/ulong rather than a uint

8202352

tannergooding force-pushed the xplat-hwintrin branch from e0b3289 to 8202352 Compare January 3, 2022 18:19

tannergooding mentioned this pull request Jan 3, 2022

Extend Vector64<T>, Vector128<T>, and Vector256<T> to support nint and nuint #52017

Closed

echesakov self-requested a review January 3, 2022 19:07

echesakov approved these changes Jan 4, 2022

View reviewed changes

src/coreclr/jit/simdashwintrinsic.cpp Show resolved Hide resolved

src/libraries/System.Private.CoreLib/src/ILLink/ILLink.Substitutions.NoArmIntrinsics.xml Show resolved Hide resolved

tannergooding merged commit 7172c68 into dotnet:main Jan 4, 2022

tannergooding mentioned this pull request Jan 4, 2022

Expose cross-platform helpers for Vector64, Vector128, and Vector256 #49397

Closed

DrewScoggins mentioned this pull request Jan 13, 2022

[Perf] Changes at 1/4/2022 3:56:15 AM dotnet/perf-autofiling-issues#2706

Closed

kunalspathak mentioned this pull request Jan 13, 2022

[Perf] Changes at 1/4/2022 3:56:15 AM dotnet/perf-autofiling-issues#2833

Closed

JulieLeeMSFT added this to the 7.0.0 milestone Jan 25, 2022

JulieLeeMSFT mentioned this pull request Jan 25, 2022

What's new in .NET 7 Preview 1 [WIP] dotnet/core#7106

Closed

ghost locked as resolved and limited conversation to collaborators Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate additional cross platform hardware intrinsics #61649

Accelerate additional cross platform hardware intrinsics #61649

tannergooding commented Nov 16, 2021 •

edited

Loading

ghost commented Nov 16, 2021

tannergooding Nov 17, 2021

tannergooding Nov 17, 2021

tannergooding Nov 17, 2021

tannergooding commented Nov 17, 2021

tannergooding commented Nov 17, 2021

tannergooding commented Jan 3, 2022

echesakov left a comment

EgorBo commented Jan 20, 2022

Accelerate additional cross platform hardware intrinsics #61649

Accelerate additional cross platform hardware intrinsics #61649

Conversation

tannergooding commented Nov 16, 2021 • edited Loading

ghost commented Nov 16, 2021

tannergooding Nov 17, 2021

Choose a reason for hiding this comment

tannergooding Nov 17, 2021

Choose a reason for hiding this comment

tannergooding Nov 17, 2021

Choose a reason for hiding this comment

tannergooding commented Nov 17, 2021

--pmi --tests

Diff

tannergooding commented Nov 17, 2021

tannergooding commented Jan 3, 2022

echesakov left a comment

Choose a reason for hiding this comment

EgorBo commented Jan 20, 2022

tannergooding commented Nov 16, 2021 •

edited

Loading

`--pmi --tests`