Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate additional cross platform hardware intrinsics #61649

Merged
merged 12 commits into from
Jan 4, 2022

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Nov 16, 2021

This continues the work on #49397 which started with #53450 and #60094

In particular, this moves

  • IsHardwareAccelerated
  • floating-point/integer conversions
  • CompareAll and CompanyAny APIs for each comparison type
    to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will likely be one or two more PRs after this one covering:

  • Native Integer support
  • Misc APIs that have been recently added to Vector such as Sum or approved but NYI such as ShiftLeft/ShiftRight

Once this is in, the library side work to switch over to using the xplat APIs can also happen.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 16, 2021
@ghost
Copy link

ghost commented Nov 16, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

Comment on lines +829 to +841
case NI_VectorT128_ConvertToUInt32:
{
assert(simdBaseType == TYP_FLOAT);
return gtNewSimdHWIntrinsicNode(retType, op1, NI_AdvSimd_ConvertToUInt32RoundToZero,
simdBaseJitType, simdSize, /* isSimdAsHWIntrinsic */ true);
}

case NI_VectorT128_ConvertToUInt64:
{
assert(simdBaseType == TYP_DOUBLE);
return gtNewSimdHWIntrinsicNode(retType, op1, NI_AdvSimd_Arm64_ConvertToUInt64RoundToZero,
simdBaseJitType, simdSize, /* isSimdAsHWIntrinsic */ true);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth calling out these weren't accelerated at all on ARM64 before; and now they are and via a single instruction.

SIMD_INTRINSIC("ConvertToInt32", false, ConvertToInt32, "ConvertToInt32", TYP_STRUCT, 1, {TYP_STRUCT, TYP_UNDEF, TYP_UNDEF}, {TYP_FLOAT, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF})
// Convert double to long
SIMD_INTRINSIC("ConvertToInt64", false, ConvertToInt64, "ConvertToInt64", TYP_STRUCT, 1, {TYP_STRUCT, TYP_UNDEF, TYP_UNDEF}, {TYP_DOUBLE, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF, TYP_UNDEF})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting near the point that the rest of this "legacy" SIMD intrinsic support can be removed entirely as nearly everything has moved onto SIMDAsHWIntrinsic now.

if (Sse2.IsSupported)
{
// Based on __m256d int64_to_double_fast_precise(const __m256i v)
// from https://stackoverflow.com/a/41223013/12860347. CC BY-SA 4.0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three bits of code here are new algorithms and are significantly faster than the previous.

They are also correct where-as various inputs on the old algorithm would actually return a different result as compared to the scalar versions.

@tannergooding
Copy link
Member Author

Only diffs for --pmi --frameworks only show the Vector/Vector128/Vector256.ConvertToDouble/ConverToSingle changes in managed.
No diffs for --pmi --benchmarks

--pmi --tests

Below regressions are because we had a call to a software fallback and are now inlining intrinsic code

Improvements are all the "standard" ones we've seen from previous moves from legacy "SIMD" to the "SIMDAsHWIntrinsic" support.

For example, we now support containment:

- vmovupd  ymm6, ymmword ptr[rsi]
- vcvttps2dq ymm6, ymm6
+ vcvttps2dq ymm6, ymmword ptr[rsi]

Also algorithms that are smaller and now correct (ConvertToSingle(Vector128<uint>)):

- vmovupd  ymm6, ymmword ptr[rsp+60H]
- vmovaps  ymm1, ymm6
- vpsrld   ymm6, 16
- vpslld   ymm1, 16
- vpsrld   ymm1, 16
- mov      rcx, 0xD1FFAB1E
- vmovd    xmm0, rcx
- vpbroadcastd ymm0, ymm0
- vorps    ymm6, ymm0
- vsubps   ymm6, ymm0
- vcvtdq2ps ymm1, ymm1
- vaddps   ymm6, ymm1
+ vmovupd  ymm0, ymmword ptr[rsp+60H]
+ vpand    ymm1, ymm0, ymmword ptr[reloc @RWD00]
+ vpsrld   ymm0, ymm0, 16
+ vcvtdq2ps ymm1, ymm1
+ vcvtdq2ps ymm6, ymm0
+ vfmadd132ps ymm6, ymm1, ymmword ptr[reloc @RWD32]

and

- vextracti128 xmm0, xmm6, 1
- vmovaps  ymm1, ymm0
- vpsrldq  ymm1, 8
- vmovd    rcx, xmm1
- vcvtsi2sd  xmm1, rcx
- vpslldq  ymm1, 8
- vmovd    rcx, xmm0
- vcvtsi2sd  xmm1, rcx
- vmovaps  ymm0, ymm6
- vpsrldq  ymm0, 8
- vmovd    rcx, xmm0
- vcvtsi2sd  xmm0, rcx
- vpslldq  ymm0, 8
- vmovd    rcx, xmm6
- vcvtsi2sd  xmm0, rcx
- vmovaps  ymm7, ymm0
- vinsertf128 xmm7, xmm1, 1
+ vmovupd  ymm0, ymmword ptr[rsp+60H]
+ vmovupd  ymm1, ymmword ptr[reloc @RWD00]
+ vpblendd ymm1, ymm1, ymm0, 5
+ vpsrlq   ymm0, ymm0, 32
+ vpxor    ymm0, ymm0, ymmword ptr[reloc @RWD32]
+ vsubpd   ymm0, ymm0, ymmword ptr[reloc @RWD64]
+ vaddpd   ymm6, ymm0, ymm1

Notable ConvertToInt64(Vector<double>) isn't "intrinsic" anymore on x64:

- vmovupd  ymm6, ymmword ptr[rsi]
- vextractf128 xmm0, xmm6, 1
- vmovaps  ymm1, ymm0
- vpsrldq  ymm1, 8
- vcvttsd2si  rcx, xmm1
- vmovd    xmm1, rcx
- vpslldq  ymm1, 8
- vcvttsd2si  rcx, xmm0
- vmovd    xmm0, rcx
- vpor     ymm1, ymm0
- vmovaps  ymm0, ymm6
- vpsrldq  ymm0, 8
- vcvttsd2si  rcx, xmm0
- vmovd    xmm0, rcx
- vpslldq  ymm0, 8
- vcvttsd2si  rcx, xmm6
- vmovd    xmm6, rcx
- vpor     ymm6, ymm0
- vinserti128 xmm6, xmm1, 1
+ lea      rcx, [rsp+60H]
+ vmovupd  ymm0, ymmword ptr[rsi]
+ vmovupd  ymmword ptr[rsp+40H], ymm0
+ lea      rdx, bword ptr [rsp+40H]
+ call     Vector:ConvertToInt64(Vector`1):Vector`1

It already wasn't on x86 but on x64 it was really just an inlined + unrolled version of the software fallback. This really deserves proper vectorization and probably shouldn't have been inlined with how big it was in the first place.

Diff

Total bytes of base: 163647693
Total bytes of diff: 163646334
Total bytes of delta: -1359 (-0.00 % of base)
Total relative delta: -2.94
    diff is an improvement.
    relative diff is an improvement.


Top file improvements (bytes):
        -273 : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm (-0.01% of base)
        -217 : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm (-0.01% of base)
        -201 : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm (-0.01% of base)
        -199 : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm (-0.01% of base)
        -191 : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm (-0.81% of base)
        -182 : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm (-0.15% of base)
         -96 : JIT\Directed\Convert\out_of_range_fp_to_int_conversions\out_of_range_fp_to_int_conversions.dasm (-0.64% of base)

7 total files with Code Size differences (7 improved, 0 regressed), 3575 unchanged.

Top method regressions (bytes):
          32 (13.11% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          20 ( 5.56% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          18 ( 5.39% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          12 ( 5.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          12 ( 3.79% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          10 ( 3.44% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          10 ( 2.76% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructFldScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructLclFldScenario():this
           8 ( 2.49% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToDoubleDouble):this
           8 ( 2.20% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassLclFldScenario():this
           3 ( 1.27% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunBasicScenario_UnsafeRead():this
           3 ( 0.95% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunLclVarScenario_UnsafeRead():this
           2 ( 0.63% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
           1 ( 0.34% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunClassFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructLclFldScenario():this

Top method improvements (bytes):
        -122 (-15.04% of base) : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
        -106 (-8.09% of base) : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
         -96 (-0.97% of base) : JIT\Directed\Convert\out_of_range_fp_to_int_conversions\out_of_range_fp_to_int_conversions.dasm - Program:TestBitValue(int,Nullable`1,Nullable`1)
         -46 (-4.08% of base) : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleUInt64(Vector`1):int
         -41 (-11.23% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -41 (-10.79% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -41 (-9.86% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -41 (-10.30% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -41 (-10.38% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -40 (-6.30% of base) : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleUInt64(Vector`1):int
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructFldScenario():this
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -39 (-14.03% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -39 (-12.19% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -38 (-13.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -38 (-8.78% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunLclVarScenario_UnsafeRead():this
         -37 (-11.60% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -33 (-10.15% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunBasicScenario_UnsafeRead():this
         -31 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -31 (-9.66% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this

Top method regressions (percentages):
          32 (13.11% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          20 ( 5.56% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          18 ( 5.39% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          12 ( 5.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunBasicScenario_UnsafeRead():this
          12 ( 3.79% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunLclVarScenario_UnsafeRead():this
          10 ( 3.44% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassFldScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructFldScenario():this
          10 ( 3.24% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunStructLclFldScenario():this
          10 ( 2.76% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
           8 ( 2.49% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToDoubleDouble):this
           8 ( 2.20% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClassLclFldScenario():this
           3 ( 1.27% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunBasicScenario_UnsafeRead():this
           3 ( 0.95% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunLclVarScenario_UnsafeRead():this
           2 ( 0.63% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToDoubleDouble:RunClsVarScenario():this
           1 ( 0.34% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunClassFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructFldScenario():this
           1 ( 0.32% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToSingleSingle:RunStructLclFldScenario():this

Top method improvements (percentages):
        -122 (-15.04% of base) : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm - VectorConvertTest:VectorConvertDoubleInt64(Vector`1):int
         -39 (-14.03% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -38 (-13.06% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -39 (-12.19% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -37 (-11.60% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -41 (-11.23% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -41 (-10.79% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructFldScenario():this
         -40 (-10.67% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -41 (-10.38% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -41 (-10.30% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClsVarScenario():this
         -33 (-10.15% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunBasicScenario_UnsafeRead():this
         -41 (-9.86% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassLclFldScenario():this
         -31 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -23 (-9.70% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunBasicScenario_UnsafeRead():this
         -31 (-9.66% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorUnaryOpTest__ConvertToInt32Int32):this
         -29 (-9.39% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructFldScenario():this
         -29 (-9.39% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunStructLclFldScenario():this
         -31 (-9.31% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this
         -30 (-8.98% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorUnaryOpTest__ConvertToInt32Int32:RunClassFldScenario():this

69 total methods with Code Size differences (52 improved, 17 regressed), 502905 unchanged.

@tannergooding
Copy link
Member Author

Test failures are because the current RyuJIT algorithm for ulong->double conversion is incorrect.

This can be repro'd using 17178724614669765595 where:

  • RyuJIT: 17178724614669766656
  • C# + double.Parse + this PR: 17178724614669764608

@tannergooding tannergooding force-pushed the xplat-hwintrin branch 3 times, most recently from ecbf81c to 073a445 Compare November 23, 2021 15:56
@tannergooding tannergooding marked this pull request as ready for review December 3, 2021 18:41
@tannergooding tannergooding changed the title [WIP]: Accelerate additional cross platform hardware intrinsics Accelerate additional cross platform hardware intrinsics Dec 6, 2021
@tannergooding
Copy link
Member Author

Rebased onto dotnet/main; this is still pending review and would be very beneficial to get merged so the PRs adding new SIMD logic can trivially support ARM from the start.

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some questions, but, other than that, looks good.

@EgorBo
Copy link
Member

EgorBo commented Jan 20, 2022

Improvement on win-arm64: dotnet/perf-autofiling-issues#2974

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Jan 25, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Feb 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants