Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

gfoidl · 2019-05-22T06:07:43Z

Vector128<long>.ToScalar() stores the xmm to the stack, then reads r64 from there via a mov.

vmovapd  xmmword ptr [rsp], xmm0
mov      rax, qword ptr [rsp]

Ideally this would use movq (c++ intrinsic: _mm_cvtsi128_si64), so asm becomes:

-   vmovapd  xmmword ptr [rsp], xmm0
-   mov      rax, qword ptr [rsp]
+   movq     rax, xmm0

Vector128<double>.ToScalar() produces expected code (vmovsd) -- no issue there.
Same CQ issue for int, and for Vector256<T>.
Didn't check other types, than noted here.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

gfoidl · 2019-05-22T06:08:15Z

/cc: @tannergooding please have a look here

fiigii · 2019-05-22T08:33:21Z

I remember that ToScalar on integer types are not intrinsic now. You can use Sse2.X64.ConvertToInt64 for Vector128<long>.

gfoidl · 2019-05-22T09:10:11Z

Thanks.

Sse2.X64.ConvertToInt64

Is better, but still not ideal:

vmovupd  xmm0, xmmword ptr [rsp+08H]
vmovd    rax, xmm0

It is documented as __int64 _mm_cvtsi128_si64 (__m128i a) MOVQ reg/m64, xmm, but JIT didn't emit the movq.

mikedn · 2019-05-22T09:12:20Z

Is that output from the JIT's own disassembler? It's probably movq but displayed as movd.

gfoidl · 2019-05-22T09:16:34Z

Looked at the JIT-dump and in VS-dissambly view (both in release with optimization on, tiering disabled).

SharpLab with CoreCLR shows the same.

fiigii · 2019-05-22T09:18:46Z

Right, vmovd rax, xmm0 is actually vmovq rax, xmm0.
movd is an alias of movq on r64.

fiigii · 2019-05-22T09:25:46Z

SharpLab with CoreCLR shows the same.

BTW, in this link, vzeroupper is generated for Vector128 code. That should not be there, I will take a look.

gfoidl · 2019-05-22T09:27:58Z

movd is an alias of movq on r64.

👍

So codegen could be

vmovd    rax, xmmword ptr [rsp+08H]

or

vmovq    rax, xmmword ptr [rsp+08H]

There is the extra vmovupd (see https://github.com/dotnet/coreclr/issues/24710#issuecomment-494720476)

I will take a look.

Thanks.

My code to show this

using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            Vector128<sbyte> vec = Vector128.Create((byte)0x42).AsSByte();
            long l = ToLong(vec);
            double d = ToDouble(vec);

            if (double.IsNaN(d) || l == long.MaxValue)
                Environment.Exit(1);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static long ToLong(Vector128<sbyte> vec)
        {
            //return vec.AsInt64().ToScalar();
            return Sse2.X64.ConvertToInt64(vec.AsInt64());
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static double ToDouble(Vector128<sbyte> vec)
        {
            return vec.AsDouble().ToScalar();
        }
    }
}

dasm for that code

; Assembly listing for method Program:ToLong(struct):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  1,  1   )  simd16  ->  [rsp+0x08]
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M6413_IG01:
       C5F877               vzeroupper
       6690                 nop

G_M6413_IG02:
       C5F910442408         vmovupd  xmm0, xmmword ptr [rsp+08H]
       C4E1F97EC0           vmovd    rax, xmm0

G_M6413_IG03:
       C3                   ret

; Total bytes of code 17, prolog size 5 for method Program:ToLong(struct):long
; ============================================================
; Assembly listing for method Program:ToDouble(struct):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  1,  1   )  simd16  ->  [rsp+0x08]
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M24050_IG01:
       C5F877               vzeroupper
       6690                 nop

G_M24050_IG02:
       C5FB10442408         vmovsd   xmm0, xmmword ptr [rsp+08H]

G_M24050_IG03:
       C3                   ret

; Total bytes of code 12, prolog size 5 for method Program:ToDouble(struct):double
; ============================================================

gfoidl · 2019-05-22T09:31:08Z

vzeroupper is generated for Vector128 code. That should not be there,

Isn't this needed for VEX? No matter if Vector128 or Vector256.

fiigii · 2019-05-22T09:36:43Z

It is a bit complex, please see https://github.com/dotnet/coreclr/issues/21062.
But I am sure that codegen has something wrong.

AndyAyersMS · 2019-05-22T18:17:12Z

Marking as future; if there's something surgical we can fix, or there's a bug, we can move to 3.0.

omariom · 2019-05-24T16:57:49Z

It might be related. xmm0 is spilled to the stack.

private static long AsLong(double dbl)
{
    return *(long*)&dbl;
}

gfoidl · 2020-03-11T12:12:46Z

@omariom for reference: this is tracked by #11413 (thx @EgorBo for the remainder).

hypeartist · 2020-04-20T20:03:02Z

It might be related. xmm0 is spilled to the stack.

@omariom What about this?

unsafe class C 
{
    private static long AsLong(in double dbl)
    {
        return *(long*)Unsafe.AsPointer(ref Unsafe.AsRef(dbl));
    } 
}

Asm output:

C.AsLong(Double ByRef)
    L0000: mov rax, [rcx]
    L0003: ret

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNbrmwAzGE1IMhDGgG8aDKwzFReC7BgPMkDSRC4BzBgEFcAGQ9PAApeLgYAEwgOYGlI2IBKS2sLamt0pgB2BgAqYPcvHISAVS4dfRY/AAUIMKcoYNhdBlLywT82GF1giMSEjTTrAF9TaiGgA

gfoidl · 2020-04-20T20:09:47Z

@hypeartist this uses also the stack [rcx] and doesn't operate with registers solely.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

This was referenced Mar 4, 2020

Performance issues for Math.Min, Math.Max for double/float #33057

Closed

Use HW-intrinsics in BitConverter for double <-> long / float <-> int #33476

Merged

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

gfoidl commented May 22, 2019

gfoidl commented May 22, 2019

fiigii commented May 22, 2019 •

edited

Loading

gfoidl commented May 22, 2019

mikedn commented May 22, 2019

gfoidl commented May 22, 2019 •

edited

Loading

fiigii commented May 22, 2019 •

edited

Loading

fiigii commented May 22, 2019

gfoidl commented May 22, 2019 •

edited

Loading

gfoidl commented May 22, 2019

fiigii commented May 22, 2019

AndyAyersMS commented May 22, 2019

omariom commented May 24, 2019 •

edited

Loading

gfoidl commented Mar 11, 2020

hypeartist commented Apr 20, 2020

gfoidl commented Apr 20, 2020

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

Comments

gfoidl commented May 22, 2019

gfoidl commented May 22, 2019

fiigii commented May 22, 2019 • edited Loading

gfoidl commented May 22, 2019

mikedn commented May 22, 2019

gfoidl commented May 22, 2019 • edited Loading

fiigii commented May 22, 2019 • edited Loading

fiigii commented May 22, 2019

gfoidl commented May 22, 2019 • edited Loading

gfoidl commented May 22, 2019

fiigii commented May 22, 2019

AndyAyersMS commented May 22, 2019

omariom commented May 24, 2019 • edited Loading

gfoidl commented Mar 11, 2020

hypeartist commented Apr 20, 2020

gfoidl commented Apr 20, 2020

fiigii commented May 22, 2019 •

edited

Loading

gfoidl commented May 22, 2019 •

edited

Loading

fiigii commented May 22, 2019 •

edited

Loading

gfoidl commented May 22, 2019 •

edited

Loading

omariom commented May 24, 2019 •

edited

Loading