Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

Open
gfoidl opened this issue May 22, 2019 · 15 comments
Open

Vector{128,256}<T>.ToScalar suboptimal codegen \ { double } #12733

gfoidl opened this issue May 22, 2019 · 15 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@gfoidl
Copy link
Member

gfoidl commented May 22, 2019

Vector128<long>.ToScalar() stores the xmm to the stack, then reads r64 from there via a mov.

vmovapd  xmmword ptr [rsp], xmm0
mov      rax, qword ptr [rsp]

Ideally this would use movq (c++ intrinsic: _mm_cvtsi128_si64), so asm becomes:

-   vmovapd  xmmword ptr [rsp], xmm0
-   mov      rax, qword ptr [rsp]
+   movq     rax, xmm0

Vector128<double>.ToScalar() produces expected code (vmovsd) -- no issue there.
Same CQ issue for int, and for Vector256<T>.
Didn't check other types, than noted here.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

@gfoidl
Copy link
Member Author

gfoidl commented May 22, 2019

/cc: @tannergooding please have a look here

@fiigii
Copy link
Contributor

fiigii commented May 22, 2019

I remember that ToScalar on integer types are not intrinsic now. You can use Sse2.X64.ConvertToInt64 for Vector128<long>.

@gfoidl
Copy link
Member Author

gfoidl commented May 22, 2019

Thanks.

Sse2.X64.ConvertToInt64

Is better, but still not ideal:

vmovupd  xmm0, xmmword ptr [rsp+08H]
vmovd    rax, xmm0

It is documented as __int64 _mm_cvtsi128_si64 (__m128i a) MOVQ reg/m64, xmm, but JIT didn't emit the movq.

@mikedn
Copy link
Contributor

mikedn commented May 22, 2019

Is that output from the JIT's own disassembler? It's probably movq but displayed as movd.

@gfoidl
Copy link
Member Author

gfoidl commented May 22, 2019

Looked at the JIT-dump and in VS-dissambly view (both in release with optimization on, tiering disabled).

SharpLab with CoreCLR shows the same.

@fiigii
Copy link
Contributor

fiigii commented May 22, 2019

Right, vmovd rax, xmm0 is actually vmovq rax, xmm0.
movd is an alias of movq on r64.

@fiigii
Copy link
Contributor

fiigii commented May 22, 2019

SharpLab with CoreCLR shows the same.

BTW, in this link, vzeroupper is generated for Vector128 code. That should not be there, I will take a look.

@gfoidl
Copy link
Member Author

gfoidl commented May 22, 2019

movd is an alias of movq on r64.

👍

So codegen could be

vmovd    rax, xmmword ptr [rsp+08H]

or

vmovq    rax, xmmword ptr [rsp+08H]

There is the extra vmovupd (see https://github.com/dotnet/coreclr/issues/24710#issuecomment-494720476)

I will take a look.

Thanks.

My code to show this
using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            Vector128<sbyte> vec = Vector128.Create((byte)0x42).AsSByte();
            long l = ToLong(vec);
            double d = ToDouble(vec);

            if (double.IsNaN(d) || l == long.MaxValue)
                Environment.Exit(1);
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static long ToLong(Vector128<sbyte> vec)
        {
            //return vec.AsInt64().ToScalar();
            return Sse2.X64.ConvertToInt64(vec.AsInt64());
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static double ToDouble(Vector128<sbyte> vec)
        {
            return vec.AsDouble().ToScalar();
        }
    }
}
dasm for that code
; Assembly listing for method Program:ToLong(struct):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  1,  1   )  simd16  ->  [rsp+0x08]
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M6413_IG01:
       C5F877               vzeroupper
       6690                 nop

G_M6413_IG02:
       C5F910442408         vmovupd  xmm0, xmmword ptr [rsp+08H]
       C4E1F97EC0           vmovd    rax, xmm0

G_M6413_IG03:
       C3                   ret

; Total bytes of code 17, prolog size 5 for method Program:ToLong(struct):long
; ============================================================
; Assembly listing for method Program:ToDouble(struct):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  1,  1   )  simd16  ->  [rsp+0x08]
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M24050_IG01:
       C5F877               vzeroupper
       6690                 nop

G_M24050_IG02:
       C5FB10442408         vmovsd   xmm0, xmmword ptr [rsp+08H]

G_M24050_IG03:
       C3                   ret

; Total bytes of code 12, prolog size 5 for method Program:ToDouble(struct):double
; ============================================================

@gfoidl
Copy link
Member Author

gfoidl commented May 22, 2019

vzeroupper is generated for Vector128 code. That should not be there,

Isn't this needed for VEX? No matter if Vector128 or Vector256.

@fiigii
Copy link
Contributor

fiigii commented May 22, 2019

It is a bit complex, please see https://github.com/dotnet/coreclr/issues/21062.
But I am sure that codegen has something wrong.

@AndyAyersMS
Copy link
Member

Marking as future; if there's something surgical we can fix, or there's a bug, we can move to 3.0.

@omariom
Copy link
Contributor

omariom commented May 24, 2019

It might be related. xmm0 is spilled to the stack.

private static long AsLong(double dbl)
{
    return *(long*)&dbl;
} 

@gfoidl
Copy link
Member Author

gfoidl commented Mar 11, 2020

@omariom for reference: this is tracked by #11413 (thx @EgorBo for the remainder).

@hypeartist
Copy link

It might be related. xmm0 is spilled to the stack.

@omariom What about this?

unsafe class C 
{
    private static long AsLong(in double dbl)
    {
        return *(long*)Unsafe.AsPointer(ref Unsafe.AsRef(dbl));
    } 
}

Asm output:

C.AsLong(Double ByRef)
    L0000: mov rax, [rcx]
    L0003: ret

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNbrmwAzGE1IMhDGgG8aDKwzFReC7BgPMkDSRC4BzBgEFcAGQ9PAApeLgYAEwgOYGlI2IBKS2sLamt0pgB2BgAqYPcvHISAVS4dfRY/AAUIMKcoYNhdBlLywT82GF1giMSEjTTrAF9TaiGgA

@gfoidl
Copy link
Member Author

gfoidl commented Apr 20, 2020

@hypeartist this uses also the stack [rcx] and doesn't operate with registers solely.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

8 participants