Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Implement MultiplyHigh #47362

Merged
merged 10 commits into from
Jan 28, 2021
Merged

Conversation

echesakov
Copy link
Contributor

@echesakov echesakov commented Jan 23, 2021

Closes #43106

In addition to implementing the intrinsics I have updated System.Math:BigMul(long,long,byref):long implementation in System.Private.CoreLib. The following is the codegen of the methods:

; Assembly listing for method System.Math:BigMul(long,long,byref):long
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )    long  ->   x0        
;  V01 arg1         [V01,T01] (  4,  4   )    long  ->   x1        
;  V02 arg2         [V02,T02] (  3,  3   )   byref  ->   x2        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M18264_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M18264_IG02:              ;; offset=0008H
        9B017C03          mul     x3, x0, x1
        F9000043          str     x3, [x2]
        9BC17C00          umulh   x0, x0, x1
						;; bbWeight=1    PerfScore 8.00
G_M18264_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 14.30, instruction count 7, allocated bytes for code 28 (MethodHash=96edb8a7) for method System.Math:BigMul(long,long,byref):long
; ============================================================

; Assembly listing for method System.Math:BigMul(long,long,byref):long
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )    long  ->   x0        
;  V01 arg1         [V01,T01] (  4,  4   )    long  ->   x1        
;  V02 arg2         [V02,T02] (  3,  3   )   byref  ->   x2        
;* V03 loc0         [V03    ] (  0,  0   )    long  ->  zero-ref   
;* V04 loc1         [V04    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M18264_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M18264_IG02:              ;; offset=0008H
        9B017C03          mul     x3, x0, x1
        F9000043          str     x3, [x2]
        9B417C00          smulh   x0, x0, x1
						;; bbWeight=1    PerfScore 8.00
G_M18264_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 14.30, instruction count 7, allocated bytes for code 28 (MethodHash=96edb8a7) for method System.Math:BigMul(long,long,byref):long
; ============================================================

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Jan 23, 2021

Tagging subscribers to this area: @tannergooding
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #43106

Author: echesakovMSFT
Assignees: -
Labels:

area-System.Runtime.Intrinsics, new-api-needs-documentation

Milestone: -

@echesakov echesakov marked this pull request as ready for review January 26, 2021 02:43
@echesakov
Copy link
Contributor Author

@dotnet/jit-contrib PTAL
cc @tannergooding @TamarChristinaArm

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Jan 26, 2021

@echesakovMSFT out of curiosity what does the JIT generate for

public static long BigMul(int a, int b)
{
    return ((long)a) * b;
}

that indicates the multiplication has to be done as long, just curious if it generates <U|S>MULL here or two extensions and then a normal MUL.

@@ -151,6 +151,11 @@ public static unsafe ulong BigMul(ulong a, ulong b, out ulong low)
low = tmp;
return high;
}
else if (ArmBase.Arm64.IsSupported)
{
low = a * b;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a code pattern that the JIT should handle and generate UMULL and SMULL

This has been previously blocked due to codegen issues, but I think the work Carol did around multi-reg returns unblocked that, in which case these should forward to an intrinsic (long, long) BigMul(long, long)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I am not sure I understand you comment about handling and generating umull/smull in context of BigMul(long, long)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, I misread and thought UMULL had a variant that multiplies two 64-bit and returns the 128-bit result in 2x register like IMUL and MUL support on x86/x64.

It looks like it's just 32x32=64 and 64x64=upper 64 (the latter via UMULH). I don't see a variant that also returns the lower 64 with the same computation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually looks like you might be able to achieve it using PMULL and/or PMULL2, but it isn't clear if that's "always a win"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe PMULL{2} can be used at this context - results of the instructions (polynomial multiplication) does't correspond to the result of integer multiplications.

But there is a definitely room for improving 32-bit integer multiplication (see #47490) that would result in more efficient code for BigMul(int, int)

@@ -151,6 +151,11 @@ public static unsafe ulong BigMul(ulong a, ulong b, out ulong low)
low = tmp;
return high;
}
else if (ArmBase.Arm64.IsSupported)
Copy link
Contributor

@imhameed imhameed Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mono will need support for these. Alternatively, you'll need to make Mono return false for IsSupported. Otherwise, any use of BigMul on arm64 with LLVM JIT/AOT will make Mono crash with a stack overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @echesakovMSFT better just put 0 to

EMIT_NEW_ICONST (cfg, ins, supported ? 1 : 0);
and some of us will implement these for Mono later and enable it back

For reference, here is the IR we need to emit: https://godbolt.org/z/64rvjP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if mono's support could be table driven like it is in RyuJIT.

Then it would (ideally) be relatively trivial most of the time to just say Arm64.MultiplyHigh maps to llvm.intrinsic....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding While I agree in general, I don't think there is an intrinsic for this particular thing, as you can see from my godbolt link you will have to emit at least 5 different statements for this operation so a table won't help here 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. ARM64 actually doesn't define a C++ intrinsic for this: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=UMULH, so LLVM won't have a corresponding llvm.intrinsic entry, 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a try and implemented MultiplyHigh in mono.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo @imhameed Are these

<ExcludeList Include = "$(XunitTestBinBase)/JIT/HardwareIntrinsics/Arm/ArmBase.Arm64/ArmBase.Arm64_ro/**">

disabled on purpose with mono? If so, what would be an alternative way to verify that my changes are correct?

Copy link
Contributor

@imhameed imhameed Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if they work now. Presumably they failed to pass although nothing was never recorded. They should (and will) be reenabled. That work is being tracked here: #43842

You could try reenabling this specific test and seeing what happens. You could also build Mono as an arm64 AOT cross compiler if you don't have immediate access to any arm64 hardware. Here are some instructions for doing this: https://gist.github.com/imhameed/e4b246dbf0d0b247155fc9e3326194b6

This is not ideal, but this will get better soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your branch on an arm64 machine.

Given:

[MethodImpl(MethodImplOptions.NoInlining)]
public static ulong test_mulhi(ulong x, ulong y) {
    return ArmBase.Arm64.MultiplyHigh(x, y);
}

it yields:

0:»  9bc17c00 »      umulh»  x0, x0, x1
4:»  d65f03c0 »      ret

and given:

[MethodImpl(MethodImplOptions.NoInlining)]
public static long test_mulhi(long x, long y) {
    return ArmBase.Arm64.MultiplyHigh(x, y);
}

it yields:

0:»  9b417c00 »      smulh»  x0, x0, x1
4:»  d65f03c0 »      ret

And the intermediate mini IR and LLVM IR all looks reasonable. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your branch on an arm64 machine.

@imhameed Thanks a lot for validating the change!

@echesakov
Copy link
Contributor Author

echesakov commented Jan 26, 2021

@TamarChristinaArm For BigMul(int,int) the JIT generates two sign extensions and mul.

; Assembly listing for method System.Math:BigMul(int,int):long
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->   x0        
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->   x1        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M36318_IG01:              ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M36318_IG02:              ;; offset=0008H
        93407C00          sxtw    x0, w0
        93407C21          sxtw    x1, w1
        9B017C00          mul     x0, x0, x1
						;; bbWeight=1    PerfScore 3.00
G_M36318_IG03:              ;; offset=0014H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

Opened #47490 to track this

@echesakov echesakov merged commit 39d6396 into dotnet:master Jan 28, 2021
@echesakov echesakov deleted the Arm64-MultiplyHigh branch January 28, 2021 00:23
@echesakov echesakov mentioned this pull request Jan 28, 2021
29 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Feb 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Arm64] MultiplyHigh
6 participants