Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Implement MultiplyHigh #47362

Merged
merged 10 commits into from
Jan 28, 2021
12 changes: 12 additions & 0 deletions src/coreclr/jit/hwintrinsic.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1062,6 +1062,18 @@ GenTree* Compiler::impHWIntrinsic(NamedIntrinsic intrinsic,
retNode->AsHWIntrinsic()->SetAuxiliaryType(getBaseTypeOfSIMDType(sigReader.op2ClsHnd));
break;

case NI_ArmBase_Arm64_MultiplyHigh:
if (sig->retType == CORINFO_TYPE_ULONG)
{
retNode->AsHWIntrinsic()->gtSIMDBaseType = TYP_ULONG;
}
else
{
assert(sig->retType == CORINFO_TYPE_LONG);
retNode->AsHWIntrinsic()->gtSIMDBaseType = TYP_LONG;
}
break;

default:
break;
}
Expand Down
1 change: 1 addition & 0 deletions src/coreclr/jit/hwintrinsiclistarm64.h
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,7 @@ HARDWARE_INTRINSIC(ArmBase, ReverseElementBits,
// Base 64-bit only Intrinsics
HARDWARE_INTRINSIC(ArmBase_Arm64, LeadingSignCount, 0, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_cls, INS_invalid, INS_cls, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Scalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoFloatingPointUsed)
HARDWARE_INTRINSIC(ArmBase_Arm64, LeadingZeroCount, 0, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_clz, INS_clz, INS_invalid, INS_invalid}, HW_Category_Scalar, HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoFloatingPointUsed)
HARDWARE_INTRINSIC(ArmBase_Arm64, MultiplyHigh, 0, 2, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_smulh, INS_umulh, INS_invalid, INS_invalid}, HW_Category_Scalar, HW_Flag_NoFloatingPointUsed)
HARDWARE_INTRINSIC(ArmBase_Arm64, ReverseElementBits, 0, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_rbit, INS_rbit, INS_invalid, INS_invalid}, HW_Category_Scalar, HW_Flag_NoFloatingPointUsed)

// ***************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
Expand Down
11 changes: 11 additions & 0 deletions src/libraries/System.Private.CoreLib/src/System/Math.cs
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,11 @@ public static unsafe ulong BigMul(ulong a, ulong b, out ulong low)
low = tmp;
return high;
}
else if (ArmBase.Arm64.IsSupported)
Copy link
Contributor

@imhameed imhameed Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mono will need support for these. Alternatively, you'll need to make Mono return false for IsSupported. Otherwise, any use of BigMul on arm64 with LLVM JIT/AOT will make Mono crash with a stack overflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @echesakovMSFT better just put 0 to

EMIT_NEW_ICONST (cfg, ins, supported ? 1 : 0);
and some of us will implement these for Mono later and enable it back

For reference, here is the IR we need to emit: https://godbolt.org/z/64rvjP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if mono's support could be table driven like it is in RyuJIT.

Then it would (ideally) be relatively trivial most of the time to just say Arm64.MultiplyHigh maps to llvm.intrinsic....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding While I agree in general, I don't think there is an intrinsic for this particular thing, as you can see from my godbolt link you will have to emit at least 5 different statements for this operation so a table won't help here 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. ARM64 actually doesn't define a C++ intrinsic for this: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=UMULH, so LLVM won't have a corresponding llvm.intrinsic entry, 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a try and implemented MultiplyHigh in mono.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorBo @imhameed Are these

<ExcludeList Include = "$(XunitTestBinBase)/JIT/HardwareIntrinsics/Arm/ArmBase.Arm64/ArmBase.Arm64_ro/**">

disabled on purpose with mono? If so, what would be an alternative way to verify that my changes are correct?

Copy link
Contributor

@imhameed imhameed Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if they work now. Presumably they failed to pass although nothing was never recorded. They should (and will) be reenabled. That work is being tracked here: #43842

You could try reenabling this specific test and seeing what happens. You could also build Mono as an arm64 AOT cross compiler if you don't have immediate access to any arm64 hardware. Here are some instructions for doing this: https://gist.github.com/imhameed/e4b246dbf0d0b247155fc9e3326194b6

This is not ideal, but this will get better soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your branch on an arm64 machine.

Given:

[MethodImpl(MethodImplOptions.NoInlining)]
public static ulong test_mulhi(ulong x, ulong y) {
    return ArmBase.Arm64.MultiplyHigh(x, y);
}

it yields:

0:»  9bc17c00 »      umulh»  x0, x0, x1
4:»  d65f03c0 »      ret

and given:

[MethodImpl(MethodImplOptions.NoInlining)]
public static long test_mulhi(long x, long y) {
    return ArmBase.Arm64.MultiplyHigh(x, y);
}

it yields:

0:»  9b417c00 »      smulh»  x0, x0, x1
4:»  d65f03c0 »      ret

And the intermediate mini IR and LLVM IR all looks reasonable. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried your branch on an arm64 machine.

@imhameed Thanks a lot for validating the change!

{
low = a * b;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a code pattern that the JIT should handle and generate UMULL and SMULL

This has been previously blocked due to codegen issues, but I think the work Carol did around multi-reg returns unblocked that, in which case these should forward to an intrinsic (long, long) BigMul(long, long)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I am not sure I understand you comment about handling and generating umull/smull in context of BigMul(long, long)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake, I misread and thought UMULL had a variant that multiplies two 64-bit and returns the 128-bit result in 2x register like IMUL and MUL support on x86/x64.

It looks like it's just 32x32=64 and 64x64=upper 64 (the latter via UMULH). I don't see a variant that also returns the lower 64 with the same computation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually looks like you might be able to achieve it using PMULL and/or PMULL2, but it isn't clear if that's "always a win"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe PMULL{2} can be used at this context - results of the instructions (polynomial multiplication) does't correspond to the result of integer multiplications.

But there is a definitely room for improving 32-bit integer multiplication (see #47490) that would result in more efficient code for BigMul(int, int)

return ArmBase.Arm64.MultiplyHigh(a, b);
}

return SoftwareFallback(a, b, out low);

Expand Down Expand Up @@ -185,6 +190,12 @@ static ulong SoftwareFallback(ulong a, ulong b, out ulong low)
/// <returns>The high 64-bit of the product of the specied numbers.</returns>
public static long BigMul(long a, long b, out long low)
{
if (ArmBase.Arm64.IsSupported)
{
low = a * b;
return ArmBase.Arm64.MultiplyHigh(a, b);
}

ulong high = BigMul((ulong)a, (ulong)b, out ulong ulow);
low = (long)ulow;
return (long)high - ((a >> 63) & b) - ((b >> 63) & a);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,16 @@ internal Arm64() { }
/// </summary>
public static int LeadingZeroCount(ulong value) { throw new PlatformNotSupportedException(); }

/// <summary>
/// A64: SMULH Xd, Xn, Xm
/// </summary>
public static long MultiplyHigh(long left, long right) { throw new PlatformNotSupportedException(); }

/// <summary>
/// A64: UMULH Xd, Xn, Xm
/// </summary>
public static ulong MultiplyHigh(ulong left, ulong right) { throw new PlatformNotSupportedException(); }

/// <summary>
/// A64: RBIT Xd, Xn
/// </summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,16 @@ internal Arm64() { }
/// </summary>
public static int LeadingZeroCount(ulong value) => LeadingZeroCount(value);

/// <summary>
/// A64: SMULH Xd, Xn, Xm
/// </summary>
public static long MultiplyHigh(long left, long right) => MultiplyHigh(left, right);

/// <summary>
/// A64: UMULH Xd, Xn, Xm
/// </summary>
public static ulong MultiplyHigh(ulong left, ulong right) => MultiplyHigh(left, right);

/// <summary>
/// A64: RBIT Xd, Xn
/// </summary>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2563,6 +2563,8 @@ internal Arm64() { }
public static int LeadingSignCount(long value) { throw null; }
public static int LeadingZeroCount(long value) { throw null; }
public static int LeadingZeroCount(ulong value) { throw null; }
public static long MultiplyHigh(long left, long right) { throw null; }
public static ulong MultiplyHigh(ulong left, ulong right) { throw null; }
public static long ReverseElementBits(long value) { throw null; }
public static ulong ReverseElementBits(ulong value) { throw null; }
}
Expand Down
16 changes: 16 additions & 0 deletions src/mono/mono/mini/mini-llvm.c
Original file line number Diff line number Diff line change
Expand Up @@ -9086,6 +9086,22 @@ process_bb (EmitContext *ctx, MonoBasicBlock *bb)
values [ins->dreg] = LLVMBuildCall (builder, get_intrins (ctx, ins->opcode == OP_LSCNT32 ? INTRINS_CTLZ_I32 : INTRINS_CTLZ_I64), args, 2, "");
break;
}
case OP_ARM64_SMULH:
case OP_ARM64_UMULH: {
LLVMValueRef op1, op2;
if (ins->opcode == OP_ARM64_SMULH) {
op1 = LLVMBuildSExt (builder, lhs, LLVMInt128Type (), "");
op2 = LLVMBuildSExt (builder, rhs, LLVMInt128Type (), "");
} else {
op1 = LLVMBuildZExt (builder, lhs, LLVMInt128Type (), "");
op2 = LLVMBuildZExt (builder, rhs, LLVMInt128Type (), "");
}
LLVMValueRef mul = LLVMBuildMul (builder, op1, op2, "");
LLVMValueRef hi64 = LLVMBuildLShr (builder, mul,
LLVMConstInt (LLVMInt128Type (), 64, FALSE), "");
values [ins->dreg] = LLVMBuildTrunc (builder, hi64, LLVMInt64Type (), "");
break;
}
#endif

case OP_DUMMY_USE:
Expand Down
2 changes: 2 additions & 0 deletions src/mono/mono/mini/mini-ops.h
Original file line number Diff line number Diff line change
Expand Up @@ -1578,4 +1578,6 @@ MINI_OP(OP_POPCNT64, "popcnt64", LREG, LREG, NONE)
#ifdef TARGET_ARM64
MINI_OP(OP_LSCNT32, "lscnt32", IREG, IREG, NONE)
MINI_OP(OP_LSCNT64, "lscnt64", LREG, LREG, NONE)
MINI_OP(OP_ARM64_SMULH, "arm64_smulh", LREG, LREG, LREG)
MINI_OP(OP_ARM64_UMULH, "arm64_umulh", LREG, LREG, LREG)
#endif // TARGET_ARM64
4 changes: 4 additions & 0 deletions src/mono/mono/mini/simd-intrinsics-netcore.c
Original file line number Diff line number Diff line change
Expand Up @@ -807,6 +807,7 @@ emit_invalid_operation (MonoCompile *cfg, const char* message)
static SimdIntrinsic armbase_methods [] = {
{SN_LeadingSignCount},
{SN_LeadingZeroCount},
{SN_MultiplyHigh},
{SN_ReverseElementBits},
{SN_get_IsSupported}
};
Expand Down Expand Up @@ -847,6 +848,9 @@ emit_arm64_intrinsics (MonoCompile *cfg, MonoMethod *cmethod, MonoMethodSignatur
return emit_simd_ins_for_sig (cfg, klass, arg0_i32 ? OP_LZCNT32 : OP_LZCNT64, 0, arg0_type, fsig, args);
case SN_LeadingSignCount:
return emit_simd_ins_for_sig (cfg, klass, arg0_i32 ? OP_LSCNT32 : OP_LSCNT64, 0, arg0_type, fsig, args);
case SN_MultiplyHigh:
return emit_simd_ins_for_sig (cfg, klass,
(arg0_type == MONO_TYPE_I8 ? OP_ARM64_SMULH : OP_ARM64_UMULH), 0, arg0_type, fsig, args);
case SN_ReverseElementBits:
return emit_simd_ins_for_sig (cfg, klass,
(is_64bit ? OP_XOP_I8_I8 : OP_XOP_I4_I4),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
<Compile Include="LeadingSignCount.Int64.cs" />
<Compile Include="LeadingZeroCount.Int64.cs" />
<Compile Include="LeadingZeroCount.UInt64.cs" />
<Compile Include="MultiplyHigh.Int64.cs" />
<Compile Include="MultiplyHigh.UInt64.cs" />
<Compile Include="ReverseElementBits.Int64.cs" />
<Compile Include="ReverseElementBits.UInt64.cs" />
<Compile Include="Program.ArmBase.Arm64.cs" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
<Compile Include="LeadingSignCount.Int64.cs" />
<Compile Include="LeadingZeroCount.Int64.cs" />
<Compile Include="LeadingZeroCount.UInt64.cs" />
<Compile Include="MultiplyHigh.Int64.cs" />
<Compile Include="MultiplyHigh.UInt64.cs" />
<Compile Include="ReverseElementBits.Int64.cs" />
<Compile Include="ReverseElementBits.UInt64.cs" />
<Compile Include="Program.ArmBase.Arm64.cs" />
Expand Down
Loading