Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Added SVE APIs - Test*, ExtractVector #103739

Merged
merged 11 commits into from
Jun 25, 2024

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Jun 20, 2024

Contributes to #99957

Adds:

  • ExtractVector
  • TestAnyTrue
  • TestFirstTrue
  • TestLastTrue

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Jun 20, 2024
…t coverage for TestAnyTrue, TestFirstTrue, TestLastTrue.
@TIHan TIHan changed the title JIT: Added SVE APIs - Test*, Extract* JIT: Added SVE APIs - Test*, ExtractVector Jun 21, 2024
@TIHan TIHan marked this pull request as ready for review June 21, 2024 00:54
@TIHan
Copy link
Contributor Author

TIHan commented Jun 21, 2024

@dotnet/arm64-contrib @kunalspathak this is ready.

During stress testing, the test failures are from predicate register callee-save stuff.
However, there was this assertion:

Assert failure(PID 19096 [0x00004a98], Thread: 17844 [0x45b4]): Assertion failed 'unreached' in 'Program:<<Main>$>g__TestExecutor3559|0_3560(System.IO.StreamWriter,System.IO.StreamWriter,byref)' during 'Do value numbering' (IL size 264; hash 0x824a73e0; FullOpts)
 
    File: C:\work\runtime\src\coreclr\jit\valuenum.cpp:2119

This was on the main test wrapper and it inlined all the methods. Occurs when TieredCompilation=0 and JitStress=2.

Basically, value numbering is not handling TYP_MASK for ARM64.
I have an experimental PR #103790 that tries to enable it. However, corerun.exe will hard-crash when running value numbering; definitely missing something.

@kunalspathak
Copy link
Member

However, corerun.exe will hard-crash when running value numbering; definitely missing something.

I don't see changes from #99743 in files emit.cpp, emit.h, gentree.cpp, instr.cpp were ported to arm64.

@@ -1266,6 +1266,27 @@ GenTree* Lowering::LowerHWIntrinsic(GenTreeHWIntrinsic* node)
return LowerHWIntrinsicCmpOp(node, GT_NE);
}

case NI_Sve_TestAnyTrue:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is happening here? Is it ensuring the bool return is set set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because TestAnyTrue is tied to the instruction ptest, the instruction itself doesn't have a destination register; it only sets the conditional flags.

This lowering transformation effectively handles the conditional flags and returns the appropriate 'bool' value we expect. Changing TestAnyTrue's gtType to TYP_VOID ensures we won't allocate a destination register for that particular node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing TestAnyTrue's gtType to TYP_VOID ensures we won't allocate a destination register for that particular node.

I think I get this part. What I am trying to understand is how we make sure that the underlying operation is doing what it is supposed to do:

  • TestAnyTrue: Return true if at least one element is active and if at least one active element of op is true.
  • TestFirstTrue: Return true if at least one element is active and if the first active element of op is true.
  • TestLastTrue: Return true if at least one element is active and if the last active element of op is true.

Can you share the disassembly of each of those?

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some more understanding on the operation of the API. Added comments around that.

("SveTestTest.template", new Dictionary<string, string> { ["TestName"] = "SveTestFirstTrue_short_custom1", ["Isa"] = "Sve", ["LoadIsa"] = "Sve", ["Method"] = "TestFirstTrue", ["MaskBaseType"] = "Int16", ["Op1Value"] = "Helpers.CreateAndFillMaskFromLastElement<Int16>(1)", ["Op2Value"] = "Helpers.CreateAndFillMaskFromSecondToLastElement<Int16>(1)", ["ValidateEntry"] = "result != true"}),
("SveTestTest.template", new Dictionary<string, string> { ["TestName"] = "SveTestLastTrue_short_custom1", ["Isa"] = "Sve", ["LoadIsa"] = "Sve", ["Method"] = "TestLastTrue", ["MaskBaseType"] = "Int16", ["Op1Value"] = "Helpers.CreateAndFillMaskFromFirstElement<Int16>(1)", ["Op2Value"] = "Helpers.CreateAndFillMaskFromSecondElement<Int16>(1)", ["ValidateEntry"] = "result != true"}),

("SveExtractVectorTest.template", new Dictionary<string, string> { ["TestName"] = "SveExtractVector_Byte_1", ["Isa"] = "Sve", ["LoadIsa"] = "Sve", ["Method"] = "ExtractVector", ["RetVectorType"] = "Vector", ["RetBaseType"] = "Byte", ["Op1VectorType"] = "Vector", ["Op1BaseType"] = "Byte", ["Op2VectorType"] = "Vector", ["Op2BaseType"] = "Byte", ["LargestVectorSize"] = "64", ["NextValueOp1"] = "TestLibrary.Generator.GetByte()", ["NextValueOp2"] = "TestLibrary.Generator.GetByte()", ["ElementIndex"] = "1", ["ValidateIterResult"] = "Helpers.ExtractVector(firstOp, secondOp, ElementIndex, i) != result[i]"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should reuse the template SveVecImmTernOpFirstArgTest that we have for dotproduct after you fix the Op2BaseType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This template was already based on ExtractVectorTest which is a really similar API

{
assert(targetReg != op2Reg);

GetEmitter()->emitIns_Mov(INS_mov, emitTypeSize(node), targetReg, op1Reg, /* canSkip */ true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be MOVPRFX because we are generating the destructive form. Please check https://docsmirror.github.io/A64/2023-06/ext_z_zi.html

Copy link
Contributor Author

@TIHan TIHan Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't know about this; I'll fix it.

src/coreclr/jit/hwintrinsiccodegenarm64.cpp Show resolved Hide resolved
@@ -1266,6 +1266,27 @@ GenTree* Lowering::LowerHWIntrinsic(GenTreeHWIntrinsic* node)
return LowerHWIntrinsicCmpOp(node, GT_NE);
}

case NI_Sve_TestAnyTrue:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing TestAnyTrue's gtType to TYP_VOID ensures we won't allocate a destination register for that particular node.

I think I get this part. What I am trying to understand is how we make sure that the underlying operation is doing what it is supposed to do:

  • TestAnyTrue: Return true if at least one element is active and if at least one active element of op is true.
  • TestFirstTrue: Return true if at least one element is active and if the first active element of op is true.
  • TestLastTrue: Return true if at least one element is active and if the last active element of op is true.

Can you share the disassembly of each of those?

@TIHan
Copy link
Contributor Author

TIHan commented Jun 22, 2024

@kunalspathak

Regarding the Test* APIs, I based the testing on what I read from the arm-docs:
https://developer.arm.com/documentation/ddi0602/2024-03/SVE-Instructions/PTEST--Set-condition-flags-for-predicate-?lang=en

PredTest:

Library pseudocode for aarch64/functions/sve/PredTest

// PredTest()
// ==========

bits(4) PredTest(bits(N) mask, bits(N) result, integer esize)
    bit n = [FirstActive](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.FirstActive.3)(mask, result, esize);
    bit z = [NoneActive](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.NoneActive.3)(mask, result, esize);
    bit c = NOT [LastActive](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.LastActive.3)(mask, result, esize);
    bit v = '0';
    return n:z:c:v;

FirstActive:

// FirstActive()
// =============

bit FirstActive(bits(N) mask, bits(N) x, integer esize)
    integer elements = N DIV (esize DIV 8);
    for e = 0 to elements-1
        if [ActivePredicateElement](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.ActivePredicateElement.3)(mask, e, esize) then
            return [PredicateElement](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.PredicateElement.3)(x, e, esize);
    return '0';

NoneActive:

// NoneActive()
// ============

bit NoneActive(bits(N) mask, bits(N) x, integer esize)
    integer elements = N DIV (esize DIV 8);
    for e = 0 to elements-1
        if [ActivePredicateElement](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.ActivePredicateElement.3)(mask, e, esize) && [ActivePredicateElement](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.ActivePredicateElement.3)(x, e, esize) then
            return '0';
    return '1';

LastActive:

// LastActiveElement()
// ===================

integer LastActiveElement(bits(N) mask, integer esize)
    integer elements = N DIV (esize DIV 8);
    for e = elements-1 downto 0
        if [ActivePredicateElement](https://developer.arm.com/documentation/ddi0602/2024-03/Shared-Pseudocode/aarch64-functions-sve?lang=en#impl-aarch64.ActivePredicateElement.3)(mask, e, esize) then return e;
    return -1;

@kunalspathak
Copy link
Member

Also, we should add unreached() where we set fmt= SVE_BQ_2A, to make sure we are not producing "constructive" form for now until #103850 is fixed.

@TIHan
Copy link
Contributor Author

TIHan commented Jun 24, 2024

@kunalspathak
ExtractVector using a constant index

; Assembly listing for method JIT.HardwareIntrinsics.Arm._Sve.ExtractVectorTest__SveExtractVector_Int64_1:Wrapper[long]():System.Numerics.Vector`1[long]:this (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Windows
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T04] (  3,  3   )     ref  ->   x0         this class-hnd single-def <JIT.HardwareIntrinsics.Arm._Sve.ExtractVectorTest__SveExtractVector_Int64_1>
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [sp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V02 tmp1         [V02,T08] (  2,  4   )  simd16  ->   d8         "impAppendStmt"
;  V03 tmp2         [V03,T00] (  4,  8   )   byref  ->  x20         single-def "Inlining Arg"
;* V04 tmp3         [V04    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inline stloc first use temp"
;  V05 tmp4         [V05,T05] (  2,  4   )    long  ->   x0         "Inlining Arg"
;  V06 tmp5         [V06,T02] (  3,  6   )    long  ->   x1         "Inlining Arg"
;  V07 tmp6         [V07,T01] (  3,  6   )   byref  ->  x19         single-def "Inlining Arg"
;* V08 tmp7         [V08    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inline stloc first use temp"
;  V09 tmp8         [V09,T06] (  2,  4   )    long  ->   x0         "Inlining Arg"
;  V10 tmp9         [V10,T03] (  3,  6   )    long  ->   x1         "Inlining Arg"
;  V11 cse0         [V11,T07] (  3,  3   )   byref  ->  x19         "CSE #01: aggressive"
;
; Lcl frame size = 0

G_M10163_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x30]!
            stp     d8, d9, [sp, #0x10]
            stp     x19, x20, [sp, #0x20]
            mov     fp, sp
                                                ;; size=16 bbWeight=1 PerfScore 3.50
G_M10163_IG02:  ;; offset=0x0010
            add     x19, x0, #48
            mov     x20, x19
            ldrsb   wzr, [x20]
            add     x0, x20, #32
            movz    x1, #0x72D0      // code for System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
            movk    x1, #584 LSL #16
            movk    x1, #0x7FFD LSL #32
            ldr     x1, [x1]
            blr     x1
            ldr     x1, [x20, #0x18]
            add     x0, x0, x1
            sub     x0, x0, #1
            sub     x1, x1, #1
            bic     x0, x0, x1
            ptrue   p0.d
            ld1d    { z8.d }, p0/z, [x0]
            add     x0, x19, #40
            movz    x1, #0x72D0      // code for System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
            movk    x1, #584 LSL #16
            movk    x1, #0x7FFD LSL #32
            ldr     x1, [x1]
            mov     v9.d[0], v8.d[1]
            blr     x1
            ldr     x1, [x19, #0x18]
            add     x0, x0, x1
            sub     x0, x0, #1
            sub     x1, x1, #1
            bic     x0, x0, x1
            ptrue   p0.d
            ld1d    { z0.d }, p0/z, [x0]
            mov     v8.d[1], v9.d[0]
            ext     z8.b, z8.b, z0.b, #8
            mov     v0.16b, v8.16b
                                                ;; size=132 bbWeight=1 PerfScore 50.50
G_M10163_IG03:  ;; offset=0x0094
            ldp     x19, x20, [sp, #0x20]
            ldp     d8, d9, [sp, #0x10]
            ldp     fp, lr, [sp], #0x30
            ret     lr
                                                ;; size=16 bbWeight=1 PerfScore 4.00

; Total bytes of code 164, prolog size 16, PerfScore 58.00, instruction count 41, allocated bytes for code 164 (MethodHash=f4b8d84c) for method JIT.HardwareIntrinsics.Arm._Sve.ExtractVectorTest__SveExtractVector_Int64_1:Wrapper[long]():System.Numerics.Vector`1[long]:this (FullOpts)
; ============================================================

ExtractVector using a non-constant index

; Assembly listing for method JIT.HardwareIntrinsics.Arm._Sve.ExtractVectorTest__SveExtractVector_Int64_1:WrapperWithIndex[long](ubyte):System.Numerics.Vector`1[long]:this (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Windows
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T04] (  3,  3   )     ref  ->   x0         this class-hnd single-def <JIT.HardwareIntrinsics.Arm._Sve.ExtractVectorTest__SveExtractVector_Int64_1>
;  V01 arg1         [V01,T05] (  3,  3   )   ubyte  ->  x19         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [sp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V03 tmp1         [V03,T09] (  2,  4   )  simd16  ->   d8         "impAppendStmt"
;  V04 tmp2         [V04,T00] (  4,  8   )   byref  ->  x21         single-def "Inlining Arg"
;* V05 tmp3         [V05    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inline stloc first use temp"
;  V06 tmp4         [V06,T06] (  2,  4   )    long  ->   x0         "Inlining Arg"
;  V07 tmp5         [V07,T02] (  3,  6   )    long  ->   x1         "Inlining Arg"
;  V08 tmp6         [V08,T01] (  3,  6   )   byref  ->  x20         single-def "Inlining Arg"
;* V09 tmp7         [V09    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inline stloc first use temp"
;  V10 tmp8         [V10,T07] (  2,  4   )    long  ->   x0         "Inlining Arg"
;  V11 tmp9         [V11,T03] (  3,  6   )    long  ->   x1         "Inlining Arg"
;  V12 cse0         [V12,T08] (  3,  3   )   byref  ->  x20         "CSE #01: aggressive"
;
; Lcl frame size = 8

G_M49680_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x40]!
            stp     d8, d9, [sp, #0x18]
            stp     x19, x20, [sp, #0x28]
            str     x21, [sp, #0x38]
            mov     fp, sp
            mov     w19, w1
                                                ;; size=24 bbWeight=1 PerfScore 5.00
G_M49680_IG02:  ;; offset=0x0018
            add     x20, x0, #48
            mov     x21, x20
            ldrsb   wzr, [x21]
            add     x0, x21, #32
            movz    x1, #0x72D0      // code for System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
            movk    x1, #582 LSL #16
            movk    x1, #0x7FFD LSL #32
            ldr     x1, [x1]
            blr     x1
            ldr     x1, [x21, #0x18]
            add     x0, x0, x1
            sub     x0, x0, #1
            sub     x1, x1, #1
            bic     x0, x0, x1
            ptrue   p0.d
            ld1d    { z8.d }, p0/z, [x0]
            add     x0, x20, #40
            movz    x1, #0x72D0      // code for System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
            movk    x1, #582 LSL #16
            movk    x1, #0x7FFD LSL #32
            ldr     x1, [x1]
            mov     v9.d[0], v8.d[1]
            blr     x1
            ldr     x1, [x20, #0x18]
            add     x0, x0, x1
            sub     x0, x0, #1
            sub     x1, x1, #1
            bic     x0, x0, x1
            ptrue   p0.d
            ld1d    { z1.d }, p0/z, [x0]
            uxtb    w0, w19
            mov     v8.d[1], v9.d[0]
            mov     v0.16b, v8.16b
            movz    x1, #0xEB08      // code for System.Runtime.Intrinsics.Arm.Sve:ExtractVector(System.Numerics.Vector`1[long],System.Numerics.Vector`1[long],ubyte):System.Numerics.Vector`1[long]
            movk    x1, #0x479 LSL #16
            movk    x1, #0x7FFD LSL #32
            ldr     x1, [x1]
            blr     x1
                                                ;; size=152 bbWeight=1 PerfScore 54.50
G_M49680_IG03:  ;; offset=0x00B0
            ldr     x21, [sp, #0x38]
            ldp     x19, x20, [sp, #0x28]
            ldp     d8, d9, [sp, #0x18]
            ldp     fp, lr, [sp], #0x40
            ret     lr
                                                ;; size=20 bbWeight=1 PerfScore 6.00

; Total bytes of code 196, prolog size 20, PerfScore 65.50, instruction count 49, allocated bytes for code 196 (MethodHash=b8cb3def) for method JIT.HardwareIntrinsics.Arm._Sve.ExtractVectorTest__SveExtractVector_Int64_1:WrapperWithIndex[long](ubyte):System.Numerics.Vector`1[long]:this (FullOpts)
; ============================================================

Without a constant index, in rationalization it's made into a Call, which the impl is this:

; Assembly listing for method System.Runtime.Intrinsics.Arm.Sve:ExtractVector(System.Numerics.Vector`1[long],System.Numerics.Vector`1[long],ubyte):System.Numerics.Vector`1[long] (FullOpts)
; Emitting BLENDED_CODE for generic ARM64 - Windows
; FullOpts code
; optimized code
; fp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T02] (  3,  3   )  simd16  ->   d0         HFA(simd16)  single-def <System.Numerics.Vector`1[long]>
;  V01 arg1         [V01,T03] (  3,  3   )  simd16  ->   d1         HFA(simd16)  single-def <System.Numerics.Vector`1[long]>
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0         single-def
;# V03 OutArgs      [V03    ] (  1,  1   )  struct ( 0) [sp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE #01: aggressive"
;
; Lcl frame size = 0

G_M4290_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
                                                ;; size=8 bbWeight=1 PerfScore 1.50
G_M4290_IG02:  ;; offset=0x0008
            uxtb    w0, w0
            cmp     w0, #2
            bhs     G_M4290_IG06
            cbnz    w0, G_M4290_IG04
                                                ;; size=16 bbWeight=1 PerfScore 3.00
G_M4290_IG03:  ;; offset=0x0018
            ext     z0.b, z0.b, z1.b, #0
            b       G_M4290_IG05
                                                ;; size=8 bbWeight=1 PerfScore 3.00
G_M4290_IG04:  ;; offset=0x0020
            ext     z0.b, z0.b, z1.b, #8
                                                ;; size=4 bbWeight=1 PerfScore 2.00
G_M4290_IG05:  ;; offset=0x0024
            ldp     fp, lr, [sp], #0x10
            ret     lr
                                                ;; size=8 bbWeight=1 PerfScore 2.00
G_M4290_IG06:  ;; offset=0x002C
            bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
            brk_windows #0
                                                ;; size=8 bbWeight=0 PerfScore 0.00

; Total bytes of code 52, prolog size 8, PerfScore 11.50, instruction count 13, allocated bytes for code 52 (MethodHash=cc42ef3d) for method System.Runtime.Intrinsics.Arm.Sve:ExtractVector(System.Numerics.Vector`1[long],System.Numerics.Vector`1[long],ubyte):System.Numerics.Vector`1[long] (FullOpts)
; ============================================================

@TIHan
Copy link
Contributor Author

TIHan commented Jun 24, 2024

The HWIntrinsicImmOpHelper takes into account containment, so it will handle the scenario of a constant index.

CodeGen::HWIntrinsicImmOpHelper::HWIntrinsicImmOpHelper(CodeGen* codeGen, GenTree* immOp, GenTreeHWIntrinsic* intrin)
    : codeGen(codeGen)
    , endLabel(nullptr)
    , nonZeroLabel(nullptr)
    , branchTargetReg(REG_NA)
{
    assert(codeGen != nullptr);
    assert(varTypeIsIntegral(immOp));

    if (immOp->isContainedIntOrIImmed())
    {
        nonConstImmReg = REG_NA;

        immValue      = (int)immOp->AsIntCon()->IconValue();
        immLowerBound = immValue;
        immUpperBound = immValue;
    }
    else

@TIHan
Copy link
Contributor Author

TIHan commented Jun 24, 2024

@dotnet/arm64-contrib @kunalspathak this is ready again.

Ran SveTest APIs:

===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_sbyte() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_short() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_int() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_long() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_byte() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_ushort() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_uint() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestAnyTrue_ulong() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_sbyte() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_short() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_int() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_long() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_byte() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_ushort() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_uint() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestFirstTrue_ulong() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_sbyte() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_short() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_int() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_long() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_byte() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_ushort() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_uint() : 5
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.SveTestLastTrue_ulong() : 5
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@@ -69,6 +69,7 @@ HARDWARE_INTRINSIC(Sve, CreateWhileLessThanOrEqualMask8Bit,
HARDWARE_INTRINSIC(Sve, Divide, -1, 2, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_sdiv, INS_sve_udiv, INS_sve_sdiv, INS_sve_udiv, INS_sve_fdiv, INS_sve_fdiv}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation)
HARDWARE_INTRINSIC(Sve, DotProduct, -1, 3, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_sdot, INS_sve_udot, INS_sve_sdot, INS_sve_udot, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_HasRMWSemantics)
HARDWARE_INTRINSIC(Sve, DotProductBySelectedScalar, -1, 4, true, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_sdot, INS_sve_udot, INS_sve_sdot, INS_sve_udot, INS_invalid, INS_invalid}, HW_Category_SIMDByIndexedElement, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_HasImmediateOperand|HW_Flag_HasRMWSemantics|HW_Flag_LowVectorOperation)
HARDWARE_INTRINSIC(Sve, ExtractVector, -1, 3, true, {INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext, INS_sve_ext}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_HasImmediateOperand|HW_Flag_HasRMWSemantics|HW_Flag_SpecialCodeGen)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be of category HW_Category_SIMDByIndexedElement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other ExtractVector APIs from AdvSimd do not have them marked with HW_Category_SIMDByIndexedElement

@@ -198,6 +199,9 @@ HARDWARE_INTRINSIC(Sve, StoreNarrowing,
HARDWARE_INTRINSIC(Sve, StoreNonTemporal, -1, 3, true, {INS_sve_stnt1b, INS_sve_stnt1b, INS_sve_stnt1h, INS_sve_stnt1h, INS_sve_stnt1w, INS_sve_stnt1w, INS_sve_stnt1d, INS_sve_stnt1d, INS_sve_stnt1w, INS_sve_stnt1d}, HW_Category_MemoryStore, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_ExplicitMaskedOperation|HW_Flag_SpecialCodeGen|HW_Flag_LowMaskedOperation)
HARDWARE_INTRINSIC(Sve, Subtract, -1, 2, true, {INS_sve_sub, INS_sve_sub, INS_sve_sub, INS_sve_sub, INS_sve_sub, INS_sve_sub, INS_sve_sub, INS_sve_sub, INS_sve_fsub, INS_sve_fsub}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_OptionalEmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation)
HARDWARE_INTRINSIC(Sve, SubtractSaturate, -1, 2, true, {INS_sve_sqsub, INS_sve_uqsub, INS_sve_sqsub, INS_sve_uqsub, INS_sve_sqsub, INS_sve_uqsub, INS_sve_sqsub, INS_sve_uqsub, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_OptionalEmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation)
HARDWARE_INTRINSIC(Sve, TestAnyTrue, -1, 2, true, {INS_sve_ptest, INS_sve_ptest, INS_sve_ptest, INS_sve_ptest, INS_sve_ptest, INS_sve_ptest, INS_sve_ptest, INS_sve_ptest, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_BaseTypeFromFirstArg|HW_Flag_SpecialCodeGen)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does Test* needs SpecialCodeGen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 things:

  • Need to assert that the dst register is REG_NA
  • I need to pass INS_OPTS_SCALABLE_B on emitIns

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense then.


private static readonly int Op1ElementCount = Unsafe.SizeOf<{Op1VectorType}<{Op1BaseType}>>() / sizeof({Op1BaseType});
private static readonly int Op2ElementCount = Unsafe.SizeOf<{Op2VectorType}<{Op2BaseType}>>() / sizeof({Op2BaseType});
private static readonly int RetElementCount = Unsafe.SizeOf<{RetVectorType}<{RetBaseType}>>() / sizeof({RetBaseType});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why this cannot be shared with existing templates?

test.RunStructFldScenario(this);
}

public void RunUnsupportedScenario()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. Can you please reuse the existing template? If not, can you confirm from which template this was created and what were the differences that was not letting us reuse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template was ExtractVectorTest.template. The only differences are alignment and when using the Load APIs, I needed to create a mask.

}

/// Find any occurrence where both left and right and set
static bool TestAnyTrue(Vector<{MaskBaseType}> left, Vector<{MaskBaseType}> right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helpers should be part of Helpers.cs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to duplicate them 8 times and they would only be used for this specific test.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to still fix the formatting.

@TIHan
Copy link
Contributor Author

TIHan commented Jun 24, 2024

Seems the format jobs are having trouble:

Run python3 runtime/src/coreclr/scripts/jitformat.py --jitutils jitutils -r /home/runner/work/runtime/runtime/runtime -o linux -a x64 --cross
[20:31:01] Bad runtime path

I'm seeing it happen on other PRs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants