Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust impNormStructVal to not wrap in GT_OBJ #81636

Merged
merged 3 commits into from
Feb 6, 2023

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented Feb 4, 2023

The ability to make this change comes down to months of work from @SingleAccretion to teach the rest of the JIT about struct typed locals.

Given the breadth of the impact here, that this is a pure improvement to minopts (no diff + significant TP improvement), and that for fullopts the improvement bytes is over 4x that of the regression bytes, I think this is worth taking and getting in depth perf numbers as part of perf triage which let us better understand where to focus efforts around handling said regressions (or inversely, if we should scope the full-opts scenario down).

@ghost ghost assigned tannergooding Feb 4, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 4, 2023
@ghost
Copy link

ghost commented Feb 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

tannergooding commented Feb 4, 2023

Edit: Updated diff comment is here: #81636 (comment)


diffs

Windows x64 and Arm64 are the two largest improvements with Windows Arm64 being -300k bytes. Linux and OSX x64/Arm64 are closer to Windows Arm64 being around -140k bytes (sometimes bit more, sometimes bit less).

There are no diffs for minopts in any context. There is however a TP improvement in all contexts with minopts being around -0.07% and fullopts being around -0.34%

Will post some analysis on the "improvements" and "regressions" shortly.

Windows x64

Collection Contexts with diffs Improvements Regressions Same size Improvements (bytes) Regressions (bytes)
benchmarks.run.windows.x64.checked.mch 461 181 128 152 -4,599 +4,932
coreclr_tests.run.windows.x64.checked.mch 1,604 796 306 502 -56,089 +10,462
libraries.crossgen2.windows.x64.checked.mch 3,185 2,405 223 557 -62,815 +6,811
libraries.pmi.windows.x64.checked.mch 6,105 4,124 777 1,204 -131,974 +20,930
libraries_tests.pmi.windows.x64.checked.mch 8,299 5,587 1,414 1,298 -232,310 +72,586
  19,654 13,093 2,848 3,713 -487,787 +115,721

Windows Arm64

Collection Contexts with diffs Improvements Regressions Same size Improvements (bytes) Regressions (bytes)
benchmarks.run.windows.arm64.checked.mch 395 127 69 199 -2,392 +2,420
coreclr_tests.run.windows.arm64.checked.mch 2,036 706 934 396 -47,312 +11,764
libraries.crossgen2.windows.arm64.checked.mch 1,918 890 101 927 -23,040 +1,760
libraries.pmi.windows.arm64.checked.mch 3,986 1,862 517 1,607 -42,412 +12,800
libraries_tests.pmi.windows.arm64.checked.mch 7,608 4,270 1,534 1,804 -116,260 +64,444
  15,943 7,855 3,155 4,933 -231,416 +93,188

@tannergooding
Copy link
Member Author

tannergooding commented Feb 4, 2023

Going to start with a simple example:

public bool Equals(UInt128 other)
{
    return this == other;
}

public static bool operator ==(UInt128 left, UInt128 right)
{
    return (left._lower == right._lower)
        && (left._upper == right._upper);
}

In this case we previously had these locals and are able to now elide an "inlining arg" and therefore extra promotion (of course the actual diff is slightly different due to lclnum's changing):

 ;  V00 this         [V00,T01] (  4,  4   )   byref  ->  rcx         this single-def
 ;  V01 arg1         [V01,T00] (  4,  8   )   byref  ->  rdx         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
 ;  V03 tmp1         [V03,T02] (  3,  2   )    bool  ->  rax         "Inline return value spill temp"
 ;* V04 tmp2         [V04    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
-;* V05 tmp3         [V05    ] (  0,  0   )  struct (16) zero-ref    "Inlining Arg"
 ;* V06 tmp4         [V06    ] (  0,  0   )    long  ->  zero-ref    V12._lower(offs=0x00) P-INDEP "field V01._lower (fldOffset=0x0)"
 ;* V07 tmp5         [V07    ] (  0,  0   )    long  ->  zero-ref    V12._upper(offs=0x08) P-INDEP "field V01._upper (fldOffset=0x8)"
 ;  V08 tmp6         [V08,T03] (  2,  2   )    long  ->  rax         V04._lower(offs=0x00) P-INDEP "field V04._lower (fldOffset=0x0)"
 ;  V09 tmp7         [V09,T05] (  2,  1.50)    long  ->  rcx         V04._upper(offs=0x08) P-INDEP "field V04._upper (fldOffset=0x8)"
-;  V10 tmp8         [V10,T04] (  2,  2   )    long  ->   r8         V05._lower(offs=0x00) P-INDEP "field V05._lower (fldOffset=0x0)"
-;  V11 tmp9         [V11,T06] (  2,  1.50)    long  ->  rdx         V05._upper(offs=0x08) P-INDEP "field V05._upper (fldOffset=0x8)"
 ;* V12 tmp10        [V12    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"

This allows us to access the underlying data directly rather than pre-loading it into a register:

-       mov      r8, qword ptr [rdx]
-       mov      rdx, qword ptr [rdx+08H]
-       ; byrRegs -[rdx]
-       cmp      rax, r8
+       cmp      rax, qword ptr [rdx]
        ...
        xor      eax, eax
-       cmp      rcx, rdx
+       cmp      rcx, qword ptr [rdx+08H]
        ...
+       ; byrRegs -[rdx]

These cases look to be the bulk of many diffs, simply removing "inlining arg" and therefore removing an unnecessary copy.

@tannergooding
Copy link
Member Author

tannergooding commented Feb 4, 2023

We have some cases like this:

private readonly SegmentedHashSet<T> _set

public bool Contains(T value) => _set.Contains(value);

Where given T being Vector<float> the dropped "Inlining Arg" is:

-;  V04 tmp2         [V04    ] (  2,  4   )  simd32  ->  [rsp+20H]   do-not-enreg[XS] addr-exposed "Inlining Arg"

This avoidance of a copy means we no longer have SIMD code, which means we no long emit vzeroupper:

-       sub      rsp, 88
-       vzeroupper 
+       sub      rsp, 40

We likewise then no longer create a copy and don't need to spill said copy for the method we're forwarding to:

-       vmovups  ymm0, ymmword ptr[rdx]
-       vmovups  ymmword ptr[rsp+20H], ymm0
        cmp      byte  ptr [rcx], cl
-       lea      rdx, [rsp+20H]
-       ; byrRegs -[rdx]
        call     [<unknown method>]
        ; gcrRegs -[rcx]
+       ; byrRegs -[rdx]

This is essentially just a variation on the first and again applies to many cases where we're simply passing along a value type across the inlining boundary.

We see similar improvements for Nullable<T> and effectively any value type that is passed via a byref to a shadow copy.

@tannergooding
Copy link
Member Author

For a regression case:

public unsafe Matrix3x2 TransformElements
{
    set
    {
        IntPtr nativeMatrix = Matrix.CreateNativeHandle(value);

        try
        {
            Gdip.CheckStatus(Gdip.GdipSetWorldTransform(
                new HandleRef(this, NativeGraphics), new HandleRef(null, nativeMatrix)));
        }
        finally
        {
            if (nativeMatrix != IntPtr.Zero)
            {
                Gdip.GdipDeleteMatrix(new HandleRef(null, nativeMatrix));
            }
        }
    }
}

internal static void CheckStatus(int status)
{
    if (status != Ok)
        throw StatusException(status);
}

// Other calls are Library import based P/Invokes

We are able to drop a do-not-enreg inlining arg again:

-;  V08 tmp5         [V08,T00] (  7, 14   )  struct (24) [rbp-58H]   do-not-enreg[SF] "Inlining Arg"

This in turn means that other inlining args are no longer zero-ref:

-;* V13 tmp10        [V13    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
-;* V14 tmp11        [V14    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
-;* V15 tmp12        [V15    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
-;* V16 tmp13        [V16    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
-;* V17 tmp14        [V17    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
-;* V18 tmp15        [V18    ] (  0,  0   )   float  ->  zero-ref    "Inlining Arg"
+;  V12 tmp9         [V12,T19] (  2,  4   )   float  ->  mm6         "Inlining Arg"
+;  V13 tmp10        [V13,T20] (  2,  4   )   float  ->  mm7         "Inlining Arg"
+;  V14 tmp11        [V14,T21] (  2,  4   )   float  ->  mm8         "Inlining Arg"
+;  V15 tmp12        [V15,T22] (  2,  4   )   float  ->  mm9         "Inlining Arg"
+;  V16 tmp13        [V16,T23] (  2,  4   )   float  ->  mm10         "Inlining Arg"
+;  V17 tmp14        [V17,T24] (  2,  4   )   float  ->  mm11         "Inlining Arg"

This is because previously we were doing an efficient block copy of the Matrix3x2, but are now loading the 6 fields directly. This causes more spilling in the prologue/epilogue as they are caller saved registers and the loss of containment:

-       vmovdqu  xmm0, xmmword ptr [rsi]
-       vmovdqu  xmmword ptr [rbp-58H], xmm0
-       mov      rcx, qword ptr [rsi+10H]
-       mov      qword ptr [rbp-48H], rcx
+       vmovss   xmm6, dword ptr [rsi]
+       vmovss   xmm7, dword ptr [rsi+04H]
+       vmovss   xmm8, dword ptr [rsi+08H]
+       vmovss   xmm9, dword ptr [rsi+0CH]
+       vmovss   xmm10, dword ptr [rsi+10H]
+       vmovss   xmm11, dword ptr [rsi+14H]
       mov      rcx, 0xD1FFAB1E
       mov      edx, 204
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       ; byrRegs -[rsi]
       ; gcr arg pop 0
-       lea      rax, bword ptr [rbp-60H]
+       lea      rax, bword ptr [rbp-B0H]
       ; byrRegs +[rax]
-       vmovss   xmm0, dword ptr [rbp-58H]
-       vmovss   xmm1, dword ptr [rbp-54H]
-       vmovss   xmm2, dword ptr [rbp-50H]
-       vmovss   xmm3, dword ptr [rbp-4CH]
-       vmovss   xmm4, dword ptr [rbp-48H]
-       vmovss   xmm5, dword ptr [rbp-44H]
-       vmovss   dword ptr [rsp+20H], xmm4
-       vmovss   dword ptr [rsp+28H], xmm5
+       vmovaps  xmm0, xmm6
+       vmovaps  xmm1, xmm7
+       vmovaps  xmm2, xmm8
+       vmovaps  xmm3, xmm9
+       vmovss   dword ptr [rsp+20H], xmm10
+       vmovss   dword ptr [rsp+28H], xmm11

Part of this appears to be "non-ideal" LSRA decisions. Given that xmm6-9 is only used once and then later loaded into xmm0-3, it likely should've been loaded into xmm0-3 in the first place. I think this is partially "blocked" because of the call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE, but I haven't dug into the dump to confirm.

There's some other considerations here in that the managed code could be tweaked slightly to better handle this scenario.

@tannergooding
Copy link
Member Author

Let's also take a look at:

private ValueTask<int> ReadBufferedAsync(Memory<byte> destination)
{
    // If the caller provided buffer, and thus the amount of data desired to be read,
    // is larger than the internal buffer, there's no point going through the internal
    // buffer, so just do an unbuffered read.
    // Also avoid avoid using the internal buffer if the user requested a zero-byte read to allow
    // underlying streams to efficiently handle such a read (e.g. SslStream defering buffer allocation).
    return destination.Length >= _readBuffer.Capacity || destination.Length == 0 ?
        ReadAsync(destination) :
        ReadBufferedAsyncCore(destination);
}

private async ValueTask<int> ReadAsync(Memory<byte> destination)
{
    // This is called when reading the response body.

    if (_readBuffer.ActiveLength > 0)
    {
        // We have data in the read buffer.  Return it to the caller.
        return ReadFromBuffer(destination.Span);
    }

    // No data in read buffer.
    // Do an unbuffered read directly against the underlying stream.
    Debug.Assert(_readAheadTask == default, "Read ahead task should have been consumed as part of the headers.");
    int count = await _stream.ReadAsync(destination).ConfigureAwait(false);
    if (NetEventSource.Log.IsEnabled()) Trace($"Received {count} bytes.");
    return count;
}

private async ValueTask<int> ReadBufferedAsyncCore(Memory<byte> destination)
{
    // This is called when reading the response body.

    if (_readBuffer.ActiveLength == 0)
    {
        // Do a buffered read directly against the underlying stream.
        Debug.Assert(_readAheadTask == default, "Read ahead task should have been consumed as part of the headers.");

        Debug.Assert(_readBuffer.AvailableLength == _readBuffer.Capacity);
        int bytesRead = await _stream.ReadAsync(_readBuffer.AvailableMemory).ConfigureAwait(false);
        _readBuffer.Commit(bytesRead);

        if (NetEventSource.Log.IsEnabled()) Trace($"Received {bytesRead} bytes.");
    }

    // Hand back as much data as we can fit.
    return ReadFromBuffer(destination.Span);
}

For this case we change from a CSE to a few new "Argument with side effect" instead.

Before:

;  V00 this         [V00,T01] (  5,  4   )     ref  ->  rcx         this class-hnd single-def
;  V01 RetBuf       [V01,T02] (  4,  3   )   byref  ->  rdx         single-def
;  V02 arg1         [V02,T00] (  5,  8   )   byref  ->   r8         ld-addr-op single-def
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V04 tmp1         [V04    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V05 tmp2         [V05    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V06 tmp3         [V06    ] (  0,  0   )     ref  ->  zero-ref    V09._object(offs=0x00) P-INDEP "field V02._object (fldOffset=0x0)"
;* V07 tmp4         [V07    ] (  0,  0   )     int  ->  zero-ref    V09._index(offs=0x08) P-INDEP "field V02._index (fldOffset=0x8)"
;* V08 tmp5         [V08    ] (  0,  0   )     int  ->  zero-ref    V09._length(offs=0x0c) P-INDEP "field V02._length (fldOffset=0xc)"
;* V09 tmp6         [V09    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V10 cse0         [V10,T03] (  3,  2.50)     int  ->  rax         "CSE - aggressive"

After:

;  V00 this         [V00,T02] (  5,  4   )     ref  ->  rcx         this class-hnd single-def
;  V01 RetBuf       [V01,T01] (  6,  4   )   byref  ->  rsi         single-def
;  V02 arg1         [V02,T00] (  5, 10   )   byref  ->   r8         ld-addr-op single-def
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;* V04 tmp1         [V04    ] (  0,  0   )     int  ->  zero-ref    "impAppendStmt"
;* V05 tmp2         [V05    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;  V06 tmp3         [V06,T04] (  3,  2   )     ref  ->  rdi         single-def V09._object(offs=0x00) P-INDEP "field V02._object (fldOffset=0x0)"
;  V07 tmp4         [V07,T05] (  3,  2   )     int  ->  rbx         V09._index(offs=0x08) P-INDEP "field V02._index (fldOffset=0x8)"
;  V08 tmp5         [V08,T03] (  5,  3.50)     int  ->  rbp         V09._length(offs=0x0c) P-INDEP "field V02._length (fldOffset=0xc)"
;* V09 tmp6         [V09    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref"
;  V10 tmp7         [V10    ] (  8,  8   )  struct (16) [rsp+28H]   do-not-enreg[XSF] must-init addr-exposed "by-value struct argument"
;* V11 tmp8         [V11,T06] (  0,  0   )     ref  ->  zero-ref    single-def "argument with side effect"
;* V12 tmp9         [V12,T07] (  0,  0   )   byref  ->  zero-ref    single-def "argument with side effect"
;* V13 tmp10        [V13,T08] (  0,  0   )     ref  ->  zero-ref    single-def "argument with side effect"
;* V14 tmp11        [V14,T09] (  0,  0   )   byref  ->  zero-ref    single-def "argument with side effect"

This ends up happening because rather than two tail.jmp calls:

G_M51019_IG04:        ; bbWeight=0.50, epilog, nogc, extend
       tail.jmp [System.Net.Http.HttpConnection:ReadBufferedAsyncCore(System.Memory`1[ubyte]):System.Threading.Tasks.ValueTask`1[int]:this]
       ; gcr arg pop 0
						;; size=6 bbWeight=0.50 PerfScore 1.00
G_M51019_IG05:        ; bbWeight=0.50, gcrefRegs=00000002 {rcx}, byrefRegs=00000104 {rdx r8}, byref, epilog, nogc
       tail.jmp [System.Net.Http.HttpConnection:ReadAsync(System.Memory`1[ubyte]):System.Threading.Tasks.ValueTask`1[int]:this]

We end up with real calls and corresponding logic:

call     [System.Net.Http.HttpConnection:ReadBufferedAsyncCore(System.Memory`1[ubyte]):System.Threading.Tasks.ValueTask`1[int]:this]

@tannergooding
Copy link
Member Author

Most regressions I've seen look to be along these lines, where the copy ended up being "beneficial" due to the ability for it to be a SIMD copy and use less registers than the alternative.

They generally look to be cases where small tweaks might be able to win this back.

@tannergooding tannergooding marked this pull request as ready for review February 4, 2023 18:51
@jakobbotsch
Copy link
Member

Some credit where credit is due: the ability to make this change comes down to months of work from @SingleAccretion to teach the rest of the JIT about struct typed locals.

@tannergooding
Copy link
Member Author

I left a note in the OP giving the well deserved credit to Single, and also my view on taking this change:

Given the breadth of the impact here, that this is a pure improvement to minopts (no diff + significant TP improvement), and that for fullopts the improvement bytes is over 4x that of the regression bytes, I think this is worth taking and getting in depth perf numbers as part of perf triage which let us better understand where to focus efforts around handling said regressions (or inversely, if we should scope the full-opts scenario down).

If we were to take this change today/tomorrow, then we would be able to get concrete perf numbers Tuesday (x64) and Thursday (Arm64) of next week

Given this is an 11 line change and we aren't snapping again in the next two weeks, the "worst case outcome" is that we decide to "revert" the change for fullopts after getting the initial numbers (given there is no diff for minopts and a nice TP improvement, there is no need to do also revert it there).

@tannergooding
Copy link
Member Author

In the meantime, I am going to work on looking more in depth at the JIT dumps for the regression cases to see if there is anything simple/obvious around them (such as the tail call scenario).

@tannergooding
Copy link
Member Author

Here's the base vs diff JitDump for Number.TryParseUInt64 where it drops the tail call optimization

Base.txt
Diff.txt

The first part of the diff is straightforward, just dropping OBJ(LCL_VAR_ADDR) in favor of LCL_VAR

This in turns when we get to morph and the LocalAddressVisitor, we no long increment the weighted ref count (base gets an ultimate weighted ref count of 3):

LocalAddressVisitor visiting statement:
STMT00004 ( 0x006[E-] ... 0x00F )
-               [000034] --C-G------                         *  RETURN    int   
-               [000031] --C-G------                         \--*  CALL      int    Test:TryParseUInt64IntegerStyle(System.ReadOnlySpan`1[ushort],int,System.Globalization.NumberFormatInfo,byref):int
-               [000033] n---------- arg0                       +--*  OBJ       struct<System.ReadOnlySpan`1, 16>
-               [000032] -----------                            |  \--*  LCL_VAR_ADDR byref  V00 arg0         
-                                                               |  \--*    byref  V00.System.ReadOnlySpan`1[ushort]:_reference (offs=0x00) -> V05 tmp1         
-                                                               |  \--*    int    V00.System.ReadOnlySpan`1[ushort]:_length (offs=0x08) -> V06 tmp2         
-               [000028] ----------- arg1                       +--*  LCL_VAR   int    V01 arg1         
-               [000029] ----------- arg2                       +--*  LCL_VAR   ref    V02 arg2         
-               [000030] ----------- arg3                       \--*  LCL_VAR   byref  V03 arg3         
-LocalAddressVisitor incrementing weighted ref count from 0 to 1 for implicit by-ref V00 arg passed to call
-LocalAddressVisitor modified statement:

Diff instead gets more done in in Morph - Byrefs with a new block covering a struct copy being inserted. That in turns results in Morph - Global making some additional transformations and then a rejection due to the callee having a byref parameter.

@SingleAccretion
Copy link
Contributor

UpdateEarlyRefCount needs to recognize LCL_VARs.

// is introduced (it was primarily from impNormStructVal). But until all cases are gone
// we still want to handle it as well.

if (m_ancestors.Height() < 2)
Copy link
Member Author

@tannergooding tannergooding Feb 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateEarlyRefCount needs to recognize LCL_VARs.

Updated that here. I left in the OBJ(LCL_VAR_ADDR) handling since we still have that in some cases (even if the majority of OBJ wrapping is gone now that it's out of impNormStrutVal.

Also changed it from a for loop to to some explicit linear checks instead, which simplifies things and should improve the method TP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Not the most familiar with LclMorph, so I went with the assumption that the early exit for !m_compiler->lvaIsImplicitByRefLocal(lclNum) is enough to ensure LCL_VAR is something we want to handle and that we don't also need to explicitly check that LCL_VAR is a struct or a struct that would actually be passed by ref (e.g. struct S { int x; } typically gets passed in register instead).

Let me know if this assumption isn't correct and if more checks are needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did not handle LCL_FLD since this is "pre-order" and I believe that means we should only see GT_FIELD instead at this point in time.

@tannergooding
Copy link
Member Author

tannergooding commented Feb 5, 2023

New diffs

The TL;D:R is more improvements, less regressions; and even better TP

Windows x64

Collection Contexts with diffs Improvements Regressions Same size Improvements (bytes) Regressions (bytes)
benchmarks.run.windows.x64.checked.mch 428 198 80 150 -5,718 +1,770
coreclr_tests.run.windows.x64.checked.mch 1,572 815 255 502 -56,614 +8,334
libraries.crossgen2.windows.x64.checked.mch 3,168 2,503 113 552 -65,778 +875
libraries.pmi.windows.x64.checked.mch 5,971 4,235 539 1,197 -139,180 +11,405
libraries_tests.pmi.windows.x64.checked.mch 8,308 5,678 1,361 1,269 -238,215 +69,148
  19,447 13,429 2,348 3,670 -505,505 +91,532

Windows Arm64

Collection Contexts with diffs Improvements Regressions Same size Improvements (bytes) Regressions (bytes)
benchmarks.run.windows.arm64.checked.mch 404 135 70 199 -2,532 +1,984
coreclr_tests.run.windows.arm64.checked.mch 2,031 708 929 394 -47,340 +11,588
libraries.crossgen2.windows.arm64.checked.mch 1,923 897 101 925 -23,200 +1,508
libraries.pmi.windows.arm64.checked.mch 4,012 1,880 527 1,605 -43,824 +12,456
libraries_tests.pmi.windows.arm64.checked.mch 7,603 4,298 1,529 1,776 -117,256 +63,980
  15,973 7,918 3,156 4,899 -234,152 +91,516

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice diffs!

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@tannergooding
Copy link
Member Author

tannergooding commented Feb 6, 2023

libraries-jitstress failures are #81585
outerloop, jistress, and isas-x86 failures are #81669
Fuzzlyn failure is pre-exting: #81675

@tannergooding
Copy link
Member Author

Merging given this is a small and easily reverted change, that it has very large positive impact and a small set of regressions.

This will allow us to get concrete perf numbers as part of the perf triage on Tuesday (x64) and Thursday (Arm64) that will let us know with more confidence the impact of the diffs.

@tannergooding tannergooding merged commit 58719ec into dotnet:main Feb 6, 2023
@tannergooding tannergooding deleted the impNormStructVal branch February 6, 2023 04:11
@ghost ghost locked as resolved and limited conversation to collaborators Mar 8, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants