-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust impNormStructVal to not wrap in GT_OBJ #81636
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue Detailsnull
|
71481de
to
d55a004
Compare
Edit: Updated diff comment is here: #81636 (comment) Windows x64 and Arm64 are the two largest improvements with Windows Arm64 being -300k bytes. Linux and OSX x64/Arm64 are closer to Windows Arm64 being around -140k bytes (sometimes bit more, sometimes bit less). There are no diffs for minopts in any context. There is however a TP improvement in all contexts with minopts being around -0.07% and fullopts being around -0.34% Will post some analysis on the "improvements" and "regressions" shortly. Windows x64
Windows Arm64
|
Going to start with a simple example: public bool Equals(UInt128 other)
{
return this == other;
}
public static bool operator ==(UInt128 left, UInt128 right)
{
return (left._lower == right._lower)
&& (left._upper == right._upper);
} In this case we previously had these locals and are able to now elide an "inlining arg" and therefore extra promotion (of course the actual diff is slightly different due to lclnum's changing): ; V00 this [V00,T01] ( 4, 4 ) byref -> rcx this single-def
; V01 arg1 [V01,T00] ( 4, 8 ) byref -> rdx single-def
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V03 tmp1 [V03,T02] ( 3, 2 ) bool -> rax "Inline return value spill temp"
;* V04 tmp2 [V04 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
-;* V05 tmp3 [V05 ] ( 0, 0 ) struct (16) zero-ref "Inlining Arg"
;* V06 tmp4 [V06 ] ( 0, 0 ) long -> zero-ref V12._lower(offs=0x00) P-INDEP "field V01._lower (fldOffset=0x0)"
;* V07 tmp5 [V07 ] ( 0, 0 ) long -> zero-ref V12._upper(offs=0x08) P-INDEP "field V01._upper (fldOffset=0x8)"
; V08 tmp6 [V08,T03] ( 2, 2 ) long -> rax V04._lower(offs=0x00) P-INDEP "field V04._lower (fldOffset=0x0)"
; V09 tmp7 [V09,T05] ( 2, 1.50) long -> rcx V04._upper(offs=0x08) P-INDEP "field V04._upper (fldOffset=0x8)"
-; V10 tmp8 [V10,T04] ( 2, 2 ) long -> r8 V05._lower(offs=0x00) P-INDEP "field V05._lower (fldOffset=0x0)"
-; V11 tmp9 [V11,T06] ( 2, 1.50) long -> rdx V05._upper(offs=0x08) P-INDEP "field V05._upper (fldOffset=0x8)"
;* V12 tmp10 [V12 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref" This allows us to access the underlying data directly rather than pre-loading it into a register: - mov r8, qword ptr [rdx]
- mov rdx, qword ptr [rdx+08H]
- ; byrRegs -[rdx]
- cmp rax, r8
+ cmp rax, qword ptr [rdx]
...
xor eax, eax
- cmp rcx, rdx
+ cmp rcx, qword ptr [rdx+08H]
...
+ ; byrRegs -[rdx] These cases look to be the bulk of many diffs, simply removing "inlining arg" and therefore removing an unnecessary copy. |
We have some cases like this: private readonly SegmentedHashSet<T> _set
public bool Contains(T value) => _set.Contains(value); Where given -; V04 tmp2 [V04 ] ( 2, 4 ) simd32 -> [rsp+20H] do-not-enreg[XS] addr-exposed "Inlining Arg" This avoidance of a copy means we no longer have SIMD code, which means we no long emit - sub rsp, 88
- vzeroupper
+ sub rsp, 40 We likewise then no longer create a copy and don't need to spill said copy for the method we're forwarding to: - vmovups ymm0, ymmword ptr[rdx]
- vmovups ymmword ptr[rsp+20H], ymm0
cmp byte ptr [rcx], cl
- lea rdx, [rsp+20H]
- ; byrRegs -[rdx]
call [<unknown method>]
; gcrRegs -[rcx]
+ ; byrRegs -[rdx] This is essentially just a variation on the first and again applies to many cases where we're simply passing along a value type across the inlining boundary. We see similar improvements for |
For a regression case: public unsafe Matrix3x2 TransformElements
{
set
{
IntPtr nativeMatrix = Matrix.CreateNativeHandle(value);
try
{
Gdip.CheckStatus(Gdip.GdipSetWorldTransform(
new HandleRef(this, NativeGraphics), new HandleRef(null, nativeMatrix)));
}
finally
{
if (nativeMatrix != IntPtr.Zero)
{
Gdip.GdipDeleteMatrix(new HandleRef(null, nativeMatrix));
}
}
}
}
internal static void CheckStatus(int status)
{
if (status != Ok)
throw StatusException(status);
}
// Other calls are Library import based P/Invokes We are able to drop a -; V08 tmp5 [V08,T00] ( 7, 14 ) struct (24) [rbp-58H] do-not-enreg[SF] "Inlining Arg" This in turn means that other inlining args are no longer -;* V13 tmp10 [V13 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
-;* V14 tmp11 [V14 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
-;* V15 tmp12 [V15 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
-;* V16 tmp13 [V16 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
-;* V17 tmp14 [V17 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
-;* V18 tmp15 [V18 ] ( 0, 0 ) float -> zero-ref "Inlining Arg"
+; V12 tmp9 [V12,T19] ( 2, 4 ) float -> mm6 "Inlining Arg"
+; V13 tmp10 [V13,T20] ( 2, 4 ) float -> mm7 "Inlining Arg"
+; V14 tmp11 [V14,T21] ( 2, 4 ) float -> mm8 "Inlining Arg"
+; V15 tmp12 [V15,T22] ( 2, 4 ) float -> mm9 "Inlining Arg"
+; V16 tmp13 [V16,T23] ( 2, 4 ) float -> mm10 "Inlining Arg"
+; V17 tmp14 [V17,T24] ( 2, 4 ) float -> mm11 "Inlining Arg" This is because previously we were doing an efficient block copy of the - vmovdqu xmm0, xmmword ptr [rsi]
- vmovdqu xmmword ptr [rbp-58H], xmm0
- mov rcx, qword ptr [rsi+10H]
- mov qword ptr [rbp-48H], rcx
+ vmovss xmm6, dword ptr [rsi]
+ vmovss xmm7, dword ptr [rsi+04H]
+ vmovss xmm8, dword ptr [rsi+08H]
+ vmovss xmm9, dword ptr [rsi+0CH]
+ vmovss xmm10, dword ptr [rsi+10H]
+ vmovss xmm11, dword ptr [rsi+14H]
mov rcx, 0xD1FFAB1E
mov edx, 204
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
; byrRegs -[rsi]
; gcr arg pop 0
- lea rax, bword ptr [rbp-60H]
+ lea rax, bword ptr [rbp-B0H]
; byrRegs +[rax]
- vmovss xmm0, dword ptr [rbp-58H]
- vmovss xmm1, dword ptr [rbp-54H]
- vmovss xmm2, dword ptr [rbp-50H]
- vmovss xmm3, dword ptr [rbp-4CH]
- vmovss xmm4, dword ptr [rbp-48H]
- vmovss xmm5, dword ptr [rbp-44H]
- vmovss dword ptr [rsp+20H], xmm4
- vmovss dword ptr [rsp+28H], xmm5
+ vmovaps xmm0, xmm6
+ vmovaps xmm1, xmm7
+ vmovaps xmm2, xmm8
+ vmovaps xmm3, xmm9
+ vmovss dword ptr [rsp+20H], xmm10
+ vmovss dword ptr [rsp+28H], xmm11 Part of this appears to be "non-ideal" LSRA decisions. Given that There's some other considerations here in that the managed code could be tweaked slightly to better handle this scenario. |
Let's also take a look at: private ValueTask<int> ReadBufferedAsync(Memory<byte> destination)
{
// If the caller provided buffer, and thus the amount of data desired to be read,
// is larger than the internal buffer, there's no point going through the internal
// buffer, so just do an unbuffered read.
// Also avoid avoid using the internal buffer if the user requested a zero-byte read to allow
// underlying streams to efficiently handle such a read (e.g. SslStream defering buffer allocation).
return destination.Length >= _readBuffer.Capacity || destination.Length == 0 ?
ReadAsync(destination) :
ReadBufferedAsyncCore(destination);
}
private async ValueTask<int> ReadAsync(Memory<byte> destination)
{
// This is called when reading the response body.
if (_readBuffer.ActiveLength > 0)
{
// We have data in the read buffer. Return it to the caller.
return ReadFromBuffer(destination.Span);
}
// No data in read buffer.
// Do an unbuffered read directly against the underlying stream.
Debug.Assert(_readAheadTask == default, "Read ahead task should have been consumed as part of the headers.");
int count = await _stream.ReadAsync(destination).ConfigureAwait(false);
if (NetEventSource.Log.IsEnabled()) Trace($"Received {count} bytes.");
return count;
}
private async ValueTask<int> ReadBufferedAsyncCore(Memory<byte> destination)
{
// This is called when reading the response body.
if (_readBuffer.ActiveLength == 0)
{
// Do a buffered read directly against the underlying stream.
Debug.Assert(_readAheadTask == default, "Read ahead task should have been consumed as part of the headers.");
Debug.Assert(_readBuffer.AvailableLength == _readBuffer.Capacity);
int bytesRead = await _stream.ReadAsync(_readBuffer.AvailableMemory).ConfigureAwait(false);
_readBuffer.Commit(bytesRead);
if (NetEventSource.Log.IsEnabled()) Trace($"Received {bytesRead} bytes.");
}
// Hand back as much data as we can fit.
return ReadFromBuffer(destination.Span);
} For this case we change from a Before:
After:
This ends up happening because rather than two
We end up with real calls and corresponding logic:
|
Most regressions I've seen look to be along these lines, where the copy ended up being "beneficial" due to the ability for it to be a SIMD copy and use less registers than the alternative. They generally look to be cases where small tweaks might be able to win this back. |
Some credit where credit is due: the ability to make this change comes down to months of work from @SingleAccretion to teach the rest of the JIT about struct typed locals. |
I left a note in the OP giving the well deserved credit to Single, and also my view on taking this change:
If we were to take this change today/tomorrow, then we would be able to get concrete perf numbers Tuesday (x64) and Thursday (Arm64) of next week Given this is an 11 line change and we aren't snapping again in the next two weeks, the "worst case outcome" is that we decide to "revert" the change for |
In the meantime, I am going to work on looking more in depth at the JIT dumps for the regression cases to see if there is anything simple/obvious around them (such as the tail call scenario). |
Here's the base vs diff JitDump for The first part of the diff is straightforward, just dropping This in turns when we get to morph and the LocalAddressVisitor visiting statement:
STMT00004 ( 0x006[E-] ... 0x00F )
- [000034] --C-G------ * RETURN int
- [000031] --C-G------ \--* CALL int Test:TryParseUInt64IntegerStyle(System.ReadOnlySpan`1[ushort],int,System.Globalization.NumberFormatInfo,byref):int
- [000033] n---------- arg0 +--* OBJ struct<System.ReadOnlySpan`1, 16>
- [000032] ----------- | \--* LCL_VAR_ADDR byref V00 arg0
- | \--* byref V00.System.ReadOnlySpan`1[ushort]:_reference (offs=0x00) -> V05 tmp1
- | \--* int V00.System.ReadOnlySpan`1[ushort]:_length (offs=0x08) -> V06 tmp2
- [000028] ----------- arg1 +--* LCL_VAR int V01 arg1
- [000029] ----------- arg2 +--* LCL_VAR ref V02 arg2
- [000030] ----------- arg3 \--* LCL_VAR byref V03 arg3
-LocalAddressVisitor incrementing weighted ref count from 0 to 1 for implicit by-ref V00 arg passed to call
-LocalAddressVisitor modified statement:
|
|
// is introduced (it was primarily from impNormStructVal). But until all cases are gone | ||
// we still want to handle it as well. | ||
|
||
if (m_ancestors.Height() < 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdateEarlyRefCount needs to recognize LCL_VARs.
Updated that here. I left in the OBJ(LCL_VAR_ADDR)
handling since we still have that in some cases (even if the majority of OBJ
wrapping is gone now that it's out of impNormStrutVal
.
Also changed it from a for
loop to to some explicit linear checks instead, which simplifies things and should improve the method TP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- Not the most familiar with LclMorph, so I went with the assumption that the early exit for !m_compiler->lvaIsImplicitByRefLocal(lclNum)
is enough to ensure LCL_VAR
is something we want to handle and that we don't also need to explicitly check that LCL_VAR is a struct or a struct that would actually be passed by ref (e.g. struct S { int x; }
typically gets passed in register instead).
Let me know if this assumption isn't correct and if more checks are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also did not handle LCL_FLD
since this is "pre-order" and I believe that means we should only see GT_FIELD
instead at this point in time.
New diffs The TL;D:R is more improvements, less regressions; and even better TP Windows x64
Windows Arm64
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice diffs!
/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 6 pipeline(s). |
Merging given this is a small and easily reverted change, that it has very large positive impact and a small set of regressions. This will allow us to get concrete perf numbers as part of the perf triage on Tuesday (x64) and Thursday (Arm64) that will let us know with more confidence the impact of the diffs. |
The ability to make this change comes down to months of work from @SingleAccretion to teach the rest of the JIT about struct typed locals.
Given the breadth of the impact here, that this is a pure improvement to minopts (no diff + significant TP improvement), and that for fullopts the improvement bytes is over 4x that of the regression bytes, I think this is worth taking and getting in depth perf numbers as part of perf triage which let us better understand where to focus efforts around handling said regressions (or inversely, if we should scope the full-opts scenario down).