Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Avoid xchg for resolution #81216

Merged
merged 6 commits into from
Feb 1, 2023

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jan 26, 2023

LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available.

Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case.

Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog.

This mostly fixes the string hashcode regressions from #80743.

Size-wise regressions are expected.

LSRA today uses xchg for reg-reg resolution when there are cycles in the
resolution graph. Benchmarks show that xchg is handled much less
efficiently by Intel CPUs than using a few movs with a temporary
register. This PR enables using temporary registers for this kind of
resolution on xarch (it was already enabled for non-xarch
architectures). xchg is still used on xarch if no temporary register is
available.

Additionally this PR adds support for getting a temporary register even
for shared critical edges. Before this change we would spill on
non-xarch for this case.

Finally, we now try to prefer a non callee saved temporary register so
that we don't need to save/restore it in prolog/epilog.

This mostly fixes the string hashcode regressions from dotnet#80743.
@ghost ghost assigned jakobbotsch Jan 26, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 26, 2023
@ghost
Copy link

ghost commented Jan 26, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available.

Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case.

Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog.

This mostly fixes the string hashcode regressions from #80743.

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

The typical diff looks like:

 G_M14819_IG09:        ; bbWeight=2, gcVars=0000000000000000 {}, gcrefRegs=00000028 {rbx rbp}, byrefRegs=00000000 {}, gcvars, byref
        ; gcrRegs -[rax] +[rbp]
-       xchg     rbx, rbp
+       mov      rcx, rbx
+       ; gcrRegs +[rcx]
+       mov      rbx, rbp
+       mov      rbp, rcx
        jmp      G_M14819_IG03
-       xchg     rdx, rsi
-       xchg     rsi, r11
-       xchg     r9, r11
-       xchg     r10, r11
+       mov      ebx, edx
+       mov      edx, esi
+       mov      esi, r11d
+       mov      r11d, r10d
+       mov      r10d, r9d
+       mov      r9d, ebx

May need to look at TP, let's see.

@jakobbotsch
Copy link
Member Author

Example ARM64 diff:

-            str     x24, [fp, #0x38]   // [V06 loc3]
-            ; GC ptr vars +{V06}
+            mov     x0, x24
+            ; gcrRegs +[x0]
             mov     x24, x25
-            ldr     x25, [fp, #0x38]   // [V06 loc3]
+            mov     x25, x0
             bge     G_M1573_IG30

@tannergooding
Copy link
Member

@jakobbotsch, what's the perf impact to AMD machines?

uops.info reports that on Intel xchg reg, reg is a 1-2 latency instruction with 0.75 reciprocal throughput and that it takes 3 micro-ops on ports 0, 1, 5, or 6. This is true of Haswell all the way up through Alder Lake. mov reg, reg is reported as 0-1 latency instruction with a 0.25 reciprocal throughput (or better) and that it takes 1 micro-ops

However, for AMD it reports that xchg reg, reg is a 0 latency instruction with a 0.33 reciprocal throughput and that it takes 2 micro-ops. The AMD architecture manual likewise explicitly states that it is handled as a "zero cycle move" and completely processed via register renaming.

The concern for the AMD side would be that not using xchg reg, reg is increasing the size of the instruction stream (and therefore limits CPU throughput due to more decoding being required) and may limit or hurt codegen in other ways due to the increased register pressure required.

My expectation is that while this will likely have a positive measurable impact to Intel, it may have an inversely negative measurable impact to AMD.

@jakobbotsch
Copy link
Member Author

@jakobbotsch, what's the perf impact to AMD machines?

For the benchmark I am looking at the impact on my AMD CPU is small. Of course I cannot say anything absolute about it. Here's the Intel results:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22623.1180)
Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-alpha.1.23068.3
  [Host]     : .NET 8.0.0 (8.0.23.6702), X64 RyuJIT AVX2
  Job-HIDAOL : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-VNSHQC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain BytesCount Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
GetStringHashCode Job-HIDAOL \Core_Root\corerun.exe 10 4.637 ns 0.0532 ns 0.0498 ns 4.657 ns 4.497 ns 4.681 ns 0.97 - NA
GetStringHashCode Job-VNSHQC \Core_Root_base\corerun.exe 10 4.780 ns 0.0293 ns 0.0260 ns 4.789 ns 4.694 ns 4.793 ns 1.00 - NA
GetStringHashCode Job-HIDAOL \Core_Root\corerun.exe 100 38.339 ns 0.1742 ns 0.1455 ns 38.389 ns 38.016 ns 38.464 ns 0.84 - NA
GetStringHashCode Job-VNSHQC \Core_Root_base\corerun.exe 100 45.441 ns 0.1471 ns 0.1304 ns 45.480 ns 44.994 ns 45.506 ns 1.00 - NA
GetStringHashCode Job-HIDAOL \Core_Root\corerun.exe 1000 394.159 ns 0.1664 ns 0.1475 ns 394.169 ns 393.965 ns 394.396 ns 0.84 - NA
GetStringHashCode Job-VNSHQC \Core_Root_base\corerun.exe 1000 471.804 ns 2.3978 ns 2.0022 ns 472.371 ns 465.152 ns 472.554 ns 1.00 - NA
GetStringHashCode Job-HIDAOL \Core_Root\corerun.exe 10000 3,917.174 ns 3.8283 ns 2.9889 ns 3,916.814 ns 3,909.882 ns 3,922.292 ns 0.83 - NA
GetStringHashCode Job-VNSHQC \Core_Root_base\corerun.exe 10000 4,700.966 ns 25.4986 ns 23.8514 ns 4,709.461 ns 4,633.248 ns 4,716.183 ns 1.00 - NA

Here's the AMD (5950X) results:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19044.2486/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-alpha.1.23063.5
  [Host]     : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2
  Job-IIILXS : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-HOELUT : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain BytesCount Mean Error StdDev Median Min Max Ratio Allocated Alloc Ratio
GetStringHashCode Job-IIILXS \Core_Root\corerun.exe 10 2.708 ns 0.0099 ns 0.0083 ns 2.706 ns 2.696 ns 2.723 ns 0.98 - NA
GetStringHashCode Job-HOELUT \Core_Root_base\corerun.exe 10 2.761 ns 0.0041 ns 0.0037 ns 2.760 ns 2.756 ns 2.768 ns 1.00 - NA
GetStringHashCode Job-IIILXS \Core_Root\corerun.exe 100 25.109 ns 0.0346 ns 0.0307 ns 25.099 ns 25.070 ns 25.169 ns 1.03 - NA
GetStringHashCode Job-HOELUT \Core_Root_base\corerun.exe 100 24.387 ns 0.1667 ns 0.1477 ns 24.327 ns 24.277 ns 24.736 ns 1.00 - NA
GetStringHashCode Job-IIILXS \Core_Root\corerun.exe 1000 261.517 ns 0.5098 ns 0.4769 ns 261.551 ns 260.746 ns 262.328 ns 1.00 - NA
GetStringHashCode Job-HOELUT \Core_Root_base\corerun.exe 1000 260.590 ns 0.3396 ns 0.3011 ns 260.595 ns 260.010 ns 261.082 ns 1.00 - NA
GetStringHashCode Job-IIILXS \Core_Root\corerun.exe 10000 2,614.609 ns 3.2482 ns 2.7124 ns 2,613.862 ns 2,612.070 ns 2,622.050 ns 0.98 - NA
GetStringHashCode Job-HOELUT \Core_Root_base\corerun.exe 10000 2,654.345 ns 33.1288 ns 30.9887 ns 2,657.437 ns 2,607.028 ns 2,693.691 ns 1.00 - NA

hurt codegen in other ways due to the increased register pressure required

There is no increased register pressure, at resolution time all registers have already been assigned and we know if we have free registers. There may be new prolog/epilog saves though.

My expectation is that while this will likely have a positive measurable impact to Intel, it may have an inversely negative measurable impact to AMD.

I think on average this has better perf characteristics due to what you said, with no or very small negative impact on AMD machines and large positive impact on Intel CPUs.
Of course ideally we would just emit the code depending on the host/target CPU, but I think that's a bit heavy handed to introduce for this (small) part of the JIT.

Anyway, we have benchmarks that are impacted by this so I think we can get better data from there when/if we merge this, and reevaluate. I'm not against reverting this change if we don't find the tradeoff to be good.

@tannergooding
Copy link
Member

tannergooding commented Jan 26, 2023

Thanks for checking and sharing results.

I agree that this looks reasonable and as long as its not increasing register pressure and therefore leading to significantly worse codegen (such as due to additional spilling required/etc), I think it's fine. -- It might be nice if LSRA could understand xchg as being special and that a temporary is not required in the case where spilling would be required. It might be likewise nice if it understood that it was indeed a temporary and therefore never requires preserving the value across GT_CALL sites/etc (and so therefore can be taken from a larger pool and not impact the main codegen)

I also agree that it would be nice to be able to do per-architecture micro-optimizations, but its a broader issue and something that would need to be done separately (and with consideration to the test matrix complexity increase).

@VSadov
Copy link
Member

VSadov commented Jan 26, 2023

can this use XOR swap when no free register?

@VSadov
Copy link
Member

VSadov commented Jan 26, 2023

Actually the same article mentions that 3 xors will not be faster than on xchg

Even if there is not any architectural register available to use, the XCHG instruction will be at least as fast as the three XORs taken together.

@tannergooding
Copy link
Member

can this use XOR swap when no free register?

Wouldn't that be problematic for gcrefs?

@VSadov
Copy link
Member

VSadov commented Jan 26, 2023

Wouldn't that be problematic for gcrefs?

yes, it would be a problem.
also it looks like there is no chance for it to be faster than just xchg

@jakobbotsch jakobbotsch marked this pull request as ready for review January 27, 2023 14:36
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @kunalspathak

This SPMI replay failure looks like the timeouts we have been dealing with recently.

regNumber LinearScan::getTempRegForResolution(BasicBlock* fromBlock,
BasicBlock* toBlock,
var_types type,
VARSET_VALARG_TP sharedCriticalLiveSet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment about sharedCriticalLiveSet in method summary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, added.

@@ -7448,6 +7475,11 @@ regNumber LinearScan::getTempRegForResolution(BasicBlock* fromBlock, BasicBlock*
}
else
{
if ((freeRegs & RBM_CALLEE_TRASH) != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying we prefer callee-trash registers among the available free temp registers to avoid prolog/epilog save/restore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -8486,18 +8514,16 @@ void LinearScan::resolveEdge(BasicBlock* fromBlock,
tempReg = tempRegFlt;
}
#ifdef TARGET_XARCH
else
else if (tempRegInt == REG_NA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow this change. What is this about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we would always use xchg/swap. Now we only do that if we weren't able to get a temporary register.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 28, 2023
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 28, 2023
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@jakobbotsch
Copy link
Member Author

Looks like I have some Fuzzlyn failures to look at.

@jakobbotsch
Copy link
Member Author

The problem seems to be that getTempRegForResolution does not take registers consumed by the terminating node into account, e.g.:

             bl      Program:M58(S0,byte,I0):bool
-            str     w19, [fp, #0xD1FFAB1E]	// [V13 loc10]
+            mov     w0, w19
             mov     x19, x21
-            ldr     w21, [fp, #0xD1FFAB1E]	// [V13 loc10]
+            mov     w21, w0
             cbz     w0, G_M62392_IG163
             stp     xzr, xzr, [fp, #0xD1FFAB1E]
             stp     xzr, xzr, [fp, #0xD1FFAB1E]
@@ -3395,7 +3395,7 @@ G_M62392_IG162:
             mov     x0, x20
             mov     x2, xzr
             blr     x3
-						;; size=380 bbWeight=1 PerfScore 134.00
+						;; size=380 bbWeight=1 PerfScore 132.00
 G_M62392_IG163:
             ldr     x0, [fp, #0x10]	// [V528 cse0]
             ldr     x20, [x0, #0x10]
@@ -3725,7 +3725,7 @@ G_M62392_IG171:
             ret     lr
 						;; size=16 bbWeight=1 PerfScore 10.00

Note that we pick w0 as the temporary register, but it is used in the upcoming cbz.

Not sure why this would only be a problem after this PR. I'll dig a little deeper and push a fix.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr libraries-jitstress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

assert(reg != REG_NA);
if (reg != REG_STK)
{
freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType);
freeRegs &= ~genRegMask(reg ARM_ARG( getIntervalForLocalVar(varIndex)->registerType));

src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved
// resolveType - the type of resolution to be performed
// liveSet - the set of tracked lclVar indices which may require resolution
// terminatorConsumedRegs - the registers consumed by the terminator node.
// These registers will be used after any resolution added to 'fromBlock'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// These registers will be used after any resolution added to 'fromBlock'.
// These registers will be used after any resolution added at the end of the 'fromBlock'.

Probably you meant this. But not sure if I fully understand this. Inside getTempRegForResolution0 we say that they are not free to be used as temp registers, but why is that the case? They are consumed already and should be fine to use them, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant that. Will update.

They are consumed already and should be fine to use them, right?

No, resolution nodes are added before the terminating node and any registers it uses. The consumed registers are computed here:

// Next, if this blocks ends with a switch table, or for Arm64, ends with JCMP instruction,
// make sure to not copy into the registers that are consumed at the end of this block.
//
// Note: Only switches and JCMP (for Arm4) have input regs (and so can be fed by copies), so those
// are the only block-ending branches that need special handling.
regMaskTP consumedRegs = RBM_NONE;
if (block->bbJumpKind == BBJ_SWITCH)
{
// At this point, Lowering has transformed any non-switch-table blocks into
// cascading ifs.
GenTree* switchTable = LIR::AsRange(block).LastNode();
assert(switchTable != nullptr && switchTable->OperGet() == GT_SWITCH_TABLE);
consumedRegs = switchTable->gtRsvdRegs;
GenTree* op1 = switchTable->gtGetOp1();
GenTree* op2 = switchTable->gtGetOp2();
noway_assert(op1 != nullptr && op2 != nullptr);
assert(op1->GetRegNum() != REG_NA && op2->GetRegNum() != REG_NA);
// No floating point values, so no need to worry about the register type
// (i.e. for ARM32, where we used the genRegMask overload with a type).
assert(varTypeIsIntegralOrI(op1) && varTypeIsIntegralOrI(op2));
consumedRegs |= genRegMask(op1->GetRegNum());
consumedRegs |= genRegMask(op2->GetRegNum());
// Special handling for GT_COPY to not resolve into the source
// of switch's operand.
if (op1->OperIs(GT_COPY))
{
GenTree* srcOp1 = op1->gtGetOp1();
consumedRegs |= genRegMask(srcOp1->GetRegNum());
}
}
#if defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64)
// Next, if this blocks ends with a JCMP, we have to make sure:
// 1. Not to copy into the register that JCMP uses
// e.g. JCMP w21, BRANCH
// 2. Not to copy into the source of JCMP's operand before it is consumed
// e.g. Should not use w0 since it will contain wrong value after resolution
// call METHOD
// ; mov w0, w19 <-- should not resolve in w0 here.
// mov w21, w0
// JCMP w21, BRANCH
// 3. Not to modify the local variable it must consume
// Note: GT_COPY has special handling in codegen and its generation is merged with the
// node that consumes its result. So both, the input and output regs of GT_COPY must be
// excluded from the set available for resolution.
LclVarDsc* jcmpLocalVarDsc = nullptr;
if (block->bbJumpKind == BBJ_COND)
{
GenTree* lastNode = LIR::AsRange(block).LastNode();
if (lastNode->OperIs(GT_JCMP))
{
GenTree* op1 = lastNode->gtGetOp1();
consumedRegs |= genRegMask(op1->GetRegNum());
if (op1->OperIs(GT_COPY))
{
GenTree* srcOp1 = op1->gtGetOp1();
consumedRegs |= genRegMask(srcOp1->GetRegNum());
}
if (op1->IsLocal())
{
GenTreeLclVarCommon* lcl = op1->AsLclVarCommon();
jcmpLocalVarDsc = &compiler->lvaTable[lcl->GetLclNum()];
}
}
}
#endif // defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolution nodes are added before the terminating node and any registers it uses.

Of course, that's right. I was thinking something else. But then how did it work so far? Wasn't this problem always?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure how it worked previously either. If I had to guess the problematic cases (i.e. switches/JCMP blocks) always end up doing resolution in the target blocks (except for the shared critical edges cases, which was skipped before). I think that would make sense since these kinds of blocks should always have 2 or more successors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be the case. So were you hitting assert or something when you added support for shared critical edges and that made you consider the terminatorConsumeRegs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)

Ah, I missed that comment and findings. That makes sense then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test case for that as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case is massive, I think weekly Antigen/Fuzzlyn runs should be ok to cover it.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 31, 2023
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 31, 2023
@jakobbotsch
Copy link
Member Author

Fuzzlyn failures are #81356.
libraries-jitstress failures are #81366 and a timeout in a run after the tests all succeeded according to the log, so will disregard that.

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Jan 31, 2023

The libraries-jitstressregs failures all look like #81366. Will rerun the failing installer job.

(Edit: It succeeded on a rerun)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants