-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Avoid xchg for resolution #81216
JIT: Avoid xchg for resolution #81216
Conversation
LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available. Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case. Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog. This mostly fixes the string hashcode regressions from dotnet#80743.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsLSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available. Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case. Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog. This mostly fixes the string hashcode regressions from #80743.
|
The typical diff looks like: G_M14819_IG09: ; bbWeight=2, gcVars=0000000000000000 {}, gcrefRegs=00000028 {rbx rbp}, byrefRegs=00000000 {}, gcvars, byref
; gcrRegs -[rax] +[rbp]
- xchg rbx, rbp
+ mov rcx, rbx
+ ; gcrRegs +[rcx]
+ mov rbx, rbp
+ mov rbp, rcx
jmp G_M14819_IG03 - xchg rdx, rsi
- xchg rsi, r11
- xchg r9, r11
- xchg r10, r11
+ mov ebx, edx
+ mov edx, esi
+ mov esi, r11d
+ mov r11d, r10d
+ mov r10d, r9d
+ mov r9d, ebx May need to look at TP, let's see. |
Example ARM64 diff: - str x24, [fp, #0x38] // [V06 loc3]
- ; GC ptr vars +{V06}
+ mov x0, x24
+ ; gcrRegs +[x0]
mov x24, x25
- ldr x25, [fp, #0x38] // [V06 loc3]
+ mov x25, x0
bge G_M1573_IG30 |
@jakobbotsch, what's the perf impact to AMD machines? uops.info reports that on Intel However, for AMD it reports that The concern for the AMD side would be that not using My expectation is that while this will likely have a positive measurable impact to Intel, it may have an inversely negative measurable impact to AMD. |
For the benchmark I am looking at the impact on my AMD CPU is small. Of course I cannot say anything absolute about it. Here's the Intel results: BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22623.1180)
Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-alpha.1.23068.3
[Host] : .NET 8.0.0 (8.0.23.6702), X64 RyuJIT AVX2
Job-HIDAOL : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-VNSHQC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Here's the AMD (5950X) results: BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19044.2486/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-alpha.1.23063.5
[Host] : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2
Job-IIILXS : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-HOELUT : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
There is no increased register pressure, at resolution time all registers have already been assigned and we know if we have free registers. There may be new prolog/epilog saves though.
I think on average this has better perf characteristics due to what you said, with no or very small negative impact on AMD machines and large positive impact on Intel CPUs. Anyway, we have benchmarks that are impacted by this so I think we can get better data from there when/if we merge this, and reevaluate. I'm not against reverting this change if we don't find the tradeoff to be good. |
Thanks for checking and sharing results. I agree that this looks reasonable and as long as its not increasing register pressure and therefore leading to significantly worse codegen (such as due to additional spilling required/etc), I think it's fine. -- It might be nice if LSRA could understand I also agree that it would be nice to be able to do per-architecture micro-optimizations, but its a broader issue and something that would need to be done separately (and with consideration to the test matrix complexity increase). |
can this use XOR swap when no free register? |
Actually the same article mentions that 3 xors will not be faster than on xchg
|
Wouldn't that be problematic for gcrefs? |
yes, it would be a problem. |
cc @dotnet/jit-contrib PTAL @kunalspathak This SPMI replay failure looks like the timeouts we have been dealing with recently. |
src/coreclr/jit/lsra.cpp
Outdated
regNumber LinearScan::getTempRegForResolution(BasicBlock* fromBlock, | ||
BasicBlock* toBlock, | ||
var_types type, | ||
VARSET_VALARG_TP sharedCriticalLiveSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment about sharedCriticalLiveSet
in method summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, added.
@@ -7448,6 +7475,11 @@ regNumber LinearScan::getTempRegForResolution(BasicBlock* fromBlock, BasicBlock* | |||
} | |||
else | |||
{ | |||
if ((freeRegs & RBM_CALLEE_TRASH) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment saying we prefer callee-trash
registers among the available free temp registers to avoid prolog/epilog save/restore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -8486,18 +8514,16 @@ void LinearScan::resolveEdge(BasicBlock* fromBlock, | |||
tempReg = tempRegFlt; | |||
} | |||
#ifdef TARGET_XARCH | |||
else | |||
else if (tempRegInt == REG_NA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow this change. What is this about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we would always use xchg
/swap. Now we only do that if we weren't able to get a temporary register.
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 4 pipeline(s). |
Looks like I have some Fuzzlyn failures to look at. |
The problem seems to be that bl Program:M58(S0,byte,I0):bool
- str w19, [fp, #0xD1FFAB1E] // [V13 loc10]
+ mov w0, w19
mov x19, x21
- ldr w21, [fp, #0xD1FFAB1E] // [V13 loc10]
+ mov w21, w0
cbz w0, G_M62392_IG163
stp xzr, xzr, [fp, #0xD1FFAB1E]
stp xzr, xzr, [fp, #0xD1FFAB1E]
@@ -3395,7 +3395,7 @@ G_M62392_IG162:
mov x0, x20
mov x2, xzr
blr x3
- ;; size=380 bbWeight=1 PerfScore 134.00
+ ;; size=380 bbWeight=1 PerfScore 132.00
G_M62392_IG163:
ldr x0, [fp, #0x10] // [V528 cse0]
ldr x20, [x0, #0x10]
@@ -3725,7 +3725,7 @@ G_M62392_IG171:
ret lr
;; size=16 bbWeight=1 PerfScore 10.00 Note that we pick Not sure why this would only be a problem after this PR. I'll dig a little deeper and push a fix. |
/azp run runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run runtime-coreclr libraries-jitstress, Fuzzlyn |
Azure Pipelines successfully started running 2 pipeline(s). |
src/coreclr/jit/lsra.cpp
Outdated
assert(reg != REG_NA); | ||
if (reg != REG_STK) | ||
{ | ||
freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType); | |
freeRegs &= ~genRegMask(reg ARM_ARG( getIntervalForLocalVar(varIndex)->registerType)); |
src/coreclr/jit/lsra.cpp
Outdated
// resolveType - the type of resolution to be performed | ||
// liveSet - the set of tracked lclVar indices which may require resolution | ||
// terminatorConsumedRegs - the registers consumed by the terminator node. | ||
// These registers will be used after any resolution added to 'fromBlock'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// These registers will be used after any resolution added to 'fromBlock'. | |
// These registers will be used after any resolution added at the end of the 'fromBlock'. |
Probably you meant this. But not sure if I fully understand this. Inside getTempRegForResolution0
we say that they are not free to be used as temp registers, but why is that the case? They are consumed already and should be fine to use them, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant that. Will update.
They are consumed already and should be fine to use them, right?
No, resolution nodes are added before the terminating node and any registers it uses. The consumed registers are computed here:
runtime/src/coreclr/jit/lsra.cpp
Lines 7655 to 7726 in f1bdd5a
// Next, if this blocks ends with a switch table, or for Arm64, ends with JCMP instruction, | |
// make sure to not copy into the registers that are consumed at the end of this block. | |
// | |
// Note: Only switches and JCMP (for Arm4) have input regs (and so can be fed by copies), so those | |
// are the only block-ending branches that need special handling. | |
regMaskTP consumedRegs = RBM_NONE; | |
if (block->bbJumpKind == BBJ_SWITCH) | |
{ | |
// At this point, Lowering has transformed any non-switch-table blocks into | |
// cascading ifs. | |
GenTree* switchTable = LIR::AsRange(block).LastNode(); | |
assert(switchTable != nullptr && switchTable->OperGet() == GT_SWITCH_TABLE); | |
consumedRegs = switchTable->gtRsvdRegs; | |
GenTree* op1 = switchTable->gtGetOp1(); | |
GenTree* op2 = switchTable->gtGetOp2(); | |
noway_assert(op1 != nullptr && op2 != nullptr); | |
assert(op1->GetRegNum() != REG_NA && op2->GetRegNum() != REG_NA); | |
// No floating point values, so no need to worry about the register type | |
// (i.e. for ARM32, where we used the genRegMask overload with a type). | |
assert(varTypeIsIntegralOrI(op1) && varTypeIsIntegralOrI(op2)); | |
consumedRegs |= genRegMask(op1->GetRegNum()); | |
consumedRegs |= genRegMask(op2->GetRegNum()); | |
// Special handling for GT_COPY to not resolve into the source | |
// of switch's operand. | |
if (op1->OperIs(GT_COPY)) | |
{ | |
GenTree* srcOp1 = op1->gtGetOp1(); | |
consumedRegs |= genRegMask(srcOp1->GetRegNum()); | |
} | |
} | |
#if defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) | |
// Next, if this blocks ends with a JCMP, we have to make sure: | |
// 1. Not to copy into the register that JCMP uses | |
// e.g. JCMP w21, BRANCH | |
// 2. Not to copy into the source of JCMP's operand before it is consumed | |
// e.g. Should not use w0 since it will contain wrong value after resolution | |
// call METHOD | |
// ; mov w0, w19 <-- should not resolve in w0 here. | |
// mov w21, w0 | |
// JCMP w21, BRANCH | |
// 3. Not to modify the local variable it must consume | |
// Note: GT_COPY has special handling in codegen and its generation is merged with the | |
// node that consumes its result. So both, the input and output regs of GT_COPY must be | |
// excluded from the set available for resolution. | |
LclVarDsc* jcmpLocalVarDsc = nullptr; | |
if (block->bbJumpKind == BBJ_COND) | |
{ | |
GenTree* lastNode = LIR::AsRange(block).LastNode(); | |
if (lastNode->OperIs(GT_JCMP)) | |
{ | |
GenTree* op1 = lastNode->gtGetOp1(); | |
consumedRegs |= genRegMask(op1->GetRegNum()); | |
if (op1->OperIs(GT_COPY)) | |
{ | |
GenTree* srcOp1 = op1->gtGetOp1(); | |
consumedRegs |= genRegMask(srcOp1->GetRegNum()); | |
} | |
if (op1->IsLocal()) | |
{ | |
GenTreeLclVarCommon* lcl = op1->AsLclVarCommon(); | |
jcmpLocalVarDsc = &compiler->lvaTable[lcl->GetLclNum()]; | |
} | |
} | |
} | |
#endif // defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolution nodes are added before the terminating node and any registers it uses.
Of course, that's right. I was thinking something else. But then how did it work so far? Wasn't this problem always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sure how it worked previously either. If I had to guess the problematic cases (i.e. switches/JCMP blocks) always end up doing resolution in the target blocks (except for the shared critical edges cases, which was skipped before). I think that would make sense since these kinds of blocks should always have 2 or more successors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be the case. So were you hitting assert or something when you added support for shared critical edges and that made you consider the terminatorConsumeRegs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)
Ah, I missed that comment and findings. That makes sense then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case is massive, I think weekly Antigen/Fuzzlyn runs should be ok to cover it.
/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs |
Azure Pipelines successfully started running 2 pipeline(s). |
The libraries-jitstressregs failures all look like #81366. Will rerun the failing installer job. (Edit: It succeeded on a rerun) |
LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available.
Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case.
Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog.
This mostly fixes the string hashcode regressions from #80743.
Size-wise regressions are expected.