JIT: Avoid xchg for resolution #81216

jakobbotsch · 2023-01-26T14:29:08Z

LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available.

Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case.

Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog.

This mostly fixes the string hashcode regressions from #80743.

Size-wise regressions are expected.

LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available. Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case. Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog. This mostly fixes the string hashcode regressions from dotnet#80743.

ghost · 2023-01-26T14:29:21Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

LSRA today uses xchg for reg-reg resolution when there are cycles in the resolution graph. Benchmarks show that xchg is handled much less efficiently by Intel CPUs than using a few movs with a temporary register. This PR enables using temporary registers for this kind of resolution on xarch (it was already enabled for non-xarch architectures). xchg is still used on xarch if no temporary register is available.

Additionally this PR adds support for getting a temporary register even for shared critical edges. Before this change we would spill on non-xarch for this case.

Finally, we now try to prefer a non callee saved temporary register so that we don't need to save/restore it in prolog/epilog.

This mostly fixes the string hashcode regressions from #80743.

Author:	jakobbotsch
Assignees:	jakobbotsch
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jakobbotsch · 2023-01-26T14:32:43Z

The typical diff looks like:

 G_M14819_IG09:        ; bbWeight=2, gcVars=0000000000000000 {}, gcrefRegs=00000028 {rbx rbp}, byrefRegs=00000000 {}, gcvars, byref
        ; gcrRegs -[rax] +[rbp]
-       xchg     rbx, rbp
+       mov      rcx, rbx
+       ; gcrRegs +[rcx]
+       mov      rbx, rbp
+       mov      rbp, rcx
        jmp      G_M14819_IG03

-       xchg     rdx, rsi
-       xchg     rsi, r11
-       xchg     r9, r11
-       xchg     r10, r11
+       mov      ebx, edx
+       mov      edx, esi
+       mov      esi, r11d
+       mov      r11d, r10d
+       mov      r10d, r9d
+       mov      r9d, ebx

May need to look at TP, let's see.

jakobbotsch · 2023-01-26T14:46:01Z

Example ARM64 diff:

-            str     x24, [fp, #0x38]   // [V06 loc3]
-            ; GC ptr vars +{V06}
+            mov     x0, x24
+            ; gcrRegs +[x0]
             mov     x24, x25
-            ldr     x25, [fp, #0x38]   // [V06 loc3]
+            mov     x25, x0
             bge     G_M1573_IG30

tannergooding · 2023-01-26T17:06:24Z

@jakobbotsch, what's the perf impact to AMD machines?

uops.info reports that on Intel xchg reg, reg is a 1-2 latency instruction with 0.75 reciprocal throughput and that it takes 3 micro-ops on ports 0, 1, 5, or 6. This is true of Haswell all the way up through Alder Lake. mov reg, reg is reported as 0-1 latency instruction with a 0.25 reciprocal throughput (or better) and that it takes 1 micro-ops

However, for AMD it reports that xchg reg, reg is a 0 latency instruction with a 0.33 reciprocal throughput and that it takes 2 micro-ops. The AMD architecture manual likewise explicitly states that it is handled as a "zero cycle move" and completely processed via register renaming.

The concern for the AMD side would be that not using xchg reg, reg is increasing the size of the instruction stream (and therefore limits CPU throughput due to more decoding being required) and may limit or hurt codegen in other ways due to the increased register pressure required.

My expectation is that while this will likely have a positive measurable impact to Intel, it may have an inversely negative measurable impact to AMD.

jakobbotsch · 2023-01-26T18:33:08Z

@jakobbotsch, what's the perf impact to AMD machines?

For the benchmark I am looking at the impact on my AMD CPU is small. Of course I cannot say anything absolute about it. Here's the Intel results:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22623.1180)
Intel Core i9-10885H CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-alpha.1.23068.3
  [Host]     : .NET 8.0.0 (8.0.23.6702), X64 RyuJIT AVX2
  Job-HIDAOL : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-VNSHQC : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	BytesCount	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
GetStringHashCode	Job-HIDAOL	\Core_Root\corerun.exe	10	4.637 ns	0.0532 ns	0.0498 ns	4.657 ns	4.497 ns	4.681 ns	0.97	-	NA
GetStringHashCode	Job-VNSHQC	\Core_Root_base\corerun.exe	10	4.780 ns	0.0293 ns	0.0260 ns	4.789 ns	4.694 ns	4.793 ns	1.00	-	NA

GetStringHashCode	Job-HIDAOL	\Core_Root\corerun.exe	100	38.339 ns	0.1742 ns	0.1455 ns	38.389 ns	38.016 ns	38.464 ns	0.84	-	NA
GetStringHashCode	Job-VNSHQC	\Core_Root_base\corerun.exe	100	45.441 ns	0.1471 ns	0.1304 ns	45.480 ns	44.994 ns	45.506 ns	1.00	-	NA

GetStringHashCode	Job-HIDAOL	\Core_Root\corerun.exe	1000	394.159 ns	0.1664 ns	0.1475 ns	394.169 ns	393.965 ns	394.396 ns	0.84	-	NA
GetStringHashCode	Job-VNSHQC	\Core_Root_base\corerun.exe	1000	471.804 ns	2.3978 ns	2.0022 ns	472.371 ns	465.152 ns	472.554 ns	1.00	-	NA

GetStringHashCode	Job-HIDAOL	\Core_Root\corerun.exe	10000	3,917.174 ns	3.8283 ns	2.9889 ns	3,916.814 ns	3,909.882 ns	3,922.292 ns	0.83	-	NA
GetStringHashCode	Job-VNSHQC	\Core_Root_base\corerun.exe	10000	4,700.966 ns	25.4986 ns	23.8514 ns	4,709.461 ns	4,633.248 ns	4,716.183 ns	1.00	-	NA

Here's the AMD (5950X) results:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 10 (10.0.19044.2486/21H2/November2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-alpha.1.23063.5
  [Host]     : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2
  Job-IIILXS : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-HOELUT : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

EnvironmentVariables=DOTNET_JitDisasm=ComputeHash32  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	BytesCount	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
GetStringHashCode	Job-IIILXS	\Core_Root\corerun.exe	10	2.708 ns	0.0099 ns	0.0083 ns	2.706 ns	2.696 ns	2.723 ns	0.98	-	NA
GetStringHashCode	Job-HOELUT	\Core_Root_base\corerun.exe	10	2.761 ns	0.0041 ns	0.0037 ns	2.760 ns	2.756 ns	2.768 ns	1.00	-	NA

GetStringHashCode	Job-IIILXS	\Core_Root\corerun.exe	100	25.109 ns	0.0346 ns	0.0307 ns	25.099 ns	25.070 ns	25.169 ns	1.03	-	NA
GetStringHashCode	Job-HOELUT	\Core_Root_base\corerun.exe	100	24.387 ns	0.1667 ns	0.1477 ns	24.327 ns	24.277 ns	24.736 ns	1.00	-	NA

GetStringHashCode	Job-IIILXS	\Core_Root\corerun.exe	1000	261.517 ns	0.5098 ns	0.4769 ns	261.551 ns	260.746 ns	262.328 ns	1.00	-	NA
GetStringHashCode	Job-HOELUT	\Core_Root_base\corerun.exe	1000	260.590 ns	0.3396 ns	0.3011 ns	260.595 ns	260.010 ns	261.082 ns	1.00	-	NA

GetStringHashCode	Job-IIILXS	\Core_Root\corerun.exe	10000	2,614.609 ns	3.2482 ns	2.7124 ns	2,613.862 ns	2,612.070 ns	2,622.050 ns	0.98	-	NA
GetStringHashCode	Job-HOELUT	\Core_Root_base\corerun.exe	10000	2,654.345 ns	33.1288 ns	30.9887 ns	2,657.437 ns	2,607.028 ns	2,693.691 ns	1.00	-	NA

hurt codegen in other ways due to the increased register pressure required

There is no increased register pressure, at resolution time all registers have already been assigned and we know if we have free registers. There may be new prolog/epilog saves though.

My expectation is that while this will likely have a positive measurable impact to Intel, it may have an inversely negative measurable impact to AMD.

I think on average this has better perf characteristics due to what you said, with no or very small negative impact on AMD machines and large positive impact on Intel CPUs.
Of course ideally we would just emit the code depending on the host/target CPU, but I think that's a bit heavy handed to introduce for this (small) part of the JIT.

Anyway, we have benchmarks that are impacted by this so I think we can get better data from there when/if we merge this, and reevaluate. I'm not against reverting this change if we don't find the tradeoff to be good.

tannergooding · 2023-01-26T18:42:15Z

Thanks for checking and sharing results.

I agree that this looks reasonable and as long as its not increasing register pressure and therefore leading to significantly worse codegen (such as due to additional spilling required/etc), I think it's fine. -- It might be nice if LSRA could understand xchg as being special and that a temporary is not required in the case where spilling would be required. It might be likewise nice if it understood that it was indeed a temporary and therefore never requires preserving the value across GT_CALL sites/etc (and so therefore can be taken from a larger pool and not impact the main codegen)

I also agree that it would be nice to be able to do per-architecture micro-optimizations, but its a broader issue and something that would need to be done separately (and with consideration to the test matrix complexity increase).

VSadov · 2023-01-26T22:43:22Z

can this use XOR swap when no free register?

VSadov · 2023-01-26T22:46:59Z

Actually the same article mentions that 3 xors will not be faster than on xchg

Even if there is not any architectural register available to use, the XCHG instruction will be at least as fast as the three XORs taken together.

tannergooding · 2023-01-26T22:49:23Z

can this use XOR swap when no free register?

Wouldn't that be problematic for gcrefs?

VSadov · 2023-01-26T22:53:40Z

Wouldn't that be problematic for gcrefs?

yes, it would be a problem.
also it looks like there is no chance for it to be faster than just xchg

jakobbotsch · 2023-01-27T14:36:30Z

cc @dotnet/jit-contrib PTAL @kunalspathak

This SPMI replay failure looks like the timeouts we have been dealing with recently.

kunalspathak · 2023-01-27T22:56:31Z

src/coreclr/jit/lsra.cpp

+regNumber LinearScan::getTempRegForResolution(BasicBlock*      fromBlock,
+                                              BasicBlock*      toBlock,
+                                              var_types        type,
+                                              VARSET_VALARG_TP sharedCriticalLiveSet)


Please add a comment about sharedCriticalLiveSet in method summary.

Whoops, added.

kunalspathak · 2023-01-28T02:27:12Z

src/coreclr/jit/lsra.cpp

@@ -7448,6 +7475,11 @@ regNumber LinearScan::getTempRegForResolution(BasicBlock* fromBlock, BasicBlock*
    }
    else
    {
+        if ((freeRegs & RBM_CALLEE_TRASH) != 0)


can you add a comment saying we prefer callee-trash registers among the available free temp registers to avoid prolog/epilog save/restore.

kunalspathak · 2023-01-28T02:28:05Z

src/coreclr/jit/lsra.cpp

@@ -8486,18 +8514,16 @@ void LinearScan::resolveEdge(BasicBlock*      fromBlock,
                        tempReg = tempRegFlt;
                }
 #ifdef TARGET_XARCH
-                else
+                else if (tempRegInt == REG_NA)


I didn't follow this change. What is this about?

Previously we would always use xchg/swap. Now we only do that if we weren't able to get a temporary register.

jakobbotsch · 2023-01-30T17:22:17Z

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

azure-pipelines · 2023-01-30T17:22:55Z

Azure Pipelines successfully started running 4 pipeline(s).

jakobbotsch · 2023-01-30T21:13:03Z

Looks like I have some Fuzzlyn failures to look at.

jakobbotsch · 2023-01-30T21:25:44Z

The problem seems to be that getTempRegForResolution does not take registers consumed by the terminating node into account, e.g.:

             bl      Program:M58(S0,byte,I0):bool
-            str     w19, [fp, #0xD1FFAB1E]	// [V13 loc10]
+            mov     w0, w19
             mov     x19, x21
-            ldr     w21, [fp, #0xD1FFAB1E]	// [V13 loc10]
+            mov     w21, w0
             cbz     w0, G_M62392_IG163
             stp     xzr, xzr, [fp, #0xD1FFAB1E]
             stp     xzr, xzr, [fp, #0xD1FFAB1E]
@@ -3395,7 +3395,7 @@ G_M62392_IG162:
             mov     x0, x20
             mov     x2, xzr
             blr     x3
-						;; size=380 bbWeight=1 PerfScore 134.00
+						;; size=380 bbWeight=1 PerfScore 132.00
 G_M62392_IG163:
             ldr     x0, [fp, #0x10]	// [V528 cse0]
             ldr     x20, [x0, #0x10]
@@ -3725,7 +3725,7 @@ G_M62392_IG171:
             ret     lr
 						;; size=16 bbWeight=1 PerfScore 10.00

Note that we pick w0 as the temporary register, but it is used in the upcoming cbz.

Not sure why this would only be a problem after this PR. I'll dig a little deeper and push a fix.

jakobbotsch · 2023-01-30T21:51:47Z

/azp run runtime-coreclr libraries-jitstress, Fuzzlyn

azure-pipelines · 2023-01-30T21:52:08Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2023-01-30T23:17:44Z

/azp run runtime-coreclr libraries-jitstress, Fuzzlyn

azure-pipelines · 2023-01-30T23:18:07Z

Azure Pipelines successfully started running 2 pipeline(s).

kunalspathak · 2023-01-31T06:09:45Z

src/coreclr/jit/lsra.cpp

+            assert(reg != REG_NA);
+            if (reg != REG_STK)
+            {
+                freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType);


Suggested change

freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType);

freeRegs &= ~genRegMask(reg ARM_ARG( getIntervalForLocalVar(varIndex)->registerType));

src/coreclr/jit/lsra.cpp

kunalspathak · 2023-01-31T06:22:21Z

src/coreclr/jit/lsra.cpp

+//    resolveType            - the type of resolution to be performed
+//    liveSet                - the set of tracked lclVar indices which may require resolution
+//    terminatorConsumedRegs - the registers consumed by the terminator node.
+//                             These registers will be used after any resolution added to 'fromBlock'.


Suggested change

// These registers will be used after any resolution added to 'fromBlock'.

// These registers will be used after any resolution added at the end of the 'fromBlock'.

Probably you meant this. But not sure if I fully understand this. Inside getTempRegForResolution0 we say that they are not free to be used as temp registers, but why is that the case? They are consumed already and should be fine to use them, right?

Yes, I meant that. Will update.

They are consumed already and should be fine to use them, right?

No, resolution nodes are added before the terminating node and any registers it uses. The consumed registers are computed here:

runtime/src/coreclr/jit/lsra.cpp

Lines 7655 to 7726 in f1bdd5a

// Next, if this blocks ends with a switch table, or for Arm64, ends with JCMP instruction,

// make sure to not copy into the registers that are consumed at the end of this block.

//

// Note: Only switches and JCMP (for Arm4) have input regs (and so can be fed by copies), so those

// are the only block-ending branches that need special handling.

regMaskTP consumedRegs = RBM_NONE;

if (block->bbJumpKind == BBJ_SWITCH)

{

// At this point, Lowering has transformed any non-switch-table blocks into

// cascading ifs.

GenTree* switchTable = LIR::AsRange(block).LastNode();

assert(switchTable != nullptr && switchTable->OperGet() == GT_SWITCH_TABLE);

consumedRegs = switchTable->gtRsvdRegs;

GenTree* op1 = switchTable->gtGetOp1();

GenTree* op2 = switchTable->gtGetOp2();

noway_assert(op1 != nullptr && op2 != nullptr);

assert(op1->GetRegNum() != REG_NA && op2->GetRegNum() != REG_NA);

// No floating point values, so no need to worry about the register type

// (i.e. for ARM32, where we used the genRegMask overload with a type).

assert(varTypeIsIntegralOrI(op1) && varTypeIsIntegralOrI(op2));

consumedRegs |= genRegMask(op1->GetRegNum());

consumedRegs |= genRegMask(op2->GetRegNum());

// Special handling for GT_COPY to not resolve into the source

// of switch's operand.

if (op1->OperIs(GT_COPY))

{

GenTree* srcOp1 = op1->gtGetOp1();

consumedRegs |= genRegMask(srcOp1->GetRegNum());

}

}

#if defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64)

// Next, if this blocks ends with a JCMP, we have to make sure:

// 1. Not to copy into the register that JCMP uses

// e.g. JCMP w21, BRANCH

// 2. Not to copy into the source of JCMP's operand before it is consumed

// e.g. Should not use w0 since it will contain wrong value after resolution

// call METHOD

// ; mov w0, w19 <-- should not resolve in w0 here.

// mov w21, w0

// JCMP w21, BRANCH

// 3. Not to modify the local variable it must consume

// Note: GT_COPY has special handling in codegen and its generation is merged with the

// node that consumes its result. So both, the input and output regs of GT_COPY must be

// excluded from the set available for resolution.

LclVarDsc* jcmpLocalVarDsc = nullptr;

if (block->bbJumpKind == BBJ_COND)

{

GenTree* lastNode = LIR::AsRange(block).LastNode();

if (lastNode->OperIs(GT_JCMP))

{

GenTree* op1 = lastNode->gtGetOp1();

consumedRegs |= genRegMask(op1->GetRegNum());

if (op1->OperIs(GT_COPY))

{

GenTree* srcOp1 = op1->gtGetOp1();

consumedRegs |= genRegMask(srcOp1->GetRegNum());

}

if (op1->IsLocal())

{

GenTreeLclVarCommon* lcl = op1->AsLclVarCommon();

jcmpLocalVarDsc = &compiler->lvaTable[lcl->GetLclNum()];

}

}

}

#endif // defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64)

resolution nodes are added before the terminating node and any registers it uses.

Of course, that's right. I was thinking something else. But then how did it work so far? Wasn't this problem always?

I'm not totally sure how it worked previously either. If I had to guess the problematic cases (i.e. switches/JCMP blocks) always end up doing resolution in the target blocks (except for the shared critical edges cases, which was skipped before). I think that would make sense since these kinds of blocks should always have 2 or more successors.

That could be the case. So were you hitting assert or something when you added support for shared critical edges and that made you consider the terminatorConsumeRegs?

No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)

No, it was a Fuzzlyn test case, here was the codegen for that case: #81216 (comment)

Ah, I missed that comment and findings. That makes sense then.

Can you add a test case for that as well?

The test case is massive, I think weekly Antigen/Fuzzlyn runs should be ok to cover it.

jakobbotsch · 2023-01-31T08:24:46Z

Fuzzlyn failures are #81356.
libraries-jitstress failures are #81366 and a timeout in a run after the tests all succeeded according to the log, so will disregard that.

jakobbotsch · 2023-01-31T10:38:02Z

/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs

azure-pipelines · 2023-01-31T10:38:27Z

Azure Pipelines successfully started running 2 pipeline(s).

jakobbotsch · 2023-01-31T15:35:12Z

The libraries-jitstressregs failures all look like #81366. Will rerun the failing installer job.

(Edit: It succeeded on a rerun)

ghost assigned jakobbotsch Jan 26, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 26, 2023

jakobbotsch marked this pull request as ready for review January 27, 2023 14:36

jakobbotsch requested a review from kunalspathak January 27, 2023 14:36

kunalspathak requested changes Jan 28, 2023

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 28, 2023

Address feedback

24ce52b

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 28, 2023

Ensure we do not pick a consumed register as a temporary register

5d1b7c4

Fix arm32 build

6703fd9

kunalspathak requested changes Jan 31, 2023

View reviewed changes

ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 31, 2023

Add some ARM_ARG inside getTempRegForResolution

a8b21d3

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 31, 2023

Swap args

013505b

kunalspathak approved these changes Feb 1, 2023

View reviewed changes

jakobbotsch merged commit 842ec4a into dotnet:main Feb 1, 2023

jakobbotsch deleted the lsra-swap-via-temp-register branch February 1, 2023 10:12

jakobbotsch mentioned this pull request Feb 6, 2023

JIT struct work planned for .NET 8 #76931

Closed

20 tasks

This was referenced Feb 7, 2023

[Perf] Windows/x64: 6 Improvements on 2/1/2023 10:12:25 AM dotnet/perf-autofiling-issues#12470

Closed

[Perf] Windows/x64: 1 Regression on 2/1/2023 10:12:25 AM dotnet/perf-autofiling-issues#12439

Closed

Regressions from "Avoid xchg for resolution" #81768

Closed

ghost locked as resolved and limited conversation to collaborators Mar 3, 2023

	freeRegs &= ~genRegMask(reg, getIntervalForLocalVar(varIndex)->registerType);
	freeRegs &= ~genRegMask(reg ARM_ARG( getIntervalForLocalVar(varIndex)->registerType));

	// These registers will be used after any resolution added to 'fromBlock'.
	// These registers will be used after any resolution added at the end of the 'fromBlock'.

	// Next, if this blocks ends with a switch table, or for Arm64, ends with JCMP instruction,
	// make sure to not copy into the registers that are consumed at the end of this block.
	//
	// Note: Only switches and JCMP (for Arm4) have input regs (and so can be fed by copies), so those
	// are the only block-ending branches that need special handling.
	regMaskTP consumedRegs = RBM_NONE;
	if (block->bbJumpKind == BBJ_SWITCH)
	{
	// At this point, Lowering has transformed any non-switch-table blocks into
	// cascading ifs.
	GenTree* switchTable = LIR::AsRange(block).LastNode();
	assert(switchTable != nullptr && switchTable->OperGet() == GT_SWITCH_TABLE);

	consumedRegs = switchTable->gtRsvdRegs;
	GenTree* op1 = switchTable->gtGetOp1();
	GenTree* op2 = switchTable->gtGetOp2();
	noway_assert(op1 != nullptr && op2 != nullptr);
	assert(op1->GetRegNum() != REG_NA && op2->GetRegNum() != REG_NA);
	// No floating point values, so no need to worry about the register type
	// (i.e. for ARM32, where we used the genRegMask overload with a type).
	assert(varTypeIsIntegralOrI(op1) && varTypeIsIntegralOrI(op2));
	consumedRegs \|= genRegMask(op1->GetRegNum());
	consumedRegs \|= genRegMask(op2->GetRegNum());

	// Special handling for GT_COPY to not resolve into the source
	// of switch's operand.
	if (op1->OperIs(GT_COPY))
	{
	GenTree* srcOp1 = op1->gtGetOp1();
	consumedRegs \|= genRegMask(srcOp1->GetRegNum());
	}
	}

	#if defined(TARGET_ARM64) \|\| defined(TARGET_LOONGARCH64)
	// Next, if this blocks ends with a JCMP, we have to make sure:
	// 1. Not to copy into the register that JCMP uses
	// e.g. JCMP w21, BRANCH
	// 2. Not to copy into the source of JCMP's operand before it is consumed
	// e.g. Should not use w0 since it will contain wrong value after resolution
	// call METHOD
	// ; mov w0, w19 <-- should not resolve in w0 here.
	// mov w21, w0
	// JCMP w21, BRANCH
	// 3. Not to modify the local variable it must consume

	// Note: GT_COPY has special handling in codegen and its generation is merged with the
	// node that consumes its result. So both, the input and output regs of GT_COPY must be
	// excluded from the set available for resolution.
	LclVarDsc* jcmpLocalVarDsc = nullptr;
	if (block->bbJumpKind == BBJ_COND)
	{
	GenTree* lastNode = LIR::AsRange(block).LastNode();

	if (lastNode->OperIs(GT_JCMP))
	{
	GenTree* op1 = lastNode->gtGetOp1();
	consumedRegs \|= genRegMask(op1->GetRegNum());

	if (op1->OperIs(GT_COPY))
	{
	GenTree* srcOp1 = op1->gtGetOp1();
	consumedRegs \|= genRegMask(srcOp1->GetRegNum());
	}

	if (op1->IsLocal())
	{
	GenTreeLclVarCommon* lcl = op1->AsLclVarCommon();
	jcmpLocalVarDsc = &compiler->lvaTable[lcl->GetLclNum()];
	}
	}
	}
	#endif // defined(TARGET_ARM64) \|\| defined(TARGET_LOONGARCH64)

JIT: Avoid xchg for resolution #81216

JIT: Avoid xchg for resolution #81216

Conversation

jakobbotsch commented Jan 26, 2023 • edited Loading

ghost commented Jan 26, 2023

jakobbotsch commented Jan 26, 2023

jakobbotsch commented Jan 26, 2023

tannergooding commented Jan 26, 2023

jakobbotsch commented Jan 26, 2023

tannergooding commented Jan 26, 2023 • edited Loading

VSadov commented Jan 26, 2023

VSadov commented Jan 26, 2023

tannergooding commented Jan 26, 2023

VSadov commented Jan 26, 2023 • edited Loading

jakobbotsch commented Jan 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch commented Jan 30, 2023

azure-pipelines bot commented Jan 30, 2023

jakobbotsch commented Jan 30, 2023

jakobbotsch commented Jan 30, 2023

jakobbotsch commented Jan 30, 2023

azure-pipelines bot commented Jan 30, 2023

jakobbotsch commented Jan 30, 2023

azure-pipelines bot commented Jan 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakobbotsch commented Jan 31, 2023

jakobbotsch commented Jan 31, 2023

azure-pipelines bot commented Jan 31, 2023

jakobbotsch commented Jan 31, 2023 • edited Loading

jakobbotsch commented Jan 26, 2023 •

edited

Loading

tannergooding commented Jan 26, 2023 •

edited

Loading

VSadov commented Jan 26, 2023 •

edited

Loading

jakobbotsch commented Jan 31, 2023 •

edited

Loading