[mono] Fix SpanHelpers.Reverse regression #70650

BrzVlad · 2022-06-13T09:16:23Z

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

ghost · 2022-06-13T09:16:37Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

Author:	BrzVlad
Assignees:	-
Labels:	`area-System.Memory`
Milestone:	-

BrzVlad · 2022-06-13T09:17:47Z

@alexcovington @tannergooding

tannergooding · 2022-06-13T15:00:54Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

-                ref byte last = ref Unsafe.Add(ref buf, length - 1 - i);
-                (last, first) = (first, last);
-            }
+            ReverseInner(ref buf, length);


Could you help elaborate why this is faster now?

ReverseInner is effectively doing the same thing as this, just with a temp rather than using a tuple. There is a slight difference in how the addressing will end up as well, but none of this looks like it should be that impactful.

If there is something specifically impacting WASM, then it would be good to ensure we have a bug tracking that because these two loops should be effectively the same. Likewise, we should likely add AggressiveInlining to ReverseInner to ensure we aren't paying the cost of a call to handle the at most 7-8 loop trailing iterations on x64/Arm64.

Could you help elaborate why this is faster now?

It avoids this issue #64412 (comment) among other things.

I remember that minor issue, but I wouldn't have expected 2-3x perf difference there. It's worse IL but the JIT is also able to optimize it to the same native codegen in many cases (especially for the primitives like we're using here).

This seems like a case where there is likely some WASM specific inefficiency and that's what I'd like to better understand.

Note that this change also introduced 1.67x regression on arm64: #68667.

I agree that it would be nice for the JIT to optimize this to have the same performance as the original loop. None of our code-generators are there (each code-generator for different reasons).

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

-- Just noting I'm not pushing back against this change, I think its fine and simplifies things. Just wanting to better understand why there is such a drastic difference here so we know what to look out for in the future.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

Yes, it is fixing the regression on coreclr arm64. For reference, here is the disassembly of the core loop for Reverse<int>:

Current main:

0x280e91070: lsl x1, x24, #2 0x280e91074: add x1, x22, x1 0x280e91078: sub x3, x2, x24 0x280e9107c: lsl x3, x3, #2 0x280e91080: add x3, x3, x22 0x280e91084: ldr w4, [x1] 0x280e91088: ldr w5, [x3] 0x280e9108c: str w4, [x3] 0x280e91090: str w5, [x1] 0x280e91094: add x24, x24, #0x1 0x280e91098: cmp x25, x24 0x280e9109c: b.hi 0x280e91070

This change:

0x280f41240: ldr w2, [x0] 0x280f41244: ldr w3, [x1] 0x280f41248: str w3, [x0] 0x280f4124c: str w2, [x1] 0x280f41250: add x0, x0, #0x4 0x280f41254: sub x1, x1, #0x4 0x280f41258: cmp x0, x1 0x280f4125c: b.lo 0x280f41240

The regression is caused by an extra induction variable. RyuJIT is able to optimize out the extra local, but I do not expect that Mono is always able to optimize out the extra local (interpreter in particular).

For reference, here is the code generated by the interpreter for the loop in question. The generated code is fairly similar with the arm64 code Jan posted above, with some additional redundancy in the unoptimized case. The main problem is that first and last are recomputed with every loop iteration, relative to i.

Current main:

IR_0007: conv.i8.i4 [48 <- 16], // Computation of `first`, extra calculation due to non-constant indexer IR_000a: mul.i8.imm [48 <- 48], 4 // IR_000e: add.i8 [24 <- 0 48], ///////////////////// IR_0012: sub1.i4 [48 <- 8], // Computation of `last`, extra calculation due to non-constant indexer IR_0015: sub.i4 [48 <- 48 16], // IR_0019: conv.i8.i4 [48 <- 48], // IR_001c: mul.i8.imm [48 <- 48], 4 // IR_0020: add.i8 [48 <- 0 48], /////////////////// IR_0024: ldind.i4 [32 <- 24], IR_0027: ldind.i4 [40 <- 48], IR_002a: stind.i4 [nil <- 48 32], IR_002d: stind.i4 [nil <- 24 40], IR_0030: add1.i4 [16 <- 16], // index must be incremented IR_0033: ldc.i4.2 [48 <- nil], // Division of length should be hoisted out of the loop, optimization not supported by interp yet IR_0035: div.i4 [48 <- 8 48], // IR_0039: blt.i4.sp [nil <- 16 48], IR_0007

This change:

IR_0015: ldind.i4 [32 <- 16], IR_0018: ldind.i4 [40 <- 24], IR_001b: stind.i4 [nil <- 16 40], IR_001e: stind.i4 [nil <- 24 32], IR_0021: add.i8.imm [16 <- 16], 4 IR_0025: add.i8.imm [24 <- 24], -4 IR_0029: clt.un.i8 [40 <- 16 24], IR_002d: brtrue.i4.sp [nil <- 40], IR_0015

lewing

Looks like you need to guard for the length == 0 case

radical · 2022-06-16T20:37:19Z

/azp run runtime-wasm

azure-pipelines · 2022-06-16T20:37:32Z

Azure Pipelines successfully started running 1 pipeline(s).

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the slow path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized scenario (or the rest of the array in the vectorized scenario). This is 2-3 times faster on wasm for example.

This method can now be called with length 0

tannergooding · 2022-06-20T14:02:51Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.cs

        private static void ReverseInner<T>(ref T elements, nuint length)
        {
-            Debug.Assert(length > 0);
+            if (length <= 1)


This can be if (length == 0) { return } since zero is the only other value it could be here. It will also be a more efficient check.

I added comparison with 1 since reversing one element is a nop, but it probably doesn't really make a difference.

jkotas · 2022-06-20T15:46:18Z

#70944 is refactoring this in more significant ways.

SamMonoRT · 2022-06-21T14:20:02Z

#70944 is refactoring this in more significant ways.

Since this seems ready, should we go ahead and merge this ?

adamsitnik

LGTM, thank you @BrzVlad !

EgorBo · 2022-06-23T16:57:41Z

Improvements on win-arm64: dotnet/perf-autofiling-issues#6342

dotnet-issue-labeler bot added the area-System.Memory label Jun 13, 2022

ghost assigned BrzVlad Jun 13, 2022

tannergooding reviewed Jun 13, 2022

View reviewed changes

SamMonoRT approved these changes Jun 14, 2022

View reviewed changes

lewing reviewed Jun 14, 2022

View reviewed changes

runfoapp bot mentioned this pull request Jun 15, 2022

system.net.http.functional.tests crashing #70757

Closed

lewing approved these changes Jun 15, 2022

View reviewed changes

BrzVlad added 3 commits June 20, 2022 11:31

Mark method for inlining

6a7f8c4

Add check for minimum length

80f701c

This method can now be called with length 0

BrzVlad force-pushed the fix-span-reverse-perf branch from 3ec80c0 to 80f701c Compare June 20, 2022 08:31

tannergooding reviewed Jun 20, 2022

View reviewed changes

tannergooding approved these changes Jun 21, 2022

View reviewed changes

adamsitnik approved these changes Jun 22, 2022

View reviewed changes

adamsitnik merged commit 2a680c1 into dotnet:main Jun 22, 2022

adamsitnik added this to the 7.0.0 milestone Jun 22, 2022

This was referenced Jun 22, 2022

Improve Span.Reverse fast path performance #70944

Merged

Regressions in System.Tests.Perf_Array for ARM64 #68667

Closed

kunalspathak mentioned this pull request Jul 7, 2022

[Perf] Changes at 6/22/2022 1:08:18 PM dotnet/perf-autofiling-issues#6488

Closed

ghost locked as resolved and limited conversation to collaborators Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mono] Fix SpanHelpers.Reverse regression #70650

[mono] Fix SpanHelpers.Reverse regression #70650

BrzVlad commented Jun 13, 2022

ghost commented Jun 13, 2022

BrzVlad commented Jun 13, 2022

tannergooding Jun 13, 2022

jkotas Jun 13, 2022

tannergooding Jun 13, 2022

jkotas Jun 13, 2022

tannergooding Jun 13, 2022

jkotas Jun 13, 2022

BrzVlad Jun 14, 2022 •

edited

Loading

lewing left a comment

radical commented Jun 16, 2022

azure-pipelines bot commented Jun 16, 2022

tannergooding Jun 20, 2022

BrzVlad Jun 20, 2022

jkotas commented Jun 20, 2022

SamMonoRT commented Jun 21, 2022

adamsitnik left a comment

EgorBo commented Jun 23, 2022

[mono] Fix SpanHelpers.Reverse regression #70650

[mono] Fix SpanHelpers.Reverse regression #70650

Conversation

BrzVlad commented Jun 13, 2022

ghost commented Jun 13, 2022

BrzVlad commented Jun 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BrzVlad Jun 14, 2022 • edited Loading

Choose a reason for hiding this comment

lewing left a comment

Choose a reason for hiding this comment

radical commented Jun 16, 2022

azure-pipelines bot commented Jun 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Jun 20, 2022

SamMonoRT commented Jun 21, 2022

adamsitnik left a comment

Choose a reason for hiding this comment

EgorBo commented Jun 23, 2022

BrzVlad Jun 14, 2022 •

edited

Loading