-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono] Fix SpanHelpers.Reverse regression #70650
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsSpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example. #64412
|
ref byte last = ref Unsafe.Add(ref buf, length - 1 - i); | ||
(last, first) = (first, last); | ||
} | ||
ReverseInner(ref buf, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help elaborate why this is faster now?
ReverseInner
is effectively doing the same thing as this, just with a temp
rather than using a tuple. There is a slight difference in how the addressing will end up as well, but none of this looks like it should be that impactful.
If there is something specifically impacting WASM, then it would be good to ensure we have a bug tracking that because these two loops should be effectively the same. Likewise, we should likely add AggressiveInlining
to ReverseInner
to ensure we aren't paying the cost of a call to handle the at most 7-8 loop trailing iterations on x64/Arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help elaborate why this is faster now?
It avoids this issue #64412 (comment) among other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that minor issue, but I wouldn't have expected 2-3x perf difference there. It's worse IL but the JIT is also able to optimize it to the same native codegen in many cases (especially for the primitives like we're using here).
This seems like a case where there is likely some WASM specific inefficiency and that's what I'd like to better understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this change also introduced 1.67x regression on arm64: #68667.
I agree that it would be nice for the JIT to optimize this to have the same performance as the original loop. None of our code-generators are there (each code-generator for different reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some numbers for this change on Arm64 to see if it also addresses the regression?
-- Just noting I'm not pushing back against this change, I think its fine and simplifies things. Just wanting to better understand why there is such a drastic difference here so we know what to look out for in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some numbers for this change on Arm64 to see if it also addresses the regression?
Yes, it is fixing the regression on coreclr arm64. For reference, here is the disassembly of the core loop for Reverse<int>
:
Current main:
0x280e91070: lsl x1, x24, #2
0x280e91074: add x1, x22, x1
0x280e91078: sub x3, x2, x24
0x280e9107c: lsl x3, x3, #2
0x280e91080: add x3, x3, x22
0x280e91084: ldr w4, [x1]
0x280e91088: ldr w5, [x3]
0x280e9108c: str w4, [x3]
0x280e91090: str w5, [x1]
0x280e91094: add x24, x24, #0x1
0x280e91098: cmp x25, x24
0x280e9109c: b.hi 0x280e91070
This change:
0x280f41240: ldr w2, [x0]
0x280f41244: ldr w3, [x1]
0x280f41248: str w3, [x0]
0x280f4124c: str w2, [x1]
0x280f41250: add x0, x0, #0x4
0x280f41254: sub x1, x1, #0x4
0x280f41258: cmp x0, x1
0x280f4125c: b.lo 0x280f41240
The regression is caused by an extra induction variable. RyuJIT is able to optimize out the extra local, but I do not expect that Mono is always able to optimize out the extra local (interpreter in particular).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, here is the code generated by the interpreter for the loop in question. The generated code is fairly similar with the arm64 code Jan posted above, with some additional redundancy in the unoptimized case. The main problem is that first
and last
are recomputed with every loop iteration, relative to i
.
Current main:
IR_0007: conv.i8.i4 [48 <- 16], // Computation of `first`, extra calculation due to non-constant indexer
IR_000a: mul.i8.imm [48 <- 48], 4 //
IR_000e: add.i8 [24 <- 0 48], /////////////////////
IR_0012: sub1.i4 [48 <- 8], // Computation of `last`, extra calculation due to non-constant indexer
IR_0015: sub.i4 [48 <- 48 16], //
IR_0019: conv.i8.i4 [48 <- 48], //
IR_001c: mul.i8.imm [48 <- 48], 4 //
IR_0020: add.i8 [48 <- 0 48], ///////////////////
IR_0024: ldind.i4 [32 <- 24],
IR_0027: ldind.i4 [40 <- 48],
IR_002a: stind.i4 [nil <- 48 32],
IR_002d: stind.i4 [nil <- 24 40],
IR_0030: add1.i4 [16 <- 16], // index must be incremented
IR_0033: ldc.i4.2 [48 <- nil], // Division of length should be hoisted out of the loop, optimization not supported by interp yet
IR_0035: div.i4 [48 <- 8 48], //
IR_0039: blt.i4.sp [nil <- 16 48], IR_0007
This change:
IR_0015: ldind.i4 [32 <- 16],
IR_0018: ldind.i4 [40 <- 24],
IR_001b: stind.i4 [nil <- 16 40],
IR_001e: stind.i4 [nil <- 24 32],
IR_0021: add.i8.imm [16 <- 16], 4
IR_0025: add.i8.imm [24 <- 24], -4
IR_0029: clt.un.i8 [40 <- 16 24],
IR_002d: brtrue.i4.sp [nil <- 40], IR_0015
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you need to guard for the length == 0 case
/azp run runtime-wasm |
Azure Pipelines successfully started running 1 pipeline(s). |
SpanHelpers.Reverse was optimized recently using vectorized operations. However, the slow path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized scenario (or the rest of the array in the vectorized scenario). This is 2-3 times faster on wasm for example.
This method can now be called with length 0
3ec80c0
to
80f701c
Compare
private static void ReverseInner<T>(ref T elements, nuint length) | ||
{ | ||
Debug.Assert(length > 0); | ||
if (length <= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be if (length == 0) { return }
since zero is the only other value it could be here. It will also be a more efficient check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comparison with 1 since reversing one element is a nop, but it probably doesn't really make a difference.
#70944 is refactoring this in more significant ways. |
Since this seems ready, should we go ahead and merge this ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @BrzVlad !
Improvements on win-arm64: dotnet/perf-autofiling-issues#6342 |
SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.
#64412
dotnet/perf-autofiling-issues#5014