Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Fix SpanHelpers.Reverse regression #70650

Merged
merged 3 commits into from
Jun 22, 2022

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Jun 13, 2022

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

@ghost
Copy link

ghost commented Jun 13, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the unvectorized path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized case (or the rest of the array in the vectorized case). This is 2-3 times faster on wasm for example.

#64412
dotnet/perf-autofiling-issues#5014

Author: BrzVlad
Assignees: -
Labels:

area-System.Memory

Milestone: -

@BrzVlad
Copy link
Member Author

BrzVlad commented Jun 13, 2022

ref byte last = ref Unsafe.Add(ref buf, length - 1 - i);
(last, first) = (first, last);
}
ReverseInner(ref buf, length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help elaborate why this is faster now?

ReverseInner is effectively doing the same thing as this, just with a temp rather than using a tuple. There is a slight difference in how the addressing will end up as well, but none of this looks like it should be that impactful.

If there is something specifically impacting WASM, then it would be good to ensure we have a bug tracking that because these two loops should be effectively the same. Likewise, we should likely add AggressiveInlining to ReverseInner to ensure we aren't paying the cost of a call to handle the at most 7-8 loop trailing iterations on x64/Arm64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help elaborate why this is faster now?

It avoids this issue #64412 (comment) among other things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that minor issue, but I wouldn't have expected 2-3x perf difference there. It's worse IL but the JIT is also able to optimize it to the same native codegen in many cases (especially for the primitives like we're using here).

This seems like a case where there is likely some WASM specific inefficiency and that's what I'd like to better understand.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this change also introduced 1.67x regression on arm64: #68667.

I agree that it would be nice for the JIT to optimize this to have the same performance as the original loop. None of our code-generators are there (each code-generator for different reasons).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

-- Just noting I'm not pushing back against this change, I think its fine and simplifies things. Just wanting to better understand why there is such a drastic difference here so we know what to look out for in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some numbers for this change on Arm64 to see if it also addresses the regression?

Yes, it is fixing the regression on coreclr arm64. For reference, here is the disassembly of the core loop for Reverse<int>:

Current main:

    0x280e91070: lsl    x1, x24, #2
    0x280e91074: add    x1, x22, x1
    0x280e91078: sub    x3, x2, x24
    0x280e9107c: lsl    x3, x3, #2
    0x280e91080: add    x3, x3, x22
    0x280e91084: ldr    w4, [x1]
    0x280e91088: ldr    w5, [x3]
    0x280e9108c: str    w4, [x3]
    0x280e91090: str    w5, [x1]
    0x280e91094: add    x24, x24, #0x1
    0x280e91098: cmp    x25, x24
    0x280e9109c: b.hi   0x280e91070

This change:

    0x280f41240: ldr    w2, [x0]
    0x280f41244: ldr    w3, [x1]
    0x280f41248: str    w3, [x0]
    0x280f4124c: str    w2, [x1]
    0x280f41250: add    x0, x0, #0x4
    0x280f41254: sub    x1, x1, #0x4
    0x280f41258: cmp    x0, x1
    0x280f4125c: b.lo   0x280f41240

The regression is caused by an extra induction variable. RyuJIT is able to optimize out the extra local, but I do not expect that Mono is always able to optimize out the extra local (interpreter in particular).

Copy link
Member Author

@BrzVlad BrzVlad Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, here is the code generated by the interpreter for the loop in question. The generated code is fairly similar with the arm64 code Jan posted above, with some additional redundancy in the unoptimized case. The main problem is that first and last are recomputed with every loop iteration, relative to i.

Current main:

IR_0007: conv.i8.i4     [48 <- 16],     // Computation of `first`, extra calculation due to non-constant indexer
IR_000a: mul.i8.imm     [48 <- 48], 4   //
IR_000e: add.i8         [24 <- 0 48],   /////////////////////
IR_0012: sub1.i4        [48 <- 8],      // Computation of `last`, extra calculation due to non-constant indexer
IR_0015: sub.i4         [48 <- 48 16],  //
IR_0019: conv.i8.i4     [48 <- 48],     //
IR_001c: mul.i8.imm     [48 <- 48], 4   //
IR_0020: add.i8         [48 <- 0 48],   ///////////////////
IR_0024: ldind.i4       [32 <- 24],
IR_0027: ldind.i4       [40 <- 48],
IR_002a: stind.i4       [nil <- 48 32],
IR_002d: stind.i4       [nil <- 24 40],
IR_0030: add1.i4        [16 <- 16],     // index must be incremented
IR_0033: ldc.i4.2       [48 <- nil],    // Division of length should be hoisted out of the loop, optimization not supported by interp yet
IR_0035: div.i4         [48 <- 8 48],   //
IR_0039: blt.i4.sp      [nil <- 16 48], IR_0007

This change:

IR_0015: ldind.i4       [32 <- 16],
IR_0018: ldind.i4       [40 <- 24],
IR_001b: stind.i4       [nil <- 16 40],
IR_001e: stind.i4       [nil <- 24 32],
IR_0021: add.i8.imm     [16 <- 16], 4
IR_0025: add.i8.imm     [24 <- 24], -4
IR_0029: clt.un.i8      [40 <- 16 24],
IR_002d: brtrue.i4.sp   [nil <- 40], IR_0015

Copy link
Member

@lewing lewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to guard for the length == 0 case

@radical
Copy link
Member

radical commented Jun 16, 2022

/azp run runtime-wasm

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

SpanHelpers.Reverse was optimized recently using vectorized operations. However, the slow path (which is used by wasm for example) became slower. This change uses the old code pattern to reverse the array in non-vectorized scenario (or the rest of the array in the vectorized scenario). This is 2-3 times faster on wasm for example.
This method can now be called with length 0
private static void ReverseInner<T>(ref T elements, nuint length)
{
Debug.Assert(length > 0);
if (length <= 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be if (length == 0) { return } since zero is the only other value it could be here. It will also be a more efficient check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comparison with 1 since reversing one element is a nop, but it probably doesn't really make a difference.

@jkotas
Copy link
Member

jkotas commented Jun 20, 2022

#70944 is refactoring this in more significant ways.

@SamMonoRT
Copy link
Member

#70944 is refactoring this in more significant ways.

Since this seems ready, should we go ahead and merge this ?

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @BrzVlad !

@adamsitnik adamsitnik merged commit 2a680c1 into dotnet:main Jun 22, 2022
@adamsitnik adamsitnik added this to the 7.0.0 milestone Jun 22, 2022
@EgorBo
Copy link
Member

EgorBo commented Jun 23, 2022

Improvements on win-arm64: dotnet/perf-autofiling-issues#6342

@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants