[ARM64] Possible perf regression: slicing #41704

adamsitnik · 2020-09-01T21:18:04Z

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to slicing.

It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).

Repro

git clone https://github.com/dotnet/performance.git
py ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'System.Memory.Slice*'

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04
Unknown processor
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PVNQZA : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PXIHWO : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Type	Method	Toolchain	Mean	Ratio
Slice<Byte>	SpanStart	netcoreapp3.1	3.831 ns	1.00
Slice<Byte>	SpanStart	netcoreapp5.0	2.550 ns	0.67

Slice<String>	SpanStart	netcoreapp3.1	11.526 ns	1.00
Slice<String>	SpanStart	netcoreapp5.0	16.482 ns	1.43

Slice<Byte>	SpanStartLength	netcoreapp3.1	3.782 ns	1.00
Slice<Byte>	SpanStartLength	netcoreapp5.0	3.202 ns	0.85

Slice<String>	SpanStartLength	netcoreapp3.1	11.720 ns	1.00
Slice<String>	SpanStartLength	netcoreapp5.0	16.823 ns	1.44

Slice<Byte>	ReadOnlySpanStart	netcoreapp3.1	3.801 ns	1.00
Slice<Byte>	ReadOnlySpanStart	netcoreapp5.0	2.867 ns	0.75

Slice<String>	ReadOnlySpanStart	netcoreapp3.1	9.144 ns	1.00
Slice<String>	ReadOnlySpanStart	netcoreapp5.0	16.039 ns	1.75

Slice<Byte>	ReadOnlySpanStartLength	netcoreapp3.1	3.779 ns	1.00
Slice<Byte>	ReadOnlySpanStartLength	netcoreapp5.0	3.156 ns	0.83

Slice<String>	ReadOnlySpanStartLength	netcoreapp3.1	9.279 ns	1.00
Slice<String>	ReadOnlySpanStartLength	netcoreapp5.0	16.418 ns	1.77

Slice<Byte>	MemoryStart	netcoreapp3.1	3.779 ns	1.00
Slice<Byte>	MemoryStart	netcoreapp5.0	6.418 ns	1.70

Slice<String>	MemoryStart	netcoreapp3.1	12.952 ns	1.00
Slice<String>	MemoryStart	netcoreapp5.0	25.550 ns	1.97

Slice<Byte>	MemoryStartSpan	netcoreapp3.1	6.416 ns	1.00
Slice<Byte>	MemoryStartSpan	netcoreapp5.0	10.515 ns	1.64

Slice<String>	MemoryStartSpan	netcoreapp3.1	24.265 ns	1.00
Slice<String>	MemoryStartSpan	netcoreapp5.0	33.036 ns	1.36

Slice<Byte>	MemoryStartLength	netcoreapp3.1	3.805 ns	1.00
Slice<Byte>	MemoryStartLength	netcoreapp5.0	5.899 ns	1.55

Slice<String>	MemoryStartLength	netcoreapp3.1	12.285 ns	1.00
Slice<String>	MemoryStartLength	netcoreapp5.0	18.409 ns	1.50

Slice<Byte>	MemoryStartLengthSpan	netcoreapp3.1	6.245 ns	1.00
Slice<Byte>	MemoryStartLengthSpan	netcoreapp5.0	9.975 ns	1.60

Slice<String>	MemoryStartLengthSpan	netcoreapp3.1	23.963 ns	1.00
Slice<String>	MemoryStartLengthSpan	netcoreapp5.0	31.040 ns	1.30

Slice<Byte>	ReadOnlyMemoryStart	netcoreapp3.1	3.807 ns	1.00
Slice<Byte>	ReadOnlyMemoryStart	netcoreapp5.0	6.394 ns	1.68

Slice<String>	ReadOnlyMemoryStart	netcoreapp3.1	9.371 ns	1.00
Slice<String>	ReadOnlyMemoryStart	netcoreapp5.0	22.140 ns	2.36

Slice<Byte>	ReadOnlyMemoryStartSpan	netcoreapp3.1	6.379 ns	1.00
Slice<Byte>	ReadOnlyMemoryStartSpan	netcoreapp5.0	9.150 ns	1.43

Slice<String>	ReadOnlyMemoryStartSpan	netcoreapp3.1	22.247 ns	1.00
Slice<String>	ReadOnlyMemoryStartSpan	netcoreapp5.0	32.229 ns	1.45

Slice<Byte>	ReadOnlyMemoryStartLength	netcoreapp3.1	3.733 ns	1.00
Slice<Byte>	ReadOnlyMemoryStartLength	netcoreapp5.0	6.005 ns	1.61

Slice<String>	ReadOnlyMemoryStartLength	netcoreapp3.1	8.833 ns	1.00
Slice<String>	ReadOnlyMemoryStartLength	netcoreapp5.0	16.008 ns	1.81

Slice<Byte>	ReadOnlyMemoryStartLengthSpan	netcoreapp3.1	6.087 ns	1.00
Slice<Byte>	ReadOnlyMemoryStartLengthSpan	netcoreapp5.0	9.430 ns	1.55

Slice<String>	ReadOnlyMemoryStartLengthSpan	netcoreapp3.1	22.489 ns	1.00
Slice<String>	ReadOnlyMemoryStartLengthSpan	netcoreapp5.0	29.502 ns	1.31

Slice<Byte>	MemorySpanStart	netcoreapp3.1	8.985 ns	1.00
Slice<Byte>	MemorySpanStart	netcoreapp5.0	11.703 ns	1.31

Slice<String>	MemorySpanStart	netcoreapp3.1	23.013 ns	1.00
Slice<String>	MemorySpanStart	netcoreapp5.0	23.544 ns	1.02

Slice<Byte>	MemorySpanStartLength	netcoreapp3.1	8.289 ns	1.00
Slice<Byte>	MemorySpanStartLength	netcoreapp5.0	9.989 ns	1.21

Slice<String>	MemorySpanStartLength	netcoreapp3.1	23.611 ns	1.00
Slice<String>	MemorySpanStartLength	netcoreapp5.0	23.401 ns	0.99

Slice<Byte>	ReadOnlyMemorySpanStart	netcoreapp3.1	8.519 ns	1.00
Slice<Byte>	ReadOnlyMemorySpanStart	netcoreapp5.0	11.698 ns	1.37

Slice<String>	ReadOnlyMemorySpanStart	netcoreapp3.1	19.716 ns	1.00
Slice<String>	ReadOnlyMemorySpanStart	netcoreapp5.0	22.038 ns	1.12

Slice<Byte>	ReadOnlyMemorySpanStartLength	netcoreapp3.1	6.770 ns	1.00
Slice<Byte>	ReadOnlyMemorySpanStartLength	netcoreapp5.0	10.912 ns	1.61

Slice<String>	ReadOnlyMemorySpanStartLength	netcoreapp3.1	21.624 ns	1.00
Slice<String>	ReadOnlyMemorySpanStartLength	netcoreapp5.0	22.292 ns	1.03

@kunalspathak is there any chance you could take a look at the produced assembly code and verify if this is an actual regression in code gen or not?

category:cq
theme:ssa
skill-level:expert
cost:large

JulieLeeMSFT · 2020-09-01T23:08:17Z

@kunalspathak please look into this.

CC @dotnet/jit-contrib

kunalspathak · 2020-09-02T21:11:25Z

Noting down some observations, not necessarily related to the regression.
For Slice.ReadOnlySpanStartLength() , we don't inline calls to Slice() because of following code:

runtime/src/coreclr/src/jit/importer.cpp

Line 1917 in 6072e4d

// Runtime does not support inlining of all shapes of runtime lookups

Another observation is in .NET 3.1, we didn't JIT SequenceEqual(), but I see that we JIT that method in .NET 5. It is contrary to the fact that it was marked with AggresiveOptimization in the past and was removed in #32371 to get it JIT during R2R.

kunalspathak · 2020-09-04T01:56:25Z

The benchmarks that regressed operates on Slice<string> and Slice<byte>. The calls to Slice<string> don't get inlined as expected, but the ones for Slice<byte> gets inlined. My following analysis is for benchmarks for Slice<string> particularly ReadOnlyMemoryStart() but most likely it is the cause for other benchmarks for Slice<string>. I will share my findings for benchmarks under Slice<byte> separately, once I do that analysis.

The jit assembly for Slice() is unchanged, but the regression is inside the benchmark code ReadOnlyMemoryStart(). The underlying issue is that in .NET 3.1, we do a null check for obj and if it is null, call HELPER_RUNTIMELOOKUP. If not, we will just operate on that object. This is how we generated code in .NET 3.1 for such checks:

G_M18749_IG07:
            add     x0, fp, #32	// [V01 loc0]
            mov     w2, #5
            bl      ReadOnlyMemory`1:Slice(int):struct:this
            str     x0, [fp,#16]	// [V02 loc1]
            str     x1, [fp,#24]	// [V02 loc1+0x08]
            mov     x0, x20
            ldr     x2, [x21,#16]
            cbnz    x2, G_M18749_IG08    ; <----------- This condition checks if obj != null
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            mov     x2, x0

G_M18749_IG08:
            add     x1, fp, #16	// [V02 loc1]
            mov     x0, x2
            bl      Slice`1:Consume(byref)
            mov     x0, x20
            ldr     x1, [x21,#24]
            cbnz    x1, G_M18749_IG09
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            mov     x1, x0

For happy path (where obj != null), we would jump to G_M18749_IG08 and use the obj. Otherwise, we would call CORINFO_HELP_RUNTIMEHANDLE_CLASS. However, in .NET 5, we have flipped the condition of this check:

G_M6359_IG14:
            add     x0, fp, #64	// [V01 loc0]
            mov     w2, #5
            bl      ReadOnlyMemory`1:Slice(int):ReadOnlyMemory`1:this
            str     x0, [fp,#48]	// [V02 loc1]
            str     x1, [fp,#56]	// [V02 loc1+0x08]
            mov     x0, x20
            ldr     x1, [x21,#24]
            cbz     x1, G_M6359_IG16    ; <----------- This condition now checks if obj == null
						;; bbWeight=1    PerfScore 8.50
G_M6359_IG15:
            str     x1, [fp,#32]	// [V14 tmp11]
            b       G_M6359_IG17
						;; bbWeight=0.25 PerfScore 0.50
G_M6359_IG16:
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            str     x0, [fp,#32]	// [V14 tmp11]
						;; bbWeight=0.25 PerfScore 0.88
G_M6359_IG17:
            add     x1, fp, #48	// [V02 loc1]
            ldr     x0, [fp,#32]	// [V14 tmp11]
            bl      Slice`1:Consume(byref)
            mov     x0, x20
            ldr     x1, [x21,#32]
            cbz     x1, G_M6359_IG19

Now, in happy path, we check for obj == null condition and if true, would call CORINFO_HELP_RUNTIMEHANDLE_CLASS, otherwise, we would goto G_M6359_IG15 and do a jump to G_M6359_IG17. Not only we introduce extra jumps in .NET 5, we also have to spill the values on stack. The local frame size in .NET 3.1 is just 40 vs. 168 for .NET 5.0.

The branches in G_M6359_IG15 and G_M6359_IG16 are weighted same as 0.25. As per @AndyAyersMS , I tried to bump up the weight of non-call branch higher, but that doesn't change the condition. I will further see what else we can do here.

Edit:

In .NET 3.1, the condition would just have 1 arm and so it gets flipped, however in .NET 5, we have 2 arms on that condition, one coming from "expandable generic dictionaries" work that we did in .NET 5, so most like given that, bumping the non-call branch won't make a difference.

.NET 3.1 tree:

               [000163] ------------              *  STMT      void  (IL   ???...  ???)
               [000162] -AC-G-------              \--*  ASG       long  
               [000161] D------N----                 +--*  LCL_VAR   long   V13 tmp9         
               [000160] --C-G-------                 \--*  QMARK     long  
               [000157] Q-----------    if              +--*  NE        int   
               [000150] ------------                    |  +--*  LCL_VAR   long   V13 tmp9         
               [000156] ------------                    |  \--*  CNS_INT   long   0
               [000159] --C-G-------    if              \--*  COLON     long  
               [000155] --C-G------- else                  +--*  CALL help long   HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
               [000138] ------------ arg0                  |  +--*  LCL_VAR   long   V12 tmp8         
               [000152] ------------ arg1                  |  \--*  CNS_INT(h) long   0xd1ffab1e token
               [000158] ------------ then                  \--*  NOP       void

.NET 5.0 tree

               [000109] -AC-G+------              *  ASG       long  
               [000108] D----+-N----              +--*  LCL_VAR   long   V10 tmp7         
               [000107] --C-G+------              \--*  QMARK     long  
               [000097] J----+-N----    if           +--*  EQ        int   
               [000093] n----+------                 |  +--*  IND       long  
               [000092] -----+------                 |  |  \--*  ADD       long  
               [000090] #----+------                 |  |     +--*  IND       long  
               [000089] #----+------                 |  |     |  \--*  IND       long  
               [000088] -----+------                 |  |     |     \--*  ADD       long  
               [000086] -----+------                 |  |     |        +--*  LCL_VAR   long   V09 tmp6         
               [000087] -----+------                 |  |     |        \--*  CNS_INT   long   48
               [000091] -----+------                 |  |     \--*  CNS_INT   long   24
               [000096] -----+------                 |  \--*  CNS_INT   long   0
               [000106] --C-G+?-----    if           \--*  COLON     long  
               [000095] --C-G+?----- else               +--*  CALL help long   HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
               [000085] -----+?----- arg0 in x0         |  +--*  LCL_VAR   long   V09 tmp6         
               [000094] -----+?----- arg1 in x1         |  \--*  CNS_INT(h) long   0x7ffb0671e8e8 token
               [000098] n----+?----- then               \--*  IND       long  
               [000099] -----+?-----                       \--*  ADD       long  
               [000100] #----+?-----                          +--*  IND       long  
               [000101] #----+?-----                          |  \--*  IND       long  
               [000102] -----+?-----                          |     \--*  ADD       long  
               [000103] -----+?-----                          |        +--*  LCL_VAR   long   V09 tmp6         
               [000104] -----+?-----                          |        \--*  CNS_INT   long   48
               [000105] -----+?-----                          \--*  CNS_INT   long   24

kunalspathak · 2020-09-04T18:21:57Z

Here is my analysis for benchmarks in Slice<byte>. The regression is coming from the inlined Slice<byte>() and the ctor for ReadOnlyMemory<byte> present inside Slice() as seen here.

Below is the assembly code for

Consume(memory.Slice(Size / 2)); // private const int Size = 10;

In .NET 3.1, here is disassembly. IG06 does this check and if within bounds, proceed to creating the ReadOnlyMemory() object in IG07 as seen here.

G_M45388_IG06:
        B94037B3          ldr     w19, [fp,#52]	// [V01 loc0+0x0c]
        7100167F          cmp     w19, #5
        540005A3          blo     G_M45388_IG12

G_M45388_IG07:
        F94017B4          ldr     x20, [fp,#40]	// [V01 loc0]
        AA1403E0          mov     x0, x20
        51001675          sub     w21, w19, #5
        2A1503E1          mov     w1, w21
        528000A2          mov     w2, #5
        F9000FA0          str     x0, [fp,#24]	// [V21 tmp18] ; this._object = _object
        B90023A2          str     w2, [fp,#32]	// [V22 tmp19] ; this._index = _index
        B90027A1          str     w1, [fp,#36]	// [V23 tmp20] ; this._length = _length
        910063A0          add     x0, fp, #24	// [V02 loc1]
        94000000          bl      Slice`1:Consume(byref)
        7100167F          cmp     w19, #5
        54000483          blo     G_M45388_IG13

In .NET 5, here is disassembly. There are more memory access in IG06 (4 vs. 7).

G_M6359_IG05:
        B9402FA0          ldr     w0, [fp,#44]	// [V23 tmp20]
        7100141F          cmp     w0, #5
        540006C3          blo     G_M6359_IG11
						;; bbWeight=0.50 PerfScore 1.75

G_M6359_IG06:
        B9402BA0          ldr     w0, [fp,#40]	// [V22 tmp19]
        11001400          add     w0, w0, #5  ; <-- In .NET 3.1, we constant prop value of _index and turns this to " = 5"
        B9402FA1          ldr     w1, [fp,#44]	// [V23 tmp20] ; <-- could have skipped because we already load it in IG05, the way it is skipped in .NET 3.1
        51001421          sub     w1, w1, #5
        F94013A2          ldr     x2, [fp,#32]	// [V21 tmp18]
        F9000BA2          str     x2, [fp,#16]	// [V24 tmp21] ; this._object = _object
        B9001BA0          str     w0, [fp,#24]	// [V25 tmp22] ; this._index = _index
        B9001FA1          str     w1, [fp,#28]	// [V26 tmp23] ; this._length = _length
        910043A0          add     x0, fp, #16	// [V02 loc1]
        94000000          bl      Slice`1:Consume(byref)
        B9402FA0          ldr     w0, [fp,#44]	// [V23 tmp20] ; <-- could have been skipped
        7100141F          cmp     w0, #5
        54000523          blo     G_M6359_IG11

I believe some of them are happening because we fail to constant propagate the value of _index which is zero. Instead we load the value from stack and add 5 to it. In .NET 3.1, we detected this and converted it to assigment.

.NET 3.1 dump:


***** BB05, stmt 11 (before)
N005 (  5,  6) [000118] -A------R---              *  ASG       int   
N004 (  1,  1) [000117] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2
N003 (  5,  6) [000073] ------------              \--*  ADD       int   
N001 (  3,  4) [000070] ------------                 +--*  LCL_FLD   int    V01 loc0         u:6[+8] Fseq[_index]
N002 (  1,  1) [000072] ------------                 \--*  CNS_INT   int    5

  VNApplySelectors:
    VNForHandle(_index) is $142, fieldType is int
      AX2: $142 != $143 ==> select([$283]store($281, $143, $2c1), $142) ==> select($281, $142).
      AX1: select([$241]store($281, $142, $40), $142) ==> $40.
    VNForMapSelect($380, $142):int returns $40 {IntCns 0}
  VNApplySelectors:
    VNForHandle(_index) is $142, fieldType is int
    VNForMapSelect($380, $142):int returns $40 {IntCns 0}
N001 [000070]   LCL_FLD   V01 loc0         u:6[+8] Fseq[_index] => $40 {IntCns 0}
N002 [000072]   CNS_INT   5 => $42 {IntCns 5}
N003 [000073]   ADD       => $42 {IntCns 5}
N004 [000117]   LCL_VAR   V07 tmp4         d:2 => $42 {IntCns 5}
N005 [000118]   ASG       => $42 {IntCns 5}

***** BB05, stmt 11 (after)
N005 (  5,  6) [000118] -A------R---              *  ASG       int    $42
N004 (  1,  1) [000117] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2 $42
N003 (  5,  6) [000073] ------------              \--*  ADD       int    $42
N001 (  3,  4) [000070] ------------                 +--*  LCL_FLD   int    V01 loc0         u:6[+8] Fseq[_index] $40
N002 (  1,  1) [000072] ------------                 \--*  CNS_INT   int    5 $42

.NET 5.0 dump:


***** BB04, STMT00019(before)
N005 (  3,  4) [000092] -A------R---              *  ASG       int   
N004 (  1,  1) [000091] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2
N003 (  3,  4) [000058] ------------              \--*  ADD       int   
N001 (  1,  1) [000056] ------------                 +--*  LCL_VAR   int    V10 tmp7         
N002 (  1,  2) [000057] ------------                 \--*  CNS_INT   int    5

N001 [000056]   LCL_VAR   V10 tmp7          => $282 {282}
N002 [000057]   CNS_INT   5 => $42 {IntCns 5}
N003 [000058]   ADD       => $206 {ADD($42, $282)}
N004 [000091]   LCL_VAR   V07 tmp4         d:2 => $206 {ADD($42, $282)}
N005 [000092]   ASG       => $206 {ADD($42, $282)}

***** BB04, STMT00019(after)
N005 (  3,  4) [000092] -A------R---              *  ASG       int    $206
N004 (  1,  1) [000091] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2 $206
N003 (  3,  4) [000058] ------------              \--*  ADD       int    $206
N001 (  1,  1) [000056] ------------                 +--*  LCL_VAR   int    V10 tmp7          $282
N002 (  1,  2) [000057] ------------                 \--*  CNS_INT   int    5 $42

I am still trying to see why we fail to detect _index being constant.

kunalspathak · 2020-09-04T21:21:07Z

@CarolEidt pointed out that in .NET 3.1, this._index remains LCL_FLD and gets to SSA form while in .NET 5, we do struct promotion, but because of this condition, we don't convert it to SSA form. Because of this, we don't do constant propagation. We need to support multi-reg defs for SSA and might not be something we want to fix as part of perf regression fix. We should port this to .NET 6 (at least for benchmarks under Slice<byte>()).

kunalspathak · 2020-09-04T22:05:49Z

Also, it turns out that the fix for Slice<string>() needs more thinking given the introduction of runtime lookup in .NET 5. We need to have a broader fix for this which should be done in .NET 6

kunalspathak · 2020-09-10T16:53:11Z

Here are the benchmarks that regressed (taken from the description but filtered just the ones that regressed)

Class	Method	.NET5 / .NET 3.0 ratio
Slice	ReadOnlyMemoryStart	2.36
Slice	MemoryStart	1.97
Slice	ReadOnlyMemoryStartLength	1.81
Slice	ReadOnlySpanStartLength	1.77
Slice	ReadOnlySpanStart	1.75
Slice	MemoryStart	1.7
Slice	ReadOnlyMemoryStart	1.68
Slice	MemoryStartSpan	1.64
Slice	ReadOnlyMemoryStartLength	1.61
Slice	ReadOnlyMemorySpanStartLength	1.61
Slice	MemoryStartLengthSpan	1.6
Slice	MemoryStartLength	1.55
Slice	ReadOnlyMemoryStartLengthSpan	1.55
Slice	MemoryStartLength	1.5
Slice	ReadOnlyMemoryStartSpan	1.45
Slice	SpanStartLength	1.44
Slice	SpanStart	1.43
Slice	ReadOnlyMemoryStartSpan	1.43
Slice	ReadOnlyMemorySpanStart	1.37
Slice	MemoryStartSpan	1.36
Slice	ReadOnlyMemoryStartLengthSpan	1.31
Slice	MemorySpanStart	1.31
Slice	MemoryStartLengthSpan	1.3
Slice	MemorySpanStartLength	1.21
Slice	ReadOnlyMemorySpanStart	1.12
Slice	ReadOnlyMemorySpanStartLength	1.03
Slice	MemorySpanStart	1.02

kunalspathak · 2021-04-01T17:20:39Z

Actionable item: Need to double check the numbers if we did anything in .NET 6.0 to address #41704 (comment).

kunalspathak · 2021-04-30T23:45:14Z

Here is my analysis after comparing the assembly of .NET 6 vs. .NET 3.1. Here is the diff for ReadOnlyMemory.Slice<byte> benchmark.

No zero register used

            mov     x1, #0
            str     x1, [fp,#48]	// [V133 tmp130]
            str     w1, [fp,#56]	// [V134 tmp131]
            str     w1, [fp,#60]	// [V135 tmp132]
            b       G_M31903_IG05

Update: This PR is out - #52269

Redundant ldr from same location

I noticed redundant ldr [fp, #48] and they could have been replaced with mov x1, x19. It can vary from case to case depending on the register availability, but clearly in below case, CSE could have avoided extra load from memory.

            blo     G_M31903_IG88
            ldr     x19, [fp,#48]	// [V133 tmp130]
            ldr     w1, [fp,#56]	// [V134 tmp131]
            add     w20, w1, #5
            ldr     w1, [fp,#60]	// [V135 tmp132]
            sub     w21, w1, #5
            ldr     x1, [fp,#48]	// [V133 tmp130]
            cbz     x1, G_M31903_IG13

Update: Related issue: #6761

Repeated loading of `._object` field

We load value of ._object again and again in .NET6

            ldr     x19, [fp,#48]	// [V133 tmp130]
            ...
            ...
            str     x19, [fp,#32]	// [V136 tmp133]
            ...
            ; occurs 16 times

But in .NET 3.1 we load it in x20 once and do mov x21, x20 to store it in final destination, although instead of mov, we could have just done str x20, [fp, #24].

            ldr     x20, [fp,#40]	// [V01 loc0]
            mov     x21, x20
            ...
            str     x21, [fp,#24]	// [V149 tmp146]
            ...
            ...
            mov     x21, x20
            ...
            str     x21, [fp,#24]	// [V149 tmp146]
            ...
            occurs 16 times.

Constant propagation

Next, we still have problem of not const proping 5 in .NET 6 as discussed in #41704 (comment).

            ldr     w1, [fp,#56]	// [V134 tmp131]
            add     w20, w1, #5
            
            ...
            occurs 16 times.

But in .NEt3.1, we const prop and directly move 5:

            mov     w0, #5
            str     x21, [fp,#24]	// [V149 tmp146]  ; ._object
            str     w0, [fp,#32]	// [V150 tmp147]

Repeatative `sub` operation

Lastly, in .NET 6, we do not CSE _length - start and do sub every time.

            ldr     w1, [fp,#60]	// [V135 tmp132]
            sub     w21, w1, #5
            ...
            str     w21, [fp,#44]	// [V138 tmp135]
            ...
            ...
            sub     w21, w1, #5
            ...
            str     w21, [fp,#44]	// [V138 tmp135]
            ... 
            occurs 15 times

But in .NET3.1, we do subtraction once and CSE the result.

            sub     w22, w19, #5
            mov     w23, w22
            ...
            str     w23, [fp,#36]	// [V151 tmp148]
            ...
            ...
            mov     w23, w22
            ...
            str     w23, [fp,#36]	// [V151 tmp148]
            ...
            15 times

I will try to investigate little more and open separate issues for each of them.

kunalspathak · 2021-07-02T22:24:22Z

So most of the regression is happening due to CSE not working for add and sub. During value numbering, we name V22 with unique VN making it not possible to CSE Add(V22, 5) operation.


***** BB05, STMT00027(before)
N005 (  3,  4) [000138] -A------R---              *  ASG       int   
N004 (  1,  1) [000137] D------N----              +--*  LCL_VAR   int    V08 tmp5         d:2
N003 (  3,  4) [000069] ------------              \--*  ADD       int   
N001 (  1,  1) [000067] ------------                 +--*  LCL_VAR   int    V22 tmp19        
N002 (  1,  2) [000068] ------------                 \--*  CNS_INT   int    5

N001 [000067]   LCL_VAR   V22 tmp19         => $242 {242}
N002 [000068]   CNS_INT   5 => $42 {IntCns 5}
N003 [000069]   ADD       => $206 {ADD($42, $242)}
N004 [000137]   LCL_VAR   V08 tmp5         d:2 => $206 {ADD($42, $242)}
N005 [000138]   ASG       => $206 {ADD($42, $242)}

***** BB05, STMT00027(after)
N005 (  3,  4) [000138] -A------R---              *  ASG       int    $206
N004 (  1,  1) [000137] D------N----              +--*  LCL_VAR   int    V08 tmp5         d:2 $206
N003 (  3,  4) [000069] ------------              \--*  ADD       int    $206
N001 (  1,  1) [000067] ------------                 +--*  LCL_VAR   int    V22 tmp19         $242
N002 (  1,  2) [000068] ------------                 \--*  CNS_INT   int    5 $42

====================================================================================================================================================================

***** BB12, STMT00048(before)
N005 (  3,  4) [000250] -A------R---              *  ASG       int   
N004 (  1,  1) [000249] D------N----              +--*  LCL_VAR   int    V16 tmp13        d:2
N003 (  3,  4) [000181] ------------              \--*  ADD       int   
N001 (  1,  1) [000179] ------------                 +--*  LCL_VAR   int    V22 tmp19        
N002 (  1,  2) [000180] ------------                 \--*  CNS_INT   int    5

N001 [000179]   LCL_VAR   V22 tmp19         => $24d {24d}
N002 [000180]   CNS_INT   5 => $42 {IntCns 5}
N003 [000181]   ADD       => $20f {ADD($42, $24d)}
N004 [000249]   LCL_VAR   V16 tmp13        d:2 => $20f {ADD($42, $24d)}
N005 [000250]   ASG       => $20f {ADD($42, $24d)}

***** BB12, STMT00048(after)
N005 (  3,  4) [000250] -A------R---              *  ASG       int    $20f
N004 (  1,  1) [000249] D------N----              +--*  LCL_VAR   int    V16 tmp13        d:2 $20f
N003 (  3,  4) [000181] ------------              \--*  ADD       int    $20f
N001 (  1,  1) [000179] ------------                 +--*  LCL_VAR   int    V22 tmp19         $24d
N002 (  1,  2) [000180] ------------                 \--*  CNS_INT   int    5 $42

The reason is what I mentioned in #41704 (comment). Since it is related to the struct multireg return, I would assign this to @sandreenko .

sandreenko · 2021-07-09T16:44:30Z

Multi-def SSA is a large work item that won't happen in 6.0. Marking it as Future for now.

adamsitnik added arch-arm64 tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Sep 1, 2020

adamsitnik added this to the 5.0.0 milestone Sep 1, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 1, 2020

JulieLeeMSFT assigned kunalspathak Sep 1, 2020

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 1, 2020

adamsitnik mentioned this issue Sep 4, 2020

.NET 5.0 Microbenchmarks Performance Study Report #41871

Closed

21 tasks

kunalspathak modified the milestones: 5.0.0, 6.0.0 Sep 4, 2020

adamsitnik mentioned this issue Sep 9, 2020

Ensure building dotnet/runtime works on Windows ARM64 #42008

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 10, 2020

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

kunalspathak removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 1, 2021

This was referenced May 1, 2021

Prepare JIT backend for structs in registers. #52039

Merged

Arm64: Use zero register wherever possible #52269

Merged

LSRA: Avoiding reg spill to memory when reg-value is consistent with memory #6761

Open

kunalspathak assigned sandreenko and unassigned kunalspathak Jul 2, 2021

sandreenko modified the milestones: 6.0.0, Future Jul 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM64] Possible perf regression: slicing #41704

[ARM64] Possible perf regression: slicing #41704

adamsitnik commented Sep 1, 2020 •

edited by BruceForstall

Loading

JulieLeeMSFT commented Sep 1, 2020

kunalspathak commented Sep 2, 2020

kunalspathak commented Sep 4, 2020 •

edited

Loading

kunalspathak commented Sep 4, 2020

kunalspathak commented Sep 4, 2020

kunalspathak commented Sep 4, 2020

kunalspathak commented Sep 10, 2020

kunalspathak commented Apr 1, 2021

kunalspathak commented Apr 30, 2021 •

edited

Loading

kunalspathak commented Jul 2, 2021

sandreenko commented Jul 9, 2021

[ARM64] Possible perf regression: slicing #41704

[ARM64] Possible perf regression: slicing #41704

Comments

adamsitnik commented Sep 1, 2020 • edited by BruceForstall Loading

Repro

JulieLeeMSFT commented Sep 1, 2020

kunalspathak commented Sep 2, 2020

kunalspathak commented Sep 4, 2020 • edited Loading

kunalspathak commented Sep 4, 2020

kunalspathak commented Sep 4, 2020

kunalspathak commented Sep 4, 2020

kunalspathak commented Sep 10, 2020

kunalspathak commented Apr 1, 2021

kunalspathak commented Apr 30, 2021 • edited Loading

No zero register used

Redundant ldr from same location

Repeated loading of ._object field

Constant propagation

Repeatative sub operation

kunalspathak commented Jul 2, 2021

sandreenko commented Jul 9, 2021

adamsitnik commented Sep 1, 2020 •

edited by BruceForstall

Loading

kunalspathak commented Sep 4, 2020 •

edited

Loading

kunalspathak commented Apr 30, 2021 •

edited

Loading

Repeated loading of `._object` field

Repeatative `sub` operation