Regressions in System.Memory.ReadOnlySpan #61679

performanceautofiler · 2021-11-09T08:55:53Z

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	6aaee1622b87d8a8d1cf38c235e9c37f2c622df1
Compare	1e48eca9577433526c5c649a87ca2f2545d35958
Diff	Diff

Regressions in System.Memory.ReadOnlySpan

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
IndexOfString - Duration of single invocation	217.46 ns	238.56 ns	1.10	0.03	False

Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

kunalspathak · 2021-11-09T16:14:32Z

Most likely from 7308c4f but revisit next week.

ghost · 2021-11-16T16:57:09Z

Tagging subscribers to this area: @GrabYourPitchforks, @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	6aaee1622b87d8a8d1cf38c235e9c37f2c622df1
Compare	1e48eca9577433526c5c649a87ca2f2545d35958
Diff	Diff

Regressions in System.Memory.ReadOnlySpan

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
IndexOfString - Duration of single invocation	217.46 ns	238.56 ns	1.10	0.03	False

Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author:	performanceautofiler[bot]
Assignees:	-
Labels:	`area-System.Memory`, `untriaged`
Milestone:	-

EgorBo · 2021-11-16T16:58:03Z

Suspect: #61023

ghost · 2021-11-16T17:53:28Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	6aaee1622b87d8a8d1cf38c235e9c37f2c622df1
Compare	1e48eca9577433526c5c649a87ca2f2545d35958
Diff	Diff

Regressions in System.Memory.ReadOnlySpan

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
IndexOfString - Duration of single invocation	217.46 ns	238.56 ns	1.10	0.03	False

Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author:	performanceautofiler[bot]
Assignees:	-
Labels:	`os-windows`, `tenet-performance`, `tenet-performance-benchmarks`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

AndyAyersMS · 2021-11-20T21:00:18Z

Can repro locally:

BenchmarkDotNet=v0.13.1.1620-nightly, OS=Windows 10.0.22000
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Job-TPBGBC : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Job-KJPLXN : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method	Job	Runtime	Toolchain	input	value	comparisonType	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated
IndexOfString	Job-TPBGBC	.NET 6.0	net6.0	AAAAAAAAAAAA(...)AAAAAAAAAAAA [100]	x	OrdinalIgnoreCase	194.2 ns	1.72 ns	1.52 ns	194.1 ns	192.1 ns	197.2 ns	1.00	-
IndexOfString	Job-KJPLXN	.NET 7.0	net7.0	AAAAAAAAAAAA(...)AAAAAAAAAAAA [100]	x	OrdinalIgnoreCase	222.9 ns	2.37 ns	1.98 ns	223.0 ns	220.3 ns	227.7 ns	1.15	-

AndyAyersMS · 2021-11-20T21:18:22Z

.NET 6

.NET 7 ()

AndyAyersMS · 2021-11-20T23:02:36Z

Can't repro the above numbers with locally built 6.0 and 7.0, however -- I get around 207ns for both.

AndyAyersMS · 2021-11-21T01:32:45Z

The local 6.0 and 7.0 codegen is identical to the respective SDK versions, so not clear why the perf differs.

The 6.0 and 7.0 codegen for OrdinalCasing.IndexOf differs in two aspects. One is clearly from #61023:

;; 6.0
       cmp       r13d,eax
       sete      al
       movzx     eax,al
       test      eax,eax
       je        short M02_L15

;; 7.0
       cmp       r13d,eax
       jne       short M02_L15

and there are 6 places where we no longer zero extend index values (possibly from #57970):

;; 6.0
       cmp       r13d,[rcx+8]
       jae       near ptr M02_L19
       movsxd    rax,r13d
       movzx     r13d,word ptr [rcx+rax*2+10]

;; 7.0
       cmp       r13d,[rcx+8]
       jae       near ptr M02_L19
       mov       eax,r13d
       movzx     r13d,word ptr [rcx+rax*2+10]

These both happen in blocks within doubly nested loops. Source code here is:

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/OrdinalCasing.Icu.cs

Lines 292 to 354 in 1d065b6

    
           internal static unsafe int IndexOf(ReadOnlySpan<char> source, ReadOnlySpan<char> value) 
        
           { 
        
               Debug.Assert(value.Length > 0); 
        
               Debug.Assert(value.Length <= source.Length); 
        
               Debug.Assert(!GlobalizationMode.Invariant); 
        
               Debug.Assert(!GlobalizationMode.UseNls); 
        
               fixed (char* pSource = &MemoryMarshal.GetReference(source)) 
        
               fixed (char* pValue  = &MemoryMarshal.GetReference(value)) 
        
               { 
        
                   char* pSourceLimit = pSource + (source.Length - value.Length); 
        
                   char* pValueLimit = pValue + value.Length - 1; 
        
                   char* pCurrentSource = pSource; 
        
                   while (pCurrentSource <= pSourceLimit) 
        
                   { 
        
                       char *pVal = pValue; 
        
                       char *pSrc = pCurrentSource; 
        
                       while (pVal <= pValueLimit) 
        
                       { 
        
                           if (!char.IsHighSurrogate(*pVal) || pVal == pValueLimit) 
        
                           { 
        
                               if (*pVal != *pSrc && ToUpper(*pVal) != ToUpper(*pSrc)) 
        
                                   break; // no match 
        
                               pVal++; 
        
                               pSrc++; 
        
                               continue; 
        
                           } 
        
                           if (char.IsHighSurrogate(*pSrc) && char.IsLowSurrogate(*(pSrc + 1)) && char.IsLowSurrogate(*(pVal + 1))) 
        
                           { 
        
                               // Well formed surrogates 
        
                               // both the source and the Value have well-formed surrogates. 
        
                               if (!SurrogateCasing.Equal(*pSrc, *(pSrc + 1), *pVal, *(pVal + 1))) 
        
                                   break; // no match 
        
                               pSrc += 2; 
        
                               pVal += 2; 
        
                               continue; 
        
                           } 
        
                           if (*pVal != *pSrc) 
        
                               break; // no match 
        
                           pSrc++; 
        
                           pVal++; 
        
                       } 
        
                       if (pVal > pValueLimit) 
        
                       { 
        
                           // Found match. 
        
                           return (int) (pCurrentSource - pSource); 
        
                       } 
        
                       pCurrentSource++; 
        
                   } 
        
                   return -1; 
        
               } 
        
           }

In particular the test simplification happens at the end of the inner while loop

       cmp       r13d,eax                 // simplified test
       jne       short M02_L15
       add       r12,4                    // pSrc++; pDst++;
       add       r15,4
       jmp       short M02_L14
M02_L13:
       cmp       r13d,eax
       jne       short M02_L15
       add       r12,2
       add       r15,2
M02_L14:
       cmp       r15,rbp
       jbe       near ptr M02_L01         // branch to top of inner while loop

and one possible explanation for the perf impact here is that by simplifying this compare we've altered the offset of the M02_L13 label that follows. This is likely a frequent branch target as we only make it to the lower part of the inner while loop in rare cases.

I don't have offsets from BDN disassembly but can get them from my local builds. So will double-check if the above is plausible.

AndyAyersMS · 2021-11-21T02:09:04Z

Hmm, that doesn't seem to be holding up. There are some 32 byte jcc branches higher up in 7.0 (from the movsxd -> mov diffs) but lower down the alignment actually looks like it might work out better.

;; 6.0 tail
G_M62321_IG18:              ;; offset=02B2H
       448B6C242C           mov      r13d, dword ptr [rsp+2CH]
       443BE8               cmp      r13d, eax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
; ............................... 32B boundary ...............................
       85C0                 test     eax, eax
       7420                 je       SHORT G_M62321_IG21
       4983C404             add      r12, 4
       4983C704             add      r15, 4
       EB0D                 jmp      SHORT G_M62321_IG20
						;; bbWeight=8    PerfScore 50.00
G_M62321_IG19:              ;; offset=02CEH
       443BE8               cmp      r13d, eax
       7511                 jne      SHORT G_M62321_IG21
       4983C402             add      r12, 2
       4983C702             add      r15, 2
						;; bbWeight=8    PerfScore 14.00
G_M62321_IG20:              ;; offset=02DBH
       4C3BFD               cmp      r15, rbp
       0F867AFDFFFF         jbe      G_M62321_IG04
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jbe: 4 ; jcc erratum) 32B boundary ...............................
						;; bbWeight=16    PerfScore 20.00
G_M62321_IG21:              ;; offset=02E4H
       4C3BFD               cmp      r15, rbp
       7723                 ja       SHORT G_M62321_IG24
       4983C602             add      r14, 2
       4C3BF3               cmp      r14, rbx
       0F8659FDFFFF         jbe      G_M62321_IG03

;; 7.0 tail
G_M62321_IG18:              ;; offset=02AEH
       448B6C242C           mov      r13d, dword ptr [rsp+2CH]
       443BE8               cmp      r13d, eax
       7520                 jne      SHORT G_M62321_IG21
       4983C404             add      r12, 4
       4983C704             add      r15, 4
; ............................... 32B boundary ...............................
       EB0D                 jmp      SHORT G_M62321_IG20
						;; bbWeight=8    PerfScore 38.00
G_M62321_IG19:              ;; offset=02C2H
       443BE8               cmp      r13d, eax
       7511                 jne      SHORT G_M62321_IG21
       4983C402             add      r12, 2
       4983C702             add      r15, 2
						;; bbWeight=8    PerfScore 14.00
G_M62321_IG20:              ;; offset=02CFH
       4C3BFD               cmp      r15, rbp
       0F8686FDFFFF         jbe      G_M62321_IG04
						;; bbWeight=16    PerfScore 20.00
G_M62321_IG21:              ;; offset=02D8H
       4C3BFD               cmp      r15, rbp
       7723                 ja       SHORT G_M62321_IG24
       4983C602             add      r14, 2
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (add: 1) 32B boundary ...............................
       4C3BF3               cmp      r14, rbx
       0F8665FDFFFF         jbe      G_M62321_IG03
						;; bbWeight=4    PerfScore 11.00

AndyAyersMS · 2022-03-02T03:11:00Z

Looking again there is still a regression here:

No new insights, will need to find time to look at this one.

AndyAyersMS · 2022-05-02T18:05:20Z

Things have since improved dramatically. The second big drop was from #67758, but the earlier smaller drop (which basically undid the regression reported here) looks like it might be from #65738. Reason for the regression/first improvement still a bit of mystery.

So, going to close this one.

EgorBo changed the title ~~[Perf] Changes at 11/3/2021 6:40:28 PM~~ Regressions in System.Memory.ReadOnlySpan Nov 16, 2021

EgorBo transferred this issue from dotnet/perf-autofiling-issues Nov 16, 2021

dotnet-issue-labeler bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Nov 16, 2021

EgorBo added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark labels Nov 16, 2021

GrabYourPitchforks added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Memory labels Nov 16, 2021

AndyAyersMS self-assigned this Nov 20, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Nov 22, 2021

JulieLeeMSFT added this to the 7.0.0 milestone Nov 22, 2021

AndyAyersMS closed this as completed May 2, 2022

ghost locked as resolved and limited conversation to collaborators Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regressions in System.Memory.ReadOnlySpan #61679

Regressions in System.Memory.ReadOnlySpan #61679

performanceautofiler bot commented Nov 9, 2021

Payloads

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

kunalspathak commented Nov 9, 2021

ghost commented Nov 16, 2021

Run Information

Regressions in System.Memory.ReadOnlySpan

Repro

Payloads

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

EgorBo commented Nov 16, 2021

ghost commented Nov 16, 2021

Run Information

Regressions in System.Memory.ReadOnlySpan

Repro

Payloads

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

AndyAyersMS commented Nov 20, 2021

AndyAyersMS commented Nov 20, 2021

AndyAyersMS commented Nov 20, 2021

AndyAyersMS commented Nov 21, 2021 •

edited

Loading

AndyAyersMS commented Nov 21, 2021

AndyAyersMS commented Mar 2, 2022

AndyAyersMS commented May 2, 2022

Regressions in System.Memory.ReadOnlySpan #61679

Regressions in System.Memory.ReadOnlySpan #61679

Comments

performanceautofiler bot commented Nov 9, 2021

Run Information

Regressions in System.Memory.ReadOnlySpan

Repro

Payloads

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

kunalspathak commented Nov 9, 2021

ghost commented Nov 16, 2021

Run Information

Regressions in System.Memory.ReadOnlySpan

Repro

Payloads

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

EgorBo commented Nov 16, 2021

ghost commented Nov 16, 2021

Run Information

Regressions in System.Memory.ReadOnlySpan

Repro

Payloads

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)

Docs

AndyAyersMS commented Nov 20, 2021

AndyAyersMS commented Nov 20, 2021

AndyAyersMS commented Nov 20, 2021

AndyAyersMS commented Nov 21, 2021 • edited Loading

AndyAyersMS commented Nov 21, 2021

AndyAyersMS commented Mar 2, 2022

AndyAyersMS commented May 2, 2022

AndyAyersMS commented Nov 21, 2021 •

edited

Loading