Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regressions in System.Memory.ReadOnlySpan #61679

Closed
performanceautofiler bot opened this issue Nov 9, 2021 · 11 comments
Closed

Regressions in System.Memory.ReadOnlySpan #61679

performanceautofiler bot opened this issue Nov 9, 2021 · 11 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@performanceautofiler
Copy link

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 6aaee1622b87d8a8d1cf38c235e9c37f2c622df1
Compare 1e48eca9577433526c5c649a87ca2f2545d35958
Diff Diff

Regressions in System.Memory.ReadOnlySpan

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IndexOfString - Duration of single invocation 217.46 ns 238.56 ns 1.10 0.03 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@kunalspathak
Copy link
Member

Most likely from 7308c4f but revisit next week.

@EgorBo EgorBo changed the title [Perf] Changes at 11/3/2021 6:40:28 PM Regressions in System.Memory.ReadOnlySpan Nov 16, 2021
@EgorBo EgorBo transferred this issue from dotnet/perf-autofiling-issues Nov 16, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Memory untriaged New issue has not been triaged by the area owner labels Nov 16, 2021
@ghost
Copy link

ghost commented Nov 16, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 6aaee1622b87d8a8d1cf38c235e9c37f2c622df1
Compare 1e48eca9577433526c5c649a87ca2f2545d35958
Diff Diff

Regressions in System.Memory.ReadOnlySpan

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IndexOfString - Duration of single invocation 217.46 ns 238.56 ns 1.10 0.03 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: performanceautofiler[bot]
Assignees: -
Labels:

area-System.Memory, untriaged

Milestone: -

@EgorBo EgorBo added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark labels Nov 16, 2021
@EgorBo
Copy link
Member

EgorBo commented Nov 16, 2021

Suspect: #61023

@GrabYourPitchforks GrabYourPitchforks added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Memory labels Nov 16, 2021
@ghost
Copy link

ghost commented Nov 16, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 6aaee1622b87d8a8d1cf38c235e9c37f2c622df1
Compare 1e48eca9577433526c5c649a87ca2f2545d35958
Diff Diff

Regressions in System.Memory.ReadOnlySpan

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
IndexOfString - Duration of single invocation 217.46 ns 238.56 ns 1.10 0.03 False

graph
Test Report

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net6.0 --filter 'System.Memory.ReadOnlySpan*'

Payloads

Baseline
Compare

Histogram

System.Memory.ReadOnlySpan.IndexOfString(input: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAXAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", value: "x", comparisonType: OrdinalIgnoreCase)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: performanceautofiler[bot]
Assignees: -
Labels:

os-windows, tenet-performance, tenet-performance-benchmarks, area-CodeGen-coreclr, untriaged

Milestone: -

@AndyAyersMS AndyAyersMS self-assigned this Nov 20, 2021
@AndyAyersMS
Copy link
Member

Can repro locally:

BenchmarkDotNet=v0.13.1.1620-nightly, OS=Windows 10.0.22000
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Job-TPBGBC : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
Job-KJPLXN : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job Runtime Toolchain input value comparisonType Mean Error StdDev Median Min Max Ratio Allocated
IndexOfString Job-TPBGBC .NET 6.0 net6.0 AAAAAAAAAAAA(...)AAAAAAAAAAAA [100] x OrdinalIgnoreCase 194.2 ns 1.72 ns 1.52 ns 194.1 ns 192.1 ns 197.2 ns 1.00 -
IndexOfString Job-KJPLXN .NET 7.0 net7.0 AAAAAAAAAAAA(...)AAAAAAAAAAAA [100] x OrdinalIgnoreCase 222.9 ns 2.37 ns 1.98 ns 223.0 ns 220.3 ns 227.7 ns 1.15 -

@AndyAyersMS
Copy link
Member

.NET 6
image

.NET 7 ()

image

@AndyAyersMS
Copy link
Member

Can't repro the above numbers with locally built 6.0 and 7.0, however -- I get around 207ns for both.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Nov 21, 2021

The local 6.0 and 7.0 codegen is identical to the respective SDK versions, so not clear why the perf differs.

The 6.0 and 7.0 codegen for OrdinalCasing.IndexOf differs in two aspects. One is clearly from #61023:

;; 6.0
       cmp       r13d,eax
       sete      al
       movzx     eax,al
       test      eax,eax
       je        short M02_L15

;; 7.0
       cmp       r13d,eax
       jne       short M02_L15

and there are 6 places where we no longer zero extend index values (possibly from #57970):

;; 6.0
       cmp       r13d,[rcx+8]
       jae       near ptr M02_L19
       movsxd    rax,r13d
       movzx     r13d,word ptr [rcx+rax*2+10]

;; 7.0
       cmp       r13d,[rcx+8]
       jae       near ptr M02_L19
       mov       eax,r13d
       movzx     r13d,word ptr [rcx+rax*2+10]

These both happen in blocks within doubly nested loops. Source code here is:

internal static unsafe int IndexOf(ReadOnlySpan<char> source, ReadOnlySpan<char> value)
{
Debug.Assert(value.Length > 0);
Debug.Assert(value.Length <= source.Length);
Debug.Assert(!GlobalizationMode.Invariant);
Debug.Assert(!GlobalizationMode.UseNls);
fixed (char* pSource = &MemoryMarshal.GetReference(source))
fixed (char* pValue = &MemoryMarshal.GetReference(value))
{
char* pSourceLimit = pSource + (source.Length - value.Length);
char* pValueLimit = pValue + value.Length - 1;
char* pCurrentSource = pSource;
while (pCurrentSource <= pSourceLimit)
{
char *pVal = pValue;
char *pSrc = pCurrentSource;
while (pVal <= pValueLimit)
{
if (!char.IsHighSurrogate(*pVal) || pVal == pValueLimit)
{
if (*pVal != *pSrc && ToUpper(*pVal) != ToUpper(*pSrc))
break; // no match
pVal++;
pSrc++;
continue;
}
if (char.IsHighSurrogate(*pSrc) && char.IsLowSurrogate(*(pSrc + 1)) && char.IsLowSurrogate(*(pVal + 1)))
{
// Well formed surrogates
// both the source and the Value have well-formed surrogates.
if (!SurrogateCasing.Equal(*pSrc, *(pSrc + 1), *pVal, *(pVal + 1)))
break; // no match
pSrc += 2;
pVal += 2;
continue;
}
if (*pVal != *pSrc)
break; // no match
pSrc++;
pVal++;
}
if (pVal > pValueLimit)
{
// Found match.
return (int) (pCurrentSource - pSource);
}
pCurrentSource++;
}
return -1;
}
}

In particular the test simplification happens at the end of the inner while loop

       cmp       r13d,eax                 // simplified test
       jne       short M02_L15
       add       r12,4                    // pSrc++; pDst++;
       add       r15,4
       jmp       short M02_L14
M02_L13:
       cmp       r13d,eax
       jne       short M02_L15
       add       r12,2
       add       r15,2
M02_L14:
       cmp       r15,rbp
       jbe       near ptr M02_L01         // branch to top of inner while loop

and one possible explanation for the perf impact here is that by simplifying this compare we've altered the offset of the M02_L13 label that follows. This is likely a frequent branch target as we only make it to the lower part of the inner while loop in rare cases.

I don't have offsets from BDN disassembly but can get them from my local builds. So will double-check if the above is plausible.

@AndyAyersMS
Copy link
Member

Hmm, that doesn't seem to be holding up. There are some 32 byte jcc branches higher up in 7.0 (from the movsxd -> mov diffs) but lower down the alignment actually looks like it might work out better.

;; 6.0 tail
G_M62321_IG18:              ;; offset=02B2H
       448B6C242C           mov      r13d, dword ptr [rsp+2CH]
       443BE8               cmp      r13d, eax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
; ............................... 32B boundary ...............................
       85C0                 test     eax, eax
       7420                 je       SHORT G_M62321_IG21
       4983C404             add      r12, 4
       4983C704             add      r15, 4
       EB0D                 jmp      SHORT G_M62321_IG20
						;; bbWeight=8    PerfScore 50.00
G_M62321_IG19:              ;; offset=02CEH
       443BE8               cmp      r13d, eax
       7511                 jne      SHORT G_M62321_IG21
       4983C402             add      r12, 2
       4983C702             add      r15, 2
						;; bbWeight=8    PerfScore 14.00
G_M62321_IG20:              ;; offset=02DBH
       4C3BFD               cmp      r15, rbp
       0F867AFDFFFF         jbe      G_M62321_IG04
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jbe: 4 ; jcc erratum) 32B boundary ...............................
						;; bbWeight=16    PerfScore 20.00
G_M62321_IG21:              ;; offset=02E4H
       4C3BFD               cmp      r15, rbp
       7723                 ja       SHORT G_M62321_IG24
       4983C602             add      r14, 2
       4C3BF3               cmp      r14, rbx
       0F8659FDFFFF         jbe      G_M62321_IG03

;; 7.0 tail
G_M62321_IG18:              ;; offset=02AEH
       448B6C242C           mov      r13d, dword ptr [rsp+2CH]
       443BE8               cmp      r13d, eax
       7520                 jne      SHORT G_M62321_IG21
       4983C404             add      r12, 4
       4983C704             add      r15, 4
; ............................... 32B boundary ...............................
       EB0D                 jmp      SHORT G_M62321_IG20
						;; bbWeight=8    PerfScore 38.00
G_M62321_IG19:              ;; offset=02C2H
       443BE8               cmp      r13d, eax
       7511                 jne      SHORT G_M62321_IG21
       4983C402             add      r12, 2
       4983C702             add      r15, 2
						;; bbWeight=8    PerfScore 14.00
G_M62321_IG20:              ;; offset=02CFH
       4C3BFD               cmp      r15, rbp
       0F8686FDFFFF         jbe      G_M62321_IG04
						;; bbWeight=16    PerfScore 20.00
G_M62321_IG21:              ;; offset=02D8H
       4C3BFD               cmp      r15, rbp
       7723                 ja       SHORT G_M62321_IG24
       4983C602             add      r14, 2
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (add: 1) 32B boundary ...............................
       4C3BF3               cmp      r14, rbx
       0F8665FDFFFF         jbe      G_M62321_IG03
						;; bbWeight=4    PerfScore 11.00

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Nov 22, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Nov 22, 2021
@AndyAyersMS
Copy link
Member

Looking again there is still a regression here:
newplot - 2022-03-01T190932 613

No new insights, will need to find time to look at this one.

@AndyAyersMS
Copy link
Member

Things have since improved dramatically. The second big drop was from #67758, but the earlier smaller drop (which basically undid the regression reported here) looks like it might be from #65738. Reason for the regression/first improvement still a bit of mystery.

newplot - 2022-05-02T110234 248

So, going to close this one.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

5 participants