Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Regressions in Burgers.Test0 #57087

Closed
DrewScoggins opened this issue Aug 9, 2021 · 16 comments · Fixed by #57438
Closed

[Perf] Regressions in Burgers.Test0 #57087

DrewScoggins opened this issue Aug 9, 2021 · 16 comments · Fixed by #57438
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

DrewScoggins commented Aug 9, 2021

This was found in a 5.0 -> 6.0 comparison. I noticed that this was never copied over to be investigated.

Run Information

Architecture x64
OS ubuntu 18.04
Baseline c7fa2955e3f860f8070adea115f7277281228a2e
Compare ee3f7daed083477689a4c6240025afa45ffa3352

Regressions in Burgers

Benchmark Baseline Test Test/Base Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
Test0 227.13 ms 300.60 ms 1.32

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Burgers*'

.

Payloads

Baseline
Compare

Histogram

Burgers.Test2

[173077567.563 ; 180818538.259) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[180818538.259 ; 188559508.954) | 
[188559508.954 ; 196300479.650) | 
[196300479.650 ; 204041450.346) | 
[204041450.346 ; 211782421.042) | 
[211782421.042 ; 217157535.712) | 
[217157535.712 ; 224898506.407) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[224898506.407 ; 235629468.257) | 
[235629468.257 ; 243618923.481) | @@@@@@@@@@@

Burgers.Test0

[221800734.528 ; 229054539.920) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[229054539.920 ; 237545194.113) | @@@@@@@
[237545194.113 ; 246815564.573) | 
[246815564.573 ; 253034981.489) | 
[253034981.489 ; 262781984.527) | @@@@
[262781984.527 ; 272052354.987) | 
[272052354.987 ; 281322725.447) | 
[281322725.447 ; 290593095.907) | 
[290593095.907 ; 297131428.042) | 
[297131428.042 ; 303545307.882) | @@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Aug 9, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 9, 2021
@danmoseley danmoseley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 10, 2021
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Aug 10, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Aug 10, 2021
@briansull
Copy link
Contributor

Compare ee3f7da

Baseline / Compare change hash indicates that this PR was responsible. (Or possibly a previous PR on the same day or two)

Separate refactoring changes in 43250 (#48199)
main (#48199)
v6.0.0-preview.7.21377.19

v6.0.0-preview.3.21201.4
@echesakovMSFT
echesakovMSFT committed on Feb 26

@briansull
Copy link
Contributor

briansull commented Aug 10, 2021

Possibly related regression:
#49163 [Perf] Regression in System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches

@EgorBo
Copy link
Member

EgorBo commented Aug 11, 2021

I'll try to bisect the exact commit and reproduce locally

@EgorBo EgorBo assigned EgorBo and unassigned briansull Aug 11, 2021
@EgorBo
Copy link
Member

EgorBo commented Aug 12, 2021

This was found in a 5.0 -> 6.0 comparison. I noticed that this was never copied over to be investigated.

@DrewScoggins that comparison was also made on Ubuntu?
I can't reproduce this issue locally on both ubuntu-x64 and win-x64 🙁

The histogram states that the only commit that could affect this benchmark was Enable CSE for floating-point constants (#44419)

image

and indeed, Burgers benchmark do a lot of floating-point operations with constants so, in theory, CSE could lead to spills inside loops for loop-carried dependencies, but I wasn't able to reproduce any noticeable differences before after that commit (the issue states that it became 30% slower). Neither I could reproduce it for .NET 5.0 vs .NET 6.0P7, performance results are the same.

Here is the codegen diff: https://www.diffchecker.com/XtHoG3rP
there are some oddities like a spill and some loads aren't "contained" inside the main loop, but I still don't see any difference in final report BDN makes for me.

@JulieLeeMSFT
Copy link
Member

@DrewScoggins that comparison was also made on Ubuntu?

Ping @DrewScoggins since we are trying to fix issues before RC1 snap on 8/17.

@DrewScoggins
Copy link
Member Author

I have been working with @EgorBo on this offline. I think he has everything he needs from me, but he can let me know if he needs anything else.

@EgorBo
Copy link
Member

EgorBo commented Aug 14, 2021

I managed to get a stable repro (10-15% regression) after some tuning in my BIOS, so here is the asm diff: https://www.diffchecker.com/9StkXdFY so it's indeed a regression caused by 70676fd

we had

vmulsd    xmm0,xmm0,qword ptr [7F1BA7074F90] ; last one is a const that is loaded from data section

but after CSE-for-floats PR we now have:

vmovsd    xmm6,qword ptr [rbp-50]
vmulsd    xmm0,xmm0,xmm6

So we hoisted the const but it was spilled to stack so in theory we shouldn't have hoisted/CSE it at all in these conditions (lots of computations/registers are busy)

/cc @briansull @kunalspathak

I'm still looking on what we can do here (e.g. why it was spilled in the first place - we have plenty of spare xmm registers) but this regression is definitely not a reason to revert CSE-for-floats so this issue might be marked as not-actionable for .NET 6.0

@AndyAyersMS
Copy link
Member

There are no callee-save XMMs on SysV, so on such platforms we should be pretty reluctant to do a float CSE that crosses calls.

Are we sure we're checking the right conditions in the CSE heuristics?

@JulieLeeMSFT
Copy link
Member

Are we sure we're checking the right conditions in the CSE heuristics?

CC @briansull

@EgorBo
Copy link
Member

EgorBo commented Aug 14, 2021

image

while on Windows ABI we save xmm0 to xmm6 before the loop.

@AndyAyersMS
Copy link
Member

More apropos:

using System;
using System.Runtime.CompilerServices;

class X
{
     public static float F(int N)
     {
          float result = 0.0F;

          for (int i = 0; i < N; i++)
          {
             result += 3.14F;
             G();
             result += 3.14F;
          }
          return result;
           
     }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static void G() {}
}

Win X64

; Assembly listing for method X:F(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  4,  7   )     int  ->  rsi         single-def
;  V01 loc0         [V01,T02] (  6, 18   )   float  ->  mm6        
;  V02 loc1         [V02,T00] (  4, 13   )     int  ->  rdi        
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V04 cse0         [V04,T03] (  3,  9   )   float  ->  mm7         "CSE - aggressive"
;
; Lcl frame size = 72

G_M28134_IG01:
       push     rdi
       push     rsi
       sub      rsp, 72
       vzeroupper 
       vmovaps  qword ptr [rsp+30H], xmm6
       vmovaps  qword ptr [rsp+20H], xmm7
       mov      esi, ecx
						;; bbWeight=1    PerfScore 9.50
G_M28134_IG02:
       vxorps   xmm6, xmm6
       xor      edi, edi
       test     esi, esi
       jle      SHORT G_M28134_IG04
       vmovss   xmm7, dword ptr [reloc @RWD00]
						;; bbWeight=1    PerfScore 3.83
G_M28134_IG03:
       vaddss   xmm6, xmm6, xmm7
       call     X:G()
       vaddss   xmm6, xmm6, xmm7
       inc      edi
       cmp      edi, esi
       jl       SHORT G_M28134_IG03
						;; bbWeight=4    PerfScore 34.00
G_M28134_IG04:
       vmovaps  xmm0, xmm6
						;; bbWeight=1    PerfScore 0.25
G_M28134_IG05:
       vmovaps  xmm6, qword ptr [rsp+30H]
       vmovaps  xmm7, qword ptr [rsp+20H]
       add      rsp, 72
       pop      rsi
       pop      rdi
       ret      
						;; bbWeight=1    PerfScore 10.25
RWD00  	dd	4048F5C3h		;      3.14

SysV X64

; Assembly listing for method X:F(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  4,  7   )     int  ->  rbx         single-def
;  V01 loc0         [V01,T02] (  6, 18   )   float  ->  [rbp-14H]  
;  V02 loc1         [V02,T00] (  4, 13   )     int  ->  r14        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V04 cse0         [V04,T03] (  3, 12   )   float  ->  [rbp-18H]   spill-single-def "CSE - aggressive"
;
; Lcl frame size = 16

G_M28134_IG01:
       push     rbp
       push     r14
       push     rbx
       sub      rsp, 16
       vzeroupper 
       lea      rbp, [rsp+20H]
       mov      ebx, edi
						;; bbWeight=1    PerfScore 5.00
G_M28134_IG02:
       vxorps   xmm0, xmm0
       xor      r14d, r14d
       test     ebx, ebx
       jle      SHORT G_M28134_IG04
						;; bbWeight=1    PerfScore 1.83
G_M28134_IG03:
       vmovss   xmm1, dword ptr [reloc @RWD00]
       vmovss   dword ptr [rbp-18H], xmm1
       vaddss   xmm0, xmm0, xmm1
       vmovss   dword ptr [rbp-14H], xmm0
       call     X:G()
       vmovss   xmm1, dword ptr [rbp-18H]
       vaddss   xmm0, xmm1, dword ptr [rbp-14H]
       inc      r14d
       cmp      r14d, ebx
       jl       SHORT G_M28134_IG03
						;; bbWeight=4    PerfScore 62.00
G_M28134_IG04:
       add      rsp, 16
       pop      rbx
       pop      r14
       pop      rbp
       ret      
						;; bbWeight=1    PerfScore 2.75
RWD00  	dd	4048F5C3h		;      3.14

Not only is the latter CSE harmful, it's made even worse as we don't hoist its def out of the loop either....

@EgorBo
Copy link
Member

EgorBo commented Aug 14, 2021

@AndyAyersMS in your case it seems we give up on hoisting the def here:

if (loopVarCount >= availRegCount)
{
// Don't hoist expressions that are not heavy: tree->GetCostEx() < (2*IND_COST_EX)
if (tree->GetCostEx() < (2 * IND_COST_EX))
{
return false;
}
}

where availRegCount is 0 because CNT_CALLEE_SAVED_FLOAT is 0 for SysV and we don't use CNT_CALLEE_TRASH_FLOAT because we have a call in the loop.

@EgorBo
Copy link
Member

EgorBo commented Aug 14, 2021

@AndyAyersMS
Copy link
Member

Right, not hoisting is reasonable. But CSEing is not. Not hoisting and then CSEing is the worst of all. Hoist and CSE need to be on the same page.

optcse.cpp never refers to CNT_CALLEE_SAVED_FLOAT -- seems like it should.

@AndyAyersMS
Copy link
Member

Blocking all non-hoistable CSEs is probably sub-optimal -- if all uses are one one side of the call or the other then a CSE will likely pay off even if the def is not hoisted. But we should not be doing cross-call CSEs of cheap float trees on SysV.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 15, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 17, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Sep 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants