[Perf] Regressions in Burgers.Test0 #57087

DrewScoggins · 2021-08-09T20:37:25Z

This was found in a 5.0 -> 6.0 comparison. I noticed that this was never copied over to be investigated.

Run Information

Architecture	x64
OS	ubuntu 18.04
Baseline	c7fa2955e3f860f8070adea115f7277281228a2e
Compare	ee3f7daed083477689a4c6240025afa45ffa3352

Regressions in Burgers

Benchmark	Baseline	Test	Test/Base	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
Test0	227.13 ms	300.60 ms	1.32

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'Burgers*'

.

Payloads

Baseline
Compare

Histogram

Burgers.Test2

[173077567.563 ; 180818538.259) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[180818538.259 ; 188559508.954) | 
[188559508.954 ; 196300479.650) | 
[196300479.650 ; 204041450.346) | 
[204041450.346 ; 211782421.042) | 
[211782421.042 ; 217157535.712) | 
[217157535.712 ; 224898506.407) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[224898506.407 ; 235629468.257) | 
[235629468.257 ; 243618923.481) | @@@@@@@@@@@

Burgers.Test0

[221800734.528 ; 229054539.920) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[229054539.920 ; 237545194.113) | @@@@@@@
[237545194.113 ; 246815564.573) | 
[246815564.573 ; 253034981.489) | 
[253034981.489 ; 262781984.527) | @@@@
[262781984.527 ; 272052354.987) | 
[272052354.987 ; 281322725.447) | 
[281322725.447 ; 290593095.907) | 
[290593095.907 ; 297131428.042) | 
[297131428.042 ; 303545307.882) | @@@@@@@@@@@

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2021-08-09T20:37:27Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

briansull · 2021-08-10T21:26:43Z

Compare	`ee3f7da`

Baseline / Compare change hash indicates that this PR was responsible. (Or possibly a previous PR on the same day or two)

Separate refactoring changes in 43250 (#48199)
main (#48199)
v6.0.0-preview.7.21377.19
…
v6.0.0-preview.3.21201.4
@echesakovMSFT
echesakovMSFT committed on Feb 26

briansull · 2021-08-10T21:30:12Z

Possibly related regression:
#49163 [Perf] Regression in System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches

EgorBo · 2021-08-11T12:50:09Z

I'll try to bisect the exact commit and reproduce locally

EgorBo · 2021-08-12T07:01:19Z

This was found in a 5.0 -> 6.0 comparison. I noticed that this was never copied over to be investigated.

@DrewScoggins that comparison was also made on Ubuntu?
I can't reproduce this issue locally on both ubuntu-x64 and win-x64 🙁

The histogram states that the only commit that could affect this benchmark was Enable CSE for floating-point constants (#44419)

and indeed, Burgers benchmark do a lot of floating-point operations with constants so, in theory, CSE could lead to spills inside loops for loop-carried dependencies, but I wasn't able to reproduce any noticeable differences before after that commit (the issue states that it became 30% slower). Neither I could reproduce it for .NET 5.0 vs .NET 6.0P7, performance results are the same.

Here is the codegen diff: https://www.diffchecker.com/XtHoG3rP
there are some oddities like a spill and some loads aren't "contained" inside the main loop, but I still don't see any difference in final report BDN makes for me.

JulieLeeMSFT · 2021-08-13T22:54:49Z

@DrewScoggins that comparison was also made on Ubuntu?

Ping @DrewScoggins since we are trying to fix issues before RC1 snap on 8/17.

DrewScoggins · 2021-08-13T22:57:16Z

I have been working with @EgorBo on this offline. I think he has everything he needs from me, but he can let me know if he needs anything else.

EgorBo · 2021-08-14T08:43:14Z

I managed to get a stable repro (10-15% regression) after some tuning in my BIOS, so here is the asm diff: https://www.diffchecker.com/9StkXdFY so it's indeed a regression caused by 70676fd

we had

vmulsd    xmm0,xmm0,qword ptr [7F1BA7074F90] ; last one is a const that is loaded from data section

but after CSE-for-floats PR we now have:

vmovsd    xmm6,qword ptr [rbp-50]
vmulsd    xmm0,xmm0,xmm6

So we hoisted the const but it was spilled to stack so in theory we shouldn't have hoisted/CSE it at all in these conditions (lots of computations/registers are busy)

/cc @briansull @kunalspathak

I'm still looking on what we can do here (e.g. why it was spilled in the first place - we have plenty of spare xmm registers) but this regression is definitely not a reason to revert CSE-for-floats so this issue might be marked as not-actionable for .NET 6.0

AndyAyersMS · 2021-08-14T16:06:52Z

There are no callee-save XMMs on SysV, so on such platforms we should be pretty reluctant to do a float CSE that crosses calls.

Are we sure we're checking the right conditions in the CSE heuristics?

JulieLeeMSFT · 2021-08-14T16:33:41Z

Are we sure we're checking the right conditions in the CSE heuristics?

CC @briansull

EgorBo · 2021-08-14T19:18:43Z

while on Windows ABI we save xmm0 to xmm6 before the loop.

AndyAyersMS · 2021-08-14T21:39:03Z

More apropos:

using System;
using System.Runtime.CompilerServices;

class X
{
     public static float F(int N)
     {
          float result = 0.0F;

          for (int i = 0; i < N; i++)
          {
             result += 3.14F;
             G();
             result += 3.14F;
          }
          return result;
           
     }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static void G() {}
}

Win X64

; Assembly listing for method X:F(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  4,  7   )     int  ->  rsi         single-def
;  V01 loc0         [V01,T02] (  6, 18   )   float  ->  mm6        
;  V02 loc1         [V02,T00] (  4, 13   )     int  ->  rdi        
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V04 cse0         [V04,T03] (  3,  9   )   float  ->  mm7         "CSE - aggressive"
;
; Lcl frame size = 72

G_M28134_IG01:
       push     rdi
       push     rsi
       sub      rsp, 72
       vzeroupper 
       vmovaps  qword ptr [rsp+30H], xmm6
       vmovaps  qword ptr [rsp+20H], xmm7
       mov      esi, ecx
						;; bbWeight=1    PerfScore 9.50
G_M28134_IG02:
       vxorps   xmm6, xmm6
       xor      edi, edi
       test     esi, esi
       jle      SHORT G_M28134_IG04
       vmovss   xmm7, dword ptr [reloc @RWD00]
						;; bbWeight=1    PerfScore 3.83
G_M28134_IG03:
       vaddss   xmm6, xmm6, xmm7
       call     X:G()
       vaddss   xmm6, xmm6, xmm7
       inc      edi
       cmp      edi, esi
       jl       SHORT G_M28134_IG03
						;; bbWeight=4    PerfScore 34.00
G_M28134_IG04:
       vmovaps  xmm0, xmm6
						;; bbWeight=1    PerfScore 0.25
G_M28134_IG05:
       vmovaps  xmm6, qword ptr [rsp+30H]
       vmovaps  xmm7, qword ptr [rsp+20H]
       add      rsp, 72
       pop      rsi
       pop      rdi
       ret      
						;; bbWeight=1    PerfScore 10.25
RWD00  	dd	4048F5C3h		;      3.14

SysV X64

; Assembly listing for method X:F(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  4,  7   )     int  ->  rbx         single-def
;  V01 loc0         [V01,T02] (  6, 18   )   float  ->  [rbp-14H]  
;  V02 loc1         [V02,T00] (  4, 13   )     int  ->  r14        
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V04 cse0         [V04,T03] (  3, 12   )   float  ->  [rbp-18H]   spill-single-def "CSE - aggressive"
;
; Lcl frame size = 16

G_M28134_IG01:
       push     rbp
       push     r14
       push     rbx
       sub      rsp, 16
       vzeroupper 
       lea      rbp, [rsp+20H]
       mov      ebx, edi
						;; bbWeight=1    PerfScore 5.00
G_M28134_IG02:
       vxorps   xmm0, xmm0
       xor      r14d, r14d
       test     ebx, ebx
       jle      SHORT G_M28134_IG04
						;; bbWeight=1    PerfScore 1.83
G_M28134_IG03:
       vmovss   xmm1, dword ptr [reloc @RWD00]
       vmovss   dword ptr [rbp-18H], xmm1
       vaddss   xmm0, xmm0, xmm1
       vmovss   dword ptr [rbp-14H], xmm0
       call     X:G()
       vmovss   xmm1, dword ptr [rbp-18H]
       vaddss   xmm0, xmm1, dword ptr [rbp-14H]
       inc      r14d
       cmp      r14d, ebx
       jl       SHORT G_M28134_IG03
						;; bbWeight=4    PerfScore 62.00
G_M28134_IG04:
       add      rsp, 16
       pop      rbx
       pop      r14
       pop      rbp
       ret      
						;; bbWeight=1    PerfScore 2.75
RWD00  	dd	4048F5C3h		;      3.14

Not only is the latter CSE harmful, it's made even worse as we don't hoist its def out of the loop either....

EgorBo · 2021-08-14T22:21:46Z

@AndyAyersMS in your case it seems we give up on hoisting the def here:

runtime/src/coreclr/jit/optimizer.cpp

Lines 6013 to 6020 in 53ebce3

    
           if (loopVarCount >= availRegCount) 
        
           { 
        
               // Don't hoist expressions that are not heavy: tree->GetCostEx() < (2*IND_COST_EX) 
        
               if (tree->GetCostEx() < (2 * IND_COST_EX)) 
        
               { 
        
                   return false; 
        
               } 
        
           }

where availRegCount is 0 because CNT_CALLEE_SAVED_FLOAT is 0 for SysV and we don't use CNT_CALLEE_TRASH_FLOAT because we have a call in the loop.

EgorBo · 2021-08-14T22:28:54Z

A fix? EgorBo@00f9eab, Diff: https://www.diffchecker.com/wakOETMr

AndyAyersMS · 2021-08-14T22:29:01Z

Right, not hoisting is reasonable. But CSEing is not. Not hoisting and then CSEing is the worst of all. Hoist and CSE need to be on the same page.

optcse.cpp never refers to CNT_CALLEE_SAVED_FLOAT -- seems like it should.

AndyAyersMS · 2021-08-14T22:32:56Z

Blocking all non-hoistable CSEs is probably sub-optimal -- if all uses are one one side of the call or the other then a CSE will likely pay off even if the def is not hoisted. But we should not be doing cross-call CSEs of cheap float trees on SysV.

DrewScoggins added os-linux Linux OS (any supported distro) tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Aug 9, 2021

DrewScoggins assigned briansull Aug 9, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Aug 9, 2021

danmoseley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 10, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Aug 10, 2021

JulieLeeMSFT added this to the 6.0.0 milestone Aug 10, 2021

briansull mentioned this issue Aug 10, 2021

[Perf] Regression in System.Linq.Tests.Perf_Enumerable.FirstWithPredicate_LastElementMatches #49163

Closed

EgorBo assigned EgorBo and unassigned briansull Aug 11, 2021

JulieLeeMSFT assigned DrewScoggins Aug 13, 2021

JulieLeeMSFT unassigned DrewScoggins Aug 14, 2021

EgorBo mentioned this issue Aug 15, 2021

Skip CSE for non-hoistable floats #57438

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Aug 15, 2021

EgorBo closed this as completed in #57438 Aug 17, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Aug 17, 2021

ghost locked as resolved and limited conversation to collaborators Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Regressions in Burgers.Test0 #57087

[Perf] Regressions in Burgers.Test0 #57087

DrewScoggins commented Aug 9, 2021 •

edited

Loading

Payloads

Histogram

Burgers.Test2

Burgers.Test0

Docs

dotnet-issue-labeler bot commented Aug 9, 2021

briansull commented Aug 10, 2021

briansull commented Aug 10, 2021 •

edited

Loading

EgorBo commented Aug 11, 2021 •

edited

Loading

EgorBo commented Aug 12, 2021 •

edited

Loading

JulieLeeMSFT commented Aug 13, 2021

DrewScoggins commented Aug 13, 2021

EgorBo commented Aug 14, 2021 •

edited

Loading

AndyAyersMS commented Aug 14, 2021

JulieLeeMSFT commented Aug 14, 2021

EgorBo commented Aug 14, 2021 •

edited

Loading

AndyAyersMS commented Aug 14, 2021

EgorBo commented Aug 14, 2021

EgorBo commented Aug 14, 2021

AndyAyersMS commented Aug 14, 2021

AndyAyersMS commented Aug 14, 2021

[Perf] Regressions in Burgers.Test0 #57087

[Perf] Regressions in Burgers.Test0 #57087

Comments

DrewScoggins commented Aug 9, 2021 • edited Loading

Run Information

Regressions in Burgers

Repro

Payloads

Histogram

Burgers.Test2

Burgers.Test0

Docs

dotnet-issue-labeler bot commented Aug 9, 2021

briansull commented Aug 10, 2021

briansull commented Aug 10, 2021 • edited Loading

EgorBo commented Aug 11, 2021 • edited Loading

EgorBo commented Aug 12, 2021 • edited Loading

JulieLeeMSFT commented Aug 13, 2021

DrewScoggins commented Aug 13, 2021

EgorBo commented Aug 14, 2021 • edited Loading

AndyAyersMS commented Aug 14, 2021

JulieLeeMSFT commented Aug 14, 2021

EgorBo commented Aug 14, 2021 • edited Loading

AndyAyersMS commented Aug 14, 2021

EgorBo commented Aug 14, 2021

EgorBo commented Aug 14, 2021

AndyAyersMS commented Aug 14, 2021

AndyAyersMS commented Aug 14, 2021

DrewScoggins commented Aug 9, 2021 •

edited

Loading

briansull commented Aug 10, 2021 •

edited

Loading

EgorBo commented Aug 11, 2021 •

edited

Loading

EgorBo commented Aug 12, 2021 •

edited

Loading

EgorBo commented Aug 14, 2021 •

edited

Loading

EgorBo commented Aug 14, 2021 •

edited

Loading