-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Regressions in Burgers.Test0 #57087
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Baseline / Compare change hash indicates that this PR was responsible. (Or possibly a previous PR on the same day or two) Separate refactoring changes in 43250 (#48199) |
Possibly related regression: |
I'll try to bisect the exact commit and reproduce locally |
@DrewScoggins that comparison was also made on Ubuntu? The histogram states that the only commit that could affect this benchmark was Enable CSE for floating-point constants (#44419) and indeed, Burgers benchmark do a lot of floating-point operations with constants so, in theory, CSE could lead to spills inside loops for loop-carried dependencies, but I wasn't able to reproduce any noticeable differences before after that commit (the issue states that it became 30% slower). Neither I could reproduce it for .NET 5.0 vs .NET 6.0P7, performance results are the same. Here is the codegen diff: https://www.diffchecker.com/XtHoG3rP |
Ping @DrewScoggins since we are trying to fix issues before RC1 snap on 8/17. |
I have been working with @EgorBo on this offline. I think he has everything he needs from me, but he can let me know if he needs anything else. |
I managed to get a stable repro (10-15% regression) after some tuning in my BIOS, so here is the asm diff: https://www.diffchecker.com/9StkXdFY so it's indeed a regression caused by 70676fd we had vmulsd xmm0,xmm0,qword ptr [7F1BA7074F90] ; last one is a const that is loaded from data section but after CSE-for-floats PR we now have: vmovsd xmm6,qword ptr [rbp-50]
vmulsd xmm0,xmm0,xmm6 So we hoisted the const but it was spilled to stack so in theory we shouldn't have hoisted/CSE it at all in these conditions (lots of computations/registers are busy) I'm still looking on what we can do here (e.g. why it was spilled in the first place - we have plenty of spare xmm registers) but this regression is definitely not a reason to revert CSE-for-floats so this issue might be marked as not-actionable for .NET 6.0 |
There are no callee-save XMMs on SysV, so on such platforms we should be pretty reluctant to do a float CSE that crosses calls. Are we sure we're checking the right conditions in the CSE heuristics? |
CC @briansull |
More apropos: using System;
using System.Runtime.CompilerServices;
class X
{
public static float F(int N)
{
float result = 0.0F;
for (int i = 0; i < N; i++)
{
result += 3.14F;
G();
result += 3.14F;
}
return result;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static void G() {}
} Win X64 ; Assembly listing for method X:F(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
; V00 arg0 [V00,T01] ( 4, 7 ) int -> rsi single-def
; V01 loc0 [V01,T02] ( 6, 18 ) float -> mm6
; V02 loc1 [V02,T00] ( 4, 13 ) int -> rdi
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V04 cse0 [V04,T03] ( 3, 9 ) float -> mm7 "CSE - aggressive"
;
; Lcl frame size = 72
G_M28134_IG01:
push rdi
push rsi
sub rsp, 72
vzeroupper
vmovaps qword ptr [rsp+30H], xmm6
vmovaps qword ptr [rsp+20H], xmm7
mov esi, ecx
;; bbWeight=1 PerfScore 9.50
G_M28134_IG02:
vxorps xmm6, xmm6
xor edi, edi
test esi, esi
jle SHORT G_M28134_IG04
vmovss xmm7, dword ptr [reloc @RWD00]
;; bbWeight=1 PerfScore 3.83
G_M28134_IG03:
vaddss xmm6, xmm6, xmm7
call X:G()
vaddss xmm6, xmm6, xmm7
inc edi
cmp edi, esi
jl SHORT G_M28134_IG03
;; bbWeight=4 PerfScore 34.00
G_M28134_IG04:
vmovaps xmm0, xmm6
;; bbWeight=1 PerfScore 0.25
G_M28134_IG05:
vmovaps xmm6, qword ptr [rsp+30H]
vmovaps xmm7, qword ptr [rsp+20H]
add rsp, 72
pop rsi
pop rdi
ret
;; bbWeight=1 PerfScore 10.25
RWD00 dd 4048F5C3h ; 3.14 SysV X64 ; Assembly listing for method X:F(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rbp based frame
; partially interruptible
; No PGO data
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T01] ( 4, 7 ) int -> rbx single-def
; V01 loc0 [V01,T02] ( 6, 18 ) float -> [rbp-14H]
; V02 loc1 [V02,T00] ( 4, 13 ) int -> r14
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V04 cse0 [V04,T03] ( 3, 12 ) float -> [rbp-18H] spill-single-def "CSE - aggressive"
;
; Lcl frame size = 16
G_M28134_IG01:
push rbp
push r14
push rbx
sub rsp, 16
vzeroupper
lea rbp, [rsp+20H]
mov ebx, edi
;; bbWeight=1 PerfScore 5.00
G_M28134_IG02:
vxorps xmm0, xmm0
xor r14d, r14d
test ebx, ebx
jle SHORT G_M28134_IG04
;; bbWeight=1 PerfScore 1.83
G_M28134_IG03:
vmovss xmm1, dword ptr [reloc @RWD00]
vmovss dword ptr [rbp-18H], xmm1
vaddss xmm0, xmm0, xmm1
vmovss dword ptr [rbp-14H], xmm0
call X:G()
vmovss xmm1, dword ptr [rbp-18H]
vaddss xmm0, xmm1, dword ptr [rbp-14H]
inc r14d
cmp r14d, ebx
jl SHORT G_M28134_IG03
;; bbWeight=4 PerfScore 62.00
G_M28134_IG04:
add rsp, 16
pop rbx
pop r14
pop rbp
ret
;; bbWeight=1 PerfScore 2.75
RWD00 dd 4048F5C3h ; 3.14 Not only is the latter CSE harmful, it's made even worse as we don't hoist its def out of the loop either.... |
@AndyAyersMS in your case it seems we give up on hoisting the def here: runtime/src/coreclr/jit/optimizer.cpp Lines 6013 to 6020 in 53ebce3
where availRegCount is 0 because |
A fix? EgorBo@00f9eab, Diff: https://www.diffchecker.com/wakOETMr |
Right, not hoisting is reasonable. But CSEing is not. Not hoisting and then CSEing is the worst of all. Hoist and CSE need to be on the same page. optcse.cpp never refers to |
Blocking all non-hoistable CSEs is probably sub-optimal -- if all uses are one one side of the call or the other then a CSE will likely pay off even if the def is not hoisted. But we should not be doing cross-call CSEs of cheap float trees on SysV. |
This was found in a 5.0 -> 6.0 comparison. I noticed that this was never copied over to be investigated.
Run Information
Regressions in Burgers
Historical Data in Reporting System
Repro
.
Payloads
Baseline
Compare
Histogram
Burgers.Test2
Burgers.Test0
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
The text was updated successfully, but these errors were encountered: