Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Extend escape analysis to account for arrays with non-gcref elements #104906

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

hez2010
Copy link
Contributor

@hez2010 hez2010 commented Jul 15, 2024

Positive case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[1] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;* V03 tmp2         [V03    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V05 tmp4         [V05    ] (  0,  0   )   short  ->  zero-ref    "V02.[018..020)"
;
; Lcl frame size = 40

G_M25548_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M25548_IG02:  ;; offset=0x0004
       mov      ecx, 84
       call     [System.Console:WriteLine(int)]
       nop
                                                ;; size=12 bbWeight=1 PerfScore 3.50
G_M25548_IG03:  ;; offset=0x0010
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 21, prolog size 4, PerfScore 5.00, instruction count 6, allocated bytes for code 21 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)

Negative case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[42] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;  V03 tmp2         [V03,T00] (  1,  0   )   byref  ->  rbx         must-init "dummy temp of must thrown exception"
;* V04 tmp3         [V04    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V05 tmp4         [V05    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V06 tmp5         [V06    ] (  0,  0   )   short  ->  zero-ref    single-def "V02.[018..020)"
;
; Lcl frame size = 32

G_M25548_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 32
       xor      ebx, ebx
                                                ;; size=7 bbWeight=0 PerfScore 0.00
G_M25548_IG02:  ;; offset=0x0007
       call     CORINFO_HELP_RNGCHKFAIL
       movsx    rcx, word  ptr [rbx]
       call     [System.Console:WriteLine(int)]
       int3
                                                ;; size=16 bbWeight=0 PerfScore 0.00

; Total bytes of code 23, prolog size 5, PerfScore 0.00, instruction count 7, allocated bytes for code 23 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)
; ============================================================

Benchmark on Mandelbrot:

Method Job Mean Error StdDev Code Size Allocated
MandelBrot NoStackAllocationArray 199.7 us 1.30 us 1.22 us 1,996 B 2.49 KB
MandelBrot StackAllocationArray 195.8 us 1.16 us 1.08 us 2,414 B 1.14 KB

Diff: https://www.diffchecker.com/bNP4qHdF/

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 15, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 15, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arrays (and also perhaps boxes and ref classes) we ought to have some kind of size limit... possibly similar to the one we use for stackallocs.

We need to be careful we don't allocate a lot of stack for an object that might not be heavily used, as we'll pay per-call prolog zeroing costs.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to handle the align8 arrays.

But if you do, if you're on a 32 bit platform you probably need to mark the stack allocation local with lvStructDoubleAlign. And run some tests to make sure the right things actually happen.

@AndyAyersMS
Copy link
Member

Thank you for working on this. I don't see us taking this for .NET 9 but perhaps we can soon thereafter. I will take a closer look next week sometime.

The stats above are not all that encouraging. I think for both arrays and ref classes we likely need to take a more holistic look at what is blocking us from finding more cases (or if indeed, if there are not more cases to be had). Eugene did this sort of thing when the initial support was implemented and found similarly less-than-encouraging results (which is why we never enabled it), but it is worth taking a fresh look.

Did you put up a PR for the field-wise analysis anywhere? That may be one of the key missing pieces.

@hez2010
Copy link
Contributor Author

hez2010 commented Jul 19, 2024

Did you put up a PR for the field-wise analysis anywhere? That may be one of the key missing pieces.

It's on my branch. Will open a draft PR soon to validate CI and see how much benefits we can gain from the field-wise analysis.

@hez2010

This comment was marked as outdated.

@hez2010

This comment was marked as resolved.

@hez2010

This comment was marked as resolved.

@hez2010
Copy link
Contributor Author

hez2010 commented Jul 20, 2024

Tests (modulus unrelated) are passing.
PTAL @jakobbotsch

name arrays (%) arrays with field-wise analysis (%)
benchmarks.run.windows.x64.checked 14/4163 (0.34%) 14/4084 (0.34%)
benchmarks.run_pgo.windows.x64.checked 25/1562 (1.60%) 36/1550 (2.32%)
benchmarks.run_tiered.windows.x64.checked 14/378 (3.70%) 10/368 (2.72%)
coreclr_tests.run.windows.x64.checked 162/60525 (0.27%) 183/60271 (0.30%)
libraries.crossgen2.windows.x64.checked 13/11330 (0.11%) 48/11300 (0.42%)
libraries.pmi.windows.x64.checked 16/14840 (0.11%) 325/14577 (2.23%)
libraries_tests.run.windows.x64.Release 79/16584 (0.48%) 124/16167 (0.77%)
libraries_tests_no_tiered_compilation.run.windows.x64.Release 325/109551 (0.30%) 1340/108652 (1.23%)
realworld.run.windows.x64.checked 1/3438 (0.03%) 3/3331 (0.09%)
smoke_tests.nativeaot.windows.x64.checked 0/857 (0.00%) 2/855 (0.23%)

with field-wise analysis we are missing context for more methods, so it's possible that spmi reports worse stats.

@hez2010
Copy link
Contributor Author

hez2010 commented Jul 20, 2024

I collected the method context against BCL by myself to prevent missing method context.

name ref-classes (%) boxes (%) arrays (%)
w/o field-wise 177/49224 (0.36%) 380/7015 (5.42%) 19/5195 (0.37%)
with field-wise 407/55667 (0.73%) 375/7031 (5.33%) 38/5823 (0.65%)

@hez2010
Copy link
Contributor Author

hez2010 commented Jul 22, 2024

I don't see us taking this for .NET 9 but perhaps we can soon thereafter. I will take a closer look next week sometime.

The stats above are not all that encouraging. I think for both arrays and ref classes we likely need to take a more holistic look at what is blocking us from finding more cases (or if indeed, if there are not more cases to be had). Eugene did this sort of thing when the initial support was implemented and found similarly less-than-encouraging results (which is why we never enabled it), but it is worth taking a fresh look.

It's actually a small change as it relies on the existing analysis and added the treatment to arrays as well, which is a "free" improvement. I personally would like to see it goes to .NET 9.

@JulieLeeMSFT
Copy link
Member

This is optimization, not correctness issue. We don't have time to work on this for .NET 9, so we will review it in .NET 10.

@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Aug 12, 2024
@JulieLeeMSFT
Copy link
Member

@AndyAyersMS please review this community PR.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions along with the particular notes:

  • do you see diffs with these changes when DOTNET_JitObjectStackAllocationArray=0?
  • are stack allocated arrays amenable to physical promotion? If not, why not?
  • what does codegen look like for a locally allocated array consumed by a loop? Is it at least as good as the heap case?
  • we need a tighter size limit, 8192 is too high. We likely need one both for individual allocations (especially if conditional) and for the overall amount of stack size increase. For individual allocations we should either match the stackalloc -> local size limit or gather data suggesting why a different choice is the right one. For the overall limit I'm less sure, but something around 8192 for all stack allocated locals seems about right.
  • given that, we might need a prioritization scheme, if we have so many allocation opportunities that some get left out. But I expect this will be rare.

src/coreclr/jit/gentree.h Show resolved Hide resolved
src/coreclr/jit/lclmorph.cpp Show resolved Hide resolved
src/coreclr/jit/lclmorph.cpp Show resolved Hide resolved
}
else
{
*use = m_compiler->gtNewOperNode(GT_COMMA, node->TypeGet(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to handle the out of bounds cases here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC not handing it will make the negative case address exposed, while inserting a BOUND_CHECK node will lead to an assert in later phase.

@@ -8358,10 +8358,20 @@ GenTree* Compiler::fgMorphSmpOp(GenTree* tree, MorphAddrContext* mac, bool* optA
return tree;
}

#ifdef DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted over in #107542 we don't want to have control flow logic conditional on DEBUG like this... we need to find a better fix.

@hez2010
Copy link
Contributor Author

hez2010 commented Sep 18, 2024

  • do you see diffs with these changes when DOTNET_JitObjectStackAllocationArray=0?

I will test it and post the result later.

  • are stack allocated arrays amenable to physical promotion? If not, why not?

I think yes?

int[] a = new[] { 1,2,3,4 };
return a[2];

will produce the following codegen:

mov rax, 3
ret
  • what does codegen look like for a locally allocated array consumed by a loop? Is it at least as good as the heap case?

We need #102965 to get better codegen for this case. Otherwise it gets address exposed.

  • we need a tighter size limit, 8192 is too high. We likely need one both for individual allocations (especially if conditional) and for the overall amount of stack size increase. For individual allocations we should either match the stackalloc -> local size limit or gather data suggesting why a different choice is the right one. For the overall limit I'm less sure, but something around 8192 for all stack allocated locals seems about right.

Agree. I think we can do this in a following PR?

@hez2010
Copy link
Contributor Author

hez2010 commented Sep 20, 2024

do you see diffs with these changes when DOTNET_JitObjectStackAllocationArray=0?

Just checked superpmi and there's no stack allocated array when DOTNET_JitObjectStackAllocationArray=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants