Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: make suitably optimistic prejit inline assessments #14850

Merged

Conversation

AndyAyersMS
Copy link
Member

When prejitting, the jit assesses whether each root method is a potential
inline candidate for any possible caller. Methods deemed un-inlinable in any
caller are marked in the runtime as "noinline" to save the jit some work
later on when it sees calls to these methods.

This assessment was too conservative and led to prejit-ordering dependences
for inlines. It also meant that prejitting was missing some inlines that
would have happened had we not done the prejit root assessment.

This change removes some of the prejit observation blockers. These mostly
will enable more prejit methods to become candidates. We also now track when
a method argument reaches a test.

When we are assessing profitability for a prejit root, assume the call site
best possible case.

Also, update the inline XML to capture the prejit assessments.

This increases the number of inlines considered and performed when prejitting
and will also lead to slightly more inlining when jitting. However we do not
expect a large througput impact when jitting -- the main impact of this change
is that inlining when prejitting is now just as aggressive as inlining when
jitting, and the decisions no longer depend on the order in which methods are
prejitted.

Closes #14441.
Closes #3482.

When prejitting, the jit assesses whether each root method is a potential
inline candidate for any possible caller. Methods deemed un-inlinable in any
caller are marked in the runtime as "noinline" to save the jit some work
later on when it sees calls to these methods.

This assessment was too conservative and led to prejit-ordering dependences
for inlines. It also meant that prejitting was missing some inlines that
would have happened had we not done the prejit root assessment.

This change removes some of the prejit observation blockers. These mostly
will enable more prejit methods to become candidates. We also now track when
a method argument reaches a test.

When we are assessing profitability for a prejit root, assume the call site
best possible case.

Also, update the inline XML to capture the prejit assessments.

This increases the number of inlines considered and performed when prejitting
and will also lead to slightly more inlining when jitting. However we do not
expect a large througput impact when jitting -- the main impact of this change
is that inlining when prejitting is now just as aggressive as inlining when
jitting, and the decisions no longer depend on the order in which methods are
prejitted.

Closes #14441.
Closes #3482.
@AndyAyersMS
Copy link
Member Author

@briansull PTAL
cc @dotnet/jit-contrib

Verified that if I disable the propagation of the noinline bit into the runtime we get the same inlines in corelib as we do when we propagate.

Size impact from jit-diffs. Note size increase is expected as this makes the prejit inliner as aggressive as the jit inliner.

I plan to follow-up with some commentary on the largest regressions.

Total bytes of diff: 179408 (0.79 % of base)
    diff is a regression.

Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions by size (bytes):
       36286 : Microsoft.CodeAnalysis.CSharp.dasm (1.67 % of base)
       28841 : System.Private.Xml.dasm (0.99 % of base)
       26980 : System.Private.CoreLib.dasm (0.77 % of base)
       21377 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.92 % of base)
        9905 : System.Data.Common.dasm (0.92 % of base)

Top file improvements by size (bytes):
         -11 : System.Net.WebSockets.dasm (-0.03 % of base)

57 total files with size differences (1 improved, 56 regressed), 73 unchanged.

Top method regessions by size (bytes):
        3847 : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (1 methods)
        2093 : System.Private.DataContractSerialization.dasm - CriticalHelper:WriteMembers(ref,ref,ref):int:this (2 methods)
        1384 : System.Private.Xml.dasm - System.Xml.XmlSqlBinaryReader:RescanNextToken():int:this (1 methods)
        1303 : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Binder:ReduceFrom(ref,ref,ref):this (1 methods)
        1189 : System.Private.Xml.dasm - <ParseElementMixedContentAsync>d__167:MoveNext():this (1 methods)

Top method improvements by size (bytes):
         -86 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LambdaRewriter:IntroduceFrame(ref,ref,ref,ref):ref:this (1 methods)
         -58 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.SourceNamedTypeSymbol:AddSyntheticMyGroupCollectionProperty(ref,bool,ref,ref,ref,ref,ref,ref,ref):this (1 methods)
         -38 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LocalRewriter:RewriteObjectInitializerExpression(ref,ref,ref):ref:this (1 methods)
         -37 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LocalRewriter:VisitSyncLockStatement(ref):ref:this (1 methods)
         -37 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LocalRewriter:RewriteWinRtEvent(ref,ref,bool):ref:this (1 methods)

2379 total methods with size differences (86 improved, 2293 regressed), 140651 unchanged.

Some data on the impact to inlining in corelib below. We seen more candidates and so more inline attempts and a successes. Fewer failures from "noinline" and more from "unprofitable". Unprofitable rejects are still somewhat cheap (requires IL scan but no importation).

About 10% more methods in corelib are now considered inline candidates.

Version Methods w/Inlines Attempts Successes Failures Extras PrejitCand
Base 27079 13151 103663 46278 57385 3602 15663
Diff 27079 13222 105614 47454 58160 3611 18452
Delta 0 71 1951 1176 775 9 2789

Success Breakdown

Base Count Base Percent Reason Diff Count Diff Percent
31585 68.25 % below ALWAYS_INLINE size 31937 67.30 %
12009 25.95 % profitable inline 12830 27.04 %
2684 5.80 % aggressive inline attribute 2687 5.66 %

Failure Breakdown

Base Count Base Percent Reason Diff Count Diff Percent
25470 44.38 % noinline per IL/cached result 23262 40.00 %
10700 18.65 % unprofitable inline 13413 23.06 %
9091 15.84 % target not direct 9132 15.70 %
6668 11.62 % does not return 6869 11.81 %
1541 2.69 % within catch region 1541 2.65 %
1309 2.28 % too many il bytes 1308 2.25 %
977 1.70 % cannot get method info 977 1.68 %
477 0.83 % has exception handling 476 0.82 %
402 0.70 % ldfld needs helper 402 0.69 %
298 0.52 % runtime dictionary lookup 299 0.51 %
125 0.22 % not inline candidate 131 0.23 %
105 0.18 % rarely called, has gc struct 120 0.21 %
104 0.18 % too many basic blocks 105 0.18 %
31 0.05 % this pointer argument is null 31 0.05 %
27 0.05 % implicit recursive tail call 27 0.05 %
23 0.04 % complex handle access 23 0.04 %
12 0.02 % ldsfld of value class 14 0.02 %
8 0.01 % inline exceeds budget 13 0.02 %
5 0.01 % uses stack crawl mark 5 0.01 %
4 0.01 % has switch 4 0.01 %
4 0.01 % unsupported opcode 3 0.01 %
2 0.00 % recursive 3 0.01 %
1 0.00 % too many locals 1 0.00 %
1 0.00 % delegate invoke 1 0.00 %

Extra Breakdown

Base Count Base Percent Reason Diff Count Diff Percent
2684 4.68 % aggressive inline attribute 2687 4.62 %
483 0.84 % below ALWAYS_INLINE size 485 0.83 %
435 0.76 % profitable inline 439 0.75 %

"Extras" are methods inlined because some inline parent is an aggressive inline, eg we have a call chain A->B->C where B is an aggressive inline. Once the jit inlines B into A it now may be required to do even more inlines. So think of it as the assessment of the knock-on effect from an aggressive non-leaf inline.

@@ -474,6 +474,12 @@ void InlineContext::DumpXml(FILE* file, unsigned indent)
m_Sibling->DumpXml(file, indent);
}

// Optionally suppress failing inline records
if ((JitConfig.JitInlineDumpXml() == 3) && !m_Success)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an enum for the three values used by JitInlineDumpXml()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of enhancements I have in mind for the inline XML, though as far as I know nobody but me has ever used it.... also I have some analysis tooling lying about that I should publish somewhere, maybe over in jitutils.

I should really change this into a flags-style thing.... you should be able to choose:

  • get XML for all methods or only those with inlines or inline attempts
  • get XML for all inline attempts or only those with successful inlines
  • get XML with full details or just summary details
  • specify filename to contain the results (currently always goes to stderr)

So maybe this is best done as a follow-up change.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@AndyAyersMS
Copy link
Member Author

Looking at the larger regressions, so far they fall into two related patterns:

  • Top level method has many call sites to mid-sized methods that now get inlined. For instance RegexInterpreter:Go has a large switch to handle the various regex cases, and most switch arms end up calling Advance, and this now gets inlined.
  • Top level method has a few call sites but the call tree at those sites fans out into sets of mid-sized methods, as in CriticalHelper:WriteMembers.

These cases highlight the fact that in the current inline heuristics the profitability model is not dependent on the overall set of inline candidates and/or the history of inlines done so far. Each inline is assessed independently of all others. So if a method presents the inliner with a large number of marginal candidates the inliner can grow size fairly aggressively as each candidate when considered on its own passes the bar. There is a limiter that can kick in if things get too far out of hand (the "inline exceeds budget" rejection case) and we do see an uptick in this in the data but it is mainly there to prevent pathological behavior.

As I have stated elsewhere the current profitability heuristics are not well founded and it is likely they give improper weight to the various facts available to the inliner. The upshot is that many marginal candidates appear more attractive then they ought to be. While some tweaking is possible the biggest missing factor is some kind of call site frequency estimate which is entirely lacking*. For example in the case of RegexInterpreter:Go the call sites in the switch arms would likely be assigned fairly low frequencies and so would become much less attractive inline candidates.

;;; Before
       mov      rcx, rsi
       mov      edx, 2
       call     [System.Text.RegularExpressions.RegexInterpreter:Advance(int):this]

;;; After
       add      dword ptr [rsi+124], 3               // rsi == 'this'; need it in rcx below
       mov      gword ptr [rsp+68H], rsi             // questionable spill
       mov      rcx, gword ptr [rsi+104]
       mov      rcx, gword ptr [rcx+8]
       mov      edx, dword ptr [rsi+124]
       cmp      edx, dword ptr [rcx+8]
       jae      G_M25344_IG182
       movsxd   rdx, edx
       mov      ecx, dword ptr [rcx+4*rdx+16]       // would like to see this go into rdx
       mov      dword ptr [rsp+64H], ecx            // to avoid this spill...
       mov      rcx, gword ptr [rsp+68H]            // rsi still has this value!
       mov      edx, dword ptr [rsp+64H]            // ... and reload
       call     [System.Text.RegularExpressions.RegexInterpreter:SetOperator(int):this]

Finally the profitability model must take into account the capabilities of the rest of the optimization pipeline, so what looks like a good inline may in fact produce suboptimal code. The Go example provides some evidence of this as the inlines of Advance end up being larger than they should be.

At any rate the behavior seen here is "by design" as things are working the way they are supposed to work.

*Such data is available with IBC (profile feedback) but the inliner doesn't handle this very well either.

@AndyAyersMS
Copy link
Member Author

@dotnet-bot retest Perf Build and Test

@jorive This leg is now 3h30m or longer. Can we split it into multiple legs (say one for x64 and one for x86) to try and cut down the latency?

@benaadams
Copy link
Member

For instance RegexInterpreter:Go has a large switch to handle the various regex cases, and most switch arms end up calling Advance, and this now gets inlined

Don't think it does this; and would likely be a large undertaking if not.

However, if the same method is inlined multiple times, and only the parameters are changing could it be inlined once with the setup register values changed prior to jmp?

Downside would be reduced optimization from dead code elimination (as branches may be active on multiple paths, so indeterminate)

@michellemcdaniel
Copy link

@AndyAyersMS All of the builds/tests run in parallel, as it's a pipeline job. So breaking it into different jobs will not make a difference in terms of time spent. It is running long because we have some serious backlog on the linux jobs, so the machines are slammed. Strangely, it still shows as pending, but the Perf leg has actually completed and failed due to not being able to restore the opt data packages.

@AndyAyersMS
Copy link
Member Author

@adiaaida recent history for the leg not looking promising...
image

Thinking I should just ignore this failure and merge. Does that seem reasonble?

@briansull
Copy link

Could you wait until the build break job is checked in:
#14888

@AndyAyersMS
Copy link
Member Author

@benaadams you mean treat repeated inlines like some kind of local procedure call? That idea has floated about at times but I've never seen it pursued seriously.

Doing this presents modelling challenges: simple flow graph representations now contain many "infeasible" flow paths can seriously degrade the quality and running time of analyses.

@AndyAyersMS
Copy link
Member Author

@briansull sure, I'm in no hurry to merge, just hate rerunning failing jobs at random thinking that for some reason they might now pass.

@michellemcdaniel
Copy link

I think you can ignore the failure, but if you're interested in the perf implications of the change, you should rerun.

@briansull
Copy link

I merged the build break fix, so you are all clear from my point of view

@benaadams
Copy link
Member

benaadams commented Nov 7, 2017

@benaadams you mean treat repeated inlines like some kind of local procedure call?

Yes; using an in method jmp; return jmp location in register.

May allow for deeper call stack inlining as the repetition will be less (e.g. inlines into the functions being inlined etc).

Probably would only want to do it for methods > always inline size (e.g. wrappers, properties would likely be worse not directly inlined)

Doing this presents modelling challenges: simple flow graph representations now contain many "infeasible" flow paths can seriously degrade the quality and running time of analyses.

For running time; for example, it could be crossgen only?

However; may also reduce CQ by keeping multiple paths alive; rather than branch eliminating them (i.e. constants becoming variables) - so may not be worth it. Just musing...

@AndyAyersMS
Copy link
Member Author

I do care about perf; in particular this is going to look like it causes TP regressions.

The last 30 or so runs of "perf build and test" have all failed, so it seems unlikely triggering a rerun will do anything useful.

@AndyAyersMS
Copy link
Member Author

@benaadams the quality of the solutions can degrade pretty badly if you just wire everything up like it is normal dataflow. To get good results you typically need techniques like those developed for interprocedural dataflow, and that gets tricky with things like SSA.

Large fan-ins (say at the "entry" to the local procedure) are troublemakers since they commonly represent places where facts must merge and it requires thoughtful engineering to avoid quadratic bad behavior in dataflow models. This fan-in issue also plagues SSA which generally copes very nicely with large methods. So lots of places where things can go off the rails.

A compromise is to "fix" an ABI and decouple some of what is going on in the local procedure from the main method, and give special treatment to things that are live across the boundaries. This is more or less what we do for catches and finallys that are compiled as funclets and (for finallys anyways) invoked as local procedures.

Or you can just let the compiler do its thing and they very late try to pattern match and combine the disparate copies back into common code.

However all of these sorts of things are pretty far down there in priority.

In the case of the Regex interpreter it is pretty common that the call to Advance is the last thing done in a switch case (though not all cases call Advance and some do only conditionally). Some kind of conditional tail merge at the end of the switch would also be a way of commoning this path. But this is perhaps something better done at the source level.

@benaadams
Copy link
Member

@dotnet-bot test Perf Build and Test

@benaadams
Copy link
Member

But this is perhaps something better done at the source level.

That is probably true 😁

@benaadams
Copy link
Member

@AndyAyersMS just had a look at System.Text.RegularExpressions.RegexInterpreter:Go and see what you mean; that's a very large function, trying to use a funclet would probably be extraordinarily complex

@benaadams
Copy link
Member

Added PR for RegexInterpreter:Go dotnet/corefx#25096

@benaadams
Copy link
Member

It is running long because we have some serious backlog on the linux jobs, so the machines are slammed.

Over 6 hr 40 min for "Perf Build and Test" so far; might need some more machines?

@AndyAyersMS
Copy link
Member Author

Perf build and test actually had a few passing runs recently, but some legs are now at 16hrs+ and still running.

Seems like we are having some challenges figuring out adequate provisioning for the some of the tests we'd really like to run here. Perf runs are a bit more complex to set up since we run on dedicated HW and not VMs. @jorive suggested to me privately that perhaps we move the Linux perf to a "rolling" basis -- triggered only by commits instead of PRs -- which I think is sensible if we can't handle the per-PR traffic load yet.

@RussKeldorph we should really settle on a policy here as having likely to fail long-running CI legs is not helpful to anyone. I would like to see us be more proactive and nimble with legs that start failing with any frequency.

image

@michellemcdaniel
Copy link

I can definitely move the linux perf out of the innerloop/per PR test loop. That's an easy change.

@AndyAyersMS
Copy link
Member Author

Odds of a successful perf run still not good.

I am going to merge and let the rolling CI perf runs pick this up.

@AndyAyersMS AndyAyersMS merged commit 176dfc8 into dotnet:master Nov 8, 2017
@AndyAyersMS AndyAyersMS deleted the FixPrejitInlineClassification branch November 8, 2017 16:48
@benaadams
Copy link
Member

Perf runs are a bit more complex to set up since we run on dedicated HW and not VMs.

Bit of an aside; but interestingly .NET Core runs happily on a 272 thread Xeon Phi https://twitter.com/damageboy/status/928208914759462912 might be an interesting testbed and highlight contention issues well

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants