JIT: make suitably optimistic prejit inline assessments #14850

AndyAyersMS · 2017-11-03T18:46:17Z

When prejitting, the jit assesses whether each root method is a potential
inline candidate for any possible caller. Methods deemed un-inlinable in any
caller are marked in the runtime as "noinline" to save the jit some work
later on when it sees calls to these methods.

This assessment was too conservative and led to prejit-ordering dependences
for inlines. It also meant that prejitting was missing some inlines that
would have happened had we not done the prejit root assessment.

This change removes some of the prejit observation blockers. These mostly
will enable more prejit methods to become candidates. We also now track when
a method argument reaches a test.

When we are assessing profitability for a prejit root, assume the call site
best possible case.

Also, update the inline XML to capture the prejit assessments.

This increases the number of inlines considered and performed when prejitting
and will also lead to slightly more inlining when jitting. However we do not
expect a large througput impact when jitting -- the main impact of this change
is that inlining when prejitting is now just as aggressive as inlining when
jitting, and the decisions no longer depend on the order in which methods are
prejitted.

Closes #14441.
Closes #3482.

When prejitting, the jit assesses whether each root method is a potential inline candidate for any possible caller. Methods deemed un-inlinable in any caller are marked in the runtime as "noinline" to save the jit some work later on when it sees calls to these methods. This assessment was too conservative and led to prejit-ordering dependences for inlines. It also meant that prejitting was missing some inlines that would have happened had we not done the prejit root assessment. This change removes some of the prejit observation blockers. These mostly will enable more prejit methods to become candidates. We also now track when a method argument reaches a test. When we are assessing profitability for a prejit root, assume the call site best possible case. Also, update the inline XML to capture the prejit assessments. This increases the number of inlines considered and performed when prejitting and will also lead to slightly more inlining when jitting. However we do not expect a large througput impact when jitting -- the main impact of this change is that inlining when prejitting is now just as aggressive as inlining when jitting, and the decisions no longer depend on the order in which methods are prejitted. Closes #14441. Closes #3482.

AndyAyersMS · 2017-11-03T19:01:01Z

@briansull PTAL
cc @dotnet/jit-contrib

Verified that if I disable the propagation of the noinline bit into the runtime we get the same inlines in corelib as we do when we propagate.

Size impact from jit-diffs. Note size increase is expected as this makes the prejit inliner as aggressive as the jit inliner.

I plan to follow-up with some commentary on the largest regressions.

Total bytes of diff: 179408 (0.79 % of base)
    diff is a regression.

Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions by size (bytes):
       36286 : Microsoft.CodeAnalysis.CSharp.dasm (1.67 % of base)
       28841 : System.Private.Xml.dasm (0.99 % of base)
       26980 : System.Private.CoreLib.dasm (0.77 % of base)
       21377 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.92 % of base)
        9905 : System.Data.Common.dasm (0.92 % of base)

Top file improvements by size (bytes):
         -11 : System.Net.WebSockets.dasm (-0.03 % of base)

57 total files with size differences (1 improved, 56 regressed), 73 unchanged.

Top method regessions by size (bytes):
        3847 : System.Text.RegularExpressions.dasm - System.Text.RegularExpressions.RegexInterpreter:Go():this (1 methods)
        2093 : System.Private.DataContractSerialization.dasm - CriticalHelper:WriteMembers(ref,ref,ref):int:this (2 methods)
        1384 : System.Private.Xml.dasm - System.Xml.XmlSqlBinaryReader:RescanNextToken():int:this (1 methods)
        1303 : Microsoft.CodeAnalysis.CSharp.dasm - Microsoft.CodeAnalysis.CSharp.Binder:ReduceFrom(ref,ref,ref):this (1 methods)
        1189 : System.Private.Xml.dasm - <ParseElementMixedContentAsync>d__167:MoveNext():this (1 methods)

Top method improvements by size (bytes):
         -86 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LambdaRewriter:IntroduceFrame(ref,ref,ref,ref):ref:this (1 methods)
         -58 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.SourceNamedTypeSymbol:AddSyntheticMyGroupCollectionProperty(ref,bool,ref,ref,ref,ref,ref,ref,ref):this (1 methods)
         -38 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LocalRewriter:RewriteObjectInitializerExpression(ref,ref,ref):ref:this (1 methods)
         -37 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LocalRewriter:VisitSyncLockStatement(ref):ref:this (1 methods)
         -37 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.LocalRewriter:RewriteWinRtEvent(ref,ref,bool):ref:this (1 methods)

2379 total methods with size differences (86 improved, 2293 regressed), 140651 unchanged.

Some data on the impact to inlining in corelib below. We seen more candidates and so more inline attempts and a successes. Fewer failures from "noinline" and more from "unprofitable". Unprofitable rejects are still somewhat cheap (requires IL scan but no importation).

About 10% more methods in corelib are now considered inline candidates.

Version	Methods	w/Inlines	Attempts	Successes	Failures	Extras	PrejitCand
Base	27079	13151	103663	46278	57385	3602	15663
Diff	27079	13222	105614	47454	58160	3611	18452
Delta	0	71	1951	1176	775	9	2789

Success Breakdown

Base Count	Base Percent	Reason	Diff Count	Diff Percent
31585	68.25 %	below ALWAYS_INLINE size	31937	67.30 %
12009	25.95 %	profitable inline	12830	27.04 %
2684	5.80 %	aggressive inline attribute	2687	5.66 %

Failure Breakdown

Base Count	Base Percent	Reason	Diff Count	Diff Percent
25470	44.38 %	noinline per IL/cached result	23262	40.00 %
10700	18.65 %	unprofitable inline	13413	23.06 %
9091	15.84 %	target not direct	9132	15.70 %
6668	11.62 %	does not return	6869	11.81 %
1541	2.69 %	within catch region	1541	2.65 %
1309	2.28 %	too many il bytes	1308	2.25 %
977	1.70 %	cannot get method info	977	1.68 %
477	0.83 %	has exception handling	476	0.82 %
402	0.70 %	ldfld needs helper	402	0.69 %
298	0.52 %	runtime dictionary lookup	299	0.51 %
125	0.22 %	not inline candidate	131	0.23 %
105	0.18 %	rarely called, has gc struct	120	0.21 %
104	0.18 %	too many basic blocks	105	0.18 %
31	0.05 %	this pointer argument is null	31	0.05 %
27	0.05 %	implicit recursive tail call	27	0.05 %
23	0.04 %	complex handle access	23	0.04 %
12	0.02 %	ldsfld of value class	14	0.02 %
8	0.01 %	inline exceeds budget	13	0.02 %
5	0.01 %	uses stack crawl mark	5	0.01 %
4	0.01 %	has switch	4	0.01 %
4	0.01 %	unsupported opcode	3	0.01 %
2	0.00 %	recursive	3	0.01 %
1	0.00 %	too many locals	1	0.00 %
1	0.00 %	delegate invoke	1	0.00 %

Extra Breakdown

Base Count	Base Percent	Reason	Diff Count	Diff Percent
2684	4.68 %	aggressive inline attribute	2687	4.62 %
483	0.84 %	below ALWAYS_INLINE size	485	0.83 %
435	0.76 %	profitable inline	439	0.75 %

"Extras" are methods inlined because some inline parent is an aggressive inline, eg we have a call chain A->B->C where B is an aggressive inline. Once the jit inlines B into A it now may be required to do even more inlines. So think of it as the assessment of the knock-on effect from an aggressive non-leaf inline.

briansull · 2017-11-03T20:31:48Z

src/jit/inline.cpp

@@ -474,6 +474,12 @@ void InlineContext::DumpXml(FILE* file, unsigned indent)
        m_Sibling->DumpXml(file, indent);
    }

+    // Optionally suppress failing inline records
+    if ((JitConfig.JitInlineDumpXml() == 3) && !m_Success)
+    {


Should we add an enum for the three values used by JitInlineDumpXml()

There are a bunch of enhancements I have in mind for the inline XML, though as far as I know nobody but me has ever used it.... also I have some analysis tooling lying about that I should publish somewhere, maybe over in jitutils.

I should really change this into a flags-style thing.... you should be able to choose:

get XML for all methods or only those with inlines or inline attempts

get XML for all inline attempts or only those with successful inlines

get XML with full details or just summary details

specify filename to contain the results (currently always goes to stderr)

So maybe this is best done as a follow-up change.

briansull

Looks Good

AndyAyersMS · 2017-11-06T19:21:29Z

Looking at the larger regressions, so far they fall into two related patterns:

Top level method has many call sites to mid-sized methods that now get inlined. For instance RegexInterpreter:Go has a large switch to handle the various regex cases, and most switch arms end up calling Advance, and this now gets inlined.
Top level method has a few call sites but the call tree at those sites fans out into sets of mid-sized methods, as in CriticalHelper:WriteMembers.

These cases highlight the fact that in the current inline heuristics the profitability model is not dependent on the overall set of inline candidates and/or the history of inlines done so far. Each inline is assessed independently of all others. So if a method presents the inliner with a large number of marginal candidates the inliner can grow size fairly aggressively as each candidate when considered on its own passes the bar. There is a limiter that can kick in if things get too far out of hand (the "inline exceeds budget" rejection case) and we do see an uptick in this in the data but it is mainly there to prevent pathological behavior.

As I have stated elsewhere the current profitability heuristics are not well founded and it is likely they give improper weight to the various facts available to the inliner. The upshot is that many marginal candidates appear more attractive then they ought to be. While some tweaking is possible the biggest missing factor is some kind of call site frequency estimate which is entirely lacking*. For example in the case of RegexInterpreter:Go the call sites in the switch arms would likely be assigned fairly low frequencies and so would become much less attractive inline candidates.

;;; Before
       mov      rcx, rsi
       mov      edx, 2
       call     [System.Text.RegularExpressions.RegexInterpreter:Advance(int):this]

;;; After
       add      dword ptr [rsi+124], 3               // rsi == 'this'; need it in rcx below
       mov      gword ptr [rsp+68H], rsi             // questionable spill
       mov      rcx, gword ptr [rsi+104]
       mov      rcx, gword ptr [rcx+8]
       mov      edx, dword ptr [rsi+124]
       cmp      edx, dword ptr [rcx+8]
       jae      G_M25344_IG182
       movsxd   rdx, edx
       mov      ecx, dword ptr [rcx+4*rdx+16]       // would like to see this go into rdx
       mov      dword ptr [rsp+64H], ecx            // to avoid this spill...
       mov      rcx, gword ptr [rsp+68H]            // rsi still has this value!
       mov      edx, dword ptr [rsp+64H]            // ... and reload
       call     [System.Text.RegularExpressions.RegexInterpreter:SetOperator(int):this]

Finally the profitability model must take into account the capabilities of the rest of the optimization pipeline, so what looks like a good inline may in fact produce suboptimal code. The Go example provides some evidence of this as the inlines of Advance end up being larger than they should be.

At any rate the behavior seen here is "by design" as things are working the way they are supposed to work.

*Such data is available with IBC (profile feedback) but the inliner doesn't handle this very well either.

AndyAyersMS · 2017-11-06T19:24:06Z

@dotnet-bot retest Perf Build and Test

@jorive This leg is now 3h30m or longer. Can we split it into multiple legs (say one for x64 and one for x86) to try and cut down the latency?

benaadams · 2017-11-06T20:00:42Z

For instance RegexInterpreter:Go has a large switch to handle the various regex cases, and most switch arms end up calling Advance, and this now gets inlined

Don't think it does this; and would likely be a large undertaking if not.

However, if the same method is inlined multiple times, and only the parameters are changing could it be inlined once with the setup register values changed prior to jmp?

Downside would be reduced optimization from dead code elimination (as branches may be active on multiple paths, so indeterminate)

michellemcdaniel · 2017-11-06T22:00:47Z

@AndyAyersMS All of the builds/tests run in parallel, as it's a pipeline job. So breaking it into different jobs will not make a difference in terms of time spent. It is running long because we have some serious backlog on the linux jobs, so the machines are slammed. Strangely, it still shows as pending, but the Perf leg has actually completed and failed due to not being able to restore the opt data packages.

AndyAyersMS · 2017-11-06T23:51:51Z

@adiaaida recent history for the leg not looking promising...

Thinking I should just ignore this failure and merge. Does that seem reasonble?

briansull · 2017-11-06T23:58:07Z

Could you wait until the build break job is checked in:
#14888

AndyAyersMS · 2017-11-07T00:04:52Z

@benaadams you mean treat repeated inlines like some kind of local procedure call? That idea has floated about at times but I've never seen it pursued seriously.

Doing this presents modelling challenges: simple flow graph representations now contain many "infeasible" flow paths can seriously degrade the quality and running time of analyses.

AndyAyersMS · 2017-11-07T00:05:45Z

@briansull sure, I'm in no hurry to merge, just hate rerunning failing jobs at random thinking that for some reason they might now pass.

michellemcdaniel · 2017-11-07T00:08:34Z

I think you can ignore the failure, but if you're interested in the perf implications of the change, you should rerun.

briansull · 2017-11-07T00:18:38Z

I merged the build break fix, so you are all clear from my point of view

benaadams · 2017-11-07T00:19:16Z

@benaadams you mean treat repeated inlines like some kind of local procedure call?

Yes; using an in method jmp; return jmp location in register.

May allow for deeper call stack inlining as the repetition will be less (e.g. inlines into the functions being inlined etc).

Probably would only want to do it for methods > always inline size (e.g. wrappers, properties would likely be worse not directly inlined)

Doing this presents modelling challenges: simple flow graph representations now contain many "infeasible" flow paths can seriously degrade the quality and running time of analyses.

For running time; for example, it could be crossgen only?

However; may also reduce CQ by keeping multiple paths alive; rather than branch eliminating them (i.e. constants becoming variables) - so may not be worth it. Just musing...

AndyAyersMS · 2017-11-07T03:10:43Z

I do care about perf; in particular this is going to look like it causes TP regressions.

The last 30 or so runs of "perf build and test" have all failed, so it seems unlikely triggering a rerun will do anything useful.

AndyAyersMS · 2017-11-07T03:32:27Z

@benaadams the quality of the solutions can degrade pretty badly if you just wire everything up like it is normal dataflow. To get good results you typically need techniques like those developed for interprocedural dataflow, and that gets tricky with things like SSA.

Large fan-ins (say at the "entry" to the local procedure) are troublemakers since they commonly represent places where facts must merge and it requires thoughtful engineering to avoid quadratic bad behavior in dataflow models. This fan-in issue also plagues SSA which generally copes very nicely with large methods. So lots of places where things can go off the rails.

A compromise is to "fix" an ABI and decouple some of what is going on in the local procedure from the main method, and give special treatment to things that are live across the boundaries. This is more or less what we do for catches and finallys that are compiled as funclets and (for finallys anyways) invoked as local procedures.

Or you can just let the compiler do its thing and they very late try to pattern match and combine the disparate copies back into common code.

However all of these sorts of things are pretty far down there in priority.

In the case of the Regex interpreter it is pretty common that the call to Advance is the last thing done in a switch case (though not all cases call Advance and some do only conditionally). Some kind of conditional tail merge at the end of the switch would also be a way of commoning this path. But this is perhaps something better done at the source level.

benaadams · 2017-11-07T09:10:51Z

@dotnet-bot test Perf Build and Test

benaadams · 2017-11-07T09:12:36Z

But this is perhaps something better done at the source level.

That is probably true 😁

benaadams · 2017-11-07T09:17:16Z

@AndyAyersMS just had a look at System.Text.RegularExpressions.RegexInterpreter:Go and see what you mean; that's a very large function, trying to use a funclet would probably be extraordinarily complex

benaadams · 2017-11-07T10:17:15Z

Added PR for RegexInterpreter:Go dotnet/corefx#25096

benaadams · 2017-11-07T15:53:41Z

It is running long because we have some serious backlog on the linux jobs, so the machines are slammed.

Over 6 hr 40 min for "Perf Build and Test" so far; might need some more machines?

AndyAyersMS · 2017-11-07T16:34:45Z

Perf build and test actually had a few passing runs recently, but some legs are now at 16hrs+ and still running.

Seems like we are having some challenges figuring out adequate provisioning for the some of the tests we'd really like to run here. Perf runs are a bit more complex to set up since we run on dedicated HW and not VMs. @jorive suggested to me privately that perhaps we move the Linux perf to a "rolling" basis -- triggered only by commits instead of PRs -- which I think is sensible if we can't handle the per-PR traffic load yet.

@RussKeldorph we should really settle on a policy here as having likely to fail long-running CI legs is not helpful to anyone. I would like to see us be more proactive and nimble with legs that start failing with any frequency.

michellemcdaniel · 2017-11-07T17:00:10Z

I can definitely move the linux perf out of the innerloop/per PR test loop. That's an easy change.

AndyAyersMS · 2017-11-08T16:33:52Z

Odds of a successful perf run still not good.

I am going to merge and let the rolling CI perf runs pick this up.

benaadams · 2017-11-08T23:31:03Z

Perf runs are a bit more complex to set up since we run on dedicated HW and not VMs.

Bit of an aside; but interestingly .NET Core runs happily on a 272 thread Xeon Phi https://twitter.com/damageboy/status/928208914759462912 might be an interesting testbed and highlight contention issues well

briansull reviewed Nov 3, 2017

View reviewed changes

briansull approved these changes Nov 6, 2017

View reviewed changes

This was referenced Nov 7, 2017

Improve RegexInterpreter.Go code gen size dotnet/corefx#25096

Merged

DCSerialization CriticalHelper.WriteMembers cleanup dotnet/corefx#25098

Merged

AndyAyersMS merged commit 176dfc8 into dotnet:master Nov 8, 2017

AndyAyersMS deleted the FixPrejitInlineClassification branch November 8, 2017 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: make suitably optimistic prejit inline assessments #14850

JIT: make suitably optimistic prejit inline assessments #14850

AndyAyersMS commented Nov 3, 2017

AndyAyersMS commented Nov 3, 2017

briansull Nov 3, 2017

AndyAyersMS Nov 3, 2017

briansull left a comment

AndyAyersMS commented Nov 6, 2017

AndyAyersMS commented Nov 6, 2017

benaadams commented Nov 6, 2017

michellemcdaniel commented Nov 6, 2017

AndyAyersMS commented Nov 6, 2017

briansull commented Nov 6, 2017

AndyAyersMS commented Nov 7, 2017

AndyAyersMS commented Nov 7, 2017

michellemcdaniel commented Nov 7, 2017

briansull commented Nov 7, 2017

benaadams commented Nov 7, 2017 •

edited

Loading

AndyAyersMS commented Nov 7, 2017

AndyAyersMS commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

AndyAyersMS commented Nov 7, 2017

michellemcdaniel commented Nov 7, 2017

AndyAyersMS commented Nov 8, 2017

benaadams commented Nov 8, 2017

JIT: make suitably optimistic prejit inline assessments #14850

JIT: make suitably optimistic prejit inline assessments #14850

Conversation

AndyAyersMS commented Nov 3, 2017

AndyAyersMS commented Nov 3, 2017

briansull Nov 3, 2017

Choose a reason for hiding this comment

AndyAyersMS Nov 3, 2017

Choose a reason for hiding this comment

briansull left a comment

Choose a reason for hiding this comment

AndyAyersMS commented Nov 6, 2017

AndyAyersMS commented Nov 6, 2017

benaadams commented Nov 6, 2017

michellemcdaniel commented Nov 6, 2017

AndyAyersMS commented Nov 6, 2017

briansull commented Nov 6, 2017

AndyAyersMS commented Nov 7, 2017

AndyAyersMS commented Nov 7, 2017

michellemcdaniel commented Nov 7, 2017

briansull commented Nov 7, 2017

benaadams commented Nov 7, 2017 • edited Loading

AndyAyersMS commented Nov 7, 2017

AndyAyersMS commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

benaadams commented Nov 7, 2017

AndyAyersMS commented Nov 7, 2017

michellemcdaniel commented Nov 7, 2017

AndyAyersMS commented Nov 8, 2017

benaadams commented Nov 8, 2017

benaadams commented Nov 7, 2017 •

edited

Loading