Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: initial implementation of profile synthesis #82926

Merged
merged 4 commits into from
Mar 6, 2023

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Mar 3, 2023

Implements a profile synthesis algorithm based on the classic Wu-Larus paper Static branch frequency and program profile analysis (Micro-27, 1994), with a simple set of heuristics.

First step is construction of a depth-first spanning tree (DFST) for the flowgraph, and corresponding reverse postorder (RPO). Together these drive loop recognition; currently we only recognize reducible loops. We use DFST (non-) ancestry as a proxy for (non-) domination: the dominator of a node is required to be a DFST ancestor. So no explicit dominance computation is needed. Irreducible loops are noted but ignored. Loop features like entry, back, and exit edges, body sets, and nesting are computed and saved.

Next step is assignment of edge likelihoods. Here we use some simple local heuristics based on loop structure, returns, and throws. A final heuristic gives slight preference to conditional branches that fall through to the next IL offset.

After that we use loop nesting to compute the "cyclic probability" $cp$ for each loop, working inner to outer in loops and RPO within loops. $cp$ summarizes the effect of flow through the loop and around loop back edges. We cap $cp$ at no more than 1000. When finding $cp$ for outer loops we use $cp$ for inner loops.

Once all $cp$ values are known, we assign "external" input weights to method and EH entry points, and then a final RPO walk computes the expected weight of each block (and, via edge likelihoods, each edge).

We use the existing DFS code to build the DFST and the RPO, augmented by some fixes to ensure all blocks (even ones in isolated cycles) get numbered.

This initial version is intended to establish the right functionality, enable wider correctness testing, and to provide a foundation for refinement of the heuristics. It is not yet as efficient as it could be.

The loop recognition and recording done here overlaps with similar code elsewhere in the JIT. The version here is structural and not sensitive to IL patterns, so is arguably more general and I believe a good deal simpler than the lexically driven recognition we use for the current loop optimizer. I aspire to reconcile this somehow in future work.

All this is disabled by default; a new config option either enables using synthesis to set block weights for all root methods or just those without PGO data.

Synthesis for inlinees is not yet enabled; progress here is blocked by #82755.

Implements a profile synthesis algorithm based on the classic Wu-Larus
paper (Static branch frequency and program profile analysis, Micro-27,
1994), with a simple set of heuristics.

First step is construction of a depth-first spanning tree (DFST) for the
flowgraph, and corresponding reverse postorder (RPO). Together these drive
loop recognition; currently we only recognize reducible loops. We use DFST
(non-) ancestry as a proxy for (non-) domination: the dominator of a node
is required to be a DFST ancestor. So no explicit dominance computation is
needed. Irreducible loops are noted but ignored. Loop features like entry,
back, and exit edges, body sets, and nesting are computed and saved.

Next step is assignment of edge likelihoods. Here we use some simple local
heuristics based on loop structure, returns, and throws. A final heuristic
gives slight preference to conditional branches that fall through to the
next IL offset.

After that we use loop nesting to compute the "cyclic probability" $cp$ for
each loop, working inner to outer in loops and RPO within loops. $cp$ summarizes
the effect of flow through the loop and around loop back edges. We cap $cp$ at
no more than 1000. When finding $cp$ for outer loops we use $cp$ for inner
loops.

Once all $cp$ values are known, we assign "external" input weights to method
and EH entry points, and then a final RPO walk computes the expected weight
of each block (and, via edge likelihoods, each edge).

We use the existing DFS code to build the DFST and the RPO, augmented by
some fixes to ensure all blocks (even ones in isolated cycles) get numbered.

This initial version is intended to establish the right functionality, enable
wider correctness testing, and to provide a foundation for refinement of the
heuristics. It is not yet as efficient as it could be.

The loop recognition and recording done here overlaps with similar code
elsewhere in the JIT. The version here is structural and not sensitive to IL
patterns, so is arguably more general and I believe a good deal simpler than
the lexically driven recognition we use for the current loop optimizer.
I aspire to reconcile this somehow in future work.

All this is disabled by default; a new config option either enables using
synthesis to set block weights for all root methods or just those without PGO
data.

Synthesis for inlinees is not yet enabled; progress here is blocked by dotnet#82755.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 3, 2023
@ghost ghost assigned AndyAyersMS Mar 3, 2023
@ghost
Copy link

ghost commented Mar 3, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Implements a profile synthesis algorithm based on the classic Wu-Larus paper (Static branch frequency and program profile analysis, Micro-27, 1994), with a simple set of heuristics.

First step is construction of a depth-first spanning tree (DFST) for the flowgraph, and corresponding reverse postorder (RPO). Together these drive loop recognition; currently we only recognize reducible loops. We use DFST (non-) ancestry as a proxy for (non-) domination: the dominator of a node is required to be a DFST ancestor. So no explicit dominance computation is needed. Irreducible loops are noted but ignored. Loop features like entry, back, and exit edges, body sets, and nesting are computed and saved.

Next step is assignment of edge likelihoods. Here we use some simple local heuristics based on loop structure, returns, and throws. A final heuristic gives slight preference to conditional branches that fall through to the next IL offset.

After that we use loop nesting to compute the "cyclic probability" $cp$ for each loop, working inner to outer in loops and RPO within loops. $cp$ summarizes the effect of flow through the loop and around loop back edges. We cap $cp$ at no more than 1000. When finding $cp$ for outer loops we use $cp$ for inner loops.

Once all $cp$ values are known, we assign "external" input weights to method and EH entry points, and then a final RPO walk computes the expected weight of each block (and, via edge likelihoods, each edge).

We use the existing DFS code to build the DFST and the RPO, augmented by some fixes to ensure all blocks (even ones in isolated cycles) get numbered.

This initial version is intended to establish the right functionality, enable wider correctness testing, and to provide a foundation for refinement of the heuristics. It is not yet as efficient as it could be.

The loop recognition and recording done here overlaps with similar code elsewhere in the JIT. The version here is structural and not sensitive to IL patterns, so is arguably more general and I believe a good deal simpler than the lexically driven recognition we use for the current loop optimizer. I aspire to reconcile this somehow in future work.

All this is disabled by default; a new config option either enables using synthesis to set block weights for all root methods or just those without PGO data.

Synthesis for inlinees is not yet enabled; progress here is blocked by #82755.

Author: AndyAyersMS
Assignees: AndyAyersMS
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

@BruceForstall PTAL
cc @dotnet/jit-contrib

Was about to post a sample result but realized to my dismay the final weight computation is off. So will fix that and then put up some examples.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Mar 3, 2023

Was about to post a sample result but realized to my dismay the final weight computation is off. So will fix that and then put up some examples.

This looks like a bug in the DFS traversal (which is surprising since we rely on it for many a few other things). Given the following graph, it produces the DFST shown in blue. Each node is annotated with the preorder and postorder numbers.

image - 2023-03-03T080800 132

The RPO from this is

01 -> BB01[1, 9]
02 -> BB06[2, 8]
03 -> BB07[7, 7]
04 -> BB08[9, 6]
05 -> BB09[8, 5]
06 -> BB02[3, 4]
07 -> BB03[4, 3]
08 -> BB04[5, 2]
09 -> BB05[6, 1]

Note that BB7 appears too early in the RPO, it should appear after both its pred BB6 and its pred BB4. So when we do our final RPO walk to set profile counts, we don't include the count from BB4.

The issue seems to be that when we initially visit BB6 we push BB7 onto the pending visit stack and mark it, so when we reach BB4 we don't visit BB7.

@AndyAyersMS
Copy link
Member Author

This looks like a bug in the DFS traversal (which is surprising since we rely on it for many things)

I have a fix I will PR separately. Somewhat surprisingly it is no diff (*). What we were doing produces a spanning tree, but not a proper depth first spanning tree. Looks like existing uses are somehow ok with that.

With things fixed we get

After doing a post order traversal of the BB graph, this is the ordering:
01 -> BB01[1, 9]
02 -> BB06[2, 8]
03 -> BB02[6, 7]
04 -> BB03[7, 6]
05 -> BB04[8, 5]
06 -> BB05[9, 4]
07 -> BB07[3, 3]
08 -> BB08[4, 2]
09 -> BB09[5, 1]

and now we see BB07 properly comes after BB06 and BB04.

Resulting synthesized profile is:
image - 2023-03-03T095917 373

(*) caveat -- the method above fails in SPMI during importation because of missing info, so perhaps there would be diffs on a clean collection?

@@ -593,7 +593,10 @@ CONFIG_INTEGER(JitCrossCheckDevirtualizationAndPGO, W("JitCrossCheckDevirtualiza
CONFIG_INTEGER(JitNoteFailedExactDevirtualization, W("JitNoteFailedExactDevirtualization"), 0)
CONFIG_INTEGER(JitRandomlyCollect64BitCounts, W("JitRandomlyCollect64BitCounts"), 0) // Collect 64-bit counts randomly
// for some methods.
#endif // debug
// 1: profile synthesis for root methods
// 2: profile synthesis for foot methods w/o pgo data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 2: profile synthesis for foot methods w/o pgo data
// 2: profile synthesis for root methods w/o pgo data

@@ -113,6 +113,7 @@ set( JIT_SOURCES
fginline.cpp
fgopt.cpp
fgprofile.cpp
fgprofilesynthesis.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add fgprofilesynthesis.h in the JIT_HEADERS section, below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

{
JITDUMP("Synthesizing profile data\n");
ProfileSynthesis::Run(this, ProfileSynthesisOption::AssignLikelihoods);
fgPgoHaveWeights = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ProfileSynthesis::Run() set this instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over time this flag will either change into something more complex or go away.


//------------------------------------------------------------------------
// AssignLikelihoods: update edge likelihoods and block counts based
// entrely on heuristics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// entrely on heuristics.
// entirely on heuristics.

// Returns:
// loop headed by block, or nullptr
//
SimpleLoop* ProfileSynthesis::IsLoopHeader(BasicBlock* block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is implies a bool return. Maybe GetLoopFromHeaderBlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

}

//------------------------------------------------------------------------
// AssignLikelihoodCond: update edge likelihood for a block that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// AssignLikelihoodCond: update edge likelihood for a block that
// AssignLikelihoodSwitch: update edge likelihood for a block that

{
// Assume each switch case is equally probable
//
const unsigned n = block->NumSucc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we have no degenerate (n == 0) cases here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't? Even if a switch is just default it has 1 case. I'll add an assert for now.

{
// Assume each switch case is equally probable
//
const unsigned n = block->NumSucc();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const unsigned n = block->NumSucc();
const unsigned n = block->NumSucc(m_comp);

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is intentional -- if there are $n$ switch cases then each case gets $1/n$ of the flow. In the subsequent loop we just look at targets and give each target edge a likelihood of $dupCount/n$.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. Might want to add a comment to ensure that is perfectly clear, since we normally pair NumSucc()/GetSucc()/Succs() and NumSucc(Compiler*)/GetSucc(Compiler*,x)/Succs(Compiler*) and seeing them "unbalanced" is weird.

for (BasicBlock* const succ : block->Succs(m_comp))
{
FlowEdge* const edge = m_comp->fgGetPredForBlock(succ, block);
edge->setLikelihood(p * edge->getDupCount());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't all the heuristics in AssignLikelihoodCond apply? And need to get blended somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something like that. I kept the heuristics fairly simple for now. Plan is to revise them once we start comparing the synthesized results to real profile data.

{
BasicBlock* const loopBlock = m_bbNumToBlockMap[bbNum];

for (BasicBlock* const succBlock : loopBlock->Succs())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (BasicBlock* const succBlock : loopBlock->Succs())
for (BasicBlock* const succBlock : loopBlock->Succs(m_comp))

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@BruceForstall
Copy link
Member

When you assign likelihoods from the heuristics, you have a fixed ordering of heuristics (and fixed likelihoods), and the first heuristic that matches determines the resulting likelihoods. In Wu/Larus, if I read it correctly, they combine likelihoods from all applicable heuristics. Is that something that should be done?

@AndyAyersMS
Copy link
Member Author

When you assign likelihoods from the heuristics, you have a fixed ordering of heuristics (and fixed likelihoods), and the first heuristic that matches determines the resulting likelihoods. In Wu/Larus, if I read it correctly, they combine likelihoods from all applicable heuristics. Is that something that should be done?

My main goal here is to get the solver framework in place; to me, that's the interesting part of the change.

The heuristics will evolve, but whether they end up looking like they do now, or more like what's in the paper, or something else remains to be determined. For instance, we're currently running before importation so (for the most part) can't use any heuristic that relies on knowing the computations in each block.

This is somewhat intentional, as the importer is sensitive to profile data, so we need some notion of profile beforehand, but also somewhat unfortunate, because once we start importing we'll likely uncover things that would lead us to make different likelihood projections. I haven't yet pinned down when or how often we might revisit / rerun synthesis.

@AndyAyersMS AndyAyersMS merged commit 73b2a0f into dotnet:main Mar 6, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants