Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of class profiling for PGO #45133

Merged
merged 10 commits into from
Nov 25, 2020

Conversation

AndyAyersMS
Copy link
Member

Add support to the jit and runtime so that PGO can determine the distribution of
classes at virtual and indirect call sites.

Use this information when jitting to enable guarded devirtualization, if there
is a suitably likely class to guess for.

Enable by setting:

COMPlus_TieredCompilation=1
COMPlus_TieredPGO=1
COMPlus_JitClassProfiling=1
COMPlus_JitEnableGuardedDevirtualization=1

impact can be enhanced by also setting

COMPlus_TC_QuickJitForLoops=1

to allow more methods to pass through Tier0.

Add support to the jit and runtime so that PGO can determine the distribution of
classes at virtual and indirect call sites.

Use this information when jitting to enable guarded devirtualization, if there
is a suitably likely class to guess for.

Enable by setting:
```
COMPlus_TieredCompilation=1
COMPlus_TieredPGO=1
COMPlus_JitClassProfiling=1
COMPlus_JitEnableGuardedDevirtualization=1
```
impact can be enhanced by also setting
```
COMPlus_TC_QuickJitForLoops=1
```
to allow more methods to pass through Tier0.
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 23, 2020
@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib @davidwrighton

Jit and jit interface changes should be fairly close to their final form. Runtime side will eventually be replaced as we build out more of the PGO story over there. But this is a start.

Still missing: crossgen2 "support", and some way to invalidate class profile entries for unloadable classes.

@AndyAyersMS
Copy link
Member Author

Not surprisingly, crossgen2 is unhappy; I need to stub out getLikelyClass.

VALIDATEOBJECTREF(objRef);

ICorJitInfo::ClassProfile* const classProfile = (ICorJitInfo::ClassProfile*) tableAddress;
const int count = classProfile->Count++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to use volatile and/or interlocked operations?

In !_DEBUG, it is a valid C/C++ optimization to get rid of the count local and re-fetch it below, ie transform the line below to classProfile->ClassTable[classProfile->Count]. I would potentially lead to buffer overruns or skipped slots in ClassProfile

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be volatile so it's just read once. Likewise for s_rng.

It is OK to drop updates because of races, and we're trying to keep overhead to a minimum, so no need to interlock.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions

@@ -623,6 +623,7 @@ enum CorInfoHelpFunc
CORINFO_HELP_STACK_PROBE, // Probes each page of the allocated stack frame

CORINFO_HELP_PATCHPOINT, // Notify runtime that code has reached a patchpoint
CORINFO_HELP_CLASSPROFILE, // Update class profile for a call site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the JIT-EE GUID

@@ -251,6 +251,14 @@ class ICorJitInfo : public ICorDynamicInfo
UINT32 ExecutionCount;
};

struct ClassProfile
{
enum { SIZE = 8, SAMPLE_INTERVAL = 32 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to comment this structure and its fields. E.g., what is SAMPLE_INTERVAL? What is ClassProfile used for?

Is UINT32 big enough for Count (and ExecutionCount above)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some description.

For count overflow I currently don't have great answers; it should be rare (we'd need methods that spent a lot of cycles in Tier0) but obviously it's not impossible. Right now overflow won't cause any crashes but it might lead to nonsensical profile data.

CORINFO_METHOD_HANDLE ftnHnd,
CORINFO_CLASS_HANDLE baseHnd,
UINT32 ilOffset,
UINT32 * pLikelihood, // estimated likelihood of the class (0...100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify this is a decimal percentage? i.e. 100 = 100% certainty.

CORINFO_METHOD_HANDLE ftnHnd,
CORINFO_CLASS_HANDLE baseHnd,
UINT32 ilOffset,
UINT32 * pLikelihood, // estimated likelihood of the class (0...100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify these are OUT params

@@ -993,6 +993,15 @@ HRESULT getMethodBlockCounts(CORINFO_METHOD_HANDLE ftnHnd,
BlockCounts** pBlockCounts,
UINT32 * pNumRuns);

// Get the likely implementing class for a virtual call or interface call made by ftnHnd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be "almost diffable" with icorjitinfo.h, so add ClassProfile struct as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out, like BlockCounts is ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out, like BlockCounts is ?

If necessary. I can't remember why BlockCounts is commented out; if it's required or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if #45044 is merged, it will be unnecessary.

src/coreclr/src/jit/block.h Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/flowgraph.cpp Outdated Show resolved Hide resolved
{
enum {
SIZE = 8,
SAMPLE_INTERVAL = 32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these values need to be powers of two? For example could SIZE be 7 here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be any positive value provided N >= S.

const unsigned count = *pCount++;
const unsigned S = ICorJitInfo::ClassProfile::SIZE;
const unsigned N = ICorJitInfo::ClassProfile::SAMPLE_INTERVAL;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert(N >= S); // SAMPLE_INTERVAL must be >= SIZE.

and possibly assert that N is a power of two

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N doesn't need to a power of 2, though it probably will be. in practice.

//
if ((x % N) < S)
{
unsigned i = x % S;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a divide by an unsigned constant the compiler would use a reciprocal multiply. So this could be written using modulus and we wouldn't require S to be a power of two. I think that the replacement of values in the ClassTable[] would also be more random when N is not divisible by S.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already written using a modulus...?

In the most general form we should actually be checking (x % count) < S so the divisor would be variable. I'm experimenting with using a fixed window size here to bias the sample somewhat towards more recent observations.

Using a prime S (say) is an interesting idea, but S needs to be very small, so perhaps 7 or 11 might be worth a try.

@@ -5233,6 +5234,74 @@ void JIT_Patchpoint(int* counter, int ilOffset)

#endif // FEATURE_ON_STACK_REPLACEMENT

HCIMPL2(void, JIT_ClassProfile, Object *obj, void* tableAddress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an method header comment staying that this is a profiling/instrumentation method and needs to be optimized to minimize the cost of runtime profiling.

We might want to explicitly tell the compiler to optimize this method (in retail builds) as we might not normally gather any C++ PGO data for this case if this feature it isn't enabled by default.

// intentionally simple so we can have multithreaded
// access w/o tearing state.
//
static volatile unsigned s_rng = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If C++ supports thread local statics this would be a good use case.
Alternatively the runtime has support for per thread data, I believe.
Otherwise each thread has to grab the cache line for writing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class profile data is shared so I'm not sure there's much to be gained by having per-thread RNG state.

struct HistogramEntry
{
CORINFO_CLASS_HANDLE m_mt;
unsigned m_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the meanings of these fields:

m_count is the number of occurrences of m_mt in ClassTable[]

// Yep, found data. See if there is a suitable class profile.
//
// This bit is currently somewhat hacky ... we scan the records, the count records come
// first and are in increasing IL offset order. Class profiles have inverted IL offsets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is an "inverted" IL offset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set some flags bits in the high part of the IL offset so the runtime side can distinguish class profiles from regular block counts.

This is a temporary measure needed until the runtime side evolves to understand there are different kinds of profile data.

// Need to make sure this is even divisor
//
j += sizeof(ICorJitInfo::ClassProfile) / sizeof(ICorJitInfo::BlockCounts);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to make sure this is an even divisor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the underlying data is allocated in BlockCount increments, we need to make sure ClassProfile is a multiple, so that this update to j gets us to the next ClassProfile record.

@@ -342,6 +479,166 @@ HRESULT PgoManager::getMethodBlockCounts(MethodDesc* pMD, unsigned ilSize, UINT3
return E_NOTIMPL;
}

CORINFO_CLASS_HANDLE PgoManager::getLikelyClass(MethodDesc* pMD, unsigned ilSize, unsigned ilOffset, UINT32* pLikelihood, UINT32* pNumberOfClasses)
{
*pLikelihood = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pLikelyhood couls also be a 32-bit float.
As it currently stands is is an integer percentage, where 100 means 100% and 33 means 33%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that sort of precision; our histograms are approximate.

CORINFO_CLASS_HANDLE baseHnd,
UINT32 ilOffset,
UINT32 * pLikelihood,
UINT32 * pNumberOfClasses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pLikelyhood couls also be a 32-bit float.
As it currently stands it is an integer percentage, where 100 means 100% and 33 means 33%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes here, histogram data is approximate.

What mostly matters is that for monomorphic types we get it right and predict high likelihood. For other cases the payoff is smaller.

… type

to work with shared classes (like __Canon). Also note we can reach this
stage for both virtual and interface calls.
Copy link
Contributor

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@AndyAyersMS
Copy link
Member Author

I was looking at #4489 and realized it was a good test case for GDV via TieredPGO. Drilling in, I realized the logic I had for determining when to guess for a class wasn't as general as it could be and was missing out on guessing for calls dispatched through some shared classes (notably, __Canon).

So I've updated that part of impDevirtualizeCall and fixed a couple other small issues.

impDevirtualizeCall: Trying to devirtualize interface call:
    class for 'this' is __Canon (attrib 20020000)
    base method is ICalls::Execute
--- base class is interface
--- no derived method
    Class not final or exact
Consdering guarded devirt...
Likely class for 00007FF8C4640D10 (__Canon) is 00007FF8C48393E8 (ClassCalls) [likelihood:100 classes seen:1]
interface call would invoke method Execute
Marking call [000008] as guarded devirtualization candidate; will guess for class ClassCalls

...

Inlines into 06000003 [via DefaultPolicy] Runtime4489:UseInterfaceCalls():this
  [1 IL=0012 TR=000004 0600000E] [aggressive inline attribute] Executer`1:Execute(__Canon):this
    [2 IL=0008 TR=000024 0600000A] [aggressive inline attribute guarded devirt] ClassCalls:Execute():this
    [0 IL=0008 TR=000008 06000009] [FAILED: not inline candidate guarded] ICalls:Execute():this

resulting assembly code is:

; Assembly listing for method Runtime4489:UseInterfaceCalls():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 15837975
; invoked as altjit
; Final local variable assignments
;
;  V00 this         [V00,T00] (  4,  4   )     ref  ->  rcx         this class-hnd
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T02] (  2,  4   )     ref  ->  rcx         ld-addr-op class-hnd "Inlining Arg"
;  V03 tmp2         [V03,T01] (  3,  4   )     ref  ->  rcx         "guarded devirt this temp"
;* V04 tmp3         [V04    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "guarded devirt this exact temp"
;  V05 tmp4         [V05,T03] (  2,  4   )     ref  ->  r11         class-hnd "Inlining Arg"
;
; Lcl frame size = 40

G_M18455_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M18455_IG02:              ;; offset=0004H
       4C8B5928             mov      r11, gword ptr [rcx+40]
       488B4910             mov      rcx, gword ptr [rcx+16]
       45391B               cmp      dword ptr [r11], r11d
       49BBF0684EC0F87F0000 mov      r11, 0x7FF8C04E68F0
       4C3919               cmp      qword ptr [rcx], r11       // GDV check
       7517                 jne      SHORT G_M18455_IG04
       48B918B54DC0F87F0000 mov      rcx, 0x7FF8C04DB518        // correct guess
       4533DB               xor      r11d, r11d
       448919               mov      dword ptr [rcx], r11d
       FF01                 inc      dword ptr [rcx]
						;; bbWeight=1    PerfScore 15.75
G_M18455_IG03:              ;; offset=0030H
       4883C428             add      rsp, 40
       C3                   ret      
						;; bbWeight=1    PerfScore 1.25
G_M18455_IG04:              ;; offset=0035H
       49BB580521C0F87F0000 mov      r11, 0x7FF8C0210558     // incorrect guess
       48B8580521C0F87F0000 mov      rax, 0x7FF8C0210558
       FF10                 call     qword ptr [rax]ICalls:Execute():this
       EBE3                 jmp      SHORT G_M18455_IG03
						;; bbWeight=0    PerfScore 0.00

I also verified I can drop a checked clrjit.dll next to the release one (calling it say clrjit.chk.dll) and invoke it selectively via the altjit mechanism; very handy for experimenting without having to swap jits around (and, drive all this via BDN).

@kunalspathak
Copy link
Member

I also verified I can drop a checked clrjit.dll next to the release one (calling it say clrjit.chk.dll) and invoke it selectively via the altjit mechanism; very handy for experimenting without having to swap jits around (and, drive all this via BDN).

@DrewScoggins , @adamsitnik - As we discussed this in the past, Andy confirmed that it works, so something we can try to get the disassembly more reliably?

@AndyAyersMS
Copy link
Member Author

Failure is known #45168. Mono build is a timeout (unrelated).

@AndyAyersMS AndyAyersMS merged commit 5a6c21c into dotnet:master Nov 25, 2020
@AndyAyersMS AndyAyersMS mentioned this pull request Nov 25, 2020
54 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Dec 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants