Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support 32 byte alignment of code on xarch #2249

Merged
merged 4 commits into from
Feb 15, 2020

Conversation

AndyAyersMS
Copy link
Member

Update jit and runtime to allow jit to ask for code to be 32 byte aligned.
Request 32 byte alignment for Tier1 methods on x86/x64.

Add minimal crossgen support; one can imagine requesting or choosing 32 byte
alignment for crossgenned code, but that is left as future work.

This should provide some measure of performance stability, in particular for
microbenchmarks or other code where performance depends crucially on a few
branches.

It may or may not improve performance. If/when there are regressions we can
contemplate updating the jit to add intra-method padding to address alignment
sensitive code layout (e.g. dotnet/coreclr#11607).

This will require a jit GUID update in addition to the changes here.

@AndyAyersMS
Copy link
Member Author

cc @jkotas @billwert @dotnet/jit-contrib

Amount of extra allocation should be limited as this just applies to Tier1 methods.

@jkotas
Copy link
Member

jkotas commented Jan 27, 2020

Do you have any data that show this actually helps, or is this written as a recommendation in some of the Intel docs?

@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 27, 2020
@AndyAyersMS
Copy link
Member Author

I have some locally gathered data and there it helps quite a bit. We have also seen this in various coreclr issues, like dotnet/coreclr#22920.

Intel docs suggest that 64 byte alignment might be preferable. I haven't tried this.

3.4.1.4 Code Alignment
Careful arrangement of code can enhance cache and memory locality. Likely sequences of basic blocks should be laid out contiguously in memory. This may involve removing unlikely code, such as code to handle error conditions, from the sequence. See Section 3.7, “Prefetching,” on optimizing the instruction prefetcher.

Assembly/Compiler Coding Rule 11. (M impact, H generality) When executing code from the
Decoded Icache, direct branches that are mostly taken should have all their instruction bytes in a 64B cache line and nearer the end of that cache line. Their targets should be at or near the beginning of a 64B cache line.

When executing code from the legacy decode pipeline, direct branches that are mostly taken should have all their instruction bytes in a 16B aligned chunk of memory and nearer the end of that 16B aligned chunk. Their targets should be at or near the beginning of a 16B aligned chunk of memory.

My thought was to add support for 32 byte alignment and then do a series of perf lab runs over time to prove or disprove the "improved stability" hypothesis.

@@ -1104,7 +1108,11 @@ void ZapInfo::allocMem(

if (roDataSize > 0)
{
if (flag & CORJIT_ALLOCMEM_FLG_RODATA_16BYTE_ALIGN)
if (flag & CORJIT_ALLOCMEM_FLG_RODATA_32BYTE_ALIGN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not bother changing anything for zapper. We are trying to churn the old crossgen and there are number of other changes to make it actually work (e.g. in NewAlignedBlob).

@jkotas
Copy link
Member

jkotas commented Jan 28, 2020

I understand why this helps micro-benchmarks stability. I am worried about regressions in real world workloads. The real world workloads never fit into instruction cache and they are hitting instruction cache misses constantly. In worst case, this will make the code 2x bigger and thus only half of the code will fit into caches. What is the worst case regression we can expect from this?

@jkotas
Copy link
Member

jkotas commented Jan 28, 2020

I would be fine with having this under a switch that is turned on for micro-benchmark runs or investigations, similar to the existing JitAlignLoops.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jan 28, 2020

Off by default seems sensible; we can give BDN the ability to turn this on and see if it fixes stability issues.

But, for the sake of argument, if we wanted this on by default:

  • We are already 16 byte aligning all jitted methods on x64. So absolute worst case would be lots of small (sub 16 byte) methods that all get rejitted at Tier1. Roughly half of these would already 32 byte aligned, so say 50% overhead for those methods. But seems unlikely there are many apps like this.
  • We could restrict 32 byte alignment to methods with loops, since the main microarchitectural feature seems to be various chip front-end caches (LSB, etc). Statically methods with loops are about 5% of all methods; not sure if that changes much for methods that make it to Tier1. We could also only 32 byte align methods that are larger than 16 bytes. The combination would probably cut the worst case size overhead down in practice to something in the low single digits.
  • If an app is I-Cache bound, tiering may already causing cache/page density issues -- we potentially generate a lot more code. Do we keep Tier1 code in its own region or let it mix with Tier0 code? If we let it mix, we should think about splitting it off, since we'd expect most Tier0 code becomes cold, eventually.

@jkotas
Copy link
Member

jkotas commented Jan 28, 2020

Doing this for methods with loops by default sounds reasonable to me.

@jkotas
Copy link
Member

jkotas commented Jan 28, 2020

We could also only 32 byte align methods that are larger than 16 bytes.

When we have done similar exercise for x86 years back, we found that it is most profitable to align methods that are smaller than 16 bytes. It was not what one would expect intuitively. You can still find the code that does that in the JIT if you look for CORJIT_ALLOCMEM_FLG_16BYTE_ALIGN.

@tannergooding
Copy link
Member

For reference, the AMD Family 17h microarchitecture (Zen) optimization manuals (https://developer.amd.com/resources/developer-guides-manuals/) also have notes on 16 vs 32 vs 64-bytes for code/branch/loop alignment and packing.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jan 28, 2020

Here's some local data, where I modified the jit to insert nop padding just after the prolog in method with a simple 8 byte loop. Roughly speaking, perf suffers once the loop body crosses a 32 byte boundary.

image

This is from an i7-8650U (Kaby Lake). I see similar results on an i9-9900T (Coffee Lake) and on an i7-6700 (Skylake).

I can't quite explain why there are only 6 bad alignments instead of 7 (seems like 25 should be slow too), but this gives you an idea of what can happen.

With current master, each method is 16 byte aligned or 32 byte aligned somewhat randomly. Over time (as other method code sizes change, etc), perf of the loop can fluctuate even though codegen for the method with the loop is identical. We see this in the perf lab runs.

image

And even within a single run, perf of two otherwise identical methods can differ.

@erozenfeld
Copy link
Member

I saw exactly the same effects when analyzing Windows and Linux code for the same benchmarks. If the body of a tight loop crosses 32-byte boundary, there is a significant penalty. This is the effect of µop cache. Chapter 9.3 here has a good description of what's happening: https://www.agner.org/optimize/microarchitecture.pdf

@AndyAyersMS
Copy link
Member Author

Note SPMI currently ignores the alignment flag passed to allocMem -- which is benign since it appears jit behavior currently does not depend on alignment of the code buffer.

Failures look unrelated.

I'm going to hold off on a merge for a few days since the GUID bump will invalidate SPMI collections and we're just getting back to leveraging those.

@BruceForstall
Copy link
Member

We're ready to take JIT-EE interface changes now (some internal processes dependent on the JIT-EE interface GUID are now stable).

@AndyAyersMS Can you fix the conflicts and then merge this?

Update jit and runtime to allow jit to ask for code to be 32 byte aligned.
Request 32 byte alignment for Tier1 methods on x86/x64.

Add minimal crossgen support; one can imagine requesting or choosing 32 byte
alignment for crossgenned code, but that is left as future work.

This should provide some measure of performance stability, in particular for
microbenchmarks or other code where performance depends crucially on a few
branches.

It may or may not improve performance. If/when there are regressions we can
contemplate updating the jit to add intra-method padding to address alignment
sensitive code layout (e.g. dotnet/coreclr#11607).

This will require a jit GUID update in addition to the changes here.
@AndyAyersMS
Copy link
Member Author

Rebased to get around merge conflict on jit GUID.

@BruceForstall
Copy link
Member

Failures are infra. Merging.

kunalspathak added a commit that referenced this pull request Oct 7, 2020
In #2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
kunalspathak added a commit to kunalspathak/runtime that referenced this pull request Oct 23, 2020
In dotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants