Support 32 byte alignment of code on xarch #2249

AndyAyersMS · 2020-01-27T21:17:02Z

Update jit and runtime to allow jit to ask for code to be 32 byte aligned.
Request 32 byte alignment for Tier1 methods on x86/x64.

Add minimal crossgen support; one can imagine requesting or choosing 32 byte
alignment for crossgenned code, but that is left as future work.

This should provide some measure of performance stability, in particular for
microbenchmarks or other code where performance depends crucially on a few
branches.

It may or may not improve performance. If/when there are regressions we can
contemplate updating the jit to add intra-method padding to address alignment
sensitive code layout (e.g. dotnet/coreclr#11607).

This will require a jit GUID update in addition to the changes here.

AndyAyersMS · 2020-01-27T21:18:05Z

cc @jkotas @billwert @dotnet/jit-contrib

Amount of extra allocation should be limited as this just applies to Tier1 methods.

jkotas · 2020-01-27T21:23:15Z

Do you have any data that show this actually helps, or is this written as a recommendation in some of the Intel docs?

AndyAyersMS · 2020-01-27T21:57:26Z

I have some locally gathered data and there it helps quite a bit. We have also seen this in various coreclr issues, like dotnet/coreclr#22920.

Intel docs suggest that 64 byte alignment might be preferable. I haven't tried this.

3.4.1.4 Code Alignment
Careful arrangement of code can enhance cache and memory locality. Likely sequences of basic blocks should be laid out contiguously in memory. This may involve removing unlikely code, such as code to handle error conditions, from the sequence. See Section 3.7, “Prefetching,” on optimizing the instruction prefetcher.

Assembly/Compiler Coding Rule 11. (M impact, H generality) When executing code from the
Decoded Icache, direct branches that are mostly taken should have all their instruction bytes in a 64B cache line and nearer the end of that cache line. Their targets should be at or near the beginning of a 64B cache line.

When executing code from the legacy decode pipeline, direct branches that are mostly taken should have all their instruction bytes in a 16B aligned chunk of memory and nearer the end of that 16B aligned chunk. Their targets should be at or near the beginning of a 16B aligned chunk of memory.

My thought was to add support for 32 byte alignment and then do a series of perf lab runs over time to prove or disprove the "improved stability" hypothesis.

jkotas · 2020-01-27T22:08:23Z

src/coreclr/src/zap/zapinfo.cpp

@@ -1104,7 +1108,11 @@ void ZapInfo::allocMem(

    if (roDataSize > 0)
    {
-        if (flag & CORJIT_ALLOCMEM_FLG_RODATA_16BYTE_ALIGN)
+        if (flag & CORJIT_ALLOCMEM_FLG_RODATA_32BYTE_ALIGN)


I would not bother changing anything for zapper. We are trying to churn the old crossgen and there are number of other changes to make it actually work (e.g. in NewAlignedBlob).

jkotas · 2020-01-28T04:44:51Z

I understand why this helps micro-benchmarks stability. I am worried about regressions in real world workloads. The real world workloads never fit into instruction cache and they are hitting instruction cache misses constantly. In worst case, this will make the code 2x bigger and thus only half of the code will fit into caches. What is the worst case regression we can expect from this?

jkotas · 2020-01-28T04:52:43Z

I would be fine with having this under a switch that is turned on for micro-benchmark runs or investigations, similar to the existing JitAlignLoops.

AndyAyersMS · 2020-01-28T07:35:15Z

Off by default seems sensible; we can give BDN the ability to turn this on and see if it fixes stability issues.

But, for the sake of argument, if we wanted this on by default:

We are already 16 byte aligning all jitted methods on x64. So absolute worst case would be lots of small (sub 16 byte) methods that all get rejitted at Tier1. Roughly half of these would already 32 byte aligned, so say 50% overhead for those methods. But seems unlikely there are many apps like this.
We could restrict 32 byte alignment to methods with loops, since the main microarchitectural feature seems to be various chip front-end caches (LSB, etc). Statically methods with loops are about 5% of all methods; not sure if that changes much for methods that make it to Tier1. We could also only 32 byte align methods that are larger than 16 bytes. The combination would probably cut the worst case size overhead down in practice to something in the low single digits.
If an app is I-Cache bound, tiering may already causing cache/page density issues -- we potentially generate a lot more code. Do we keep Tier1 code in its own region or let it mix with Tier0 code? If we let it mix, we should think about splitting it off, since we'd expect most Tier0 code becomes cold, eventually.

jkotas · 2020-01-28T07:45:34Z

Doing this for methods with loops by default sounds reasonable to me.

jkotas · 2020-01-28T07:53:46Z

We could also only 32 byte align methods that are larger than 16 bytes.

When we have done similar exercise for x86 years back, we found that it is most profitable to align methods that are smaller than 16 bytes. It was not what one would expect intuitively. You can still find the code that does that in the JIT if you look for CORJIT_ALLOCMEM_FLG_16BYTE_ALIGN.

tannergooding · 2020-01-28T16:13:35Z

For reference, the AMD Family 17h microarchitecture (Zen) optimization manuals (https://developer.amd.com/resources/developer-guides-manuals/) also have notes on 16 vs 32 vs 64-bytes for code/branch/loop alignment and packing.

AndyAyersMS · 2020-01-28T19:59:09Z

Here's some local data, where I modified the jit to insert nop padding just after the prolog in method with a simple 8 byte loop. Roughly speaking, perf suffers once the loop body crosses a 32 byte boundary.

This is from an i7-8650U (Kaby Lake). I see similar results on an i9-9900T (Coffee Lake) and on an i7-6700 (Skylake).

I can't quite explain why there are only 6 bad alignments instead of 7 (seems like 25 should be slow too), but this gives you an idea of what can happen.

With current master, each method is 16 byte aligned or 32 byte aligned somewhat randomly. Over time (as other method code sizes change, etc), perf of the loop can fluctuate even though codegen for the method with the loop is identical. We see this in the perf lab runs.

And even within a single run, perf of two otherwise identical methods can differ.

erozenfeld · 2020-01-28T20:08:50Z

I saw exactly the same effects when analyzing Windows and Linux code for the same benchmarks. If the body of a tight loop crosses 32-byte boundary, there is a significant penalty. This is the effect of µop cache. Chapter 9.3 here has a good description of what's happening: https://www.agner.org/optimize/microarchitecture.pdf

AndyAyersMS · 2020-01-30T00:12:37Z

Note SPMI currently ignores the alignment flag passed to allocMem -- which is benign since it appears jit behavior currently does not depend on alignment of the code buffer.

Failures look unrelated.

I'm going to hold off on a merge for a few days since the GUID bump will invalidate SPMI collections and we're just getting back to leveraging those.

BruceForstall · 2020-02-14T19:18:03Z

We're ready to take JIT-EE interface changes now (some internal processes dependent on the JIT-EE interface GUID are now stable).

@AndyAyersMS Can you fix the conflicts and then merge this?

Update jit and runtime to allow jit to ask for code to be 32 byte aligned. Request 32 byte alignment for Tier1 methods on x86/x64. Add minimal crossgen support; one can imagine requesting or choosing 32 byte alignment for crossgenned code, but that is left as future work. This should provide some measure of performance stability, in particular for microbenchmarks or other code where performance depends crucially on a few branches. It may or may not improve performance. If/when there are regressions we can contemplate updating the jit to add intra-method padding to address alignment sensitive code layout (e.g. dotnet/coreclr#11607). This will require a jit GUID update in addition to the changes here.

AndyAyersMS · 2020-02-14T20:13:06Z

Rebased to get around merge conflict on jit GUID.

BruceForstall · 2020-02-15T02:18:04Z

Failures are infra. Merging.

In #2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

In dotnet#2249, we started doing alignment of methods to 32-byte boundary for Tier1. However, a method having loops bypass tiering and hence this condition was never executed. Fixed it to make sure we do the alignment for optimized methods only and don't do it for prejitted methods.

jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 27, 2020

jkotas reviewed Jan 27, 2020

View reviewed changes

jkotas approved these changes Jan 29, 2020

View reviewed changes

BruceForstall mentioned this pull request Jan 31, 2020

JIT: Optimize "constant_string".Length #1378

Merged

MichalStrehovsky mentioned this pull request Feb 14, 2020

Dynamic generic dictionary expansion feature #32270

Merged

1 task

BruceForstall mentioned this pull request Feb 14, 2020

Optimize comparison of unknown type with a typeof #31686

Closed

2 tasks

AndyAyersMS added 4 commits February 14, 2020 12:07

restrict to larger methods with loops; don't update zapper

3cc708c

new jit GUID

452e104

fix target ifdef name

daf588a

AndyAyersMS force-pushed the 32ByteAlignedCode branch from 7a0cee3 to daf588a Compare February 14, 2020 20:12

BruceForstall approved these changes Feb 14, 2020

View reviewed changes

BruceForstall merged commit 9b6d12a into dotnet:master Feb 15, 2020

MichalStrehovsky mentioned this pull request Feb 20, 2020

Add support in crossgen2 for 32-byte alignment #32602

Merged

AndyAyersMS mentioned this pull request Mar 31, 2020

Investigate impact of method entry alignment #9912

Open

AndyAyersMS mentioned this pull request Aug 11, 2020

Minor performance regression going to dotnet core 3.1 from .NET 4.8 #31613

Open

AndyAyersMS mentioned this pull request Sep 4, 2020

[ARM64] Performance regression: PerfLabTests.CastingPerf2.CastingPerf.IntObj #41706

Closed

AndyAyersMS mentioned this pull request Sep 16, 2020

Make memory alignment more random dotnet/BenchmarkDotNet#1513

Closed

kunalspathak mentioned this pull request Sep 30, 2020

Fix condition to align methods to 32 bytes boundary #42909

Merged

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support 32 byte alignment of code on xarch #2249

Support 32 byte alignment of code on xarch #2249

AndyAyersMS commented Jan 27, 2020

AndyAyersMS commented Jan 27, 2020

jkotas commented Jan 27, 2020

AndyAyersMS commented Jan 27, 2020

jkotas Jan 27, 2020

jkotas commented Jan 28, 2020

jkotas commented Jan 28, 2020

AndyAyersMS commented Jan 28, 2020 •

edited

Loading

jkotas commented Jan 28, 2020

jkotas commented Jan 28, 2020 •

edited

Loading

tannergooding commented Jan 28, 2020

AndyAyersMS commented Jan 28, 2020 •

edited

Loading

erozenfeld commented Jan 28, 2020

AndyAyersMS commented Jan 30, 2020

BruceForstall commented Feb 14, 2020

AndyAyersMS commented Feb 14, 2020

BruceForstall commented Feb 15, 2020

Support 32 byte alignment of code on xarch #2249

Support 32 byte alignment of code on xarch #2249

Conversation

AndyAyersMS commented Jan 27, 2020

AndyAyersMS commented Jan 27, 2020

jkotas commented Jan 27, 2020

AndyAyersMS commented Jan 27, 2020

jkotas Jan 27, 2020

Choose a reason for hiding this comment

jkotas commented Jan 28, 2020

jkotas commented Jan 28, 2020

AndyAyersMS commented Jan 28, 2020 • edited Loading

jkotas commented Jan 28, 2020

jkotas commented Jan 28, 2020 • edited Loading

tannergooding commented Jan 28, 2020

AndyAyersMS commented Jan 28, 2020 • edited Loading

erozenfeld commented Jan 28, 2020

AndyAyersMS commented Jan 30, 2020

BruceForstall commented Feb 14, 2020

AndyAyersMS commented Feb 14, 2020

BruceForstall commented Feb 15, 2020

AndyAyersMS commented Jan 28, 2020 •

edited

Loading

jkotas commented Jan 28, 2020 •

edited

Loading

AndyAyersMS commented Jan 28, 2020 •

edited

Loading