Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGO work planned for .NET 8 #74873

Closed
28 of 44 tasks
jakobbotsch opened this issue Aug 31, 2022 · 16 comments
Closed
28 of 44 tasks

PGO work planned for .NET 8 #74873

jakobbotsch opened this issue Aug 31, 2022 · 16 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:3 Work that is nice to have User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Aug 31, 2022

This issue captures the planned work for .NET 8. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope. Note that we have not finished .NET 8 planning so most items are only under consideration.

Completed in .NET 8

Opportunistic for .NET 8

Deferred

category:planning
theme:profile-feedback
skill-level:expert
cost:large
impact:large

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 31, 2022
@jakobbotsch jakobbotsch added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Aug 31, 2022
@jakobbotsch jakobbotsch added this to the 8.0.0 milestone Aug 31, 2022
@ghost
Copy link

ghost commented Aug 31, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This issue captures the planned work for .NET 8. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope. Note that we have not finished .NET 8 planning so most items are only under consideration.

Planned for .NET 8

Under Consideration

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib

@JulieLeeMSFT JulieLeeMSFT added the User Story A single user-facing feature. Can be grouped under an epic. label Aug 31, 2022
@EgorBo
Copy link
Member

EgorBo commented Sep 1, 2022

I'd add context-sensitive instrumentation, I do think it should be beneficial

@jakobbotsch
Copy link
Member Author

@EgorBo feel free to edit the issue with anything you can think of 🙂

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jan 11, 2023
Enable edge based profiles for OSR, partial compilation, and optimized plus
instrumented cases.

For OSR this requires deferring flow graph modifications until after we have
built the initial probe list, so that the initial list reflects the entirety
of the method. This set of candidate edge probes is thus the same no matter how
the method is compiled. A given compile may schematize a subset of these probes
and materialize a subset of what gets schematized; this is tolerated by the PGO
mechanism provided that the initial instrumented jitting produces a schema which
is a superset of the schema produced by any subsequent instrumented rejitting.
This is normally the case.

Partial compilation may still need some work to ensure full schematization but
it is currently off by default. Will address this subsequently.

For optimized compiles we give the EfficientEdgeCountInstrumentor the same kind
of probe relocation abilities that we have in the BlockCountInstrumentor. In
particular we need to move probes that might appear in return blocks that follow
implicit tail call blocks, since those return blocks must remain empty.

The details on how we do this are a bit different but the idea is the same: we
create duplicate copies of any probe that was going to appear in the return block
and instead instrument each pred. If the pred reached the return via a critical
edge, we split the edge and put the probe there. This analysis relies on cheap
preds, so to ensure we can use them we move all the critial edge splitting so it
happens before we need the cheap pred lists.

The ability to do block profiling is retained but will no longer be used without
special config settings.

There were also a few bug fixes in the spanning tree visitor. It must visit a
superset of the blocks we end up importing and was missing visits in some cases.

This should improve jit time and code quality for instrumented code.

Fixes dotnet#47942.
Fixes dotnet#66101.
Contributes to dotnet#74873.
AndyAyersMS added a commit that referenced this issue Jan 13, 2023
Enable edge based profiles for OSR, partial compilation, and optimized plus
instrumented cases.

For OSR this requires deferring flow graph modifications until after we have
built the initial probe list, so that the initial list reflects the entirety
of the method. This set of candidate edge probes is thus the same no matter how
the method is compiled. A given compile may schematize a subset of these probes
and materialize a subset of what gets schematized; this is tolerated by the PGO
mechanism provided that the initial instrumented jitting produces a schema which
is a superset of the schema produced by any subsequent instrumented rejitting.
This is normally the case.

Partial compilation may still need some work to ensure full schematization but
it is currently off by default. Will address this subsequently.

For optimized compiles we give the EfficientEdgeCountInstrumentor the same kind
of probe relocation abilities that we have in the BlockCountInstrumentor. In
particular we need to move probes that might appear in return blocks that follow
implicit tail call blocks, since those return blocks must remain empty.

The details on how we do this are a bit different but the idea is the same: we
create duplicate copies of any probe that was going to appear in the return block
and instead instrument each pred. If the pred reached the return via a critical
edge, we split the edge and put the probe there. This analysis relies on cheap
preds, so to ensure we can use them we move all the critial edge splitting so it
happens before we need the cheap pred lists.

The ability to do block profiling is retained but will no longer be used without
special config settings.

There were also a few bug fixes in the spanning tree visitor. It must visit a
superset of the blocks we end up importing and was missing visits in some cases.

This should improve jit time and code quality for instrumented code.

Fixes #47942.
Fixes #66101.
Contributes to #74873.
@AndyAyersMS
Copy link
Member

Rough plan of attack for profile maintenance: enable profile post-phase checks for likelihoods (see #81738) by default and fix issues that arise. To approach this incrementally, I will start by enabling the checks for just the first phase, then fix all those issues, then move the disabling point back one more phase, etc.

Depending on the amount of churn this causes I may commit this work via a series of interim PRs where the checks are partially enabled. Right now I am still working on getting things clean after the profile incorporation phase, so can't say now many PRs will make sense.

OSR + PGO is continuing to cause problems. We expect the OSR method to see an inconsistent profile, because the profile data (assuming it comes from Tier0) will quite likely represent partial method executions. I have been trying to fix these since the SPMI collection I'm using to drive the profile work includes lots of OSR instances, but I may decide to defer checking for these methods initially as well.

Inconsistencies can arise in all methods, not just OSR methods, either because of exceptions or because the Tier1 rejitting happens while the Tier0 method instances may still be live and executing. So we need to tolerate inconsistencies and handle them, while at the same time also spotting flaws and incorrect behavior from profile reconstruction and maintenance in cases that are (or should be) fully consistent.

The profile checks will mainly focus on fixing issues with our new edge likelihood representation; the idea is to get this checked and plausibly clean through the important phases and then remove the current manipulations of block and edge weights and have everything rely on the new weights. This change will be quite disruptive.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 9, 2023

Looking at current ASP.NET SPMI collection, we reconstruct edge profiles for 31161 root methods, and have inconsistencies right after reconstruction in 2297 of them, around 7.3%. The OSR subset of this has slightly higher rate of issues (131 out of 1354, 10.2%). Both these are higher than I'd expect. Possible explanations:

  • loss of counter updates because we're not doing interlocked adds
  • profile data set tearing when tier1 rejit accesses live tier0 counter data
  • bugs in instrumentation or reconstruction
  • un-modelled control flow (OSR transfers, exceptions)
  • version skew (older static profile read into method with newer IL stream)

Example problem from very simple method. Here there are two counters, one for the 3->4 edge and one for the pseudo 4->1 edge (which would end up in block 4).

image - 2023-02-08T162535 073

Note because of placement of the sparse counters all we really know is that the internal counter was hit about more often than the method return counter. We deduce BB01's weight from BB04.

This example is reading dynamic PGO data so version skew should not be possible. Since there are just two counters, it is hard to imagine how Tier1 rejit copying them both asynchronously could lead to such a large count discrepancy. Doesn't seem too likely that 1% of all calls to this method would throw exceptions and so not return. Reconstruction is simple and not buggy. This is not an OSR method.

So the most plausible explanation from the list above is that we are seeing count update losses because the instrumentation probes are not doing interlocked updates (lock inc say). Might be worth creating a mode that does these more costly updates just to see if it really does explain the high numbers of inconsistent profiles.

Also worth noting that (in general) we are using a very tight acceptance criteria here and even in a case like this, counts are in relatively good shape (just off by a bit more than 1%, so fgProfileWeightsConsistent is flagging them). So it would also be interesting to see how the numbers of failures change as we adjust the leniency factor there up and down.

@MichalPetryka
Copy link
Contributor

Are there any plans to default Tiered PGO to being enabled in .NET 8 or 9?

@AndyAyersMS
Copy link
Member

Here is a more detailed breakdown of consistency issues seen in the current aspnet.run SPMI collection.

PGO Mode Methods Inconsistent Consistent %Inconsistent
Dynamic PGO Tier0-MinOpts 9 9 0%
  Tier1 25737 985 24752 4%
  Tier1-OSR 1396 142 1254 10%
Dynamic PGO Total   27142 1127 26015 4%
Static PGO FullOpts 28 12 16 43%
  Instrumented Tier1 2000 487 1513 24%
  Tier0-MinOpts 2 0   2 0%
  Tier1 2494 685 1809 27%
Static PGO Total   4524 1184 3340 26%
Grand Total   31666 2311 29355 7%

Looks like consistency of static PGO is considerably worse than dynamic PGO.

@AndyAyersMS
Copy link
Member

Are there any plans to default Tiered PGO to being enabled in .NET 8 or 9?

I would not say "plan", but it's certainly our ambition. Currently there are some rough edges still needing attention. If and when we get those addressed, we will make a case for it to be on by default.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Feb 10, 2023

With lock add (see #81934) I get no consistency issues from a bespoke ASP.NET SPMI collection for any of the dynamic PGO methods. So this suggests loss of counter updates because of update races is the main culprit there (and possibly for static PGO too, though that is harder to test).

@AndyAyersMS
Copy link
Member

Given the larger than expected consistency issues I have shifted my plan of attack and will focus first on profile synthesis and repair. When an initial PGO profile is read, if it is inconsistent, we'll use the repair code to patch it up.

First bit of this work appears in #82926.

@AndyAyersMS
Copy link
Member

We collect microbenchmark data in the lab for PGO for windows x64. Here are the 50 worst performing benchmarks (all regressions) with PGO enabled (data from last 7 days, ratio of median result with pgo to median without, for benchmarks that run for at least 2ns)

Benchmark Pgo/noPgo Pgo noPgo
PerfLabTests.CastingPerf.CheckArrayIsInterfaceNo 0.08 398849.04 31215.90
PerfLabTests.CastingPerf.CheckArrayIsInterfaceYes 0.08 374212.91 31223.53
PerfLabTests.LowLevelPerf.TypeReflectionObjectGetType 0.13 234851.64 31199.62
PerfLabTests.LowLevelPerf.TypeReflectionArrayGetType 0.13 233117.95 31200.07
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo 0.23 272851.55 62375.07
PerfLabTests.CastingPerf.CheckObjIsInterfaceNo 0.23 272819.54 62375.07
PerfLabTests.CastingPerf.CheckObjIsInterfaceYes 0.29 218302.32 62375.47
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceYes 0.29 218295.23 62375.70
PerfLabTests.LowLevelPerf.ForeachOverList100Elements 0.32 22945136.54 7379551.48
SIMD.ConsoleMandel.ScalarDoubleSinglethreadADT 0.35 3020632385.00 1061635800.00
System.Text.Json.Tests.Perf_Get.GetByte 0.38 1534.81 580.82
System.Text.Json.Tests.Perf_Get.GetUInt16 0.40 1535.19 609.48
System.Text.Json.Tests.Perf_Get.GetInt16 0.41 1566.20 649.24
System.Text.Json.Tests.Perf_Get.GetSByte 0.43 1630.13 705.13
System.Text.Json.Tests.Perf_Get.GetUInt32 0.49 1194.25 580.18
System.Text.Json.Tests.Perf_Get.GetInt32 0.50 1232.12 613.24
PerfLabTests.LowLevelPerf.GenericClassWithIntGenericInstanceField 0.50 62373.91 31194.27
System.Collections.IterateFor.ImmutableArray(Size: 512) 0.52 324.57 169.43
PerfLabTests.CastingPerf2.CastingPerf.ScalarValueTypeObj 0.63 444482.56 281009.89
PerfLabTests.CastingPerf.FooObjCastIfIsa 0.69 528580.17 363016.07
PerfLabTests.CastingPerf2.CastingPerf.FooObjCastIfIsa 0.69 532391.38 367533.28
System.Collections.Sort.Array(Size: 512) 0.69 6653.33 4606.52
System.IO.Tests.Perf_Directory.Exists 0.71 15955.69 11325.38
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToWriter(Mode: SourceGen) 0.71 16719.32 11946.75
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToStream(Mode: SourceGen) 0.72 17271.54 12376.91
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToUtf8Bytes(Mode: SourceGen) 0.72 17532.93 12566.95
System.Collections.ContainsKeyTrue<Int32, Int32>.ConcurrentDictionary(Size: 512) 0.72 3844.47 2761.13
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeToString(Mode: SourceGen) 0.72 18159.52 13103.63
System.Formats.Tar.Tests.Perf_TarFile.CreateFromDirectory_Stream_Async 0.72 269463.63 195243.43
System.Text.Json.Serialization.Tests.WriteJson<ImmutableDictionary<String, String>>.SerializeObjectProperty(Mode: SourceGen) 0.73 18248.82 13402.00
PerfLabTests.LowLevelPerf.ObjectStringIsString 0.75 125009.63 93555.58
LinqBenchmarks.Count00ForX 0.75 163504176.92 123252745.00
System.IO.Tests.Perf_Path.GetDirectoryName 0.76 72.90 55.55
System.Memory.Span.Clear(Size: 4) 0.76 3.36 2.57
System.Text.Json.Tests.Perf_Segment.ReadMultiSegmentSequence(segmentSize: 4096, TestCase: Json4KB) 0.78 9855.06 7690.25
MicroBenchmarks.Serializers.Xml_ToStream.XmlSerializer_ 0.78 546014.06 427784.01
System.Formats.Tar.Tests.Perf_TarFile.CreateFromDirectory_Stream 0.79 181593.06 143366.95
System.Text.Json.Tests.Perf_Get.GetUInt64 0.79 771.31 609.72
Benchstone.BenchI.MulMatrix.Test 0.80 495584423.08 395630186.67
System.Collections.Sort.List(Size: 512) 0.83 6379.94 5270.91
Benchstone.BenchF.NewtR.Test 0.83 99796125.00 82667905.00
ByteMark.BenchLUDecomp 0.83 1522751180.00 1261430178.57
System.Text.Json.Document.Tests.Perf_DocumentParse.Parse(IsDataIndented: False, TestRandomAccess: True, TestCase: Json400KB) 0.83 7413636.67 6143975.93
System.Buffers.Tests.RentReturnArrayPoolTests.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) 0.83 512.94 426.19
PerfLabTests.LowLevelPerf.GenericGenericMethod 0.83 187104.80 156016.86
System.Memory.Span.SequenceCompareToDifferent(Size: 4) 0.83 5.91 4.93
System.Perf_Convert.FromBase64String 0.84 90.49 75.64
System.Text.Tests.Perf_StringBuilder.Append_Char(length: 100000) 0.84 199735.26 167488.12
System.Collections.IndexerSet.Dictionary(Size: 512) 0.84 4257.67 3577.62
System.Tests.Perf_Char.Char_ToLowerInvariant(input: "Hello World!") 0.85 15.15 12.81

@EgorBo
Copy link
Member

EgorBo commented Apr 3, 2023

@AndyAyersMS let me know if you want me to look at some of these on your choice

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 4, 2023

@AndyAyersMS let me know if you want me to look at some of these on your choice

If you want to get a head start perhaps look into SIMD.ConsoleMandel.ScalarDoubleSinglethreadADT.

Otherwise let me reassess after I get a fix in for this first issue -- it seems pretty widespread. Still working on it.

@AndyAyersMS AndyAyersMS added the Priority:3 Work that is nice to have label Jun 5, 2023
@AndyAyersMS
Copy link
Member

At this point we've wrapped up all the planned work on Dynamic PGO for .NET 8.

So closing this; we'll revisit some of the open items here in the next release's planning.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:3 Work that is nice to have User Story A single user-facing feature. Can be grouped under an epic.
Projects
Archived in project
Development

No branches or pull requests

5 participants