Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Removed unnecessary MemoryMarshal.GetReference #32669

Closed
wants to merge 6 commits into from

Conversation

lkts
Copy link

@lkts lkts commented Oct 7, 2018

Fixes https://github.com/dotnet/corefx/issues/32099

There are few places where it breaks the logic, I marked them with comments and will highlight here in PR.

@JeremyKuhne @danmosemsft @jkotas

@@ -707,6 +707,7 @@ public static unsafe void ImplicitCast_ResultingSpanMatches(string s)
ReadOnlySpan<char> span = s;
Assert.Equal(s.Length, span.Length);
fixed (char* stringPtr = s)
// removing MemoryMarshal breaks this test
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we try to compare pointer to empty string with pointer of empty span

@@ -525,6 +525,11 @@ public override void Write(ReadOnlySpan<byte> buffer)

internal void WriteCore(ReadOnlySpan<byte> buffer)
{
if (buffer.IsEmpty)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing to write

@@ -1541,6 +1541,8 @@ private void WriteCharacterStringCore(Asn1Tag tag, Text.Encoding encoding, ReadO
int size = -1;

// T-REC-X.690-201508 sec 9.2
// MemoryMarshal.GetReference is used here to handle empty string case correctly
Copy link
Author

@lkts lkts Oct 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here empty string is a valid input (there are tests for it) and removing GetReference breaks the logic since we get null for empty span

@stephentoub
Copy link
Member

stephentoub commented Oct 7, 2018

Are all of these changes necessary to fix bugs? Or if not, is it possible any will regress performance on hot paths, due to the extra checks involved?

For ones that are fixing bugs, we should make sure we have tests that would have failed without these changes.

@stephentoub
Copy link
Member

Changes under CoreLib should be made in the coreclr repo. I know it's a little confusing, but that code isn't necessarily used/compiled/tested in corefx; making the changes in coreclr will help validate it in CI.

@lkts
Copy link
Author

lkts commented Oct 7, 2018

@stephentoub this is just a cleanup, new code is equivalent except for empty span case and majority of places already have precondition checks for it (hence tests are passing)

@stephentoub
Copy link
Member

stephentoub commented Oct 7, 2018

this is just a cleanup

Thanks (and for working on this). My concern still stands as to a) some of these changes being made in corefx instead of coreclr, and b) whether some of these should remain as they were in order to avoid the extra checks. In many/most cases, the extra checks won't matter and the simpler code you're changing it to is better, but I question whether a few cases might regress perf?

@lkts
Copy link
Author

lkts commented Oct 7, 2018

@stephentoub I will port needed parts to coreclr if we decide to proceed with such changes. Regarding b) - that`s why i added comments in problematic places, other places are just result of "Find and Replace" :)

@stephentoub
Copy link
Member

other places are just result of "Find and Replace" :)

Those are the places I'm talking about. Currently it's using:

public static ref T GetReference<T>(Span<T> span) => ref span._pointer.Value;

and your change makes it use:
public unsafe ref T GetPinnableReference() => ref (_length != 0) ? ref _pointer.Value : ref Unsafe.AsRef<T>(null);

which involves an extra branch to ensure that null is returned when the length is 0. In the vast majority of uses, that extra branch won't matter. But there are some places in these code bases where methods are used on such hots paths and in such tight loops that extra branches like that, and increases in code sizes, do matter. It's quite possible that all of your changes fall into the "this doesn't matter" case, in which case great. But that's my question.

@lkts
Copy link
Author

lkts commented Oct 8, 2018

@stephentoub Okay, i get it now, thank you for detailed explanation. That totally makes sense, but how do i determine what to revert?

@stephentoub
Copy link
Member

That totally makes sense, but how do i determine what to revert?

@JeremyKuhne, you opened #32099... what did you have in mind here?

@eerhardt just faced a case in his work where he saw upwards of a 20% dip in perf due to relying on the default use of GetPinnableReference rather than MemoryMarshal.GetReference, so a sweeping PR that switches from the latter to the former makes me nervous, even though I think it's unlikely to affect the majority of these uses.

@lkts, regardless, the changes under src/Common/src/CoreLib should be reverted in this PR. Even if it's decided those changes should be made, they should be done via a PR to the coreclr repo rather than modifying the mirrored code in corefx.

Thanks.

@JeremyKuhne
Copy link
Member

JeremyKuhne commented Oct 13, 2018

you opened #32099... what did you have in mind here?

Consistency. Opened it based on @jkotas feedback: #32072 (comment)

@eerhardt
Copy link
Member

@eerhardt just faced a case in his work where he saw upwards of a 20% dip in perf due to relying on the default use of GetPinnableReference rather than MemoryMarshal.GetReference

See dotnet/machinelearning@ec4c943 where I am adding MemoryMarshal.GetReference to all ML.NET's calls to pin a span inside the CpuMath library.

On my machine the K-Means benchmark went from:

Using arrays:
           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 869.8 ms | 83.13 ms | 95.73 ms |            - |


Using/pin Span with no MemoryMarshal.GetReference:
           Method |    Mean |    Error |   StdDev | Extra Metric |
----------------- |--------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 1.110 s | 0.2186 s | 0.2339 s |            - |


Pin Span with MemoryMarshal.GetReference:
           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 868.9 ms | 87.16 ms | 100.4 ms |            - |

@lkts
Copy link
Author

lkts commented Oct 13, 2018

@stephentoub @eerhardt I see, thank you. I guess this PR and issue should be closed then and such changes should be done in scope of other changes to these files if it seems applicable.

@JeremyKuhne
Copy link
Member

@eerhardt Do you know why it is so much slower? Is that just a bug that needs fixed?

The big issue here is that pinning spans via 'MemoryMarshal.GetReference' does not match the same behavior as pinning an array directly when empty.

@jkotas, can you give your two bits here?

@jkotas
Copy link
Member

jkotas commented Oct 13, 2018

The array pinning has the extra checks for empty/null by default just like pinning for Span without 'MemoryMarshal.GetReference'. There is no fundamental reason why regular pinning of Span should be slower than regular pinning of Span. I agree that it would be good idea to understand the root cause of the performance difference in CpuMath library - since you do not see the performance regression with array pinning that has these empty/null checks as well. E.g. You may be just seeing side-effects of cache or loop alignment that have nothing to do with pinning. Or you may be seeing JIT defficiency that we want to fix.

And since the array pinning has the extra checks for empty/null by default, folks are sometimes micro-optimizing it via patterns that avoid it. For example, if you know that the array is not empty, you can do this:

fixed (byte* p = &array[0])
{
   ...
}

This pattern tends to produce slightly smaller and faster code than regular array pinning. It is similar to MemoryMarshal.GetReference micro-optimization for Span.

@stephentoub
Copy link
Member

For example, if you know that the array is not empty, you can do this

That's what Eric's array code was doing. The GetPinnableReference added checks not present in the array code.

@jkotas
Copy link
Member

jkotas commented Oct 15, 2018

Even with everything mentioned here so far, the performance results in #32669 (comment) do not make quite sense to me.

@eerhardt Could you please confirm:

  • What was the exact version of the runtime you have run this benchmark on? Note that 2.1.x had number of Span-related performance bugs that are fixed in 3.0 preview builds.
  • For the "Using arrays" case, was the array pinned done as fixed (void *p = array) or as fixed (void* p = &array[0])?

@eerhardt
Copy link
Member

What was the exact version of the runtime you have run this benchmark on? Note that 2.1.x had number of Span-related performance bugs that are fixed in 3.0 preview builds.

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009622
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-RTCIWG : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

For the "Using arrays" case, was the array pinned done as fixed (void p = array) or as fixed (void p = &array[0])?

All the existing CpuMath code pins the array with
fixed (float* pa = &array[0])

See https://github.com/dotnet/machinelearning/blob/7e9e4687284786ae241b2e8c8457a02686e639e9/src/Microsoft.ML.CpuMath/Sse.cs#L1120-L1134

@jkotas
Copy link
Member

jkotas commented Oct 15, 2018

.NET Core 2.1.5

That can explain it. @eerhardt Could you please run the same on .NET Core 3.0 runtime ?

@stephentoub
Copy link
Member

Note that 2.1.x had number of Span-related performance bugs that are fixed in 3.0 preview builds.

@jkotas, which issues were those?

Should I infer from your comments that you believe it's safe to search/remove GetReference usage in exchange for the default GetPinnableReference, without further perf validation (in particular for the coreclr cases that were removed from this PR)?

@jkotas
Copy link
Member

jkotas commented Oct 15, 2018

@jkotas, which issues were those?

E.g. https://github.com/dotnet/coreclr/issues/18270

Should I infer from your comments that you believe it's safe to search/remove GetReference usage in exchange for the default GetPinnableReference, without further perf validation

I agree with you that we should not do a blank search/remove without any perf-validation. This change should be split into multiple PRs:

  1. We know about several cases where the GetReference pinning is likely causing real product bugs. There should be separate PRs with fixes for these, ideally with regression tests.
  2. Go over GetReference uses for pining and flip the ones that are wrapping expensive operation (e.g. PInvoke) since it is unlikely to cause regression. Sanity check performance of a few randomly selected ones. Submit a PRs with this delta (it can be split into multiple PRs).
  3. For the remaining cases, go over them one-by-one and validate that there is no material perf regression when they are flipped.

My most recent comments were trying to suggest that the ML.NET results posted above do not represent the current state of .NET Core 3.0 well. The difference between the array pinning case (even the micro-optimized one with array[0]) and the regular Span pinning case should be non-existent or much smaller for ML.NET. And if it is not the case, there is likely RyuJIT bug that we want to understand and fix.

@eerhardt
Copy link
Member

@eerhardt Could you please run the same on .NET Core 3.0 runtime ?

I've run each commit 5 times using the latest 3.0 from core-sdk:

Arrays: (the "base" commit)

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-KKDUIV : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 760.0 ms | 46.96 ms | 46.12 ms |            - |
 TrainKMeansAndLR | 748.2 ms | 80.66 ms | 92.89 ms |            - |
 TrainKMeansAndLR | 766.3 ms | 95.77 ms | 106.4 ms |            - |
 TrainKMeansAndLR | 810.7 ms | 86.67 ms | 96.34 ms |            - |
 TrainKMeansAndLR | 794.4 ms | 69.48 ms | 77.23 ms |            - |

Pin Span (default): (Changing to use Spans commit)

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-LPAJUQ : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 840.4 ms | 67.70 ms | 72.44 ms |            - |
 TrainKMeansAndLR | 830.0 ms | 81.18 ms | 93.49 ms |            - |
 TrainKMeansAndLR | 905.2 ms | 91.83 ms | 98.26 ms |            - |
 TrainKMeansAndLR | 902.4 ms | 117.0 ms | 134.8 ms |            - |
 TrainKMeansAndLR | 826.2 ms | 44.83 ms | 46.04 ms |            - |

Pin MemoryMarshal.GetReference(Span):(Changing to call MemoryMarshal commit)

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-THNURV : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 744.2 ms | 87.17 ms | 93.27 ms |            - |
 TrainKMeansAndLR | 778.3 ms | 64.35 ms | 71.53 ms |            - |
 TrainKMeansAndLR | 776.7 ms | 67.13 ms | 77.30 ms |            - |
 TrainKMeansAndLR | 735.3 ms | 57.70 ms | 61.74 ms |            - |
 TrainKMeansAndLR | 827.8 ms | 125.0 ms | 143.9 ms |            - |

You can see there is some variation run-to-run, but pinning the span in the "default" way is consistently higher.

@eerhardt
Copy link
Member

Last week I also wrote up a micro benchmark, which I believe repros the core issue.

Here is the change with the benchmark, and a side-by-side "Span" overload to DotProduct
eerhardt/machinelearning@56b849c

Running the above benchmark produces:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-OWUCOK : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

               Method |     Mean |     Error |    StdDev | Extra Metric |
--------------------- |---------:|----------:|----------:|-------------:|
   TestDotProductSpan | 61.28 ns | 1.5821 ns | 1.8220 ns |            - |
 TestDotProductNoSpan | 44.95 ns | 0.8567 ns | 0.7594 ns |            - |

And here is the change to call MemoryMarshal.GetReference:
eerhardt/machinelearning@5197bfa

Running the benchmark again:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-EHBIXH : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

               Method |     Mean |     Error |    StdDev | Extra Metric |
--------------------- |---------:|----------:|----------:|-------------:|
   TestDotProductSpan | 47.93 ns | 0.9330 ns | 0.8727 ns |            - |
 TestDotProductNoSpan | 44.85 ns | 1.3614 ns | 1.5132 ns |            - |

@jkotas
Copy link
Member

jkotas commented Oct 15, 2018

Thanks. I have tweaked GetPinnableReference implementation in dotnet/coreclr#20428 to make it much closer to pinning using MemoryMarshal.GetReference.

@lkts
Copy link
Author

lkts commented Oct 16, 2018

Thanks everyone for productive discussion! Glad that improvement possibility was discovered here. As a result I suggest updating https://github.com/dotnet/corefx/issues/32099 with steps provided by @jkotas and other relevant information (or creating new issue). Code changes here do not make sense anymore so closing it.

@lkts lkts closed this Oct 16, 2018
@lkts lkts deleted the get-reference-cleanup branch October 16, 2018 07:42
@karelz karelz added this to the 3.0 milestone Nov 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants