Removed unnecessary MemoryMarshal.GetReference #32669

lkts · 2018-10-07T08:15:47Z

Fixes https://github.com/dotnet/corefx/issues/32099

There are few places where it breaks the logic, I marked them with comments and will highlight here in PR.

lkts · 2018-10-07T08:18:05Z

src/System.Runtime/tests/System/StringTests.netcoreapp.cs

@@ -707,6 +707,7 @@ public static unsafe void ImplicitCast_ResultingSpanMatches(string s)
            ReadOnlySpan<char> span = s;
            Assert.Equal(s.Length, span.Length);
            fixed (char* stringPtr = s)
+            // removing MemoryMarshal breaks this test


here we try to compare pointer to empty string with pointer of empty span

lkts · 2018-10-07T08:18:43Z

src/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs

@@ -525,6 +525,11 @@ public override void Write(ReadOnlySpan<byte> buffer)

        internal void WriteCore(ReadOnlySpan<byte> buffer)
        {
+            if (buffer.IsEmpty)


nothing to write

lkts · 2018-10-07T08:20:08Z

src/Common/src/System/Security/Cryptography/AsnWriter.cs

@@ -1541,6 +1541,8 @@ private void WriteCharacterStringCore(Asn1Tag tag, Text.Encoding encoding, ReadO
            int size = -1;

            // T-REC-X.690-201508 sec 9.2
+            // MemoryMarshal.GetReference is used here to handle empty string case correctly


here empty string is a valid input (there are tests for it) and removing GetReference breaks the logic since we get null for empty span

stephentoub · 2018-10-07T09:32:36Z

Are all of these changes necessary to fix bugs? Or if not, is it possible any will regress performance on hot paths, due to the extra checks involved?

For ones that are fixing bugs, we should make sure we have tests that would have failed without these changes.

stephentoub · 2018-10-07T09:42:33Z

Changes under CoreLib should be made in the coreclr repo. I know it's a little confusing, but that code isn't necessarily used/compiled/tested in corefx; making the changes in coreclr will help validate it in CI.

lkts · 2018-10-07T15:05:20Z

@stephentoub this is just a cleanup, new code is equivalent except for empty span case and majority of places already have precondition checks for it (hence tests are passing)

stephentoub · 2018-10-07T15:11:56Z

this is just a cleanup

Thanks (and for working on this). My concern still stands as to a) some of these changes being made in corefx instead of coreclr, and b) whether some of these should remain as they were in order to avoid the extra checks. In many/most cases, the extra checks won't matter and the simpler code you're changing it to is better, but I question whether a few cases might regress perf?

lkts · 2018-10-07T15:16:27Z

@stephentoub I will port needed parts to coreclr if we decide to proceed with such changes. Regarding b) - that`s why i added comments in problematic places, other places are just result of "Find and Replace" :)

stephentoub · 2018-10-07T18:11:40Z

other places are just result of "Find and Replace" :)

Those are the places I'm talking about. Currently it's using:

corefx/src/Common/src/CoreLib/System/Runtime/InteropServices/MemoryMarshal.Fast.cs

Line 78 in cd861e1

public static ref T GetReference<T>(Span<T> span) => ref span._pointer.Value;

and your change makes it use:

corefx/src/Common/src/CoreLib/System/Span.Fast.cs

Line 161 in cd861e1

    
           public unsafe ref T GetPinnableReference() => ref (_length != 0) ? ref _pointer.Value : ref Unsafe.AsRef<T>(null);

which involves an extra branch to ensure that null is returned when the length is 0. In the vast majority of uses, that extra branch won't matter. But there are some places in these code bases where methods are used on such hots paths and in such tight loops that extra branches like that, and increases in code sizes, do matter. It's quite possible that all of your changes fall into the "this doesn't matter" case, in which case great. But that's my question.

lkts · 2018-10-08T07:27:06Z

@stephentoub Okay, i get it now, thank you for detailed explanation. That totally makes sense, but how do i determine what to revert?

stephentoub · 2018-10-13T02:52:53Z

That totally makes sense, but how do i determine what to revert?

@JeremyKuhne, you opened #32099... what did you have in mind here?

@eerhardt just faced a case in his work where he saw upwards of a 20% dip in perf due to relying on the default use of GetPinnableReference rather than MemoryMarshal.GetReference, so a sweeping PR that switches from the latter to the former makes me nervous, even though I think it's unlikely to affect the majority of these uses.

@lkts, regardless, the changes under src/Common/src/CoreLib should be reverted in this PR. Even if it's decided those changes should be made, they should be done via a PR to the coreclr repo rather than modifying the mirrored code in corefx.

Thanks.

JeremyKuhne · 2018-10-13T03:07:10Z

you opened #32099... what did you have in mind here?

Consistency. Opened it based on @jkotas feedback: #32072 (comment)

eerhardt · 2018-10-13T03:17:53Z

@eerhardt just faced a case in his work where he saw upwards of a 20% dip in perf due to relying on the default use of GetPinnableReference rather than MemoryMarshal.GetReference

See dotnet/machinelearning@ec4c943 where I am adding MemoryMarshal.GetReference to all ML.NET's calls to pin a span inside the CpuMath library.

On my machine the K-Means benchmark went from:

Using arrays:
           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 869.8 ms | 83.13 ms | 95.73 ms |            - |


Using/pin Span with no MemoryMarshal.GetReference:
           Method |    Mean |    Error |   StdDev | Extra Metric |
----------------- |--------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 1.110 s | 0.2186 s | 0.2339 s |            - |


Pin Span with MemoryMarshal.GetReference:
           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 868.9 ms | 87.16 ms | 100.4 ms |            - |

lkts · 2018-10-13T07:54:31Z

@stephentoub @eerhardt I see, thank you. I guess this PR and issue should be closed then and such changes should be done in scope of other changes to these files if it seems applicable.

JeremyKuhne · 2018-10-13T19:26:42Z

@eerhardt Do you know why it is so much slower? Is that just a bug that needs fixed?

The big issue here is that pinning spans via 'MemoryMarshal.GetReference' does not match the same behavior as pinning an array directly when empty.

@jkotas, can you give your two bits here?

jkotas · 2018-10-13T22:13:25Z

The array pinning has the extra checks for empty/null by default just like pinning for Span without 'MemoryMarshal.GetReference'. There is no fundamental reason why regular pinning of Span should be slower than regular pinning of Span. I agree that it would be good idea to understand the root cause of the performance difference in CpuMath library - since you do not see the performance regression with array pinning that has these empty/null checks as well. E.g. You may be just seeing side-effects of cache or loop alignment that have nothing to do with pinning. Or you may be seeing JIT defficiency that we want to fix.

And since the array pinning has the extra checks for empty/null by default, folks are sometimes micro-optimizing it via patterns that avoid it. For example, if you know that the array is not empty, you can do this:

fixed (byte* p = &array[0])
{
   ...
}

This pattern tends to produce slightly smaller and faster code than regular array pinning. It is similar to MemoryMarshal.GetReference micro-optimization for Span.

stephentoub · 2018-10-13T22:42:52Z

For example, if you know that the array is not empty, you can do this

That's what Eric's array code was doing. The GetPinnableReference added checks not present in the array code.

jkotas · 2018-10-15T04:35:53Z

Even with everything mentioned here so far, the performance results in #32669 (comment) do not make quite sense to me.

@eerhardt Could you please confirm:

What was the exact version of the runtime you have run this benchmark on? Note that 2.1.x had number of Span-related performance bugs that are fixed in 3.0 preview builds.
For the "Using arrays" case, was the array pinned done as fixed (void *p = array) or as fixed (void* p = &array[0])?

eerhardt · 2018-10-15T15:13:33Z

What was the exact version of the runtime you have run this benchmark on? Note that 2.1.x had number of Span-related performance bugs that are fixed in 3.0 preview builds.

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009622
  [Host]     : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT
  Job-RTCIWG : .NET Core 2.1.5 (CoreCLR 4.6.26919.02, CoreFX 4.6.26919.02), 64bit RyuJIT

For the "Using arrays" case, was the array pinned done as fixed (void p = array) or as fixed (void p = &array[0])?

All the existing CpuMath code pins the array with
fixed (float* pa = &array[0])

See https://github.com/dotnet/machinelearning/blob/7e9e4687284786ae241b2e8c8457a02686e639e9/src/Microsoft.ML.CpuMath/Sse.cs#L1120-L1134

jkotas · 2018-10-15T15:19:31Z

.NET Core 2.1.5

That can explain it. @eerhardt Could you please run the same on .NET Core 3.0 runtime ?

stephentoub · 2018-10-15T15:21:30Z

Note that 2.1.x had number of Span-related performance bugs that are fixed in 3.0 preview builds.

@jkotas, which issues were those?

Should I infer from your comments that you believe it's safe to search/remove GetReference usage in exchange for the default GetPinnableReference, without further perf validation (in particular for the coreclr cases that were removed from this PR)?

jkotas · 2018-10-15T15:47:32Z

@jkotas, which issues were those?

E.g. https://github.com/dotnet/coreclr/issues/18270

Should I infer from your comments that you believe it's safe to search/remove GetReference usage in exchange for the default GetPinnableReference, without further perf validation

I agree with you that we should not do a blank search/remove without any perf-validation. This change should be split into multiple PRs:

We know about several cases where the GetReference pinning is likely causing real product bugs. There should be separate PRs with fixes for these, ideally with regression tests.
Go over GetReference uses for pining and flip the ones that are wrapping expensive operation (e.g. PInvoke) since it is unlikely to cause regression. Sanity check performance of a few randomly selected ones. Submit a PRs with this delta (it can be split into multiple PRs).
For the remaining cases, go over them one-by-one and validate that there is no material perf regression when they are flipped.

My most recent comments were trying to suggest that the ML.NET results posted above do not represent the current state of .NET Core 3.0 well. The difference between the array pinning case (even the micro-optimized one with array[0]) and the regular Span pinning case should be non-existent or much smaller for ML.NET. And if it is not the case, there is likely RyuJIT bug that we want to understand and fix.

eerhardt · 2018-10-15T16:35:56Z

@eerhardt Could you please run the same on .NET Core 3.0 runtime ?

I've run each commit 5 times using the latest 3.0 from core-sdk:

Arrays: (the "base" commit)

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-KKDUIV : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 760.0 ms | 46.96 ms | 46.12 ms |            - |
 TrainKMeansAndLR | 748.2 ms | 80.66 ms | 92.89 ms |            - |
 TrainKMeansAndLR | 766.3 ms | 95.77 ms | 106.4 ms |            - |
 TrainKMeansAndLR | 810.7 ms | 86.67 ms | 96.34 ms |            - |
 TrainKMeansAndLR | 794.4 ms | 69.48 ms | 77.23 ms |            - |

Pin Span (default): (Changing to use Spans commit)

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-LPAJUQ : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 840.4 ms | 67.70 ms | 72.44 ms |            - |
 TrainKMeansAndLR | 830.0 ms | 81.18 ms | 93.49 ms |            - |
 TrainKMeansAndLR | 905.2 ms | 91.83 ms | 98.26 ms |            - |
 TrainKMeansAndLR | 902.4 ms | 117.0 ms | 134.8 ms |            - |
 TrainKMeansAndLR | 826.2 ms | 44.83 ms | 46.04 ms |            - |

Pin MemoryMarshal.GetReference(Span):(Changing to call MemoryMarshal commit)

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-THNURV : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

           Method |     Mean |    Error |   StdDev | Extra Metric |
----------------- |---------:|---------:|---------:|-------------:|
 TrainKMeansAndLR | 744.2 ms | 87.17 ms | 93.27 ms |            - |
 TrainKMeansAndLR | 778.3 ms | 64.35 ms | 71.53 ms |            - |
 TrainKMeansAndLR | 776.7 ms | 67.13 ms | 77.30 ms |            - |
 TrainKMeansAndLR | 735.3 ms | 57.70 ms | 61.74 ms |            - |
 TrainKMeansAndLR | 827.8 ms | 125.0 ms | 143.9 ms |            - |

You can see there is some variation run-to-run, but pinning the span in the "default" way is consistently higher.

eerhardt · 2018-10-15T16:49:16Z

Last week I also wrote up a micro benchmark, which I believe repros the core issue.

Here is the change with the benchmark, and a side-by-side "Span" overload to DotProduct
eerhardt/machinelearning@56b849c

Running the above benchmark produces:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-OWUCOK : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

               Method |     Mean |     Error |    StdDev | Extra Metric |
--------------------- |---------:|----------:|----------:|-------------:|
   TestDotProductSpan | 61.28 ns | 1.5821 ns | 1.8220 ns |            - |
 TestDotProductNoSpan | 44.95 ns | 0.8567 ns | 0.7594 ns |            - |

And here is the change to call MemoryMarshal.GetReference:
eerhardt/machinelearning@5197bfa

Running the benchmark again:

BenchmarkDotNet=v0.11.1, OS=Windows 10.0.17763
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.0.100-alpha1-009670
  [Host]     : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT
  Job-EHBIXH : .NET Core 3.0.0-preview1-27009-01 (CoreCLR 4.6.27008.04, CoreFX 4.6.27008.04), 64bit RyuJIT

BuildConfiguration=Release  Toolchain=netcoreapp3.0  MaxIterationCount=20
WarmupCount=1

               Method |     Mean |     Error |    StdDev | Extra Metric |
--------------------- |---------:|----------:|----------:|-------------:|
   TestDotProductSpan | 47.93 ns | 0.9330 ns | 0.8727 ns |            - |
 TestDotProductNoSpan | 44.85 ns | 1.3614 ns | 1.5132 ns |            - |

jkotas · 2018-10-15T22:10:22Z

Thanks. I have tweaked GetPinnableReference implementation in dotnet/coreclr#20428 to make it much closer to pinning using MemoryMarshal.GetReference.

lkts · 2018-10-16T07:42:14Z

Thanks everyone for productive discussion! Glad that improvement possibility was discovered here. As a result I suggest updating https://github.com/dotnet/corefx/issues/32099 with steps provided by @jkotas and other relevant information (or creating new issue). Code changes here do not make sense anymore so closing it.

lkts commented Oct 7, 2018

View reviewed changes

karelz added the area-Meta label Oct 12, 2018

karelz assigned lkts, JeremyKuhne and joperezr Oct 12, 2018

cod7alex added 6 commits October 13, 2018 10:46

Removed unnecessary MemoryMarshal.GetReference

53e8cf8

Rollback unwanted changes

ee0d950

Marked one more breaking place

7847879

Added precondition in DeflateStream

e7ec903

Take upstream file version

83c3f2c

Revert changes under Common/src/CoreLib

e34a03e

lkts force-pushed the get-reference-cleanup branch from e89a12b to e34a03e Compare October 13, 2018 07:48

davidfowl mentioned this pull request Oct 13, 2018

Potentially revert "Remove calls to MemoryMarshal.GetReference" aspnet/KestrelHttpServer#3013

Closed

eerhardt mentioned this pull request Oct 15, 2018

Refactor CpuMathUtils dotnet/machinelearning#1229

Merged

jkotas mentioned this pull request Oct 15, 2018

Optimize Span.GetPinnableReference dotnet/coreclr#20428

Merged

lkts closed this Oct 16, 2018

lkts deleted the get-reference-cleanup branch October 16, 2018 07:42

karelz added this to the 3.0 milestone Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed unnecessary MemoryMarshal.GetReference #32669

Removed unnecessary MemoryMarshal.GetReference #32669

lkts commented Oct 7, 2018

lkts Oct 7, 2018

lkts Oct 7, 2018

lkts Oct 7, 2018 •

edited

Loading

stephentoub commented Oct 7, 2018 •

edited

Loading

stephentoub commented Oct 7, 2018

lkts commented Oct 7, 2018

stephentoub commented Oct 7, 2018 •

edited

Loading

lkts commented Oct 7, 2018 •

edited

Loading

stephentoub commented Oct 7, 2018

lkts commented Oct 8, 2018 •

edited

Loading

stephentoub commented Oct 13, 2018

JeremyKuhne commented Oct 13, 2018 •

edited

Loading

eerhardt commented Oct 13, 2018

lkts commented Oct 13, 2018 •

edited

Loading

JeremyKuhne commented Oct 13, 2018

jkotas commented Oct 13, 2018 •

edited

Loading

stephentoub commented Oct 13, 2018

jkotas commented Oct 15, 2018 •

edited

Loading

eerhardt commented Oct 15, 2018

jkotas commented Oct 15, 2018

stephentoub commented Oct 15, 2018

jkotas commented Oct 15, 2018 •

edited

Loading

eerhardt commented Oct 15, 2018

eerhardt commented Oct 15, 2018

jkotas commented Oct 15, 2018

lkts commented Oct 16, 2018

Removed unnecessary MemoryMarshal.GetReference #32669

Removed unnecessary MemoryMarshal.GetReference #32669

Conversation

lkts commented Oct 7, 2018

lkts Oct 7, 2018

Choose a reason for hiding this comment

lkts Oct 7, 2018

Choose a reason for hiding this comment

lkts Oct 7, 2018 • edited Loading

Choose a reason for hiding this comment

stephentoub commented Oct 7, 2018 • edited Loading

stephentoub commented Oct 7, 2018

lkts commented Oct 7, 2018

stephentoub commented Oct 7, 2018 • edited Loading

lkts commented Oct 7, 2018 • edited Loading

stephentoub commented Oct 7, 2018

lkts commented Oct 8, 2018 • edited Loading

stephentoub commented Oct 13, 2018

JeremyKuhne commented Oct 13, 2018 • edited Loading

eerhardt commented Oct 13, 2018

lkts commented Oct 13, 2018 • edited Loading

JeremyKuhne commented Oct 13, 2018

jkotas commented Oct 13, 2018 • edited Loading

stephentoub commented Oct 13, 2018

jkotas commented Oct 15, 2018 • edited Loading

eerhardt commented Oct 15, 2018

jkotas commented Oct 15, 2018

stephentoub commented Oct 15, 2018

jkotas commented Oct 15, 2018 • edited Loading

eerhardt commented Oct 15, 2018

eerhardt commented Oct 15, 2018

jkotas commented Oct 15, 2018

lkts commented Oct 16, 2018

lkts Oct 7, 2018 •

edited

Loading

stephentoub commented Oct 7, 2018 •

edited

Loading

stephentoub commented Oct 7, 2018 •

edited

Loading

lkts commented Oct 7, 2018 •

edited

Loading

lkts commented Oct 8, 2018 •

edited

Loading

JeremyKuhne commented Oct 13, 2018 •

edited

Loading

lkts commented Oct 13, 2018 •

edited

Loading

jkotas commented Oct 13, 2018 •

edited

Loading

jkotas commented Oct 15, 2018 •

edited

Loading

jkotas commented Oct 15, 2018 •

edited

Loading