Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Vectorize some Guid APIs (ctor, TryWriteBytes) #21336

Merged
merged 9 commits into from
Dec 3, 2018

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Dec 3, 2018

  1. Guid.TryWriteBytes(Span<byte>):
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public Span<byte> TryWriteBytes(Guid id)
{
    Span<byte> s = new byte[20];
    id.TryWriteBytes(s);
    return s;
}
Method Mean Error StdDev
TryWriteBytes_new 4.960 ns 0.0818 ns 0.0292 ns
TryWriteBytes_old 8.541 ns 0.0362 ns 0.0129 ns

  1. new Guid(byte[]) and new Guid(ReadOnlySpan<byte>):
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public Guid GuidCtor(byte[] data)
{
    return new Guid(data);
}
Method Mean Error StdDev
GuidCtor_new 2.615 ns 0.0275 ns 0.0098 ns
GuidCtor_old 7.483 ns 0.0146 ns 0.0052 ns

  1. Guid.ToByteArray():
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] ToByteArray(Guid id)
{
    return id.ToByteArray();
}
Method Mean Error StdDev
ToByteArray_new 3.569 ns 0.0192 ns 0.0068 ns
ToByteArray_old 6.265 ns 0.0776 ns 0.0202 ns

Environment:

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.407 (1803/April2018Update/Redstone4)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
Frequency=3609372 Hz, Resolution=277.0565 ns, Timer=TSC
.NET Core SDK=3.0.100-preview-009812
  [Host]     : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
  Job-CIYVXD : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT

IterationCount=6  WarmupCount=6  

@@ -841,8 +841,12 @@ public bool Equals(Guid g)
// Now compare each of the elements
return g._a == _a &&
Unsafe.Add(ref g._a, 1) == Unsafe.Add(ref _a, 1) &&
#if BIT64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to do this for the first check as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas in case if two guids are different (both random) - it becomes slower (see Equals_2 in the table - it does two ulong-checks).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • changed to be two ulong *

@@ -841,8 +841,12 @@ public bool Equals(Guid g)
// Now compare each of the elements
return g._a == _a &&
Unsafe.Add(ref g._a, 1) == Unsafe.Add(ref _a, 1) &&
#if BIT64
Unsafe.Add(ref Unsafe.As<int, ulong>(ref g._a), 1) == Unsafe.Add(ref Unsafe.As<int, ulong>(ref _a), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has potential portability problem: It assumes that accessing misaligned ulong is ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 🙁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this should rather use Unsafe.ReadUnaligned

@jkotas
Copy link
Member

jkotas commented Dec 3, 2018

https://gist.github.com/EgorBo/75348d5f194cbe91342d1b77d73b6c04

I has AggressiveInlining on everything. The actual implementation does not have it. Are the results representative?

@EgorBo
Copy link
Member Author

EgorBo commented Dec 3, 2018

@jkotas oh, indeed, let me re-run the benchmark without it. Initially I wanted to replace duplicated code to compare guids with Equals with [AggressiveInline] e.g. https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Guid.cs#L833-L845 (+ operator ==/!=)

@jkotas
Copy link
Member

jkotas commented Dec 3, 2018

cc @OlegAxenow

@EgorBo
Copy link
Member Author

EgorBo commented Dec 3, 2018

hm.. without AggressiveInlining results are a little bit different:
image
comparing Guids as two ulongs is the most efficient way now (tried several times on a clean env).

PS: probably it makes sense to mark this method with AggressiveInlining anyway.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 3, 2018

Turns out WriteByteHelper can be optimized with SSE2 (it makes it twice faster in my cases).

destination[13] = _i;
destination[14] = _j;
destination[15] = _k;
if (Sse2.IsSupported)
Copy link
Member

@jkotas jkotas Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it make sense to be special casing a simple copying like this for SSE2. Instead, this should be special cased for all little-endian platforms, like:

if (BitConverter.IsLittleEndian)
{
    MemoryMarshal.Write(destination, ref this);
}
else
{
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, it's even faster than the sse impl!

Copy link
Member

@jkotas jkotas Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folding WriteByteHelper into TryWriteBytes would make this even leaner for little endian platforms:

public byte[] ToByteArray()
{
    var g = new byte[16];
    if (BitConverter.IsLittleEndian)
    {
        MemoryMarshal.TryWrite<Guid>(g, ref this);
    }
    else
    {
        TryWriteBytes(g);
    }
    return g;
}

public bool TryWriteBytes(Span<byte> destination)
{
    if (BitConverter.IsLittleEndian)
    {
        return MemoryMarshal.TryWrite<Guid>(destination, ref this);    
    }
    else
    {
        ... the slower path to write the bytes one by one ...
        ... you make consider using BinaryPrimitives.WriteInt32LittleEndian and BinaryPrimitives.WriteInt16LittleEndian to implement it ...
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call TryWriteBytes from ToByteArray directly. ToByteArray is slow allocating method. It is not going to be used on performance critical paths. It is better to have smaller code for it.

// Returns whether bytes are sucessfully written to given span.
public bool TryWriteBytes(Span<byte> destination)
{
if ((uint)destination.Length < 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bounds check duplicated in MemoryMarshal.Write ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas I guess MemoryMarshal.Write has it's own bounds checks (according to output) and this uint-hack doesn't help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why not to use TryWrite for the little endian path like I have suggested above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas oh, I didn't notice it, thanks!

}

// slower path for BigEndian
if ((uint)destination.Length >= 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does writing it this way help anything? It is generally more readable to use early returns, like:

 if (destination.Length < 16)
    return false;

destination[0] = ...;
....

WriteByteHelper(destination);
return true;
if (BitConverter.IsLittleEndian)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor that takes Span can be optimized the same way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but will it compile? C# requires stucts' constructors to init all fields by hands (however Decimal compiles fine hm...).

Copy link
Member Author

@EgorBo EgorBo Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok it works

this = MemoryMarshal.Read<MyGuid>(b);

destination[12] = _h;
destination[13] = _i;
destination[14] = _j;
destination[15] = _k;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to write this one first like you are doing in other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought since it's surrounded with if ((uint)destination.Length >= 16) JIT would not insert bounds checks but turns out it inserts them anyway (I think there is a feature-request for it somewhere...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this optimization is very sensitive to exact pattern today.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it still write bounds checks if you specify 16u?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grant-d unfortunately yes it ignores (uint) > 16 (and 16u)

@EgorBo
Copy link
Member Author

EgorBo commented Dec 3, 2018

@jkotas I've removed Equals fix (will do that in a separate PR with other places).
I'll rewrite the PR description in an hour and attach better benchmarks for what we've done now 🙂

@jkotas
Copy link
Member

jkotas commented Dec 3, 2018

Could you please also add a short comment like // Hoist bounds checks next to all of the out-of-order writes so that somebody looking at this will get a hint that it is done intentionally this way?

@EgorBo EgorBo changed the title Optimize Guid.Equals and remove some bounds checks Vectorize some Guid APIs (ctor, TryWriteBytes) Dec 3, 2018
@jkotas jkotas merged commit 248449d into dotnet:master Dec 3, 2018
Anipik pushed a commit to dotnet/corert that referenced this pull request Dec 4, 2018
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@@ -841,8 +841,12 @@ public bool Equals(Guid g)
// Now compare each of the elements
return g._a == _a &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a dumb question that I could have just looked up, but I wanted to ask here so I was sure... What part of the GUID structure is the LSB here? Is it different based on big-endian vs. little-endian? The reason I ask is that if someone is comparing lots of sequentially assigned GUIDs, it's probably faster to compare LSB first, rather than MSB, right? Seems like that would be faster on average than the other short-circuit eval and not slower in the non-sequential case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kellypleahy it depends on what you mean by "sequentially assigned GUIDs". For instance MS SQL Server increment first int field something like this:
00000000-0000-0000-0000-000000000000
00010000-0000-0000-0000-000000000000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. It appears that UuidCreateSequential at least seems to update _a when generating sequential guids (and SQL Server just reverses the bytes of _a to big-endian when storing so as to ensure sequential order of the byte array) so in any case, comparing _a first is best by my argument above. Thanks for indulging me.

stephentoub pushed a commit to dotnet/corefx that referenced this pull request Dec 4, 2018
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jlennox pushed a commit to jlennox/corefx that referenced this pull request Dec 16, 2018
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Commit migrated from dotnet/coreclr@248449d
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants