Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Blake2b intrinsics POC #398

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

TimothyMakkison
Copy link

Adapted @saucecontrol's Blake2Fast intrinsics code to work with the Konscious.Security.Cryptography library and thought it would work here.

The benchmark measured a 6x-13x speed up without additional memory usage. Blake2s or Blake2xs would probably experience similar benefits.

Non intrinsics

Method Size Mean Error StdDev Median Gen0 Allocated
GetHash 3 2.270 μs 0.0804 μs 0.2372 μs 2.337 μs 0.1106 176 B
GetHash 100 2.082 μs 0.0241 μs 0.0214 μs 2.083 μs 0.1106 176 B
GetHash 1000 15.085 μs 0.8234 μs 2.4279 μs 15.903 μs 0.0916 176 B
GetHash 10000 155.035 μs 2.8955 μs 4.1527 μs 155.360 μs - 176 B

Intrinsics

Method Size Mean Error StdDev Median Gen0 Allocated
GetHash 3 356.7 ns 11.23 ns 33.12 ns 367.6 ns 0.1121 176 B
GetHash 100 341.0 ns 6.96 ns 20.09 ns 343.6 ns 0.1121 176 B
GetHash 1000 1,439.0 ns 27.08 ns 22.61 ns 1,434.5 ns 0.1106 176 B
GetHash 10000 12,796.6 ns 259.71 ns 765.76 ns 13,144.9 ns 0.1068 176 B

Note that I'm new to intrinsics and CompilerServices.Unsafe, this shouldn't be considered complete or safe. Do you have a CI pipeline for testing intrinsic code?

@peterdettman
Copy link
Collaborator

Great numbers, but suggest the following changes:

  • Move all intrinsics code to a separate file (Blake2b_X86).
  • Put intrinsics code under "#if NETCOREAPP3_0_OR_GREATER", which AFAIK is the correct minimal version.
  • Add a licensing comment if the code is a substantial copy of either of the mentioned projects.
  • Add IsSupported method to the new class and use that from Blake2bDigest
  • Extract round function into a method (AggressiveInlining is still fine)
  • Probably can replace most of the Unsafe methods with MemoryMarshal methods

At least add the license comment, the rest I can take care of if necessary.

@peterdettman peterdettman self-assigned this Nov 19, 2022
@TimothyMakkison
Copy link
Author

TimothyMakkison commented Nov 20, 2022

  • Move all intrinsics code to a separate file (Blake2b_X86).
  • Put intrinsics code under "#if NETCOREAPP3_0_OR_GREATER", which AFAIK is the correct minimal version.
  • Add a licensing comment if the code is a substantial copy of either of the mentioned projects.
  • Add IsSupported method to the new class and use that from Blake2bDigest
  • Extract round function into a method (AggressiveInlining is still fine)
  • Probably can replace most of the Unsafe methods with MemoryMarshal methods

Added Blake2s_X86.

Blake2s

Non intrinsics

Method Size Mean Error StdDev Median Gen0 Allocated
GetHash 3 1.872 μs 0.0935 μs 0.2758 μs 1.987 μs 0.1450 232 B
GetHash 100 3.686 μs 0.0737 μs 0.1523 μs 3.720 μs 0.1450 232 B
GetHash 1000 28.101 μs 0.5315 μs 1.0857 μs 28.234 μs 0.1221 232 B
GetHash 10000 204.365 μs 9.7001 μs 28.6010 μs 200.846 μs - 232 B

Intrinsics

Method Size Mean Error StdDev Gen0 Allocated
GetHash 3 347.4 ns 13.51 ns 39.85 ns 0.1478 232 B
GetHash 100 501.3 ns 12.88 ns 37.99 ns 0.1478 232 B
GetHash 1000 2,545.4 ns 38.75 ns 34.35 ns 0.1450 232 B
GetHash 10000 23,501.2 ns 465.85 ns 653.06 ns 0.1221 232 B

Blake2xs

Non intrinsics

Method Size Mean Error StdDev Gen0 Allocated
GetHash 3 2.306 μs 0.0315 μs 0.0295 μs 0.5684 896 B
GetHash 100 3.201 μs 0.0413 μs 0.0366 μs 0.5684 896 B
GetHash 1000 15.635 μs 0.1960 μs 0.1834 μs 0.5493 896 B
GetHash 10000 135.593 μs 1.6090 μs 1.5051 μs 0.4883 896 B

Intrinsics

Method Size Mean Error StdDev Median Gen0 Allocated
GetHash 3 1.072 μs 0.0193 μs 0.0198 μs 1.070 μs 0.5703 896 B
GetHash 100 1.091 μs 0.0307 μs 0.0906 μs 1.109 μs 0.5703 896 B
GetHash 1000 2.858 μs 0.1685 μs 0.4969 μs 3.035 μs 0.5684 896 B
GetHash 10000 16.398 μs 0.1364 μs 0.1675 μs 16.444 μs 0.5493 896 B

There was no overlap between Blake2b and Blake2s so I deleted VectorExtensions and inlined the methods.

Suggestions

Add a length check to DoFinal to ensure that output spans and arrays are long enough.

There are several repeat implementations of Store128, Load128, Store128_UInt32, perhaps a shared helper method would help.

@peterdettman
Copy link
Collaborator

Note that intrinsics aren't available for NETSTANDARD2_1_OR_GREATER, so those directives should be removed (just NETCOREAPP3_0_OR_GREATER).

I'd still like methods on the _X86 classes, e.g.:
public static bool IsSupported => Avx2.IsSupported && BitConverter.IsLittleEndian;

although ideally BitConverter.IsLittleEndian isn't a requirement and only affects Load/Store (this could fall out later when I factor out all the load/store methods and support bigendian).

Also if all the rounds have the same structure (or a small set thereof), please factor out a method for 1 round (using ref parameters where each round applies to different parts of the state).

I've added length checks that you suggested also (this mirror can take a few days to be updated though).

@TimothyMakkison
Copy link
Author

TimothyMakkison commented Nov 21, 2022

  • Note that intrinsics aren't available for NETSTANDARD2_1_OR_GREATER, so those directives should be removed (just NETCOREAPP3_0_OR_GREATER).
  • I'd still like methods on the _X86 classes, e.g.:
public static bool IsSupported => Avx2.IsSupported && BitConverter.IsLittleEndian;

Also if all the rounds have the same structure (or a small set thereof), please factor out a method for 1 round (using ref parameters where each round applies to different parts of the state).

I've created a Round function containing the diagonalization and G calls, The remaining logic is loading the b0-b4 values, as far as I can tell this can't be fruther simplified as each round is subtly different.

I've added length checks that you suggested also (this mirror can take a few days to be updated though).

Thanks, it was driving me nuts when debugging the benchmarks

Should I update the xml documentation? It looks like the docs haven't been updated since being converted from java. Are you okay with using <inheritdoc />?

/// <inheritdoc />
public void Update(byte b)
{

@TimothyMakkison
Copy link
Author

I've converted the javadocs to xml form. I've used <inheritdoc /> where possible and added summaries when the existing javadoc differed from the IDigest description.

Nitpick: I've noticed that a couple of functions such as Update(byte b) have different parameter names to the interface IDigest Update(byte input). I haven't altered the names as this would be a breaking change, but it might be worth updating them to match the interface.

@peterdettman
Copy link
Collaborator

Is this ready for review/merge?

@TimothyMakkison
Copy link
Author

TimothyMakkison commented Dec 22, 2022

I'll hold off for and convert this to a draft. Want to double check I haven't added a regression for .Net 3

@TimothyMakkison TimothyMakkison marked this pull request as draft December 22, 2022 10:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants