Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Crc32 for arm #1987

Merged
merged 8 commits into from
Feb 22, 2022
Merged

Optimize Crc32 for arm #1987

merged 8 commits into from
Feb 22, 2022

Conversation

Bond-009
Copy link

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Adds a fast path in Crc32 for arm and arm64

Some benchmarks for the arm64 path:

Current (LUT):

Method Count Mean Error StdDev
SixLaborsCalculate 1024 14.54 us 0.001 us 0.001 us
SixLaborsCalculate 2048 29.07 us 0.003 us 0.002 us
SixLaborsCalculate 4096 58.14 us 0.008 us 0.008 us

arm crc32 instructions without loop unrolling:

Method Count Mean Error StdDev
SixLaborsCalculate 1024 833.0 ns 7.47 ns 6.62 ns
SixLaborsCalculate 2048 1,557.8 ns 0.17 ns 0.14 ns
SixLaborsCalculate 4096 3,044.1 ns 0.75 ns 0.70 ns

This PR:

Method Count Mean Error StdDev
SixLaborsCalculate 1024 584.8 ns 0.07 ns 0.06 ns
SixLaborsCalculate 2048 1,145.8 ns 0.12 ns 0.11 ns
SixLaborsCalculate 4096 2,262.1 ns 0.35 ns 0.29 ns

@CLAassistant
Copy link

CLAassistant commented Feb 10, 2022

CLA assistant check
All committers have signed the CLA.

}
#else
return ~CalculateScalar(~crc, buffer);
else if (ArmCrc32.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be if (ArmCrc32.IsSupported) instead?

@antonfirsov
Copy link
Member

antonfirsov commented Feb 11, 2022

Hmmm .. this is difficult.

On one hand I really appreciate and value the contribution, and would be happy to see us getting better on ARM. On the other hand I'm afraid we lack some prerequisites that are necessary to take it in:

  1. We don't run CI tests against ARM, and we don't have easy access to ARM in other ways. This means that there is no way to maintain this optimization guaranteeing correctness.
  2. The .NET 5/6 targets are test-only meaning they do not actually make it to our NuGet package, the highest version is still 3.1. We will update as part of our 3.0 roadmap, but it's not high priority right now, so I don't think it will happen very soon.

As such I'm leaning towards not merging this to main but shelving it to a separate branch for future adaption.

@SixLabors/core thoughts?

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Feb 11, 2022

@antonfirsov Can we use this to emulate the environment with QEMU?
https://github.com/uraimo/run-on-arch-action

@Bond-009 I'd love to know how you manage ARM testing with Jellyfin.

@brianpopow
Copy link
Collaborator

@antonfirsov Can we use this to emulate the environment with QEMU?
https://github.com/uraimo/run-on-arch-action

For this to work we would need also a docker image with ImageMagick build for ARM, otherwise most of the test wont work.
That's why I was experimenting building ImageMagick for ARM a while ago.

While this can work, it will be very slow. My very rough estimate would be, that running all test with QEMU would take something like 40 minutes (just a wild guess, but i doubt it will be much faster).

Also what I am not sure about is, if QEMU will emulate hardware intrinsics, too.

In my opinion we should keep the changes from this PR in a separate branch until we figure out how to run the tests properly on the CI.

@JimBobSquarePants
Copy link
Member

@dlemstra got ARM builds working I think. By my understanding QEMU does emulate hardware intrinsics but I don’t know if there’s limitations.

Alternatively we can look at purchasing a device and wiring up CI to talk to that (I’ve read that’s possible but no idea how).

Re branching… yeah. Very wise.

@brianpopow
Copy link
Collaborator

Alternatively we can look at purchasing a device and wiring up CI to talk to that (I’ve read that’s possible but no idea how).

Yes a selfhosted CI runner would work, but this device then must run all the time somewhere and always be reachable to the internet. Not sure how feasible that would be.

@JimBobSquarePants
Copy link
Member

I have a static IP so possible…

I’d like to explore QEMU first though

@brianpopow
Copy link
Collaborator

brianpopow commented Feb 12, 2022

I have tried out https://github.com/uraimo/run-on-arch-action with a simple example.
So far without success. When I run any dotnet command, I get:

Fatal error. System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
   at System.DateTime.op_Subtraction(System.DateTime, System.DateTime)
   at Microsoft.DotNet.Cli.Program.Main(System.String[])

Besides that, I have tried running a simple unit test which uses ARM intrinsics with docker (which also uses QEMU when the host is not ARM) and it seems to work. So in theory QEMU should work. Not sure why run-on-arch-action does not work. I guess it uses a different version of QEMU.

edit:
I managed to setup a branch which runs all the tests with docker and QEMU. As expected all tests which rely on ImageMagick fail: https://github.com/SixLabors/ImageSharp/runs/5169376026?check_suite_focus=true

Failed!  - Failed:  1356, Passed: 33733, Skipped:    18, Total: 35107, Duration: 31 m 3 s - /app/artifacts/bin/tests/ImageSharp.Tests/Release/net5.0/SixLabors.ImageSharp.Tests.dll (net5.0)
imagesharp_imagesharptests_1 exited with code 1

If @dlemstra got ARM builds working for imagemagick, I could try those in the next steps.

Adds a fast path in Crc32 for arm and arm64
@Bond-009
Copy link
Author

@JimBobSquarePants We don't have automated arm tests for jellyfin. (We also don't have any arm specific code)

Automated tests would only help with correctness regressions, not performance regressions, so you'd still have to check that manually.

@JimBobSquarePants
Copy link
Member

Automated tests would only help with correctness regressions, not performance regressions, so you'd still have to check that manually.

We've actually developed a technique here that allows us to use automated testing against intrinsics features. We're able to switch off intrinsics features to ensure correctness for all implementations and fallbacks.

public static void RunWithHwIntrinsicsFeature(

@brianpopow
Copy link
Collaborator

Maybe we should adjust the Crc32Tests to use the FeatureTestRunner, its currently not.

@JimBobSquarePants
Copy link
Member

Good idea. I need to check the runtime also to see if there’s additional flags now for arm intrinsics

@codecov
Copy link

codecov bot commented Feb 15, 2022

Codecov Report

Merging #1987 (46b0bf0) into arm (f24b02f) will decrease coverage by 0%.
The diff coverage is n/a.

❗ Current head 46b0bf0 differs from pull request most recent head b6bb006. Consider uploading reports for the commit b6bb006 to get more accurate results

Impacted file tree graph

@@          Coverage Diff          @@
##             arm   #1987   +/-   ##
=====================================
- Coverage     88%     88%   -1%     
=====================================
  Files        979     979           
  Lines      51939   51935    -4     
  Branches    6499    6499           
=====================================
- Hits       45755   45740   -15     
- Misses      5112    5119    +7     
- Partials    1072    1076    +4     
Flag Coverage Δ
unittests 88% <0%> (-1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../Memory/Allocators/Internals/UnmanagedBuffer{T}.cs 88% <0%> (-12%) ⬇️
.../Formats/Webp/Lossless/BackwardReferenceEncoder.cs 92% <0%> (-3%) ⬇️
src/ImageSharp/Image.Decode.cs 86% <0%> (-1%) ⬇️
...eg/Components/Decoder/SpectralConverter{TPixel}.cs 100% <0%> (ø)
...mats/Jpeg/Components/Decoder/HuffmanScanDecoder.cs 92% <0%> (+<1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f24b02f...b6bb006. Read the comment docs.

@JimBobSquarePants
Copy link
Member

Found the new collection of constants. We'll have to figure out which ones are most relevant.

@brianpopow
Copy link
Collaborator

Found the new collection of constants. We'll have to figure out which ones are most relevant.

Not sure howto decide which once are relevant yet. Should we add all of them or should we wait with that until we have more ARM specific code in the repo? We can disable intrinsics complete for now.

@brianpopow
Copy link
Collaborator

@Bond-009 could you please change the target branch from main to arm (I hope the PR can stay open with this change). As @antonfirsov suggested, its best to keep the ARM related changes separated for a while until we decide how to run tests on the CI.

@Bond-009 Bond-009 changed the base branch from main to arm February 19, 2022 22:32
@JimBobSquarePants
Copy link
Member

@tannergooding Would it be possible for you to advise which constants we should add to our HwIntrinsics enum for feature testing ARM intrinsics? Thanks!

public enum HwIntrinsics
#pragma warning restore RCS1135 // Declare enum member with zero value (when enum has FlagsAttribute).
{
// Use flags so we can pass multiple values without using params.
// Don't base on 0 or use inverse for All as that doesn't translate to string values.
DisableSIMD = 1 << 0,
DisableHWIntrinsic = 1 << 1,
DisableSSE = 1 << 2,
DisableSSE2 = 1 << 3,
DisableAES = 1 << 4,
DisablePCLMULQDQ = 1 << 5,
DisableSSE3 = 1 << 6,
DisableSSSE3 = 1 << 7,
DisableSSE41 = 1 << 8,
DisableSSE42 = 1 << 9,
DisablePOPCNT = 1 << 10,
DisableAVX = 1 << 11,
DisableFMA = 1 << 12,
DisableAVX2 = 1 << 13,
DisableBMI1 = 1 << 14,
DisableBMI2 = 1 << 15,
DisableLZCNT = 1 << 16,
AllowAll = 1 << 17

@tannergooding
Copy link
Contributor

@JimBobSquarePants, sure thing!

The full list of available flags is here: https://github.com/dotnet/runtime/blob/baae8fa837198fbc1cd42dfee951579e877745f1/src/coreclr/inc/clrconfigvalues.h#L736-L769

Since the list for x86/x64 and Arm64 is large, I've given a more "in depth" explanation of what is likely meaningful to target below. Feel free to ignore if its not important or irrelevant here.


The recommendation for x86/x64 would be to target the following configurations:

  • Software fallback
    • Test with COMPlus_EnableHWIntrinsic=0
  • SSE2
    • This is the x86/x64 baseline and is required by the .NET Runtime (where x86 refers to 32-bit and x64 to 64-bit Intel/AMD based CPUs)
    • This is required by all x64 CPUs and has been in almost every x86 CPU since the Pentium 4 released in 2000
    • This gives access to hardware acceleration for Vector (which will be 16-bytes) and Vector128
    • Test with COMPlus_EnableSSE3=0
    • This gives access to X86Base, SSE (1999) and SSE2 (2000)
  • SSE41
    • This has been in almost all x86/x64 CPUs since late 2007
    • This is the ISA that x86/x64 emulation that Windows/MacOS on ARM support
    • Test with COMPlus_EnableSSE42=0
    • This gives access to X86Base, SSE, SSE2, SSE3 (2004), SSSE3 (2006), and SSE4.1 (2007)
  • AVX2
    • This has been in most x86/x64 CPUs since 2013, with the main exception here is low power/budget devices like the Atom CPUs
    • This gives access to hardware acceleration for Vector (which will be 32-bytes) and Vector256
    • There isn't really a switch to toggle for testing this, just have hardware that supports it
    • This gives access to X86Base, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 (2008), AVX (2011), AVX2 (2013)
    • This, while not represented in the class hierarchy, also gives access to AES, PCLMULQDQ, and POPCNT (all 2008)
      • AVX adds a new encoding for AES/PCLMUL, making them a requirement for AVX support. POPCNT is complex to explain (minor Intel vs AMD diff) but is effectively everywhere SSE4.2 is
    • Due to what processors have actually shipped with this also effectively gives access to BMI1, BMI2, LZCNT, and FMA (all 2013)

There is a little bit of nuance and leeway here, and of course you're free to target whatever actual instruction sets fit you and your customer's needs. However, the above provides the biggest major differences in functionality and what actual hardware has shipped with up until this point so you'll likely end up with the broadest "meaningful" coverage and the least amount of code paths to support by targeting these.


For Arm64, the consideration would be:

  • Software fallback
    • Test with COMPlus_EnableHWIntrinsic=0
  • AdvSimd
    • This is the Arm64 baseline (and ArmBase itself) and is required by the .NET Runtime
    • This gives access to hardware acceleration for Vector (which will be 16-bytes), Vector64, and Vector128
    • There isn't really a switch to toggle for testing this, just have Arm64 hardware

Since Arm licenses the spec and there can be many vendors producing their own chips and deciding what ISAs are meaningful to them or not, the coverage of what's actually supported beyond this is a bit hazy and a couple of them (like Atomics and Dczva) don't have public hardware intrinsics.

A summary is:

  • AdvSimd is required in ARMv8.0
  • Atomics (internally only) and Rdm (rounding double multiply) are required in ARMv8.1
  • Crc32 is required in ARMv8.1
  • Dp (dot product) is required in ARMv8.2
  • Aes, Sha1, and Sha256 are part of the ARMv8.0 Cryptographic Extension
    • These are not strictly bundled and having one doesn't imply the others
    • ARMv8.2 requires SHA1/SHA256 to be bundled together if its provided

An annotation on the above is that something required in ARMv8.? is generally optional in previous versions.

@brianpopow brianpopow added this to the Future milestone Feb 20, 2022
@JimBobSquarePants
Copy link
Member

@tannergooding Thanks so much for all the info, this is perfect!

@brianpopow
Copy link
Collaborator

I think this is ready to be merged into the arm branch. Anyone wants to give it another review before that?

@@ -198,6 +208,103 @@ private static unsafe uint CalculateSse(uint crc, ReadOnlySpan<byte> buffer)
}
}
}

#if NET5_0_OR_GREATER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to actually target a .NET 5+ runtime to take advantage of this. I want to add an explicit target for .NET 6 for v3.0.0

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks everyone 👍

@brianpopow brianpopow merged commit 8d831be into SixLabors:arm Feb 22, 2022
@Bond-009 Bond-009 deleted the arm branch February 22, 2022 10:53
@Bond-009 Bond-009 mentioned this pull request Jul 3, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants