Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Avx2 optimizations on Porter-Duff operations. #2359

Merged
merged 25 commits into from
Feb 20, 2023

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Feb 17, 2023

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Adds Vector256<float> implementations of the Porter-Duff functions to speed up our blending operations. Fixes #2340

  • Complete function component set within PorterDuffFunctions.cs.
  • Update the PorterDuffFunctions.Generated.tt to include the Vector256<float> variants.
  • Update DefaultPixelBlenders.Generated.tt to use the Vector256<float> variants when appropriate.
  • Update the unit tests to use RemoteExecutor

@antonfirsov @tannergooding @saucecontrol @br3aker I'll take any help you can offer pointing me in the right direction to finish porting these over.

/// <returns>The <see cref="Vector256{Single}"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Lighten(Vector256<float> backdrop, Vector256<float> source)
=> Avx.Max(backdrop, source);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do OverlayValueFunction below.

Copy link
Contributor

@br3aker br3aker Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit rusty but something like this should work:

private static readonly Vector256<float> Half = Vector256.Create(0.5f);
private static readonly Vector256<float> One = Vector256.Create(1f);
private static readonly Vector256<float> Two = Vector256.Create(2f);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source)
{
    var left = Avx.Multiply(Avx.Multiply(Two, backdrop), source);
    var right = Avx.Subtract(One, Avx.Multiply(Avx.Multiply(Two, Avx.Subtract(One, source)), Avx.Subtract(One, backdrop)));

    var cmp = Avx.CompareGreaterThan(backdrop, Half);
    return Avx.BlendVariable(left, right, cmp);
}

SIMD implementation must calculate both cases but it's still a bit faster than scalar:

BenchmarkDotNet=v0.13.4, OS=Windows 11 (10.0.22621.1265)
Unknown processor
.NET SDK=7.0.100
  [Host]     : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT AVX2
  DefaultJob : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT AVX2


| Method |      Mean |    Error |   StdDev |
|------- |----------:|---------:|---------:|
| Scalar | 119.12 ns | 0.342 ns | 0.320 ns |
|    Avx |  35.76 ns | 0.208 ns | 0.195 ns |
public class Benchmark
{
    private float[] _backdrop;
    private float[] _source;

    [GlobalSetup]
    public void Setup()
    {
        _backdrop = new float[8 * 20];
        _source = new float[8 * 20];

        FillRandom(_backdrop);
        FillRandom(_source);
    }

    private static void FillRandom(float[] arr)
    {
        var rng = new Random();
        for(int i = 0; i < arr.Length; i++)
        {
            arr[i] = rng.NextSingle();
        }
    }

    [Benchmark(Description = "Scalar")]
    public float OverlayValueFunction_Scalar()
    {
        float result = default;
        for (int i = 0; i < _backdrop.Length; i++)
        {
            result += Program.OverlayValueFunction(_backdrop[i], _source[i]);
        }

        return result;
    }

    [Benchmark(Description = "Avx")]
    public Vector256<float> OverlayValueFunction_Avx()
    {
        ref Vector256<float> backdrop = ref Unsafe.As<float, Vector256<float>>(ref MemoryMarshal.GetReference<float>(_backdrop)); ;
        ref Vector256<float> source = ref Unsafe.As<float, Vector256<float>>(ref MemoryMarshal.GetReference<float>(_backdrop));

        Vector256<float> result = default;

        int count = _backdrop.Length / Vector256<float>.Count;
        for (int i = 0; i < count; i++)
        {
            result = Avx.Add(result, Program.OverlayValueFunction(Unsafe.Add(ref backdrop, i), Unsafe.Add(ref backdrop, i)));
        }

        return result;
    }
}

P.S.

I didn't test it properly though, tried a couple of random numbers and compared result to the scalar implemention.
Most likely can be done a bit better without BlendVariable but it's a good starting point IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's awesome. Thanks! That lets me complete the first objective.

Comment on lines 26 to 28
private static readonly Vector256<float> Vector256Half = Vector256.Create(0.5F);
private static readonly Vector256<float> Vector256One = Vector256.Create(1F);
private static readonly Vector256<float> Vector256Two = Vector256.Create(2F);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be worse than direct usage for .NET 5+

Starting in .NET 5 there was some special support added for Vector###.Create(cns) and Vector###.Create(cns, ..., cns). These APIs will now generate a "method local constant" which makes it quite a bit more efficient than even a static readonly will be.

In .NET 5 this support only existed in the late phases of the JIT (lowering). That support was improved a bit more in .NET 6, 7, and now in 8 as well. Starting in 7 in particular we now have a direct node (GT_CNS_VEC) which allows other phases of the JIT to take advantage of this. In .NET 8 we've started adding constant folding support as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to centralize such "constants", I'd recommend returning them from a property instead:

private static Vector256<float> Vector256Half => Vector256.Create(0.5F);
private static Vector256<float> Vector256One => Vector256.Create(1F);
private static Vector256<float> Vector256Two => Vector256.Create(2F);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's very interesting. Will revert and look into other examples in the codebase.

What about multiple references to a local variable within a single method? Should I inline them?

public static Vector256<float> Screen(Vector256<float> backdrop, Vector256<float> source)
{
    Vector256<float> vOne = Vector256.Create(1F);
    return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source)));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure off the top of my head. In .NET 7/8 I'd expect us to be doing the right thing and recognizing them as a "common subexpression". I'd also expect the same for .NET 6, but the support wasn't quite as good and so manually hoisting it to a local may provide better results and shouldn't hurt on any of the target frameworks.

In general you can assume that most Create(cns) and Create(cns, ..., cns) come from memory and so using a local can help ensure there is only one memory access. The special consideration is Zero and AllBitsSet which are generated dynamically using a "zero cost" instruction instead.

public static Vector256<float> Overlay(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> color = OverlayValueFunction(backdrop, source);
return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl));
Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're "missing" a JIT optimization here so swapping the parameter order for vblendps can be more efficient

Namely do Avx.Blend(Vector256<float>.Zero, color, 0b_01_11_01_11);. This is special to Zero and AllBitsSet since they aren't "normal" constants but are instead always generated into register directly (rather than loaded from memory). For other constants the original ordering you had would've been better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll log a JIT issue to track ensuring we support commutativity for Blend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks!

Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet/runtime#82365 for recognizing more types of commutative operations, such as for Blend

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> left = Avx.Multiply(Avx.Multiply(Vector256Two, backdrop), source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiply(Two, backdrop) is better as Add(backdrop, backdrop).

This is something I'm working on recognizing implicitly in the JIT as part of the constant folding support being added in .NET 8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could then consider a helper method here:

public static Vector256<float> MultiplyAddEstimate(Vector256<float> left, Vector256<float> right, Vector256<float> addend)
{
    if (Fma.IsSupported)
    {
        return Fma.MultiplyAdd(left, right, addend);
    }
    else
    {
        return Avx.Add(Avx.Multiply(left, right), addend);
    }
}

The reason that it's an "estimate" is that FMA does a more precise computation and gives a more accurate result. This can also subtly change the answer in some edge cases.

The most notably is that for double, (1e308 * 2) - 1e308 returns infinity, but FusedMultiplyAdd(1e308, 2, -1e308) returns 1e300 (there is a similar scenario for float, but I don't remember the exact inputs off the top of my head).

Most other inputs are less drastic in their differences. They'll either produce the same result (with Fma typically being faster) or be off by 1 ULP (if you looked at the raw bits, the least significant bit would typically differ by 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have a helper that does exactly that in SimdUtils.HwIntrinsics.MultiplyAdd

Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet/runtime#82366 for the vector * 2 to vector + vector transform being implicitly recognized

@JimBobSquarePants JimBobSquarePants added this to the 3.0.0 milestone Feb 19, 2023
// calculate weights
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl);
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's the same input twice, using Avx.Permute should be preferred. The same shuffle control is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vpermilps is 1 byte larger than vshufps. The only reason to prefer it is that it supports a memory operand where shuffle doesn't, but that doesn't apply here.

Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 1-byte larger but also preferred by Clang/GCC in practically all cases (both Intel and AMD have their instruction selection metadata set to prefer it).

There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.

It ultimately shouldn't make too much a difference (it's been the same or with shufps taking 1-cycle more everywhere so far), but I'd typically side with what Intel/AMD have made Clang/GCC emit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(both Intel and AMD have their instruction selection metadata set to prefer it).

I don't know what this means. The two instructions mostly do different things, with one overlapping case. Can you point me at the metadata in question? And any example where Clang does the substitution?

There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.

Ignoring Knight's Landing since it's not a real CPU, this argument doesn't make any sense to me. This is an in-lane operation, so there's no reason splitting it in two would affect one and not the other. In fact, the two existing architectures that do split AVX into 2x128 (Zen+ and Gracemont) have identical measured latency and throughput numbers.

Not that 1 byte here or there is especially worth arguing over, but since you've suggested it "should be preferred", I'd like to see something concrete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can trivially see this in godbolt via:

#include <immintrin.h>

__m256 Permute(__m256 x)
{
    return _mm256_shuffle_ps(x, x, 0xFF);
}

and that Clang and GCC both universally replace this with vpermilps regardless of ISAs or target -march=....

I'll see if I can dig up, again, exactly where the instruction selection for this happens, but might take me a bit given the size/complexity of LLVM in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's interesting. At least trusty old MSVC does what it's told 😉

https://godbolt.org/z/3eTKGqchf

I'd appreciate it if you can find the reference for why they do it. I've seen Clang make some odd decisions before, but with GCC doing it too there's got to be something.

Comment on lines 294 to 296
Vector256<float> blendW = Avx.Multiply(sW, dW);

Vector256<float> dstW = Avx.Subtract(dW, blendW);
Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to build a MultiplyAddNegatedEstimate helper using Fma.MultiplyAddNegate since that does -(a * b) + c which should be equivalent to c - (a * b)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need Vector256.Negate<T> for that wouldn't I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd use Fma.MultiplyAddNegate(a, b, c) and fallback to Avx.Subtract(c, Avx.Multiply(a, b)) otherwise.

This is because (x - y) == (-y + x) and so you're just taking advantage of the specialized instruction Fma instruction that negates the multiplied result itself.

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I see that thanks. In this instance I use blendW further down so it looks like I'd be better off keeping as is since I would be introducing duplicate multiplications.

This reverts commit 907400f.
Comment on lines 393 to 395
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
Vector256<float> alpha = Avx.Multiply(sW, dW);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to change this to:

Vector256<float> alpha = Avx.Permute(Avx.Multiply(source, destination), ShuffleAlphaControl);

That will do a single shuffle/permute, rather than two of them. In general, if you can do ops and shuffle less after its "better" because shuffle ports are typically limited.

In this case its safe since all the operations are element-wise, so even though you're multiplying source[i] * destination[i], the ones for i = 0, 1, 2 don't matter and just get ignored in the subsequent permute.

It's possible there are similar opportunities in the shuffle ops in the methods above here, but they are "more complex" than this one so I didn't do any in depth analysis to look for the same simplification/optimizations in them

result = PorterDuffFunctions.NormalSrcOver(Unsafe.Add(ref backdrop, i), Unsafe.Add(ref source, i), opacity);
}

return result;
Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a bug in my benchmark?

Numbers seem too good to be true.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.102
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
  DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT


| Method |       Mean |   Error |  StdDev |
|------- |-----------:|--------:|--------:|
| Scalar | 2,176.5 ns | 4.48 ns | 3.97 ns |
|    Avx |   231.3 ns | 0.87 ns | 0.82 ns |

// * Hints *
Outliers
  PorterDuffBulkVsSingleVector.Scalar: Default -> 1 outlier  was  removed (2.20 us)

// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  1 ns   : 1 Nanosecond (0.000000001 sec)

// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:00:26 (26.39 sec), executed benchmarks: 2

Global total time: 00:00:47 (47.57 sec), executed benchmarks: 2
// * Artifacts cleanup *

Copy link

@jnyrup jnyrup Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that both backdrop and source reference this.backdrop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Doesn't affect the results though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just peaking at .NET 7 and changing the vector algorithm to be V128 to simplify the diff a bit, then we have:

Vector4

; Assembly listing for method C:NormalSrcOver(System.Numerics.Vector4,System.Numerics.Vector4,float):System.Numerics.Vector4
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
       4883EC18             sub      rsp, 24
       C5F877               vzeroupper 
       C5F8293424           vmovaps  xmmword ptr [rsp], xmm6
 
G_M000_IG02:                ;; offset=000CH
       498D400C             lea      rax, bword ptr [r8+0CH]
       C5E25900             vmulss   xmm0, xmm3, dword ptr [rax]
       C5FA1100             vmovss   dword ptr [rax], xmm0
       C4C1781000           vmovups  xmm0, xmmword ptr [r8]
       C5F828C8             vmovaps  xmm1, xmm0
       C5F81012             vmovups  xmm2, xmmword ptr [rdx]
       C5E8C6DAFF           vshufps  xmm3, xmm2, xmm2, -1
       C5F0C6E1FF           vshufps  xmm4, xmm1, xmm1, -1
       C5E259EC             vmulss   xmm5, xmm3, xmm4
       C5E25CDD             vsubss   xmm3, xmm3, xmm5
       C5DA5CF5             vsubss   xmm6, xmm4, xmm5
       C5E258E4             vaddss   xmm4, xmm3, xmm4
       C4E27918DB           vbroadcastss xmm3, xmm3
       C5E859D3             vmulps   xmm2, xmm2, xmm3
       C4E27918DE           vbroadcastss xmm3, xmm6
       C5F059CB             vmulps   xmm1, xmm1, xmm3
       C5E858C9             vaddps   xmm1, xmm2, xmm1
       C4E27918D5           vbroadcastss xmm2, xmm5
       C5F859C2             vmulps   xmm0, xmm0, xmm2
       C5F058C0             vaddps   xmm0, xmm1, xmm0
       C5F828CC             vmovaps  xmm1, xmm4
       C5F8101532000000     vmovups  xmm2, xmmword ptr [reloc @RWD00]
       C5E85FC9             vmaxps   xmm1, xmm2, xmm1
       C4E27918C9           vbroadcastss xmm1, xmm1
       C5F85EC1             vdivps   xmm0, xmm0, xmm1
       C4E37921C430         vinsertps xmm0, xmm0, xmm4, 48
       C5F81101             vmovups  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0088H
       C5F8283424           vmovaps  xmm6, xmmword ptr [rsp]
       4883C418             add      rsp, 24
       C3                   ret   

Vector128

G_M000_IG01:                ;; offset=0000H
       4883EC18             sub      rsp, 24
       C5F877               vzeroupper 
       C5F8293424           vmovaps  qword ptr [rsp], xmm6
 
G_M000_IG02:                ;; offset=000CH
       C4C1791000           vmovupd  xmm0, xmmword ptr [r8]
       C4C1785909           vmulps   xmm1, xmm0, xmmword ptr [r9]
       C4E3790CC188         vblendps xmm0, xmm0, xmm1, -120
       C4C1791100           vmovupd  xmmword ptr [r8], xmm0
       C4C1791000           vmovupd  xmm0, xmmword ptr [r8]
       C5F828C8             vmovaps  xmm1, xmm0
       C5F91012             vmovupd  xmm2, xmmword ptr [rdx]
       C4E37904D8FF         vpermilps xmm3, xmm0, -1
       C4E37904E2FF         vpermilps xmm4, xmm2, -1
       C5E059EC             vmulps   xmm5, xmm3, xmm4
       C5D85CE5             vsubps   xmm4, xmm4, xmm5
       C5E05CF5             vsubps   xmm6, xmm3, xmm5
       C5D858DB             vaddps   xmm3, xmm4, xmm3
       C5E859D4             vmulps   xmm2, xmm2, xmm4
       C4E279B8D6           vfmadd231ps xmm2, xmm0, xmm6
       C4E271B8D5           vfmadd231ps xmm2, xmm1, xmm5
       C5E05F0530000000     vmaxps   xmm0, xmm3, xmmword ptr [reloc @RWD00]
       C5E85ED0             vdivps   xmm2, xmm2, xmm0
       C4E3690CC388         vblendps xmm0, xmm2, xmm3, -120
       C5F91101             vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0071H
       C5F8283424           vmovaps  xmm6, qword ptr [rsp]
       4883C418             add      rsp, 24
       C3                   ret

The V256 code is effectively identical to the V128 code, just using ymm instead of xmm and we can assume is roughly 2x the speed.

For the differences between Vector4 and V128...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial source.W *= opacity isn't great, it's generating:

lea      rax, bword ptr [r8+0CH]
vmulss   xmm0, xmm3, dword ptr [rax]
vmovss   dword ptr [rax], xmm0
vmovupd  xmm0, xmmword ptr [r8]

If NormalSrcOver was inlined and source was in register, it'd also be generating a spill before this:

vmovapd  xmmword ptr [r8], xmm0

That basically means it's doing:

  • write source to memory
  • get the address of w
  • load w from memory and multiply by opacity
  • store result into w
  • load entire vector from memory

Compare this with the V128 approach:

vmovupd  xmm0, xmmword ptr [r8]
vmulps   xmm1, xmm0, xmmword ptr [r9]
vblendps xmm0, xmm0, xmm1, 8

Which is:

  • load source from memory
  • load opacity from memory and multiply by source
  • blend so that you get original source plus new W

If you instead did something roughly equivalent, making it Vector4 opacity and then did source.W = (source * opacity).W;, you get this instead:

vmovupd  xmm0, xmmword ptr [r8]
vmulps   xmm1, xmm0, xmmword ptr [r9]
vshufps  xmm1, xmm1, xmm1, -1
vinsertps xmm0, xmm0, xmm1, 48

Which is nearly ideal and doing the same thing, there's just a stray vshufps that I need to remove on the JIT side >.>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insanely detailed writeup, learnt a lot, thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredible @tannergooding Thanks so much!

I'm gonna have a tinker and see where I can get. NormalSrcOver is currently 9.2X faster using AVX which is absolutely bonkers. CC @antonfirsov

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.102
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
  DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT


| Method |       Mean |   Error |  StdDev | Ratio |
|------- |-----------:|--------:|--------:|------:|
| Scalar | 2,177.0 ns | 4.38 ns | 3.66 ns |  1.00 |
|    Avx |   233.5 ns | 1.25 ns | 1.17 ns |  0.11 |

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR improvement for the Vector4 path. @tannergooding

I implemented the WithW and the things immediate dropped to nearly 4X faster. PermuteW brought it down even further.

The only thing I couldn't figure out was how to fallback to Sse for WithW. I'm not great with shuffling.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.102
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
  DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT


| Method |     Mean |   Error |  StdDev | Ratio |
|------- |---------:|--------:|--------:|------:|
| Scalar | 415.5 ns | 1.36 ns | 1.06 ns |  1.00 |
|    Avx | 234.4 ns | 0.44 ns | 0.35 ns |  0.56 |

Copy link
Contributor

@tannergooding tannergooding Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR improvement for the Vector4 path.

Awesome! Much closer to the 1.5-2x difference we'd otherwise expect

The only thing I couldn't figure out was how to fallback to Sse for WithW.

Think you want:

Vector128<float> tmp = Sse.Shuffle(w.AsVector128(), value.AsVector128(), 0b00_10_11_11);
return Sse.Shuffle(value.AsVector128(), tmp, 0b00_10_01_00);

That will create tmp as <w[3], w[3], value[2], value[0]>
Then return <value[0], value[1], tmp[2], tmp[0]> (which is <value[0], value[1], value[2], w[3]>)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not have figured that one out. Thanks!

@JimBobSquarePants JimBobSquarePants changed the title WIP. Enable Avx2 optimizations on Porter-Duff operations. Enable Avx2 optimizations on Porter-Duff operations. Feb 19, 2023
@JimBobSquarePants JimBobSquarePants marked this pull request as ready for review February 19, 2023 11:58
{
return Vector4.One - ((Vector4.One - backdrop) * (Vector4.One - source));
Vector256<float> vOne = Vector256.Create(1F);
return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source)));
Copy link
Contributor

@saucecontrol saucecontrol Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much anywhere you have a multiply followed by add or multiply followed by subtract, you can use FMA. This one would be

Fma.MultiplyAddNegated(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source), vOne);

With the appropriate helper for non-FMA, ofc.

Vector256<float> vOne = Vector256.Create(1F);
Vector256<float> vTwo = Vector256.Create(2F);
Vector256<float> left = Avx.Multiply(Avx.Add(backdrop, backdrop), source);
Vector256<float> right = Avx.Subtract(vOne, Avx.Multiply(Avx.Multiply(vTwo, Avx.Subtract(vOne, source)), Avx.Subtract(vOne, backdrop)));
Copy link
Contributor

@saucecontrol saucecontrol Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another MultiplyAddNegated opportunity here. You can also remove the vTwo constant (which will be a memory load) by saving Avx.Subtract(vOne, source) into a temp and adding it to itself instead of multiplying by 2.

Add will be faster or have increased throughput over multiply on some processors as well.

// calculate alpha
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256.Create(1F), dW), sW);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I saw Tanner comment on this yesterday, but it's disappeared. You can save an instruction by doing the alpha shuffle after the multiply.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment should be here, hopefully it hasn't actually disappeared #2359 (comment) 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, yeah, it just got marked as outdated so it doesn't show up in the code view anymore. Saw it after I submitted the comments :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented that now.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Feb 20, 2023

OK. I think I'm happy here unless there are any objections.

The old Vector4 path is ~5.1x faster than before.
The new AVX path is ~9.2x faster than before.

Comment on lines +590 to +592
in Vector256<float> a,
in Vector256<float> b,
in Vector256<float> c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly since this is inlined, I wouldn't expect you to need in here (or on the other examples above).

It shouldn't actually make a difference since the JIT should inline things and elide the "address taken", but it's typically easier on the JIT if you don't mark them unnecessarily for "enregisterable" types.

Copy link
Contributor

@saucecontrol saucecontrol Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, without the in JIT could/would lose track of the fact memory load operands could be contained when inlining these small helpers, at least in netcoreapp3.1. Don't know if or when that was fixed, but I'd double check before removing them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the JIT but am seeing a 2% slowdown when running my benchmark when I remove them. Could be error but I'll leave them for now.

Copy link
Contributor

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look generally good/correct to me.

There's cleanup/simplification that can be done long term when you're able to move onto .NET 7 (or .NET 8 after that ships)

@JimBobSquarePants JimBobSquarePants merged commit ae0a51c into main Feb 20, 2023
@JimBobSquarePants JimBobSquarePants deleted the js/avx2-porter-duff branch February 20, 2023 23:47
Vector4 result = default;
for (int i = 0; i < this.backdrop.Length; i++)
{
result = PorterDuffFunctions.NormalSrcOver(this.backdrop[i], this.source[i], .5F);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern with these benchmarks is that - unlike the real span code - they do not walk the source and amount buffers, meaning there are fewer cache misses.

I would have created a parametric benchmark (N=2,16,128,512,1024) that uses the Span API-s, and run it against the code before/after the changes. Shouldn't significantly impact the result, but would paint a more nuanced picture about the improvements.

Anyways, amazing job!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They actually used to walk those buffers. I was surprised by that, so I replicated it in the benchmark.

protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, float amount)
{
amount = Numerics.Clamp(amount, 0, 1);
for (int i = 0; i < destination.Length; i++)
{
destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], amount);
}
}
/// <inheritdoc />
protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, ReadOnlySpan<float> amount)
{
for (int i = 0; i < destination.Length; i++)
{
destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], Numerics.Clamp(amount[i], 0, 1));
}
}

I have actually realized that the benchmarks were actually slightly generous to the old approach as I'd forgotten the clamp on amount

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable Avx2 optimizations on Porter-Duff operations.
6 participants