Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Avx2 optimizations on Porter-Duff operations. #2359

Merged
merged 25 commits into from
Feb 20, 2023
Merged
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
517ec80
Port most of the function components.
JimBobSquarePants Feb 17, 2023
6a4dcd7
Merge branch 'main' into js/avx2-porter-duff
JimBobSquarePants Feb 17, 2023
746b34d
Finish porting function components
JimBobSquarePants Feb 19, 2023
4c546d7
Update the PorterDuffFunctions.Generated.tt to include the Vector256<…
JimBobSquarePants Feb 19, 2023
ef34960
Fix code generation
JimBobSquarePants Feb 19, 2023
9f8bcc4
Respond to feedback
JimBobSquarePants Feb 19, 2023
5fedca8
Respond to feedback
JimBobSquarePants Feb 19, 2023
907400f
Use Permute
JimBobSquarePants Feb 19, 2023
9a552f1
Revert "Use Permute"
JimBobSquarePants Feb 19, 2023
bde9324
Use Permute
JimBobSquarePants Feb 19, 2023
41cfa9b
Port DefaultPixelBlenders
JimBobSquarePants Feb 19, 2023
b4ff1e4
Fix issues
JimBobSquarePants Feb 19, 2023
c58be60
Add additional PD tests
JimBobSquarePants Feb 19, 2023
dff381f
Fix amount span assignment
JimBobSquarePants Feb 19, 2023
6cb6bd4
Better clamp, fix offset (again)
JimBobSquarePants Feb 19, 2023
c06da8c
Add NormalSrcOver benchmark
JimBobSquarePants Feb 19, 2023
b05b25b
Use RemoteExecutor for composition tests
JimBobSquarePants Feb 19, 2023
916084c
Fix field assignment in benchmark
JimBobSquarePants Feb 19, 2023
8ffec30
Make Scalar default
JimBobSquarePants Feb 19, 2023
a666372
Use FMA where possible.
JimBobSquarePants Feb 19, 2023
afdc53c
Tanners Top Tips!!
JimBobSquarePants Feb 20, 2023
7309b6e
Merge branch 'main' into js/avx2-porter-duff
JimBobSquarePants Feb 20, 2023
78eb2f1
Use WithW
JimBobSquarePants Feb 20, 2023
ac0d27d
Provide Sse fallback for WithW
JimBobSquarePants Feb 20, 2023
9752566
Merge branch 'main' into js/avx2-porter-duff
JimBobSquarePants Feb 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 69 additions & 41 deletions src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,12 @@ namespace SixLabors.ImageSharp.PixelFormats.PixelBlenders;
/// </remarks>
internal static partial class PorterDuffFunctions
{
private const int BlendAlphaControl = 0b_10_00_10_00;
private const int ShuffleAlphaControl = 0b_11_11_11_11;
private static readonly Vector256<float> Vector256Half = Vector256.Create(0.5F);
private static readonly Vector256<float> Vector256One = Vector256.Create(1F);
private static readonly Vector256<float> Vector256Two = Vector256.Create(2F);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be worse than direct usage for .NET 5+

Starting in .NET 5 there was some special support added for Vector###.Create(cns) and Vector###.Create(cns, ..., cns). These APIs will now generate a "method local constant" which makes it quite a bit more efficient than even a static readonly will be.

In .NET 5 this support only existed in the late phases of the JIT (lowering). That support was improved a bit more in .NET 6, 7, and now in 8 as well. Starting in 7 in particular we now have a direct node (GT_CNS_VEC) which allows other phases of the JIT to take advantage of this. In .NET 8 we've started adding constant folding support as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to centralize such "constants", I'd recommend returning them from a property instead:

private static Vector256<float> Vector256Half => Vector256.Create(0.5F);
private static Vector256<float> Vector256One => Vector256.Create(1F);
private static Vector256<float> Vector256Two => Vector256.Create(2F);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's very interesting. Will revert and look into other examples in the codebase.

What about multiple references to a local variable within a single method? Should I inline them?

public static Vector256<float> Screen(Vector256<float> backdrop, Vector256<float> source)
{
    Vector256<float> vOne = Vector256.Create(1F);
    return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source)));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure off the top of my head. In .NET 7/8 I'd expect us to be doing the right thing and recognizing them as a "common subexpression". I'd also expect the same for .NET 6, but the support wasn't quite as good and so manually hoisting it to a local may provide better results and shouldn't hurt on any of the target frameworks.

In general you can assume that most Create(cns) and Create(cns, ..., cns) come from memory and so using a local can help ensure there is only one memory access. The special consideration is Zero and AllBitsSet which are generated dynamically using a "zero cost" instruction instead.


/// <summary>
/// Returns the result of the "Normal" compositing equation.
/// </summary>
Expand Down Expand Up @@ -79,7 +85,7 @@ public static Vector4 Add(Vector4 backdrop, Vector4 source)
/// <returns>The <see cref="Vector256{Single}"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Add(Vector256<float> backdrop, Vector256<float> source)
=> Avx.Min(Vector256.Create(1F), Avx.Add(backdrop, source));
=> Avx.Min(Vector256One, Avx.Add(backdrop, source));

/// <summary>
/// Returns the result of the "Subtract" compositing equation.
Expand All @@ -99,7 +105,7 @@ public static Vector4 Subtract(Vector4 backdrop, Vector4 source)
/// <returns>The <see cref="Vector256{Single}"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Subtract(Vector256<float> backdrop, Vector256<float> source)
=> Avx.Min(Vector256.Create(1F), Avx.Subtract(backdrop, source));
=> Avx.Min(Vector256One, Avx.Subtract(backdrop, source));

/// <summary>
/// Returns the result of the "Screen" compositing equation.
Expand All @@ -119,10 +125,7 @@ public static Vector4 Screen(Vector4 backdrop, Vector4 source)
/// <returns>The <see cref="Vector256{Single}"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Screen(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> vOne = Vector256.Create(1F);
return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source)));
}
=> Avx.Subtract(Vector256One, Avx.Multiply(Avx.Subtract(Vector256One, backdrop), Avx.Subtract(Vector256One, source)));

/// <summary>
/// Returns the result of the "Darken" compositing equation.
Expand Down Expand Up @@ -179,6 +182,19 @@ public static Vector4 Overlay(Vector4 backdrop, Vector4 source)
return Vector4.Min(Vector4.One, new Vector4(cr, cg, cb, 0));
}

/// <summary>
/// Returns the result of the "Overlay" compositing equation.
/// </summary>
/// <param name="backdrop">The backdrop vector.</param>
/// <param name="source">The source vector.</param>
/// <returns>The <see cref="Vector4"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Overlay(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> color = OverlayValueFunction(backdrop, source);
return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl));
Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're "missing" a JIT optimization here so swapping the parameter order for vblendps can be more efficient

Namely do Avx.Blend(Vector256<float>.Zero, color, 0b_01_11_01_11);. This is special to Zero and AllBitsSet since they aren't "normal" constants but are instead always generated into register directly (rather than loaded from memory). For other constants the original ordering you had would've been better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll log a JIT issue to track ensuring we support commutativity for Blend

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks!

Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet/runtime#82365 for recognizing more types of commutative operations, such as for Blend

}

/// <summary>
/// Returns the result of the "HardLight" compositing equation.
/// </summary>
Expand All @@ -195,6 +211,19 @@ public static Vector4 HardLight(Vector4 backdrop, Vector4 source)
return Vector4.Min(Vector4.One, new Vector4(cr, cg, cb, 0));
}

/// <summary>
/// Returns the result of the "HardLight" compositing equation.
/// </summary>
/// <param name="backdrop">The backdrop vector.</param>
/// <param name="source">The source vector.</param>
/// <returns>The <see cref="Vector4"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> HardLight(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> color = OverlayValueFunction(source, backdrop);
return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl));
}

/// <summary>
/// Helper function for Overlay and HardLight modes
/// </summary>
Expand All @@ -205,6 +234,22 @@ public static Vector4 HardLight(Vector4 backdrop, Vector4 source)
private static float OverlayValueFunction(float backdrop, float source)
=> backdrop <= 0.5f ? (2 * backdrop * source) : 1 - (2 * (1 - source) * (1 - backdrop));

/// <summary>
/// Helper function for Overlay and HardLight modes
/// </summary>
/// <param name="backdrop">Backdrop color element</param>
/// <param name="source">Source color element</param>
/// <returns>Overlay value</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> left = Avx.Multiply(Avx.Multiply(Vector256Two, backdrop), source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiply(Two, backdrop) is better as Add(backdrop, backdrop).

This is something I'm working on recognizing implicitly in the JIT as part of the constant folding support being added in .NET 8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could then consider a helper method here:

public static Vector256<float> MultiplyAddEstimate(Vector256<float> left, Vector256<float> right, Vector256<float> addend)
{
    if (Fma.IsSupported)
    {
        return Fma.MultiplyAdd(left, right, addend);
    }
    else
    {
        return Avx.Add(Avx.Multiply(left, right), addend);
    }
}

The reason that it's an "estimate" is that FMA does a more precise computation and gives a more accurate result. This can also subtly change the answer in some edge cases.

The most notably is that for double, (1e308 * 2) - 1e308 returns infinity, but FusedMultiplyAdd(1e308, 2, -1e308) returns 1e300 (there is a similar scenario for float, but I don't remember the exact inputs off the top of my head).

Most other inputs are less drastic in their differences. They'll either produce the same result (with Fma typically being faster) or be off by 1 ULP (if you looked at the raw bits, the least significant bit would typically differ by 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have a helper that does exactly that in SimdUtils.HwIntrinsics.MultiplyAdd

Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet/runtime#82366 for the vector * 2 to vector + vector transform being implicitly recognized

Vector256<float> right = Avx.Subtract(Vector256One, Avx.Multiply(Avx.Multiply(Vector256Two, Avx.Subtract(Vector256One, source)), Avx.Subtract(Vector256One, backdrop)));

Vector256<float> cmp = Avx.CompareGreaterThan(backdrop, Vector256Half);
return Avx.BlendVariable(left, right, cmp);
}

/// <summary>
/// Returns the result of the "Over" compositing equation.
/// </summary>
Expand Down Expand Up @@ -243,12 +288,9 @@ public static Vector4 Over(Vector4 destination, Vector4 source, Vector4 blend)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Over(Vector256<float> destination, Vector256<float> source, Vector256<float> blend)
{
const int blendAlphaControl = 0b_10_00_10_00;
const int shuffleAlphaControl = 0b_11_11_11_11;

// calculate weights
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl);
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's the same input twice, using Avx.Permute should be preferred. The same shuffle control is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vpermilps is 1 byte larger than vshufps. The only reason to prefer it is that it supports a memory operand where shuffle doesn't, but that doesn't apply here.

Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 1-byte larger but also preferred by Clang/GCC in practically all cases (both Intel and AMD have their instruction selection metadata set to prefer it).

There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.

It ultimately shouldn't make too much a difference (it's been the same or with shufps taking 1-cycle more everywhere so far), but I'd typically side with what Intel/AMD have made Clang/GCC emit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(both Intel and AMD have their instruction selection metadata set to prefer it).

I don't know what this means. The two instructions mostly do different things, with one overlapping case. Can you point me at the metadata in question? And any example where Clang does the substitution?

There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.

Ignoring Knight's Landing since it's not a real CPU, this argument doesn't make any sense to me. This is an in-lane operation, so there's no reason splitting it in two would affect one and not the other. In fact, the two existing architectures that do split AVX into 2x128 (Zen+ and Gracemont) have identical measured latency and throughput numbers.

Not that 1 byte here or there is especially worth arguing over, but since you've suggested it "should be preferred", I'd like to see something concrete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can trivially see this in godbolt via:

#include <immintrin.h>

__m256 Permute(__m256 x)
{
    return _mm256_shuffle_ps(x, x, 0xFF);
}

and that Clang and GCC both universally replace this with vpermilps regardless of ISAs or target -march=....

I'll see if I can dig up, again, exactly where the instruction selection for this happens, but might take me a bit given the size/complexity of LLVM in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that's interesting. At least trusty old MSVC does what it's told 😉

https://godbolt.org/z/3eTKGqchf

I'd appreciate it if you can find the reference for why they do it. I've seen Clang make some odd decisions before, but with GCC doing it too there's got to be something.

Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
Vector256<float> blendW = Avx.Multiply(sW, dW);

Vector256<float> dstW = Avx.Subtract(dW, blendW);
Copy link
Contributor

@tannergooding tannergooding Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to build a MultiplyAddNegatedEstimate helper using Fma.MultiplyAddNegate since that does -(a * b) + c which should be equivalent to c - (a * b)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need Vector256.Negate<T> for that wouldn't I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd use Fma.MultiplyAddNegate(a, b, c) and fallback to Avx.Subtract(c, Avx.Multiply(a, b)) otherwise.

This is because (x - y) == (-y + x) and so you're just taking advantage of the specialized instruction Fma instruction that negates the multiplied result itself.

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I see that thanks. In this instance I use blendW further down so it looks like I'd be better off keeping as is since I would be introducing duplicate multiplications.

Expand All @@ -264,7 +306,7 @@ public static Vector256<float> Over(Vector256<float> destination, Vector256<floa

// unpremultiply
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256));
return Avx.Blend(color, alpha, blendAlphaControl);
return Avx.Blend(color, alpha, BlendAlphaControl);
}

/// <summary>
Expand Down Expand Up @@ -304,15 +346,11 @@ public static Vector4 Atop(Vector4 destination, Vector4 source, Vector4 blend)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Atop(Vector256<float> destination, Vector256<float> source, Vector256<float> blend)
{
// calculate weights
const int blendAlphaControl = 0b_10_00_10_00;
const int shuffleAlphaControl = 0b_11_11_11_11;

// calculate final alpha
Vector256<float> alpha = Avx.Shuffle(destination, destination, shuffleAlphaControl);
Vector256<float> alpha = Avx.Shuffle(destination, destination, ShuffleAlphaControl);

// calculate weights
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Vector256<float> blendW = Avx.Multiply(sW, alpha);
Vector256<float> dstW = Avx.Subtract(alpha, blendW);

Expand All @@ -321,7 +359,7 @@ public static Vector256<float> Atop(Vector256<float> destination, Vector256<floa

// unpremultiply
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256));
return Avx.Blend(color, alpha, blendAlphaControl);
return Avx.Blend(color, alpha, BlendAlphaControl);
}

/// <summary>
Expand Down Expand Up @@ -351,20 +389,17 @@ public static Vector4 In(Vector4 destination, Vector4 source)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> In(Vector256<float> destination, Vector256<float> source)
{
const int blendAlphaControl = 0b_10_00_10_00;
const int shuffleAlphaControl = 0b_11_11_11_11;

// calculate alpha
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl);
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
Vector256<float> alpha = Avx.Multiply(sW, dW);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to change this to:

Vector256<float> alpha = Avx.Permute(Avx.Multiply(source, destination), ShuffleAlphaControl);

That will do a single shuffle/permute, rather than two of them. In general, if you can do ops and shuffle less after its "better" because shuffle ports are typically limited.

In this case its safe since all the operations are element-wise, so even though you're multiplying source[i] * destination[i], the ones for i = 0, 1, 2 don't matter and just get ignored in the subsequent permute.

It's possible there are similar opportunities in the shuffle ops in the methods above here, but they are "more complex" than this one so I didn't do any in depth analysis to look for the same simplification/optimizations in them


// premultiply
Vector256<float> color = Avx.Multiply(source, alpha);

// unpremultiply
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256));
return Avx.Blend(color, alpha, blendAlphaControl);
return Avx.Blend(color, alpha, BlendAlphaControl);
}

/// <summary>
Expand Down Expand Up @@ -394,20 +429,17 @@ public static Vector4 Out(Vector4 destination, Vector4 source)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Out(Vector256<float> destination, Vector256<float> source)
{
const int blendAlphaControl = 0b_10_00_10_00;
const int shuffleAlphaControl = 0b_11_11_11_11;

// calculate alpha
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl);
Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256.Create(1F), dW), sW);
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256One, dW), sW);

// premultiply
Vector256<float> color = Avx.Multiply(source, alpha);

// unpremultiply
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256));
return Avx.Blend(color, alpha, blendAlphaControl);
return Avx.Blend(color, alpha, BlendAlphaControl);
}

/// <summary>
Expand Down Expand Up @@ -441,24 +473,20 @@ public static Vector4 Xor(Vector4 destination, Vector4 source)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> Xor(Vector256<float> destination, Vector256<float> source)
{
const int blendAlphaControl = 0b_10_00_10_00;
const int shuffleAlphaControl = 0b_11_11_11_11;

// calculate weights
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl);
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);

Vector256<float> vOne = Vector256.Create(1F);
Vector256<float> srcW = Avx.Subtract(vOne, dW);
Vector256<float> dstW = Avx.Subtract(vOne, sW);
Vector256<float> srcW = Avx.Subtract(Vector256One, dW);
Vector256<float> dstW = Avx.Subtract(Vector256One, sW);

// calculate alpha
Vector256<float> alpha = SimdUtils.HwIntrinsics.MultiplyAdd(sW, srcW, Avx.Multiply(dW, dstW));
Vector256<float> color = SimdUtils.HwIntrinsics.MultiplyAdd(Avx.Multiply(sW, source), srcW, Avx.Multiply(Avx.Multiply(dW, destination), dstW));

// unpremultiply
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256));
return Avx.Blend(color, alpha, blendAlphaControl);
return Avx.Blend(color, alpha, BlendAlphaControl);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Expand Down