-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Avx2 optimizations on Porter-Duff operations. #2359
Changes from 1 commit
517ec80
6a4dcd7
746b34d
4c546d7
ef34960
9f8bcc4
5fedca8
907400f
9a552f1
bde9324
41cfa9b
b4ff1e4
c58be60
dff381f
6cb6bd4
c06da8c
b05b25b
916084c
8ffec30
a666372
afdc53c
7309b6e
78eb2f1
ac0d27d
9752566
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,12 @@ namespace SixLabors.ImageSharp.PixelFormats.PixelBlenders; | |
/// </remarks> | ||
internal static partial class PorterDuffFunctions | ||
{ | ||
private const int BlendAlphaControl = 0b_10_00_10_00; | ||
private const int ShuffleAlphaControl = 0b_11_11_11_11; | ||
private static readonly Vector256<float> Vector256Half = Vector256.Create(0.5F); | ||
private static readonly Vector256<float> Vector256One = Vector256.Create(1F); | ||
private static readonly Vector256<float> Vector256Two = Vector256.Create(2F); | ||
|
||
/// <summary> | ||
/// Returns the result of the "Normal" compositing equation. | ||
/// </summary> | ||
|
@@ -79,7 +85,7 @@ public static Vector4 Add(Vector4 backdrop, Vector4 source) | |
/// <returns>The <see cref="Vector256{Single}"/>.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Add(Vector256<float> backdrop, Vector256<float> source) | ||
=> Avx.Min(Vector256.Create(1F), Avx.Add(backdrop, source)); | ||
=> Avx.Min(Vector256One, Avx.Add(backdrop, source)); | ||
|
||
/// <summary> | ||
/// Returns the result of the "Subtract" compositing equation. | ||
|
@@ -99,7 +105,7 @@ public static Vector4 Subtract(Vector4 backdrop, Vector4 source) | |
/// <returns>The <see cref="Vector256{Single}"/>.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Subtract(Vector256<float> backdrop, Vector256<float> source) | ||
=> Avx.Min(Vector256.Create(1F), Avx.Subtract(backdrop, source)); | ||
=> Avx.Min(Vector256One, Avx.Subtract(backdrop, source)); | ||
|
||
/// <summary> | ||
/// Returns the result of the "Screen" compositing equation. | ||
|
@@ -119,10 +125,7 @@ public static Vector4 Screen(Vector4 backdrop, Vector4 source) | |
/// <returns>The <see cref="Vector256{Single}"/>.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Screen(Vector256<float> backdrop, Vector256<float> source) | ||
{ | ||
Vector256<float> vOne = Vector256.Create(1F); | ||
return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source))); | ||
} | ||
=> Avx.Subtract(Vector256One, Avx.Multiply(Avx.Subtract(Vector256One, backdrop), Avx.Subtract(Vector256One, source))); | ||
|
||
/// <summary> | ||
/// Returns the result of the "Darken" compositing equation. | ||
|
@@ -179,6 +182,19 @@ public static Vector4 Overlay(Vector4 backdrop, Vector4 source) | |
return Vector4.Min(Vector4.One, new Vector4(cr, cg, cb, 0)); | ||
} | ||
|
||
/// <summary> | ||
/// Returns the result of the "Overlay" compositing equation. | ||
/// </summary> | ||
/// <param name="backdrop">The backdrop vector.</param> | ||
/// <param name="source">The source vector.</param> | ||
/// <returns>The <see cref="Vector4"/>.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Overlay(Vector256<float> backdrop, Vector256<float> source) | ||
{ | ||
Vector256<float> color = OverlayValueFunction(backdrop, source); | ||
return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl)); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We're "missing" a JIT optimization here so swapping the parameter order for Namely do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll log a JIT issue to track ensuring we support commutativity for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Awesome. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dotnet/runtime#82365 for recognizing more types of commutative operations, such as for |
||
} | ||
|
||
/// <summary> | ||
/// Returns the result of the "HardLight" compositing equation. | ||
/// </summary> | ||
|
@@ -195,6 +211,19 @@ public static Vector4 HardLight(Vector4 backdrop, Vector4 source) | |
return Vector4.Min(Vector4.One, new Vector4(cr, cg, cb, 0)); | ||
} | ||
|
||
/// <summary> | ||
/// Returns the result of the "HardLight" compositing equation. | ||
/// </summary> | ||
/// <param name="backdrop">The backdrop vector.</param> | ||
/// <param name="source">The source vector.</param> | ||
/// <returns>The <see cref="Vector4"/>.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> HardLight(Vector256<float> backdrop, Vector256<float> source) | ||
{ | ||
Vector256<float> color = OverlayValueFunction(source, backdrop); | ||
return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl)); | ||
} | ||
|
||
/// <summary> | ||
/// Helper function for Overlay and HardLight modes | ||
/// </summary> | ||
|
@@ -205,6 +234,22 @@ public static Vector4 HardLight(Vector4 backdrop, Vector4 source) | |
private static float OverlayValueFunction(float backdrop, float source) | ||
=> backdrop <= 0.5f ? (2 * backdrop * source) : 1 - (2 * (1 - source) * (1 - backdrop)); | ||
|
||
/// <summary> | ||
/// Helper function for Overlay and HardLight modes | ||
/// </summary> | ||
/// <param name="backdrop">Backdrop color element</param> | ||
/// <param name="source">Source color element</param> | ||
/// <returns>Overlay value</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source) | ||
{ | ||
Vector256<float> left = Avx.Multiply(Avx.Multiply(Vector256Two, backdrop), source); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is something I'm working on recognizing implicitly in the JIT as part of the constant folding support being added in .NET 8 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could then consider a helper method here: public static Vector256<float> MultiplyAddEstimate(Vector256<float> left, Vector256<float> right, Vector256<float> addend)
{
if (Fma.IsSupported)
{
return Fma.MultiplyAdd(left, right, addend);
}
else
{
return Avx.Add(Avx.Multiply(left, right), addend);
}
} The reason that it's an "estimate" is that The most notably is that for Most other inputs are less drastic in their differences. They'll either produce the same result (with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We actually have a helper that does exactly that in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. dotnet/runtime#82366 for the |
||
Vector256<float> right = Avx.Subtract(Vector256One, Avx.Multiply(Avx.Multiply(Vector256Two, Avx.Subtract(Vector256One, source)), Avx.Subtract(Vector256One, backdrop))); | ||
|
||
Vector256<float> cmp = Avx.CompareGreaterThan(backdrop, Vector256Half); | ||
return Avx.BlendVariable(left, right, cmp); | ||
} | ||
|
||
/// <summary> | ||
/// Returns the result of the "Over" compositing equation. | ||
/// </summary> | ||
|
@@ -243,12 +288,9 @@ public static Vector4 Over(Vector4 destination, Vector4 source, Vector4 blend) | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Over(Vector256<float> destination, Vector256<float> source, Vector256<float> blend) | ||
{ | ||
const int blendAlphaControl = 0b_10_00_10_00; | ||
const int shuffleAlphaControl = 0b_11_11_11_11; | ||
|
||
// calculate weights | ||
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl); | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since it's the same input twice, using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's 1-byte larger but also preferred by Clang/GCC in practically all cases (both Intel and AMD have their instruction selection metadata set to prefer it). There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle. It ultimately shouldn't make too much a difference (it's been the same or with shufps taking 1-cycle more everywhere so far), but I'd typically side with what Intel/AMD have made Clang/GCC emit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't know what this means. The two instructions mostly do different things, with one overlapping case. Can you point me at the metadata in question? And any example where Clang does the substitution?
Ignoring Knight's Landing since it's not a real CPU, this argument doesn't make any sense to me. This is an in-lane operation, so there's no reason splitting it in two would affect one and not the other. In fact, the two existing architectures that do split AVX into 2x128 (Zen+ and Gracemont) have identical measured latency and throughput numbers. Not that 1 byte here or there is especially worth arguing over, but since you've suggested it "should be preferred", I'd like to see something concrete. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can trivially see this in godbolt via: #include <immintrin.h>
__m256 Permute(__m256 x)
{
return _mm256_shuffle_ps(x, x, 0xFF);
} and that Clang and GCC both universally replace this with I'll see if I can dig up, again, exactly where the instruction selection for this happens, but might take me a bit given the size/complexity of LLVM in general. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well that's interesting. At least trusty old MSVC does what it's told 😉 https://godbolt.org/z/3eTKGqchf I'd appreciate it if you can find the reference for why they do it. I've seen Clang make some odd decisions before, but with GCC doing it too there's got to be something. |
||
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
Vector256<float> blendW = Avx.Multiply(sW, dW); | ||
|
||
Vector256<float> dstW = Avx.Subtract(dW, blendW); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to build a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You'd use This is because There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep. I see that thanks. In this instance I use |
||
|
@@ -264,7 +306,7 @@ public static Vector256<float> Over(Vector256<float> destination, Vector256<floa | |
|
||
// unpremultiply | ||
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256)); | ||
return Avx.Blend(color, alpha, blendAlphaControl); | ||
return Avx.Blend(color, alpha, BlendAlphaControl); | ||
} | ||
|
||
/// <summary> | ||
|
@@ -304,15 +346,11 @@ public static Vector4 Atop(Vector4 destination, Vector4 source, Vector4 blend) | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Atop(Vector256<float> destination, Vector256<float> source, Vector256<float> blend) | ||
{ | ||
// calculate weights | ||
const int blendAlphaControl = 0b_10_00_10_00; | ||
const int shuffleAlphaControl = 0b_11_11_11_11; | ||
|
||
// calculate final alpha | ||
Vector256<float> alpha = Avx.Shuffle(destination, destination, shuffleAlphaControl); | ||
Vector256<float> alpha = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
|
||
// calculate weights | ||
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl); | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
Vector256<float> blendW = Avx.Multiply(sW, alpha); | ||
Vector256<float> dstW = Avx.Subtract(alpha, blendW); | ||
|
||
|
@@ -321,7 +359,7 @@ public static Vector256<float> Atop(Vector256<float> destination, Vector256<floa | |
|
||
// unpremultiply | ||
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256)); | ||
return Avx.Blend(color, alpha, blendAlphaControl); | ||
return Avx.Blend(color, alpha, BlendAlphaControl); | ||
} | ||
|
||
/// <summary> | ||
|
@@ -351,20 +389,17 @@ public static Vector4 In(Vector4 destination, Vector4 source) | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> In(Vector256<float> destination, Vector256<float> source) | ||
{ | ||
const int blendAlphaControl = 0b_10_00_10_00; | ||
const int shuffleAlphaControl = 0b_11_11_11_11; | ||
|
||
// calculate alpha | ||
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl); | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
Vector256<float> alpha = Avx.Multiply(sW, dW); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should be able to change this to: Vector256<float> alpha = Avx.Permute(Avx.Multiply(source, destination), ShuffleAlphaControl); That will do a single shuffle/permute, rather than two of them. In general, if you can do ops and shuffle less after its "better" because shuffle ports are typically limited. In this case its safe since all the operations are It's possible there are similar opportunities in the shuffle ops in the methods above here, but they are "more complex" than this one so I didn't do any in depth analysis to look for the same simplification/optimizations in them |
||
|
||
// premultiply | ||
Vector256<float> color = Avx.Multiply(source, alpha); | ||
|
||
// unpremultiply | ||
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256)); | ||
return Avx.Blend(color, alpha, blendAlphaControl); | ||
return Avx.Blend(color, alpha, BlendAlphaControl); | ||
} | ||
|
||
/// <summary> | ||
|
@@ -394,20 +429,17 @@ public static Vector4 Out(Vector4 destination, Vector4 source) | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Out(Vector256<float> destination, Vector256<float> source) | ||
{ | ||
const int blendAlphaControl = 0b_10_00_10_00; | ||
const int shuffleAlphaControl = 0b_11_11_11_11; | ||
|
||
// calculate alpha | ||
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl); | ||
Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256.Create(1F), dW), sW); | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256One, dW), sW); | ||
|
||
// premultiply | ||
Vector256<float> color = Avx.Multiply(source, alpha); | ||
|
||
// unpremultiply | ||
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256)); | ||
return Avx.Blend(color, alpha, blendAlphaControl); | ||
return Avx.Blend(color, alpha, BlendAlphaControl); | ||
} | ||
|
||
/// <summary> | ||
|
@@ -441,24 +473,20 @@ public static Vector4 Xor(Vector4 destination, Vector4 source) | |
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Xor(Vector256<float> destination, Vector256<float> source) | ||
{ | ||
const int blendAlphaControl = 0b_10_00_10_00; | ||
const int shuffleAlphaControl = 0b_11_11_11_11; | ||
|
||
// calculate weights | ||
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl); | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
|
||
Vector256<float> vOne = Vector256.Create(1F); | ||
Vector256<float> srcW = Avx.Subtract(vOne, dW); | ||
Vector256<float> dstW = Avx.Subtract(vOne, sW); | ||
Vector256<float> srcW = Avx.Subtract(Vector256One, dW); | ||
Vector256<float> dstW = Avx.Subtract(Vector256One, sW); | ||
|
||
// calculate alpha | ||
Vector256<float> alpha = SimdUtils.HwIntrinsics.MultiplyAdd(sW, srcW, Avx.Multiply(dW, dstW)); | ||
Vector256<float> color = SimdUtils.HwIntrinsics.MultiplyAdd(Avx.Multiply(sW, source), srcW, Avx.Multiply(Avx.Multiply(dW, destination), dstW)); | ||
|
||
// unpremultiply | ||
color = Avx.Divide(color, Avx.Max(alpha, Constants.Epsilon256)); | ||
return Avx.Blend(color, alpha, blendAlphaControl); | ||
return Avx.Blend(color, alpha, BlendAlphaControl); | ||
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be worse than direct usage for .NET 5+
Starting in .NET 5 there was some special support added for
Vector###.Create(cns)
andVector###.Create(cns, ..., cns)
. These APIs will now generate a "method local constant" which makes it quite a bit more efficient than even astatic readonly
will be.In .NET 5 this support only existed in the late phases of the JIT (lowering). That support was improved a bit more in .NET 6, 7, and now in 8 as well. Starting in 7 in particular we now have a direct node (
GT_CNS_VEC
) which allows other phases of the JIT to take advantage of this. In .NET 8 we've started adding constant folding support as well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to centralize such "constants", I'd recommend returning them from a property instead:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's very interesting. Will revert and look into other examples in the codebase.
What about multiple references to a local variable within a single method? Should I inline them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure off the top of my head. In .NET 7/8 I'd expect us to be doing the right thing and recognizing them as a "common subexpression". I'd also expect the same for .NET 6, but the support wasn't quite as good and so manually hoisting it to a local may provide better results and shouldn't hurt on any of the target frameworks.
In general you can assume that most
Create(cns)
andCreate(cns, ..., cns)
come from memory and so using a local can help ensure there is only one memory access. The special consideration isZero
andAllBitsSet
which are generated dynamically using a "zero cost" instruction instead.