Enable Avx2 optimizations on Porter-Duff operations. #2359

JimBobSquarePants · 2023-02-17T11:11:34Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Adds Vector256<float> implementations of the Porter-Duff functions to speed up our blending operations. Fixes #2340

Complete function component set within PorterDuffFunctions.cs.
Update the PorterDuffFunctions.Generated.tt to include the Vector256<float> variants.
Update DefaultPixelBlenders.Generated.tt to use the Vector256<float> variants when appropriate.
Update the unit tests to use RemoteExecutor

@antonfirsov @tannergooding @saucecontrol @br3aker I'll take any help you can offer pointing me in the right direction to finish porting these over.

JimBobSquarePants · 2023-02-17T11:12:47Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+    /// <returns>The <see cref="Vector256{Single}"/>.</returns>
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static Vector256<float> Lighten(Vector256<float> backdrop, Vector256<float> source)
+        => Avx.Max(backdrop, source);


I don't know how to do OverlayValueFunction below.

I'm a bit rusty but something like this should work:

private static readonly Vector256<float> Half = Vector256.Create(0.5f); private static readonly Vector256<float> One = Vector256.Create(1f); private static readonly Vector256<float> Two = Vector256.Create(2f); [MethodImpl(MethodImplOptions.AggressiveInlining)] public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source) { var left = Avx.Multiply(Avx.Multiply(Two, backdrop), source); var right = Avx.Subtract(One, Avx.Multiply(Avx.Multiply(Two, Avx.Subtract(One, source)), Avx.Subtract(One, backdrop))); var cmp = Avx.CompareGreaterThan(backdrop, Half); return Avx.BlendVariable(left, right, cmp); }

SIMD implementation must calculate both cases but it's still a bit faster than scalar:

BenchmarkDotNet=v0.13.4, OS=Windows 11 (10.0.22621.1265) Unknown processor .NET SDK=7.0.100 [Host] : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT AVX2 DefaultJob : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT AVX2 | Method | Mean | Error | StdDev | |------- |----------:|---------:|---------:| | Scalar | 119.12 ns | 0.342 ns | 0.320 ns | | Avx | 35.76 ns | 0.208 ns | 0.195 ns |

public class Benchmark { private float[] _backdrop; private float[] _source; [GlobalSetup] public void Setup() { _backdrop = new float[8 * 20]; _source = new float[8 * 20]; FillRandom(_backdrop); FillRandom(_source); } private static void FillRandom(float[] arr) { var rng = new Random(); for(int i = 0; i < arr.Length; i++) { arr[i] = rng.NextSingle(); } } [Benchmark(Description = "Scalar")] public float OverlayValueFunction_Scalar() { float result = default; for (int i = 0; i < _backdrop.Length; i++) { result += Program.OverlayValueFunction(_backdrop[i], _source[i]); } return result; } [Benchmark(Description = "Avx")] public Vector256<float> OverlayValueFunction_Avx() { ref Vector256<float> backdrop = ref Unsafe.As<float, Vector256<float>>(ref MemoryMarshal.GetReference<float>(_backdrop)); ; ref Vector256<float> source = ref Unsafe.As<float, Vector256<float>>(ref MemoryMarshal.GetReference<float>(_backdrop)); Vector256<float> result = default; int count = _backdrop.Length / Vector256<float>.Count; for (int i = 0; i < count; i++) { result = Avx.Add(result, Program.OverlayValueFunction(Unsafe.Add(ref backdrop, i), Unsafe.Add(ref backdrop, i))); } return result; } }

P.S.

I didn't test it properly though, tried a couple of random numbers and compared result to the scalar implemention.
Most likely can be done a bit better without BlendVariable but it's a good starting point IMO.

Ah that's awesome. Thanks! That lets me complete the first objective.

tannergooding · 2023-02-19T04:08:13Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+    private static readonly Vector256<float> Vector256Half = Vector256.Create(0.5F);
+    private static readonly Vector256<float> Vector256One = Vector256.Create(1F);
+    private static readonly Vector256<float> Vector256Two = Vector256.Create(2F);


This is going to be worse than direct usage for .NET 5+

Starting in .NET 5 there was some special support added for Vector###.Create(cns) and Vector###.Create(cns, ..., cns). These APIs will now generate a "method local constant" which makes it quite a bit more efficient than even a static readonly will be.

In .NET 5 this support only existed in the late phases of the JIT (lowering). That support was improved a bit more in .NET 6, 7, and now in 8 as well. Starting in 7 in particular we now have a direct node (GT_CNS_VEC) which allows other phases of the JIT to take advantage of this. In .NET 8 we've started adding constant folding support as well.

If you want to centralize such "constants", I'd recommend returning them from a property instead:

private static Vector256<float> Vector256Half => Vector256.Create(0.5F); private static Vector256<float> Vector256One => Vector256.Create(1F); private static Vector256<float> Vector256Two => Vector256.Create(2F);

Oh that's very interesting. Will revert and look into other examples in the codebase.

What about multiple references to a local variable within a single method? Should I inline them?

public static Vector256<float> Screen(Vector256<float> backdrop, Vector256<float> source) { Vector256<float> vOne = Vector256.Create(1F); return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source))); }

Not sure off the top of my head. In .NET 7/8 I'd expect us to be doing the right thing and recognizing them as a "common subexpression". I'd also expect the same for .NET 6, but the support wasn't quite as good and so manually hoisting it to a local may provide better results and shouldn't hurt on any of the target frameworks.

In general you can assume that most Create(cns) and Create(cns, ..., cns) come from memory and so using a local can help ensure there is only one memory access. The special consideration is Zero and AllBitsSet which are generated dynamically using a "zero cost" instruction instead.

…float> variants.

tannergooding · 2023-02-19T04:29:45Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+    public static Vector256<float> Overlay(Vector256<float> backdrop, Vector256<float> source)
+    {
+        Vector256<float> color = OverlayValueFunction(backdrop, source);
+        return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl));


We're "missing" a JIT optimization here so swapping the parameter order for vblendps can be more efficient

Namely do Avx.Blend(Vector256<float>.Zero, color, 0b_01_11_01_11);. This is special to Zero and AllBitsSet since they aren't "normal" constants but are instead always generated into register directly (rather than loaded from memory). For other constants the original ordering you had would've been better.

I'll log a JIT issue to track ensuring we support commutativity for Blend

Awesome. Thanks!

dotnet/runtime#82365 for recognizing more types of commutative operations, such as for Blend

tannergooding · 2023-02-19T04:32:37Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source)
+    {
+        Vector256<float> left = Avx.Multiply(Avx.Multiply(Vector256Two, backdrop), source);


Multiply(Two, backdrop) is better as Add(backdrop, backdrop).

This is something I'm working on recognizing implicitly in the JIT as part of the constant folding support being added in .NET 8

You could then consider a helper method here:

public static Vector256<float> MultiplyAddEstimate(Vector256<float> left, Vector256<float> right, Vector256<float> addend) { if (Fma.IsSupported) { return Fma.MultiplyAdd(left, right, addend); } else { return Avx.Add(Avx.Multiply(left, right), addend); } }

The reason that it's an "estimate" is that FMA does a more precise computation and gives a more accurate result. This can also subtly change the answer in some edge cases.

The most notably is that for double, (1e308 * 2) - 1e308 returns infinity, but FusedMultiplyAdd(1e308, 2, -1e308) returns 1e300 (there is a similar scenario for float, but I don't remember the exact inputs off the top of my head).

Most other inputs are less drastic in their differences. They'll either produce the same result (with Fma typically being faster) or be off by 1 ULP (if you looked at the raw bits, the least significant bit would typically differ by 1).

We actually have a helper that does exactly that in SimdUtils.HwIntrinsics.MultiplyAdd

dotnet/runtime#82366 for the vector * 2 to vector + vector transform being implicitly recognized

tannergooding · 2023-02-19T04:47:23Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

        // calculate weights
-        Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl);
-        Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl);
+        Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);


Since it's the same input twice, using Avx.Permute should be preferred. The same shuffle control is used.

vpermilps is 1 byte larger than vshufps. The only reason to prefer it is that it supports a memory operand where shuffle doesn't, but that doesn't apply here.

It's 1-byte larger but also preferred by Clang/GCC in practically all cases (both Intel and AMD have their instruction selection metadata set to prefer it).

There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.

It ultimately shouldn't make too much a difference (it's been the same or with shufps taking 1-cycle more everywhere so far), but I'd typically side with what Intel/AMD have made Clang/GCC emit.

(both Intel and AMD have their instruction selection metadata set to prefer it).

I don't know what this means. The two instructions mostly do different things, with one overlapping case. Can you point me at the metadata in question? And any example where Clang does the substitution?

There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.

Ignoring Knight's Landing since it's not a real CPU, this argument doesn't make any sense to me. This is an in-lane operation, so there's no reason splitting it in two would affect one and not the other. In fact, the two existing architectures that do split AVX into 2x128 (Zen+ and Gracemont) have identical measured latency and throughput numbers.

Not that 1 byte here or there is especially worth arguing over, but since you've suggested it "should be preferred", I'd like to see something concrete.

You can trivially see this in godbolt via:

#include <immintrin.h> __m256 Permute(__m256 x) { return _mm256_shuffle_ps(x, x, 0xFF); }

and that Clang and GCC both universally replace this with vpermilps regardless of ISAs or target -march=....

I'll see if I can dig up, again, exactly where the instruction selection for this happens, but might take me a bit given the size/complexity of LLVM in general.

Well that's interesting. At least trusty old MSVC does what it's told 😉

https://godbolt.org/z/3eTKGqchf

I'd appreciate it if you can find the reference for why they do it. I've seen Clang make some odd decisions before, but with GCC doing it too there's got to be something.

tannergooding · 2023-02-19T04:57:44Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

        Vector256<float> blendW = Avx.Multiply(sW, dW);

        Vector256<float> dstW = Avx.Subtract(dW, blendW);


You should be able to build a MultiplyAddNegatedEstimate helper using Fma.MultiplyAddNegate since that does -(a * b) + c which should be equivalent to c - (a * b)

I'd need Vector256.Negate<T> for that wouldn't I?

You'd use Fma.MultiplyAddNegate(a, b, c) and fallback to Avx.Subtract(c, Avx.Multiply(a, b)) otherwise.

This is because (x - y) == (-y + x) and so you're just taking advantage of the specialized instruction Fma instruction that negates the multiplied result itself.

Yep. I see that thanks. In this instance I use blendW further down so it looks like I'd be better off keeping as is since I would be introducing duplicate multiplications.

This reverts commit 907400f.

tannergooding · 2023-02-19T05:06:09Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+        Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
+        Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
        Vector256<float> alpha = Avx.Multiply(sW, dW);


You should be able to change this to:

Vector256<float> alpha = Avx.Permute(Avx.Multiply(source, destination), ShuffleAlphaControl);

That will do a single shuffle/permute, rather than two of them. In general, if you can do ops and shuffle less after its "better" because shuffle ports are typically limited.

In this case its safe since all the operations are element-wise, so even though you're multiplying source[i] * destination[i], the ones for i = 0, 1, 2 don't matter and just get ignored in the subsequent permute.

It's possible there are similar opportunities in the shuffle ops in the methods above here, but they are "more complex" than this one so I didn't do any in depth analysis to look for the same simplification/optimizations in them

JimBobSquarePants · 2023-02-19T11:35:23Z

tests/ImageSharp.Benchmarks/PixelBlenders/PorterDuffBulkVsSingleVector.cs

+            result = PorterDuffFunctions.NormalSrcOver(Unsafe.Add(ref backdrop, i), Unsafe.Add(ref source, i), opacity);
+        }
+
+        return result;


Is there a bug in my benchmark?

Numbers seem too good to be true.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621 11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.102 [Host] : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT | Method | Mean | Error | StdDev | |------- |-----------:|--------:|--------:| | Scalar | 2,176.5 ns | 4.48 ns | 3.97 ns | | Avx | 231.3 ns | 0.87 ns | 0.82 ns | // * Hints * Outliers PorterDuffBulkVsSingleVector.Scalar: Default -> 1 outlier was removed (2.20 us) // * Legends * Mean : Arithmetic mean of all measurements Error : Half of 99.9% confidence interval StdDev : Standard deviation of all measurements 1 ns : 1 Nanosecond (0.000000001 sec) // ***** BenchmarkRunner: End ***** // ** Remained 0 benchmark(s) to run ** Run time: 00:00:26 (26.39 sec), executed benchmarks: 2 Global total time: 00:00:47 (47.57 sec), executed benchmarks: 2 // * Artifacts cleanup *

I notice that both backdrop and source reference this.backdrop?

Good catch! Doesn't affect the results though.

Just peaking at .NET 7 and changing the vector algorithm to be V128 to simplify the diff a bit, then we have:

Vector4

; Assembly listing for method C:NormalSrcOver(System.Numerics.Vector4,System.Numerics.Vector4,float):System.Numerics.Vector4 ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; optimized code ; rsp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 4883EC18 sub rsp, 24 C5F877 vzeroupper C5F8293424 vmovaps xmmword ptr [rsp], xmm6 G_M000_IG02: ;; offset=000CH 498D400C lea rax, bword ptr [r8+0CH] C5E25900 vmulss xmm0, xmm3, dword ptr [rax] C5FA1100 vmovss dword ptr [rax], xmm0 C4C1781000 vmovups xmm0, xmmword ptr [r8] C5F828C8 vmovaps xmm1, xmm0 C5F81012 vmovups xmm2, xmmword ptr [rdx] C5E8C6DAFF vshufps xmm3, xmm2, xmm2, -1 C5F0C6E1FF vshufps xmm4, xmm1, xmm1, -1 C5E259EC vmulss xmm5, xmm3, xmm4 C5E25CDD vsubss xmm3, xmm3, xmm5 C5DA5CF5 vsubss xmm6, xmm4, xmm5 C5E258E4 vaddss xmm4, xmm3, xmm4 C4E27918DB vbroadcastss xmm3, xmm3 C5E859D3 vmulps xmm2, xmm2, xmm3 C4E27918DE vbroadcastss xmm3, xmm6 C5F059CB vmulps xmm1, xmm1, xmm3 C5E858C9 vaddps xmm1, xmm2, xmm1 C4E27918D5 vbroadcastss xmm2, xmm5 C5F859C2 vmulps xmm0, xmm0, xmm2 C5F058C0 vaddps xmm0, xmm1, xmm0 C5F828CC vmovaps xmm1, xmm4 C5F8101532000000 vmovups xmm2, xmmword ptr [reloc @RWD00] C5E85FC9 vmaxps xmm1, xmm2, xmm1 C4E27918C9 vbroadcastss xmm1, xmm1 C5F85EC1 vdivps xmm0, xmm0, xmm1 C4E37921C430 vinsertps xmm0, xmm0, xmm4, 48 C5F81101 vmovups xmmword ptr [rcx], xmm0 488BC1 mov rax, rcx G_M000_IG03: ;; offset=0088H C5F8283424 vmovaps xmm6, xmmword ptr [rsp] 4883C418 add rsp, 24 C3 ret

Vector128

G_M000_IG01: ;; offset=0000H 4883EC18 sub rsp, 24 C5F877 vzeroupper C5F8293424 vmovaps qword ptr [rsp], xmm6 G_M000_IG02: ;; offset=000CH C4C1791000 vmovupd xmm0, xmmword ptr [r8] C4C1785909 vmulps xmm1, xmm0, xmmword ptr [r9] C4E3790CC188 vblendps xmm0, xmm0, xmm1, -120 C4C1791100 vmovupd xmmword ptr [r8], xmm0 C4C1791000 vmovupd xmm0, xmmword ptr [r8] C5F828C8 vmovaps xmm1, xmm0 C5F91012 vmovupd xmm2, xmmword ptr [rdx] C4E37904D8FF vpermilps xmm3, xmm0, -1 C4E37904E2FF vpermilps xmm4, xmm2, -1 C5E059EC vmulps xmm5, xmm3, xmm4 C5D85CE5 vsubps xmm4, xmm4, xmm5 C5E05CF5 vsubps xmm6, xmm3, xmm5 C5D858DB vaddps xmm3, xmm4, xmm3 C5E859D4 vmulps xmm2, xmm2, xmm4 C4E279B8D6 vfmadd231ps xmm2, xmm0, xmm6 C4E271B8D5 vfmadd231ps xmm2, xmm1, xmm5 C5E05F0530000000 vmaxps xmm0, xmm3, xmmword ptr [reloc @RWD00] C5E85ED0 vdivps xmm2, xmm2, xmm0 C4E3690CC388 vblendps xmm0, xmm2, xmm3, -120 C5F91101 vmovupd xmmword ptr [rcx], xmm0 488BC1 mov rax, rcx G_M000_IG03: ;; offset=0071H C5F8283424 vmovaps xmm6, qword ptr [rsp] 4883C418 add rsp, 24 C3 ret

The V256 code is effectively identical to the V128 code, just using ymm instead of xmm and we can assume is roughly 2x the speed.

For the differences between Vector4 and V128...

The initial source.W *= opacity isn't great, it's generating:

lea rax, bword ptr [r8+0CH] vmulss xmm0, xmm3, dword ptr [rax] vmovss dword ptr [rax], xmm0 vmovupd xmm0, xmmword ptr [r8]

If NormalSrcOver was inlined and source was in register, it'd also be generating a spill before this:

vmovapd xmmword ptr [r8], xmm0

That basically means it's doing:

write source to memory

get the address of w

load w from memory and multiply by opacity

store result into w

load entire vector from memory

Compare this with the V128 approach:

vmovupd xmm0, xmmword ptr [r8] vmulps xmm1, xmm0, xmmword ptr [r9] vblendps xmm0, xmm0, xmm1, 8

Which is:

load source from memory

load opacity from memory and multiply by source

blend so that you get original source plus new W

If you instead did something roughly equivalent, making it Vector4 opacity and then did source.W = (source * opacity).W;, you get this instead:

vmovupd xmm0, xmmword ptr [r8] vmulps xmm1, xmm0, xmmword ptr [r9] vshufps xmm1, xmm1, xmm1, -1 vinsertps xmm0, xmm0, xmm1, 48

Which is nearly ideal and doing the same thing, there's just a stray vshufps that I need to remove on the JIT side >.>

Insanely detailed writeup, learnt a lot, thank you!

This is incredible @tannergooding Thanks so much!

I'm gonna have a tinker and see where I can get. NormalSrcOver is currently 9.2X faster using AVX which is absolutely bonkers. CC @antonfirsov

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621 11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.102 [Host] : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT | Method | Mean | Error | StdDev | Ratio | |------- |-----------:|--------:|--------:|------:| | Scalar | 2,177.0 ns | 4.38 ns | 3.66 ns | 1.00 | | Avx | 233.5 ns | 1.25 ns | 1.17 ns | 0.11 |

MAJOR improvement for the Vector4 path. @tannergooding

I implemented the WithW and the things immediate dropped to nearly 4X faster. PermuteW brought it down even further.

The only thing I couldn't figure out was how to fallback to Sse for WithW. I'm not great with shuffling.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621 11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.102 [Host] : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT | Method | Mean | Error | StdDev | Ratio | |------- |---------:|--------:|--------:|------:| | Scalar | 415.5 ns | 1.36 ns | 1.06 ns | 1.00 | | Avx | 234.4 ns | 0.44 ns | 0.35 ns | 0.56 |

MAJOR improvement for the Vector4 path.

Awesome! Much closer to the 1.5-2x difference we'd otherwise expect

The only thing I couldn't figure out was how to fallback to Sse for WithW.

Think you want:

Vector128<float> tmp = Sse.Shuffle(w.AsVector128(), value.AsVector128(), 0b00_10_11_11); return Sse.Shuffle(value.AsVector128(), tmp, 0b00_10_01_00);

That will create tmp as <w[3], w[3], value[2], value[0]>
Then return <value[0], value[1], tmp[2], tmp[0]> (which is <value[0], value[1], value[2], w[3]>)

I would not have figured that one out. Thanks!

saucecontrol · 2023-02-19T18:13:56Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

    {
-        return Vector4.One - ((Vector4.One - backdrop) * (Vector4.One - source));
+        Vector256<float> vOne = Vector256.Create(1F);
+        return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source)));


Pretty much anywhere you have a multiply followed by add or multiply followed by subtract, you can use FMA. This one would be

Fma.MultiplyAddNegated(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source), vOne);

With the appropriate helper for non-FMA, ofc.

saucecontrol · 2023-02-19T18:25:46Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+        Vector256<float> vOne = Vector256.Create(1F);
+        Vector256<float> vTwo = Vector256.Create(2F);
+        Vector256<float> left = Avx.Multiply(Avx.Add(backdrop, backdrop), source);
+        Vector256<float> right = Avx.Subtract(vOne, Avx.Multiply(Avx.Multiply(vTwo, Avx.Subtract(vOne, source)), Avx.Subtract(vOne, backdrop)));


Another MultiplyAddNegated opportunity here. You can also remove the vTwo constant (which will be a memory load) by saving Avx.Subtract(vOne, source) into a temp and adding it to itself instead of multiplying by 2.

Add will be faster or have increased throughput over multiply on some processors as well.

saucecontrol · 2023-02-19T18:30:35Z

src/ImageSharp/PixelFormats/PixelBlenders/PorterDuffFunctions.cs

+        // calculate alpha
+        Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl);
+        Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl);
+        Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256.Create(1F), dW), sW);


Thought I saw Tanner comment on this yesterday, but it's disappeared. You can save an instruction by doing the alpha shuffle after the multiply.

My comment should be here, hopefully it hasn't actually disappeared #2359 (comment) 😅

Ha, yeah, it just got marked as outdated so it doesn't show up in the code view anymore. Saw it after I submitted the comments :)

I've implemented that now.

JimBobSquarePants · 2023-02-20T03:38:41Z

OK. I think I'm happy here unless there are any objections.

The old Vector4 path is ~5.1x faster than before.
The new AVX path is ~9.2x faster than before.

tannergooding · 2023-02-20T14:48:47Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

+            in Vector256<float> a,
+            in Vector256<float> b,
+            in Vector256<float> c)


Particularly since this is inlined, I wouldn't expect you to need in here (or on the other examples above).

It shouldn't actually make a difference since the JIT should inline things and elide the "address taken", but it's typically easier on the JIT if you don't mark them unnecessarily for "enregisterable" types.

IIRC, without the in JIT could/would lose track of the fact memory load operands could be contained when inlining these small helpers, at least in netcoreapp3.1. Don't know if or when that was fixed, but I'd double check before removing them.

I haven't looked at the JIT but am seeing a 2% slowdown when running my benchmark when I remove them. Could be error but I'll leave them for now.

tannergooding

Changes look generally good/correct to me.

There's cleanup/simplification that can be done long term when you're able to move onto .NET 7 (or .NET 8 after that ships)

antonfirsov · 2023-02-21T00:22:01Z

tests/ImageSharp.Benchmarks/PixelBlenders/PorterDuffBulkVsSingleVector.cs

+        Vector4 result = default;
+        for (int i = 0; i < this.backdrop.Length; i++)
+        {
+            result = PorterDuffFunctions.NormalSrcOver(this.backdrop[i], this.source[i], .5F);


The only concern with these benchmarks is that - unlike the real span code - they do not walk the source and amount buffers, meaning there are fewer cache misses.

I would have created a parametric benchmark (N=2,16,128,512,1024) that uses the Span API-s, and run it against the code before/after the changes. Shouldn't significantly impact the result, but would paint a more nuanced picture about the improvements.

Anyways, amazing job!

They actually used to walk those buffers. I was surprised by that, so I replicated it in the benchmark.

ImageSharp/src/ImageSharp/PixelFormats/PixelBlenders/DefaultPixelBlenders.Generated.tt

Lines 87 to 103 in f53c78d

protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, float amount)

{

amount = Numerics.Clamp(amount, 0, 1);

for (int i = 0; i < destination.Length; i++)

{

destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], amount);

}

}

/// <inheritdoc />

protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, ReadOnlySpan<float> amount)

{

for (int i = 0; i < destination.Length; i++)

{

destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], Numerics.Clamp(amount[i], 0, 1));

}

}

I have actually realized that the benchmarks were actually slightly generous to the old approach as I'd forgotten the clamp on amount

Port most of the function components.

517ec80

JimBobSquarePants added area:performance area:pixelformats labels Feb 17, 2023

Merge branch 'main' into js/avx2-porter-duff

6a4dcd7

JimBobSquarePants commented Feb 17, 2023

View reviewed changes

Finish porting function components

746b34d

tannergooding reviewed Feb 19, 2023

View reviewed changes

Update the PorterDuffFunctions.Generated.tt to include the Vector256<…

4c546d7

…float> variants.

tannergooding reviewed Feb 19, 2023

View reviewed changes

JimBobSquarePants added 3 commits February 19, 2023 14:40

Fix code generation

ef34960

Respond to feedback

9f8bcc4

Respond to feedback

5fedca8

JimBobSquarePants added this to the 3.0.0 milestone Feb 19, 2023

tannergooding reviewed Feb 19, 2023

View reviewed changes

Use Permute

907400f

tannergooding reviewed Feb 19, 2023

View reviewed changes

Revert "Use Permute"

9a552f1

This reverts commit 907400f.

tannergooding reviewed Feb 19, 2023

View reviewed changes

JimBobSquarePants added 7 commits February 19, 2023 15:58

Use Permute

bde9324

Port DefaultPixelBlenders

41cfa9b

Fix issues

b4ff1e4

Add additional PD tests

c58be60

Fix amount span assignment

dff381f

Better clamp, fix offset (again)

6cb6bd4

Add NormalSrcOver benchmark

c06da8c

JimBobSquarePants commented Feb 19, 2023

View reviewed changes

Use RemoteExecutor for composition tests

b05b25b

JimBobSquarePants changed the title ~~WIP. Enable Avx2 optimizations on Porter-Duff operations.~~ Enable Avx2 optimizations on Porter-Duff operations. Feb 19, 2023

JimBobSquarePants requested a review from antonfirsov February 19, 2023 11:58

JimBobSquarePants requested review from br3aker and tannergooding and removed request for saucecontrol, br3aker and tannergooding February 19, 2023 11:58

JimBobSquarePants marked this pull request as ready for review February 19, 2023 11:58

JimBobSquarePants requested review from saucecontrol, tannergooding and br3aker and removed request for br3aker, saucecontrol and tannergooding February 19, 2023 12:10

JimBobSquarePants added 2 commits February 19, 2023 23:12

Fix field assignment in benchmark

916084c

Make Scalar default

8ffec30

tannergooding mentioned this pull request Feb 19, 2023

Support recognizing Blend as commutative dotnet/runtime#82365

Open

saucecontrol reviewed Feb 19, 2023

View reviewed changes

JimBobSquarePants added 4 commits February 20, 2023 09:52

Use FMA where possible.

a666372

Tanners Top Tips!!

afdc53c

Merge branch 'main' into js/avx2-porter-duff

7309b6e

Use WithW

78eb2f1

JimBobSquarePants added 2 commits February 20, 2023 14:10

Provide Sse fallback for WithW

ac0d27d

Merge branch 'main' into js/avx2-porter-duff

9752566

tannergooding reviewed Feb 20, 2023

View reviewed changes

tannergooding approved these changes Feb 20, 2023

View reviewed changes

JimBobSquarePants merged commit ae0a51c into main Feb 20, 2023

JimBobSquarePants deleted the js/avx2-porter-duff branch February 20, 2023 23:47

antonfirsov reviewed Feb 21, 2023

View reviewed changes

JimBobSquarePants mentioned this pull request Feb 24, 2023

Fix and normalize Vector4 UnPremultiply #2369

Merged

4 tasks

		Vector256<float> blendW = Avx.Multiply(sW, dW);

		Vector256<float> dstW = Avx.Subtract(dW, blendW);

	protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, float amount)
	{
	amount = Numerics.Clamp(amount, 0, 1);
	for (int i = 0; i < destination.Length; i++)
	{
	destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], amount);
	}
	}

	/// <inheritdoc />
	protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, ReadOnlySpan<float> amount)
	{
	for (int i = 0; i < destination.Length; i++)
	{
	destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], Numerics.Clamp(amount[i], 0, 1));
	}
	}

Enable Avx2 optimizations on Porter-Duff operations. #2359

Enable Avx2 optimizations on Porter-Duff operations. #2359

Conversation

JimBobSquarePants commented Feb 17, 2023 • edited Loading

Prerequisites

Description

Choose a reason for hiding this comment

br3aker Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

jnyrup Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vector4

Vector128

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

tannergooding Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saucecontrol Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

saucecontrol Feb 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

saucecontrol Feb 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Feb 17, 2023 •

edited

Loading

br3aker Feb 17, 2023 •

edited

Loading

tannergooding Feb 19, 2023 •

edited

Loading

tannergooding Feb 19, 2023 •

edited

Loading

tannergooding Feb 19, 2023 •

edited

Loading

tannergooding Feb 19, 2023 •

edited

Loading

tannergooding Feb 19, 2023 •

edited

Loading

JimBobSquarePants Feb 19, 2023 •

edited

Loading

JimBobSquarePants Feb 19, 2023 •

edited

Loading

jnyrup Feb 19, 2023 •

edited

Loading

JimBobSquarePants Feb 20, 2023 •

edited

Loading

tannergooding Feb 20, 2023 •

edited

Loading

saucecontrol Feb 19, 2023 •

edited

Loading

saucecontrol Feb 19, 2023 •

edited

Loading

JimBobSquarePants commented Feb 20, 2023 •

edited

Loading

saucecontrol Feb 20, 2023 •

edited

Loading