-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Avx2 optimizations on Porter-Duff operations. #2359
Conversation
/// <returns>The <see cref="Vector256{Single}"/>.</returns> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> Lighten(Vector256<float> backdrop, Vector256<float> source) | ||
=> Avx.Max(backdrop, source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do OverlayValueFunction
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit rusty but something like this should work:
private static readonly Vector256<float> Half = Vector256.Create(0.5f);
private static readonly Vector256<float> One = Vector256.Create(1f);
private static readonly Vector256<float> Two = Vector256.Create(2f);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source)
{
var left = Avx.Multiply(Avx.Multiply(Two, backdrop), source);
var right = Avx.Subtract(One, Avx.Multiply(Avx.Multiply(Two, Avx.Subtract(One, source)), Avx.Subtract(One, backdrop)));
var cmp = Avx.CompareGreaterThan(backdrop, Half);
return Avx.BlendVariable(left, right, cmp);
}
SIMD implementation must calculate both cases but it's still a bit faster than scalar:
BenchmarkDotNet=v0.13.4, OS=Windows 11 (10.0.22621.1265)
Unknown processor
.NET SDK=7.0.100
[Host] : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT AVX2
DefaultJob : .NET 6.0.12 (6.0.1222.56807), X64 RyuJIT AVX2
| Method | Mean | Error | StdDev |
|------- |----------:|---------:|---------:|
| Scalar | 119.12 ns | 0.342 ns | 0.320 ns |
| Avx | 35.76 ns | 0.208 ns | 0.195 ns |
public class Benchmark
{
private float[] _backdrop;
private float[] _source;
[GlobalSetup]
public void Setup()
{
_backdrop = new float[8 * 20];
_source = new float[8 * 20];
FillRandom(_backdrop);
FillRandom(_source);
}
private static void FillRandom(float[] arr)
{
var rng = new Random();
for(int i = 0; i < arr.Length; i++)
{
arr[i] = rng.NextSingle();
}
}
[Benchmark(Description = "Scalar")]
public float OverlayValueFunction_Scalar()
{
float result = default;
for (int i = 0; i < _backdrop.Length; i++)
{
result += Program.OverlayValueFunction(_backdrop[i], _source[i]);
}
return result;
}
[Benchmark(Description = "Avx")]
public Vector256<float> OverlayValueFunction_Avx()
{
ref Vector256<float> backdrop = ref Unsafe.As<float, Vector256<float>>(ref MemoryMarshal.GetReference<float>(_backdrop)); ;
ref Vector256<float> source = ref Unsafe.As<float, Vector256<float>>(ref MemoryMarshal.GetReference<float>(_backdrop));
Vector256<float> result = default;
int count = _backdrop.Length / Vector256<float>.Count;
for (int i = 0; i < count; i++)
{
result = Avx.Add(result, Program.OverlayValueFunction(Unsafe.Add(ref backdrop, i), Unsafe.Add(ref backdrop, i)));
}
return result;
}
}
P.S.
I didn't test it properly though, tried a couple of random numbers and compared result to the scalar implemention.
Most likely can be done a bit better without BlendVariable
but it's a good starting point IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's awesome. Thanks! That lets me complete the first objective.
private static readonly Vector256<float> Vector256Half = Vector256.Create(0.5F); | ||
private static readonly Vector256<float> Vector256One = Vector256.Create(1F); | ||
private static readonly Vector256<float> Vector256Two = Vector256.Create(2F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be worse than direct usage for .NET 5+
Starting in .NET 5 there was some special support added for Vector###.Create(cns)
and Vector###.Create(cns, ..., cns)
. These APIs will now generate a "method local constant" which makes it quite a bit more efficient than even a static readonly
will be.
In .NET 5 this support only existed in the late phases of the JIT (lowering). That support was improved a bit more in .NET 6, 7, and now in 8 as well. Starting in 7 in particular we now have a direct node (GT_CNS_VEC
) which allows other phases of the JIT to take advantage of this. In .NET 8 we've started adding constant folding support as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to centralize such "constants", I'd recommend returning them from a property instead:
private static Vector256<float> Vector256Half => Vector256.Create(0.5F);
private static Vector256<float> Vector256One => Vector256.Create(1F);
private static Vector256<float> Vector256Two => Vector256.Create(2F);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's very interesting. Will revert and look into other examples in the codebase.
What about multiple references to a local variable within a single method? Should I inline them?
public static Vector256<float> Screen(Vector256<float> backdrop, Vector256<float> source)
{
Vector256<float> vOne = Vector256.Create(1F);
return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source)));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure off the top of my head. In .NET 7/8 I'd expect us to be doing the right thing and recognizing them as a "common subexpression". I'd also expect the same for .NET 6, but the support wasn't quite as good and so manually hoisting it to a local may provide better results and shouldn't hurt on any of the target frameworks.
In general you can assume that most Create(cns)
and Create(cns, ..., cns)
come from memory and so using a local can help ensure there is only one memory access. The special consideration is Zero
and AllBitsSet
which are generated dynamically using a "zero cost" instruction instead.
public static Vector256<float> Overlay(Vector256<float> backdrop, Vector256<float> source) | ||
{ | ||
Vector256<float> color = OverlayValueFunction(backdrop, source); | ||
return Avx.Min(Vector256One, Avx.Blend(color, Vector256<float>.Zero, BlendAlphaControl)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're "missing" a JIT optimization here so swapping the parameter order for vblendps
can be more efficient
Namely do Avx.Blend(Vector256<float>.Zero, color, 0b_01_11_01_11);
. This is special to Zero
and AllBitsSet
since they aren't "normal" constants but are instead always generated into register directly (rather than loaded from memory). For other constants the original ordering you had would've been better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll log a JIT issue to track ensuring we support commutativity for Blend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dotnet/runtime#82365 for recognizing more types of commutative operations, such as for Blend
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static Vector256<float> OverlayValueFunction(Vector256<float> backdrop, Vector256<float> source) | ||
{ | ||
Vector256<float> left = Avx.Multiply(Avx.Multiply(Vector256Two, backdrop), source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiply(Two, backdrop)
is better as Add(backdrop, backdrop)
.
This is something I'm working on recognizing implicitly in the JIT as part of the constant folding support being added in .NET 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could then consider a helper method here:
public static Vector256<float> MultiplyAddEstimate(Vector256<float> left, Vector256<float> right, Vector256<float> addend)
{
if (Fma.IsSupported)
{
return Fma.MultiplyAdd(left, right, addend);
}
else
{
return Avx.Add(Avx.Multiply(left, right), addend);
}
}
The reason that it's an "estimate" is that FMA
does a more precise computation and gives a more accurate result. This can also subtly change the answer in some edge cases.
The most notably is that for double
, (1e308 * 2) - 1e308
returns infinity
, but FusedMultiplyAdd(1e308, 2, -1e308)
returns 1e300
(there is a similar scenario for float
, but I don't remember the exact inputs off the top of my head).
Most other inputs are less drastic in their differences. They'll either produce the same result (with Fma
typically being faster) or be off by 1 ULP (if you looked at the raw bits, the least significant bit would typically differ by 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually have a helper that does exactly that in SimdUtils.HwIntrinsics.MultiplyAdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dotnet/runtime#82366 for the vector * 2
to vector + vector
transform being implicitly recognized
// calculate weights | ||
Vector256<float> sW = Avx.Shuffle(source, source, shuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, shuffleAlphaControl); | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's the same input twice, using Avx.Permute
should be preferred. The same shuffle control is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vpermilps
is 1 byte larger than vshufps
. The only reason to prefer it is that it supports a memory operand where shuffle doesn't, but that doesn't apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 1-byte larger but also preferred by Clang/GCC in practically all cases (both Intel and AMD have their instruction selection metadata set to prefer it).
There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.
It ultimately shouldn't make too much a difference (it's been the same or with shufps taking 1-cycle more everywhere so far), but I'd typically side with what Intel/AMD have made Clang/GCC emit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(both Intel and AMD have their instruction selection metadata set to prefer it).
I don't know what this means. The two instructions mostly do different things, with one overlapping case. Can you point me at the metadata in question? And any example where Clang does the substitution?
There have been CPUs in the past (Knight's Landing most notably) where there was actual cost difference between them. Any CPU that decides to emulate YMM based operations via 2x 128-bit ops internally may likewise see worse performance for shuffle.
Ignoring Knight's Landing since it's not a real CPU, this argument doesn't make any sense to me. This is an in-lane operation, so there's no reason splitting it in two would affect one and not the other. In fact, the two existing architectures that do split AVX into 2x128 (Zen+ and Gracemont) have identical measured latency and throughput numbers.
Not that 1 byte here or there is especially worth arguing over, but since you've suggested it "should be preferred", I'd like to see something concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can trivially see this in godbolt via:
#include <immintrin.h>
__m256 Permute(__m256 x)
{
return _mm256_shuffle_ps(x, x, 0xFF);
}
and that Clang and GCC both universally replace this with vpermilps
regardless of ISAs or target -march=...
.
I'll see if I can dig up, again, exactly where the instruction selection for this happens, but might take me a bit given the size/complexity of LLVM in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that's interesting. At least trusty old MSVC does what it's told 😉
https://godbolt.org/z/3eTKGqchf
I'd appreciate it if you can find the reference for why they do it. I've seen Clang make some odd decisions before, but with GCC doing it too there's got to be something.
Vector256<float> blendW = Avx.Multiply(sW, dW); | ||
|
||
Vector256<float> dstW = Avx.Subtract(dW, blendW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to build a MultiplyAddNegatedEstimate
helper using Fma.MultiplyAddNegate
since that does -(a * b) + c
which should be equivalent to c - (a * b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd need Vector256.Negate<T>
for that wouldn't I?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd use Fma.MultiplyAddNegate(a, b, c)
and fallback to Avx.Subtract(c, Avx.Multiply(a, b))
otherwise.
This is because (x - y)
== (-y + x)
and so you're just taking advantage of the specialized instruction Fma
instruction that negates the multiplied result itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I see that thanks. In this instance I use blendW
further down so it looks like I'd be better off keeping as is since I would be introducing duplicate multiplications.
This reverts commit 907400f.
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
Vector256<float> alpha = Avx.Multiply(sW, dW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to change this to:
Vector256<float> alpha = Avx.Permute(Avx.Multiply(source, destination), ShuffleAlphaControl);
That will do a single shuffle/permute, rather than two of them. In general, if you can do ops and shuffle less after its "better" because shuffle ports are typically limited.
In this case its safe since all the operations are element-wise
, so even though you're multiplying source[i] * destination[i]
, the ones for i = 0, 1, 2
don't matter and just get ignored in the subsequent permute.
It's possible there are similar opportunities in the shuffle ops in the methods above here, but they are "more complex" than this one so I didn't do any in depth analysis to look for the same simplification/optimizations in them
result = PorterDuffFunctions.NormalSrcOver(Unsafe.Add(ref backdrop, i), Unsafe.Add(ref source, i), opacity); | ||
} | ||
|
||
return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a bug in my benchmark?
Numbers seem too good to be true.
BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.102
[Host] : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
| Method | Mean | Error | StdDev |
|------- |-----------:|--------:|--------:|
| Scalar | 2,176.5 ns | 4.48 ns | 3.97 ns |
| Avx | 231.3 ns | 0.87 ns | 0.82 ns |
// * Hints *
Outliers
PorterDuffBulkVsSingleVector.Scalar: Default -> 1 outlier was removed (2.20 us)
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
1 ns : 1 Nanosecond (0.000000001 sec)
// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:00:26 (26.39 sec), executed benchmarks: 2
Global total time: 00:00:47 (47.57 sec), executed benchmarks: 2
// * Artifacts cleanup *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that both backdrop
and source
reference this.backdrop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Doesn't affect the results though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just peaking at .NET 7 and changing the vector algorithm to be V128
to simplify the diff a bit, then we have:
Vector4
; Assembly listing for method C:NormalSrcOver(System.Numerics.Vector4,System.Numerics.Vector4,float):System.Numerics.Vector4
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
4883EC18 sub rsp, 24
C5F877 vzeroupper
C5F8293424 vmovaps xmmword ptr [rsp], xmm6
G_M000_IG02: ;; offset=000CH
498D400C lea rax, bword ptr [r8+0CH]
C5E25900 vmulss xmm0, xmm3, dword ptr [rax]
C5FA1100 vmovss dword ptr [rax], xmm0
C4C1781000 vmovups xmm0, xmmword ptr [r8]
C5F828C8 vmovaps xmm1, xmm0
C5F81012 vmovups xmm2, xmmword ptr [rdx]
C5E8C6DAFF vshufps xmm3, xmm2, xmm2, -1
C5F0C6E1FF vshufps xmm4, xmm1, xmm1, -1
C5E259EC vmulss xmm5, xmm3, xmm4
C5E25CDD vsubss xmm3, xmm3, xmm5
C5DA5CF5 vsubss xmm6, xmm4, xmm5
C5E258E4 vaddss xmm4, xmm3, xmm4
C4E27918DB vbroadcastss xmm3, xmm3
C5E859D3 vmulps xmm2, xmm2, xmm3
C4E27918DE vbroadcastss xmm3, xmm6
C5F059CB vmulps xmm1, xmm1, xmm3
C5E858C9 vaddps xmm1, xmm2, xmm1
C4E27918D5 vbroadcastss xmm2, xmm5
C5F859C2 vmulps xmm0, xmm0, xmm2
C5F058C0 vaddps xmm0, xmm1, xmm0
C5F828CC vmovaps xmm1, xmm4
C5F8101532000000 vmovups xmm2, xmmword ptr [reloc @RWD00]
C5E85FC9 vmaxps xmm1, xmm2, xmm1
C4E27918C9 vbroadcastss xmm1, xmm1
C5F85EC1 vdivps xmm0, xmm0, xmm1
C4E37921C430 vinsertps xmm0, xmm0, xmm4, 48
C5F81101 vmovups xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0088H
C5F8283424 vmovaps xmm6, xmmword ptr [rsp]
4883C418 add rsp, 24
C3 ret
Vector128
G_M000_IG01: ;; offset=0000H
4883EC18 sub rsp, 24
C5F877 vzeroupper
C5F8293424 vmovaps qword ptr [rsp], xmm6
G_M000_IG02: ;; offset=000CH
C4C1791000 vmovupd xmm0, xmmword ptr [r8]
C4C1785909 vmulps xmm1, xmm0, xmmword ptr [r9]
C4E3790CC188 vblendps xmm0, xmm0, xmm1, -120
C4C1791100 vmovupd xmmword ptr [r8], xmm0
C4C1791000 vmovupd xmm0, xmmword ptr [r8]
C5F828C8 vmovaps xmm1, xmm0
C5F91012 vmovupd xmm2, xmmword ptr [rdx]
C4E37904D8FF vpermilps xmm3, xmm0, -1
C4E37904E2FF vpermilps xmm4, xmm2, -1
C5E059EC vmulps xmm5, xmm3, xmm4
C5D85CE5 vsubps xmm4, xmm4, xmm5
C5E05CF5 vsubps xmm6, xmm3, xmm5
C5D858DB vaddps xmm3, xmm4, xmm3
C5E859D4 vmulps xmm2, xmm2, xmm4
C4E279B8D6 vfmadd231ps xmm2, xmm0, xmm6
C4E271B8D5 vfmadd231ps xmm2, xmm1, xmm5
C5E05F0530000000 vmaxps xmm0, xmm3, xmmword ptr [reloc @RWD00]
C5E85ED0 vdivps xmm2, xmm2, xmm0
C4E3690CC388 vblendps xmm0, xmm2, xmm3, -120
C5F91101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M000_IG03: ;; offset=0071H
C5F8283424 vmovaps xmm6, qword ptr [rsp]
4883C418 add rsp, 24
C3 ret
The V256
code is effectively identical to the V128
code, just using ymm
instead of xmm
and we can assume is roughly 2x the speed.
For the differences between Vector4
and V128
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial source.W *= opacity
isn't great, it's generating:
lea rax, bword ptr [r8+0CH]
vmulss xmm0, xmm3, dword ptr [rax]
vmovss dword ptr [rax], xmm0
vmovupd xmm0, xmmword ptr [r8]
If NormalSrcOver
was inlined and source
was in register, it'd also be generating a spill before this:
vmovapd xmmword ptr [r8], xmm0
That basically means it's doing:
- write source to memory
- get the address of
w
- load
w
from memory and multiply byopacity
- store result into
w
- load entire vector from memory
Compare this with the V128
approach:
vmovupd xmm0, xmmword ptr [r8]
vmulps xmm1, xmm0, xmmword ptr [r9]
vblendps xmm0, xmm0, xmm1, 8
Which is:
- load source from memory
- load opacity from memory and multiply by source
- blend so that you get original source plus new W
If you instead did something roughly equivalent, making it Vector4 opacity
and then did source.W = (source * opacity).W;
, you get this instead:
vmovupd xmm0, xmmword ptr [r8]
vmulps xmm1, xmm0, xmmword ptr [r9]
vshufps xmm1, xmm1, xmm1, -1
vinsertps xmm0, xmm0, xmm1, 48
Which is nearly ideal and doing the same thing, there's just a stray vshufps
that I need to remove on the JIT side >.>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insanely detailed writeup, learnt a lot, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredible @tannergooding Thanks so much!
I'm gonna have a tinker and see where I can get. NormalSrcOver
is currently 9.2X faster using AVX which is absolutely bonkers. CC @antonfirsov
BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.102
[Host] : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio |
|------- |-----------:|--------:|--------:|------:|
| Scalar | 2,177.0 ns | 4.38 ns | 3.66 ns | 1.00 |
| Avx | 233.5 ns | 1.25 ns | 1.17 ns | 0.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR improvement for the Vector4
path. @tannergooding
I implemented the WithW
and the things immediate dropped to nearly 4X faster. PermuteW
brought it down even further.
The only thing I couldn't figure out was how to fallback to Sse
for WithW
. I'm not great with shuffling.
BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.102
[Host] : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio |
|------- |---------:|--------:|--------:|------:|
| Scalar | 415.5 ns | 1.36 ns | 1.06 ns | 1.00 |
| Avx | 234.4 ns | 0.44 ns | 0.35 ns | 0.56 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAJOR improvement for the Vector4 path.
Awesome! Much closer to the 1.5-2x difference we'd otherwise expect
The only thing I couldn't figure out was how to fallback to Sse for WithW.
Think you want:
Vector128<float> tmp = Sse.Shuffle(w.AsVector128(), value.AsVector128(), 0b00_10_11_11);
return Sse.Shuffle(value.AsVector128(), tmp, 0b00_10_01_00);
That will create tmp
as <w[3], w[3], value[2], value[0]>
Then return <value[0], value[1], tmp[2], tmp[0]>
(which is <value[0], value[1], value[2], w[3]>
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not have figured that one out. Thanks!
{ | ||
return Vector4.One - ((Vector4.One - backdrop) * (Vector4.One - source)); | ||
Vector256<float> vOne = Vector256.Create(1F); | ||
return Avx.Subtract(vOne, Avx.Multiply(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much anywhere you have a multiply followed by add or multiply followed by subtract, you can use FMA. This one would be
Fma.MultiplyAddNegated(Avx.Subtract(vOne, backdrop), Avx.Subtract(vOne, source), vOne);
With the appropriate helper for non-FMA, ofc.
Vector256<float> vOne = Vector256.Create(1F); | ||
Vector256<float> vTwo = Vector256.Create(2F); | ||
Vector256<float> left = Avx.Multiply(Avx.Add(backdrop, backdrop), source); | ||
Vector256<float> right = Avx.Subtract(vOne, Avx.Multiply(Avx.Multiply(vTwo, Avx.Subtract(vOne, source)), Avx.Subtract(vOne, backdrop))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another MultiplyAddNegated opportunity here. You can also remove the vTwo constant (which will be a memory load) by saving Avx.Subtract(vOne, source)
into a temp and adding it to itself instead of multiplying by 2.
Add will be faster or have increased throughput over multiply on some processors as well.
// calculate alpha | ||
Vector256<float> sW = Avx.Shuffle(source, source, ShuffleAlphaControl); | ||
Vector256<float> dW = Avx.Shuffle(destination, destination, ShuffleAlphaControl); | ||
Vector256<float> alpha = Avx.Multiply(Avx.Subtract(Vector256.Create(1F), dW), sW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought I saw Tanner comment on this yesterday, but it's disappeared. You can save an instruction by doing the alpha shuffle after the multiply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment should be here, hopefully it hasn't actually disappeared #2359 (comment) 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yeah, it just got marked as outdated so it doesn't show up in the code view anymore. Saw it after I submitted the comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented that now.
OK. I think I'm happy here unless there are any objections. The old Vector4 path is ~5.1x faster than before. |
in Vector256<float> a, | ||
in Vector256<float> b, | ||
in Vector256<float> c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Particularly since this is inlined, I wouldn't expect you to need in
here (or on the other examples above).
It shouldn't actually make a difference since the JIT should inline things and elide the "address taken", but it's typically easier on the JIT if you don't mark them unnecessarily for "enregisterable" types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, without the in
JIT could/would lose track of the fact memory load operands could be contained when inlining these small helpers, at least in netcoreapp3.1. Don't know if or when that was fixed, but I'd double check before removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the JIT but am seeing a 2% slowdown when running my benchmark when I remove them. Could be error but I'll leave them for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look generally good/correct to me.
There's cleanup/simplification that can be done long term when you're able to move onto .NET 7 (or .NET 8 after that ships)
Vector4 result = default; | ||
for (int i = 0; i < this.backdrop.Length; i++) | ||
{ | ||
result = PorterDuffFunctions.NormalSrcOver(this.backdrop[i], this.source[i], .5F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only concern with these benchmarks is that - unlike the real span code - they do not walk the source
and amount
buffers, meaning there are fewer cache misses.
I would have created a parametric benchmark (N=2,16,128,512,1024) that uses the Span API-s, and run it against the code before/after the changes. Shouldn't significantly impact the result, but would paint a more nuanced picture about the improvements.
Anyways, amazing job!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They actually used to walk those buffers. I was surprised by that, so I replicated it in the benchmark.
ImageSharp/src/ImageSharp/PixelFormats/PixelBlenders/DefaultPixelBlenders.Generated.tt
Lines 87 to 103 in f53c78d
protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, float amount) | |
{ | |
amount = Numerics.Clamp(amount, 0, 1); | |
for (int i = 0; i < destination.Length; i++) | |
{ | |
destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], amount); | |
} | |
} | |
/// <inheritdoc /> | |
protected override void BlendFunction(Span<Vector4> destination, ReadOnlySpan<Vector4> background, ReadOnlySpan<Vector4> source, ReadOnlySpan<float> amount) | |
{ | |
for (int i = 0; i < destination.Length; i++) | |
{ | |
destination[i] = PorterDuffFunctions.<#=blender_composer#>(background[i], source[i], Numerics.Clamp(amount[i], 0, 1)); | |
} | |
} |
I have actually realized that the benchmarks were actually slightly generous to the old approach as I'd forgotten the clamp on amount
Prerequisites
Description
Adds
Vector256<float>
implementations of the Porter-Duff functions to speed up our blending operations. Fixes #2340PorterDuffFunctions.cs
.PorterDuffFunctions.Generated.tt
to include theVector256<float>
variants.DefaultPixelBlenders.Generated.tt
to use theVector256<float>
variants when appropriate.RemoteExecutor
@antonfirsov @tannergooding @saucecontrol @br3aker I'll take any help you can offer pointing me in the right direction to finish porting these over.