Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and normalize Vector4 UnPremultiply #2369

Merged
merged 7 commits into from
Feb 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 92 additions & 23 deletions src/ImageSharp/Common/Helpers/Numerics.cs
Original file line number Diff line number Diff line change
Expand Up @@ -474,21 +474,10 @@ private static void ClampImpl<T>(Span<T> span, T min, T max)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Premultiply(ref Vector4 source)
{
float w = source.W;
source *= w;
source.W = w;
}

/// <summary>
/// Reverses the result of premultiplying a vector via <see cref="Premultiply(ref Vector4)"/>.
/// </summary>
/// <param name="source">The <see cref="Vector4"/> to premultiply</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void UnPremultiply(ref Vector4 source)
{
float w = source.W;
source /= w;
source.W = w;
// Load into a local variable to prevent accessing the source from memory multiple times.
Vector4 src = source;
Vector4 alpha = PermuteW(src);
source = WithW(src * alpha, alpha);
}

/// <summary>
Expand All @@ -498,7 +487,7 @@ public static void UnPremultiply(ref Vector4 source)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Premultiply(Span<Vector4> vectors)
{
if (Avx2.IsSupported && vectors.Length >= 2)
if (Avx.IsSupported && vectors.Length >= 2)
{
// Divide by 2 as 4 elements per Vector4 and 8 per Vector256<float>
ref Vector256<float> vectorsBase = ref Unsafe.As<Vector4, Vector256<float>>(ref MemoryMarshal.GetReference(vectors));
Expand All @@ -507,8 +496,8 @@ public static void Premultiply(Span<Vector4> vectors)
while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
{
Vector256<float> source = vectorsBase;
Vector256<float> multiply = Avx.Shuffle(source, source, ShuffleAlphaControl);
vectorsBase = Avx.Blend(Avx.Multiply(source, multiply), source, BlendAlphaControl);
Vector256<float> alpha = Avx.Permute(source, ShuffleAlphaControl);
vectorsBase = Avx.Blend(Avx.Multiply(source, alpha), source, BlendAlphaControl);
vectorsBase = ref Unsafe.Add(ref vectorsBase, 1);
}

Expand All @@ -532,24 +521,49 @@ public static void Premultiply(Span<Vector4> vectors)
}
}

/// <summary>
/// Reverses the result of premultiplying a vector via <see cref="Premultiply(ref Vector4)"/>.
/// </summary>
/// <param name="source">The <see cref="Vector4"/> to premultiply</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void UnPremultiply(ref Vector4 source)
{
Vector4 alpha = PermuteW(source);
UnPremultiply(ref source, alpha);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void UnPremultiply(ref Vector4 source, Vector4 alpha)
{
if (alpha == Vector4.Zero)
{
return;
}

// Divide source by alpha if alpha is nonzero, otherwise set all components to match the source value
// Blend the result with the alpha vector to ensure that the alpha component is unchanged
source = WithW(source / alpha, alpha);
Comment on lines +538 to +545
Copy link
Contributor

@tannergooding tannergooding Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branching in SIMD code is "generally" undesirable.

It's probably worth comparing this to an implementation that does CompareEqual + Blend instead (much like you're already doing for the Avx2 path) to see which is better.

You should be able to use the AsVector128() again. With Sse.CompareEqual and using Sse41.BlendVariable.

If Sse41 isn't supported (which is unlikely since its been around since 2007), you can replace it with Sse.Or(Sse.And(left, mask), Sse.AndNot(mask, right))

-- That is: (left & mask) | (right & ~mask)
-- Sse.AndNot(x, y) does ~x & y, despite its name implying x & ~y; a weird quirk of the instruction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- left here represents the value to use if the comparison result (mask) was true for that element
-- right correspondingly is the value to use if it was false

Copy link
Member Author

@JimBobSquarePants JimBobSquarePants Feb 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you believe the version I had was actually faster!?

UnPremultiply using Vector4 and branching.

Method Mean Error StdDev Ratio
UnPremultiplyBaseline 1.545 us 0.0070 us 0.0066 us 1.00
UnPremultiply 1.827 us 0.0039 us 0.0037 us 1.18
UnPremultiplyBulk 1.267 us 0.0033 us 0.0027 us 0.82

UnPremultiply using Sse41.

Method Mean Error StdDev Ratio
UnPremultiplyBaseline 1.546 us 0.0077 us 0.0068 us 1.00
UnPremultiply 2.058 us 0.0043 us 0.0036 us 1.33
UnPremultiplyBulk 1.268 us 0.0025 us 0.0022 us 0.82

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd believe it.

The general reason to avoid branching is that it often slows down the overall operation. However, the exact difference depends on the scenario being tested and how frequently that branch will be hit as compared to the cost of computing and blending both paths.

If I'm reading your table right, then in this case, they look practically identical with a little less error and deviation for blending in Bulk. But the branching ends up winning out for single operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We'd only ever use in for single operations or for overlaps in the AVX run.

}

/// <summary>
/// Bulk variant of <see cref="UnPremultiply(ref Vector4)"/>
/// </summary>
/// <param name="vectors">The span of vectors</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void UnPremultiply(Span<Vector4> vectors)
{
if (Avx2.IsSupported && vectors.Length >= 2)
if (Avx.IsSupported && vectors.Length >= 2)
{
// Divide by 2 as 4 elements per Vector4 and 8 per Vector256<float>
ref Vector256<float> vectorsBase = ref Unsafe.As<Vector4, Vector256<float>>(ref MemoryMarshal.GetReference(vectors));
ref Vector256<float> vectorsLast = ref Unsafe.Add(ref vectorsBase, (IntPtr)((uint)vectors.Length / 2u));
Vector256<float> epsilon = Vector256.Create(Constants.Epsilon);

while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
{
Vector256<float> source = vectorsBase;
Vector256<float> multiply = Avx.Shuffle(source, source, ShuffleAlphaControl);
vectorsBase = Avx.Blend(Avx.Divide(source, multiply), source, BlendAlphaControl);
Vector256<float> alpha = Avx.Permute(source, ShuffleAlphaControl);
vectorsBase = UnPremultiply(source, alpha);
vectorsBase = ref Unsafe.Add(ref vectorsBase, 1);
}

Expand All @@ -573,6 +587,61 @@ public static void UnPremultiply(Span<Vector4> vectors)
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256<float> UnPremultiply(Vector256<float> source, Vector256<float> alpha)
{
// Check if alpha is zero to avoid division by zero
Vector256<float> zeroMask = Avx.CompareEqual(alpha, Vector256<float>.Zero);

// Divide source by alpha if alpha is nonzero, otherwise set all components to match the source value
Vector256<float> result = Avx.BlendVariable(Avx.Divide(source, alpha), source, zeroMask);

// Blend the result with the alpha vector to ensure that the alpha component is unchanged
return Avx.Blend(result, alpha, BlendAlphaControl);
}

/// <summary>
/// Permutes the given vector return a new instance with all the values set to <see cref="Vector4.W"/>.
/// </summary>
/// <param name="value">The vector.</param>
/// <returns>The <see cref="Vector4"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4 PermuteW(Vector4 value)
{
if (Sse.IsSupported)
{
return Sse.Shuffle(value.AsVector128(), value.AsVector128(), ShuffleAlphaControl).AsVector4();
}

return new(value.W);
}

/// <summary>
/// Sets the W component of the given vector <paramref name="value"/> to the given value from <paramref name="w"/>.
/// </summary>
/// <param name="value">The vector to set.</param>
/// <param name="w">The vector containing the W value.</param>
/// <returns>The <see cref="Vector4"/>.</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4 WithW(Vector4 value, Vector4 w)
{
if (Sse41.IsSupported)
{
return Sse41.Insert(value.AsVector128(), w.AsVector128(), 0b11_11_0000).AsVector4();
}

if (Sse.IsSupported)
{
// Create tmp as <w[3], w[0], value[2], value[0]>
// Then return <value[0], value[1], tmp[2], tmp[0]> (which is <value[0], value[1], value[2], w[3]>)
Vector128<float> tmp = Sse.Shuffle(w.AsVector128(), value.AsVector128(), 0b00_10_00_11);
return Sse.Shuffle(value.AsVector128(), tmp, 0b00_10_01_00).AsVector4();
}

value.W = w.W;
return value;
}

/// <summary>
/// Calculates the cube pow of all the XYZ channels of the input vectors.
/// </summary>
Expand All @@ -586,7 +655,7 @@ public static unsafe void CubePowOnXYZ(Span<Vector4> vectors)
while (Unsafe.IsAddressLessThan(ref baseRef, ref endRef))
{
Vector4 v = baseRef;
float a = v.W;
Vector4 a = PermuteW(v);

// Fast path for the default gamma exposure, which is 3. In this case we can skip
// calling Math.Pow 3 times (one per component), as the method is an internal call and
Expand All @@ -595,7 +664,7 @@ public static unsafe void CubePowOnXYZ(Span<Vector4> vectors)
// back to the target index in the temporary span. The whole iteration will get completely
// inlined and traslated into vectorized instructions, with much better performance.
v = v * v * v;
v.W = a;
v = WithW(v, a);

baseRef = v;
baseRef = ref Unsafe.Add(ref baseRef, 1);
Expand Down
5 changes: 1 addition & 4 deletions src/ImageSharp/Common/Helpers/Shuffle/IComponentShuffle.cs
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,7 @@ internal interface IShuffle4 : IComponentShuffle
internal readonly struct DefaultShuffle4 : IShuffle4
{
public DefaultShuffle4(byte control)
{
DebugGuard.MustBeBetweenOrEqualTo<byte>(control, 0, 3, nameof(control));
this.Control = control;
}
=> this.Control = control;

public byte Control { get; }

Expand Down
5 changes: 1 addition & 4 deletions src/ImageSharp/Common/Helpers/Shuffle/IPad3Shuffle4.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@ internal interface IPad3Shuffle4 : IComponentShuffle
internal readonly struct DefaultPad3Shuffle4 : IPad3Shuffle4
{
public DefaultPad3Shuffle4(byte control)
{
DebugGuard.MustBeBetweenOrEqualTo<byte>(control, 0, 3, nameof(control));
this.Control = control;
}
=> this.Control = control;

public byte Control { get; }

Expand Down
5 changes: 1 addition & 4 deletions src/ImageSharp/Common/Helpers/Shuffle/IShuffle3.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@ internal interface IShuffle3 : IComponentShuffle
internal readonly struct DefaultShuffle3 : IShuffle3
{
public DefaultShuffle3(byte control)
{
DebugGuard.MustBeBetweenOrEqualTo<byte>(control, 0, 3, nameof(control));
this.Control = control;
}
=> this.Control = control;

public byte Control { get; }

Expand Down
5 changes: 1 addition & 4 deletions src/ImageSharp/Common/Helpers/Shuffle/IShuffle4Slice3.cs
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@ internal interface IShuffle4Slice3 : IComponentShuffle
internal readonly struct DefaultShuffle4Slice3 : IShuffle4Slice3
{
public DefaultShuffle4Slice3(byte control)
{
DebugGuard.MustBeBetweenOrEqualTo<byte>(control, 0, 3, nameof(control));
this.Control = control;
}
=> this.Control = control;

public byte Control { get; }

Expand Down
Loading