-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and normalize Vector4 UnPremultiply #2369
Conversation
Floating-point division is quite a bit slower than multiplication (although not quite as bad as integral division).
Edit: I had looked at the question and sample asm without looking at the new implementation. Looks like you're already doing this, just without also removing the duplicate memory access. This is notably another case where you'd probably see some benefit in using the What you have now is roughly: C.Premultiply(System.Numerics.Vector4 ByRef)
L0000: vzeroupper ;
L0003: vmovss xmm0, [rcx+0xc] ; 4-7 cycles; port 2 or 3
L0008: vmovupd xmm1, [rcx] ; 4-7 cycles; port 2 or 3
L000c: vbroadcastss xmm2, xmm0 ; 1 cycle; port 5
L0011: vmulps xmm1, xmm1, xmm2 ; 4 cycles; port 0 or 1
L0015: vmovupd [rcx], xmm1 ; 4-10 cycles; port 2 or 3 or 7 + port 4
L0019: vmovss [rcx+0xc], xmm0 ; 4-10 cycles; port 2 or 3 or 7 + port 4
L001e: ret ; Given dependencies, parallel dispatch, and other bits this will take around However, if you change the implementation to something like: private static void Premultiply1(ref Vector4 source)
{
var vsource = source.AsVector128();
Vector4 w = PermuteW(vsource);
vsource *= w;
source = WithW(vsource, w);
} We instead generate: L0000: vzeroupper ;
L0003: vmovupd xmm0, [rcx] ; 4-7 cycles; port 2 or 3
L0007: vpermilps xmm1, xmm0, 0xff ; 1 cycle; port 5
L000d: vmulps xmm0, xmm0, xmm1 ; 4 cycles; port 0 or 1
L0011: vinsertps xmm0, xmm0, xmm1, 0xf0 ; 1 cycle; port 5
L0017: vmovupd [rcx], xmm0 ; 4-10 cycles; port 2 or 3 or 7 + port 4
L001b: ret ; Which is around |
Vector4 alpha = PermuteW(source); | ||
source = WithW(source * alpha, alpha); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to load source
into a local here.
Vector4 alpha = PermuteW(source); | |
source = WithW(source * alpha, alpha); | |
Vector4 src = source; | |
Vector4 alpha = PermuteW(src); | |
source = WithW(src * alpha, alpha); |
The reasoning is that since source
is a ref
, each access ends up coming from memory. However, we're using the value multiple times (twice given the implementation) and so accessing it twice is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- Since it's AggressiveInlining
, its possible the JIT is already optimizing this almost equivalently. However, it might also make it more likely to spill the value since its "address taken" with multiple uses, so it's probably better to be "safe".
if (alpha == Vector4.Zero) | ||
{ | ||
return; | ||
} | ||
|
||
// Divide source by alpha if alpha is nonzero, otherwise set all components to match the source value | ||
// Blend the result with the alpha vector to ensure that the alpha component is unchanged | ||
source = WithW(source / alpha, alpha); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Branching in SIMD code is "generally" undesirable.
It's probably worth comparing this to an implementation that does CompareEqual
+ Blend
instead (much like you're already doing for the Avx2
path) to see which is better.
You should be able to use the AsVector128()
again. With Sse.CompareEqual
and using Sse41.BlendVariable
.
If Sse41
isn't supported (which is unlikely since its been around since 2007), you can replace it with Sse.Or(Sse.And(left, mask), Sse.AndNot(mask, right))
-- That is: (left & mask) | (right & ~mask)
-- Sse.AndNot(x, y)
does ~x & y
, despite its name implying x & ~y
; a weird quirk of the instruction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- left
here represents the value to use if the comparison result (mask
) was true for that element
-- right
correspondingly is the value to use if it was false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you believe the version I had was actually faster!?
UnPremultiply using Vector4 and branching.
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
UnPremultiplyBaseline | 1.545 us | 0.0070 us | 0.0066 us | 1.00 |
UnPremultiply | 1.827 us | 0.0039 us | 0.0037 us | 1.18 |
UnPremultiplyBulk | 1.267 us | 0.0033 us | 0.0027 us | 0.82 |
UnPremultiply using Sse41.
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
UnPremultiplyBaseline | 1.546 us | 0.0077 us | 0.0068 us | 1.00 |
UnPremultiply | 2.058 us | 0.0043 us | 0.0036 us | 1.33 |
UnPremultiplyBulk | 1.268 us | 0.0025 us | 0.0022 us | 0.82 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd believe it.
The general reason to avoid branching is that it often slows down the overall operation. However, the exact difference depends on the scenario being tested and how frequently that branch will be hit as compared to the cost of computing and blending both paths.
If I'm reading your table right, then in this case, they look practically identical with a little less error and deviation for blending in Bulk
. But the branching ends up winning out for single operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. We'd only ever use in for single operations or for overlaps in the AVX run.
Thanks for all the info there, much appreciated. What I'm seeing though is that the division is faster. A lot faster. 1.538 us vs 31.17 us. It makes no sense to me. |
Ah yes. I misread the numbers initially. This isn't clear to me either and I'd probably end up doing a profile using My gut instinct would be a difference in input data or in how BDN ir measuring the two profiles. |
Prerequisites
Description
Our existing
UnPremultiply
methods suffered from an issue in that they allowed dividing values by zero. This was masked by our tests since the transform to byte would lead tozero
in most cases but could cause issues when chaining certain operations requiring premultiplication using single precision pixel formats since interim values for theXYZ
components would be∞
.The Porter-Duff operators contained a bad fix for this by dividing by the max of the
W
component vs an epsilon of0.001F
. This would lead to out-of-range values if the X, Y, Z components had a non-zero value. This was generally masked by clamping but again could cause issues for single precision pixel formats.The fix is simple. Compare the alpha component to
zero
and return the source value if so. We return the source rather than an explicit zeroed value for two reasons:Vector4.Zero
zero
but havenon-zero
RGB component values. This allows us to maintain the integrity of those values. (Note the reference decoder makes the same mistake we did before this fix).The fix does not affect the performance of
UnPremultiply
. Migrating the Porter-Duff operators to use the fixed method however yields an additional 1.55x performance boost over the work done in #2359This brings an overall performance boost of 14.4x over our v2 version of the operators.
Porter-Duff NormalSrcOver
Main
PR
PremultiplyVector4
Main
PR
UnPremultiplyVector4
Main
PR
One thing I do not understand though is why is there such a difference between
Premultiply
vsUnPremultiply
The baseline implementations in the benchmark use virtually identical code and generate virtually identical ASM.
@tannergooding would you have any insight?