-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SSE41 version of Quantize block #1811
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1811 +/- ##
==========================================
- Coverage 87.33% 87.13% -0.21%
==========================================
Files 936 936
Lines 48085 48128 +43
Branches 6035 6037 +2
==========================================
- Hits 41994 41935 -59
- Misses 5092 5190 +98
- Partials 999 1003 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@brianpopow I wonder if your profiler results are for a single image or show a multi-image run of images triggering different code paths? |
No its just a single image encoded as lossy. All should be the same except the part in QuantizeBlock: edit: all method call counts should be exactly the same before and after (which they are not). I will try to re-run the test. |
I have re-run the profiling tests and I can replicate the numbers, but I have trouble explaining why there are a few more calls to |
I also struggle getting consistent results from dotTrace these days, it definitely does some stuff wrong, however I was never able to figure out what exactly / why. If you want an accurate comparison between the two, I'm afraid you'll have to move the SSE41 / scalar logic to separate methods and run them with BDN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly related to the PR, but you may also want to turn Vp8Matrix
into a struct with all the Q
, IQ
etc fields being fixed-size buffers.
Helping cache coherence by keeping stuff close in memory should alone bring some visible perf boost.
public void QuantizeBlock_WithoutSSE2_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSE2); | ||
|
||
[Fact] | ||
public void QuantizeBlock_WithoutSSSE3_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSSE3); | ||
|
||
[Fact] | ||
public void QuantizeBlock_WithoutSSE2AndSSSE3_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSE2 | HwIntrinsics.DisableSSSE3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why do we need all these variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You think its enough to test only without SSE2? Can it not be that the CPU supports SSE2, but not SSSE3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that the feature needs SSE4.1, so unless we really believe we will add SSE2 and SSSE3 versions of the algorithm later, I don't see why is it necessary to test switching off those particular instruction sets.
We should just switch off SSE4.1 or all intrinsics in general instead I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i haved changed it to test with and without intrinsics
fixed (ushort* mtxIqPtr = mtx.IQ) | ||
fixed (ushort* mtxQPtr = mtx.Q) | ||
fixed (uint* biasQPtr = mtx.Bias) | ||
fixed (short* sharpenPtr = mtx.Sharpen) | ||
fixed (short* inputPtr = input) | ||
fixed (short* outputPtr = output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can avoid pinning by by loading with unsafe conversion instead of LoadVector128
:
Unsafe.As<ushort, Vector128<ushort>>(ref MemoryMarshal.GetReference(input));
You can even expose helper methods on Vp8Matrix
to load these vectors in order to keep code simple here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to pin for the final Sse2.Store
right?
fixed (short* outputPtr = output) | ||
{ | ||
Sse2.Store(outputPtr, outZ0.AsInt16()); | ||
Sse2.Store(outputPtr + 8, outZ8.AsInt16()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to pin for the final Sse2.Store right?
Haven't checked if the following compiles & works, but since MemoryMarshal.GetReference
and Unsafe.As
return references, you can use them on the left side of an assignment:
fixed (short* outputPtr = output) | |
{ | |
Sse2.Store(outputPtr, outZ0.AsInt16()); | |
Sse2.Store(outputPtr + 8, outZ8.AsInt16()); | |
} | |
ref short outputRef = ref MemoryMarshal.GetReference(output); | |
Unsafe.As<short, Vector128<short>>(ref outputRef) = outZ0.AsInt16(); | |
Unsafe.As<short, Vector128<short>>(ref Unsafe.Add(ref outputRef, 8)) = outZ8.AsInt16(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this works 👍
I have tried to change the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably should not used public fields, but what would be the right way here to make that work without them.
Sometimes it's possible to juggle with the fields in a way that keeps them private an yet avoids perf regression, not sure if it's the case here, but IMO it's waste of time. I would recommend to disable the warning and move on. StyleCop was not designed for performance critical code.
@@ -510,51 +527,150 @@ public static void RefineUsingDistortion(Vp8EncIterator it, Vp8SegmentInfo[] seg | |||
[MethodImpl(InliningOptions.ShortMethod)] | |||
public static int Quantize2Blocks(Span<short> input, Span<short> output, Vp8Matrix mtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to pass Vp8Matrix
by reference now to avoid full stack copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, good spot
int nz = QuantizeBlock(input, output, mtx) << 0; | ||
nz |= QuantizeBlock(input.Slice(1 * 16), output.Slice(1 * 16), mtx) << 1; | ||
int nz = QuantizeBlock(input.Slice(0, 16), output.Slice(0, 16), mtx) << 0; | ||
nz |= QuantizeBlock(input.Slice(1 * 16, 16), output.Slice(1 * 16, 16), mtx) << 1; | ||
return nz; | ||
} | ||
|
||
public static int QuantizeBlock(Span<short> input, Span<short> output, Vp8Matrix mtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too: ref Vp8Matrix mtx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Prerequisites
Description
This PR add SSE41 version of Quantize block, which is used during lossy encoding.
Related to #1786
Profiling results
Before:
After: