Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SSE41 version of Quantize block #1811

Merged
merged 12 commits into from
Nov 9, 2021
Merged

Add SSE41 version of Quantize block #1811

merged 12 commits into from
Nov 9, 2021

Conversation

brianpopow
Copy link
Collaborator

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR add SSE41 version of Quantize block, which is used during lossy encoding.
Related to #1786

Profiling results

Before:
quantize_block

After:
quantize_block_sse

@codecov
Copy link

codecov bot commented Nov 7, 2021

Codecov Report

Merging #1811 (55f8596) into master (ce7687b) will decrease coverage by 0.20%.
The diff coverage is 96.84%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1811      +/-   ##
==========================================
- Coverage   87.33%   87.13%   -0.21%     
==========================================
  Files         936      936              
  Lines       48085    48128      +43     
  Branches     6035     6037       +2     
==========================================
- Hits        41994    41935      -59     
- Misses       5092     5190      +98     
- Partials      999     1003       +4     
Flag Coverage Δ
unittests 87.13% <96.84%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Formats/Webp/Lossy/Vp8Encoder.cs 93.60% <ø> (-0.05%) ⬇️
...rc/ImageSharp/Formats/Webp/Lossy/Vp8SegmentInfo.cs 100.00% <ø> (ø)
src/ImageSharp/Formats/Webp/Lossy/QuantEnc.cs 96.95% <96.77%> (+0.25%) ⬆️
src/ImageSharp/Formats/Webp/Lossy/Vp8Matrix.cs 100.00% <100.00%> (ø)
...ageSharp/Formats/Webp/Lossless/PredictorEncoder.cs 89.22% <0.00%> (-9.08%) ⬇️
.../ImageSharp/Formats/Webp/Lossless/LosslessUtils.cs 88.68% <0.00%> (-8.86%) ⬇️
...rc/ImageSharp/Formats/Webp/Lossless/Vp8LEncoder.cs 97.51% <0.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce7687b...55f8596. Read the comment docs.

@antonfirsov
Copy link
Member

@brianpopow I wonder if your profiler results are for a single image or show a multi-image run of images triggering different code paths?

@brianpopow
Copy link
Collaborator Author

brianpopow commented Nov 7, 2021

@brianpopow I wonder if your profiler results are for a single image or show a multi-image run of images triggering different code paths?

No its just a single image encoded as lossy. All should be the same except the part in QuantizeBlock: if (Sse41.IsSupported)
which i have set to false for the comparison run.

edit: all method call counts should be exactly the same before and after (which they are not). I will try to re-run the test.

@brianpopow
Copy link
Collaborator Author

I have re-run the profiling tests and I can replicate the numbers, but I have trouble explaining why there are a few more calls to QuantizeBlock now.
To make sure the quantization results are the same, I have compared the QuantizeBlock SSE results with the QuantizeBlock during one encoding run and they are exactly the same.

@antonfirsov
Copy link
Member

I also struggle getting consistent results from dotTrace these days, it definitely does some stuff wrong, however I was never able to figure out what exactly / why.

If you want an accurate comparison between the two, I'm afraid you'll have to move the SSE41 / scalar logic to separate methods and run them with BDN.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related to the PR, but you may also want to turn Vp8Matrix into a struct with all the Q, IQ etc fields being fixed-size buffers.

Helping cache coherence by keeping stuff close in memory should alone bring some visible perf boost.

Comment on lines 47 to 53
public void QuantizeBlock_WithoutSSE2_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSE2);

[Fact]
public void QuantizeBlock_WithoutSSSE3_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSSE3);

[Fact]
public void QuantizeBlock_WithoutSSE2AndSSSE3_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSE2 | HwIntrinsics.DisableSSSE3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why do we need all these variants.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You think its enough to test only without SSE2? Can it not be that the CPU supports SSE2, but not SSSE3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that the feature needs SSE4.1, so unless we really believe we will add SSE2 and SSSE3 versions of the algorithm later, I don't see why is it necessary to test switching off those particular instruction sets.

We should just switch off SSE4.1 or all intrinsics in general instead I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i haved changed it to test with and without intrinsics

Comment on lines 540 to 545
fixed (ushort* mtxIqPtr = mtx.IQ)
fixed (ushort* mtxQPtr = mtx.Q)
fixed (uint* biasQPtr = mtx.Bias)
fixed (short* sharpenPtr = mtx.Sharpen)
fixed (short* inputPtr = input)
fixed (short* outputPtr = output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid pinning by by loading with unsafe conversion instead of LoadVector128:

Unsafe.As<ushort, Vector128<ushort>>(ref MemoryMarshal.GetReference(input));

You can even expose helper methods on Vp8Matrix to load these vectors in order to keep code simple here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to pin for the final Sse2.Store right?

Comment on lines 623 to 627
fixed (short* outputPtr = output)
{
Sse2.Store(outputPtr, outZ0.AsInt16());
Sse2.Store(outputPtr + 8, outZ8.AsInt16());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to pin for the final Sse2.Store right?

Haven't checked if the following compiles & works, but since MemoryMarshal.GetReference and Unsafe.As return references, you can use them on the left side of an assignment:

Suggested change
fixed (short* outputPtr = output)
{
Sse2.Store(outputPtr, outZ0.AsInt16());
Sse2.Store(outputPtr + 8, outZ8.AsInt16());
}
ref short outputRef = ref MemoryMarshal.GetReference(output);
Unsafe.As<short, Vector128<short>>(ref outputRef) = outZ0.AsInt16();
Unsafe.As<short, Vector128<short>>(ref Unsafe.Add(ref outputRef, 8)) = outZ8.AsInt16();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this works 👍

@brianpopow
Copy link
Collaborator Author

Not strictly related to the PR, but you may also want to turn Vp8Matrix into a struct with all the Q, IQ etc fields being fixed-size buffers.

Helping cache coherence by keeping stuff close in memory should alone bring some visible perf boost.

I have tried to change the Vp8Matrix into a struct with fixed sized buffers. To make it work, I had to make the Vp8Matrix in Vp8SegmentInfo public fields, to be able to access the fixed buffer. I probably should not used public fields, but what would be the right way here to make that work without them.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably should not used public fields, but what would be the right way here to make that work without them.

Sometimes it's possible to juggle with the fields in a way that keeps them private an yet avoids perf regression, not sure if it's the case here, but IMO it's waste of time. I would recommend to disable the warning and move on. StyleCop was not designed for performance critical code.

@@ -510,51 +527,150 @@ public static void RefineUsingDistortion(Vp8EncIterator it, Vp8SegmentInfo[] seg
[MethodImpl(InliningOptions.ShortMethod)]
public static int Quantize2Blocks(Span<short> input, Span<short> output, Vp8Matrix mtx)
Copy link
Member

@antonfirsov antonfirsov Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to pass Vp8Matrix by reference now to avoid full stack copy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, good spot

int nz = QuantizeBlock(input, output, mtx) << 0;
nz |= QuantizeBlock(input.Slice(1 * 16), output.Slice(1 * 16), mtx) << 1;
int nz = QuantizeBlock(input.Slice(0, 16), output.Slice(0, 16), mtx) << 0;
nz |= QuantizeBlock(input.Slice(1 * 16, 16), output.Slice(1 * 16, 16), mtx) << 1;
return nz;
}

public static int QuantizeBlock(Span<short> input, Span<short> output, Vp8Matrix mtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too: ref Vp8Matrix mtx

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@brianpopow brianpopow merged commit 7495a91 into master Nov 9, 2021
@brianpopow brianpopow deleted the bp/quantizeblocksse branch November 9, 2021 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants