Add SSE41 version of Quantize block #1811

brianpopow · 2021-11-07T14:12:53Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This PR add SSE41 version of Quantize block, which is used during lossy encoding.
Related to #1786

Profiling results

Before:

After:

codecov · 2021-11-07T14:28:07Z

Codecov Report

Merging #1811 (55f8596) into master (ce7687b) will decrease coverage by 0.20%.
The diff coverage is 96.84%.

@@            Coverage Diff             @@
##           master    #1811      +/-   ##
==========================================
- Coverage   87.33%   87.13%   -0.21%     
==========================================
  Files         936      936              
  Lines       48085    48128      +43     
  Branches     6035     6037       +2     
==========================================
- Hits        41994    41935      -59     
- Misses       5092     5190      +98     
- Partials      999     1003       +4

Flag	Coverage Δ
unittests	`87.13% <96.84%> (-0.21%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/ImageSharp/Formats/Webp/Lossy/Vp8Encoder.cs	`93.60% <ø> (-0.05%)`	⬇️
...rc/ImageSharp/Formats/Webp/Lossy/Vp8SegmentInfo.cs	`100.00% <ø> (ø)`
src/ImageSharp/Formats/Webp/Lossy/QuantEnc.cs	`96.95% <96.77%> (+0.25%)`	⬆️
src/ImageSharp/Formats/Webp/Lossy/Vp8Matrix.cs	`100.00% <100.00%> (ø)`
...ageSharp/Formats/Webp/Lossless/PredictorEncoder.cs	`89.22% <0.00%> (-9.08%)`	⬇️
.../ImageSharp/Formats/Webp/Lossless/LosslessUtils.cs	`88.68% <0.00%> (-8.86%)`	⬇️
...rc/ImageSharp/Formats/Webp/Lossless/Vp8LEncoder.cs	`97.51% <0.00%> (+0.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce7687b...55f8596. Read the comment docs.

antonfirsov · 2021-11-07T15:47:48Z

@brianpopow I wonder if your profiler results are for a single image or show a multi-image run of images triggering different code paths?

brianpopow · 2021-11-07T15:53:09Z

@brianpopow I wonder if your profiler results are for a single image or show a multi-image run of images triggering different code paths?

No its just a single image encoded as lossy. All should be the same except the part in QuantizeBlock: if (Sse41.IsSupported)
which i have set to false for the comparison run.

edit: all method call counts should be exactly the same before and after (which they are not). I will try to re-run the test.

brianpopow · 2021-11-08T11:18:19Z

I have re-run the profiling tests and I can replicate the numbers, but I have trouble explaining why there are a few more calls to QuantizeBlock now.
To make sure the quantization results are the same, I have compared the QuantizeBlock SSE results with the QuantizeBlock during one encoding run and they are exactly the same.

antonfirsov · 2021-11-08T12:44:38Z

I also struggle getting consistent results from dotTrace these days, it definitely does some stuff wrong, however I was never able to figure out what exactly / why.

If you want an accurate comparison between the two, I'm afraid you'll have to move the SSE41 / scalar logic to separate methods and run them with BDN.

antonfirsov

Not strictly related to the PR, but you may also want to turn Vp8Matrix into a struct with all the Q, IQ etc fields being fixed-size buffers.

Helping cache coherence by keeping stuff close in memory should alone bring some visible perf boost.

antonfirsov · 2021-11-08T12:49:08Z

tests/ImageSharp.Tests/Formats/WebP/QuantEncTests.cs

+        public void QuantizeBlock_WithoutSSE2_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSE2);
+
+        [Fact]
+        public void QuantizeBlock_WithoutSSSE3_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSSE3);
+
+        [Fact]
+        public void QuantizeBlock_WithoutSSE2AndSSSE3_Works() => FeatureTestRunner.RunWithHwIntrinsicsFeature(RunQuantizeBlockTest, HwIntrinsics.DisableSSE2 | HwIntrinsics.DisableSSSE3);


I don't see why do we need all these variants.

You think its enough to test only without SSE2? Can it not be that the CPU supports SSE2, but not SSSE3?

What I meant is that the feature needs SSE4.1, so unless we really believe we will add SSE2 and SSSE3 versions of the algorithm later, I don't see why is it necessary to test switching off those particular instruction sets.

We should just switch off SSE4.1 or all intrinsics in general instead I think.

Ok, i haved changed it to test with and without intrinsics

antonfirsov · 2021-11-08T12:57:32Z

src/ImageSharp/Formats/Webp/Lossy/QuantEnc.cs

+                fixed (ushort* mtxIqPtr = mtx.IQ)
+                fixed (ushort* mtxQPtr = mtx.Q)
+                fixed (uint* biasQPtr = mtx.Bias)
+                fixed (short* sharpenPtr = mtx.Sharpen)
+                fixed (short* inputPtr = input)
+                fixed (short* outputPtr = output)


We can avoid pinning by by loading with unsafe conversion instead of LoadVector128:

Unsafe.As<ushort, Vector128<ushort>>(ref MemoryMarshal.GetReference(input));

You can even expose helper methods on Vp8Matrix to load these vectors in order to keep code simple here.

We still need to pin for the final Sse2.Store right?

antonfirsov · 2021-11-08T15:40:05Z

src/ImageSharp/Formats/Webp/Lossy/QuantEnc.cs

+                fixed (short* outputPtr = output)
+                {
+                    Sse2.Store(outputPtr, outZ0.AsInt16());
+                    Sse2.Store(outputPtr + 8, outZ8.AsInt16());
+                }


We still need to pin for the final Sse2.Store right?

Haven't checked if the following compiles & works, but since MemoryMarshal.GetReference and Unsafe.As return references, you can use them on the left side of an assignment:

Suggested change

fixed (short* outputPtr = output)

{

Sse2.Store(outputPtr, outZ0.AsInt16());

Sse2.Store(outputPtr + 8, outZ8.AsInt16());

}

ref short outputRef = ref MemoryMarshal.GetReference(output);

Unsafe.As<short, Vector128<short>>(ref outputRef) = outZ0.AsInt16();

Unsafe.As<short, Vector128<short>>(ref Unsafe.Add(ref outputRef, 8)) = outZ8.AsInt16();

yeah this works 👍

brianpopow · 2021-11-09T10:09:17Z

Not strictly related to the PR, but you may also want to turn Vp8Matrix into a struct with all the Q, IQ etc fields being fixed-size buffers.

Helping cache coherence by keeping stuff close in memory should alone bring some visible perf boost.

I have tried to change the Vp8Matrix into a struct with fixed sized buffers. To make it work, I had to make the Vp8Matrix in Vp8SegmentInfo public fields, to be able to access the fixed buffer. I probably should not used public fields, but what would be the right way here to make that work without them.

antonfirsov

I probably should not used public fields, but what would be the right way here to make that work without them.

Sometimes it's possible to juggle with the fields in a way that keeps them private an yet avoids perf regression, not sure if it's the case here, but IMO it's waste of time. I would recommend to disable the warning and move on. StyleCop was not designed for performance critical code.

antonfirsov · 2021-11-09T12:12:51Z

src/ImageSharp/Formats/Webp/Lossy/QuantEnc.cs

@@ -510,51 +527,150 @@ public static void RefineUsingDistortion(Vp8EncIterator it, Vp8SegmentInfo[] seg
        [MethodImpl(InliningOptions.ShortMethod)]
        public static int Quantize2Blocks(Span<short> input, Span<short> output, Vp8Matrix mtx)


You need to pass Vp8Matrix by reference now to avoid full stack copy.

ah yes, good spot

antonfirsov · 2021-11-09T12:13:49Z

src/ImageSharp/Formats/Webp/Lossy/QuantEnc.cs

-            int nz = QuantizeBlock(input, output, mtx) << 0;
-            nz |= QuantizeBlock(input.Slice(1 * 16), output.Slice(1 * 16), mtx) << 1;
+            int nz = QuantizeBlock(input.Slice(0, 16), output.Slice(0, 16), mtx) << 0;
+            nz |= QuantizeBlock(input.Slice(1 * 16, 16), output.Slice(1 * 16, 16), mtx) << 1;
            return nz;
        }

        public static int QuantizeBlock(Span<short> input, Span<short> output, Vp8Matrix mtx)


Here too: ref Vp8Matrix mtx

antonfirsov

Looks good!

brianpopow added 3 commits November 7, 2021 14:51

Add sse41 version of quantize block

3a03fad

Add QuantizeBlock sse tests

020134a

Add coeff = abs(in) + sharpen

a628909

brianpopow added area:performance formats:webp labels Nov 7, 2021

Merge branch 'master' into bp/quantizeblocksse

cb891f2

antonfirsov reviewed Nov 8, 2021

View reviewed changes

Avoid pinning of vp8 matrix data

5c6e08b

antonfirsov reviewed Nov 8, 2021

View reviewed changes

brianpopow and others added 4 commits November 8, 2021 16:58

Avoid pinning input and output data

0c0812d

Only test with and without HardwareIntrinsics

cffa4b0

Merge branch 'master' into bp/quantizeblocksse

1864ca4

Use fixed sized arrays in Vp8Matrix

cb513a9

antonfirsov reviewed Nov 9, 2021

View reviewed changes

brianpopow and others added 3 commits November 9, 2021 13:40

Disable SA1401 in file: Fields should be private

42c2cf7

Pass Vp8Matrix as ref

8160a0e

Merge branch 'master' into bp/quantizeblocksse

55f8596

antonfirsov approved these changes Nov 9, 2021

View reviewed changes

brianpopow merged commit 7495a91 into master Nov 9, 2021

brianpopow deleted the bp/quantizeblocksse branch November 9, 2021 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SSE41 version of Quantize block #1811

Add SSE41 version of Quantize block #1811

brianpopow commented Nov 7, 2021

codecov bot commented Nov 7, 2021 •

edited

Loading

antonfirsov commented Nov 7, 2021

brianpopow commented Nov 7, 2021 •

edited

Loading

brianpopow commented Nov 8, 2021

antonfirsov commented Nov 8, 2021

antonfirsov left a comment

antonfirsov Nov 8, 2021

brianpopow Nov 8, 2021

antonfirsov Nov 8, 2021

brianpopow Nov 8, 2021

antonfirsov Nov 8, 2021

brianpopow Nov 8, 2021

antonfirsov Nov 8, 2021

brianpopow Nov 8, 2021

brianpopow commented Nov 9, 2021

antonfirsov left a comment

antonfirsov Nov 9, 2021 •

edited

Loading

brianpopow Nov 9, 2021

antonfirsov Nov 9, 2021

antonfirsov left a comment

		@@ -510,51 +527,150 @@ public static void RefineUsingDistortion(Vp8EncIterator it, Vp8SegmentInfo[] seg
		[MethodImpl(InliningOptions.ShortMethod)]
		public static int Quantize2Blocks(Span<short> input, Span<short> output, Vp8Matrix mtx)

Add SSE41 version of Quantize block #1811

Add SSE41 version of Quantize block #1811

Conversation

brianpopow commented Nov 7, 2021

Prerequisites

Description

Profiling results

codecov bot commented Nov 7, 2021 • edited Loading

Codecov Report

antonfirsov commented Nov 7, 2021

brianpopow commented Nov 7, 2021 • edited Loading

brianpopow commented Nov 8, 2021

antonfirsov commented Nov 8, 2021

antonfirsov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianpopow commented Nov 9, 2021

antonfirsov left a comment

Choose a reason for hiding this comment

antonfirsov Nov 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 7, 2021 •

edited

Loading

brianpopow commented Nov 7, 2021 •

edited

Loading

antonfirsov Nov 9, 2021 •

edited

Loading