Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX conversions for Luminance and Rgb #2039

Merged
merged 4 commits into from
Mar 8, 2022

Conversation

ynse01
Copy link
Contributor

@ynse01 ynse01 commented Mar 6, 2022

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided benchmark coverage for my change (where applicable)

Description

This PR fixed #1763

Implemented AVX versions of Luminance and RGB conversions in JPEG Encoding, as described in the issue.
Added Benchmarks for these cases.

Results with and without AVX:

Method Quality Mean Error StdDev
ImageSharp (greyscale) Jpeg 4:0:0 75 12.90 ms 0.074 ms 0.069 ms
ImageSharp (greyscale) Jpeg 4:0:0 AVX 75 11.96 ms 0.076 ms 0.072 ms
ImageSharp Jpeg rgb 75 28.45 ms 0.130 ms 0.115 ms
ImageSharp Jpeg rgb AVX 75 26.22 ms 0.051 ms 0.046 ms
ImageSharp (greyscale) Jpeg 4:0:0 90 16.23 ms 0.273 ms 0.255 ms
ImageSharp (greyscale) Jpeg 4:0:0 AVX 90 15.15 ms 0.234 ms 0.219 ms
ImageSharp Jpeg rgb 90 36.84 ms 0.258 ms 0.241 ms
ImageSharp Jpeg rgb AVX 90 34.56 ms 0.111 ms 0.104 ms
ImageSharp (greyscale) Jpeg 4:0:0 100 20.57 ms 0.080 ms 0.075 ms
ImageSharp (greyscale) Jpeg 4:0:0 AVX 100 19.55 ms 0.070 ms 0.086 ms
ImageSharp Jpeg rgb 100 54.64 ms 0.102 ms 0.090 ms
ImageSharp Jpeg rgb AVX 100 51.67 ms 0.112 ms 0.105 ms

@br3aker
Copy link
Contributor

br3aker commented Mar 6, 2022

Looks good overall but benchmark results look strange, ran your fork on my machine:

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19044
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100-preview.3.21202.5
  [Host]     : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
  DefaultJob : .NET 5.0.12 (5.0.1221.52207), X64 RyuJIT
Method Quality Mean Error StdDev Median Ratio RatioSD
'System.Drawing Jpeg 4:2:0' 75 27.73 ms 0.189 ms 0.168 ms 27.75 ms 1.00 0.00
'ImageSharp (greyscale) Jpeg 4:0:0' 75 13.13 ms 0.235 ms 0.525 ms 12.91 ms 0.48 0.03
'ImageSharp Jpeg 4:2:0' 75 20.48 ms 0.168 ms 0.149 ms 20.44 ms 0.74 0.01
'ImageSharp Jpeg 4:4:4' 75 23.60 ms 0.467 ms 0.985 ms 23.38 ms 0.85 0.04
'ImageSharp Jpeg rgb' 75 33.51 ms 0.664 ms 1.072 ms 33.87 ms 1.23 0.02
'System.Drawing Jpeg 4:2:0' 90 31.28 ms 0.604 ms 0.565 ms 31.14 ms 1.00 0.00
'ImageSharp (greyscale) Jpeg 4:0:0' 90 17.17 ms 0.195 ms 0.163 ms 17.12 ms 0.55 0.01
'ImageSharp Jpeg 4:2:0' 90 25.36 ms 0.276 ms 0.230 ms 25.27 ms 0.81 0.01
'ImageSharp Jpeg 4:4:4' 90 30.48 ms 0.600 ms 1.126 ms 30.54 ms 0.94 0.03
'ImageSharp Jpeg rgb' 90 43.32 ms 0.337 ms 0.281 ms 43.28 ms 1.38 0.03
'System.Drawing Jpeg 4:2:0' 100 37.89 ms 0.529 ms 0.469 ms 37.79 ms 1.00 0.00
'ImageSharp (greyscale) Jpeg 4:0:0' 100 26.05 ms 0.438 ms 0.366 ms 25.91 ms 0.69 0.01
'ImageSharp Jpeg 4:2:0' 100 37.09 ms 0.734 ms 1.565 ms 36.88 ms 0.98 0.05
'ImageSharp Jpeg 4:4:4' 100 48.73 ms 0.347 ms 0.289 ms 48.74 ms 1.29 0.02
'ImageSharp Jpeg rgb' 100 71.04 ms 1.349 ms 1.262 ms 70.54 ms 1.87 0.03

Note that I'm using win10. There's no way System.Drawing got significantly faster purely because it uses libgdiplus which had a jpeg-related commit like 3 years ago:

Because System.Drawing.Common was designed to be a thin wrapper over Windows technologies, its cross-platform implementation is subpar.

libgdiplus is the main provider of the cross-platform implementation of System.Drawing.Common on the native side. libgdiplus is effectively a reimplementation of the parts of Windows that System.Drawing.Common depends on.

We had similar speedup of the System.Drawing in the past - it resaved same byte stream without any reencoding so maybe internal libgdiplus does the same, maybe it's not but I don't trust these results. Most likely we should redo this benchmark to test against other supported libraries (Skia/NetVips) on win/linux platforms and System.Drawing only on Windows but I won't be able to work on this for some time.

P.S. It's also worth checking whether System.Drawing on linux actually produces expected ouput.

@ynse01
Copy link
Contributor Author

ynse01 commented Mar 6, 2022

Using Linux, I converted a BMP image into Jpeg, using System.Drawing.Common package. Looks decent to me:
output

System.Drawing.Common is indeed using Libgdiplus, which in turn is using libjpeg. See:
https://github.com/mono/libgdiplus/blob/94a49875487e296376f209fe64b921c6020f74c0/src/jpegcodec.c#L40
Libjpeg is still active, last release Jan 2022 according to Wikipedia

Btw, I ran my banchmark on this configuration:

BenchmarkDotNet=v0.13.0, OS=linuxmint 20.3
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.200
[Host] : .NET Core 3.1.22 (CoreCLR 4.700.21.56803, CoreFX 4.700.21.57101), X64 RyuJIT
DefaultJob : .NET Core 3.1.22 (CoreCLR 4.700.21.56803, CoreFX 4.700.21.57101), X64 RyuJIT

ref Vector256<float> destRef = ref yBlock.V0;

const int bytesPerL8Stride = 8;
for (int i = 0; i < 8; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < 8; i++)
for (nint i = 0; i < bytesPerL8Stride; i++)

Is 8 from the condition by coincidence the same as bytesPerL8Stride? If not, just use 8, but change to nint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it is not. 8 is the number of floats in the destination Vector256. Using the nint type does indeed eliminate the cast to IntPtr.
Thanks for the feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gfoidl for the reviews, always welcome!

const int bytesPerL8Stride = 8;
for (int i = 0; i < 8; i++)
{
Unsafe.Add(ref destRef, i) = Avx2.ConvertToVector256Single(Avx2.ConvertToVector256Int32(Unsafe.AddByteOffset(ref l8ByteSpan, (IntPtr)(bytesPerL8Stride * i))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Unsafe.Add(ref destRef, i) = Avx2.ConvertToVector256Single(Avx2.ConvertToVector256Int32(Unsafe.AddByteOffset(ref l8ByteSpan, (IntPtr)(bytesPerL8Stride * i))));
Unsafe.Add(ref destRef, i) = Avx2.ConvertToVector256Single(Avx2.ConvertToVector256Int32(Unsafe.AddByteOffset(ref l8ByteSpan, bytesPerL8Stride * i)));

The cast to IntPtr can go away with nint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Vector256<byte> rgb, rg, bx;

const int bytesPerRgbStride = 24;
for (int i = 0; i < 8; i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < 8; i++)
for (nint i = 0; i < 8; i++)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const int bytesPerRgbStride = 24;
for (int i = 0; i < 8; i++)
{
rgb = Avx2.PermuteVar8x32(Unsafe.AddByteOffset(ref rgbByteSpan, (IntPtr)(bytesPerRgbStride * i)).AsUInt32(), extractToLanesMask).AsByte();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rgb = Avx2.PermuteVar8x32(Unsafe.AddByteOffset(ref rgbByteSpan, (IntPtr)(bytesPerRgbStride * i)).AsUInt32(), extractToLanesMask).AsByte();
rgb = Avx2.PermuteVar8x32(Unsafe.AddByteOffset(ref rgbByteSpan, bytesPerRgbStride * i).AsUInt32(), extractToLanesMask).AsByte();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@br3aker
Copy link
Contributor

br3aker commented Mar 6, 2022

Using Linux, I converted a BMP image into Jpeg, using System.Drawing.Common package. Looks decent to me: output

I can't judge output quality without original input image, encoding parameters, System.Drawing output and ImageSharp output. I can't do this myself because I don't have a Linux machine at my hand. Looks like this happens only for Linux.

Libjpeg is, of course, faster than ImageSharp but encoder part of the jpeg codec is almost complete wiht all the optimizations from libjpeg (not even counting stuff which ImageSharp has and libjpeg doesn't). All I want is to be sure that these results are true so 'ImageSharp is slow' won't happen again due to some incorrect encoding setup.

IMO we should either resolve these performance questions here or at least open an issue/discussion.
@JimBobSquarePants @antonfirsov

@JimBobSquarePants
Copy link
Member

IMO we should either resolve these performance questions here or at least open an issue/discussion.

@br3aker Expanded benchmarks including a minimum of SkiaSharp is probably wise.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you help here!

@JimBobSquarePants JimBobSquarePants merged commit 913ce38 into SixLabors:main Mar 8, 2022
@ynse01 ynse01 deleted the jpeg-intrinsics branch March 8, 2022 19:09
@br3aker br3aker mentioned this pull request Mar 9, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jpeg RGB & grayscale encoders can be accelerated with intrinsics
4 participants