Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize TrimTransparentPixels in GifEncoderCore #2500

Merged

Conversation

gfoidl
Copy link
Contributor

@gfoidl gfoidl commented Jul 27, 2023

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

See #2455 (comment)

A simple benchmark -- just for the inner loop -- yields:

|     Method |      Mean |    Error |   StdDev | Ratio |
|----------- |----------:|---------:|---------:|------:|
|    Default | 102.33 ns | 2.073 ns | 2.973 ns |  1.00 |
| Vectorized |  16.53 ns | 0.065 ns | 0.055 ns |  0.16 |

This is measured with .NET 7, but the codegen for .NET 6 is very similar.

benchmark code
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;

Bench bench = new();
bench.Setup();
Console.WriteLine(bench.Default());
Console.WriteLine(bench.Vectorized());

#if !DEBUG
BenchmarkDotNet.Running.BenchmarkRunner.Run<Bench>();
#endif

public class Bench
{
    private byte[] _rowSpan = null!;
    private byte _trimmableIndex;

    [GlobalSetup]
    public void Setup()
    {
        _rowSpan = new byte[100];
        _rowSpan.AsSpan().Fill(42);
        _rowSpan.AsSpan(25, 9).Clear();

        _trimmableIndex = 42;
    }

    [Benchmark(Baseline = true)]
    public (int left, int right, bool isTransparentRow) Default()
    {
        Span<byte> rowSpan = _rowSpan;
        byte trimmableIndex = _trimmableIndex;

        int left = int.MaxValue;
        int right = int.MinValue;
        bool isTransparentRow = true;

        for (int x = 0; x < rowSpan.Length; ++x)
        {
            if (rowSpan[x] != trimmableIndex)
            {
                isTransparentRow = false;
                left = Math.Min(left, x);
                right = Math.Max(right, x);
            }
        }

        if (left == int.MaxValue)
        {
            left = 0;
        }

        if (right == int.MinValue)
        {
            right = rowSpan.Length;
        }

        return (left, right, isTransparentRow);
    }

    [Benchmark]
    public (int left, int right, bool isTransparentRow) Vectorized()
    {
        Span<byte> rowSpan = _rowSpan;
        byte trimmableIndex = _trimmableIndex;

        int left = int.MaxValue;
        int right = int.MinValue;
        bool isTransparentRow = true;

        ref byte rowPtr = ref MemoryMarshal.GetReference(rowSpan);
        nint rowLength = (nint)(uint)rowSpan.Length;
        nint x = 0;

        if (Vector128.IsHardwareAccelerated && rowLength >= Vector128<byte>.Count)
        {
            Vector256<byte> trimmableVec256 = Vector256.Create(trimmableIndex);

            if (Vector256.IsHardwareAccelerated && rowLength >= Vector256<byte>.Count)
            {
                do
                {
                    Vector256<byte> vec = Vector256.LoadUnsafe(ref rowPtr, (nuint)x);
                    Vector256<byte> notEquals = ~Vector256.Equals(vec, trimmableVec256);

                    if (notEquals != Vector256<byte>.Zero)
                    {
                        isTransparentRow = false;
                        uint mask = notEquals.ExtractMostSignificantBits();
                        nint start = x + (nint)uint.TrailingZeroCount(mask);

                        nint end = (nint)uint.LeadingZeroCount(mask);
                        // end is from the end, but we need the index from the beginning
                        end = x + Vector256<byte>.Count - 1 - end;

                        left = Math.Min(left, (int)start);
                        right = Math.Max(right, (int)end);
                    }

                    x += Vector256<byte>.Count;
                }
                while (x <= rowLength - Vector256<byte>.Count);
            }

            Vector128<byte> trimmableVec = Vector256.IsHardwareAccelerated
                ? trimmableVec256.GetLower()
                : Vector128.Create(trimmableIndex);

            while (x <= rowLength - Vector128<byte>.Count)
            {
                Vector128<byte> vec = Vector128.LoadUnsafe(ref rowPtr, (nuint)x);
                Vector128<byte> notEquals = ~Vector128.Equals(vec, trimmableVec);

                if (notEquals != Vector128<byte>.Zero)
                {
                    isTransparentRow = false;
                    uint mask = notEquals.ExtractMostSignificantBits();
                    nint start = x + (nint)uint.TrailingZeroCount(mask);

                    nint end = (nint)uint.LeadingZeroCount(mask) - Vector128<byte>.Count;
                    // end is from the end, but we need the index from the beginning
                    end = x + Vector128<byte>.Count - 1 - end;

                    left = Math.Min(left, (int)start);
                    right = Math.Max(right, (int)end);
                }

                x += Vector128<byte>.Count;
            }
        }

        for (; x < rowLength; ++x)
        {
            if (Unsafe.Add(ref rowPtr, x) != trimmableIndex)
            {
                isTransparentRow = false;
                left = Math.Min(left, (int)x);
                right = Math.Max(right, (int)x);
            }
        }

        if (left == int.MaxValue)
        {
            left = 0;
        }

        if (right == int.MinValue)
        {
            right = (int)rowLength;
        }

        return (left, right, isTransparentRow);
    }
}

Copy link
Contributor Author

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes for review.

Vector256<byte> vec = Vector256.LoadUnsafe(ref rowPtr, (nuint)x);
Vector256<byte> notEquals = ~Vector256.Equals(vec, trimmableVec256);

if (notEquals != Vector256<byte>.Zero)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I don't have any idea on how to make this branchless.
isTransparentRow could be tracked in a vector, but left and right not, as there's a mismatch of vector-types, namely byte and int.

A quite complicated approach would be to use VectorXYZ<byte> and track the left and right -- but just before these can overflow merge it back to the scalar left, right and start over. But I guess the book-keeping is more work, so I'm not sure if this is actually faster. For sure the code gets painful.

}
}
#endif
for (; x < rowLength; ++x)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remainder could be handled vectorized too, by shifting the mask of the most significant bits around by the count of elements left in the final vector.
I tried this somewhere else, the cost for that book-keeping isn't negligible, so didn't do this here and now (maybe I'll try this later).

Hm, I remembered that movemask isn't the fastest, and ptest (TestZ in .NET-terms) is faster but current benchmarks didn't prove this, also Intel's instruction table didn't show any benefit in terms of latency or throughput.
Thus simplified that check.
@JimBobSquarePants
Copy link
Member

Oof! Look at those numbers!! Thanks so much for looking at this. I'm going have a good dig through it to wrap my head round what you have done.

@gfoidl
Copy link
Contributor Author

gfoidl commented Jul 29, 2023

😃
I think the easiest way to understand / check is to use the benchmark-code (see top-comment) in a simple console app and step with the debugger through it (maybe change the size of the _rowSpan to 20 or that like.
Calculation of the correct index for end is the strangest part IMO.

PS: I'm back on Tuesday, so maybe slow to respond in the meantime.

@JimBobSquarePants
Copy link
Member

This is fantastic stuff. I figured theoretically after reading some of the source for Span.IndexOf that masking with bit counting would be the vectorized solution I just had no idea how I'd actually implement it.

Tip of the cap to you sir.

@JimBobSquarePants JimBobSquarePants merged commit 949e6ad into SixLabors:js/gif-fixes Aug 9, 2023
@gfoidl gfoidl deleted the git-transparency-simd branch August 9, 2023 12:14
@gfoidl
Copy link
Contributor Author

gfoidl commented Aug 9, 2023

Thanks for the kind words ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants