Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvement on the dataLength % 3. #114

Merged
merged 8 commits into from
Mar 17, 2020
Original file line number Diff line number Diff line change
Expand Up @@ -231,8 +231,30 @@ private static int GetNumBase64PaddingCharsAddedByEncode(int dataLength)
// 0 -> 0
// 1 -> 2
// 2 -> 1
uint mod3 = FastMod3((uint)dataLength);
return mod3 == 0 ? 0 : (int)(3 - mod3);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the JIT can emit cmov here, so it gets branchless. Unfortunately there is no "trick" to force the JIT to do so, except some very obscure hacks (which we shouldn't use here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious to know the obscure hacks...

Copy link
Owner

@gfoidl gfoidl Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactely a cmov (which would be desired), but still a branchless version with some bit-twiddling and the knowledge that true / false is represented as 1 / 0 -> snippet in sharplab
It the stack isn't used, it would be even better.

The main advantage will be when the branch-predictor can't work reliable due to "random" inputs, i.e. unknown patterns for mod3 == 0. Then this approach may gain some perf.
Contrast this with clang, so it could be pretty concise branchless code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting: if the bool is "casted" to byte instead of int the stack-usage goes away and code looks pretty good.

Can you give this a try with the above benchmark? If it has potential to be faster, we should go with such a approach (although state otherwise above 😉).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried with the BranchlessSelect casted as byte. The results are interesting:

Method Value Mean Error StdDev Median Ratio RatioSD
Mod3 64 2.607 ns 0.0991 ns 0.2890 ns 2.519 ns 1.00 0.00
FastMod3 64 2.208 ns 0.0491 ns 0.1351 ns 2.165 ns 0.85 0.11
FastMod3_Branchless 64 2.611 ns 0.0502 ns 0.0445 ns 2.606 ns 1.07 0.06
FastMod3_BranchlessReversed 64 2.299 ns 0.0066 ns 0.0051 ns 2.300 ns 0.94 0.05
Mod3 65 2.357 ns 0.0494 ns 0.0462 ns 2.350 ns 1.00 0.00
FastMod3 65 2.108 ns 0.0411 ns 0.0365 ns 2.094 ns 0.89 0.02
FastMod3_Branchless 65 2.622 ns 0.0626 ns 0.0615 ns 2.607 ns 1.11 0.04
FastMod3_BranchlessReversed 65 2.313 ns 0.0467 ns 0.0437 ns 2.298 ns 0.98 0.02
Mod3 66 2.157 ns 0.0346 ns 0.0289 ns 2.166 ns 1.00 0.00
FastMod3 66 1.920 ns 0.0396 ns 0.0471 ns 1.918 ns 0.89 0.02
FastMod3_Branchless 66 2.637 ns 0.0588 ns 0.0577 ns 2.631 ns 1.23 0.04
FastMod3_BranchlessReversed 66 2.334 ns 0.0617 ns 0.0606 ns 2.327 ns 1.09 0.03

The 'branchless' is not as efficient as expected .
When reverting the branch BranchlessSelect(mod3 != 0, 3 - mod3, 0) instead of BranchlessSelect(mod3 == 0, 0, 3 - mod3), the result are better but still under the branched one.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the numbers. 👍

Some thoughts -- but I don't know what matches here...

The effect of cmov et.al. depends on the distribution of taken / not taken branches, i.e. on how effecient the branch predictor can work. If the branch predictor does a good job, then the "branchy code" may be more efficient, as the pipeline can speculatively process the following steps, whereas with cmov the rest of the pipeline needs to wait for the result of cmov.

The code for the branchless version is quite a bit, so it may be more efficient just to use simple and compact code with a branch -- which can be predicted, as in the benchmarks where for constant input length the result will always be the same.
I'm pretty sure that there will be a benchmark constellation where the branchless version will be quite a bit faster.

In effect I think we should go with the code as is. It's easy and not hacky...and as the numbers show quite good. Furthermore I doubt a slight improvement here will show any effect on overall perf.

Once the runtime (JIT) will detect a pattern like return cond ? a : b and emit better code (maybe with tiering) we'll gain the perf-win for free.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less just for fun I tried a vectorized approach to make it branchless. I show it here because you are curios 😉

Benchmark is laid out so that (hopefully) the branch-predictor won't do a good job.

|      Method |      Mean |     Error |    StdDev | Ratio | RatioSD |
|------------ |----------:|----------:|----------:|------:|--------:|
|     Default | 1.0953 ns | 0.0213 ns | 0.0199 ns |  1.00 |    0.00 |
|  Branchless | 0.8785 ns | 0.0136 ns | 0.0127 ns |  0.80 |    0.02 |
|  Vectorized | 0.9657 ns | 0.0210 ns | 0.0186 ns |  0.88 |    0.02 |
| Vectorized1 | 0.8840 ns | 0.0156 ns | 0.0146 ns |  0.81 |    0.02 |
Benchmark-code
using System;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;

#if !DEBUG
using BenchmarkDotNet.Running;
#endif

namespace ConsoleApp4
{
    class Program
    {
        static void Main(string[] args)
        {
            var bench = new Bench();

            int mod3 = 0;
            Console.WriteLine(bench.Default(mod3));
            Console.WriteLine(bench.Branchless(mod3));
            Console.WriteLine(bench.Vectorized(mod3));
            Console.WriteLine(bench.Vectorized1(mod3));
            Console.WriteLine();

            mod3 = 1;
            Console.WriteLine(bench.Default(mod3));
            Console.WriteLine(bench.Branchless(mod3));
            Console.WriteLine(bench.Vectorized(mod3));
            Console.WriteLine(bench.Vectorized1(mod3));

            Console.WriteLine();

            mod3 = 2;
            Console.WriteLine(bench.Default(mod3));
            Console.WriteLine(bench.Branchless(mod3));
            Console.WriteLine(bench.Vectorized(mod3));
            Console.WriteLine(bench.Vectorized1(mod3));

#if !DEBUG
            BenchmarkRunner.Run<Bench>();
#endif
        }
    }

    public class Bench
    {
        [Benchmark(Baseline = true, OperationsPerInvoke = 3)]
        public int Default()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Default(i);
            }

            return s;
        }

        [Benchmark(OperationsPerInvoke = 3)]
        public int Branchless()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Branchless(i);
            }

            return s;
        }

        [Benchmark(OperationsPerInvoke = 3)]
        public int Vectorized()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Vectorized(i);
            }

            return s;
        }

        [Benchmark(OperationsPerInvoke = 3)]
        public int Vectorized1()
        {
            int s = 0;

            for (int i = 0; i < 3; ++i)
            {
                s += this.Vectorized1(i);
            }

            return s;
        }

        public int Default(int mod3)    => mod3 == 0 ? 0 : 3 - mod3;
        public int Branchless(int mod3) => BranchlessSelect(mod3 != 0, 3 - mod3, 0);

        public int Vectorized(int mod3)
        {
            Vector128<int> vMod3       = Vector128.CreateScalarUnsafe(mod3);
            Vector128<int> zero        = Vector128<int>.Zero;
            Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3);

            Vector128<int> eqZero  = Sse2.CompareEqual(vMod3, zero);
            Vector128<int> neqZero = Sse2.CompareEqual(eqZero, zero);
            Vector128<int> result  = Sse2.And(neqZero, valueIfTrue);

            return Sse2.ConvertToInt32(result);
        }

        public int Vectorized1(int mod3)
        {
            Vector128<int> vMod3       = Vector128.CreateScalarUnsafe(mod3);
            Vector128<int> zero        = Vector128<int>.Zero;
            Vector128<int> valueIfTrue = Vector128.CreateScalarUnsafe(3 - mod3);

            Vector128<int> eqZero = Sse2.CompareEqual(vMod3, zero);
            Vector128<int> result = Sse2.AndNot(eqZero, valueIfTrue);

            return Sse2.ConvertToInt32(result);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static int BranchlessSelect(bool condition, int valueIfTrue, int valueIfFalse)
        {
            return ((-Unsafe.As<bool, byte>(ref condition)) & (valueIfTrue - valueIfFalse)) + valueIfFalse;
        }
    }
}

Codegen

}
//---------------------------------------------------------------------
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private unsafe static uint FastMod3(uint value)
{
if (sizeof(IntPtr) == 8)
ycrumeyrolle marked this conversation as resolved.
Show resolved Hide resolved
{
// We use modified Daniel Lemire's fastmod algorithm (https://github.com/dotnet/runtime/pull/406),
// which allows to avoid the long multiplication if the divisor is less than 2**31.
ulong lowbits = (ulong.MaxValue / 3 + 1) * value;
ycrumeyrolle marked this conversation as resolved.
Show resolved Hide resolved

return dataLength % 3 == 0 ? 0 : 3 - dataLength % 3;
// 64bit * 64bit => 128bit isn't currently supported by Math https://github.com/dotnet/corefx/issues/41822
ycrumeyrolle marked this conversation as resolved.
Show resolved Hide resolved
// otherwise we'd want this to be (uint)Math.BigMul(lowbits, divisor, out _)
uint highbits = (uint)((((lowbits >> 32) + 1) * 3) >> 32);

Debug.Assert(highbits == value % 3);
return highbits;
}
else
{
return value % 3;
}
}
//---------------------------------------------------------------------
// internal because tests use this map too
Expand Down