Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Vectorized span-processing helpers #93217

Open
stephentoub opened this issue Oct 9, 2023 · 5 comments
Open

[API Proposal]: Vectorized span-processing helpers #93217

stephentoub opened this issue Oct 9, 2023 · 5 comments

Comments

@stephentoub
Copy link
Member

This is more of a tracking issue right now than a proper API proposal.

In TensorPrimitives, we now have ~40 routines, most of which are vectorized across Vector128, Vector256, Vector512, plus a scalar fallback. Rather than duplicate all of the boilerplate across them, we employ several helper functions that are parameterized via generic method parameters constrained to specific interfaces, with a caller supplying the kernel of the processing via a struct that implements the relevant interfaces. For example, TensorPrimitives.Abs (which computes the absolute value of each input element and stores the result into the corresponding destination slot) is implemented as:

        public static void Abs(ReadOnlySpan<float> x, Span<float> destination) =>
            InvokeSpanIntoSpan<AbsoluteOperator>(x, destination);

and TensorPrimitives.Add (which adds each pair of elements and stores the result for each into the corresponding destination slot) is implemented as:

        public static void Add(ReadOnlySpan<float> x, ReadOnlySpan<float> y, Span<float> destination) =>
            InvokeSpanSpanIntoSpan<AddOperator>(x, y, destination);

There are helpers for various shapes of these element-wise operations, as well as helpers for performing various forms of aggregations. For example, TensorPrimitives.ProductOfSums (which sums each element-wise pair from the two inputs and multiplies together all of those sums) is implemented as:

        public static float ProductOfSums(ReadOnlySpan<float> x, ReadOnlySpan<float> y)
        {
            ...
            return Aggregate<AddOperator, MultiplyOperator>(x, y);
        }

These "Operator"s each implement various interfaces for IUnaryOperator, IBinaryOperator, etc. For example, the AddMultipleOperator (which implements ITernaryOperator) is defined as:

        readonly struct AddMultiplyOperator : ITernaryOperator
        {
            public static float Invoke(float x, float y, float z) => (x + y) * z;
            public static Vector128<float> Invoke(Vector128<float> x, Vector128<float> y, Vector128<float> z) => (x + y) * z;
            public static Vector256<float> Invoke(Vector256<float> x, Vector256<float> y, Vector256<float> z) => (x + y) * z;
            public static Vector512<float> Invoke(Vector512<float> x, Vector512<float> y, Vector512<float> z) => (x + y) * z;
        }

We should consider exposing a set of such helpers and interfaces, to enable developers to express efficient vectorized implementations across various shapes without having to write all the boilerplate themselves. With an IVector as proposed in #76244 and as prototyped internally in #90764, and if we exposed a Scalar<T> that also implemented this interface, this could be consolidated down to just a single method and that would also be more future proof for vector types that might be added in the future, e.g.

        readonly struct AddMultiplyOperator<T> : ITernaryOperator<T>
        {
            public static TVector Invoke<TVector>(TVector x, TVector y, TVector z) where TVector : ISimdVector<TVector, T> =>
                (x + y) * z;
        }
@ghost
Copy link

ghost commented Oct 9, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors
See info in area-owners.md if you want to be subscribed.

Issue Details

This is more of a tracking issue right now than a proper API proposal.

In TensorPrimitives, we now have ~40 routines, most of which are vectorized across Vector128, Vector256, Vector512, plus a scalar fallback. Rather than duplicate all of the boilerplate across them, we employ several helper functions that are parameterized via generic method parameters constrained to specific interfaces, with a caller supplying the kernel of the processing via a struct that implements the relevant interfaces. For example, TensorPrimitives.Abs (which computes the absolute value of each input element and stores the result into the corresponding destination slot) is implemented as:

        public static void Abs(ReadOnlySpan<float> x, Span<float> destination) =>
            InvokeSpanIntoSpan<AbsoluteOperator>(x, destination);

and TensorPrimitives.Add (which adds each pair of elements and stores the result for each into the corresponding destination slot) is implemented as:

        public static void Add(ReadOnlySpan<float> x, ReadOnlySpan<float> y, Span<float> destination) =>
            InvokeSpanSpanIntoSpan<AddOperator>(x, y, destination);

There are helpers for various shapes of these element-wise operations, as well as helpers for performing various forms of aggregations. For example, TensorPrimitives.ProductOfSums (which sums each element-wise pair from the two inputs and multiplies together all of those sums) is implemented as:

        public static float ProductOfSums(ReadOnlySpan<float> x, ReadOnlySpan<float> y)
        {
            ...
            return Aggregate<AddOperator, MultiplyOperator>(x, y);
        }

These "Operator"s each implement various interfaces for IUnaryOperator, IBinaryOperator, etc. For example, the AddMultipleOperator (which implements ITernaryOperator) is defined as:

        readonly struct AddMultiplyOperator : ITernaryOperator
        {
            public static float Invoke(float x, float y, float z) => (x + y) * z;
            public static Vector128<float> Invoke(Vector128<float> x, Vector128<float> y, Vector128<float> z) => (x + y) * z;
            public static Vector256<float> Invoke(Vector256<float> x, Vector256<float> y, Vector256<float> z) => (x + y) * z;
            public static Vector512<float> Invoke(Vector512<float> x, Vector512<float> y, Vector512<float> z) => (x + y) * z;
        }

We should consider exposing a set of such helpers and interfaces, to enable developers to express efficient vectorized implementations across various shapes without having to write all the boilerplate themselves. With an IVector as proposed in #76244 and as prototyped internally in #90764, and if we exposed a Scalar<T> that also implemented this interface, this could be consolidated down to just a single method and that would also be more future proof for vector types that might be added in the future, e.g.

        readonly struct AddMultiplyOperator<T> : ITernaryOperator<T>
        {
            public static TVector Invoke<TVector>(TVector x, TVector y, TVector z) where TVector : ISimdVector<TVector, T> =>
                (x + y) * z;
        }
Author: stephentoub
Assignees: -
Labels:

area-System.Numerics.Tensors

Milestone: 9.0.0

@asmirnov82
Copy link

It would be very useful to have extended versions of such helpers like InvokeSpanSpanIntoSpan and etc., these methods should allow to effectively work with span of different data types:

static void InvokeSpanSpanIntoSpan<TResult, TBinaryOperator, T, TConverter>(ReadOnlySpan<TResult> x, ReadOnlySpan<T> y, Span<TResult> destination)
      where TBinaryOperator : struct, IBinaryOperator<TResult>
      where TConverter : struct, IConverter<TResult, T>
static void InvokeSpanSpanIntoSpan<TResult, TBinaryOperator, T, TConverter>(ReadOnlySpan<T> x, ReadOnlySpan<TResult> y, Span<TResult> destination)
      where TBinaryOperator : struct, IBinaryOperator<TResult>
      where TConverter : struct, IConverter<TResult, T>

and

static void InvokeSpanSpanIntoSpan<TResult, TBinaryOperator, TX, TY, TConverterX, TConverterY>(ReadOnlySpan<TX> x, ReadOnlySpan<TY> y, Span<TResult> destination)
      where TBinaryOperator : struct, IBinaryOperator<TResult>
      where TConverterX : struct, IConverter<TResult, TX>
      where TConverterY : struct, IConverter<TResult, TY>

where IConverter<TResult,T> can be defined somewhat like this:

interface IConverter<TResult, T>
    where TResult : struct
    where T : struct
{

    /// <summary>
    /// True if conversion is suitable for low-level optimization using SIMD instructions.
    /// </summary>
    static abstract bool SupportVectorization { get; }

    /// <summary>
    /// True if vectorized conversion requires widening.
    /// </summary>
    static abstract bool RequiresWidening { get; }

    /// <summary>
    /// Performs not-vectorized conversion.
    /// </summary>
    static abstract TResult Convert(T value);

    /// <summary>
    /// Performs vectorized conversion.
    /// </summary>
    static abstract ISimdVector<TResult> Convert(ISimdVector<T> vector);

    /// <summary>
    /// Performs vectorized conversion, that requires widening.
    /// </summary>
    static abstract (ISimdVector<<TResult> Lower, ISimdVector<<TResult> Upper) ConvertWithWidening(ISimdVector<T> vector);

This functionality is required for the DataFrame generic math: dotnet/machinelearning#6905

@tannergooding
Copy link
Member

Why would the IConverter be required here? You could always define a TBinaryOperator that takes a T, returns a TResult, and does the conversion itself directly?

@asmirnov82
Copy link

asmirnov82 commented Jan 5, 2024

Sorry, I don't get your idea. Seems we may talk about different things. Let me try to explain the issue that I faced and would like to solve with new TensorPrimitive API. The main idea of my proposal was to reuse existing generic binary operators and helper methods like InvokeSpanSpanIntoSpan for vectorized operation on arguments of different type. As DataFrame allow user to do math operations with every column inside it, different combination of input types should be supported, like:

Input types Output type
int64, int64 int64
int64, int32 int64
int64, int16 int64
int32, int32 int32
int32, uint32 int64
uint32, int16 int64
double, int32 double
double, float double
float, int32 float
float, double double
float, float float

and etc.

I just don't see how this conversion can be defined in TBinaryOperator using generic math, especially for vectors. The only way I see is to have concreate implementations for each operator for all existing type combinations (like AddOperator_IntLong, AddOperator_DoubleFloat and etc) - the resulting amount of operators will be huge. My thought was to have generic operators that operate with arguments, having both the same type, like we have now. And separate convertes for each type. In this case all conversions are defined only once.

Converters know type details, like that converting float to double or int to double requires vector widening, vectorized conversion of long or ulong to double can be performed directly and byte to double or byte to long can't be vectorized.

For conversition that requires widening InvokeSpanSpanIntoSpan of course should have different implementation as it requires to read pair of vectors for one side for each vector from another side, but this can be hidded from public API.

Do I miss something?

@stephentoub stephentoub modified the milestones: 9.0.0, Future Feb 8, 2024
@aalmada
Copy link

aalmada commented Feb 16, 2024

I've been working on an open-source geometry library that relies on generic math principles, aiming to enhance performance using SIMD technology. One specific area I've been focusing on is optimizing the processing of collections of geometric entities. After experimenting with various concept prototypes and consulting with @tannergooding, I delved into exploring System.Numerics.Tensors. However, it didn't fulfill my needs at the time, so I decided to develop my own version of the library based on Vector<T> instead. Although it shares similarities with System.Numerics.Tensors, there are notable distinctions.

A significant difference lies in the fact that my implementation handles all the same operations with fewer special cases, resulting in a more concise interface surface. I've implemented all operators using the following interfaces:

public interface IOperator
{
    static virtual bool IsVectorizable => true;
}

public interface IUnaryOperator<T, TResult> 
    : IOperator
    where T : struct
    where TResult : struct
{
    static abstract TResult Invoke(T x);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T> x);
}

public interface IBinaryOperator<T1, T2, TResult> 
    : IOperator
    where T1 : struct
    where T2 : struct
    where TResult : struct
{
    static abstract TResult Invoke(T1 x, T2 y);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T1> x, ref readonly Vector<T2> y);
}

public interface IBinaryScalarOperator<T1, T2, TResult> 
    : IOperator
    where T1 : struct
    where TResult : struct
{
    static abstract TResult Invoke(T1 x, T2 y);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T1> x, T2 y);
}

public interface ITernaryOperator<T1, T2, T3, TResult> 
    : IOperator
    where T1 : struct
    where T2 : struct
    where T3 : struct
    where TResult : struct
{
    static abstract TResult Invoke(T1 x, T2 y, T3 z);
    static abstract Vector<TResult> Invoke(ref readonly Vector<T1> x, ref readonly Vector<T2> y, ref readonly Vector<T3> z);
}

public interface IAggregationOperator<T, TResult> 
    : IBinaryOperator<TResult, T, TResult>
    where T : struct
    where TResult : struct
{
    static virtual TResult Seed => Throw.NotSupportedException<TResult>();
    static abstract TResult Invoke(TResult x, TResult y);
}

The only operations I haven't implemented yet, which will require another interface, are the IndexOf operations.

It's worth mentioning that each parameter in these interfaces employs a different generic type, enabling broader coverage of cases and potentially greater adaptability in the future. This also solves the issues mentioned above by @asmirnov82.

To simplify usage, I've included multiple overloads for just two methods, Apply() and Aggregate(), rather than creating unique method names for each parameter combination. This enhances user experience, as users only need to choose whether to apply or aggregate the operator and provide the relevant parameters.

An exception to this approach is AggregatePropagateNan(), which acts as a generalized version of your MinMaxCore(), expanding its usability to other scenarios. Operators must adhere to the existing IAggregationOperator interface. Consequently, its property is named Seed instead of Identity. Also, my implementation is much simpler as you can see here.

Additionally, I've introduced Apply2 and AggregatePropagateNan2(), which can apply two operators in a single span iteration, facilitating operations like SinCos() and MinMax() respectively.

Finally, I've introduced the aggregation methods Aggregate2D(), Aggregate3D(), and Aggregate4D(). These methods are designed to aggregate data from spans containing elements with multiple values stored consecutively in memory. They prove particularly useful for managing aggregation operations on value-type vectors with multiple dimensions.

Lastly, I've organized my implementation into smaller files for easier navigation. 😅

Comprehensive documentation accompanies my implementation, providing detailed insights into these concepts. Benchmarking data within the documentation suggests that System.Numerics.Tensors may not always be the optimal solution across various scenarios. You can find the documentation here and the benchmarking results here. The repository is available here.

In summary, I pursued this alternative because System.Numerics.Tensors didn't meet my requirements, particularly its lack of extensibility. I'm still open to leveraging System.Numerics.Tensors and contributing to its improvement efforts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants