Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] DataFrame Arithmetic and Computation API #6905

Open
asmirnov82 opened this issue Dec 11, 2023 · 2 comments
Open

[Proposal] DataFrame Arithmetic and Computation API #6905

asmirnov82 opened this issue Dec 11, 2023 · 2 comments
Labels
enhancement New feature or request untriaged New issue has not been triaged

Comments

@asmirnov82
Copy link
Contributor

asmirnov82 commented Dec 11, 2023

Background and motivation

Current arithmetic and computation API of the DataFrame is inconsistent and quite slow in scenarios where columns of different types are involved as each column casting to different type requires coping the entire column data.

Moreover some of the methods of computation API may produce inaccurate results (due to overflow exception).

Motivation of this change includes

  1. reusing generic operators from future TensorPrimitives API for fast computation over DataFrame columns of different types (and possibly even chaining such operations to avoid intensive memory usage for storing intermediate calculation results)
  2. accurate calculation of aggregation functions
  3. increased performance and using SIMD commands for the significant part of DataFrame operations
  4. reduce DataFrame code duplication

Details of current implementation limitations

  1. Inconsistency of the API is related to types of data returned by arithmetic operations over PrimitiveColumn<T> instances and their concrete aliases (like DoubleDataFrame or Int32DataFrameColumn)

For example

var left_column = new ByteDataFrameColumn("Left", new byte[] { 1, 2, 3 });
var right_column = new Int16DataFrameColumn("Right", new short[] { 1, 2, 3 });

var sum = left_column.Add + right_column;

results in sum column containing int values.

The same code, but referring columns using parent type

PrimitiveDataFrameColumn<byte> left_column = new ByteDataFrameColumn("Left", new byte[] { 1, 2, 3 });
PrimitiveDataFrameColumn<short> right_column = new Int16DataFrameColumn("Right", new short[] { 1, 2, 3 });

var sum = left_column.Add + right_column;

results in sum column containing double values.

  1. Inaccurate results of aggregated functions. Currently aggregated functions are calculated using the same type as input column, for example for byte column it’s byte type.
    And
var column = new ByteDataFrameColumn("Column", new byte[] { 128, 129 });

var sum = column.Sum();
var mean = column.Mean();

results in sum equals to 1 and mean to 0.5, which is obviously incorrect.

  1. Performance and excessive coping issues are well known and mentioned in Reduce the number of copies in binary operations in DataFrame #5663

Proposal

Proposal is:

  1. to use the common numeric type for the arithmetic output (the smallest numeric type which can accommodate any value of any input). If any input is a floating point type the common numeric type is the widest floating point type among the inputs. Otherwise the common numeric type is integral and is signed if any input is signed.

  2. to return the widest possible type for accumulating aggregation functions (like sum, product and etc). It’s double for floating point types (double, float, Half), long for signed integers (long, int, short, sbyte) and ulong for unsigned integers (ulong, uint, ushort, byte) and input type for other aggregated functions (like min, max, first and etc).

This approach is used in Apache Arrow Compute functions

Implementation

Implementation can reuse the low level parts of the future generic TensorPrimitives API, like IUnaryOperator<T>, IBinaryOperator<T> and their concrete implementations.

Operators allow to invoke highly efficient vectorized calculations over arguments of the same type. New implementation should extend this functionality by providing efficient vectorized convertors and conversion rules for cases when arguments have different data type.

Example

I created an POC of possible implementation.
Apache Arrow doesn’t have a .Net implementation of Compute Functions, so I took at as an exercise for the POC. DataFrame follows the same paradigm, so all the ideas can be applied to it as well.

I extend several operators from TensorPrimitive API to have generic version (this is done to illustrate how future Generic Tensor API can be look like and be used). I also created several OperatorExecutors and Functions classes, that provides rules for argument type conversions for several cases (binary arithmetic calculations and two types of aggregating functions). I also created a list of wideners and convertes that leverage SIMD vectorized calculation as an example

@asmirnov82 asmirnov82 added the enhancement New feature or request label Dec 11, 2023
@ghost ghost added the untriaged New issue has not been triaged label Dec 11, 2023
@davesearle
Copy link

This is great. Could the DataFrame also support streaming operations (similar to how Apache Arrow Acero is architected?). I've recently been looking to implement support for Acero in the C# Arrow client: apache/arrow#37544

@asmirnov82
Copy link
Contributor Author

@davesearle DataFrame was designed to handle situations where all the data is in memory, so streaming was not the primary goal. However DataFrame allows convertion to a collection of Arrow RecordBatches (for most of the cases without memory copy) and that's why potentionsly can be used as an input for Acero record_batch_source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

2 participants