Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column reductions should return 1-row Column #297

Closed
MarcoGorelli opened this issue Oct 24, 2023 · 5 comments
Closed

Column reductions should return 1-row Column #297

MarcoGorelli opened this issue Oct 24, 2023 · 5 comments

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Oct 24, 2023

Let's talk column reductions.

I see two uses cases for them:

  • a user wants the exact value of a scalar, right now:
    df: DataFrame
    df.col('a').mean()
  • a user just needs to use the scalar as part of another operation, so it can stay lazy if necessary:
    df: DataFrame
    df.assign((df.col('a') - df.col('a').mean()).rename('a_centered'))

The Standard currently defines the return value of Column.mean to be Scalar. Implementations are supposed to figure out which of the two cases above the user wants.

I have two problems with this:

  • I really don't like anything related to implicit materialisation (so long as we're defining a top-level Python API)
  • we have an inconsistency with the DataFrame case:
    • DataFrame.mean returns a 1-row DataFrame
    • Column.mean returns a Scalar

Proposal

Column reductions return 1-row Columns (just like how DataFrame reductions return 1-row DataFrames).

Broadcasting rules: a binary operation between a n-row Column and a 1-row Column, the 1-row Column is broadcast to be of length-n. So column - column.mean() is well-defined, and everything can stay lazy if necessary.

If someone really need the value of a reduction now, they can call .get_value(0). And behaviour of scalars may vary based on implementations, but I think that's fine.

At least, for the (much more common) case when reductions are used as part of other operations, the operations can stay completely within the DataFrame API now, the rules become predictable, and everything is well-defined

@MarcoGorelli
Copy link
Contributor Author

@kkraus14 @shwina @rgommers @ cbourjau I'd really like to discuss this one on Thursday - gentle ping to have a read through before then if you have time, thanks 🙏

@cbourjau
Copy link
Contributor

I totally agree that this is a problematic point in the API as it stands now. Thanks for the suggestion!

I am a bit concerned about making Column objects broadcastable, though. The behavior would be rather surprising coming from Pandas (even though I realize that the idea of the standard is not to copy what Pandas is doing):

>>> import pandas as pd
>>> s = pd.Series([1, 2])
>>> s + pd.Series([1])
0    2.0
1    NaN
dtype: float64

I'd rather raise an exception in such a case.

I think this may be an instance of a more general question that I am wondering about: To what extent should the data frame standard piggy-back on the array-api? If the dataframe-api were to fully embrace the array-api, then the answer to the problem at hand would be clear: Reduction functions on Column such as mean and std return a rank-0 standard array (of an array-api-implementation chosen by the dataframe implementation). This would mean that the Column's dtype must have a "physical" dtype which is representable by an array-api dtype (maybe a good general restriction?).

Alternatively, one may consider that reduction functions are always meant to be called on an array object (i.e. column.to_array_object()). However I think that would pose two problems:

  1. I could imagine that this would cause issues for an SQL-backed implementation?
  2. It would be an awkward user experience since a standard array object currently has no member function mean. Instead, the free function of the array namespace would need to be called.

@MarcoGorelli
Copy link
Contributor Author

Thanks for looking into this

The behavior would be rather surprising coming from Pandas

It's surprising only if you expect index alignment, which the consortium is explicitly forbidding (it's not even optional, it's strictly forbidden 😄 )

So if you think of columns as array-like, without an index, then this broadcasting behaviour is exactly what you'd expect:

In [5]: s = np.array([1, 2])

In [6]: s + np.array([1])
Out[6]: array([2, 3])

@kkraus14
Copy link
Collaborator

I think this may be an instance of a more general question that I am wondering about: To what extent should the data frame standard piggy-back on the array-api? If the dataframe-api were to fully embrace the array-api, then the answer to the problem at hand would be clear: Reduction functions on Column such as mean and std return a rank-0 standard array (of an array-api-implementation chosen by the dataframe implementation). This would mean that the Column's dtype must have a "physical" dtype which is representable by an array-api dtype (maybe a good general restriction?).

This doesn't work for String, date, datetime, etc. dtype columns that the array API doesn't support or situations where you need to return NULL.

I believe the decision as of now was that Scalars were to follow Python scalars via ducktyping to allow preventing implicit materialization. I.E. in your example above of df.col('a').mean() vs df.assign((df.col('a') - df.col('a').mean()).rename('a_centered')). Assuming df.col('a').mean() returns a Scalar, there's nothing forcing materialization in the second case, and in the first case what forces materialization is __repr__ which is where I think we should focus discussion.

I'm -1 on broadcasting columns because people will run into issues where some data ends up being 1 row unexpectedly and things end up being broadcast instead of failing due to a shape mismatch. Scalars allow us to handle that behavior more nicely, we just need to align on what type of behavior we want to have for things like __repr__, __bool__, __int__, etc.

@MarcoGorelli
Copy link
Contributor Author

thanks for taking a look

we just need to align on what type of behavior we want to have for things like repr, bool, int, etc.

ok let's pivot to that (or rather, punt on it), I'll just raise for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants