Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC, WIP: demo: use dataframe api in datetimeencoder #786

Closed
wants to merge 10 commits into from

Conversation

MarcoGorelli
Copy link

I'm not expecting this to be merged, but following some conversation with @Vincent-Maladiere I just wanted to demo how the dataframe-api could work here, and this can get the ball rolling for how to proceed

Some questions which I thought of while doing this:

  • will you always be returning arrays from transformers? If so, then I will need to call .collect in order to be able to materialise the dataframe to an array - but I can do that as late as possible in the function. See transfom for an example:
  • each column is constructed lazily (using namespace.col)
  • then, the new columns are all inserted at once (with X.select), and the insertion can happen in parallel in some implementations (e.g. Polars)
  • a restriction of the dataframe-api is that only string column names are supported - would you be OK with requiring this for your users?

Note that, in Polars, converting a single column to 1-d ndarray can be zero-copy (numeric data with no Nones), but converting an entire dataframe to 2-d ndarray always copies data

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Oct 10, 2023

Hey @MarcoGorelli, thank you for this proof of concept!

  1. will you always be returning arrays from transformers?
    This depends on the transformer. Conceptually, we either work on single columns (with e.g. the GapEncoder) or on entire dataframes (with e.g. Joiner). Transformers working on dataframes will always output dataframes, but transformers working on single columns will preferably output arrays like scikit-learn does.

    In scikit-learn, transformers output arrays, but the user can decide to output pandas dataframes instead, using transformer.set_output(transform="pandas"). We copy this behavior in skrub by inheriting from sklearn.base.TransformerMixin.

  2. Column transformers like GapEncoder are written in numpy. Therefore, we will collect lazy frames or columns before running those transformers. Would this fit with your dataframe API implementation?

  3. would you be OK with requiring this for your users?
    We want to make skrub accessible to a large user base that may not be familiar with concepts like dtypes. Also, even for more experienced programmers, converting categories or objects to strings is a showstopper.

    An alternative solution would be for us to detect objects and categories and attempt to convert them to strings. What do you think?

@MarcoGorelli
Copy link
Author

MarcoGorelli commented Oct 10, 2023

Also, even for more experienced programmers, converting categories or objects to strings is a showstopper

Sorry I just meant that the column names need to be strings. In pandas you can have literally anything as a function names, but in polars (and some other packages) column names have to be strings, so that's what the dataframe api requires

Anyway I could always work around this in the pandas implementation so that it would be a strict superset of the dataframe api standard.. sorry for the noise then, this point may have been irrelevant 😄 )

@jeromedockes
Copy link
Member

IMO it's reasonable to require column names to be strings. pandas supporting any hashable object as a column name is rarely useful and sometimes a source of bugs in client code in my experience

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 12, 2023 via email

Copy link
Author

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated based on recent Dataframe API changes, and added some commentary

The hardest part about the Dataframe API is .collect. For now I'm just doing

        if hasattr(X, 'collect'):
            X = X.collect()

, if/when the Consortium agrees on something we can use that instead


Also, I've updated dataframe-api-compat to allow for non-string column names, so long as they're unique - technically outside the standard, but means adoption would result is less breakage

setup.cfg Outdated
@@ -28,7 +28,8 @@ install_requires =
scikit-learn>=1.2.1
numpy>=1.23.5
scipy>=1.9.3
pandas>=1.5.3
pandas>=2.1.0
dataframe-api-compat>=0.1.23
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is extremely light, it's just a bunch of pure-Python files which wrap around the pandas and Polars apis https://pypi.org/project/dataframe-api-compat/0.1.23/#files

Comment on lines +153 to +159
if hasattr(date_series, 'nanosecond'):
return date_series.nanosecond()
else:
raise AttributeError(
f"`nanosecond` is not part of the DataFrame API and so support is not guaranteed across all libraries. "
"In particular, it is not supported for {date_series.__class__.__name__}"
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, nanosecond isn't part of the API standard (yet?). But I think you're primarily interested in pandas-Polars compatibility anyway (please correct me if I'm wrong)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the nanosecond handling into the dataframe API? It would make more sense to have it there and would help your users, IMHO.

Copy link

@ogrisel ogrisel Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "into the dataframe API"? Into the standard? Or the dataframe-api-compat package? It cannot be done without having the stakeholders of the specification process agree on that first.

To move it into the dataframe-api-compat package would be helpful but maybe the dataframe-api-compat package does not want to expose extensions to the standard in its public API. So it would have to wait for this new method to become part of the standard first.

I think keeping this code there is pragmatic for the time being (to avoid introducing a regression w.r.t. what is already done in the current code-based).

Copy link
Author

@MarcoGorelli MarcoGorelli Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataframe-api-compat currently only handles implementations of the Dataframe API for pandas and Polars. For some other libraries, like cudf, they plan to expose the dataframe API methods from their main namespace directly, so they wouldn't have a separate package

I've added nanosecond to dataframe-api-compat, but unless nanosecond is added to https://data-apis.org/dataframe-api/draft/index.html, then there's no guarantee that other libraries (e.g. cudf) would add it

I'll try to get nanoseconds added to the spec though, would make things cleaner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification @MarcoGorelli

Comment on lines 189 to 190
if hasattr(X, 'collect'):
X = X.collect()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is where it gets properly controversial, because .collect isn't part of the Dataframe API

There's a discussion here if you're interested data-apis/dataframe-api#294 (comment) , hope we can find a resolution soon-ish. Any input there would be greatly appreciated

The reason collect is necessary here is that, a few lines down, you're calling float on a scalar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. You're anticipating the reflection on the lazy API on our part, which is great!

Ideally, we would like to release skrub soon with a weak support of Polars (meaning we will accept it as input but may convert it to numpy or pandas). Later, we'll move into stronger support (we limit Polars conversion to the strictly necessary, e.g., before calling a scikit-learn estimator).

Comment on lines 196 to 201
if np.nanstd(self._extract_from_date(X[:, i], "total_time")) > 0:
if float(self._extract_from_date(column, "total_time").std()) > 0:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the Dataframe API, column reductions (e.g. Column.std) return ducktyped Scalars. If you want to use them within Python (e.g. here within an if-then statement), then you need to call float on them to materialise them to Python

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very enlightening. I guess this will encourage us to use tricks like branchless when possible to keep the laziness.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - correction here, this can be simplified, because if we write

if self._extract_from_date(column, "total_time").std():

then Python will implicitly call

if self._extract_from_date(column, "total_time").std().__bool__():

which will materialise the scalar to Python anyway

Yes, in general, if there's a way to avoid working with Python scalars, that's better

But my suspicion is that, in general, for fit methods you'll need to materialise at some point (counter-examples welcome), whereas transform ones may be able stay lazy (at least, the ones that don't go via numpy)

Comment on lines 275 to 276
if hasattr(X, 'collect'):
X = X.collect()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note how this time, in transform, I only needed to call .collect much later, right at the end of the function. I need to call it because you want to return a numpy array, but I can put it as late as possible and previous operations can be done lazily

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there's a discussion we need to have on returning dataframes by default as we do with Joiner or AggJoiner. If a user inputs a Lazyframe, he expects a Lazyframe in return when possible. When it's not, we should follow the scikit-learn design and return a numpy array by default but allow the user to set the output (either Pandas or Polars).

Since outputting Polars dataframes is still under discussion at scikit-learn/scikit-learn#25896, we could implement a workaround with skrub.

Copy link

@ogrisel ogrisel Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it's not, we should follow the scikit-learn design and return a numpy array by default but allow the user to set the output (either Pandas or Polars).

Since skrub does not have to deal with backward compat and will primarily be used with dataframe inputs, maybe its default policy could be to return an output with the same container type as its input (at least for the estimators for which it would be efficient to do so)?

Are there transformers in skrub that output datastructures that would not efficiently be written with by composing column-wise dataframe operations? (e.g. row-wise operations on contiguous rows in numpy arrays).

Are there transformers in skrub that cannot output dataframes (e.g. because they can only output scipy sparse matrices that cannot be efficiently represented by a dataframe)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you could return a dataframe as output, that would be 110% awesome, as then .transform could be completely lazy (on the other hand, I think .fit would practically always need to materialise data at some point, at least for the parts of skrub I've looked at so far)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vincent and I were just talking about this. It would be nice if skrub could work on dataframes as much as possible and delay converting to numpy until it is really necessary (eg in a scikit-learn estimator), so that:

  • columns of different types can move through the pipeline without needing to cast everything to a common type
  • when the dataframe contains several pandas blocks or arrow arrays we don't need to copy everything into a numpy array
  • in the long term there is some chance of taking advantage of polars LazyFrames

maybe its default policy could be to return an output with the same container type as its input (at least for the estimators for which it would be efficient to do so)?

yes and ideally wherever possible use the original container internally rather than converting to a different one (and most likely incurring a copy) and then converting back

Are their transformers in skrub that output datastructures that would not efficiently be written with by composing column-wise dataframe operations? (e.g. row-wise operations on contiguous rows in numpy arrays).

ATM I don't see any but I guess there will be a point in most pipelines where a numpy array is needed, eg when we reach some scikit-learn estimator. Still it would be useful to delay that creation of a (contiguous, homogeneous) numpy array as much as possible

Are there transformers in skrub that cannot output dataframes (e.g. because they can only output scipy sparse matrices that cannot be efficiently represented by a dataframe)?

It could be the case for the TableVectorizer as it allows users to choose the transformers, eg a user could use a TfidfVectorizer as the high_card_cat_transformer. But I think (I may be wrong) that dense data is the main envisioned use case, at least the default transformers output dense arrays. Also IIUC the TableVectorizer is likely to go later in a pipeline than some other skrub transformers as its goal is to transform the input dataframe into a numerical array that can then be fed to scikit learn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Jérome, we identified two main groups of estimators in skrub: encoders (GapEncoder, MinHashEncoder, SimilarityEncoder) that could have been implemented in scikit-learn, and "assemblers" (naming TBD) which only works with dataframes as input (Joiner, TableVectorizer, ColsSelector).

Comment on lines +185 to +187
if not hasattr(X, "__dataframe_consortium_standard__"):
X = check_input(X)
X = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])])
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this requires some discussion too

currently, you're allowing numpy arrays as inputs, but you're doing operations which numpy doesn't support (e.g. dayofweek) - if the input is a numpy array, is it OK with just pick a dataframe library and construct the input using that? here I'm just picking pandas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We're refactoring the DatetimeEncoder in #784, and we don't plan to have efficient support of Polars in this PR since we're converting the input to numpy right away.

So, to answer your question, you won't need to call a dataframe constructor. However, we will revisit this so that Polars and Pandas can be used without numpy conversions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering: do you expect many users to use skrub with non-dataframe inputs? If not maybe it's still time not to accept numpy arrays as input and only accept dataframes. That might make the code simpler and easier to maintain.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Marco, thank you for the heads-up :) I think it's only a matter of time before we jump into the dataframe-api bandwagon. Is the join method functional already?

skrub/_datetime_encoder.py Show resolved Hide resolved
Comment on lines +153 to +159
if hasattr(date_series, 'nanosecond'):
return date_series.nanosecond()
else:
raise AttributeError(
f"`nanosecond` is not part of the DataFrame API and so support is not guaranteed across all libraries. "
"In particular, it is not supported for {date_series.__class__.__name__}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the nanosecond handling into the dataframe API? It would make more sense to have it there and would help your users, IMHO.

Comment on lines +185 to +187
if not hasattr(X, "__dataframe_consortium_standard__"):
X = check_input(X)
X = pd.DataFrame(X, columns=[str(i) for i in range(X.shape[1])])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We're refactoring the DatetimeEncoder in #784, and we don't plan to have efficient support of Polars in this PR since we're converting the input to numpy right away.

So, to answer your question, you won't need to call a dataframe constructor. However, we will revisit this so that Polars and Pandas can be used without numpy conversions.

Comment on lines 189 to 190
if hasattr(X, 'collect'):
X = X.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. You're anticipating the reflection on the lazy API on our part, which is great!

Ideally, we would like to release skrub soon with a weak support of Polars (meaning we will accept it as input but may convert it to numpy or pandas). Later, we'll move into stronger support (we limit Polars conversion to the strictly necessary, e.g., before calling a scikit-learn estimator).

Comment on lines 196 to 201
if np.nanstd(self._extract_from_date(X[:, i], "total_time")) > 0:
if float(self._extract_from_date(column, "total_time").std()) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very enlightening. I guess this will encourage us to use tricks like branchless when possible to keep the laziness.

Comment on lines 275 to 276
if hasattr(X, 'collect'):
X = X.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there's a discussion we need to have on returning dataframes by default as we do with Joiner or AggJoiner. If a user inputs a Lazyframe, he expects a Lazyframe in return when possible. When it's not, we should follow the scikit-learn design and return a numpy array by default but allow the user to set the output (either Pandas or Polars).

Since outputting Polars dataframes is still under discussion at scikit-learn/scikit-learn#25896, we could implement a workaround with skrub.

@MarcoGorelli
Copy link
Author

Is the join method functional already?

yup, should be! at least, it's in both the spec and the implementation

if hasattr(X, 'collect'):
X = X.collect()
X = X.__dataframe_consortium_standard__(api_version='2023.11-beta')
X = X.persist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea behind persist()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sure that computations prior to this step won't be repeated https://data-apis.org/dataframe-api/draft/API_specification/dataframe_object.html#dataframe_api.DataFrame.persist . In the end, the consortium preferred this to collect. It does pretty much the same thing anyway (in Polars it does .collect().lazy())

Below, if df.col(col_name).std() > 0 will trigger computation because it calls (df.col(col_name).std() > 0).__bool__(). That command produces a Python scalar, so it's necessary to trigger compute here (as opposed to keeping things lazy)

By putting persist, we ensure that the compute which gets triggered only happens from this point until the __bool__ call

Just saw you've already done a lot of work here to support Polars - well done! I'll rebase then

@MarcoGorelli
Copy link
Author

closing for now as this is really out-of-date

i'm trying to turn things around in the dataframe api because I'm not exactly ecstatic about how it's going (data-apis/dataframe-api#346), trying to squash the concepts of "column" and "expression" into a single API leads to some very awkward code and error messages, as you've seen

thanks anyway for your feedback, if you're then still interested in it we can revisit if/when it's in a better shape

@jeromedockes
Copy link
Member

Hi @MarcoGorelli , I'm really sorry for the late reply! somehow I missed this comment. From your comments here and our experimentations with the dataframe API (eg #885 , #827), it seems it might be a little while before we can heavily rely on the dataframe API in skrub -- we'll definitely keep monitoring its progress closely.

In the meanwhile, support for polars remains a high priority. As in the short term we are targeting pandas and polars only it should be manageable to just write the utilities we need once for polars and once for pandas. So the proposed solution for now is just to write 2 implementations for skrub internal functions that need to be specialized and use single dispatch #888 , I'd be interested to know what you think.

As we extend our support for polars dataframes I think we may need to adapt a bit the way data is processed in skrub / scikit-learn pipelines to be more compatible with polars expressions and have a chance of eventually benefiting from the lazy api where feasible. This is super vague 😅 but I know we could benefit a lot from your advice on how to use polars well in skrub and I'll try to ping you with more specific questions in other issues/PRs

@MarcoGorelli
Copy link
Author

Hey @jeromedockes !

I'm afraid I've given up on the dataframe standard project (see data-apis/dataframe-api#346)

I know we could benefit a lot from your advice on how to use polars well in skrub and I'll try to ping you with more specific questions in other issues/PRs

great! and always feel free to book some time if you'd like to chat about anything https://calendly.com/marcogorelli

@jeromedockes
Copy link
Member

jeromedockes commented Feb 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants