Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support the Arrow PyCapsule Interface for importing data #59631

Open
jorisvandenbossche opened this issue Aug 27, 2024 · 11 comments
Open

ENH: support the Arrow PyCapsule Interface for importing data #59631

jorisvandenbossche opened this issue Aug 27, 2024 · 11 comments
Labels
Arrow pyarrow functionality Enhancement

Comments

@jorisvandenbossche
Copy link
Member

We have #56587 and #59518 now for exporting pandas DataFrame and Series through the Arrow PyCapsule Interface (i.e. adding __arrow_c_stream__ methods), but we don't yet have the import counterpart.

For importing, the specification doesn't provide any API guidelines on what this should look like, so we have a couple of options. The two main ones I can think of:

  • Add a dedicated from_arrow() method, which could be top level (pd.from_arrow(..)) or as class methods (pd.DataFrame.from_arrow(..))
  • Support such objects directly in the main constructors (pd.Dataframe(..))

In pandas itself, we do have a couple of from_.. class methods (from_dict/from_records), but often for objects we also allow in the main constructor (at least for the dict case), but I think the main differentiator is that the specific class methods then have more specialized keyword arguments (and therefore allow a larger variety of input).
So based on that pattern, we could also do both: add a DataFrame.from_arrow() class method, and then also accept such objects in pd.DataFrame(), passing through to from_arrow() (which could have more custom options to control how the conversion from arrow to pandas exactly is done).

Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both. pl.from_arrow already existed but might be more specific for pyarrow? And then pola-rs/polars#17693 added it to the main pl.DataFrame(..) constructor (@kylebarron)

For geopandas, I added a GeoDataFrame.from_arrow() method.

(to be clear, everything said above also applies to Series() / Series.from_arrow() etc)

cc @MarcoGorelli @WillAyd

@jorisvandenbossche jorisvandenbossche added Enhancement Arrow pyarrow functionality labels Aug 27, 2024
@mroeschke
Copy link
Member

xref #54057 where a user expected pandas.DataFrame an ArrowDtype after passing a pyarrow object.

I would be +1 for a from_arrow constructor for objects with an Arrow PyCapsule Interface

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

Do you know why the pycapsule interface chose not to specify anything around imports? I vaguely recall some upstream conversations about that but not sure where it landed

My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like Series.from_arrow(capsule, dtype_backend="numpy") seems a bit strange

@kylebarron
Copy link
Contributor

Looking at polars, it seems they also have both, but I am not entirely sure about the connection between both.

polars has a module-level polars.from_arrow.

The main problem with a module-level from_arrow is that there's no certain way to know whether PyCapsule input that emits struct arrays is intended to be a Series or a DataFrame. In the Arrow C data interface, a struct array is overloaded for both uses, so we really need user intention to say which target class should be constructed.

My PR didn't touch the module level from_arrow. That constructor still only supports known inputs.

@jorisvandenbossche
Copy link
Member Author

Do you know why the pycapsule interface chose not to specify anything around imports? I

It says 'This is left up to individual libraries".
The dunder method is not user-visible API, so it is fine to make requirements there. But for a public import function or method, a library might want to make certain choices to be consistent within their own library.

For example, polars now uses pl.DataFrame, which would work for pandas as well, but the spec can't really require a library.DataFrame(..) usage (not every library uses that name, or uses class constructors, etc)

(now, while we speak about a public import method, it might certainly be a valid question whether there should be a protocol for import as well, so that you could roundtrip, but that's a different topic I think)

My concern about the Python API is overloading the specification with a bunch of pandas-specific functionality. Maybe that is by design, but having something like Series.from_arrow(capsule, dtype_backend="numpy") seems a bit strange

Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated from_arrow(..) method is that it is more easy to add custom keywords when required.

xref #54057 where a user expected pandas.DataFrame an ArrowDtype after passing a pyarrow object.

Interesting reference. Personally, I think that by default a method to consume arrow data should also return default data types (and not ArrowDtype). We can give users control about that, though (like with the dtype backend in other IO methods)

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

Why does that seem strange? We have such a keyword in other functions, so why not here? I would say that the point of a dedicated from_arrow(..) method is that it is more easy to add custom keywords when required.

I think because this blurs the line between the PyCapsule interface as an exchange mechanism and that same interface as an end-user API. I'm of the impression our target audience is other developers and their libraries, not necessarily an end user using this like its an I/O method

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2024

To put a real use case, I've had a need for this in a library I created called pantab:

https://github.com/innobi/pantab/blob/ce3dc034102a506c2348de71169859c84c3be231/src/pantab/_reader.py#L13

At least from the perspective of that library, I ideally would want the dataframe libraries to all have one consistent interface. That way, my third party library could just say "ok, whatever dataframe library you are using, I'm just going to send this capsule through to X and you will get back the result you want"

If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today

@kylebarron
Copy link
Contributor

kylebarron commented Aug 28, 2024

The PyCapsule Interface is focused on use cases around importing some foreign data to your library. I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects.

In your case where you have return_type, I'd argue that's an anti-pattern here. Instead, as long as you return any class that also implements the PyCapsule Interface, then users are able to pass that return object into whatever library they want.

import polars as pl
from arro3.compute import take
import pyarrow.parquet as pq

# Creates a polars object
df = pl.DataFrame({...})

# understands the polars object via C Stream
# returns an arro3 RecordBatchReader
filtered = take(df, [1, 4, 2, 5])

# understands the arro3 object via C Stream
pq.write_parquet(filtered)

If each library overloads their import mechanisms and offers different features, then third party producers of Arrow data aren't any better off than they are today

In particular, my argument is that an arrow producer should not choose the user-facing API but rather just expose the data protocol. Then the user can choose how to import the data as they wish.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2024

In your case where you have return_type, I'd argue that's an anti-pattern here

Absolutely. To be clear, that code was from 7 months ago, before any library (except for pyarrow) started supporting imports. I am definitely trying to solve that pattern, not promote its usage

I think the right way forward is not to specify a specific import API, but rather in advocating for more libraries to look for and understand pycapsule objects.

Is the python capsule available at runtime? I thought it was just for extension authors and not really even inspectable (i.e. can you even do an isinstance check for one?) but maybe that knowledge is outdated

I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules?

@kylebarron
Copy link
Contributor

Is the python capsule available at runtime?

Sorry, by "pycapsule objects" I meant to say "instances of classes that have Arrow PyCapsule Interface dunder methods and can export PyCapsules".

I really like the code that you have there @kylebarron, but the arro3.RecordBatchReader is the piece that I think we are missing in pandas. Maybe we need something like that instead of just passing around raw capsules?

Well, that's why I created arro3 😄. I wanted a lightweight (~7MB compared to pyarrow's >100MB) library that can manage Arrow data in a compliant way between libraries, but with nicer high-level APIs than nanoarrow. It has wheels for every platform, including pyodide.

@WillAyd
Copy link
Member

WillAyd commented Aug 28, 2024

Well I don't want to try and boil the ocean here, but I wonder if we don't require pyarrow that we shouldn't look at requiring arro3 as a fallback. I think there's good value in having another library provide a consistent object like a RecordBatchReader for data exchange like this, and we could just accept that in our series / dataframe constructor, rather than building that ourselves

@kylebarron
Copy link
Contributor

kylebarron commented Aug 28, 2024

Well, I'd say the point of arro3 is to handle cases like this. But at the same time stable enough to be a required pandas dependency is a pretty high bar...

I'd say that in managing Arrow data, arro3 is relatively stable, but that in managing interop with pandas and numpy it's less stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants