Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export) #56587

Merged

Conversation

jorisvandenbossche
Copy link
Member

See apache/arrow#39195 for some context, and https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html for the new Arrow specification.

For now this PR just implements the stream support on DataFrame (using pyarrow under the hood to do the actual conversion). We should also consider adding the array and schema protocol methods.

We could add similar methods on Series, but that has less of an exact equivalent in Arrow terms (e.g. it would loose the index).

This PR also only implements exporting a pandas DataFrame through the protocol, not adding support to our constructors to consume (import) any object supporting the protocol.

@WillAyd
Copy link
Member

WillAyd commented Jan 5, 2024

Very cool. Is this ready for review or sitting in draft for futher development?

@jorisvandenbossche
Copy link
Member Author

Whatever is here is already ready for review.

As mentioned in the top post, we should also consider adding __arrow_c_schema__ (because we don't have a custom schema object that can expose this, although DataFrame is of course not a schema object, so it's not exactly fitting)

I should probably also add some details about how the conversion is done (essentially the defaults of pyarrow.Table.from_pandas, at the moment, but for example we can document that this converts the Index into a column, except for RangeIndex)

df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

capsule = df.__arrow_c_stream__()
assert (
Copy link
Member

@WillAyd WillAyd Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use ctypes to test this a little more deeply? I had in mind something like this:

import pyarrow as pa
import ctypes

tbl = pa.Table.from_pydict({"col": [1, 2, 3]})
stream = tbl.__arrow_c_stream__()

class ArrowSchema(ctypes.Structure):
    pass

ArrowSchema._fields_ = [
    ("format", ctypes.POINTER(ctypes.c_char)),
    ("name", ctypes.POINTER(ctypes.c_char)),
    ("metadata", ctypes.POINTER(ctypes.c_char)),
    ("flags", ctypes.c_int, 8),
    ("n_children", ctypes.c_int, 8),
    ("children", ctypes.POINTER(ctypes.POINTER(ArrowSchema))),
    ("dictionary", ctypes.POINTER(ArrowSchema)),
    # NB there are more members
    # not sure how to define release callback, but probably not important
]

ctypes.pythonapi.PyCapsule_GetName.restype = ctypes.c_char_p
ctypes.pythonapi.PyCapsule_GetName.argtypes = [ctypes.py_object]
nm = ctypes.pythonapi.PyCapsule_GetName(stream)
#assert nm == b"array_schema"  # TODO: this actually returns arrow_array_stream

capsule_name = ctypes.create_string_buffer("arrow_array_stream".encode())
ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p
ctypes.pythonapi.PyCapsule_GetPointer.argtypes = [ctypes.py_object, ctypes.c_char_p]

# TODO: not sure why the below isn't working?
#void_ptr = ctypes.pythonapi.PyCapsule_GetPointer(
#    stream,
#    capsule_name
#)
#obj = ctypes.cast(void_ptr, ctypes.POINTER(ArrowSchema))[0]
#assert obj.n_children = 1

I commented out things that weren't working. I'm a little less sure of the last section what is going on, but at the very least there is a problem with the capsule name as it returns b"arrow_array_stream" yet the documentation says it should be "arrow_schema"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore what I said before - I mistakenly didn't realize this was returning a stream. This all looks good to me - I think the ctypes would get a little too wonky to deal with. Here's something I stubbed out but I'm not sure how ctypes would sanely deal with struct members that are function points. Probably too much detail for us to get into on our end

import pyarrow as pa
import ctypes

tbl = pa.Table.from_pydict({"col": [1, 2, 3]})
stream = tbl.__arrow_c_stream__()

class ArrowSchema(ctypes.Structure):
    pass

class ArrowArray(ctypes.Structure):
    pass

class ArrowArrayStream(ctypes.Structure):
    pass


schema_release_func = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowSchema))
ArrowSchema._fields_ = [
    ("format", ctypes.POINTER(ctypes.c_char)),
    ("name", ctypes.POINTER(ctypes.c_char)),
    ("metadata", ctypes.POINTER(ctypes.c_char)),
    ("flags", ctypes.c_int, 8),
    ("n_children", ctypes.c_int, 8),
    ("children", ctypes.POINTER(ctypes.POINTER(ArrowSchema))),
    ("dictionary", ctypes.POINTER(ArrowSchema)),
    ("release", schema_release_func),
]

array_release_func = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowArray))
ArrowArray._fields_ = [
    ("length", ctypes.c_int, 8),
    ("null_count", ctypes.c_int, 8),
    ("offset", ctypes.c_int, 8),
    ("n_buffers", ctypes.c_int, 8),
    ("n_children", ctypes.c_int, 8),
    ("buffers", ctypes.POINTER(ctypes.c_void_p)),
    ("children", ctypes.POINTER(ctypes.POINTER(ArrowArray))),
    ("dictionary", ctypes.POINTER(ctypes.POINTER(ArrowArray))),
    ("release", array_release_func),
]

get_schema_func = ctypes.CFUNCTYPE(int, ctypes.POINTER(ArrowArrayStream), ctypes.POINTER(ArrowSchema))
get_next_func = ctypes.CFUNCTYPE(int, ctypes.POINTER(ArrowArrayStream), ctypes.POINTER(ArrowArray))
get_last_error_func = ctypes.CFUNCTYPE(bytes, ctypes.POINTER(ArrowArrayStream))
stream_release_func = ctypes.CFUNCTYPE(None, ctypes.POINTER(ArrowArrayStream))
ArrowArrayStream._fields_ = [
    ("get_schema", get_schema_func),
    ("get_next", get_next_func),
    ("get_last_error", get_last_error_func),
    ("release", stream_release_func),
]


ctypes.pythonapi.PyCapsule_GetName.restype = ctypes.c_char_p
ctypes.pythonapi.PyCapsule_GetName.argtypes = [ctypes.py_object]
nm = ctypes.pythonapi.PyCapsule_GetName(stream)
assert nm == "arrow_array_stream"

capsule_name = ctypes.create_string_buffer("arrow_array_stream".encode())
ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p
ctypes.pythonapi.PyCapsule_GetPointer.argtypes = [ctypes.py_object, ctypes.c_char_p]

void_ptr = ctypes.pythonapi.PyCapsule_GetPointer(
    stream,
    capsule_name
)
stream_obj = ctypes.cast(void_ptr, ctypes.POINTER(ArrowArrayStream))[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that, because we use pyarrow here, such detailed testing isn't necessary here. We can assume that the struct's content is thoroughly tested on the Arrow side, and we mostly need to test we return the correct capsule (and there is already a test that checks the capsule name with ctypes.pythonapi.PyCapsule_IsValid).

If at some point we would implement our own version of the C Data Interface, then for sure it would need a lot more testing.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anything missing to be able to merge this?

"""
pa = import_optional_dependency("pyarrow", min_version="14.0.0")
if requested_schema is not None:
requested_schema = pa.Schema._import_from_c_capsule(requested_schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Will _import_from_c_capsule become public in the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use pa.schema() directly instead of pa.Schema._import_from_c_capsule?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will necessarily become public in the current form (but the _import_from_c version has been used in many other external projects, so we won't just change those methods in pyarrow)

Can't you use pa.schema() directly instead of pa.Schema._import_from_c_capsule?

Not directly, because we get a capsule here (we are inside the low-level dunder here), and pa.schema() doesn't accept capsules, only objects implementing __arrow_c_schema__. Of course we could have a small wrapper object that has the dunder method and returns the capsule, if we want to avoid using the _import_from_c_capsule.

I brought this up in the past on the pyarrow side whether we need an "official" way to import capsules, in the last paragraph in apache/arrow#38010, but we should maybe discuss that a bit more (or whether we just "bless" the _import_from_c_capsule as the official way to do this)

@MarcoGorelli MarcoGorelli added this to the 2.2 milestone Jan 18, 2024
@MarcoGorelli
Copy link
Member

merging then, as discussed

@MarcoGorelli MarcoGorelli merged commit 7212ecd into pandas-dev:main Jan 18, 2024
50 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 18, 2024
MarcoGorelli pushed a commit that referenced this pull request Jan 18, 2024
…Interface on pandas.DataFrame (export)) (#56944)

Backport PR #56587: ENH: support the Arrow PyCapsule Interface on pandas.DataFrame (export)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche deleted the arrow-capsule-interface branch January 18, 2024 22:10
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…t) (pandas-dev#56587)

* ENH: support the Arrow PyCapsule Interface on pandas.DataFrame

* expand documentation on how index is handled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants