Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Document limitation using cudf.pandas proxy arrays #16955

Open
wants to merge 3 commits into
base: branch-24.12
Choose a base branch
from

Conversation

Matt711
Copy link
Contributor

@Matt711 Matt711 commented Sep 30, 2024

Description

When instantiating a cudf.pandas proxy array, a DtoH transfer occurs so that the data buffer is set correctly. We do this because functions which utilize NumPy's C API can utilize the data buffer directly instead of going through __array__. This PR documents this limitation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added the doc Documentation label Sep 30, 2024
@Matt711 Matt711 self-assigned this Sep 30, 2024
@Matt711 Matt711 added non-breaking Non-breaking change cudf.pandas Issues specific to cudf.pandas labels Sep 30, 2024
below produces a NumPy proxy array that

```python
arr = pd.DataFrame("a":range(10)).values() # implicit DtoH transfer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arr = pd.DataFrame("a":range(10)).values() # implicit DtoH transfer
arr = pd.DataFrame("a":range(10)).values # implicit DtoH transfer

Comment on lines 184 to 186
- `cudf.pandas` can interface with functions that utilize NumPy's C API, but doing so requires
a data transfer from device to host to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays)(aka the underlying C array) is set correctly. For example, calling `.values`
below produces a NumPy proxy array that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This phrasing is a little confusing: "doing so requires" seems to imply that the user has to take some action, when really this is being done under the hood. Moreover, compatibility with the C API wasn't really the primary goal of #16601, but rather enabling instance checks (although to your point I think you're right and we may have gotten this for free). I would remove discussion of the NumPy C API here and instead focus on the fact that "in order to handle all cases where users of cudf.pandas might expect to see something than ducktypes as a numpy array, we have to actually return a numpy array and cannot keep the data on device with a cupy array".

Copy link
Contributor Author

@Matt711 Matt711 Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with changing the wording for now and updating it again after more testing described here.

Edit: We probably don't need to mention the C API at all, if you're okay with the latest changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I was thinking. We just talk about the conversion happening as the limitation. Once we address the testing you linked, we can add a note here saying that we work with the C API because we convert eagerly.

@github-actions github-actions bot removed the cudf.pandas Issues specific to cudf.pandas label Oct 3, 2024
@Matt711 Matt711 changed the title [DOC] Document limitation using cudf.pandas and functions using NumPy's C API [DOC] Document limitation using cudf.pandas proxy arrays Oct 3, 2024
@Matt711 Matt711 requested a review from vyasr October 3, 2024 17:53
- In order for `cudf.pandas` to produce a proxy array that ducktypes as a `np.ndarray`, we actually have to wrap a valid `np.ndarray` and cannot keep the data on device with a `cupy` array. This approach incurs the overhead of an initial device-to-host (DtoH) transfer when creating a proxy array. For example,

```python
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a %load_ext cudf.pandas? It seems like we do that elsewhere in this file to be clear that we're using cudf.pandas and not vanilla pandas (please correct me if I'm wrong though).

Comment on lines +193 to +198
The reason why we do the data transfer from device to host is to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays) is set correctly. With the data buffer set, we can utilize other functions which require a valid data buffer.

```python
import torch
x = torch.from_numpy(arr)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite the right phrasing, especially for consistency with above. Above we say that we do this to have proxies ducktype as numpy arrays. I think a more accurate way to frame the whole thing would be for line 184 to say roughly "in order for cudf.pandas to produce a proxy array that ducktypes as a numpy array, we create a proxy type that actually subclasses numpy.ndarray" and remove line 190 showing the transfer. Then, here we can say "because the proxy type ducktypes as a numpy array, there are many situations where numpy will attempt to access internal members like the data buffer that make assumptions about the data being on the host or use the C API, and since we cannot proxy all of those we have to do an eager DtoH copy" and then show that in the second example. Then you can say that "with this approach we also get compatibility with third party libraries like torch". The last bit we can either add now, or after we've done the more thorough C API testing that we're discussing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants