Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Document limitation using cudf.pandas proxy arrays #16955

Open
wants to merge 4 commits into
base: branch-24.12
Choose a base branch
from
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,21 @@ There are a few known limitations that you should be aware of:
```
- `cudf.pandas` (and cuDF in general) is only compatible with pandas 2. Version
24.02 of cudf was the last to support pandas 1.5.x.
- In order for `cudf.pandas` to produce a proxy array that ducktypes as a `np.ndarray`, we actually have to wrap a valid `np.ndarray` and cannot keep the data on device with a `cupy` array. This approach incurs the overhead of an initial device-to-host (DtoH) transfer when creating a proxy array. For example,

```python
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a %load_ext cudf.pandas? It seems like we do that elsewhere in this file to be clear that we're using cudf.pandas and not vanilla pandas (please correct me if I'm wrong though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I added it back to be consistent. I like including it in code snippets anyway.

import numpy as np

arr = pd.DataFrame("a":range(10)).values # implicit DtoH transfer
isinstance(arr, np.ndarrray) # returns True
```
The reason why we do the data transfer from device to host is to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays) is set correctly. With the data buffer set, we can utilize other functions which require a valid data buffer.

```python
import torch
x = torch.from_numpy(arr)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite the right phrasing, especially for consistency with above. Above we say that we do this to have proxies ducktype as numpy arrays. I think a more accurate way to frame the whole thing would be for line 184 to say roughly "in order for cudf.pandas to produce a proxy array that ducktypes as a numpy array, we create a proxy type that actually subclasses numpy.ndarray" and remove line 190 showing the transfer. Then, here we can say "because the proxy type ducktypes as a numpy array, there are many situations where numpy will attempt to access internal members like the data buffer that make assumptions about the data being on the host or use the C API, and since we cannot proxy all of those we have to do an eager DtoH copy" and then show that in the second example. Then you can say that "with this approach we also get compatibility with third party libraries like torch". The last bit we can either add now, or after we've done the more thorough C API testing that we're discussing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is clearer. This part: "that make assumptions about the data being on the host or use the C API" is not as clear to me. I think you mean the buffer assumes the data is in host memory, but it doesn't assume you're using the C API? I thinks its numpy functions that assume they can access the buffer directly. That's the assumption that does not work with our proxy arrays.


## Can I force running on the CPU?

Expand Down
Loading