Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Document limitation using cudf.pandas proxy arrays #16955

Open
wants to merge 4 commits into
base: branch-24.12
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/cudf/source/cudf_pandas/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,19 @@ There are a few known limitations that you should be aware of:
```
- `cudf.pandas` (and cuDF in general) is only compatible with pandas 2. Version
24.02 of cudf was the last to support pandas 1.5.x.
- `cudf.pandas` can interface with functions that utilize NumPy's C API, but doing so requires
a data transfer from device to host to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays)(aka the underlying C array) is set correctly. For example, calling `.values`
below produces a NumPy proxy array that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This phrasing is a little confusing: "doing so requires" seems to imply that the user has to take some action, when really this is being done under the hood. Moreover, compatibility with the C API wasn't really the primary goal of #16601, but rather enabling instance checks (although to your point I think you're right and we may have gotten this for free). I would remove discussion of the NumPy C API here and instead focus on the fact that "in order to handle all cases where users of cudf.pandas might expect to see something than ducktypes as a numpy array, we have to actually return a numpy array and cannot keep the data on device with a cupy array".

Copy link
Contributor Author

@Matt711 Matt711 Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with changing the wording for now and updating it again after more testing described here.

Edit: We probably don't need to mention the C API at all, if you're okay with the latest changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I was thinking. We just talk about the conversion happening as the limitation. Once we address the testing you linked, we can add a note here saying that we work with the C API because we convert eagerly.


```python
arr = pd.DataFrame("a":range(10)).values() # implicit DtoH transfer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arr = pd.DataFrame("a":range(10)).values() # implicit DtoH transfer
arr = pd.DataFrame("a":range(10)).values # implicit DtoH transfer

```
With the data buffer set, other functions which require the data buffer can be used. For example,

```python
import torch
x = torch.from_numpy(arr)
```

## Can I force running on the CPU?

Expand Down
Loading