-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Document limitation using cudf.pandas
proxy arrays
#16955
base: branch-24.12
Are you sure you want to change the base?
[DOC] Document limitation using cudf.pandas
proxy arrays
#16955
Conversation
docs/cudf/source/cudf_pandas/faq.md
Outdated
below produces a NumPy proxy array that | ||
|
||
```python | ||
arr = pd.DataFrame("a":range(10)).values() # implicit DtoH transfer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arr = pd.DataFrame("a":range(10)).values() # implicit DtoH transfer | |
arr = pd.DataFrame("a":range(10)).values # implicit DtoH transfer |
docs/cudf/source/cudf_pandas/faq.md
Outdated
- `cudf.pandas` can interface with functions that utilize NumPy's C API, but doing so requires | ||
a data transfer from device to host to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays)(aka the underlying C array) is set correctly. For example, calling `.values` | ||
below produces a NumPy proxy array that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This phrasing is a little confusing: "doing so requires" seems to imply that the user has to take some action, when really this is being done under the hood. Moreover, compatibility with the C API wasn't really the primary goal of #16601, but rather enabling instance checks (although to your point I think you're right and we may have gotten this for free). I would remove discussion of the NumPy C API here and instead focus on the fact that "in order to handle all cases where users of cudf.pandas might expect to see something than ducktypes as a numpy array, we have to actually return a numpy array and cannot keep the data on device with a cupy array".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with changing the wording for now and updating it again after more testing described here.
Edit: We probably don't need to mention the C API at all, if you're okay with the latest changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I was thinking. We just talk about the conversion happening as the limitation. Once we address the testing you linked, we can add a note here saying that we work with the C API because we convert eagerly.
cudf.pandas
and functions using NumPy's C APIcudf.pandas
proxy arrays
- In order for `cudf.pandas` to produce a proxy array that ducktypes as a `np.ndarray`, we actually have to wrap a valid `np.ndarray` and cannot keep the data on device with a `cupy` array. This approach incurs the overhead of an initial device-to-host (DtoH) transfer when creating a proxy array. For example, | ||
|
||
```python | ||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a %load_ext cudf.pandas
? It seems like we do that elsewhere in this file to be clear that we're using cudf.pandas and not vanilla pandas (please correct me if I'm wrong though).
The reason why we do the data transfer from device to host is to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays) is set correctly. With the data buffer set, we can utilize other functions which require a valid data buffer. | ||
|
||
```python | ||
import torch | ||
x = torch.from_numpy(arr) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite the right phrasing, especially for consistency with above. Above we say that we do this to have proxies ducktype as numpy arrays. I think a more accurate way to frame the whole thing would be for line 184 to say roughly "in order for cudf.pandas to produce a proxy array that ducktypes as a numpy array, we create a proxy type that actually subclasses numpy.ndarray" and remove line 190 showing the transfer. Then, here we can say "because the proxy type ducktypes as a numpy array, there are many situations where numpy will attempt to access internal members like the data buffer that make assumptions about the data being on the host or use the C API, and since we cannot proxy all of those we have to do an eager DtoH copy" and then show that in the second example. Then you can say that "with this approach we also get compatibility with third party libraries like torch". The last bit we can either add now, or after we've done the more thorough C API testing that we're discussing.
Description
When instantiating a
cudf.pandas
proxy array, a DtoH transfer occurs so that the data buffer is set correctly. We do this because functions which utilize NumPy's C API can utilize the data buffer directly instead of going through__array__
. This PR documents this limitation.Checklist