-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC] Document limitation using cudf.pandas
proxy arrays
#16955
base: branch-24.12
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -181,6 +181,21 @@ There are a few known limitations that you should be aware of: | |
``` | ||
- `cudf.pandas` (and cuDF in general) is only compatible with pandas 2. Version | ||
24.02 of cudf was the last to support pandas 1.5.x. | ||
- In order for `cudf.pandas` to produce a proxy array that ducktypes as a `np.ndarray`, we actually have to wrap a valid `np.ndarray` and cannot keep the data on device with a `cupy` array. This approach incurs the overhead of an initial device-to-host (DtoH) transfer when creating a proxy array. For example, | ||
|
||
```python | ||
import pandas as pd | ||
import numpy as np | ||
|
||
arr = pd.DataFrame("a":range(10)).values # implicit DtoH transfer | ||
isinstance(arr, np.ndarrray) # returns True | ||
``` | ||
The reason why we do the data transfer from device to host is to ensure that the [data buffer](https://numpy.org/doc/stable/dev/internals.html#internal-organization-of-numpy-arrays) is set correctly. With the data buffer set, we can utilize other functions which require a valid data buffer. | ||
|
||
```python | ||
import torch | ||
x = torch.from_numpy(arr) | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't quite the right phrasing, especially for consistency with above. Above we say that we do this to have proxies ducktype as numpy arrays. I think a more accurate way to frame the whole thing would be for line 184 to say roughly "in order for cudf.pandas to produce a proxy array that ducktypes as a numpy array, we create a proxy type that actually subclasses numpy.ndarray" and remove line 190 showing the transfer. Then, here we can say "because the proxy type ducktypes as a numpy array, there are many situations where numpy will attempt to access internal members like the data buffer that make assumptions about the data being on the host or use the C API, and since we cannot proxy all of those we have to do an eager DtoH copy" and then show that in the second example. Then you can say that "with this approach we also get compatibility with third party libraries like torch". The last bit we can either add now, or after we've done the more thorough C API testing that we're discussing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Overall, this is clearer. This part: "that make assumptions about the data being on the host or use the C API" is not as clear to me. I think you mean the buffer assumes the data is in host memory, but it doesn't assume you're using the C API? I thinks its numpy functions that assume they can access the buffer directly. That's the assumption that does not work with our proxy arrays. |
||
|
||
## Can I force running on the CPU? | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a
%load_ext cudf.pandas
? It seems like we do that elsewhere in this file to be clear that we're using cudf.pandas and not vanilla pandas (please correct me if I'm wrong though).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I added it back to be consistent. I like including it in code snippets anyway.