-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix constructing Column from column_view with expired mask #11354
Fix constructing Column from column_view with expired mask #11354
Conversation
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.4]}) | ||
df = cudf.from_pandas(pdf) | ||
|
||
assert_eq(pdf.iloc[0], df.iloc[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm almost inclined to just make this
assert_eq(pdf.iloc[0], df.iloc[0]) | |
df.iloc[0] |
The assertion obscures what we're really testing, which is that we have a valid, indexable object at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would check that the result is correct too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it doesn't hurt to do both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, here's a slightly more minimal exhibitor of the issue:
from cudf import _lib as libcudf
column = cudf.Series([1.0, None])[:1]._column
transpose_column = libcudf.transpose.transpose(column) => segfault
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem with checking both, but I want it to be obvious why this is different from every other iloc
test without people having to open their browser to the GH issue. Otherwise this test just looks redundant (assuming that people don't read the issue, which probably is true about half the time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix'd.
Honestly, I think iloc
just happened to be where this bug manifested and this isn't the most appropriate place to test this fix. At the same time, I can't think of any test of our public API that would reliably test against failures like this.
One more reason to have a stable, tested, public, set of wrappers around libcudf
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I started that conversation it seems like we come up with new reasons every day :) All the more reason to get started around that sooner!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some queries for my own edification mostly
# 1) `cv` is constructed as a view into a | ||
# `cudf::column` that is nullable (i.e., it has | ||
# a null mask), but contains no nulls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so cv.has_nulls()
is True
, but cv.null_count() == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. cv.has_nulls
is true because it's a libcudf object tied to a nullable cudf::column
. cuDF Python actually never holds on to cudf::column
objects, though. It just extracts the underlying rmm buffer and stores it into a Column
object (this is to support other owners of the data like cupy arrays, see https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design.html#buffer). Once that Column
object owns the data, the original cudf::column
goes out of scope, and in possibly its null mask too if the null count was zero. The Column
is not guaranteed to hold on to the null mask of the cudf::column
if the null count was zero. I assume that decision was originally made for performance reasons, but @shwina could correct me if there are more important reasons as well not to keep that null mask around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Actually, my analysis was a little off cv.has_nulls()
is False
(because although the column is nullable, the null_count is 0, has_nulls()
is just {return null_count() > 0}
).
This is a pretty subtle footgun, such that I worry that while this fix handles the problem, it's entirely possible that a similar problem is lurking elsewhere.
In this case, the calling sequence that led to the problems is via libcudf.transpose
:
def transpose(list source_columns):
"""Transpose m n-row columns into n m-row columns
"""
cdef pair[unique_ptr[column], table_view] c_result
cdef table_view c_input = table_view_from_columns(source_columns)
with nogil:
c_result = move(cpp_transpose(c_input))
result_owner = Column.from_unique_ptr(move(c_result.first))
return columns_from_table_view(
c_result.second,
owners=[result_owner] * c_result.second.num_columns()
)
This code looks benign, but the problem is that any cudf::column_view
into a cudf::column
is in the general case invalid after Column.from_unique_ptr
. In this case, cpp_transpose
returns a pair of a unique_ptr<cudf::column>
and a cudf::table_view
(which is a wrapper around column_view
s) into the returned cudf::column
.
So in this case c_result.first
is a cudf::column
which is nullable
(so _null_mask != NULL
), but has_nulls
is False
.
So now we move
the result out and construct a Column
from from_unique_ptr
which does:
has_nulls = c_col.get()[0].has_nulls()
# After call to release(), c_col is unusable
cdef column_contents contents = move(c_col.get()[0].release())
data = DeviceBuffer.c_from_unique_ptr(move(contents.data))
data = Buffer(data)
if has_nulls:
mask = DeviceBuffer.c_from_unique_ptr(move(contents.null_mask))
mask = Buffer(mask)
null_count = c_col.get()[0].null_count()
else:
mask = None
null_count = 0
So the null_mask
is dropped, and I guess, RAII means it is collected at this point. But, uh-oh, the column_view
s we have lying around still point at the null_mask
(so we have use-after-free potential).
I think that getting this fix in should not wait on this discussion, but shall we move it to an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that getting this fix in should not wait on this discussion, but shall we move it to an issue?
I had a look through the callers from XXX_from_table_view
and this looks like the only bad place (and the fix in this PR handles any other potential issues I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you @wence- - there's no reason the same pattern couldn't be lurking elsewhere.
We should consider not dropping the nullmask in from_unique_ptr
. Memory usage is one reason we chose to discard "all valid" null masks, but there could be other breakages: for example, I believe CuPy will complain when consuming a CUDA array interface that contains a mask.
I'd rather we take that on early in the next release than late in this one. Does that sound alright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to propose not dropping the nullmask as well, but decided to wait until next release since that's definitely a bigger change that requires some thought. I definitely think that's the right way to go, though. We should try to represent the original column
as faithfully as possible to avoid creating these kinds of holes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #11384.
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.4]}) | ||
df = cudf.from_pandas(pdf) | ||
|
||
assert_eq(pdf.iloc[0], df.iloc[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, here's a slightly more minimal exhibitor of the issue:
from cudf import _lib as libcudf
column = cudf.Series([1.0, None])[:1]._column
transpose_column = libcudf.transpose.transpose(column) => segfault
@shwina there are a couple of open discussions, but nothing crucial. If you don't think they're worth addressing (especially my question around tests) feel free to merge when ready. I'm not overly concerned with urgency now that we have the fix ready to go, though. |
@gpucibot merge |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-22.08 #11354 +/- ##
================================================
+ Coverage 86.37% 86.43% +0.05%
================================================
Files 143 144 +1
Lines 22753 22808 +55
================================================
+ Hits 19653 19713 +60
+ Misses 3100 3095 -5
Continue to review full report at Codecov.
|
Description
This PR fixes a subtle bug leading to a segfault when constructing a
Column
from acolumn_view
.Closes #11349
Checklist