Fix constructing Column from column_view with expired mask #11354

shwina · 2022-07-26T16:57:24Z

Description

This PR fixes a subtle bug leading to a segfault when constructing a Column from a column_view.

Closes #11349

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cudf/cudf/_lib/column.pyx

vyasr · 2022-07-26T17:16:52Z

python/cudf/cudf/tests/test_indexing.py

+    pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.4]})
+    df = cudf.from_pandas(pdf)
+
+    assert_eq(pdf.iloc[0], df.iloc[0])


I'm almost inclined to just make this

Suggested change

assert_eq(pdf.iloc[0], df.iloc[0])

df.iloc[0]

The assertion obscures what we're really testing, which is that we have a valid, indexable object at the end.

I would check that the result is correct too

I think it doesn't hurt to do both?

FWIW, here's a slightly more minimal exhibitor of the issue:

from cudf import _lib as libcudf column = cudf.Series([1.0, None])[:1]._column transpose_column = libcudf.transpose.transpose(column) => segfault

I have no problem with checking both, but I want it to be obvious why this is different from every other iloc test without people having to open their browser to the GH issue. Otherwise this test just looks redundant (assuming that people don't read the issue, which probably is true about half the time).

Fix'd.

Honestly, I think iloc just happened to be where this bug manifested and this isn't the most appropriate place to test this fix. At the same time, I can't think of any test of our public API that would reliably test against failures like this.

One more reason to have a stable, tested, public, set of wrappers around libcudf :)

Now that I started that conversation it seems like we come up with new reasons every day :) All the more reason to get started around that sooner!

wence-

Some queries for my own edification mostly

wence- · 2022-07-26T17:30:12Z

python/cudf/cudf/_lib/column.pyx

+                    # 1) `cv` is constructed as a view into a
+                    #    `cudf::column` that is nullable (i.e., it has
+                    #    a null mask), but contains no nulls.


OK, so cv.has_nulls() is True, but cv.null_count() == 0?

Correct. cv.has_nulls is true because it's a libcudf object tied to a nullable cudf::column. cuDF Python actually never holds on to cudf::column objects, though. It just extracts the underlying rmm buffer and stores it into a Column object (this is to support other owners of the data like cupy arrays, see https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design.html#buffer). Once that Column object owns the data, the original cudf::column goes out of scope, and in possibly its null mask too if the null count was zero. The Column is not guaranteed to hold on to the null mask of the cudf::column if the null count was zero. I assume that decision was originally made for performance reasons, but @shwina could correct me if there are more important reasons as well not to keep that null mask around.

Thanks!

Actually, my analysis was a little off cv.has_nulls() is False (because although the column is nullable, the null_count is 0, has_nulls() is just {return null_count() > 0}).

This is a pretty subtle footgun, such that I worry that while this fix handles the problem, it's entirely possible that a similar problem is lurking elsewhere.

In this case, the calling sequence that led to the problems is via libcudf.transpose:

def transpose(list source_columns): """Transpose m n-row columns into n m-row columns """ cdef pair[unique_ptr[column], table_view] c_result cdef table_view c_input = table_view_from_columns(source_columns) with nogil: c_result = move(cpp_transpose(c_input)) result_owner = Column.from_unique_ptr(move(c_result.first)) return columns_from_table_view( c_result.second, owners=[result_owner] * c_result.second.num_columns() )

This code looks benign, but the problem is that any cudf::column_view into a cudf::column is in the general case invalid after Column.from_unique_ptr. In this case, cpp_transpose returns a pair of a unique_ptr<cudf::column> and a cudf::table_view (which is a wrapper around column_views) into the returned cudf::column.

So in this case c_result.first is a cudf::column which is nullable (so _null_mask != NULL), but has_nulls is False.

So now we move the result out and construct a Column from from_unique_ptr which does:

has_nulls = c_col.get()[0].has_nulls() # After call to release(), c_col is unusable cdef column_contents contents = move(c_col.get()[0].release()) data = DeviceBuffer.c_from_unique_ptr(move(contents.data)) data = Buffer(data) if has_nulls: mask = DeviceBuffer.c_from_unique_ptr(move(contents.null_mask)) mask = Buffer(mask) null_count = c_col.get()[0].null_count() else: mask = None null_count = 0

So the null_mask is dropped, and I guess, RAII means it is collected at this point. But, uh-oh, the column_views we have lying around still point at the null_mask (so we have use-after-free potential).

I think that getting this fix in should not wait on this discussion, but shall we move it to an issue?

I think that getting this fix in should not wait on this discussion, but shall we move it to an issue?

I had a look through the callers from XXX_from_table_view and this looks like the only bad place (and the fix in this PR handles any other potential issues I think).

I agree with you @wence- - there's no reason the same pattern couldn't be lurking elsewhere.

We should consider not dropping the nullmask in from_unique_ptr. Memory usage is one reason we chose to discard "all valid" null masks, but there could be other breakages: for example, I believe CuPy will complain when consuming a CUDA array interface that contains a mask.

I'd rather we take that on early in the next release than late in this one. Does that sound alright?

I was going to propose not dropping the nullmask as well, but decided to wait until next release since that's definitely a bigger change that requires some thought. I definitely think that's the right way to go, though. We should try to represent the original column as faithfully as possible to avoid creating these kinds of holes.

Opened #11384.

wence- · 2022-07-26T17:54:34Z

python/cudf/cudf/tests/test_indexing.py

+    pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.4]})
+    df = cudf.from_pandas(pdf)
+
+    assert_eq(pdf.iloc[0], df.iloc[0])


FWIW, here's a slightly more minimal exhibitor of the issue:

from cudf import _lib as libcudf column = cudf.Series([1.0, None])[:1]._column transpose_column = libcudf.transpose.transpose(column) => segfault

python/cudf/cudf/_lib/column.pyx

vyasr · 2022-07-27T00:45:55Z

@shwina there are a couple of open discussions, but nothing crucial. If you don't think they're worth addressing (especially my question around tests) feel free to merge when ready. I'm not overly concerned with urgency now that we have the fix ready to go, though.

vyasr · 2022-07-27T18:01:38Z

@gpucibot merge

shwina · 2022-07-28T10:40:59Z

rerun tests

codecov · 2022-07-28T12:31:24Z

Codecov Report

Merging #11354 (127a7b9) into branch-22.08 (204218a) will increase coverage by 0.05%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.08   #11354      +/-   ##
================================================
+ Coverage         86.37%   86.43%   +0.05%     
================================================
  Files               143      144       +1     
  Lines             22753    22808      +55     
================================================
+ Hits              19653    19713      +60     
+ Misses             3100     3095       -5

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/column.py	`88.60% <0.00%> (-1.18%)`	⬇️
python/cudf/cudf/core/indexed_frame.py	`92.07% <0.00%> (-0.41%)`	⬇️
python/dask_cudf/dask_cudf/tests/test_groupby.py	`99.68% <0.00%> (-0.32%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`95.90% <0.00%> (-0.30%)`	⬇️
python/cudf/cudf/options.py	`86.66% <0.00%> (ø)`
python/dask_cudf/dask_cudf/groupby.py	`97.44% <0.00%> (+0.01%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.80% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/__init__.py	`90.90% <0.00%> (+0.21%)`	⬆️
python/cudf/cudf/core/index.py	`92.35% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.78% <0.00%> (+0.25%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 204218a...127a7b9. Read the comment docs.

Fix constructing Column from column_view with expired mask

e706306

shwina requested a review from a team as a code owner July 26, 2022 16:57

shwina requested review from mroeschke and skirui-source July 26, 2022 16:57

shwina added bug Something isn't working Python Affects Python cuDF API. labels Jul 26, 2022

vyasr added the non-breaking Non-breaking change label Jul 26, 2022

brandon-b-miller reviewed Jul 26, 2022

View reviewed changes

python/cudf/cudf/_lib/column.pyx Show resolved Hide resolved

brandon-b-miller approved these changes Jul 26, 2022

View reviewed changes

vyasr approved these changes Jul 26, 2022

View reviewed changes

mroeschke approved these changes Jul 26, 2022

View reviewed changes

wence- reviewed Jul 26, 2022

View reviewed changes

shwina added 2 commits July 26, 2022 18:45

owner -> mask_owner

fb46b45

Revert change to CategoricalColumn

95961dc

wence- mentioned this pull request Jul 27, 2022

column: calculate null_count before release()ing the cudf::column #11365

Merged

3 tasks

wence- approved these changes Jul 27, 2022

View reviewed changes

First test that we can iloc at all

127a7b9

wence- mentioned this pull request Jul 28, 2022

Constructing Column objects from cudf::column should always take ownership of the null mask #11384

Open

rapids-bot bot merged commit c11b780 into rapidsai:branch-22.08 Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix constructing Column from column_view with expired mask #11354

Fix constructing Column from column_view with expired mask #11354

shwina commented Jul 26, 2022

vyasr Jul 26, 2022

wence- Jul 26, 2022

shwina Jul 26, 2022

wence- Jul 26, 2022

vyasr Jul 26, 2022

shwina Jul 27, 2022

vyasr Jul 27, 2022

wence- left a comment

wence- Jul 26, 2022

vyasr Jul 26, 2022 •

edited

Loading

wence- Jul 27, 2022 •

edited

Loading

wence- Jul 27, 2022

shwina Jul 27, 2022

vyasr Jul 27, 2022

wence- Jul 28, 2022

wence- Jul 26, 2022

vyasr commented Jul 27, 2022

vyasr commented Jul 27, 2022

shwina commented Jul 28, 2022

codecov bot commented Jul 28, 2022 •

edited

Loading

Fix constructing Column from column_view with expired mask #11354

Fix constructing Column from column_view with expired mask #11354

Conversation

shwina commented Jul 26, 2022

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

wence- Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr commented Jul 27, 2022

vyasr commented Jul 27, 2022

shwina commented Jul 28, 2022

codecov bot commented Jul 28, 2022 • edited Loading

Codecov Report

vyasr Jul 26, 2022 •

edited

Loading

wence- Jul 27, 2022 •

edited

Loading

codecov bot commented Jul 28, 2022 •

edited

Loading