Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix constructing Column from column_view with expired mask #11354

Merged
merged 4 commits into from
Jul 28, 2022

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jul 26, 2022

Description

This PR fixes a subtle bug leading to a segfault when constructing a Column from a column_view.

Closes #11349

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested a review from a team as a code owner July 26, 2022 16:57
@shwina shwina added bug Something isn't working Python Affects Python cuDF API. labels Jul 26, 2022
@vyasr vyasr added the non-breaking Non-breaking change label Jul 26, 2022
pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.4]})
df = cudf.from_pandas(pdf)

assert_eq(pdf.iloc[0], df.iloc[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost inclined to just make this

Suggested change
assert_eq(pdf.iloc[0], df.iloc[0])
df.iloc[0]

The assertion obscures what we're really testing, which is that we have a valid, indexable object at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check that the result is correct too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't hurt to do both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, here's a slightly more minimal exhibitor of the issue:

from cudf import _lib as libcudf

column = cudf.Series([1.0, None])[:1]._column
transpose_column = libcudf.transpose.transpose(column) => segfault

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with checking both, but I want it to be obvious why this is different from every other iloc test without people having to open their browser to the GH issue. Otherwise this test just looks redundant (assuming that people don't read the issue, which probably is true about half the time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix'd.

Honestly, I think iloc just happened to be where this bug manifested and this isn't the most appropriate place to test this fix. At the same time, I can't think of any test of our public API that would reliably test against failures like this.

One more reason to have a stable, tested, public, set of wrappers around libcudf :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I started that conversation it seems like we come up with new reasons every day :) All the more reason to get started around that sooner!

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some queries for my own edification mostly

Comment on lines +539 to +541
# 1) `cv` is constructed as a view into a
# `cudf::column` that is nullable (i.e., it has
# a null mask), but contains no nulls.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so cv.has_nulls() is True, but cv.null_count() == 0?

Copy link
Contributor

@vyasr vyasr Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. cv.has_nulls is true because it's a libcudf object tied to a nullable cudf::column. cuDF Python actually never holds on to cudf::column objects, though. It just extracts the underlying rmm buffer and stores it into a Column object (this is to support other owners of the data like cupy arrays, see https://docs.rapids.ai/api/cudf/stable/developer_guide/library_design.html#buffer). Once that Column object owns the data, the original cudf::column goes out of scope, and in possibly its null mask too if the null count was zero. The Column is not guaranteed to hold on to the null mask of the cudf::column if the null count was zero. I assume that decision was originally made for performance reasons, but @shwina could correct me if there are more important reasons as well not to keep that null mask around.

Copy link
Contributor

@wence- wence- Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Actually, my analysis was a little off cv.has_nulls() is False (because although the column is nullable, the null_count is 0, has_nulls() is just {return null_count() > 0}).

This is a pretty subtle footgun, such that I worry that while this fix handles the problem, it's entirely possible that a similar problem is lurking elsewhere.

In this case, the calling sequence that led to the problems is via libcudf.transpose:

def transpose(list source_columns):
    """Transpose m n-row columns into n m-row columns
    """
    cdef pair[unique_ptr[column], table_view] c_result
    cdef table_view c_input = table_view_from_columns(source_columns)

    with nogil:
        c_result = move(cpp_transpose(c_input))

    result_owner = Column.from_unique_ptr(move(c_result.first))
    return columns_from_table_view(
        c_result.second,
        owners=[result_owner] * c_result.second.num_columns()
    )

This code looks benign, but the problem is that any cudf::column_view into a cudf::column is in the general case invalid after Column.from_unique_ptr. In this case, cpp_transpose returns a pair of a unique_ptr<cudf::column> and a cudf::table_view (which is a wrapper around column_views) into the returned cudf::column.

So in this case c_result.first is a cudf::column which is nullable (so _null_mask != NULL), but has_nulls is False.

So now we move the result out and construct a Column from from_unique_ptr which does:

        has_nulls = c_col.get()[0].has_nulls()

        # After call to release(), c_col is unusable
        cdef column_contents contents = move(c_col.get()[0].release())

        data = DeviceBuffer.c_from_unique_ptr(move(contents.data))
        data = Buffer(data)

        if has_nulls:
            mask = DeviceBuffer.c_from_unique_ptr(move(contents.null_mask))
            mask = Buffer(mask)
            null_count = c_col.get()[0].null_count()
        else:
            mask = None
            null_count = 0

So the null_mask is dropped, and I guess, RAII means it is collected at this point. But, uh-oh, the column_views we have lying around still point at the null_mask (so we have use-after-free potential).

I think that getting this fix in should not wait on this discussion, but shall we move it to an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that getting this fix in should not wait on this discussion, but shall we move it to an issue?

I had a look through the callers from XXX_from_table_view and this looks like the only bad place (and the fix in this PR handles any other potential issues I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @wence- - there's no reason the same pattern couldn't be lurking elsewhere.

We should consider not dropping the nullmask in from_unique_ptr. Memory usage is one reason we chose to discard "all valid" null masks, but there could be other breakages: for example, I believe CuPy will complain when consuming a CUDA array interface that contains a mask.

I'd rather we take that on early in the next release than late in this one. Does that sound alright?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to propose not dropping the nullmask as well, but decided to wait until next release since that's definitely a bigger change that requires some thought. I definitely think that's the right way to go, though. We should try to represent the original column as faithfully as possible to avoid creating these kinds of holes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #11384.

pdf = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0.1, 0.2, None, 0.4]})
df = cudf.from_pandas(pdf)

assert_eq(pdf.iloc[0], df.iloc[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, here's a slightly more minimal exhibitor of the issue:

from cudf import _lib as libcudf

column = cudf.Series([1.0, None])[:1]._column
transpose_column = libcudf.transpose.transpose(column) => segfault

python/cudf/cudf/_lib/column.pyx Show resolved Hide resolved
@vyasr
Copy link
Contributor

vyasr commented Jul 27, 2022

@shwina there are a couple of open discussions, but nothing crucial. If you don't think they're worth addressing (especially my question around tests) feel free to merge when ready. I'm not overly concerned with urgency now that we have the fix ready to go, though.

@vyasr
Copy link
Contributor

vyasr commented Jul 27, 2022

@gpucibot merge

@shwina
Copy link
Contributor Author

shwina commented Jul 28, 2022

rerun tests

@codecov
Copy link

codecov bot commented Jul 28, 2022

Codecov Report

Merging #11354 (127a7b9) into branch-22.08 (204218a) will increase coverage by 0.05%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.08   #11354      +/-   ##
================================================
+ Coverage         86.37%   86.43%   +0.05%     
================================================
  Files               143      144       +1     
  Lines             22753    22808      +55     
================================================
+ Hits              19653    19713      +60     
+ Misses             3100     3095       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 88.60% <0.00%> (-1.18%) ⬇️
python/cudf/cudf/core/indexed_frame.py 92.07% <0.00%> (-0.41%) ⬇️
python/dask_cudf/dask_cudf/tests/test_groupby.py 99.68% <0.00%> (-0.32%) ⬇️
python/cudf/cudf/core/column/numerical.py 95.90% <0.00%> (-0.30%) ⬇️
python/cudf/cudf/options.py 86.66% <0.00%> (ø)
python/dask_cudf/dask_cudf/groupby.py 97.44% <0.00%> (+0.01%) ⬆️
python/cudf/cudf/core/column/string.py 88.80% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/__init__.py 90.90% <0.00%> (+0.21%) ⬆️
python/cudf/cudf/core/index.py 92.35% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.25%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 204218a...127a7b9. Read the comment docs.

@rapids-bot rapids-bot bot merged commit c11b780 into rapidsai:branch-22.08 Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] SIGSEGV with sample iloc
5 participants