Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transpose API to pylibcudf #16749

Merged
merged 15 commits into from
Sep 25, 2024

Conversation

mroeschke
Copy link
Contributor

Description

Contributes to #15162

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package labels Sep 4, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue and removed pylibcudf Issues specific to the pylibcudf package labels Sep 4, 2024
@mroeschke mroeschke marked this pull request as ready for review September 5, 2024 20:19
@mroeschke mroeschke requested a review from a team as a code owner September 5, 2024 20:19
@mroeschke
Copy link
Contributor Author

mroeschke commented Sep 5, 2024

Looks like this test expects the data pointer to be exposed after transpose

______________________________ test_df_transpose _______________________________
[gw7] linux -- Python 3.10.14 /opt/conda/envs/test/bin/python3.10

manager = <SpillManager device_memory_limit=N/A | 0B spilled | 57B (28%) unspilled (unspillable)>

    def test_df_transpose(manager: SpillManager):
        df1 = cudf.DataFrame({"a": [1, 2]})
        df2 = df1.transpose()
        # For now, all buffers are marked as exposed
        assert df1._data._data["a"].data.owner.exposed
>       assert df2._data._data[0].data.owner.exposed
E       assert False
E        +  where False = <cudf.core.buffer.spillable_buffer.SpillableBufferOwner object at 0x7f6b538952d0>.exposed
E        +    where <cudf.core.buffer.spillable_buffer.SpillableBufferOwner object at 0x7f6b538952d0> = SpillableBuffer(owner=<cudf.core.buffer.spillable_buffer.SpillableBufferOwner object at 0x7f6b538952d0>, offset=0, size=8).owner
E        +      where SpillableBuffer(owner=<cudf.core.buffer.spillable_buffer.SpillableBufferOwner object at 0x7f6b538952d0>, offset=0, size=8) = <cudf.core.column.numerical.NumericalColumn object at 0x7f6b53886830>\n[\n  1\n]\ndtype: int64.data

tests/test_spilling.py:580: AssertionError

Would this require cudf._lib.column.Column.from_pylibcudf to implement the data_ptr_exposed keyword?

https://github.com/rapidsai/cudf/blob/branch-24.10/python/cudf/cudf/_lib/column.pyx#L602-L605

python/pylibcudf/pylibcudf/transpose.pyx Outdated Show resolved Hide resolved
Comment on lines 24 to 15
# Notice, the data pointer of `result_owner` has been exposed
# through `c_result.second` at this point.
result_owner = Column.from_unique_ptr(
move(c_result.first), data_ptr_exposed=True
)
return columns_from_table_view(
c_result.second,
owners=[result_owner] * c_result.second.num_columns()
input_table = plc.table.Table(
[col.to_pylibcudf(mode="read") for col in source_columns]
)
_, result_table = plc.transpose.transpose(input_table)
return [Column.from_pylibcudf(col) for col in result_table.columns()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madsbk: can you remind me what it means that the result_owner is exposed through the table (c_result.second).

Is it that we have, now, two Buffers that point to the same data, and therefore if we were to spill one, we would need to spill the other?

I think this is right, and so yes, I think we do need (@mroeschke) to have a way of marking a column's data as exposed when we import it from pylibcudf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that we have, now, two Buffers that point to the same data, and therefore if we were to spill one, we would need to spill the other?

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right, and so yes, I think we do need (@mroeschke) to have a way of marking a column's data as exposed when we import it from pylibcudf.

There is a data_ptr_exposed keyword in from_pylibcudf that currently isn't implemented. I think we need to pass that parameter through to the exposed keyword in as_buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was addressed in #16760

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like the necessary parameter was handled there so this PR should be safe to merge now.

@Matt711 Matt711 added the pylibcudf Issues specific to the pylibcudf package label Sep 19, 2024
@galipremsagar
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 503ce03 into rapidsai:branch-24.10 Sep 25, 2024
99 checks passed
@mroeschke mroeschke deleted the pylibcudf/transpose branch September 25, 2024 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants