Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate string find operations to pylibcudf #15604

Merged

Conversation

brandon-b-miller
Copy link
Contributor

This PR implements libcudf's string find.hpp and migrates existing cuDF cython to leverage it.

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Apr 26, 2024
@brandon-b-miller brandon-b-miller self-assigned this Apr 26, 2024
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner April 26, 2024 21:21
@github-actions github-actions bot added the CMake CMake build issue label Apr 26, 2024
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just some feedback on the tests. In general I think you can find a lot more pyarrow equivalents than you think! Also, be on the lookout for potential fixtures!

python/cudf/cudf/_lib/pylibcudf/strings/__init__.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/strings/find.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/strings/find.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/strings/find.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/strings/find.pyx Outdated Show resolved Hide resolved
target_col = plc.interop.from_arrow(
pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"])
)
expected = pa.array([0, 0, 0, 0, 0, 0, None, 0, None, 0], type=pa.int32())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 47 to 49
expected = pa.Array.from_pandas(
string_col.to_pandas().str.rfind(target), type=pa.int32()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to avoid using pandas. I'd rather hardcode an expected output than use pandas for this testing, I think? Another option could be doing something clever like reversing the string (pyarrow does support that) and then doing a find_substring. I'm not sure, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped to python stringops in 6cd8747

Comment on lines 59 to 61
expected = pa.Array.from_pandas(
string_col.to_pandas().str.contains(target)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to check count_substring(...) > 1 roughly.

Comment on lines 70 to 72
expected = pa.array(
[False, True, True, True, True, True, None, True, None, True]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 92 to 94
expected = pa.array(
[True, True, True, True, True, True, None, True, None, True]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call starts_with once per row? It's obviously inefficient, but would work and save you from hardcoding the outputs in case we change the input. Same goes for test_ends_with_column.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there; I have a couple more suggestions for improving the test suite. Some of them are also about building in the expectations for how tests should look so that we stick to the right patterns going forward.

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved
Comment on lines 20 to 45
@pytest.fixture
def find_target_column():
return plc.interop.from_arrow(
pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"])
)


@pytest.fixture
def contains_target_column():
return plc.interop.from_arrow(
pa.array(["a", "d", "F", "j", "m", "q", None, "R", None, "w"])
)


@pytest.fixture
def starts_with_target_column():
return plc.interop.from_arrow(
pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"])
)


@pytest.fixture
def ends_with_target_column():
return plc.interop.from_arrow(
pa.array(["C", "e", "I", "j", "m", "q", None, "T", None, "w"])
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these actually need to be different? If it's easy to rewrite the inputs to the tests to use the same fixture, we might as well. It makes patterns easier to spot when there's only a single piece of data used for all the testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concatenated these into a mega fixture

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved
@pytest.mark.parametrize("target", ["a", ""])
def test_rfind(plc_col, target):
got = plc.strings.find.rfind(
plc_col, DeviceScalar(target, dtype=np.dtype("object")).c_value, 0, -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be create cudf._lib.DeviceScalar objects here. Can we just make a pylibcudf Scalar directly? For now, it's OK if that requires using the interop module. We'll eventually need to add proper constructors but for now removing usage of cudf from pylibcudf tests is a higher priority.

Also, while you're at it... the scalar could be another fixture :) You can parametrize the fixture directly like this. This would again require ensuring that the same fixture can be used for all of the tests, but the way that you've written the tests I'm fairly certain that they should pass generically for arbitrary input data. This way you can also have a single place where you add new edge cases you want to test (like the empty string above).

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved
python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved
brandon-b-miller and others added 2 commits May 6, 2024 13:22
Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>
@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 5f02cb8 into rapidsai:branch-24.06 May 6, 2024
70 checks passed
@vyasr vyasr added the pylibcudf Issues specific to the pylibcudf package label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants