New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Migrate string `find` operations to `pylibcudf` #15604

Merged

rapids-bot merged 23 commits into rapidsai:branch-24.06 from brandon-b-miller:pylibcudf-strings-find

May 6, 2024

Contributor

brandon-b-miller commented Apr 26, 2024

This PR implements libcudf's string find.hpp and migrates existing cuDF cython to leverage it.

brandon-b-miller added 3 commits

April 25, 2024 15:30


          initial

d6afb86


          implement find and tests

67153b7


          pass cudf tests

3687e4f

brandon-b-miller added feature request Python non-breaking labels

brandon-b-miller self-assigned this

brandon-b-miller requested a review from a team as a code owner

April 26, 2024 21:21

brandon-b-miller requested review from vyasr and mroeschke

April 26, 2024 21:21

github-actions bot added the CMake label

brandon-b-miller added 4 commits

April 26, 2024 16:24


          formatting

478fc01


          docs

36b4f4b


          fix up docs

c59a97f


          minor fix

14690f1

brandon-b-miller mentioned this pull request

[FEA] Implement all libcudf modules required by cuDF Python in pylibcudf #15162

Open

vyasr requested changes

View reviewed changes

Contributor

vyasr left a comment

Looks great, just some feedback on the tests. In general I think you can find a lot more pyarrow equivalents than you think! Also, be on the lookout for potential fixtures!

python/cudf/cudf/_lib/pylibcudf/strings/__init__.pxd Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/strings/find.pxd Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/strings/find.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/strings/find.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/strings/find.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

+                  target_col = plc.interop.from_arrow(
+                      pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"])
+                  )
+                  expected = pa.array([0, 0, 0, 0, 0, 0, None, 0, None, 0], type=pa.int32())

Contributor

vyasr Apr 30, 2024

https://arrow.apache.org/docs/python/generated/pyarrow.compute.find_substring.html#pyarrow.compute.find_substring

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

Comment on lines 47 to 49

+                  expected = pa.Array.from_pandas(
+                      string_col.to_pandas().str.rfind(target), type=pa.int32()
+                  )

Contributor

vyasr Apr 30, 2024

I'd love to avoid using pandas. I'd rather hardcode an expected output than use pandas for this testing, I think? Another option could be doing something clever like reversing the string (pyarrow does support that) and then doing a find_substring. I'm not sure, WDYT?

Contributor Author

brandon-b-miller Apr 30, 2024

Dropped to python stringops in 6cd8747

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

Comment on lines 59 to 61

+                  expected = pa.Array.from_pandas(
+                      string_col.to_pandas().str.contains(target)
+                  )

Contributor

vyasr Apr 30, 2024

You should be able to check count_substring(...) > 1 roughly.

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

Comment on lines 70 to 72

+                  expected = pa.array(
+                      [False, True, True, True, True, True, None, True, None, True]
+                  )

Contributor

vyasr Apr 30, 2024

https://arrow.apache.org/docs/python/generated/pyarrow.compute.count_substring.html#pyarrow.compute.match_substring

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

Comment on lines 92 to 94

+                  expected = pa.array(
+                      [True, True, True, True, True, True, None, True, None, True]
+                  )

Contributor

vyasr Apr 30, 2024

Maybe just call starts_with once per row? It's obviously inefficient, but would work and save you from hardcoding the outputs in case we change the input. Same goes for test_ends_with_column.

brandon-b-miller and others added 6 commits

April 30, 2024 10:59


          Apply suggestions from code review

81d2b7e

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>


          Merge branch 'branch-24.06' into pylibcudf-strings-find

5ccc77b


          update docs

97f7bef


          fixturize the plc col instead

b27944a


          excise pandas

6cd8747


          fixturize target columns

2d3d2d8

vyasr requested changes

View reviewed changes

Contributor

vyasr left a comment

Almost there; I have a couple more suggestions for improving the test suite. Some of them are also about building in the expectations for how tests should look so that we stick to the right patterns going forward.

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

Comment on lines 20 to 45

+              @pytest.fixture
+              def find_target_column():
+                  return plc.interop.from_arrow(
+                      pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"])
+                  )
+              @pytest.fixture
+              def contains_target_column():
+                  return plc.interop.from_arrow(
+                      pa.array(["a", "d", "F", "j", "m", "q", None, "R", None, "w"])
+                  )
+              @pytest.fixture
+              def starts_with_target_column():
+                  return plc.interop.from_arrow(
+                      pa.array(["A", "d", "F", "j", "k", "n", None, "R", None, "u"])
+                  )
+              @pytest.fixture
+              def ends_with_target_column():
+                  return plc.interop.from_arrow(
+                      pa.array(["C", "e", "I", "j", "m", "q", None, "T", None, "w"])
+                  )

Contributor

vyasr May 3, 2024

Do these actually need to be different? If it's easy to rewrite the inputs to the tests to use the same fixture, we might as well. It makes patterns easier to spot when there's only a single piece of data used for all the testing.

Contributor Author

brandon-b-miller May 3, 2024

Concatenated these into a mega fixture

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated

+              @pytest.mark.parametrize("target", ["a", ""])
+              def test_rfind(plc_col, target):
+                  got = plc.strings.find.rfind(
+                      plc_col, DeviceScalar(target, dtype=np.dtype("object")).c_value, 0, -1

Contributor

vyasr May 3, 2024

We shouldn't be create cudf._lib.DeviceScalar objects here. Can we just make a pylibcudf Scalar directly? For now, it's OK if that requires using the interop module. We'll eventually need to add proper constructors but for now removing usage of cudf from pylibcudf tests is a higher priority.

Also, while you're at it... the scalar could be another fixture :) You can parametrize the fixture directly like this. This would again require ensuring that the same fixture can be used for all of the tests, but the way that you've written the tests I'm fairly certain that they should pass generically for arbitrary input data. This way you can also have a single place where you add new edge cases you want to test (like the empty string above).

brandon-b-miller added 5 commits

May 3, 2024 10:48


          Merge branch 'branch-24.06' into pylibcudf-strings-find

1ed340e


          address reviews on one function

4a544b9


          update rfind

3e4b728


          refactor all tests

ec6d9ba


          fixture()

a19dea1

brandon-b-miller requested a review from vyasr

May 3, 2024 20:59

brandon-b-miller added 3 commits

May 6, 2024 10:02


          module scope fixtures

c0e08c5


          fixturize pyarrow scalar

63dc774


          colwise apply helper

086de7c

vyasr approved these changes

View reviewed changes

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved

python/cudf/cudf/pylibcudf_tests/test_string_find.py Outdated Show resolved Hide resolved

brandon-b-miller and others added 2 commits

May 6, 2024 13:22


          Apply suggestions from code review

39945b3

Co-authored-by: Vyas Ramasubramani <vyas.ramasubramani@gmail.com>


          Merge branch 'branch-24.06' into pylibcudf-strings-find

Contributor Author

brandon-b-miller commented May 6, 2024

/merge

rapids-bot bot merged commit 5f02cb8 into rapidsai:branch-24.06

70 checks passed

vyasr added the pylibcudf label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake feature request non-breaking pylibcudf Python