Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better slicing via union_offsets #776

Conversation

thomcom
Copy link
Contributor

@thomcom thomcom commented Nov 3, 2022

Description

Closes #771

This PR modifies the production of geometry_offset, part_offset, and ring_offset by sampling the existing values in a GeoSeries before returning their various offsets. This has the effect of using cudf.ListSeries to re-pack any features into a new, dense GeoColumn, then returning the offsets based on it.

Previously, GeoSeries that had been modified by slicing would have the appearance of the sliced elements, but when _offset buffers were used they would return the full original offset buffer that the sliced GeoSeries had originated from. This was a problem because it made slicing useless for our algorithms.

I also modify the core/spatial/distance.py and core/spatial/nearest_points.py files to use as_column(linestrings.lines.geometry_offset) instead of linestrings.lines.geometry_offset._column because there doesn't appear to be a reason to use a cudf.Series to wrap the offset buffers. They are private methods essentially, don't need indexes, and will eventually be factored out so that they're hidden from the user.

I wrote a new test file test_geocolumn_accessor.py to exercise the new {geometry_buffer...} accessors for all geometry types.

Finally I added tests for a base case, a more complicated case, and a case with noncontiugous slices to the inputs of test_point_linestring_distance.py, validating that the changes have exactly the effect we need.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the Python Related to Python code label Nov 3, 2022
@thomcom thomcom self-assigned this Nov 3, 2022
@thomcom thomcom added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 3, 2022
@thomcom thomcom added this to the DE-9IM milestone Nov 3, 2022
@thomcom thomcom marked this pull request as ready for review November 3, 2022 20:19
@thomcom thomcom requested a review from a team as a code owner November 3, 2022 20:19
@thomcom thomcom requested a review from isVoid November 3, 2022 20:19
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is the solution to get the right offsets and coordinate arrays. Do you think there's a chance for optimization if the slices are simple contiguous range that can be represented by offset and size, the resulting child column slices can also be zero-copy?

python/cuspatial/cuspatial/tests/conftest.py Show resolved Hide resolved
Comment on lines +217 to +220
@pytest.mark.parametrize("slice_index", [0, 1, 2, 3, 4])
def test_multipoint_multilinestring_sliced_many(
multipoint_generator, multilinestring_generator, slice_twenty, slice_index
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you make changes according to my suggestion of fixture above, you don't need slice_index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reflects a pattern that I'm trying to build for some other test cases as well. There are a few tests where I use premade slicing patterns, then iterate over them to get a fair coverage of ways a column can be sliced. In those tests I just rewrite the same slice group a few different times inside of the test file, because you can't parameterize a feature in pytest. I realized while writing this that I could put the slices in conftest.py as a fixture, then simply iterate over the fixture using a @pytest.mark.parametrize over a range, as here. The goal is not fundamentally to iterate the segments of 0,4...4,8... but to iterate over any predefined slicing range. Given our limited resources I didn't opt to refactor those tests yet, but I'd like to leave this, unless you know of a better way to iterate over a fixture in multiple tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is not fundamentally to iterate the segments of 0,4...4,8... but to iterate over any predefined slicing range.

If you need other slicing range, can't you just create other fixtures?

@harrism
Copy link
Member

harrism commented Nov 7, 2022

rerun tests

@thomcom
Copy link
Contributor Author

thomcom commented Nov 8, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 29c3d5d into rapidsai:branch-22.12 Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to Python code
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] GeoColumnAccessors should respect slicing
3 participants