Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] refactor separable convolution for improved performance #321

Closed
grlee77 opened this issue Jun 28, 2022 · 3 comments · Fixed by #355
Closed

[FEA] refactor separable convolution for improved performance #321

grlee77 opened this issue Jun 28, 2022 · 3 comments · Fixed by #355
Labels
performance Performance improvement
Milestone

Comments

@grlee77
Copy link
Contributor

grlee77 commented Jun 28, 2022

Benchmarking vs. VPI indicates that a several-fold improvement may be possible. I will try to work on an implementation making better use of shared memory. Ideally this would be an n-d version of the row and column filters implemented here: https://github.com/opencv/opencv_contrib/tree/4.x/modules/cudafilters/src/cuda

@grlee77 grlee77 added the performance Performance improvement label Jun 28, 2022
@grlee77 grlee77 added this to the v22.08.00 milestone Jun 28, 2022
@jakirkham
Copy link
Member

There may be some overlapping interest on the Dask side. In particular there is some planned work around shared memory ( https://github.com/orgs/dask/projects/3/views/1 ). Happy to discuss further if it's of interest

@grlee77
Copy link
Contributor Author

grlee77 commented Jul 13, 2022

By shared, I meant small memory allocations declared as __shared__ within a CUDA kernel. The use case is still for running on a single GPU, not distributing across multiple GPUs and/or nodes.

I think that Dask issue is more about the latter?

@jakirkham
Copy link
Member

Ah ok. Yeah that's right

@rapids-bot rapids-bot bot closed this as completed in #355 Aug 3, 2022
rapids-bot bot pushed a commit that referenced this issue Aug 3, 2022
closes #321 

This PR adds a different implementation of separable convolution that was adapted from `opencv_contrib`. It is not yet ready for review.

Still needs:
- [x] support other boundary modes
- [x] tests
- [x] ~extension to nD~  extension to 3D
- [x] wrappers for existing filters to call this one instead of CuPy's

The key changes in this PR are the new files:
`skimage/filters/_separable_filtering.py`
`skimage/filters/tests/test_separable_filtering.py`

It is based on the approach taken in OpenCV-contrib's [row](https://github.com/opencv/opencv_contrib/blob/4.6.0/modules/cudafilters/src/cuda/row_filter.hpp) and [column]([https://github.com/opencv/opencv_contrib/blob/4.6.0/modules/cudafilters/src/cuda/column_filter.hpp] ) filters, but also supports:
- 3D
- additional dtypes (e.g. complex64)
- all boundary modes from SciPy
- not restricted to kernel size <= 32
- casting behavior to the output matches SciPy rather than OpenCV conventions.
 
A simpler version of the same approach has long been in the CUDA samples [convolutionSeparable.cu example](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/2_Concepts_and_Techniques/convolutionSeparable/convolutionSeparable.cu). The basic idea is:
1.) First stage loads the current patch of the image and its boundaries into shared memory
2.) After synchronization, convolution is performed on the shared memory array.

A lot of CuPy's ndimage code is vendored here for the following reasons:
- We need any `cupyx.scipy.ndimage` functions we call, like `gaussian_filter`, `uniform_filter`, etc. to dispatch to the new convolution implementation when possible.
- the `_get_weights_dtype` utility was changed to promote 8 and 16-bit integers to float32 instead of float64 during convolutions.
- the `_get_output` utility was changed to allocate the output arrays as `empty` rather than `zeros` which is more efficient.
- the `_run_1d_filters` utility was improved so that it avoids an extra copy when the number of filters is even

I will submit PRs for these to CuPy. The `_get_output` change in particular also impacts other morphology and interpolation functions that we use in cuCIM, so I have vendored those here as well. For reference regarding these non-filtering changes, there is modest performance improvement. Grayscale `erosion` on a 4k image became 10% faster and `resize` of a 4k image to HD became 20% faster.

It is not easy to review kernels based on generated code strings, so I have tried to add pretty comprehensive tests covering many kernel sizes, all boundary modes, various dtype combinations, etc.

Authors:
  - Gregory Lee (https://github.com/grlee77)

Approvers:
  - Gigon Bae (https://github.com/gigony)
  - https://github.com/jakirkham

URL: #355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants