-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] refactor separable convolution for improved performance #321
Comments
There may be some overlapping interest on the Dask side. In particular there is some planned work around shared memory ( https://github.com/orgs/dask/projects/3/views/1 ). Happy to discuss further if it's of interest |
By shared, I meant small memory allocations declared as I think that Dask issue is more about the latter? |
Ah ok. Yeah that's right |
closes #321 This PR adds a different implementation of separable convolution that was adapted from `opencv_contrib`. It is not yet ready for review. Still needs: - [x] support other boundary modes - [x] tests - [x] ~extension to nD~ extension to 3D - [x] wrappers for existing filters to call this one instead of CuPy's The key changes in this PR are the new files: `skimage/filters/_separable_filtering.py` `skimage/filters/tests/test_separable_filtering.py` It is based on the approach taken in OpenCV-contrib's [row](https://github.com/opencv/opencv_contrib/blob/4.6.0/modules/cudafilters/src/cuda/row_filter.hpp) and [column]([https://github.com/opencv/opencv_contrib/blob/4.6.0/modules/cudafilters/src/cuda/column_filter.hpp] ) filters, but also supports: - 3D - additional dtypes (e.g. complex64) - all boundary modes from SciPy - not restricted to kernel size <= 32 - casting behavior to the output matches SciPy rather than OpenCV conventions. A simpler version of the same approach has long been in the CUDA samples [convolutionSeparable.cu example](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/2_Concepts_and_Techniques/convolutionSeparable/convolutionSeparable.cu). The basic idea is: 1.) First stage loads the current patch of the image and its boundaries into shared memory 2.) After synchronization, convolution is performed on the shared memory array. A lot of CuPy's ndimage code is vendored here for the following reasons: - We need any `cupyx.scipy.ndimage` functions we call, like `gaussian_filter`, `uniform_filter`, etc. to dispatch to the new convolution implementation when possible. - the `_get_weights_dtype` utility was changed to promote 8 and 16-bit integers to float32 instead of float64 during convolutions. - the `_get_output` utility was changed to allocate the output arrays as `empty` rather than `zeros` which is more efficient. - the `_run_1d_filters` utility was improved so that it avoids an extra copy when the number of filters is even I will submit PRs for these to CuPy. The `_get_output` change in particular also impacts other morphology and interpolation functions that we use in cuCIM, so I have vendored those here as well. For reference regarding these non-filtering changes, there is modest performance improvement. Grayscale `erosion` on a 4k image became 10% faster and `resize` of a 4k image to HD became 20% faster. It is not easy to review kernels based on generated code strings, so I have tried to add pretty comprehensive tests covering many kernel sizes, all boundary modes, various dtype combinations, etc. Authors: - Gregory Lee (https://github.com/grlee77) Approvers: - Gigon Bae (https://github.com/gigony) - https://github.com/jakirkham URL: #355
Benchmarking vs. VPI indicates that a several-fold improvement may be possible. I will try to work on an implementation making better use of shared memory. Ideally this would be an n-d version of the row and column filters implemented here: https://github.com/opencv/opencv_contrib/tree/4.x/modules/cudafilters/src/cuda
The text was updated successfully, but these errors were encountered: