[FEA] refactor separable convolution for improved performance #321

grlee77 · 2022-06-28T18:58:13Z

Benchmarking vs. VPI indicates that a several-fold improvement may be possible. I will try to work on an implementation making better use of shared memory. Ideally this would be an n-d version of the row and column filters implemented here: https://github.com/opencv/opencv_contrib/tree/4.x/modules/cudafilters/src/cuda

jakirkham · 2022-07-13T06:13:53Z

There may be some overlapping interest on the Dask side. In particular there is some planned work around shared memory ( https://github.com/orgs/dask/projects/3/views/1 ). Happy to discuss further if it's of interest

grlee77 · 2022-07-13T21:10:35Z

By shared, I meant small memory allocations declared as __shared__ within a CUDA kernel. The use case is still for running on a single GPU, not distributing across multiple GPUs and/or nodes.

I think that Dask issue is more about the latter?

jakirkham · 2022-07-16T05:04:36Z

Ah ok. Yeah that's right

closes #321 This PR adds a different implementation of separable convolution that was adapted from `opencv_contrib`. It is not yet ready for review. Still needs: - [x] support other boundary modes - [x] tests - [x] ~extension to nD~ extension to 3D - [x] wrappers for existing filters to call this one instead of CuPy's The key changes in this PR are the new files: `skimage/filters/_separable_filtering.py` `skimage/filters/tests/test_separable_filtering.py` It is based on the approach taken in OpenCV-contrib's [row](https://github.com/opencv/opencv_contrib/blob/4.6.0/modules/cudafilters/src/cuda/row_filter.hpp) and [column]([https://github.com/opencv/opencv_contrib/blob/4.6.0/modules/cudafilters/src/cuda/column_filter.hpp] ) filters, but also supports: - 3D - additional dtypes (e.g. complex64) - all boundary modes from SciPy - not restricted to kernel size <= 32 - casting behavior to the output matches SciPy rather than OpenCV conventions. A simpler version of the same approach has long been in the CUDA samples [convolutionSeparable.cu example](https://github.com/NVIDIA/cuda-samples/blob/master/Samples/2_Concepts_and_Techniques/convolutionSeparable/convolutionSeparable.cu). The basic idea is: 1.) First stage loads the current patch of the image and its boundaries into shared memory 2.) After synchronization, convolution is performed on the shared memory array. A lot of CuPy's ndimage code is vendored here for the following reasons: - We need any `cupyx.scipy.ndimage` functions we call, like `gaussian_filter`, `uniform_filter`, etc. to dispatch to the new convolution implementation when possible. - the `_get_weights_dtype` utility was changed to promote 8 and 16-bit integers to float32 instead of float64 during convolutions. - the `_get_output` utility was changed to allocate the output arrays as `empty` rather than `zeros` which is more efficient. - the `_run_1d_filters` utility was improved so that it avoids an extra copy when the number of filters is even I will submit PRs for these to CuPy. The `_get_output` change in particular also impacts other morphology and interpolation functions that we use in cuCIM, so I have vendored those here as well. For reference regarding these non-filtering changes, there is modest performance improvement. Grayscale `erosion` on a 4k image became 10% faster and `resize` of a 4k image to HD became 20% faster. It is not easy to review kernels based on generated code strings, so I have tried to add pretty comprehensive tests covering many kernel sizes, all boundary modes, various dtype combinations, etc. Authors: - Gregory Lee (https://github.com/grlee77) Approvers: - Gigon Bae (https://github.com/gigony) - https://github.com/jakirkham URL: #355

grlee77 added the performance Performance improvement label Jun 28, 2022

grlee77 added this to the v22.08.00 milestone Jun 28, 2022

grlee77 mentioned this issue Jul 28, 2022

more efficient separable convolution #355

Merged

4 tasks

rapids-bot bot closed this as completed in #355 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] refactor separable convolution for improved performance #321

[FEA] refactor separable convolution for improved performance #321

grlee77 commented Jun 28, 2022

jakirkham commented Jul 13, 2022

grlee77 commented Jul 13, 2022

jakirkham commented Jul 16, 2022

[FEA] refactor separable convolution for improved performance #321

[FEA] refactor separable convolution for improved performance #321

Comments

grlee77 commented Jun 28, 2022

jakirkham commented Jul 13, 2022

grlee77 commented Jul 13, 2022

jakirkham commented Jul 16, 2022