Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster hessian_matrix_* and structure_tensor_eigvals via analytical eigenvalues for the 3D case #434

Merged

Conversation

grlee77
Copy link
Contributor

@grlee77 grlee77 commented Nov 15, 2022

closes #354

This MR implements faster 2D and 3D pixelwise eigenvalue computations for hessian_matrix_eigvals, structure_tensor_eigvals and hessian_matrix_det. The 2D case already had a fairly fast code path, but it is further improved here by switching from a fused kernel to an elementwise kernel that removed the need for a separate call to cupy.stack. In 3D runtime is reduced by ~30x for float32 and >100x for float64. The 3D case also uses MUCH less RAM than previously (>20x reduction). For example computing the eigenvalues for size (128, 128, 128) float32 arrays would run out of memory even on an A6000 (40GB). With the changes here, it works even for 16x larger data of shape (512, 512, 512).

Functions that benefit from this are:

  • cucim.skimage.feature.hessian_matrix_det
  • cucim.skimage.feature.hessian_matrix_eigvals
  • cucim.skimage.feature.structure_tensor_eigenvalues
  • cucim.skimage.feature.shape_index
  • cucim.skimage.feature.multiscale_basic_features
  • cucim.skimage.filters.meijering
  • cucim.skimage.filters.sato
  • cucim.skimage.filters.frangi
  • cucim.skimage.filters.hessian

Independently of the above, the function cucim.skimage.measure.inertia_tensor_eigvals was updated with custom kernels so it can operate purely on the GPU for the 2D and 3D cases (formly these used copies to/from the host). These operate on tiny arrays, so they use only a single GPU thread. Despite the lack of paralellism, this is lower overhead than round trip host/device transfer. This will also improve region properties making use of these eigenvalues (e.g. the axis_major_length and axis_minor_length properties for regionprops_table)

@grlee77 grlee77 added non-breaking Introduces a non-breaking change performance Performance improvement labels Nov 15, 2022
@grlee77 grlee77 added this to the v22.12.00 milestone Nov 15, 2022
@grlee77 grlee77 requested a review from a team as a code owner November 15, 2022 18:50
@grlee77 grlee77 added the improvement Improves an existing functionality label Nov 15, 2022
Copy link
Contributor

@gigony gigony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
I can't go over this PR in detail due to a lack of knowledge of this algorithm, but it looks good to me.

@grlee77
Copy link
Contributor Author

grlee77 commented Nov 16, 2022

Benchmark result for hessian_matrix_eigvals when run on pre-computed hessian matrix, H. I see 1.5-2x speedup for 2D case and >100x for larger 3D images. I could only benchmark the old implementation up to (96, 96, 96) as larger sizes ran out of GPU memory.

function shape acceleration vs. old implementation
hessian_matrix_eigvals (256, 256) 2.2693
hessian_matrix_eigvals (512, 512) 1.5650
hessian_matrix_eigvals (1024, 1024) 1.5719
hessian_matrix_eigvals (2048, 2048) 1.7348
hessian_matrix_eigvals (4096, 4096) 1.7687
hessian_matrix_eigvals (16, 16, 16) 23.3574
hessian_matrix_eigvals (32, 32, 32) 83.6846
hessian_matrix_eigvals (64, 64, 64) 178.3335
hessian_matrix_eigvals (96, 96, 96) 203.8865

If we instead benchmark the full process of getting the eigenvalues from the original image rather than its hessian (i.e. benchmarking hessian_matrix_eigvals(hessian_matrix(img))), then we still see a benefit even though the hessian_matrix computation itself is not improved in this MR.

function shape acceleration vs. old implementation
hessian_matrix + hessian_matrix_eigvals (256, 256) 1.0459
hessian_matrix + hessian_matrix_eigvals (512, 512) 1.0340
hessian_matrix + hessian_matrix_eigvals (1024, 1024) 1.0446
hessian_matrix + hessian_matrix_eigvals (2048, 2048) 1.0862
hessian_matrix + hessian_matrix_eigvals (4096, 4096) 1.0880
hessian_matrix + hessian_matrix_eigvals (16, 16, 16) 1.4988
hessian_matrix + hessian_matrix_eigvals (32, 32, 32) 3.8251
hessian_matrix + hessian_matrix_eigvals (64, 64, 64) 20.7775
hessian_matrix + hessian_matrix_eigvals (96, 96, 96) 57.2562

@grlee77 grlee77 force-pushed the symmetric-3x3-analytical-eigenvalues branch from 87f6dab to 26982a2 Compare November 16, 2022 19:19
@grlee77
Copy link
Contributor Author

grlee77 commented Nov 16, 2022

I had to update the eigenvalue kernels to always use double precision internally to avoid nan values in a few ridge filter test cases. The benchmark results as reported above are nearly identical after the change, so we must be limited more by memory bandwidth than compute here. Note that the eigenvector outputs are NOT always in double, there is just use of double precision internally for the computation.

@grlee77
Copy link
Contributor Author

grlee77 commented Nov 16, 2022

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change performance Performance improvement
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] Reduce memory overhead of nD hessian_matrix_eigenvalues
2 participants