Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

Merged
merged 6 commits into from
Jun 26, 2017

Conversation

rouault
Copy link
Collaborator

@rouault rouault commented Jun 21, 2017

Implements #953

With the new bench_dwt utility, on x86_64:

  • before changes: 3.356 s
  • with SSE2 optimization (default for a x86_64): 0.992 s
  • with AVX2 optimization (requested at compilation time): 0.744 s

SSE2/AVX2 is used in the vertical pass to handle several columns at the same time. This avoids a lot of CPU cache trashing.
Note: I tried a SSE2 optimized version of opj_idwt53_h_cas0() but the gain is almost unnoticeable, so not included in this PR.

* Use single-pass lifting inverse wavelet transform.
* For vertical pass, use SSE2 when available so as to process 8 columns
  in parallel. This is the most beneficial improvement, since the
  vertical pass involves a lot of cache trashing.

With the bench_dwt utility with default arguments (16383x16383 image),
time goes from 4.064 s to 1.212 s.
Thanks to our macros that abstract SSE use, the functions can use
AVX2 when available (at compile time)

This brings an extra 23% speed improvement on bench_dwt in 64bit builds
with AVX2 compared to SSE2.
…able tests since Travis doesn't have AVX2 compatible machines)
@rouault
Copy link
Collaborator Author

rouault commented Jun 21, 2017

Note: the failure in AppVeyor is a network flake. Passes on the same commit pushed to my account: https://ci.appveyor.com/project/rouault/openjpeg/build/2.1.1.15

@rouault
Copy link
Collaborator Author

rouault commented Jun 26, 2017

Results on opj_decompress time on 8c05f00a-ae05-4dd5-bdc7-a1b5eed4ebfb.jp2 from testovani : 15595 wide x 11128 tall x 3 components

idwt_53_improvements branch, SSE2 : 48.698s
idwt_53_improvements branch, AVX2 : 48.050s
master branch, SSE2: 55.759s
master branch, AVX2: 55.294s

So a global decrease of 12.6% (7.061 s) from master to idwt_53_improvements branch in SSE2, and an extra decrease 1.3% from SSE2 to AVX2 in idwt_53_improvements branch
Note: the SSE2->AVX2 improvement here is composed of a gain of recompiling the whole code base in AVX2 (55.759 - 55.294 = 465 ms) + a specific improvement due to the IDWT5x3 AVX2 optimization ( 48.698 - 48.050 - 0.465 = 183 ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants