IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

rouault · 2017-06-21T11:36:54Z

Implements #953

With the new bench_dwt utility, on x86_64:

before changes: 3.356 s
with SSE2 optimization (default for a x86_64): 0.992 s
with AVX2 optimization (requested at compilation time): 0.744 s

SSE2/AVX2 is used in the vertical pass to handle several columns at the same time. This avoids a lot of CPU cache trashing.
Note: I tried a SSE2 optimized version of opj_idwt53_h_cas0() but the gain is almost unnoticeable, so not included in this PR.

* Use single-pass lifting inverse wavelet transform. * For vertical pass, use SSE2 when available so as to process 8 columns in parallel. This is the most beneficial improvement, since the vertical pass involves a lot of cache trashing. With the bench_dwt utility with default arguments (16383x16383 image), time goes from 4.064 s to 1.212 s.

Thanks to our macros that abstract SSE use, the functions can use AVX2 when available (at compile time) This brings an extra 23% speed improvement on bench_dwt in 64bit builds with AVX2 compared to SSE2.

…able tests since Travis doesn't have AVX2 compatible machines)

rouault · 2017-06-21T12:24:02Z

Note: the failure in AppVeyor is a network flake. Passes on the same commit pushed to my account: https://ci.appveyor.com/project/rouault/openjpeg/build/2.1.1.15

rouault · 2017-06-26T10:45:20Z

Results on opj_decompress time on 8c05f00a-ae05-4dd5-bdc7-a1b5eed4ebfb.jp2 from testovani : 15595 wide x 11128 tall x 3 components

idwt_53_improvements branch, SSE2 : 48.698s
idwt_53_improvements branch, AVX2 : 48.050s
master branch, SSE2: 55.759s
master branch, AVX2: 55.294s

So a global decrease of 12.6% (7.061 s) from master to idwt_53_improvements branch in SSE2, and an extra decrease 1.3% from SSE2 to AVX2 in idwt_53_improvements branch
Note: the SSE2->AVX2 improvement here is composed of a gain of recompiling the whole code base in AVX2 (55.759 - 55.294 = 465 ms) + a specific improvement due to the IDWT5x3 AVX2 optimization ( 48.698 - 48.050 - 0.465 = 183 ms)

rouault added 6 commits June 20, 2017 17:56

Add bench_dwt program (compiled only if BUILD_BENCH_DWT=ON)

919ed5f

Enable __SSE__ / __SSE2__ with Visual Studio

f06cfad

dwt.c: small cleanup

f6e3475

IDWT 5x3: generalize SSE2 version for AVX2

fd0dc53

Thanks to our macros that abstract SSE use, the functions can use AVX2 when available (at compile time) This brings an extra 23% speed improvement on bench_dwt in 64bit builds with AVX2 compared to SSE2.

.travis.yml: add a configuration to test compilation of AVX2 (but dis…

4fe7620

…able tests since Travis doesn't have AVX2 compatible machines)

rouault requested review from detonin and CharlesBuysschaertIntopix June 21, 2017 11:36

rouault merged commit 533fa2f into uclouvain:master Jun 26, 2017

This was referenced Jun 26, 2017

Port single-pass & SSE2/AVX2 optimizations of IDWT 5x3 to forward DWT 5x3 (compression) #959

Open

SSE2 optimization for horizontal pass of IDWT 5x3 #960

Open

Dynamic switch at runtime between SSE2 and AVX2 optim of IDWT 5x3 #961

Open

rouault added a commit that referenced this pull request Jun 29, 2017

IDWT 5x3: fix bug in AVX2 implementation (#953, #957)

8fa405e

detonin added the enhancement label Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

rouault commented Jun 21, 2017

rouault commented Jun 21, 2017

rouault commented Jun 26, 2017

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

IDWT 5x3 single-pass lifting and SSE2/AVX2 implementation #957

Conversation

rouault commented Jun 21, 2017

rouault commented Jun 21, 2017

rouault commented Jun 26, 2017