Eigen backend optimization for Dense, GRU and LSTM layers #108

IHorvalds · 2023-08-06T20:07:18Z

Hello,

Recently, I read an article on optimising LSTMs and was looking over the RTNeural implementations. This PR applies some of the ideas in the article to the Eigen backend.

Essentially, since Eigen vectorizes operations, it's advantageous to do fewer operations on larger operands, than many operations on smaller ones.

I only modified the dense, GRU and LSTM layers because those are the ones I'm familiar with.

Benchmark results

I ran the benchmark with 10s signal, 16 in and 16 out, with the Release configuration, on a Macbook Air (2022, M2 16GB) and an Intel Core i5 11th gen with 32GB RAM running Windows 11.

Numbers are "times faster than real-time".

M2 (Apple Clang)

Layer	main	this branch
Dense (dynamic)	414.80	693.66
Dense (templated)	720.45	826.06
GRU (dynamic)	47.53	67.16
GRU (templated)	59.83	68.43
LSTM (dynamic)	33.62	46.53
LSTM (templated)	45.10	49.18

Intel (No AVX2, MSVC)

Layer	main	this branch
Dense (dynamic)	170.89	262.39
Dense (templated)	295.52	292.04
GRU (dynamic)	19.22	27.83
GRU (templated)	27.37	28.00
LSTM (dynamic)	13.50	20.50
LSTM (templated)	20.20	21.23

Intel (No AVX2, Clang-cl)

Layer	main	this branch
Dense (dynamic)	188.74	305.48
Dense (templated)	314.58	327.81
GRU (dynamic)	21.12	29.46
GRU (templated)	29.07	31.44
LSTM (dynamic)	14.82	20.97
LSTM (templated)	20.71	22.86

Intel (with AVX2, MSVC)

Layer	main	this branch
Dense (dynamic)	206.34	350.06
Dense (templated)	355.337	344.14
GRU (dynamic)	24.51	40.77
GRU (templated)	37.54	44.16
LSTM (dynamic)	17.76	30.95
LSTM (templated)	28.12	32.62

Intel (with AVX2, Clang-cl)

Layer	main	this branch
Dense (dynamic)	217.82	400.09
Dense (templated)	433.96	470.44
GRU (dynamic)	27.96	48.41
GRU (templated)	45.27	50.16
LSTM (dynamic)	19.84	34.03
LSTM (templated)	33.09	37.35

I did not try it on a desktop Linux or a Raspberry Pi.

Note on readability

The current implementation on main for the Eigen backend is extremely easy to read and for the most part one could probably just look it and understand the operations and their order in these layers (this is what I did about a year ago).

This new implementation, on the other hand, is quite difficult to read, I think. I left comments that hopefully explain what's going on, but they're too fresh in my eyes to know if they're useful. Naming suggestions and feedback for the comments would be very very helpful!

Thanks for reading this and hopefully it's useful!

Make the Eigen implementation around 30-40% faster by fusing matrix operation.

Added the templated implementation for the GRU layer. Currently, on an Apple MBA 2022, 10s 16in 16out, Release config: - dynamic implementation is ~67x faster than realtime - templated implementation is ~69x faster than realtime

…timization

codecov-commenter · 2023-08-07T01:19:56Z

Codecov Report

Merging #108 (c726e13) into main (fce3f70) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #108      +/-   ##
==========================================
- Coverage   95.75%   95.74%   -0.01%     
==========================================
  Files          55       55              
  Lines        3953     3949       -4     
==========================================
- Hits         3785     3781       -4     
  Misses        168      168

Files Changed	Coverage Δ
RTNeural/dense/dense_eigen.h	`95.55% <100.00%> (+0.43%)`	⬆️
RTNeural/gru/gru_eigen.h	`96.55% <100.00%> (+1.81%)`	⬆️
RTNeural/gru/gru_eigen.tpp	`100.00% <100.00%> (ø)`
RTNeural/lstm/lstm_eigen.h	`96.29% <100.00%> (-0.32%)`	⬇️
RTNeural/lstm/lstm_eigen.tpp	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

jatinchowdhury18 · 2023-08-07T05:28:03Z

Thanks for the pull request! I'm still going over all the changes, but they look great at first glance.

The failing tests are related to the sample-rate correction modes for the recurrent layers. I should be able to have a closer look at those tomorrow, but if you get to them first that's fine too :)

The performance numbers look fantastic. I've been using the XSIMD backend for most recurrent networks, but I should test again with the new changes (from the CI runs, XSIMD appears to still be a little bit faster, but benchmarks on the CI runners aren't very reliable). I did notice that the Dense templated layer appears to be a little bit slower on the Intel CPU with the new changes... The difference is small enough that I think the win on ARM is worth doing the change, but I'd like to test at a few more layer sizes first.

I wouldn't worry too much about readability for now... I would still consider the STL implementation quite readable, and the comments that you left with your changes look very helpful as well.

IHorvalds · 2023-08-07T06:18:09Z

Thanks! I'll try to fix the failing tests sometime today. :)

GRU: The issue was that for the case where SampleRateCorrectionMode was not None, the values for h(t-1) in extendedHt1 weren't actually the ones in the `outs` vector as set by `processDelay`. LSTM: Similar issue. When SampleRateCorrectionMode was not None, the h(t-1) in subsequent `forward` calls did not contain the same values as the `outs` vector as set by `processDelay`. The `cVec` did not have the same issue because after setting the values of `ct_delayed[delayWriteIdx]` in `computeOutputsInternal`, `processDelay` puts the right value in `cVec`.

IHorvalds · 2023-08-07T21:24:33Z

Fixed it. rtneural_tests all passed on my machine.

I'll try tomorrow to profile the templated dense implementation on Intel and figure out if I can make it any faster there.
I'll also try with Clang and report if it's any faster than MSVC.

jatinchowdhury18

Looking good! There's a few spots where data is being copied that I wish we could avoid it, but if we can't that's not the end of the world!

jatinchowdhury18 · 2023-08-07T22:39:25Z

RTNeural/dense/dense_eigen.h

+        for (int i = 0; i < in_size; ++i)
+            ins_internal(i, 0) = ins(i, 0);


It would be great if there were a way to do this kind of "one-multiply" dense layer operation without requiring the copy here... Unfortunately I can't think of an easy way at the moment.

Maybe it would be possible to set up something at a higher level which makes sure that all the layer input/output matrices have enough memory available to be extended to be one row longer? That way we could do something like an Eigen::Map to "grow" the input vector by one row, and just put a 1 in that row, which would avoid the copy. That seems a bit cumbersome though, so it would be cool if there were an easier way.

jatinchowdhury18 · 2023-08-07T22:45:43Z

RTNeural/gru/gru_eigen.h

+        for (int i = 0; i < out_sizet; ++i)
+        {
+            outs(i) = extendedHt1(i);
+        }


Similarly, I think there should be a way to avoid this copy... I'll try out a couple of ideas.

IHorvalds · 2023-08-08T21:34:19Z

I compiled it with clang-cl on Windows with AVX2, Release config and this is the performance I got for the dense layer:

Much better than MSVC somehow, but I really don't understand why...

I've also run it with a profiler to see where most of the time is being spent and the templated implementation doesn't appear in the hot path.
The dynamic implementation seems to spend quite a bit of time doing the copy though:

I'll come back to this this weekend, re-run the benchmarks with clang-cl and update the description of the PR.

I agree with your comments @jatinchowdhury18, it really would be nice to not have to copy the inputs/outputs into the extended arrays, but I can't come up with an elegant way to do it right now.
Also, strangely enough the fastest way I measured to do the copies for the input vector was actually doing the for-loops. memcpy, std::copy with iterators and with pointers and doing things like ins_internal.segment(0, in_sizet) = ... were measurably slower.

sharp-trickster · 2023-08-08T22:09:37Z

thats beautiful.
Great work.

jatinchowdhury18 · 2023-08-09T03:48:50Z

I compiled it with clang-cl on Windows with AVX2, Release config and this is the performance I got for the dense layer...
Much better than MSVC somehow, but I really don't understand why...

This is really interesting! I have seen several cases where clang-cl does a better job of optimizing some code than MSVC, but the difference in this case is a bit more extreme than I would expect. (As an aside, I've been using clang-cl more and more lately, but that's mainly because I like being able to use the same compiler across Windows/Mac/Linux.)

I've also run it with a profiler to see where most of the time is being spent and the templated implementation doesn't appear in the hot path. The dynamic implementation seems to spend quite a bit of time doing the copy though:

For the templated implementation the compiler is usually able to inline the entire forward() method for each layer into whatever context is calling it, so it's usually hard to get direct profiling information (sometimes it's possible to coax some information out of the generated assembly, but that can be tricky). Even for the dynamic implementation, I don't always trust which lines of code the profiler claims is eating up all the time. In this case, I would imagine that the memcpy and the for loop copies are both taking up 0.49%, and the matrix multiply is using 5.42% + 4.93%, but for some reason the profiler is splitting this up across two lines.

I'll come back to this this weekend, re-run the benchmarks with clang-cl and update the description of the PR.

Sounds good! I'll leave the PR open for you to merge whenever you like. Since I've approved the PR you should have permission to merge it, but please let me know if that's not the case.

I agree with your comments @jatinchowdhury18, it really would be nice to not have to copy the inputs/outputs into the extended arrays, but I can't come up with an elegant way to do it right now.

For now I think it's fine to merge these changes as-is, and then we can think more about how to remove some of the copy steps later on (unless some inspiration strikes one of us in the coming days).

Also, strangely enough the fastest way I measured to do the copies for the input vector was actually doing the for-loops. memcpy, std::copy with iterators and with pointers and doing things like ins_internal.segment(0, in_sizet) = ... were measurably slower.

This is interesting... I'm curious how much the difference ended up being?

IHorvalds · 2023-08-09T19:14:31Z

The difference in performance for the dense layer on M2, 10s 16in 16out:

	Dynamic	Templated
for-loop	707.72x	874.091x
memcpy	658.057x	861.577x
std::copy with pointers	693.393x	864.482x
std::copy with iterators	Unstable (between 588x - 660x) (using std::span)	847.775x (also unstable, but not quite as much)
.segment<in_sizet>(0) = ...	-	862.968x

This is just for the input vector.

I realize it's slightly faster than the description of PR. These numbers go down a bit when the laptop heats up (passive cooling...), but the proportions keep.

Apparently, the += operator for NoAlias is about as fast as the just the = operator. So I figured I'd give it a shot. On M2 it's yet a little faster. I'll try it on Windows in a minute.

This reverts commit 29bd1da.

IHorvalds · 2023-08-12T12:48:58Z

Hey @jatinchowdhury18, the last 2 commits revoked the permission to merge. When you have time, could you approve it again?
PS. I updated the description with the clang-cl benchmarks :)

…timization

jatinchowdhury18

Looking good! Feel free to merge when you're ready.

IHorvalds · 2023-08-14T17:34:38Z

It still doesn't let me merge

jatinchowdhury18 · 2023-08-14T19:34:24Z

Hmm, must be some weird GitHub permissions thing. I'll get that worked out, but for now I can just merge it myself.

Thanks again!

IHorvalds · 2023-08-14T19:37:18Z

Thank you! :)

IHorvalds added 8 commits August 6, 2023 19:45

Make Dense Eigen faster

2e09a7b

Make the Eigen implementation around 30-40% faster by fusing matrix operation.

GRU Eigen templated implementation

02b19b6

Added the templated implementation for the GRU layer. Currently, on an Apple MBA 2022, 10s 16in 16out, Release config: - dynamic implementation is ~67x faster than realtime - templated implementation is ~69x faster than realtime

Fix GRU and make LSTM faster

ff2248c

Add explanatory comments

487ee81

Remove auto templated parameters

4f6d911

Fix comment format

11e1ca6

Merge branch 'jatinchowdhury18:main' into experiment/eigen_backend_op…

f1d3568

…timization

Remove unused type alias

77c1f75

jatinchowdhury18 approved these changes Aug 7, 2023

View reviewed changes

IHorvalds added 2 commits August 10, 2023 20:27

One more try for the Dense layer

29bd1da

Apparently, the += operator for NoAlias is about as fast as the just the = operator. So I figured I'd give it a shot. On M2 it's yet a little faster. I'll try it on Windows in a minute.

Revert "One more try for the Dense layer"

58d79bb

This reverts commit 29bd1da.

Merge branch 'jatinchowdhury18:main' into experiment/eigen_backend_op…

4474a38

…timization

jatinchowdhury18 approved these changes Aug 14, 2023

View reviewed changes

jatinchowdhury18 merged commit e802121 into jatinchowdhury18:main Aug 14, 2023
20 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eigen backend optimization for Dense, GRU and LSTM layers #108

Eigen backend optimization for Dense, GRU and LSTM layers #108

IHorvalds commented Aug 6, 2023 •

edited

Loading

codecov-commenter commented Aug 7, 2023 •

edited

Loading

jatinchowdhury18 commented Aug 7, 2023

IHorvalds commented Aug 7, 2023

IHorvalds commented Aug 7, 2023 •

edited

Loading

jatinchowdhury18 left a comment

jatinchowdhury18 Aug 7, 2023

jatinchowdhury18 Aug 7, 2023

IHorvalds commented Aug 8, 2023 •

edited

Loading

sharp-trickster commented Aug 8, 2023

jatinchowdhury18 commented Aug 9, 2023

IHorvalds commented Aug 9, 2023

IHorvalds commented Aug 12, 2023

jatinchowdhury18 left a comment

IHorvalds commented Aug 14, 2023

jatinchowdhury18 commented Aug 14, 2023

IHorvalds commented Aug 14, 2023

		for (int i = 0; i < in_size; ++i)
		ins_internal(i, 0) = ins(i, 0);

Eigen backend optimization for Dense, GRU and LSTM layers #108

Eigen backend optimization for Dense, GRU and LSTM layers #108

Conversation

IHorvalds commented Aug 6, 2023 • edited Loading

Benchmark results

M2 (Apple Clang)

Intel (No AVX2, MSVC)

Intel (No AVX2, Clang-cl)

Intel (with AVX2, MSVC)

Intel (with AVX2, Clang-cl)

Note on readability

codecov-commenter commented Aug 7, 2023 • edited Loading

Codecov Report

jatinchowdhury18 commented Aug 7, 2023

IHorvalds commented Aug 7, 2023

IHorvalds commented Aug 7, 2023 • edited Loading

jatinchowdhury18 left a comment

Choose a reason for hiding this comment

jatinchowdhury18 Aug 7, 2023

Choose a reason for hiding this comment

jatinchowdhury18 Aug 7, 2023

Choose a reason for hiding this comment

IHorvalds commented Aug 8, 2023 • edited Loading

sharp-trickster commented Aug 8, 2023

jatinchowdhury18 commented Aug 9, 2023

IHorvalds commented Aug 9, 2023

IHorvalds commented Aug 12, 2023

jatinchowdhury18 left a comment

Choose a reason for hiding this comment

IHorvalds commented Aug 14, 2023

jatinchowdhury18 commented Aug 14, 2023

IHorvalds commented Aug 14, 2023

IHorvalds commented Aug 6, 2023 •

edited

Loading

codecov-commenter commented Aug 7, 2023 •

edited

Loading

IHorvalds commented Aug 7, 2023 •

edited

Loading

IHorvalds commented Aug 8, 2023 •

edited

Loading