Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eigen backend optimization for Dense, GRU and LSTM layers #108

Conversation

IHorvalds
Copy link
Contributor

@IHorvalds IHorvalds commented Aug 6, 2023

Hello,

Recently, I read an article on optimising LSTMs and was looking over the RTNeural implementations. This PR applies some of the ideas in the article to the Eigen backend.

Essentially, since Eigen vectorizes operations, it's advantageous to do fewer operations on larger operands, than many operations on smaller ones.

I only modified the dense, GRU and LSTM layers because those are the ones I'm familiar with.

Benchmark results

I ran the benchmark with 10s signal, 16 in and 16 out, with the Release configuration, on a Macbook Air (2022, M2 16GB) and an Intel Core i5 11th gen with 32GB RAM running Windows 11.

Numbers are "times faster than real-time".

M2 (Apple Clang)

Layer main this branch
Dense (dynamic) 414.80 693.66
Dense (templated) 720.45 826.06
GRU (dynamic) 47.53 67.16
GRU (templated) 59.83 68.43
LSTM (dynamic) 33.62 46.53
LSTM (templated) 45.10 49.18

Intel (No AVX2, MSVC)

Layer main this branch
Dense (dynamic) 170.89 262.39
Dense (templated) 295.52 292.04
GRU (dynamic) 19.22 27.83
GRU (templated) 27.37 28.00
LSTM (dynamic) 13.50 20.50
LSTM (templated) 20.20 21.23

Intel (No AVX2, Clang-cl)

Layer main this branch
Dense (dynamic) 188.74 305.48
Dense (templated) 314.58 327.81
GRU (dynamic) 21.12 29.46
GRU (templated) 29.07 31.44
LSTM (dynamic) 14.82 20.97
LSTM (templated) 20.71 22.86

Intel (with AVX2, MSVC)

Layer main this branch
Dense (dynamic) 206.34 350.06
Dense (templated) 355.337 344.14
GRU (dynamic) 24.51 40.77
GRU (templated) 37.54 44.16
LSTM (dynamic) 17.76 30.95
LSTM (templated) 28.12 32.62

Intel (with AVX2, Clang-cl)

Layer main this branch
Dense (dynamic) 217.82 400.09
Dense (templated) 433.96 470.44
GRU (dynamic) 27.96 48.41
GRU (templated) 45.27 50.16
LSTM (dynamic) 19.84 34.03
LSTM (templated) 33.09 37.35

I did not try it on a desktop Linux or a Raspberry Pi.

Note on readability

The current implementation on main for the Eigen backend is extremely easy to read and for the most part one could probably just look it and understand the operations and their order in these layers (this is what I did about a year ago).

This new implementation, on the other hand, is quite difficult to read, I think. I left comments that hopefully explain what's going on, but they're too fresh in my eyes to know if they're useful. Naming suggestions and feedback for the comments would be very very helpful!

Thanks for reading this and hopefully it's useful!

Make the Eigen implementation around 30-40% faster by fusing matrix operation.
Added the templated implementation for the GRU layer.
Currently, on an Apple MBA 2022, 10s 16in 16out, Release config:
- dynamic implementation is ~67x faster than realtime
- templated implementation is ~69x faster than realtime
@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2023

Codecov Report

Merging #108 (c726e13) into main (fce3f70) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #108      +/-   ##
==========================================
- Coverage   95.75%   95.74%   -0.01%     
==========================================
  Files          55       55              
  Lines        3953     3949       -4     
==========================================
- Hits         3785     3781       -4     
  Misses        168      168              
Files Changed Coverage Δ
RTNeural/dense/dense_eigen.h 95.55% <100.00%> (+0.43%) ⬆️
RTNeural/gru/gru_eigen.h 96.55% <100.00%> (+1.81%) ⬆️
RTNeural/gru/gru_eigen.tpp 100.00% <100.00%> (ø)
RTNeural/lstm/lstm_eigen.h 96.29% <100.00%> (-0.32%) ⬇️
RTNeural/lstm/lstm_eigen.tpp 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jatinchowdhury18
Copy link
Owner

Thanks for the pull request! I'm still going over all the changes, but they look great at first glance.

The failing tests are related to the sample-rate correction modes for the recurrent layers. I should be able to have a closer look at those tomorrow, but if you get to them first that's fine too :)

The performance numbers look fantastic. I've been using the XSIMD backend for most recurrent networks, but I should test again with the new changes (from the CI runs, XSIMD appears to still be a little bit faster, but benchmarks on the CI runners aren't very reliable). I did notice that the Dense templated layer appears to be a little bit slower on the Intel CPU with the new changes... The difference is small enough that I think the win on ARM is worth doing the change, but I'd like to test at a few more layer sizes first.

I wouldn't worry too much about readability for now... I would still consider the STL implementation quite readable, and the comments that you left with your changes look very helpful as well.

@IHorvalds
Copy link
Contributor Author

Thanks! I'll try to fix the failing tests sometime today. :)

GRU:
The issue was that for the case where SampleRateCorrectionMode was not None,
the values for h(t-1) in extendedHt1 weren't actually the ones in the `outs`
vector as set by `processDelay`.

LSTM:
Similar issue. When SampleRateCorrectionMode was not None, the h(t-1)
in subsequent `forward` calls did not contain the same values as the
`outs` vector as set by `processDelay`.

The `cVec` did not have the same issue because after setting the values of
`ct_delayed[delayWriteIdx]` in `computeOutputsInternal`, `processDelay` puts
the right value in `cVec`.
@IHorvalds
Copy link
Contributor Author

IHorvalds commented Aug 7, 2023

Fixed it. rtneural_tests all passed on my machine.

I'll try tomorrow to profile the templated dense implementation on Intel and figure out if I can make it any faster there.
I'll also try with Clang and report if it's any faster than MSVC.

Copy link
Owner

@jatinchowdhury18 jatinchowdhury18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! There's a few spots where data is being copied that I wish we could avoid it, but if we can't that's not the end of the world!

Comment on lines +152 to +153
for (int i = 0; i < in_size; ++i)
ins_internal(i, 0) = ins(i, 0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if there were a way to do this kind of "one-multiply" dense layer operation without requiring the copy here... Unfortunately I can't think of an easy way at the moment.

Maybe it would be possible to set up something at a higher level which makes sure that all the layer input/output matrices have enough memory available to be extended to be one row longer? That way we could do something like an Eigen::Map to "grow" the input vector by one row, and just put a 1 in that row, which would avoid the copy. That seems a bit cumbersome though, so it would be cool if there were an easier way.

Comment on lines +277 to +280
for (int i = 0; i < out_sizet; ++i)
{
outs(i) = extendedHt1(i);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I think there should be a way to avoid this copy... I'll try out a couple of ideas.

@IHorvalds
Copy link
Contributor Author

IHorvalds commented Aug 8, 2023

I compiled it with clang-cl on Windows with AVX2, Release config and this is the performance I got for the dense layer:
image

Much better than MSVC somehow, but I really don't understand why...

I've also run it with a profiler to see where most of the time is being spent and the templated implementation doesn't appear in the hot path.
The dynamic implementation seems to spend quite a bit of time doing the copy though:
image

I'll come back to this this weekend, re-run the benchmarks with clang-cl and update the description of the PR.

I agree with your comments @jatinchowdhury18, it really would be nice to not have to copy the inputs/outputs into the extended arrays, but I can't come up with an elegant way to do it right now.
Also, strangely enough the fastest way I measured to do the copies for the input vector was actually doing the for-loops. memcpy, std::copy with iterators and with pointers and doing things like ins_internal.segment(0, in_sizet) = ... were measurably slower.

@sharp-trickster
Copy link

thats beautiful.
Great work.

@jatinchowdhury18
Copy link
Owner

I compiled it with clang-cl on Windows with AVX2, Release config and this is the performance I got for the dense layer...
Much better than MSVC somehow, but I really don't understand why...

This is really interesting! I have seen several cases where clang-cl does a better job of optimizing some code than MSVC, but the difference in this case is a bit more extreme than I would expect. (As an aside, I've been using clang-cl more and more lately, but that's mainly because I like being able to use the same compiler across Windows/Mac/Linux.)

I've also run it with a profiler to see where most of the time is being spent and the templated implementation doesn't appear in the hot path. The dynamic implementation seems to spend quite a bit of time doing the copy though: image

For the templated implementation the compiler is usually able to inline the entire forward() method for each layer into whatever context is calling it, so it's usually hard to get direct profiling information (sometimes it's possible to coax some information out of the generated assembly, but that can be tricky). Even for the dynamic implementation, I don't always trust which lines of code the profiler claims is eating up all the time. In this case, I would imagine that the memcpy and the for loop copies are both taking up 0.49%, and the matrix multiply is using 5.42% + 4.93%, but for some reason the profiler is splitting this up across two lines.

I'll come back to this this weekend, re-run the benchmarks with clang-cl and update the description of the PR.

Sounds good! I'll leave the PR open for you to merge whenever you like. Since I've approved the PR you should have permission to merge it, but please let me know if that's not the case.

I agree with your comments @jatinchowdhury18, it really would be nice to not have to copy the inputs/outputs into the extended arrays, but I can't come up with an elegant way to do it right now.

For now I think it's fine to merge these changes as-is, and then we can think more about how to remove some of the copy steps later on (unless some inspiration strikes one of us in the coming days).

Also, strangely enough the fastest way I measured to do the copies for the input vector was actually doing the for-loops. memcpy, std::copy with iterators and with pointers and doing things like ins_internal.segment(0, in_sizet) = ... were measurably slower.

This is interesting... I'm curious how much the difference ended up being?

@IHorvalds
Copy link
Contributor Author

The difference in performance for the dense layer on M2, 10s 16in 16out:

Dynamic Templated
for-loop 707.72x 874.091x
memcpy 658.057x 861.577x
std::copy with pointers 693.393x 864.482x
std::copy with iterators Unstable (between 588x - 660x)
(using std::span)
847.775x
(also unstable, but not quite as much)
.segment<in_sizet>(0) = ... - 862.968x

This is just for the input vector.

I realize it's slightly faster than the description of PR. These numbers go down a bit when the laptop heats up (passive cooling...), but the proportions keep.

Apparently, the += operator for NoAlias is about as fast as the just the = operator.

So I figured I'd give it a shot. On M2 it's yet a little faster.
I'll try it on Windows in a minute.
@IHorvalds
Copy link
Contributor Author

Hey @jatinchowdhury18, the last 2 commits revoked the permission to merge. When you have time, could you approve it again?
PS. I updated the description with the clang-cl benchmarks :)

Copy link
Owner

@jatinchowdhury18 jatinchowdhury18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Feel free to merge when you're ready.

@IHorvalds
Copy link
Contributor Author

It still doesn't let me merge
image

@jatinchowdhury18
Copy link
Owner

Hmm, must be some weird GitHub permissions thing. I'll get that worked out, but for now I can just merge it myself.

Thanks again!

@jatinchowdhury18 jatinchowdhury18 merged commit e802121 into jatinchowdhury18:main Aug 14, 2023
20 of 21 checks passed
@IHorvalds
Copy link
Contributor Author

Thank you! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants