-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eigen backend optimization for Dense, GRU and LSTM layers #108
Eigen backend optimization for Dense, GRU and LSTM layers #108
Conversation
Make the Eigen implementation around 30-40% faster by fusing matrix operation.
Added the templated implementation for the GRU layer. Currently, on an Apple MBA 2022, 10s 16in 16out, Release config: - dynamic implementation is ~67x faster than realtime - templated implementation is ~69x faster than realtime
Codecov Report
@@ Coverage Diff @@
## main #108 +/- ##
==========================================
- Coverage 95.75% 95.74% -0.01%
==========================================
Files 55 55
Lines 3953 3949 -4
==========================================
- Hits 3785 3781 -4
Misses 168 168
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thanks for the pull request! I'm still going over all the changes, but they look great at first glance. The failing tests are related to the sample-rate correction modes for the recurrent layers. I should be able to have a closer look at those tomorrow, but if you get to them first that's fine too :) The performance numbers look fantastic. I've been using the XSIMD backend for most recurrent networks, but I should test again with the new changes (from the CI runs, XSIMD appears to still be a little bit faster, but benchmarks on the CI runners aren't very reliable). I did notice that the Dense templated layer appears to be a little bit slower on the Intel CPU with the new changes... The difference is small enough that I think the win on ARM is worth doing the change, but I'd like to test at a few more layer sizes first. I wouldn't worry too much about readability for now... I would still consider the STL implementation quite readable, and the comments that you left with your changes look very helpful as well. |
Thanks! I'll try to fix the failing tests sometime today. :) |
GRU: The issue was that for the case where SampleRateCorrectionMode was not None, the values for h(t-1) in extendedHt1 weren't actually the ones in the `outs` vector as set by `processDelay`. LSTM: Similar issue. When SampleRateCorrectionMode was not None, the h(t-1) in subsequent `forward` calls did not contain the same values as the `outs` vector as set by `processDelay`. The `cVec` did not have the same issue because after setting the values of `ct_delayed[delayWriteIdx]` in `computeOutputsInternal`, `processDelay` puts the right value in `cVec`.
Fixed it. I'll try tomorrow to profile the templated dense implementation on Intel and figure out if I can make it any faster there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! There's a few spots where data is being copied that I wish we could avoid it, but if we can't that's not the end of the world!
for (int i = 0; i < in_size; ++i) | ||
ins_internal(i, 0) = ins(i, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if there were a way to do this kind of "one-multiply" dense layer operation without requiring the copy here... Unfortunately I can't think of an easy way at the moment.
Maybe it would be possible to set up something at a higher level which makes sure that all the layer input/output matrices have enough memory available to be extended to be one row longer? That way we could do something like an Eigen::Map
to "grow" the input vector by one row, and just put a 1
in that row, which would avoid the copy. That seems a bit cumbersome though, so it would be cool if there were an easier way.
for (int i = 0; i < out_sizet; ++i) | ||
{ | ||
outs(i) = extendedHt1(i); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, I think there should be a way to avoid this copy... I'll try out a couple of ideas.
I compiled it with clang-cl on Windows with AVX2, Release config and this is the performance I got for the dense layer: Much better than MSVC somehow, but I really don't understand why... I've also run it with a profiler to see where most of the time is being spent and the templated implementation doesn't appear in the hot path. I'll come back to this this weekend, re-run the benchmarks with clang-cl and update the description of the PR. I agree with your comments @jatinchowdhury18, it really would be nice to not have to copy the inputs/outputs into the extended arrays, but I can't come up with an elegant way to do it right now. |
thats beautiful. |
This is really interesting! I have seen several cases where clang-cl does a better job of optimizing some code than MSVC, but the difference in this case is a bit more extreme than I would expect. (As an aside, I've been using clang-cl more and more lately, but that's mainly because I like being able to use the same compiler across Windows/Mac/Linux.)
For the templated implementation the compiler is usually able to inline the entire
Sounds good! I'll leave the PR open for you to merge whenever you like. Since I've approved the PR you should have permission to merge it, but please let me know if that's not the case.
For now I think it's fine to merge these changes as-is, and then we can think more about how to remove some of the copy steps later on (unless some inspiration strikes one of us in the coming days).
This is interesting... I'm curious how much the difference ended up being? |
The difference in performance for the dense layer on M2, 10s 16in 16out:
This is just for the input vector. I realize it's slightly faster than the description of PR. These numbers go down a bit when the laptop heats up (passive cooling...), but the proportions keep. |
Apparently, the += operator for NoAlias is about as fast as the just the = operator. So I figured I'd give it a shot. On M2 it's yet a little faster. I'll try it on Windows in a minute.
This reverts commit 29bd1da.
Hey @jatinchowdhury18, the last 2 commits revoked the permission to merge. When you have time, could you approve it again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Feel free to merge when you're ready.
Hmm, must be some weird GitHub permissions thing. I'll get that worked out, but for now I can just merge it myself. Thanks again! |
e802121
into
jatinchowdhury18:main
Thank you! :) |
Hello,
Recently, I read an article on optimising LSTMs and was looking over the RTNeural implementations. This PR applies some of the ideas in the article to the Eigen backend.
Essentially, since Eigen vectorizes operations, it's advantageous to do fewer operations on larger operands, than many operations on smaller ones.
I only modified the dense, GRU and LSTM layers because those are the ones I'm familiar with.
Benchmark results
I ran the benchmark with 10s signal, 16 in and 16 out, with the Release configuration, on a Macbook Air (2022, M2 16GB) and an Intel Core i5 11th gen with 32GB RAM running Windows 11.
Numbers are "times faster than real-time".
M2 (Apple Clang)
Intel (No AVX2, MSVC)
Intel (No AVX2, Clang-cl)
Intel (with AVX2, MSVC)
Intel (with AVX2, Clang-cl)
I did not try it on a desktop Linux or a Raspberry Pi.
Note on readability
The current implementation on main for the Eigen backend is extremely easy to read and for the most part one could probably just look it and understand the operations and their order in these layers (this is what I did about a year ago).
This new implementation, on the other hand, is quite difficult to read, I think. I left comments that hopefully explain what's going on, but they're too fresh in my eyes to know if they're useful. Naming suggestions and feedback for the comments would be very very helpful!
Thanks for reading this and hopefully it's useful!