Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use muladd in matmul and improved operation order. #818

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

chriselrod
Copy link
Contributor

@chriselrod chriselrod commented Aug 24, 2020

Performance should generally be better under this PR for Float64 inputs given AVX(2).

Here, I ran the benchmark in different Julia sessions and then println the results from one.

function benchesmat(rn)
    res = Vector{Float64}(undef, length(rn));
    for (i,r)  enumerate(rn)
        A = @SMatrix rand(r,r)
        B = @SMatrix rand(r,r)
        res[i] = @belapsed $(Ref(A))[] * $(Ref(B))[]
    end
    res
end

res = benchesmat(1:24);

Benefits are minor with AVX512, but I'll run AVX2 benchmarks too:
newplot(14)

@chriselrod
Copy link
Contributor Author

chriselrod commented Aug 24, 2020

AVX2 results:
newplot(16)

@chriselrod
Copy link
Contributor Author

Overall it is an improvement. But if you actually want it to be fast, it needs SIMD intrinsics.
To be optimal, it'd also have to change behavior in a hardware and eltype dependent manner.

@mateuszbaran
Copy link
Collaborator

This partially conflicts with #814 , did you look at my PR? It is more or less complete now, I'm just working on slightly better heuristics for picking the right multiplication method. Now I'm wondering if I should integrate this PR first.

@chriselrod
Copy link
Contributor Author

This PR has two changes: changing the order of expressions in the generated code, and using muladd. We can start Julia with -Civybridge to get an idea of the performance impact of the new order in isolation, because ivybridge does not have fma instructions:
newplot(17)
There is a performance regression at 8x8 for some reason, but otherwise performance is either equal or better.

I think muladd is a straightforward and obvious optimization. Make of the reordering what you will.

If you just use muladd in your PR instead of a * b + c, I'd be fine with closing this PR. The reordering changes help much less than I expected.

@mateuszbaran
Copy link
Collaborator

I can definitely change my PR to use muladd, this is a good idea. Reordering in mul_loop is also something that I can easily do because I didn't touch that function. I can also keep your version of mul_unrolled_chunks for non-wrapped matrices. By the way, did you check whether using mul_unrolled and mul_unrolled_chunks for smaller matrices is actually beneficial? In this PR: #514 it was suggested to use mul_loop for all sizes and from my tests (on i7-4800MQ and recently also on i7-9700KF) Julia 1.5 is pretty good at unrolling mul_loop and other variants aren't consistently faster for any matrix sizes I've tried. Your changes may of course change the outcome here.

In any case I will merge your branch into mine.

@chriselrod
Copy link
Contributor Author

The reordering was supposed to make things a little more similar to PaddedMatrices, which is far faster at most sizes: https://github.com/chriselrod/PaddedMatrices.jl
The problem is we can't count on LLVM doing what we want.

But good idea to test the three different implementations.
Ivybridge (AVX, no FMA):
newplot(18)

Haswell (AVX, FMA):
newplot(19)

Skylake-X (AVX512):
newplot(20)
Note that in this last plot, at 14x14 where we're at around 25 GFLOPS, PaddedMatrices were at 90 for statically sized arrays on the same computer.

Worth pointing all of these are done on the same computer by starting Julia with different -C, so they're sort of comparable between charts as well.
Unfortunately that trick isn't compatible with LoopVectorization (it doesn't realize the architecture has been "demoted" and thus makes bad decisions), otherwise I'd have added it as a baseline.

Script:

using StaticArrays, BenchmarkTools, Plots; plotly();

function bench_muls(sizerange)
    res = Matrix{Float64}(undef, length(sizerange), 3)
    for (i,r)  enumerate(sizerange)
        A = @SMatrix rand(r,r)
        B = @SMatrix rand(r,r)
        res[i,1] = @belapsed StaticArrays.mul_unrolled($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
        res[i,2] = @belapsed StaticArrays.mul_unrolled_chunks($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
        res[i,3] = @belapsed StaticArrays.mul_loop($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
    end
    res
end

benchtimes = bench_muls(1:24);

plot(1:24, 2e-9 .* (1:24) .^ 3 ./ benchtimes, labels = ["mul_unrolled" "mul_unrolled_chunks" "mul_loop"]);
xlabel!("Matrix Dimensions"); ylabel!("GFLOPS")

@mateuszbaran
Copy link
Collaborator

Hmm... I'll have to re-run my benchmarks with muladd.

@chriselrod
Copy link
Contributor Author

Also, worth pointing out that a difference between running a recent CPU with -Chaswell and an actual Haswell is that on actual Haswell addition has half the instruction throughput as fma, meaning addition + multiplication has 1/3 the throughput as fma instead of half.

On skylake, addition's throughput was doubled, so you only have half instead of 1/3. Meaning real Haswell should benefit a lot from muladd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants