Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructuring in blax() #58

Closed
wants to merge 5 commits into from
Closed

Restructuring in blax() #58

wants to merge 5 commits into from

Conversation

ngomezve
Copy link
Contributor

@ngomezve ngomezve commented Jun 4, 2024

This PR addresses a potential issue in some processors, in which the LU decomposition in blax() could take several orders of magnitude longer than expected.

The LU matrix is also restructured to enable future performance improvements.

@codecov-commenter
Copy link

codecov-commenter commented Jun 4, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.95%. Comparing base (78ba4b2) to head (9a1f258).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
- Coverage   76.41%   75.95%   -0.46%     
==========================================
  Files          69       71       +2     
  Lines       13169    13401     +232     
==========================================
+ Hits        10063    10179     +116     
- Misses       3106     3222     +116     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@askprash
Copy link
Member

askprash commented Jun 5, 2024

Great that we are looking into this! How did you benchmark your code? I'm running into something strange as I tried to benchmark...

I'm find that upstream/main has the following performance (benchmarked using BenchmarkTools on a Mac M2):

Benchmarking... blax
---------------------------------------
blax (FORTRAN on MacPro M2 ~ 1.9 ms)
---------------------------------------
BenchmarkTools.Trial: 3525 samples with 5 evaluations.
 Range (min … max):  1.509 ms …   9.381 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.596 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.672 ms ± 258.716 μs  ┊ GC (mean ± σ):  3.84% ± 8.09%

    ▁▆███▇▅▄▂                  ▁  ▁             ▁             ▁
  ▃▇██████████▇▅▅▄▄▄▄▄▄▄▄▄▆▇███████▆█▇▇▇▄▅▅▇▇▇█████▇██▆▇▅▅▅▅▄ █
  1.51 ms      Histogram: log(frequency) by time      2.32 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241. 

Whereas benchmarking this commit gives me:

---------------------------------------
blax (FORTRAN on MacPro M2 ~ 1.9 ms)
---------------------------------------
BenchmarkTools.Trial: 1439 samples with 5 evaluations.
 Range (min … max):  3.787 ms …   5.184 ms  ┊ GC (min … max): 0.00% … 8.34%
 Time  (median):     4.068 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.117 ms ± 220.747 μs  ┊ GC (mean ± σ):  1.42% ± 2.54%

        █▃▁       ▃                                            
  ▄▇██▆▅███▆█▇▇▇███▇▆▆▇▅▄▅▄▅▅▆▆█▆█▇█▆▆█▆▆▄▄▄▄▄▃▃▃▃▃▂▃▃▂▂▁▁▂▂▃ ▄
  3.79 ms         Histogram: frequency by time        4.71 ms <

 Memory estimate: 5.08 MiB, allocs estimate: 44881. 

I am surprised by the results. This seems to suggest that somehow it is significantly slower (~4 ms for de462ed vs 1.6 ms for main).

My initial thought was some unnecessary allocations were the culprit like here:

TASOPT.jl/src/aero/blax.jl

Lines 357 to 359 in de462ed

#---- clear system matrix and righthand side
asys = spzeros(nsys, nsys)
rsys = zeros(nsys)

But replacing that with something like:

rsys[:] .= 0.0
nonzeros(asys) .= 0.0

does improve the performance but it is still somehow slower than the dense system.

BenchmarkTools.Trial: 2695 samples with 5 evaluations.
 Range (min … max):  2.043 ms …   4.906 ms  ┊ GC (min … max): 0.00% … 9.60%
 Time  (median):     2.174 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.197 ms ± 154.466 μs  ┊ GC (mean ± σ):  1.44% ± 1.98%

     ▁▄██▂ ▂▄▅▃▃▂▃▂▁▁▂  ▁                                      
  ▃▄███████████████████▇█▇▇▆▇▅▆▆▄▄▄▄▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂ ▄
  2.04 ms         Histogram: frequency by time        2.59 ms <

 Memory estimate: 4.43 MiB, allocs estimate: 44781

Any thoughts? I'll investigate more later.

@askprash
Copy link
Member

askprash commented Jun 5, 2024

Also looking at this part of the code now makes me cringe (and I'm the one to blame)- let's just say there is a lot of clean up that can be done here. 🧹

@ngomezve
Copy link
Contributor Author

ngomezve commented Jun 6, 2024

I profiled wsize() using Profile and ViewProfile. In particular, I did this for a hydrogen aircraft sizing case, where blax() is called every sizing loop. I also did a benchmark for the entire wsize() and it was significantly faster after the change.

@askprash
Copy link
Member

askprash commented Jun 12, 2024

Did a little more digging here and seems like this comes down to the BLAS library (default is OpenBLAS) that LinearAlgebra is using and the number of threads that the library defaults to.

Default setup on hex

On the Intel Xeon processors the difference in performance is drastic between using 1 CPU vs 24, and I think this is because the default number of threads that BLAS is trying to use is the total number of CPUs/2 even if that number of threads are not available.

CPUs median time
1 8.6 s
4 5.0 s
16 1.8 s
24 5.1 ms

In all these cases it looks like BLAS.get_num_threads() returns 12.

Enforcing BLAS.set_num_threads(1) gives vastly different performance to that above and is roughly constant at ~3.5 ms:

Performance specifying a single thread
CPUs median time
1 3.5 ms
4 3.5 ms
16 3.5 ms
24 3.5 ms

Repeating this exercise on an AMD node is consistently faster with median times about ~2.7 ms across the number of CPUs (1,4,16,24) tested. Interestingly, in all these cases the sparse calculations are slower at ~4.8 ms.

Other BLAS implementations (MKL)

I found some discussion about this on Julia Discourse (this post in particular was very interesting). One of the things I tried was to use other libraries. I found that the MKL library has a default num_threads that seems to be better tuned to get the right performance? BLAS.get_num_threads() seems to return the number of cores and blax.jl takes approximately independent of the number of cores and is roughly the same as the OpenBLAS library (~3.2 ms).

I'm also tried to test this on my M2 mac with the AppleAccelerate.jl BLAS implementations that supposedly take advantage of the AppleAccelerate framework. It seems like our matrix is rather small to really take advantage of any of this probably:

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.dylib

BenchmarkTools.Trial: 4083 samples with 5 evaluations.
 Range (min … max):  1.332 ms …  10.070 ms  ┊ GC (min … max): 0.00% … 18.14%
 Time  (median):     1.389 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.443 ms ± 274.095 μs  ┊ GC (mean ± σ):  2.85% ±  5.56%

     ▂▄▆██▇▅▃▂▁               ▁▁▁▂▂▂▃▂▂▂▁▂▁▁                  ▂
  ▅█████████████▇▇█▇▆▅▅▅▅▄▃▅▆█████████████████▆▆▆▆▅▅▅▁▄▅▅▄▅▃▄ █
  1.33 ms      Histogram: log(frequency) by time      1.78 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ LP64] Accelerate
└ [ILP64] Accelerate

BenchmarkTools.Trial: 4758 samples with 5 evaluations.
 Range (min … max):  1.163 ms …  2.120 ms  ┊ GC (min … max): 0.00% … 38.15%
 Time  (median):     1.225 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.260 ms ± 88.886 μs  ┊ GC (mean ± σ):  3.04% ±  6.10%

      ▂▃▄▅▇██▆▄▂                          ▁▂▂▃▂▂▂▂▂▁▁ ▁      ▂
  ▃▅▇▇████████████▇▅▁▃▃▃▃▄▃▁▁▁▁▁▁▁▁▆▄▆▆▇▆█████████████████▇▇ █
  1.16 ms      Histogram: log(frequency) by time     1.52 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241. 

Way forward?

Perhaps we just set BLAS.set_num_threads(1) as the default (and document this) and stick to the dense matrix calculations since it's about 40% faster (I still need to understand why the sparse version isn't as fast).

@ngomezve
Copy link
Contributor Author

Thanks for looking into this @askprash.

I agree that keeping the dense operation is the way forward for now. I'll update the PR with that. I'll keep the matrix restructuring into a tetradiagonal matrix with a block as it does not appear to make a difference in terms of allocations, memory or speed, but it may enable other methods in the future.

@ngomezve ngomezve changed the title Sparse matrix in blax() Restructuring in blax() Jun 12, 2024
@ngomezve
Copy link
Contributor Author

ngomezve commented Jun 14, 2024

Upon further testing, it appears that the BLAS.set_num_threads(1) only gives you a performance improvement when it is added to the script that runs TASOPT rather than to the TASOPT source directly. As a workaround, I added an __init__ function to the aero module; this seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants