Restructuring in blax() #58

ngomezve · 2024-06-04T20:26:13Z

This PR addresses a potential issue in some processors, in which the LU decomposition in blax() could take several orders of magnitude longer than expected.

The LU matrix is also restructured to enable future performance improvements.

codecov-commenter · 2024-06-04T20:38:16Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.95%. Comparing base (78ba4b2) to head (9a1f258).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
- Coverage   76.41%   75.95%   -0.46%     
==========================================
  Files          69       71       +2     
  Lines       13169    13401     +232     
==========================================
+ Hits        10063    10179     +116     
- Misses       3106     3222     +116

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

askprash · 2024-06-05T23:48:01Z

Great that we are looking into this! How did you benchmark your code? I'm running into something strange as I tried to benchmark...

I'm find that upstream/main has the following performance (benchmarked using BenchmarkTools on a Mac M2):

Benchmarking... blax
---------------------------------------
blax (FORTRAN on MacPro M2 ~ 1.9 ms)
---------------------------------------
BenchmarkTools.Trial: 3525 samples with 5 evaluations.
 Range (min … max):  1.509 ms …   9.381 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.596 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.672 ms ± 258.716 μs  ┊ GC (mean ± σ):  3.84% ± 8.09%

    ▁▆███▇▅▄▂                  ▁  ▁             ▁             ▁
  ▃▇██████████▇▅▅▄▄▄▄▄▄▄▄▄▆▇███████▆█▇▇▇▄▅▅▇▇▇█████▇██▆▇▅▅▅▅▄ █
  1.51 ms      Histogram: log(frequency) by time      2.32 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

Whereas benchmarking this commit gives me:

---------------------------------------
blax (FORTRAN on MacPro M2 ~ 1.9 ms)
---------------------------------------
BenchmarkTools.Trial: 1439 samples with 5 evaluations.
 Range (min … max):  3.787 ms …   5.184 ms  ┊ GC (min … max): 0.00% … 8.34%
 Time  (median):     4.068 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.117 ms ± 220.747 μs  ┊ GC (mean ± σ):  1.42% ± 2.54%

        █▃▁       ▃                                            
  ▄▇██▆▅███▆█▇▇▇███▇▆▆▇▅▄▅▄▅▅▆▆█▆█▇█▆▆█▆▆▄▄▄▄▄▃▃▃▃▃▂▃▃▂▂▁▁▂▂▃ ▄
  3.79 ms         Histogram: frequency by time        4.71 ms <

 Memory estimate: 5.08 MiB, allocs estimate: 44881.

I am surprised by the results. This seems to suggest that somehow it is significantly slower (~4 ms for de462ed vs 1.6 ms for main).

My initial thought was some unnecessary allocations were the culprit like here:

TASOPT.jl/src/aero/blax.jl

Lines 357 to 359 in de462ed

    
           #---- clear system matrix and righthand side 
        
                 asys = spzeros(nsys, nsys) 
        
                 rsys = zeros(nsys)

But replacing that with something like:

rsys[:] .= 0.0
nonzeros(asys) .= 0.0

does improve the performance but it is still somehow slower than the dense system.

BenchmarkTools.Trial: 2695 samples with 5 evaluations.
 Range (min … max):  2.043 ms …   4.906 ms  ┊ GC (min … max): 0.00% … 9.60%
 Time  (median):     2.174 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.197 ms ± 154.466 μs  ┊ GC (mean ± σ):  1.44% ± 1.98%

     ▁▄██▂ ▂▄▅▃▃▂▃▂▁▁▂  ▁                                      
  ▃▄███████████████████▇█▇▇▆▇▅▆▆▄▄▄▄▄▄▃▃▃▃▃▂▃▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂ ▄
  2.04 ms         Histogram: frequency by time        2.59 ms <

 Memory estimate: 4.43 MiB, allocs estimate: 44781

Any thoughts? I'll investigate more later.

askprash · 2024-06-05T23:49:56Z

Also looking at this part of the code now makes me cringe (and I'm the one to blame)- let's just say there is a lot of clean up that can be done here. 🧹

ngomezve · 2024-06-06T00:06:35Z

I profiled wsize() using Profile and ViewProfile. In particular, I did this for a hydrogen aircraft sizing case, where blax() is called every sizing loop. I also did a benchmark for the entire wsize() and it was significantly faster after the change.

askprash · 2024-06-12T15:22:44Z

Did a little more digging here and seems like this comes down to the BLAS library (default is OpenBLAS) that LinearAlgebra is using and the number of threads that the library defaults to.

Default setup on hex

On the Intel Xeon processors the difference in performance is drastic between using 1 CPU vs 24, and I think this is because the default number of threads that BLAS is trying to use is the total number of CPUs/2 even if that number of threads are not available.

CPUs	median time
1	8.6 s
4	5.0 s
16	1.8 s
24	5.1 ms

In all these cases it looks like BLAS.get_num_threads() returns 12.

Enforcing BLAS.set_num_threads(1) gives vastly different performance to that above and is roughly constant at ~3.5 ms:

Performance specifying a single thread

CPUs	median time
1	3.5 ms
4	3.5 ms
16	3.5 ms
24	3.5 ms

Repeating this exercise on an AMD node is consistently faster with median times about ~2.7 ms across the number of CPUs (1,4,16,24) tested. Interestingly, in all these cases the sparse calculations are slower at ~4.8 ms.

Other BLAS implementations (MKL)

I found some discussion about this on Julia Discourse (this post in particular was very interesting). One of the things I tried was to use other libraries. I found that the MKL library has a default num_threads that seems to be better tuned to get the right performance? BLAS.get_num_threads() seems to return the number of cores and blax.jl takes approximately independent of the number of cores and is roughly the same as the OpenBLAS library (~3.2 ms).

I'm also tried to test this on my M2 mac with the AppleAccelerate.jl BLAS implementations that supposedly take advantage of the AppleAccelerate framework. It seems like our matrix is rather small to really take advantage of any of this probably:

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.dylib

BenchmarkTools.Trial: 4083 samples with 5 evaluations.
 Range (min … max):  1.332 ms …  10.070 ms  ┊ GC (min … max): 0.00% … 18.14%
 Time  (median):     1.389 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.443 ms ± 274.095 μs  ┊ GC (mean ± σ):  2.85% ±  5.56%

     ▂▄▆██▇▅▃▂▁               ▁▁▁▂▂▂▃▂▂▂▁▂▁▁                  ▂
  ▅█████████████▇▇█▇▆▅▅▅▅▄▃▅▆█████████████████▆▆▆▆▅▅▅▁▄▅▅▄▅▃▄ █
  1.33 ms      Histogram: log(frequency) by time      1.78 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ LP64] Accelerate
└ [ILP64] Accelerate

BenchmarkTools.Trial: 4758 samples with 5 evaluations.
 Range (min … max):  1.163 ms …  2.120 ms  ┊ GC (min … max): 0.00% … 38.15%
 Time  (median):     1.225 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.260 ms ± 88.886 μs  ┊ GC (mean ± σ):  3.04% ±  6.10%

      ▂▃▄▅▇██▆▄▂                          ▁▂▂▃▂▂▂▂▂▁▁ ▁      ▂
  ▃▅▇▇████████████▇▅▁▃▃▃▃▄▃▁▁▁▁▁▁▁▁▆▄▆▆▇▆█████████████████▇▇ █
  1.16 ms      Histogram: log(frequency) by time     1.52 ms <

 Memory estimate: 2.11 MiB, allocs estimate: 44241.

Way forward?

Perhaps we just set BLAS.set_num_threads(1) as the default (and document this) and stick to the dense matrix calculations since it's about 40% faster (I still need to understand why the sparse version isn't as fast).

ngomezve · 2024-06-12T15:50:47Z

Thanks for looking into this @askprash.

I agree that keeping the dense operation is the way forward for now. I'll update the PR with that. I'll keep the matrix restructuring into a tetradiagonal matrix with a block as it does not appear to make a difference in terms of allocations, memory or speed, but it may enable other methods in the future.

ngomezve · 2024-06-14T15:23:02Z

Upon further testing, it appears that the BLAS.set_num_threads(1) only gives you a performance improvement when it is added to the script that runs TASOPT rather than to the TASOPT source directly. As a workaround, I added an __init__ function to the aero module; this seems to work.

ngomezve added 2 commits June 4, 2024 15:10

reshaped asys matrix in blax

8baac78

turned asys into sparse to use sparse solver

de462ed

changed matrix reseting to reduce allocations

67f4171

removed sparsearrays and added blas setting

3f560ba

ngomezve changed the title ~~Sparse matrix in blax()~~ Restructuring in blax() Jun 12, 2024

added init function to aero module

9a1f258

ngomezve mentioned this pull request Jul 1, 2024

Fixing issue with linear algebra threads #65

Merged

askprash closed this Aug 13, 2024

ngomezve deleted the dev_blax branch August 22, 2024 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructuring in blax() #58

Restructuring in blax() #58

ngomezve commented Jun 4, 2024 •

edited

Loading

codecov-commenter commented Jun 4, 2024 •

edited

Loading

askprash commented Jun 5, 2024 •

edited

Loading

askprash commented Jun 5, 2024

ngomezve commented Jun 6, 2024

askprash commented Jun 12, 2024 •

edited

Loading

ngomezve commented Jun 12, 2024

ngomezve commented Jun 14, 2024 •

edited

Loading

Restructuring in blax() #58

Restructuring in blax() #58

Conversation

ngomezve commented Jun 4, 2024 • edited Loading

codecov-commenter commented Jun 4, 2024 • edited Loading

Codecov Report

askprash commented Jun 5, 2024 • edited Loading

askprash commented Jun 5, 2024

ngomezve commented Jun 6, 2024

askprash commented Jun 12, 2024 • edited Loading

Default setup on hex

Other BLAS implementations (MKL)

Way forward?

ngomezve commented Jun 12, 2024

ngomezve commented Jun 14, 2024 • edited Loading

ngomezve commented Jun 4, 2024 •

edited

Loading

codecov-commenter commented Jun 4, 2024 •

edited

Loading

askprash commented Jun 5, 2024 •

edited

Loading

askprash commented Jun 12, 2024 •

edited

Loading

ngomezve commented Jun 14, 2024 •

edited

Loading