-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructuring in blax() #58
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #58 +/- ##
==========================================
- Coverage 76.41% 75.95% -0.46%
==========================================
Files 69 71 +2
Lines 13169 13401 +232
==========================================
+ Hits 10063 10179 +116
- Misses 3106 3222 +116 ☔ View full report in Codecov by Sentry. |
Great that we are looking into this! How did you benchmark your code? I'm running into something strange as I tried to benchmark... I'm find that
Whereas benchmarking this commit gives me:
I am surprised by the results. This seems to suggest that somehow it is significantly slower (~4 ms for de462ed vs 1.6 ms for My initial thought was some unnecessary allocations were the culprit like here: Lines 357 to 359 in de462ed
But replacing that with something like: rsys[:] .= 0.0
nonzeros(asys) .= 0.0 does improve the performance but it is still somehow slower than the dense system.
Any thoughts? I'll investigate more later. |
Also looking at this part of the code now makes me cringe (and I'm the one to blame)- let's just say there is a lot of clean up that can be done here. 🧹 |
I profiled wsize() using Profile and ViewProfile. In particular, I did this for a hydrogen aircraft sizing case, where blax() is called every sizing loop. I also did a benchmark for the entire wsize() and it was significantly faster after the change. |
Did a little more digging here and seems like this comes down to the BLAS library (default is OpenBLAS) that Default setup on hexOn the Intel Xeon processors the difference in performance is drastic between using 1 CPU vs 24, and I think this is because the default number of threads that BLAS is trying to use is the total number of CPUs/2 even if that number of threads are not available.
In all these cases it looks like Enforcing Performance specifying a single thread
Repeating this exercise on an AMD node is consistently faster with median times about ~2.7 ms across the number of CPUs (1,4,16,24) tested. Interestingly, in all these cases the sparse calculations are slower at ~4.8 ms. Other BLAS implementations (MKL)I found some discussion about this on Julia Discourse (this post in particular was very interesting). One of the things I tried was to use other libraries. I found that the I'm also tried to test this on my M2 mac with the
Way forward?Perhaps we just set |
Thanks for looking into this @askprash. I agree that keeping the dense operation is the way forward for now. I'll update the PR with that. I'll keep the matrix restructuring into a tetradiagonal matrix with a block as it does not appear to make a difference in terms of allocations, memory or speed, but it may enable other methods in the future. |
Upon further testing, it appears that the |
This PR addresses a potential issue in some processors, in which the LU decomposition in blax() could take several orders of magnitude longer than expected.
The LU matrix is also restructured to enable future performance improvements.