Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows #923

Closed
jhux2 opened this issue Mar 30, 2021 · 2 comments

Comments

@jhux2
Copy link

jhux2 commented Mar 30, 2021

I’ve been testing multithreaded symmetric Gauss-Seidel (MTSGS) through Ifpack2 on some application matrices on Vortex. These matrices have ~699K rows, and most of the row stencil sizes are 50 or less, but there are some non-sparse rows (called “bulk rows”). Here are the nonzero counts by row, sorted largest first:

42254 31980 5088 237 48 47 …

Those first four bulk rows appear to be hurting the performance of MTSGS. I did some experiments to see what effect the bulk rows have. In each experiment, the linear system is solved 10 times with GMRES preconditioned by 3 MTSGS sweeps. There are MPI barriers before/after calls the MTSGS kernel, as well as timers for the barriers themselves. What I found is that removing the bulk rows yields about a 13x speedup in KokkosSparse::Experimental::symmetric_gauss_seidel_apply.

Here's a summary of the experiments:

experiment #1: run with application rowmap

Driver: 5 - Belos Solve                                                    69.17 (10)        69.17 (10)        69.17 (10)
Driver: S - Global Time                                                    112.6 (1)         112.6 (1)         112.6 (1)
Ifpack2::Relaxation::ApplyInverseMTGS_CrsMatrix : import                   0.2921 (1820)     0.298 (1820)      0.3063 (1820)
Ifpack2::Relaxation::apply                                                 65.57 (910)       65.58 (910)       65.59 (910)
Ifpack2::Relaxation::compute                                               0.003969 (1)      0.006031 (1)      0.008348 (1)
Ifpack2::Relaxation::initialize                                            0.04399 (1)       0.05346 (1)       0.06291 (1)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply                   4.963 (2730)      29.64 (2730)      65.07 (2730)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (post)    0.00592 (2730)    35.44 (2730)      60.11 (2730)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (pre)     0.09387 (2730)    0.1073 (2730)     0.1223 (2730)

experiment #2: run with uniform map (so each GPU has about the same #nonzeros)

Driver: 5 - Belos Solve                                                    101.9 (10)        101.9 (10)        101.9 (10)
Driver: S - Global Time                                                    140.9 (1)         140.9 (1)         140.9 (1)
Ifpack2::Relaxation::ApplyInverseMTGS_CrsMatrix : import                   0.4447 (2280)     0.4612 (2280)     0.4784 (2280)
Ifpack2::Relaxation::apply                                                 97.44 (1140)      97.47 (1140)      97.52 (1140)
Ifpack2::Relaxation::compute                                               0.004134 (1)      0.004919 (1)      0.006527 (1)
Ifpack2::Relaxation::initialize                                            0.04344 (1)       0.05109 (1)       0.06242 (1)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply                   6.22 (3420)       28.94 (3420)      96.77 (3420)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (post)    0.01009 (3420)    67.83 (3420)      90.56 (3420)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (pre)     0.05467 (3420)    0.1058 (3420)     0.1649 (3420)
MueLu: Hierarchy: Setup (total)

experiment #3: run with application's rowmap, but zero out the three matrix rows with the largest #nonzeros and put a 1 on the diagonal for those rows (i.e., make them Dirichlet rows)

Driver: 5 - Belos Solve                                                    9.082 (10)        9.082 (10)        9.083 (10)
Driver: S - Global Time                                                    50.11 (1)         50.11 (1)         50.11 (1)
Ifpack2::Relaxation::ApplyInverseMTGS_CrsMatrix : import                   0.3822 (1840)     0.4599 (1840)     0.5211 (1840)
Ifpack2::Relaxation::apply                                                 5.655 (920)       5.665 (920)       5.681 (920)
Ifpack2::Relaxation::compute                                               0.004102 (1)      0.004201 (1)      0.004295 (1)
Ifpack2::Relaxation::initialize                                            0.04331 (1)       0.04612 (1)       0.04872 (1)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply                   4.712 (2760)      4.816 (2760)      4.918 (2760)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (post)    0.03125 (2760)    0.1326 (2760)     0.2372 (2760)
KokkosSparse::Experimental::symmetric_gauss_seidel_apply barrier (pre)     0.08867 (2760)    0.1597 (2760)     0.2545 (2760)
MueLu: Hierarchy: Setup (total)

@srajama1 @brian-kelley @lucbv

@jhux2 jhux2 changed the title Multithreaded Symmetric Gauss-Seidel performance for matrices with a few dense rows Multithreaded symmetric Gauss-Seidel performance for matrices with a few dense rows Mar 30, 2021
@jhux2
Copy link
Author

jhux2 commented May 6, 2021

Any updates? In particular, when do you think you might have an algorithm that I could start kicking the tires on? Thanks!

@jhux2
Copy link
Author

jhux2 commented Aug 3, 2021

Closing as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant