Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program crash when calling DNEUPD after exiting DNAUPD with info=0 and iparam(5) > NEV #465

Closed
JarRosenberg opened this issue Aug 29, 2024 · 15 comments · Fixed by #468
Closed

Comments

@JarRosenberg
Copy link

Hello, I've run into what I think is a bug using the DNAUPD and DNEUPD functions in shift-invert mode to calculate eigenvalues of some admittedly dificult matrices. The short of it is that sometimes calling DNEUPD after DNAUPD leads to a program crash, despite the lack of an error on return from DNAUPD. I've found the when this occurs, iparam(5), the number of converged eigenvalues, is greater than NEV, the number of requested eigenvalues. I have attached a txt containing a sample matrix (failed_mat.txt), and a simple program that triggers the crash. More details provided below.

Expected behavior

The info variable would be non-zero after calling DNAUPD, indicating an error has occured, when trying to solve the provided matrix and others like it. Alternatively, calling DNEUPD would return the correct eigenvalue.

Actual behavior

Calling DNEUPD after a "sucessful" exit from DNAUPD causes the program to crash with the error shown below solving the attached matrix.

Where/how to reproduce the problem

  • arpack-ng: current release
  • OS: macOS Sonoma 14.5
  • compiler: gfortran version 13.2.0
  • environment: FFLAGS = -std=f2018 -finit-real=snan -fbacktrace -fmax-errors=25 -fPIC -ffpe-trap=invalid,overflow,zero -fopenmp --Og -ggdb -fcheck=all -Wall -Wno-unused-dummy-argument -Wno-maybe-uninitialized -finline-limit=0
  • configure: Unkown, though likely just the minimum settings needed to run on a Mac with an M1 chop.
  • input data: See the attached file failed_mat.txt for the matrix in question.

Steps to reproduce the problem

  • Run the following program, attempting to solve the provided matrix (failed_mat.txt) in shift-invert mode for a few eigenvalues with the largest magnitude, in the same directory as failed_mat.txt. Note that the calls to ARPACK functions are in the subroutine eigen_shift_invert.
program arpack_test

   use ISO_FORTRAN_ENV
   use f95_lapack

   implicit none

   integer, parameter :: RD = REAL64
   integer, parameter :: sys_size = 42
   complex(RD), parameter :: sig_shift = (2.34381384001574511E-009_RD, -3.63893604681407892E-007_RD)

   real(RD) :: A(sys_size, sys_size)
   complex(RD)  :: sig_max
   integer  :: info

   call read_matrix('failed_mat.txt', A)

   call eigen_shift_inv(A, sig_shift, sys_size, sig_max, info)
	
	
contains
	
   subroutine read_matrix(fname, A)
      ! reads in matrix from a file
      character(*), intent(in) :: fname
      real(RD), intent(out) :: A(:,:)
      integer :: io, i
      
      open(newunit=io, file=fname, action='read')
      do i = 1, SIZE(A, 1)
         read(io, *) A(i, :)
      end do
      close(io)

   end subroutine read_matrix

   subroutine apply_shift(sig_shift, A)
      real(RD), intent(in)    :: sig_shift
      real(RD), intent(inout) :: A(:,:)
      
      integer :: i
      
      do i = 1, SIZE(A, 1)
         A(i,i) = A(i,i) - sig_shift
      end do

   end subroutine apply_shift

   subroutine eigen_shift_inv(A, sig_shift, sys_size, sig_max, info)

      real(RD), intent(inout)  :: A(:,:)
      complex(RD), intent(in)  :: sig_shift ! e-value shift
      integer, intent(in)  :: sys_size ! size of sytem, should be equal to
      complex(RD), intent(out) :: sig_max
      integer, intent(out) :: info
      
      ! ARPACK parameters
      integer, parameter          :: NEV = 4
      integer, parameter          :: NCV = (2*NEV + 1) * 1
      integer, parameter          :: LWORKL = 3*NCV**2 + 6*NCV
      real(RD), parameter         :: TOL = 1E-12
      character(len=2), parameter :: which = 'LM'
      character, parameter        :: bmat = 'I'
      
      integer  :: ido
      integer  :: iparam(11)
      integer  :: ipntr(14)
      real(RD) :: resid(sys_size)
      real(RD) :: v(sys_size,NCV)
      real(RD) :: workd(3*sys_size)
      real(RD) :: workl(LWORKL)
      real(RD) :: dr(NEV+1)
      real(RD) :: di(NEV+1)
      real(RD) :: z(sys_size,NEV+1)
      real(RD) :: workev(3*NCV)
      logical  :: select(NCV)
      
      ! matrix solving variables
      integer  :: IPIV(sys_size)
      integer  :: la_info
      
      real(RD), allocatable :: A_copy(:,:) ! copy of A before any transformations are done
      
      ! Find the maximal eigenvalue using shift-invert mode
      
      ! First set up parameters
      
      iparam = 0
      
      iparam(1) = 1
      iparam(3) = 1000 ! max iterations
      iparam(7) = 3 ! 3 sets to shift inver mode
      
      ido  = 0
      info = 0 ! 0 means normal start, not 0 means giving an inital resid vector
      
      ! initialize these arrays
      dr = -HUGE(dr)
      di = -HUGE(di)
      
      A_copy = A ! keep a copy for later, as A will be changed by LAPACK
      
      ! factor the matrix
      
      ! apply shift
      call apply_shift(sig_shift%re, A)
      
      ! perform PLU factorization
      call DGETRF(sys_size, sys_size, A, sys_size, IPIV, la_info)
      if (la_info /= 0) then
         print *, 'Error with PLU factorisation, info = ', la_info
         stop
      end if
      
      
      iter_loop : do
      
         ! call to ARPACK
         call DNAUPD(ido, bmat, sys_size, which, NEV, TOL, resid, &
            NCV, v, SIZE(v, 1), iparam, ipntr, workd, workl, LWORKL, &
            info)
         
         if (ido == -1 .OR. ido == 1) then
         
         associate( &
            x => workd(ipntr(1) : ipntr(1) + sys_size - 1), &
            y => workd(ipntr(2) : ipntr(2) + sys_size - 1))
         
            y = x
         
            ! now solve the system (A - sig_shift*I) x = y
            call DGETRS('N', sys_size, 1, A, sys_size, IPIV, y, sys_size, la_info)
         
            if (la_info /= 0) then
               print *, 'Error with linear solve, info = ', la_info
               stop
            end if
         
         end associate
         
         else
         
            exit iter_loop
         
         end if
      
      end do iter_loop
      
      ! Check for convergence
      
      if (info /= 0) then
      
         print *, ' '
         print *, ' Error with _naupd, info = ', info
         print *, ' Check the documentation of _naupd'
         print *, ' '
         stop
      
      ! else if (iparam(5) /= NEV) then
      !    print *, 'NCONV does not match NEV. NCONV = ', iparam(5)
      !    stop
      end if
      
      print *, 'DNAUPD converged, now calling DNEUPD'
      
      call DNEUPD(.TRUE., 'A', select, dr, di, z, SIZE(z, 1), &
            sig_shift%re, sig_shift%im, workev, bmat, sys_size, which, NEV, TOL, &
            resid, NCV, v, SIZE(v, 1), iparam, ipntr, workd, workl, &
            LWORKL, info)
            
      
      if (info /= 0) then
      
            print *, ' '
            print *, ' Error with _neupd, info = ', info
            print *, ' Check the documentation of _neupd. '
            print *, ' '
            stop
      
      end if
      
      print *, 'DNEUPD Succeeded'

   end subroutine eigen_shift_inv

end program arpack_test

Error message

The exact error message:

Program received signal SIGBUS: Access to an undefined portion of a memory object.

Backtrace for this error:
#0  0x106298ac7
#1  0x106297b13
#2  0x18c1bb583
#3  0x105131887
Bus error: 10

Callstack

Backtrace from lldb:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x1000070f0)
  * frame #0: 0x000000010063d8d4 libarpack.2.dylib`dngets_ + 404
    frame #1: 0x000000010063c330 libarpack.2.dylib`dneupd_ + 592
    frame #2: 0x0000000100004c04 thermo`eigen_shift_inv.0 at thermo.f90:169:25
    frame #3: 0x0000000100005014 thermo`MAIN__ at thermo.f90:18:62
    frame #4: 0x0000000100006b68 thermo`main at thermo.f90:4:7
    frame #5: 0x000000018be020e0 dyld`start + 2360

Notes, remarks

It seems from experimenting that the error occurs when iparam(5) > NEV after calling DNAUPD. Handling this case seems to prevent the SIGBUS crash.

@fghoussen
Copy link
Collaborator

Do you set iparam(5) = 0 before calling dnaupd?
https://github.com/opencollab/arpack-ng/blob/0b3038d139f874a423ba6a1cdf5098fc15336848/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L757C12-L757C41

Would not be surprised this may fix the problem.

@JarRosenberg
Copy link
Author

Thank you for the reply! Yes, I do set iparam(5) = 0 before calling dnaupd. Should I not?

I also forgot to mention, the same code works for several other matrices. The error only occurs with certain matrices.

@fghoussen
Copy link
Collaborator

fghoussen commented Aug 30, 2024

Lessons I learned from arpack is that you should preferably initialize all data (iparam, inptr) passed to it and check every possible case of ido in the iterative loop.

Feel free to PR a fix if you found one.

x => workd(ipntr(1) : ipntr(1) + sys_size - 1), &

This looks wrong to me: https://github.com/opencollab/arpack-ng/blob/0b3038d139f874a423ba6a1cdf5098fc15336848/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L803C22-L803C26

@fghoussen
Copy link
Collaborator

complex(RD), parameter :: sig_shift = (2.34381384001574511E-009_RD, -3.63893604681407892E-007_RD)
real(RD) :: A(sys_size, sys_size)

Not sure arpack can handle real matrix with complex shift.

I guess, like often, that the problem may lie in the way you call/use arpack: try to compare with other solver implementation like arpackmm shipped in the repo

>> cmake -D EXAMPLES=ON -D ICB=ON -D EIGEN=ON ..

>> cat main.cpp 
#include <fstream>
#include <sstream>
#include <string>

int main() {
  std::ifstream input( "failed_mat.txt" );
  std::ofstream output( "failed_mat.mtx" );
  output << "42 42" << std::endl;
  int i = 0;
  for( std::string line; getline( input, line ); )
  {
    int j = 0;
    std::stringstream str(line);
    std::string value;
    while (str >> value) {
      output << i << " " << j << " " << value << std::endl;
      j++;
    }
    i++;
  }
  return 0;
}

>> g++ -o main main.cpp

>> ./main

>> ./EXAMPLES/MATRIX_MARKET/arpackmm --A failed_mat.mtx --shiftReal 2.34381384001574511E-009 --invert --nbEV 4
...
OUT: mode 1, nb EV found 4, nb iterations 3

>> for i in {1..100}; do ./EXAMPLES/MATRIX_MARKET/arpackmm --A failed_mat.mtx --shiftReal 2.34381384001574511E-009 --invert --nbEV 4; done

@fghoussen
Copy link
Collaborator

Not sure arpack can handle real matrix with complex shift.

Indeed, seems you can: this works fine

>> for i in {1..100}; do ./EXAMPLES/MATRIX_MARKET/arpackmm --A failed_mat.mtx --shiftReal 2.34381384001574511E-009 --shiftImag -3.63893604681407892E-007 --invert --nbEV 4; done

@fghoussen
Copy link
Collaborator

Indeed, seems you can: this works fine

... Nope ! If I remember well, for standard problem, only mode 1 (you used mode 3) is allowed with real shift only

if (shiftReal && !shiftImag) {

You may have some clue here https://github.com/opencollab/arpack-ng/blob/master/README.md#faq (Lehoucq97.pdf)

@fghoussen fghoussen linked a pull request Sep 15, 2024 that will close this issue
@fghoussen
Copy link
Collaborator

fghoussen commented Sep 17, 2024

... Nope ! If I remember well, for standard problem, only mode 1 (you used mode 3) is allowed with real shift only

According to tests in the associated PR, it seems that if you need to shift, the back transform must be done only for mode 1.

If you always back transform whatever the mode:

diff --git a/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp b/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp
index 53b89c4..e0bbd5c 100644
--- a/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp
+++ b/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp
@@ -253,17 +253,18 @@ class arpackSolver {
       bool shiftReal = (fabs(sigmaReal) > eps) ? true : false;
       bool shiftImag = (fabs(sigmaImag) > eps) ? true : false;
       bool backTransform = false;
+      if (shiftReal || shiftImag) {
+        EM I(A.rows(), A.cols());
+        I.setIdentity();
+        RC sigma; makeSigma(sigma);
+        A -= sigma*I;
+        backTransform = true;
+      }
+
       mode = 0;
       if (stdPb) {
         mode = 1;
         // CAUTION: back transform must be done only if mode = 1.
-        if (shiftReal || shiftImag) {
-          EM I(A.rows(), A.cols());
-          I.setIdentity();
-          RC sigma; makeSigma(sigma);
-          A -= sigma*I;
-          backTransform = true;
-        }
       }
       else {
         // CAUTION: back transform must NOT be done if mode > 1.

Then, for generalized problem (mode 3) you get an incorrect eigen value of 1. (instead of 3.) and the residual goes up to 1.2 (instead of 5.e-14).

Not sure arpack can handle real matrix with complex shift.

Mixing type (real/double) is likely not supported by the arpack API: you likely need to use real shift for real matrices.

I guess this issue is indeed a misuse of arpack as often as the doc is not clear to anybody !... :)

@JarRosenberg
Copy link
Author

Thank you for looking in to the issue!

Not sure arpack can handle real matrix with complex shift.

I'll keep this in mind going forward, thank you.

I guess this issue is indeed a misuse of arpack as often as the doc is not clear to anybody

If using complex shifts is a misuse of arpack, then should the documentation in dneupd.f be updated? The comments on the input variable SIGMAI and the remark on post-processing seem, to me, to imply that complex shifts in mode 3 is supposed to be supported.

@fghoussen
Copy link
Collaborator

If using complex shifts is a misuse of arpack, then should the documentation in dneupd.f be updated?

I guess nobody could really do that... That's arpack's curse... Eigen problems are as difficult to solve as they are at the heart of many other scientific problems, but the original dev team do not more maintain this cumbersome F77 codebase so it's difficult to maintain / extend / document / use the arpack library... In the end, the only question is: how to use the damn thing which lasts up-to-now to be a reference?!...

The comments on the input variable SIGMAI

Good point! If, and only if, the matrix is non sym, then only, it sounds like eigen values may be complex... And that is what you find here:

dneupd_c(rvec, howmny, select, dr, di, z, ldz, sigmaReal, sigmaImag, workev,
where you'll get only half spectrum.

Also, from what I experienced / tested: you should only shift when it helps. In practice, you shift by order of magnitude like 1, 10 or 100, or, by 42 if you can estimate your eigen value is about 42 (= not sure if shifting 1.e-5 does mean something nor help - In case you are stuck in the zero staring vector problem, try to shift 1 to "get out into a convergence region").

So maybe you crash because you back transform by hand while you shouldn't?... Are your eigen values correct if you don't shift? BTW, increase CV may help convergence too

HTH

@JarRosenberg
Copy link
Author

So maybe you crash because you back transform by hand while you shouldn't?... Are your eigen values correct if you don't shift?

I have a lot of trouble trying to get it to converge at all without shifting. What I actually need in my use-case is the eigenvalue with the largest real part, which, given the spectrum of the matrices I'm solving, seems to be pretty tough for ARPACK. Since I'm solving a series of parametrized matrices, I've been trying to use the largest real eigenvalue from the previous matrix as a shift to help convergence, which has generally been doing better but, as we've been discussing, still has some problems.

@JarRosenberg
Copy link
Author

In fact, I just reproduced the same error in mode 1 without using shifts at all. I got it this matrix (failed_mtx_no_shift.txt) setting NEV = 25, TOL = 1.E-6, and which = LR, and doing regular matrix multiplication instead of doing the PLU factorization and solve.

@fghoussen
Copy link
Collaborator

fghoussen commented Sep 18, 2024

Can you compile both arpack and your code in debug -g -O0: what stack do you get? Can you get some clue from gdb?

seems to be pretty tough for ARPACK

Give a chance to slepc too

@JarRosenberg
Copy link
Author

JarRosenberg commented Sep 18, 2024

My apologies, I spoke too soon. In the mode 1 case, I am now finding instances where iparam(5) > NEV, but the subsequent call to dneupd doesn't crash like it did in mode 3, and even returns the correct result.

Give a chance to slepc too

I will look in to it, thanks!

@fghoussen
Copy link
Collaborator

I've just tested some run with your matrice. This works if you allow the max residual norm to be 2.1 (the more eigen value you compute, the more you'll be imprecise so you need to relax the allowed residual norm)

>> ./arpackmm --A failed_mtx_no_shift.mtx --nbEV 25 --mag LM --maxResNorm 2.1
OPT: A failed_mtx_no_shift.mtx, dense no, nbEV 25, nbCV 51, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag LM
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxResNorm 2.1, maxIt 100, Ritz vectors
OPT: slv BiCG, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0.005 s

OUT: mode 1, nb EV found 25, nb iterations 1
OUT: init mode solver 0 s, RCI time 0 s
OUT: full time 0.025 s

STAT: total number of user OP*x operation                         52
STAT: total number of user  B*x operation                         0
STAT: total number of reorthogonalization steps taken             50
STAT: total number of it. refinement steps in reorthogonalization 0
STAT: total number of restart steps                               0

But LR doesn't work

>> ./arpackmm --A failed_mtx_no_shift.mtx --nbEV 25 --mag LR --maxResNorm 2.1
OPT: A failed_mtx_no_shift.mtx, dense no, nbEV 25, nbCV 51, stdPb yes, symPb yes, cpxPb no, simplePrec no, mag LR
OPT: shiftReal no, sigmaReal 0, shiftImag no, sigmaImag 0, invert no, tol 1e-06, maxResNorm 2.1, maxIt 100, Ritz vectors
OPT: slv BiCG, slvItrPC Diag, slvItrTol 1e-06, slvItrMaxIt 100
OPT: check yes, verbose 0, debug 0, restart no

INP: create A 0.005 s
Error: [dz][sn]aupd - KO with info -5, nbIt 100
Error: arpack solve KO
Error: solve KO
Error: arpack solve KO

@fghoussen
Copy link
Collaborator

Pushing a few things I forgot to push that may help testing.

Since I'm solving a series of parametrized matrices, I've been trying to use the largest real eigenvalue from the previous matrix as a shift

Just realizing: in your case, you may have more luck using restart instead of shift?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants