Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853

Closed
jngrad opened this issue Aug 6, 2020 · 21 comments
Closed

Comments

@jngrad
Copy link
Member

jngrad commented Aug 6, 2020

Same problem as #3396 but for Fedora 33, originally reported by @junghans in #3396 (comment).

The investigation below was conducted with this dockerfile and script.

TL;DR: the MPICH build fails due to the old 4.1.3 ScriptInterface not supporting link time optimization (LTO), the OpenMPI build fails due to issues with electrostatics. The new 4.2.0-dev ScriptInterface has additional issues in OpenMPI builds.

ESPResSo 4.1.3 compiled with MPICH

Most Python test fails. This happens usually when creating an espressomd.system.System object. Error message:

Traceback (most recent call last):
  File "/home/espresso/rpmbuild/BUILD/espresso-098adfe327aba153d88604bbe23e313b4d3ddc15/mpich/x86_64-redhat-linux-gnu/testsuite/python/time_series.py", line 42, in test_time_series
    system = espressomd.System(box_l=3 * [1.])
  File "system.pyx", line 144, in espressomd.system.System.__init__
  File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
  File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
  File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
IndexError: _Map_base::at

The corresponding Cython code is:

self.set_sip(make_shared(to_char_pointer(name), policy_))

which is called from:
def __init__(self, **kwargs):
super().__init__(self._so_name, policy=self._so_creation_policy,
**kwargs)

The ScriptInterface code is currently being refactored in #3794. The new interface in 4.2.0 doesn't have this issue with MPICH (tested with commit 9064d69, requires patch in #3852).

Like in the OpenSUSE case, this issue in 4.1.3 stems from LTO, which was enabled by default on Fedora 33 (see LTOByDefault). Contrary to what was done on OpenSUSE (#3396 (comment)) and what can be read on the Fedora bug tracker (1863059#c0 or 1789137#c5), adding %define _lto_cflags %{nil} in the specfile did not remove the -flto -ffat-lto-objects flags during the RPM build in my docker image. @junghans any idea how to opt out of LTO?

ESPResSo 4.1.3 compiled with OpenMPI

Only 3 tests fail with OpenMPI: the two coulomb_cloud_wall* tests and the domain_decomposition test

 43/147 Test  #23: coulomb_cloud_wall ................................***Failed    1.57 sec
Complex value is not zero (i=39,data=170.329)!!!
Complex value is not zero (i=40,data=195.229)!!!
terminate called after throwing an instance of 'std::runtime_error'
  what():  Complex value is not zero
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 5] /lib64/libstdc++.so.6()
[ 6] terminate called after throwing an instance of 'std::runtime_error'
[ 8] .../site-packages/openmpi/espressomd/EspressoCore.so(p3m_calc_kspace_forces(bool, bool, ParticleRange const&))
[10] .../site-packages/openmpi/espressomd/EspressoCore.so(force_calc(CellStructure&))
*** End of error message ***

 87/147 Test  #71: domain_decomposition ..............................***Failed    2.12 sec
          sock.c:344  UCX  ERROR recv(fd=31) failed: Bad address
          sock.c:344  UCX  ERROR recv(fd=28) failed: Connection reset by peer
          sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor
terminate called after throwing an instance of 'boost::wrapexcept<boost::mpi::exception>'
  what():  MPI_Test: MPI_ERR_INTERN: internal error
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 6] /lib64/libstdc++.so.6(+0xaa4d9)[0x7f54c36344d9]
[ 7] /usr/lib64/openmpi/lib/libboost_mpi.so.1.73.0()
[ 8] /usr/lib64/openmpi/lib/libboost_mpi.so.1.73.0(boost::mpi::request::trivial_handler::test())
[ 9] .../site-packages/openmpi/espressomd/EspressoCore.so()
[10] .../site-packages/openmpi/espressomd/EspressoCore.so(dd_exchange_and_sort_particles(int, ParticleList*, Utils::Vector<int, 3ul> const&))
[11] .../site-packages/openmpi/espressomd/EspressoCore.so(cells_resort_particles(int))
[12] .../site-packages/openmpi/espressomd/EspressoCore.so(mpi_resort_particles_slave(int, int))
[13] .../site-packages/openmpi/espressomd/EspressoCore.so(mpi_loop())
*** End of error message ***

These errors seem unrelated to the old ScriptInterface code (full log in rpmbuild-4.1.3-openmpi.txt). The new interface in #3794 compiled with OpenMPI has extra issues in a few checkpointing tests (tested with commit 9064d69, requires patch in #3852), which seem to involve the ScriptInterface boost variant visitor pattern (full log in rpmbuild-4.2-dev-openmpi.txt).

@junghans
Copy link
Member

junghans commented Aug 6, 2020

@junghans
Copy link
Member

junghans commented Aug 7, 2020

I made some progress: https://koji.fedoraproject.org/koji/taskinfo?taskID=48848410, now it fails on x86_64 with:

The following tests FAILED:
	 23 - coulomb_cloud_wall (Failed)
	 72 - layered (Timeout)
	 83 - coulomb_cloud_wall_duplicated (Failed)
Errors while running CTest

build.log.txt

@jngrad
Copy link
Member Author

jngrad commented Aug 7, 2020

There is a bug in the mpich package that makes %define _lto_cflags %{nil} getting ignored

Thanks! This explains why I couldn't opt out LTO.

now it fails on x86_64

I have the same log on x86_64. This is independent of LTO. The layered test hangs with a UCX error message.

@junghans
Copy link
Member

junghans commented Aug 7, 2020

Have a look at https://github.com/junghans/espresso-rpm/blob/f33/espresso.spec for my latest changes.

@jngrad
Copy link
Member Author

jngrad commented Aug 7, 2020

I had tried %global _lto_cflags %{nil} as well, as suggested in one of the Fedora tickets mentioned in the first post, but MPICH still injected the LTO flags: -flto -ffat-lto-objects -fexceptions -fasynchronous-unwind-tables. I've re-built espresso with your new spec file, but the MPICH issue persists from my side.

However, the OpenMPI UCX timeout is now gone with your new specfile (junghans/espresso-rpm@12b1bb2ca680868e02 for future reference) from my side. You simplified a lot of paths in the build and test sections. Is the layered test passing from your side too?

@junghans
Copy link
Member

junghans commented Aug 7, 2020

@jngrad
Copy link
Member Author

jngrad commented Aug 7, 2020

The layered test passes reproducibly from my side on x86_64, but it doesn't pass on the x86_64 Koji job you just re-triggered.

MPICH doesn't inject the -flto flag anymore on Koji, so I guess your LTO fix that was merged recently is already deployed.

@junghans
Copy link
Member

junghans commented Aug 7, 2020

@opoplawski any idea on UCX error on F33?

@opoplawski
Copy link

I haven't seen that before. I would bring it up with UCX directly - they were responsive to another UCX issue I just reported.

@jngrad
Copy link
Member Author

jngrad commented Aug 15, 2020

Could be a false lead, but printing the value of data[0] and data[1] (real resp. imaginary components of the first data point) in fft_perform_back() shows a discrepancy between ES built on Fedora 33 (columns 2-3) and Ubuntu 18.04 (columns 4-5):

step real imaginary real imaginary
input 0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
FFT 1 -115.884 0.000 -115.884 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
Communicate 1 -115.884 0.000 -115.884 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
FFT 2 -115.884 0.000 -115.884 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
Communicate 2 -952.797 -91.189 817.912 0.000
121.787 -123.919 133.162 0.000
281.095 217.920 -161.609 0.000
-168.084 119.324 -1617.001 0.000
FFT 3 -262.081 175.674 -185.928 0.000
-83.281 -358.977 448.158 0.000
372.641 362.332 -1248.708 0.000
-993.044 -471.697 441.617 0.000

The exception Complex value is not zero is triggered on 4 MPI ranks. On 2 MPI ranks, the error is too small to be detected in the core, but the python test detect a measurable deviation from the reference values. To get a GDB session, the code needs an MPI barrier, otherwise the exception is always thrown on ranks != 0, which makes it more difficult to work with GDB. Here's a MWE:

--- a/src/core/electrostatics_magnetostatics/fft.cpp
+++ b/src/core/electrostatics_magnetostatics/fft.cpp
@@ -740,10 +740,11 @@ void fft_perform_back(double *data, bool check_complex, fft_data_struct &fft,
   for (int i = 0; i < fft.plan[1].new_size; i++) {
     fft.data_buf[i] = data[2 * i]; /* real value */
     // Vincent:
+    MPI_Barrier(comm);
     if (check_complex && (data[2 * i + 1] > 1e-5)) {
-      printf("Complex value is not zero (i=%d,data=%g)!!!\n", i,
+      printf("rank %i: Complex value is not zero (i=%d,data=%g)!!!\n", comm.rank(), i,
              data[2 * i + 1]);
-      if (i > 100)
+      if (i > 1000 or comm.rank() == 0)
         throw std::runtime_error("Complex value is not zero");
     }
   }
git clone --depth=1 --recursive -b python https://github.com/espressomd/espresso.git
cd espresso
git apply patch.diff # contains the patch above
export with_ccache=true build_procs=$(nproc) check_procs=$(nproc)
export with_cuda=false myconfig=maxset make_check_python=false
bash maintainer/CI/build_cmake.sh
cd build
make python_test_data
module load mpi
mpiexec -n 1 ./pypresso --gdb testsuite/python/coulomb_cloud_wall.py : -n 3 ./pypresso testsuite/python/coulomb_cloud_wall.py
(gdb) catch throw
(gdb) run
Thread 1 "python3" hit Catchpoint 1 (exception thrown), 0x00007fc36f7e94a2 in __cxa_throw () from /lib64/libstdc++.so.6
(gdb) f 1
#1  0x00007fc36ffdaa23 in fft_perform_back (data=0x7fc34c35d040, 
    check_complex=<optimized out>, fft=..., comm=...)
    at /home/espresso/test/espresso/src/core/electrostatics_magnetostatics/fft.cpp:748
748	        throw std::runtime_error("Complex value is not zero");
(gdb) print data[0]
$1 = -214.95799046091787

The value in data[0] seems random. Values (data[0], data[1]) sent and received in back_grid_comm() are checking out, so it's not an issue with MPI_Sendrecv. I'm not familiar enough with the FFT code in espresso to continue this investigation further. @fweik Would you like to have a look? It's reproducible on the python branch with this dockerfile.

@jngrad
Copy link
Member Author

jngrad commented Aug 15, 2020

Additional data point: bug reproducible when compiling OpenMPI 4.0.4 + UCX 1.8.1 + boost 1.73 from sources (dockerfile), but not reproducible when compiling OpenMPI without UCX (dockerfile).

@fweik
Copy link
Contributor

fweik commented Aug 15, 2020

I'm not familiar with the FFT code. I does have a complex communication pattern which will make it hard to debug. Let's discuss this on Monday, maybe I'll have an idea how to approach this until then. Generally I'd say it will be easier to re-implement the algorithm or use PFFT (library solution) than understanding the existing code. PFFT is bus factor 1 domain scientist code too iirc, but it has more clients than our code...

@jngrad
Copy link
Member Author

jngrad commented Aug 15, 2020

Yeah, I'm also not sure how to approach this. We would need to extract a self-contained MWE that reproduces the issue, but this means first finding out where in the FFT workflow things go wrong. Since it's only reproducible with a specific version of UCX, there is also a possibility that nothing is wrong from our side (it already happened once with an older OpenMPI+UCX version).

The UCX repository has merged around 100 PRs since the 1.8.1 release a month ago, and they still have around 40 PRs waiting to be merged. Using the latest commit (openucx/ucx@6cd8b3cee2) instead of the latest release 1.8.1, the domain_decomposition and coulomb_* tests pass, so the underlying issue in UCX that blocks espresso 4.1.3 might very well be fixed already. However with that commit, 44 other python tests are failing... We might have to wait for the next UCX release and try building espresso again.

@fweik
Copy link
Contributor

fweik commented Aug 15, 2020

Well, I made the fft an isolated component, so to test if it does actually do FFT shouldn't be that hard. The error that is eventually raised does mean that either tthe fft algorithm dosen't work as expected, or that the Fourier coefficients what went into the transform just were not those of a real function, and the problem is in the P3M or elsewhere. Tests for the fft would help to locate the issue. I could also very well be that it is technical issue, as you were saying. I'd say an error in the P3M code is unlikely, because the P3M draws its local k coordinates from the fft component, which actually decides how the mesh is split across the nodes... These are tests that we should have anyway, would also a good opportunity to move the fft out of the core :-)

@jngrad
Copy link
Member Author

jngrad commented Aug 15, 2020

New data point: with UCX 1.9.0rc1 we have 55 failing python tests. Most print a UCX backtrace, but not the ones I'm investigating.

@jngrad
Copy link
Member Author

jngrad commented Aug 16, 2020

After a lengthy bisection through recents merge commits, I found out openucx/ucx#5473 introduced the fix for the 3 failing tests, but I couldn't backport it onto 1.8.1 due to non-trivial merge conflicts. The PR is also quite substantial, so I can't tell what's responsible for the fix. The merge commit cannot be used to build espresso 4.1 with OpenMPI+UCX, as it contains all PRs since 1.8.1, some of which introduced changes that break other espresso python tests.

@junghans
Copy link
Member

Is there already a ucx release containing the fix?

@jngrad
Copy link
Member Author

jngrad commented Aug 16, 2020

The fix hasn't been milestoned yet. The release candidate 1.9.0rc1 doesn't include it, and causes other espresso tests to fail.

@junghans
Copy link
Member

The fix hasn't been milestoned yet. The release candidate 1.9.0rc1 doesn't include it, and causes other espresso tests to fail.

Hmm, not great. @opoplawski can we back port that UCX fix?

@jngrad
Copy link
Member Author

jngrad commented Nov 2, 2020

Should be fixed by UCX 1.9.0: #3905 (comment). Closing.

@jngrad jngrad closed this as completed Nov 2, 2020
@opoplawski
Copy link

I've built ucx 1.9.0 in Fedora rawhide now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants