epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853

jngrad · 2020-08-06T16:22:30Z

Same problem as #3396 but for Fedora 33, originally reported by @junghans in #3396 (comment).

The investigation below was conducted with this dockerfile and script.

TL;DR: the MPICH build fails due to the old 4.1.3 ScriptInterface not supporting link time optimization (LTO), the OpenMPI build fails due to issues with electrostatics. The new 4.2.0-dev ScriptInterface has additional issues in OpenMPI builds.

ESPResSo 4.1.3 compiled with MPICH

Most Python test fails. This happens usually when creating an espressomd.system.System object. Error message:

Traceback (most recent call last):
  File "/home/espresso/rpmbuild/BUILD/espresso-098adfe327aba153d88604bbe23e313b4d3ddc15/mpich/x86_64-redhat-linux-gnu/testsuite/python/time_series.py", line 42, in test_time_series
    system = espressomd.System(box_l=3 * [1.])
  File "system.pyx", line 144, in espressomd.system.System.__init__
  File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
  File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
  File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
IndexError: _Map_base::at

The corresponding Cython code is:

espresso/src/python/espressomd/script_interface.pyx

Line 78 in 098adfe

self.set_sip(make_shared(to_char_pointer(name), policy_))

which is called from:

espresso/src/python/espressomd/script_interface.pyx

Lines 296 to 298 in 098adfe

    
           def __init__(self, **kwargs): 
        
               super().__init__(self._so_name, policy=self._so_creation_policy, 
        
                                **kwargs)

The ScriptInterface code is currently being refactored in #3794. The new interface in 4.2.0 doesn't have this issue with MPICH (tested with commit 9064d69, requires patch in #3852).

Like in the OpenSUSE case, this issue in 4.1.3 stems from LTO, which was enabled by default on Fedora 33 (see LTOByDefault). Contrary to what was done on OpenSUSE (#3396 (comment)) and what can be read on the Fedora bug tracker (1863059#c0 or 1789137#c5), adding %define _lto_cflags %{nil} in the specfile did not remove the -flto -ffat-lto-objects flags during the RPM build in my docker image. @junghans any idea how to opt out of LTO?

ESPResSo 4.1.3 compiled with OpenMPI

Only 3 tests fail with OpenMPI: the two coulomb_cloud_wall* tests and the domain_decomposition test

 43/147 Test  #23: coulomb_cloud_wall ................................***Failed    1.57 sec
Complex value is not zero (i=39,data=170.329)!!!
Complex value is not zero (i=40,data=195.229)!!!
terminate called after throwing an instance of 'std::runtime_error'
  what():  Complex value is not zero
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 5] /lib64/libstdc++.so.6()
[ 6] terminate called after throwing an instance of 'std::runtime_error'
[ 8] .../site-packages/openmpi/espressomd/EspressoCore.so(p3m_calc_kspace_forces(bool, bool, ParticleRange const&))
[10] .../site-packages/openmpi/espressomd/EspressoCore.so(force_calc(CellStructure&))
*** End of error message ***

 87/147 Test  #71: domain_decomposition ..............................***Failed    2.12 sec
          sock.c:344  UCX  ERROR recv(fd=31) failed: Bad address
          sock.c:344  UCX  ERROR recv(fd=28) failed: Connection reset by peer
          sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor
terminate called after throwing an instance of 'boost::wrapexcept<boost::mpi::exception>'
  what():  MPI_Test: MPI_ERR_INTERN: internal error
*** Process received signal ***
Signal: Aborted (6)
Signal code:  (-6)
[ 6] /lib64/libstdc++.so.6(+0xaa4d9)[0x7f54c36344d9]
[ 7] /usr/lib64/openmpi/lib/libboost_mpi.so.1.73.0()
[ 8] /usr/lib64/openmpi/lib/libboost_mpi.so.1.73.0(boost::mpi::request::trivial_handler::test())
[ 9] .../site-packages/openmpi/espressomd/EspressoCore.so()
[10] .../site-packages/openmpi/espressomd/EspressoCore.so(dd_exchange_and_sort_particles(int, ParticleList*, Utils::Vector<int, 3ul> const&))
[11] .../site-packages/openmpi/espressomd/EspressoCore.so(cells_resort_particles(int))
[12] .../site-packages/openmpi/espressomd/EspressoCore.so(mpi_resort_particles_slave(int, int))
[13] .../site-packages/openmpi/espressomd/EspressoCore.so(mpi_loop())
*** End of error message ***

These errors seem unrelated to the old ScriptInterface code (full log in rpmbuild-4.1.3-openmpi.txt). The new interface in #3794 compiled with OpenMPI has extra issues in a few checkpointing tests (tested with commit 9064d69, requires patch in #3852), which seem to involve the ScriptInterface boost variant visitor pattern (full log in rpmbuild-4.2-dev-openmpi.txt).

The text was updated successfully, but these errors were encountered:

junghans · 2020-08-06T16:32:15Z

There is a bug in the mpich package that makes %define _lto_cflags %{nil} getting ignored, see
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/7TFWEKTDWBYBHEGMIWBVI3AVGORZGNBS/#7TFWEKTDWBYBHEGMIWBVI3AVGORZGNBS and https://src.fedoraproject.org/rpms/mpich/pull-request/4

junghans · 2020-08-07T02:42:46Z

I made some progress: https://koji.fedoraproject.org/koji/taskinfo?taskID=48848410, now it fails on x86_64 with:

The following tests FAILED:
	 23 - coulomb_cloud_wall (Failed)
	 72 - layered (Timeout)
	 83 - coulomb_cloud_wall_duplicated (Failed)
Errors while running CTest

build.log.txt

jngrad · 2020-08-07T09:15:37Z

There is a bug in the mpich package that makes %define _lto_cflags %{nil} getting ignored

Thanks! This explains why I couldn't opt out LTO.

now it fails on x86_64

I have the same log on x86_64. This is independent of LTO. The layered test hangs with a UCX error message.

junghans · 2020-08-07T15:30:10Z

Have a look at https://github.com/junghans/espresso-rpm/blob/f33/espresso.spec for my latest changes.

jngrad · 2020-08-07T19:06:24Z

I had tried %global _lto_cflags %{nil} as well, as suggested in one of the Fedora tickets mentioned in the first post, but MPICH still injected the LTO flags: -flto -ffat-lto-objects -fexceptions -fasynchronous-unwind-tables. I've re-built espresso with your new spec file, but the MPICH issue persists from my side.

However, the OpenMPI UCX timeout is now gone with your new specfile (junghans/espresso-rpm@12b1bb2ca680868e02 for future reference) from my side. You simplified a lot of paths in the build and test sections. Is the layered test passing from your side too?

junghans · 2020-08-07T19:38:58Z

The build of junghans/espresso-rpm@12b1bb2ca680868e02 was https://koji.fedoraproject.org/koji/taskinfo?taskID=48848410, but I retriggered it again as https://koji.fedoraproject.org/koji/taskinfo?taskID=48889098

jngrad · 2020-08-07T20:24:29Z

The layered test passes reproducibly from my side on x86_64, but it doesn't pass on the x86_64 Koji job you just re-triggered.

MPICH doesn't inject the -flto flag anymore on Koji, so I guess your LTO fix that was merged recently is already deployed.

junghans · 2020-08-07T20:47:29Z

@opoplawski any idea on UCX error on F33?

opoplawski · 2020-08-08T01:37:25Z

I haven't seen that before. I would bring it up with UCX directly - they were responsive to another UCX issue I just reported.

jngrad · 2020-08-15T17:07:14Z

Could be a false lead, but printing the value of data[0] and data[1] (real resp. imaginary components of the first data point) in fft_perform_back() shows a discrepancy between ES built on Fedora 33 (columns 2-3) and Ubuntu 18.04 (columns 4-5):

step	real	imaginary	real
input	0.000	0.000	0.000
	0.000	0.000	0.000
	0.000	0.000	0.000
	0.000	0.000	0.000
FFT 1	-115.884	0.000	-115.884
	0.000	0.000	0.000
	0.000	0.000	0.000
	0.000	0.000	0.000
Communicate 1	-115.884	0.000	-115.884
	0.000	0.000	0.000
	0.000	0.000	0.000
	0.000	0.000	0.000
FFT 2	-115.884	0.000	-115.884
	0.000	0.000	0.000
	0.000	0.000	0.000
	0.000	0.000	0.000
Communicate 2	-952.797	-91.189	817.912
	121.787	-123.919	133.162
	281.095	217.920	-161.609
	-168.084	119.324	-1617.001
FFT 3	-262.081	175.674	-185.928
	-83.281	-358.977	448.158
	372.641	362.332	-1248.708
	-993.044	-471.697	441.617

The exception Complex value is not zero is triggered on 4 MPI ranks. On 2 MPI ranks, the error is too small to be detected in the core, but the python test detect a measurable deviation from the reference values. To get a GDB session, the code needs an MPI barrier, otherwise the exception is always thrown on ranks != 0, which makes it more difficult to work with GDB. Here's a MWE:

--- a/src/core/electrostatics_magnetostatics/fft.cpp
+++ b/src/core/electrostatics_magnetostatics/fft.cpp
@@ -740,10 +740,11 @@ void fft_perform_back(double *data, bool check_complex, fft_data_struct &fft,
   for (int i = 0; i < fft.plan[1].new_size; i++) {
     fft.data_buf[i] = data[2 * i]; /* real value */
     // Vincent:
+    MPI_Barrier(comm);
     if (check_complex && (data[2 * i + 1] > 1e-5)) {
-      printf("Complex value is not zero (i=%d,data=%g)!!!\n", i,
+      printf("rank %i: Complex value is not zero (i=%d,data=%g)!!!\n", comm.rank(), i,
              data[2 * i + 1]);
-      if (i > 100)
+      if (i > 1000 or comm.rank() == 0)
         throw std::runtime_error("Complex value is not zero");
     }
   }

git clone --depth=1 --recursive -b python https://github.com/espressomd/espresso.git
cd espresso
git apply patch.diff # contains the patch above
export with_ccache=true build_procs=$(nproc) check_procs=$(nproc)
export with_cuda=false myconfig=maxset make_check_python=false
bash maintainer/CI/build_cmake.sh
cd build
make python_test_data
module load mpi
mpiexec -n 1 ./pypresso --gdb testsuite/python/coulomb_cloud_wall.py : -n 3 ./pypresso testsuite/python/coulomb_cloud_wall.py
(gdb) catch throw
(gdb) run
Thread 1 "python3" hit Catchpoint 1 (exception thrown), 0x00007fc36f7e94a2 in __cxa_throw () from /lib64/libstdc++.so.6
(gdb) f 1
#1  0x00007fc36ffdaa23 in fft_perform_back (data=0x7fc34c35d040, 
    check_complex=<optimized out>, fft=..., comm=...)
    at /home/espresso/test/espresso/src/core/electrostatics_magnetostatics/fft.cpp:748
748	        throw std::runtime_error("Complex value is not zero");
(gdb) print data[0]
$1 = -214.95799046091787

The value in data[0] seems random. Values (data[0], data[1]) sent and received in back_grid_comm() are checking out, so it's not an issue with MPI_Sendrecv. I'm not familiar enough with the FFT code in espresso to continue this investigation further. @fweik Would you like to have a look? It's reproducible on the python branch with this dockerfile.

jngrad · 2020-08-15T18:07:06Z

Additional data point: bug reproducible when compiling OpenMPI 4.0.4 + UCX 1.8.1 + boost 1.73 from sources (dockerfile), but not reproducible when compiling OpenMPI without UCX (dockerfile).

fweik · 2020-08-15T18:37:07Z

I'm not familiar with the FFT code. I does have a complex communication pattern which will make it hard to debug. Let's discuss this on Monday, maybe I'll have an idea how to approach this until then. Generally I'd say it will be easier to re-implement the algorithm or use PFFT (library solution) than understanding the existing code. PFFT is bus factor 1 domain scientist code too iirc, but it has more clients than our code...

jngrad · 2020-08-15T18:53:07Z

Yeah, I'm also not sure how to approach this. We would need to extract a self-contained MWE that reproduces the issue, but this means first finding out where in the FFT workflow things go wrong. Since it's only reproducible with a specific version of UCX, there is also a possibility that nothing is wrong from our side (it already happened once with an older OpenMPI+UCX version).

The UCX repository has merged around 100 PRs since the 1.8.1 release a month ago, and they still have around 40 PRs waiting to be merged. Using the latest commit (openucx/ucx@6cd8b3cee2) instead of the latest release 1.8.1, the domain_decomposition and coulomb_* tests pass, so the underlying issue in UCX that blocks espresso 4.1.3 might very well be fixed already. However with that commit, 44 other python tests are failing... We might have to wait for the next UCX release and try building espresso again.

fweik · 2020-08-15T19:03:29Z

Well, I made the fft an isolated component, so to test if it does actually do FFT shouldn't be that hard. The error that is eventually raised does mean that either tthe fft algorithm dosen't work as expected, or that the Fourier coefficients what went into the transform just were not those of a real function, and the problem is in the P3M or elsewhere. Tests for the fft would help to locate the issue. I could also very well be that it is technical issue, as you were saying. I'd say an error in the P3M code is unlikely, because the P3M draws its local k coordinates from the fft component, which actually decides how the mesh is split across the nodes... These are tests that we should have anyway, would also a good opportunity to move the fft out of the core :-)

jngrad · 2020-08-15T19:15:47Z

New data point: with UCX 1.9.0rc1 we have 55 failing python tests. Most print a UCX backtrace, but not the ones I'm investigating.

jngrad · 2020-08-16T19:05:56Z

After a lengthy bisection through recents merge commits, I found out openucx/ucx#5473 introduced the fix for the 3 failing tests, but I couldn't backport it onto 1.8.1 due to non-trivial merge conflicts. The PR is also quite substantial, so I can't tell what's responsible for the fix. The merge commit cannot be used to build espresso 4.1 with OpenMPI+UCX, as it contains all PRs since 1.8.1, some of which introduced changes that break other espresso python tests.

junghans · 2020-08-16T20:46:05Z

Is there already a ucx release containing the fix?

jngrad · 2020-08-16T21:08:02Z

The fix hasn't been milestoned yet. The release candidate 1.9.0rc1 doesn't include it, and causes other espresso tests to fail.

junghans · 2020-08-16T23:16:43Z

The fix hasn't been milestoned yet. The release candidate 1.9.0rc1 doesn't include it, and causes other espresso tests to fail.

Hmm, not great. @opoplawski can we back port that UCX fix?

jngrad · 2020-11-02T14:29:08Z

Should be fixed by UCX 1.9.0: #3905 (comment). Closing.

opoplawski · 2020-11-04T02:45:04Z

I've built ucx 1.9.0 in Fedora rawhide now.

jngrad added Confirmed ScriptInterface labels Aug 6, 2020

jngrad mentioned this issue Sep 21, 2020

Some tests fail on Fedora Rawhide #3905

Closed

jngrad closed this as completed Nov 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853

epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853

jngrad commented Aug 6, 2020

junghans commented Aug 6, 2020 •

edited

Loading

junghans commented Aug 7, 2020

jngrad commented Aug 7, 2020

junghans commented Aug 7, 2020

jngrad commented Aug 7, 2020

junghans commented Aug 7, 2020

jngrad commented Aug 7, 2020

junghans commented Aug 7, 2020

opoplawski commented Aug 8, 2020

jngrad commented Aug 15, 2020 •

edited

Loading

jngrad commented Aug 15, 2020 •

edited

Loading

fweik commented Aug 15, 2020

jngrad commented Aug 15, 2020

fweik commented Aug 15, 2020

jngrad commented Aug 15, 2020

jngrad commented Aug 16, 2020

junghans commented Aug 16, 2020

jngrad commented Aug 16, 2020

junghans commented Aug 16, 2020

jngrad commented Nov 2, 2020

opoplawski commented Nov 4, 2020

epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853

epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853

Comments

jngrad commented Aug 6, 2020

ESPResSo 4.1.3 compiled with MPICH

ESPResSo 4.1.3 compiled with OpenMPI

junghans commented Aug 6, 2020 • edited Loading

junghans commented Aug 7, 2020

jngrad commented Aug 7, 2020

junghans commented Aug 7, 2020

jngrad commented Aug 7, 2020

junghans commented Aug 7, 2020

jngrad commented Aug 7, 2020

junghans commented Aug 7, 2020

opoplawski commented Aug 8, 2020

jngrad commented Aug 15, 2020 • edited Loading

jngrad commented Aug 15, 2020 • edited Loading

fweik commented Aug 15, 2020

jngrad commented Aug 15, 2020

fweik commented Aug 15, 2020

jngrad commented Aug 15, 2020

jngrad commented Aug 16, 2020

junghans commented Aug 16, 2020

jngrad commented Aug 16, 2020

junghans commented Aug 16, 2020

jngrad commented Nov 2, 2020

opoplawski commented Nov 4, 2020

junghans commented Aug 6, 2020 •

edited

Loading

jngrad commented Aug 15, 2020 •

edited

Loading

jngrad commented Aug 15, 2020 •

edited

Loading