-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
epresso-4.1.3: Many tests fail on Fedora33 due to IndexError: _Map_base::at #3853
Comments
There is a bug in the mpich package that makes |
I made some progress: https://koji.fedoraproject.org/koji/taskinfo?taskID=48848410, now it fails on x86_64 with:
|
Thanks! This explains why I couldn't opt out LTO.
I have the same log on x86_64. This is independent of LTO. The |
Have a look at https://github.com/junghans/espresso-rpm/blob/f33/espresso.spec for my latest changes. |
I had tried However, the OpenMPI UCX timeout is now gone with your new specfile (junghans/espresso-rpm@12b1bb2ca680868e02 for future reference) from my side. You simplified a lot of paths in the build and test sections. Is the |
The build of junghans/espresso-rpm@12b1bb2ca680868e02 was https://koji.fedoraproject.org/koji/taskinfo?taskID=48848410, but I retriggered it again as https://koji.fedoraproject.org/koji/taskinfo?taskID=48889098 |
The MPICH doesn't inject the |
@opoplawski any idea on UCX error on F33? |
I haven't seen that before. I would bring it up with UCX directly - they were responsive to another UCX issue I just reported. |
Could be a false lead, but printing the value of
The exception --- a/src/core/electrostatics_magnetostatics/fft.cpp
+++ b/src/core/electrostatics_magnetostatics/fft.cpp
@@ -740,10 +740,11 @@ void fft_perform_back(double *data, bool check_complex, fft_data_struct &fft,
for (int i = 0; i < fft.plan[1].new_size; i++) {
fft.data_buf[i] = data[2 * i]; /* real value */
// Vincent:
+ MPI_Barrier(comm);
if (check_complex && (data[2 * i + 1] > 1e-5)) {
- printf("Complex value is not zero (i=%d,data=%g)!!!\n", i,
+ printf("rank %i: Complex value is not zero (i=%d,data=%g)!!!\n", comm.rank(), i,
data[2 * i + 1]);
- if (i > 100)
+ if (i > 1000 or comm.rank() == 0)
throw std::runtime_error("Complex value is not zero");
}
} git clone --depth=1 --recursive -b python https://github.com/espressomd/espresso.git
cd espresso
git apply patch.diff # contains the patch above
export with_ccache=true build_procs=$(nproc) check_procs=$(nproc)
export with_cuda=false myconfig=maxset make_check_python=false
bash maintainer/CI/build_cmake.sh
cd build
make python_test_data
module load mpi
mpiexec -n 1 ./pypresso --gdb testsuite/python/coulomb_cloud_wall.py : -n 3 ./pypresso testsuite/python/coulomb_cloud_wall.py
(gdb) catch throw
(gdb) run
Thread 1 "python3" hit Catchpoint 1 (exception thrown), 0x00007fc36f7e94a2 in __cxa_throw () from /lib64/libstdc++.so.6
(gdb) f 1
#1 0x00007fc36ffdaa23 in fft_perform_back (data=0x7fc34c35d040,
check_complex=<optimized out>, fft=..., comm=...)
at /home/espresso/test/espresso/src/core/electrostatics_magnetostatics/fft.cpp:748
748 throw std::runtime_error("Complex value is not zero");
(gdb) print data[0]
$1 = -214.95799046091787 The value in |
Additional data point: bug reproducible when compiling OpenMPI 4.0.4 + UCX 1.8.1 + boost 1.73 from sources (dockerfile), but not reproducible when compiling OpenMPI without UCX (dockerfile). |
I'm not familiar with the FFT code. I does have a complex communication pattern which will make it hard to debug. Let's discuss this on Monday, maybe I'll have an idea how to approach this until then. Generally I'd say it will be easier to re-implement the algorithm or use PFFT (library solution) than understanding the existing code. PFFT is bus factor 1 domain scientist code too iirc, but it has more clients than our code... |
Yeah, I'm also not sure how to approach this. We would need to extract a self-contained MWE that reproduces the issue, but this means first finding out where in the FFT workflow things go wrong. Since it's only reproducible with a specific version of UCX, there is also a possibility that nothing is wrong from our side (it already happened once with an older OpenMPI+UCX version). The UCX repository has merged around 100 PRs since the 1.8.1 release a month ago, and they still have around 40 PRs waiting to be merged. Using the latest commit (openucx/ucx@6cd8b3cee2) instead of the latest release 1.8.1, the |
Well, I made the fft an isolated component, so to test if it does actually do FFT shouldn't be that hard. The error that is eventually raised does mean that either tthe fft algorithm dosen't work as expected, or that the Fourier coefficients what went into the transform just were not those of a real function, and the problem is in the P3M or elsewhere. Tests for the fft would help to locate the issue. I could also very well be that it is technical issue, as you were saying. I'd say an error in the P3M code is unlikely, because the P3M draws its local k coordinates from the fft component, which actually decides how the mesh is split across the nodes... These are tests that we should have anyway, would also a good opportunity to move the fft out of the core :-) |
New data point: with UCX 1.9.0rc1 we have 55 failing python tests. Most print a UCX backtrace, but not the ones I'm investigating. |
After a lengthy bisection through recents merge commits, I found out openucx/ucx#5473 introduced the fix for the 3 failing tests, but I couldn't backport it onto 1.8.1 due to non-trivial merge conflicts. The PR is also quite substantial, so I can't tell what's responsible for the fix. The merge commit cannot be used to build espresso 4.1 with OpenMPI+UCX, as it contains all PRs since 1.8.1, some of which introduced changes that break other espresso python tests. |
Is there already a ucx release containing the fix? |
The fix hasn't been milestoned yet. The release candidate 1.9.0rc1 doesn't include it, and causes other espresso tests to fail. |
Hmm, not great. @opoplawski can we back port that UCX fix? |
Should be fixed by UCX 1.9.0: #3905 (comment). Closing. |
I've built ucx 1.9.0 in Fedora rawhide now. |
Same problem as #3396 but for Fedora 33, originally reported by @junghans in #3396 (comment).
The investigation below was conducted with this dockerfile and script.
TL;DR: the MPICH build fails due to the old 4.1.3 ScriptInterface not supporting link time optimization (LTO), the OpenMPI build fails due to issues with electrostatics. The new 4.2.0-dev ScriptInterface has additional issues in OpenMPI builds.
ESPResSo 4.1.3 compiled with MPICH
Most Python test fails. This happens usually when creating an
espressomd.system.System
object. Error message:The corresponding Cython code is:
espresso/src/python/espressomd/script_interface.pyx
Line 78 in 098adfe
which is called from:
espresso/src/python/espressomd/script_interface.pyx
Lines 296 to 298 in 098adfe
The ScriptInterface code is currently being refactored in #3794. The new interface in 4.2.0 doesn't have this issue with MPICH (tested with commit 9064d69, requires patch in #3852).
Like in the OpenSUSE case, this issue in 4.1.3 stems from LTO, which was enabled by default on Fedora 33 (see LTOByDefault). Contrary to what was done on OpenSUSE (#3396 (comment)) and what can be read on the Fedora bug tracker (
1863059#c0
or1789137#c5
), adding%define _lto_cflags %{nil}
in the specfile did not remove the-flto -ffat-lto-objects
flags during the RPM build in my docker image. @junghans any idea how to opt out of LTO?ESPResSo 4.1.3 compiled with OpenMPI
Only 3 tests fail with OpenMPI: the two
coulomb_cloud_wall*
tests and thedomain_decomposition
testThese errors seem unrelated to the old ScriptInterface code (full log in rpmbuild-4.1.3-openmpi.txt). The new interface in #3794 compiled with OpenMPI has extra issues in a few checkpointing tests (tested with commit 9064d69, requires patch in #3852), which seem to involve the ScriptInterface boost variant visitor pattern (full log in rpmbuild-4.2-dev-openmpi.txt).
The text was updated successfully, but these errors were encountered: