-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
epresso-4.1.2: Many tests fail on OpenSuse tumbleweed due to IndexError: _Map_base::at #3396
Comments
@mcepl saved us from OpenSUSE removal. |
Regarding
```
[ 1116s] 135/145 Test #131: lb_momentum_conservation ..........................***Failed 2.53 sec
[ 1116s] Traceback (most recent call last):
[ 1116s] File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 40, in <module>
[ 1116s] class Momentum(object):
[ 1116s] File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 48, in Momentum
[ 1116s] system = espressomd.System(box_l=[BOX_SIZE] * 3)
[ 1116s] File "system.pyx", line 149, in espressomd.system.System.__init__
[ 1116s] File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
[ 1116s] File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
[ 1116s] File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
[ 1116s] IndexError: _Map_base::at
```
I have no idea. The trace seems to be incomplete. To my understanding, no map lookup happens at PScriptinterface.get_sip().
In case you can interact with (or download and install) the resulting build, could you please run
```
./pypresso
import espressomd.script_interface
and check whether CollisionDetection::CollisionDetection is in the
script_interface._python_class_by_so_name dict?
It also needs to be determined, whether the script interface class for collision detection is registered in src/scriptinterface/collision_detection/initialize.cpp.
The only reason I can imagine for this not to happen is inconsistent cached versions between config.hpp and myconfig.pxi. The latter contains the definition of active features for Python/Cython.
Could you re-run this in a clean (uncached) environment?
Surprisingly on i586, only matrix_vector_product fails:
With regards to
```
[ 765s] 48/68 Test #48: matrix_vector_product ............***Failed 0.00 sec
[ 765s] Running 1 test case...
[ 765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{4.31155e-17} between result[i]{30.900000000000002} and boost::inner_product(matrix[i], vector, 0.0){30.900000000000002} exceeds 2.22045e-16%[0;39;49m
```
I don’t think there is a guarantee that the difference will be only once std::numeric_limits<double>::epsilon for calculations involving several operations.
I think we can just increase the tolerance by a factor of 5 for the vector and matrix_vector tests in src/utils/tests
|
This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works). I can maybe have a look next week. The question is also how urgent this is, because some of the relevant code is also up for replacement, so the problem could also go away by itself given some time. |
not super urgent … if we know that somebody actually works on this issue, I can keep espresso put in openSUSE. |
I was able to reproduce this on OpenSuse tumbleweed with:
(Not sure why |
This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global >static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works).
So, we cannot safely declare Python script interface classes in pyx files.
IIRC, the only reason I made collision_detection.pyx a Cython file was not to copy the values of the collision mode enum. It should be easily separable. The long-term solution will come with the refactoring of collision detection and its interface.
Due to the seriousness of the bugs fixed in 4.1.2, I think we need to fix the build.
We’ll sort it out this week.
|
@RudolfWeeber, if you make a clean patch, I will just apply it in the 4.1.2 rpm for OpenSUSE. |
Could not reproduce the error in a docker container with the |
Did you use |
For me it still persist using |
In an openSUSE Docker container:
@junghans I thought @RudolfWeeber @fweik In GDB, pypresso throws at the call to |
It does, but first it generates isolated environment. Would diff between output of |
@mcepl @junghans I think there is something wrong in the CMake command, either an incorrect linking flag or an Rpath error. Within cd /home/abuild/rpmbuild/BUILD/espresso/build
rm -rf *
rm -rf /home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd/*
cmake .. -DCMAKE_BUILD_TYPE=Debug -DMPIEXEC_PREFLAGS=--allow-run-as-root
make -j24 check_python ARGS=-j24 In that shell, I also tried the fully expanded CMake command used by |
@jngrad yeah rpm hates rpath hence |
@junghans Just had a look with cd build
rm -rf *
rm -rf /tmp/espresso-unit-tests
cp ../maintainer/configs/empty.hpp myconfig.hpp
cmake .. '-GUnix Makefiles' -DCMAKE_INSTALL_PREFIX:PATH=/tmp/espresso-unit-tests \
-DCMAKE_INSTALL_LIBDIR:PATH=/usr/lib -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_C_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
-DCMAKE_CXX_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
-DCMAKE_EXE_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
-DCMAKE_MODULE_LINKER_FLAGS='-flto=auto -Wl,--as-needed' \
-DCMAKE_SHARED_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
-DCMAKE_SKIP_RPATH:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON \
-DCMAKE_COLOR_MAKEFILE:BOOL=OFF \
-DCMAKE_SHARED_LINKER_FLAGS='-Wl,--as-needed -Wl,-z,now' \
-DPYTHON_EXECUTABLE=/usr/bin/python3 -DINSTALL_PYPRESSO=OFF
make -j$(nproc) install
# trigger index error
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" make -j$(nproc) check_python ARGS="-R ^lb_momentum_conservation$"
# GDB backtrace
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" ./pypresso --gdb "testsuite/python/lb_momentum_conservation.py"
(gdb) catch throw
(gdb) run
(gdb) bt For some reason, with the Ubuntu build the GDB backtrace doesn't allow printing the value of Cython variables (either empty or optimized out), while in the OpenSUSE build most of them can be printed. |
Just a note, the hate towards rpath is not only openSUSE-thing, Fedora Packaging Guidelines forbids it as well, we are just a way more thorough about actual checking it (running rpmlint on every package build). |
@RudolfWeeber using the procedure above for Ubuntu 18: >>> from espressomd import script_interface
>>> script_interface._python_class_by_so_name['CollisionDetection::CollisionDetection']
<class 'espressomd.collision_detection.CollisionDetection'>
>>> from espressomd import System
>>> s = System(box_l=(1,1,1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "system.pyx", line 149, in espressomd.system.System.__init__
File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
IndexError: _Map_base::at When compiling espresso with the minimal config, the issue happens for the ComFixed attribute, as described earlier. I don't get why the Rpath would have any impact on ComFixed, which is defined in a plain python file. Could it be another issue? The GDB trace for both
which calls espresso/src/core/MpiCallbacks.hpp Line 494 in 50b25e6
with fp = (void (*)(void)) <(anonymous namespace)::make_remote_handle()> and m_func_ptr_to_id a container of type std::unordered_map<void (*)(), int> . The container throws because the fp callback is not registered. Adding the following assertion catches the mistake:
assert(m_func_ptr_to_id.find(reinterpret_cast<void (*)()>(fp))!=m_func_ptr_to_id.end()); |
@jngrad can you please try giving the function |
Adding I think the root cause is the LTO (link time optimization), enabled through the @junghans Until espresso gets LTO'ed, a temporary fix would be to manually remove the --- python3-espressomd.spec (revision fdbeb87bede60ca9aa7dac9bebb5ab7f)
+++ python3-espressomd.spec (working copy)
@@ -81,7 +82,13 @@
'-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now' \
-DLIBDIR=%{_lib} \
-DPYTHON_EXECUTABLE=%{_bindir}/python3 \
- -DINSTALL_PYPRESSO=OFF
+ -DINSTALL_PYPRESSO=OFF \
+ '-DCMAKE_C_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+ '-DCMAKE_CXX_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+ '-DCMAKE_Fortran_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+ '-DCMAKE_EXE_LINKER_FLAGS=-Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
+ '-DCMAKE_MODULE_LINKER_FLAGS=-Wl,--as-needed' \
+ '-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now'
%make_jobs
%install
|
Ok, x86_64 is ok now, i586 still fails on matrix_vector_product. |
On i586 in addition:
|
On making espresso LTO-compliant: I originally thought the We will not move forward with LTO-compliance for the moment. The specfile @junghans Errors of the type @fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like: #ifdef __i586__
auto constexpr epsilon = 3 * std::numeric_limits<double>::epsilon();
#else
auto constexpr epsilon = std::numeric_limits<double>::epsilon();
#endif |
<https://github.com/fweik> @fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it >in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like:
#ifdef __i586__
auto constexpr epsilon = 3 * std::numeric_limits<double>::epsilon();
#else
auto constexpr epsilon = std::numeric_limits<double>::epsilon();
#endif
Please don’t use Gnu specifics. No idea if any of that works on MSVC.
I also think, this is out of scope for us. The architecture is from the mid 90s.
That said,
I don’t understand why the error of something that involves more than one operation should be <= 1 epsilon, in the first place.
Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.
|
So how does one find error limits for floating point calculations? Is the error bounded? Introducing random error bounds with magic numbers does not increase confidence IMHO... In general, |
Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.
There are adaptative epsilon check algorithms that take the magnitude of the floats into account and can deal with zero, but they're quite involved and need to be used with care, because some mathematical operations can lead to large deviations (e.g. |
I don’t understand, what’s so complicated about adding |
On Fri, Jan 17, 2020 at 02:03:12AM -0800, Jean-Noël Grad wrote:
> Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.
Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.
I'm in favor of looking at the specific test rather than discussing the abstract question of floating point accuracy.
Looking at src/utils/tests/matrix_vector_product.cpp,
the involved matrix and vector are of order 1 to 100.
Expected and observed results agreeing to 10E-16 should be convincing enough.
|
why? |
Update: the epsilon value in the boost unit tests was wrong and got fixed in #3427, which is now a patch in the openSUSE package (openSUSE:Factory/python3-espressomd/3427.patch). Other failing tests were patched (#3315 (comment)), excepted |
Now we have that problem on Fedora 33 as well. |
Due to https://build.opensuse.org/request/show/760741, I tried to bump espresso to v4.1.2 on OpenSUSE, but many tests fail on x86_64 with:
Details here.
Surprisingly on i586, only
matrix_vector_product
fails:Details here
The text was updated successfully, but these errors were encountered: