Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MPI on Ubuntu 18.04 with CUDA #2271

Merged
merged 2 commits into from
Sep 18, 2018
Merged

Fix MPI on Ubuntu 18.04 with CUDA #2271

merged 2 commits into from
Sep 18, 2018

Conversation

mkuron
Copy link
Member

@mkuron mkuron commented Sep 18, 2018

OpenMPI 2.1.1 has a broken vader BTL (byte-transport layer), requiring us to disable its single-copy mode. The brokenness manifests itself in messages like

Read -1, expected 4000000, errno = 14

and leads to broken MPI communication between multiple ranks on the same machine. This seems to somehow be CUDA-related as Espresso compiled without CUDA support does not have the issue. As Ubuntu hasn't backported the fix from 2.1.3 (https://www.mail-archive.com/users@lists.open-mpi.org/msg32357.html), we have to resort to disabling vader's single-copy mode. This makes vader behave similar to the old sm BTL, which was the default before OpenMPI 2.

This patch has the same effect as setting the environment variable OMPI_MCA_btl_vader_single_copy_mechanism=none on OpenMPI 2.0-2.1.2 and 3.0.0. However, we cannot set that from inside Espresso, so we use the MPI_T interface instead.

Reported by @pkreissl.

OpenMPI 2.1.1 has a broken vader BTL, requiring us to disable its single-copy mode
@mkuron mkuron changed the title Fix MPI on Ubuntu 18.04 with CUDA [WIP] Fix MPI on Ubuntu 18.04 with CUDA Sep 18, 2018
@mkuron mkuron changed the title [WIP] Fix MPI on Ubuntu 18.04 with CUDA Fix MPI on Ubuntu 18.04 with CUDA Sep 18, 2018
Copy link
Contributor

@pkreissl pkreissl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 43b49c3 fixes the issue.

@mkuron
Copy link
Member Author

mkuron commented Sep 18, 2018

Please milestone for Espresso 4.0.1

@fweik fweik added the BugFix label Sep 18, 2018
@fweik fweik added this to the Espresso 4.0.1 milestone Sep 18, 2018
@fweik fweik merged commit 3059d2d into espressomd:python Sep 18, 2018
RudolfWeeber pushed a commit to RudolfWeeber/espresso that referenced this pull request Oct 15, 2018
@mkuron mkuron deleted the ubuntu1804 branch November 26, 2018 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants