Skip to content

Commit

Permalink
machines: eccc: set I_MPI_CBWR for BASEGEN/BASECOM runs
Browse files Browse the repository at this point in the history
Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
[1], [2]), does not (by default) guarantee that repeated runs of the same
code on the same machine with the same number of MPI ranks yield the
same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.

Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
repeated runs of the code giving different answers, and baseline
comparing runs with code built from the same commit failing.

When generating a baseline or comparing against an existing baseline,
set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
using Intel MPI [3], so that (processor) topology-aware collective
algorithms are not used and results are reproducible.

Note that we do not need to set this variable on robert or underhill, on
which jobs have exclusive node access and thus job placement (on
processors) is guaranteed to be reproducible.

[1] https://stackoverflow.com/a/45916859/
[2] https://scicomp.stackexchange.com/a/2386/
[3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411

(cherry picked from commit 1f46305)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
  • Loading branch information
phil-blain committed Sep 5, 2023
1 parent ac4fa80 commit 739d815
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 0 deletions.
4 changes: 4 additions & 0 deletions configuration/scripts/machines/env.ppp5_intel
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ source $ssmuse -d /fs/ssm/main/opt/intelcomp/inteloneapi-2022.1.2/intelcomp+mpi+
# module load -s icc mpi
setenv FOR_DUMP_CORE_FILE 1
setenv I_MPI_DEBUG_COREDUMP 1
# Reproducible collectives
if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
setenv I_MPI_CBWR 1
endif
# Stop being buggy
setenv I_MPI_FABRICS ofi
# NetCDF
Expand Down
4 changes: 4 additions & 0 deletions configuration/scripts/machines/env.ppp6_gnu-impi
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ setenv I_MPI_F90 gfortran
setenv I_MPI_FC gfortran
setenv I_MPI_CC gcc
setenv I_MPI_CXX g++
# Reproducible collectives
if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
setenv I_MPI_CBWR 1
endif
# Stop being buggy
setenv I_MPI_FABRICS ofi

Expand Down
4 changes: 4 additions & 0 deletions configuration/scripts/machines/env.ppp6_intel
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ source $ssmuse -d /fs/ssm/main/opt/intelcomp/inteloneapi-2022.1.2/intelcomp+mpi+
# module load -s icc mpi
setenv FOR_DUMP_CORE_FILE 1
setenv I_MPI_DEBUG_COREDUMP 1
# Reproducible collectives
if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
setenv I_MPI_CBWR 1
endif
# Stop being buggy
setenv I_MPI_FABRICS ofi
# NetCDF
Expand Down
4 changes: 4 additions & 0 deletions configuration/scripts/machines/env.ppp6_intel19
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,10 @@ setenv FOR_DUMP_CORE_FILE 1
source $ssmuse -d /fs/ssm/hpco/exp/intelpsxe-impi-19.0.3.199
setenv FI_PROVIDER verbs
setenv I_MPI_DEBUG_COREDUMP 1
# Reproducible collectives
if (${ICE_BASEGEN} != ${ICE_SPVAL} || ${ICE_BASECOM} != ${ICE_SPVAL}) then
setenv I_MPI_CBWR 1
endif
# Stop being buggy
setenv I_MPI_FABRICS ofi
# NetCDF
Expand Down

0 comments on commit 739d815

Please sign in to comment.