Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ice_dyn_vp: allow for bit-for-bit reproducibility under bfbflag #774

Merged
merged 17 commits into from
Oct 20, 2022

Conversation

phil-blain
Copy link
Member

@phil-blain phil-blain commented Oct 14, 2022

PR checklist

  • Short (1 sentence) summary of your PR:
    title
  • Developer(s):
    me
  • Suggest PR reviewers from list in the column to the right.
  • Please copy the PR test results link or provide a summary of testing completed below.
350 measured results of 350 total results
348 of 350 tests PASSED
0 of 350 tests PENDING
0 of 350 tests MISSING data
2 of 350 tests FAILED
$ ./results.csh|\grep FAIL
FAIL ppp6_intel_smoke_gx3_4x1_dynpicard complog 007fbff different-data
FAIL ppp6_intel_smoke_gx3_4x1_dynpicard compare 007fbff 2.58 0.75 1.14 different-data

more details below.

  • How much do the PR code changes differ from the unmodified code?
    • bit for bit
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • No
  • Does this PR update the Icepack submodule? If so, the Icepack submodule must point to a hash on Icepack's main branch.
    • Yes
    • No
  • Does this PR add any new test cases?
    • Yes
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

Summary

This PR refactors the VP solver to allow bit-for-bit reproducibility when using bfbflag. Currently on main, global sums are accumulated locally before calling global_sum on a scalar value. With the new code, when bfbflag is not off, we call global_sum on an array, allowing the existing algorithms (reprosum, ddpdd, etc) to be used for b4b reproducibility. See also previous implementations and design decisions in #763. In this new version, the code has the same number of global sums as the main version.

PR overview

First we update the docs a little:

  • 1/15 c9144ea doc: fix typo in index (bfbflag)
  • 2/15 545377e doc: correct default value of 'maxits_nonlin'
  • 3/15 399d873 doc: VP solver is validated with OpenMP

Then we update our machines files:

  • 4/15 75b934a machines: eccc : add ICE_MACHINE_MAXRUNLENGTH to ppp[56]
  • 5/15 405df63 machines: eccc: use PBS-enabled OpenMPI for 'ppp6_gnu'
  • 6/15 de08b34 machines: eccc: set I_MPI_FABRICS=ofi
  • 7/15 f7cbcb2 machines: eccc: set I_MPI_CBWR for BASEGEN/BASECOM runs

Then we fix a bug we discovered along the way:

  • 8/15 fd54d5d ice_dyn_vp: fgmres: exit early if right-hand-side vector is zero

Then the core of the changes, first introducing two new functions and then refactoring the code to use them:

  • 9/15 b96ea5b ice_dyn_vp: add global_norm, global_dot_product functions
  • 10/15 c37c645 ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit output reproducibility
  • 11/15 33ccce2 ice_dyn_vp: do not skip halo updates in 'pgmres' under 'bfbflag'
  • 12/15 6aab357 ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit log reproducibility

Some clean-up and doc updates:

  • 13/15 5a8b2d1 ice_dyn_vp: remove unused subroutine and cleanup interfaces
  • 14/15 8901899 doc: mention VP solver is only reproducible using 'bfbflag'

And finally a tweak to the default solver settings:

  • 15/15 9f1efbb ice_dyn_vp: improve default parameters for VP solver

Testing

  • I ran the decomp_suite with -s dynpicard,reprosum (i.e. bfbflag=reprosum) and all tests pass
    • I also ran it with bfbflag=ddpdd, all tests pass
    • I also ran it with bfbflag=lsum16, all tests pass
  • I ran the nothread_suite with -s dynpicard,reprosum and all tests pass
  • I also ran both these suites twice from the same commit and compared the results using bgen/bcmp to also ensure reproducibility on the same decomposition + MPI distribution. All pass.
  • I ran the QC test for the failing test in the base_suite. I ran it on 4 pairs of results.
    • main code vs new code
    • main code vs new code with bfbflag=lsum8, so exercising the array version of global_sum
    • main code with the same changes to the solver parameters as in 15/15 vs new code
    • main code with the same changes to the solver parameters as in 15/15 vs new code with bfbflag=lsum8

The first two pairs pass the 2-stage test and the NH quadratic skill test but fail in the SH.
The last two pass all 3 tests (2-stage, quadratic skill NH, quadratic skill SH).

I don't think it's too surprising that the first two pairs fail the test, as with 4 nonlinear iterations the solution is not really converged at all from what we've seen. With 10 iterations we usually get at least one significant digit of convergence.

Performance

  • The old and new code both took 3 hours to simulate the 5 years QC data (with the new solver parameters)
  • The new code with bfbflag=lsum8 (i.e. array version of global_sum) and the new solver parameters) took 2.5 hours, so it's actually faster in this case!

This is not what I was expecting since my previous timings showed a modest time increase with the lsum8 case, but those timings were done on single-day runs, so the 5 year runs are maybe more "realistic" timings. So in the end we might want to default to the array version ? What do others think ?

Questions

  • After 10/15, the global_allreduce_sum function is not used anymore. It was added by me for the original implementation of the VP solver but is not used anywhere else in the code. I can remove it if we feel we do not need it. My opinion is it does not hurt to keep it for now.

  • The doc mentions

    For best performance, the off (or lsum8 which is equivalent) setting is recommended.

    but after these changes, off is not the same as lsum8 for the VP solver, since with lsum8 we use the array version of global_sum... I don't think it matters that much, but I could also just remove that parenthesis from the doc.

@apcraig
Copy link
Contributor

apcraig commented Oct 14, 2022

This all looks great! A couple questions/comments,

Under performance you say the old code time was 3 hours and the new 3.5 hours and then suggest new is faster. Something doesn't seem consistent, are I misunderstanding?

Overall, I sort of vote for removing global_sum_prod subroutines if we don't need them anymore, and we don't really see a need for them at this point. It's just more code to test/maintain if we don't need it. Also, they exist in the serial and mpi side, so removing them reduces the burden on keep the two versions synced up a bit. They are tested in the sumchk unit tester, so if we remove them, we need to update the sumchk test. However, if others prefer we keep them, I'm OK with that. Should we create an issue and defer for now?

With regard to documentation of global sums, I'm trying to think about the best way to update the usage and documentation overall. In some ways, we've overloaded the bfbflag. It means one thing in global_sums and a slightly different but mostly overlapping thing in the vp solver. Thinking out loud, do we need a new bfbflag option (offvp?) that sets bfbflag=off and uses the scalar global sum in vp while all other bfbflag settings (including off) would call into the standard global_sum from vp? Will there ever be a time where we want the vp global_sum to be set different from the diagnostics global sum (maybe vp=off but diagnostics=bfb)? If so, then we need a new namelist in the dynamics for the vp global sum setting. I'm largely happy with what's there now, but just throwing out some ideas. If we keep what we have, then I suggest we keep the documentation simple and clean up what we can (remove suggestion that sum8==off).

Have all the OpenMP comments been turned back on, are all block loops shared parallel now?

@apcraig
Copy link
Contributor

apcraig commented Oct 14, 2022

Quick follow up on OpenMP, what about line 337 in ice_dyn_vp.F90?

@phil-blain
Copy link
Member Author

phil-blain commented Oct 14, 2022

Under performance you say the old code time was 3 hours and the new 3.5 hours and then suggest new is faster. Something doesn't seem consistent, are I misunderstanding?

Sorry, my mistake. The new code with bfbflag=lsum8 is actually took 2.5 hours. So yes it's faster. EDIT I've edited the PR description.

Overall, I sort of vote for removing global_sum_prod subroutines if we don't need them anymore, and we don't really see a need for them at this point. It's just more code to test/maintain if we don't need it. Also, they exist in the serial and mpi side, so removing them reduces the burden on keep the two versions synced up a bit. They are tested in the sumchk unit tester, so if we remove them, we need to update the sumchk test. However, if others prefer we keep them, I'm OK with that. Should we create an issue and defer for now?

Just to be clear, here I'm talking about the global_allreduce_sum module procedure, not global_sum_prod. global_sum_prod is used at a few places (ice_diagnostics, ice_transport_driver). global_allreduce_sum is not used anymore and could be removed if we wish.

With regard to documentation of global sums, I'm trying to think about the best way to update the usage and documentation overall. In some ways, we've overloaded the bfbflag. It means one thing in global_sums and a slightly different but mostly overlapping thing in the vp solver. Thinking out loud, do we need a new bfbflag option (offvp?) that sets bfbflag=off and uses the scalar global sum in vp while all other bfbflag settings (including off) would call into the standard global_sum from vp? Will there ever be a time where we want the vp global_sum to be set different from the diagnostics global sum (maybe vp=off but diagnostics=bfb)? If so, then we need a new namelist in the dynamics for the vp global sum setting. I'm largely happy with what's there now, but just throwing out some ideas. If we keep what we have, then I suggest we keep the documentation simple and clean up what we can (remove suggestion that sum8==off).

I think I would keep things as simple as possible and just remove the parenthesis in the doc :)

Have all the OpenMP comments been turned back on, are all block loops shared parallel now?
Quick follow up on OpenMP, what about line 337 in ice_dyn_vp.F90?

Oh yeah, this:

! tcraig, tcx, threading here leads to some non-reproducbile results and failures in icepack_ice_strength
! need to do more debugging
!$TCXOMP PARALLEL DO PRIVATE(iblk,ilo,ihi,jlo,jhi,this_block,ij,i,j)
do iblk = 1, nblocks
!-----------------------------------------------------------------
! more preparation for dynamics
!-----------------------------------------------------------------

This dates back to when ice_dyn_vp was created by copying ice_dyn_evp. Then when you recently fixed a bunch of OpenMP loops in d1e972a (Update OMP (#680), 2022-02-18), you fixed the private variables in ice_dyn_vp for this loop as in ice_dyn_evp but kept the loop unthreaded, since (I'm guessing) we already knew that the VP solver had other b4b issues. Now that you mention it, I checked and that line is the only remaining TCXOMP loop in the code.

I'll do a few more tests with an actual !$OMP directive over that loop and report back.

@apcraig
Copy link
Contributor

apcraig commented Oct 14, 2022

Thanks for clarifications.

Sorry for my confusion, I would get rid of the global_allreduce_sum methods. I don't think those are going to be needed in the future. This is tested in sumchk, so we would need to update that as well. Again, maybe best to be a new issue and done separately.

I'm fine with the plan for bfbflag and documentation of it.

Correct, when I worked on the OpenMP validation, vp had other problems that made it difficult for me to validate the OpenMP in vp. I commented out OpenMP directives that were clearly problematic in vp to improve overall robustness and then I think vp passed the OpenMP testing generally. But we need to revisit that. I think the first question is that one open OpenMP loop that is commented out. Also, we should make sure there are no other iblk loops in vp that should have OpenMP added. And we should try to get all loops on and validated with OpenMP. If a loop cannot have OpenMP, lets try to add a comment why. Again, this could be a separate PR but would be great for it to be settled now. Thanks!

@phil-blain
Copy link
Member Author

Sorry for my confusion, I would get rid of the global_allreduce_sum methods. I don't think those are going to be needed in the future. This is tested in sumchk, so we would need to update that as well. Again, maybe best to be a new issue and done separately.

OK. I can remove them and update the unit test. And it's relatively easy to re-add them if needed from the Git history :)

I'm fine with the plan for bfbflag and documentation of it.

OK. Will update the doc.

Correct, when I worked on the OpenMP validation, vp had other problems that made it difficult for me to validate the OpenMP in vp. I commented out OpenMP directives that were clearly problematic in vp to improve overall robustness and then I think vp passed the OpenMP testing generally. But we need to revisit that.

My understanding is that after this PR, everything is validated as far as I know. The decomp suite has a lot of different MPI x OpenMP distributions in additions to the various block decompositions, and as mentioned all these tests pass with bfbflag=reprosum.

I think the first question is that one open OpenMP loop that is commented out.

Yes, I'm testing that and will update the PR with the results.

Also, we should make sure there are no other iblk loops in vp that should have OpenMP added. And we should try to get all loops on and validated with OpenMP. If a loop cannot have OpenMP, lets try to add a comment why. Again, this could be a separate PR but would be great for it to be settled now.

I just had a look and the only loops that are not threaded are those that compute ntot and the ones in vec_to_arrays and arrays_to_vec. I'm planning to remove these subroutines when I go back to work on the parallelization of the Anderson solver. Right now these loops are only involved in the Anderson solver code, not in the Picard solver. So I would keep them as is for now.

@phil-blain
Copy link
Member Author

I think the first question is that one open OpenMP loop that is commented out.

Yes, I'm testing that and will update the PR with the results.

Great news, decomp suite passes with that loop OpenMP'd (with bfbflag=reprosum, as in the rest of my tests). So I'll rework a but my first commits to include that change (next week). Have a good week-end! :)

@apcraig
Copy link
Contributor

apcraig commented Oct 14, 2022

Thanks @phil-blain. All sounds good. If you have questions/problems with the unit tester, feel free to defer. The unit testers are very useful, but pretty rough in some ways.

Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. I'll let the experts review/approve/merge.

The "Table of namelist options" in the user guide lists 'maxits_nonlin'
as having a default value of 1000, whereas its actual default is 4, both
in the namelist and in 'ice_init.F90'. This has been the case since the
original implementation of the implicit solver in f7fd063 (dynamics: add
implicit VP solver (CICE-Consortium#491), 2020-09-22).

Fix the documentation.
When the implicit VP solver was added in f7fd063 (dynamics: add implicit
VP solver (CICE-Consortium#491), 2020-09-22), it had not yet been tested with OpenMP
enabled.

The OpenMP implementation was carefully reviewed and then fixed in
d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), which lead to all runs of the
'decomp' suite completing and all restart tests passing. The 'bfbcomp'
tests are still failing, but this is due to the code not using the CICE
global sum implementation correctly, which will be fixed in the next
commits.

Update the documentation accordingly.
When the OpenMP implementation was reviewed and fixed in d1e972a (Update
OMP (CICE-Consortium#680), 2022-02-18), the 'PRIVATE' clause of the OpenMP directive
for the loop where 'dyn_prep2' is called in 'implicit_solver' was
corrected in line with what was done in 'ice_dyn_evp', but OpenMP was
left unactivated for this loop (the 'TCXOMP' was not changed to a real
'OMP' directive).

Activate OpenMP for this loop. All runs and restart tests of the
'decomp_suite' still pass with this change.
The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is
not compiled with support for PBS. This leads to failures as the MPI
runtime does not have the same view of the number of available processors
as the job scheduler.

Use our own build of OpenMPI, compiled with PBS support, for the
'ppp6_gnu'  environment, which uses OpenMPI.
Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an
intermittent bug where a call to 'MPI_Waitall' fails with:

    Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

and no core dump is produced. This affects at least these cases of the
'decomp' suite:

- *_*_restart_gx3_16x2x1x1x800_droundrobin
- *_*_restart_gx3_16x2x2x2x200_droundrobin

This was reported to Intel and they suggested setting the variable
'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This
disables shared memory transport and indeeds fixes the failures.

Set this variable for all ECCC machine files using Intel MPI.

[1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html
Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
[1], [2]), does not (by default) guarantee that repeated runs of the same
code on the same machine with the same number of MPI ranks yield the
same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.

Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
repeated runs of the code giving different answers, and baseline
comparing runs with code built from the same commit failing.

When generating a baseline or comparing against an existing baseline,
set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
using Intel MPI [3], so that (processor) topology-aware collective
algorithms are not used and results are reproducible.

Note that we do not need to set this variable on robert or underhill, on
which jobs have exclusive node access and thus job placement (on
processors) is guaranteed to be reproducible.

[1] https://stackoverflow.com/a/45916859/
[2] https://scicomp.stackexchange.com/a/2386/
[3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411
If starting a run with with "ice_ic='none'" (no ice), the linearized
problem for the ice velocity A x = b will have b = 0, since all terms in
the right hand side vector will be zero:

- strint[xy] is zero because the velocity is zero
- tau[xy] is zero because the ocean velocity is also zero
- [uv]vel_init is zero
- strair[xy] is zero because the concentration is zero
- strtlt[xy] is zero because the ocean velocity is zero

We thus have a linear system A x = b with b=0, so we
must have x=0.

In the FGMRES linear solver, this special case is not taken into
account, and so we end up with an all-zero initial residual since
workspace_[xy] is also zero because of the all-zero initial guess
'sol[xy]', which corresponds to the initial ice velocity. This then
leads to a division by zero when normalizing the first Arnoldi vector.

Fix this special case by computing the norm of the right-hand-side
vector before starting the iterations, and exiting early if it is zero.
This is in line with the GMRES implementation in SciPy [1].

[1] https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/sparse/linalg/_isolve/iterative.py#L620-L628

Close: #42
The VP solver uses a linear solver, FGMRES, as part of the non-linear
iteration. The FGMRES algorithm involves computing the norm of a
distributed vector field, thus performing global sums.

These norms are computed by first summing the squared X and Y components
of a vector field in subroutine 'calc_L2norm_squared', summing these
over the local blocks, and then doing a global (MPI) sum using
'global_sum'.

This approach does not lead to reproducible results when the MPI
distribution, or the number of local blocks, is changed, for reasons
explained in the "Reproducible sums" section of the Developer Guide
(mostly, floating point addition is not associative). This was partly
pointed out in [1] but I failed to realize it at the time.

Introduce a new function, 'global_dot_product', to encapsulate the
computation of the dot product of two grid vectors, each split into two
arrays (for the X and Y components).

Compute the reduction locally as is done in 'calc_L2norm_squared', but
throw away the result and use the existing 'global_sum' function when
'bfbflag' is active, passing it the temporary array used to compute the
element-by-element product.

This approach avoids a performance regression from the added work done
in 'global_sum', such that non-bfbflag runs are as fast as before.

Note that since 'global_sum' loops on the whole array (and not just ice
points as 'global_dot_product'), make sure to zero-initialize the 'prod'
local array.

Also add a 'global_norm' function implemented using
'global_dot_product'. Both functions will be used in subsequent commits
to ensure bit-for-bit reproducibility.
…oducibility

Make the results of the VP solver reproducible if desired by refactoring
the code to use the subroutines 'global_norm' and 'global_dot_product'
added in the previous commit.

The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the
preconditioner 'pgmres' which uses the same algorithm, and the
Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'.

These modifications do not change the number of global sums in the
fgmres, pgmres and the MGS algorithm. For the CGS algorithm, there is
(in theory) a slight performance impact as 'global_dot_product' is
called inside the loop, whereas previously we called
'global_allreduce_sum' after the loop to compute all 'initer' sums at
the same time.

To keep that optimization, we would have to implement a new interface
'global_allreduce_sum' which would take an array of shape
(nx_block,ny_block,max_blocks,k) and sum over their first three
dimensions before performing the global reduction over the k dimension.

We choose to not go that route for now mostly because anyway the CGS
algorithm is (by default) only used for the PGMRES preconditioner, and
so the cost should be relatively low as 'initer' corresponds to
'dim_pgmres' in the namelist, which should be kept low for efficiency
(default 5).

These changes lead to bit-for-bit reproducibility (the decomp_suite
passes) when using 'precond=ident' and 'precond=diag' along with
'bfbflag=reprosum'. 'precond=pgmres' is still not bit-for-bit because
some halo updates are skipped for efficiency. This will be addressed in
a following commit.

[1] CICE-Consortium#491 (comment)
The 'pgmres' subroutine implements a separate GMRES solver and is used
as a preconditioner for the FGMRES linear solver. Since it is only a
preconditioner, it was decided to skip the halo updates after computing
the matrix-vector product (in 'matvec'), for efficiency.

This leads to non-reproducibility since the content of the non-updated
halos depend on the block / MPI distribution.

Add the required halo updates, but only perform them when we are
explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
set to something else than 'not'.

Adjust the interfaces of 'pgmres' and 'precondition' (from which
'pgmres' is called) to accept 'halo_info_mask', since it is needed for
masked updates.

Closes CICE-Consortium#518
…cibility

In the previous commits we ensured bit-for-bit reproducibility of the
outputs when using the VP solver.

Some global norms computed during the nonlinear iteration still use the
same non-reproducible pattern of summing over blocks locally before
performing the reduction. However, these norms are used only to monitor
the convergence in the log file, as well as to exit the iteration when
the required convergence level is reached ('nlres_norm'). Only
'nlres_norm' could (in theory) influence the output, but it is unlikely
that a difference due to floating point errors would influence the 'if
(nlres_norm < tol_nl)' condition used to exist the nonlinear iteration.

Change these remaining cases to also use 'global_norm', leading to
bit-for-bit log reproducibility.
The previous commit removed the last caller of 'calc_L2norm_squared'.
Remove the subroutine.

Also, do not compute 'sum_squared' in 'residual_vec', since the variable
'L2norm' which receives this value is also unused in 'anderson_solver'
since the previous commit. Remove that variable, and adjust the
interface of 'residual_vec' accordingly.
In a previous commit, we removed the sole caller of
'global_allreduce_sum' (in ice_dyn_vp::orthogonalize). We do not
anticipate that function to be ued elsewhere in the code, so remove it
from ice_global_reductions. Update the 'sumchk' unit test accordingly.
The previous commits made sure that the model outputs as well as the log
file output are bit-for-bit reproducible when using the VP solver by
refactoring the code to use the existing 'global_sum' subroutine.

Add a note in the documentation mentioning that 'bfbflag' is required to
get bit-for-bit reproducible results under different decompositions /
MPI counts when using the VP solver.

Also, adjust the doc about 'bfbflag=lsum8' being the same as
'bfbflag=off' since this is not the case for the VP solver: in the first
case we use the scalar version of 'global_sum', in the second case we
use the array version.
During QC testing of the previous commit, the 5 years QC test with the
updated VP solver failed twice with "bad departure points" after a few
years of simulation. Simply bumping the number of nonlinear iterations
(maxits_nonlin) from 4 to 5 makes these failures disappear and allow the
simulations to run to completion, suggesting the solution is not
converged enough with 4 iterations.

We also noticed that in these failing cases, the relative tolerance for
the linear solver (reltol_fmgres = 1E-2) is too small to be reached in
less than 50 iterations (maxits_fgmres), and that's the case at each
nonlinear iteration. Other papers mention a relative tolerance of 1E-1
for the linear solver, and using this value also allows both cases to
run to completion (even without changing maxits_nonlin).

Let's set the default tolerance for the linear solver to 1E-1, and let's
be conservative and bump the number of nonlinear iterations to 10. This
should give us a more converged solution and add robustness to the
default settings.
@phil-blain
Copy link
Member Author

I've just pushed a v2 addressing all feedback.

@apcraig
Copy link
Contributor

apcraig commented Oct 18, 2022

This looks great to me, I'm going to run a few tests and then will formally approve and merge.

Copy link
Contributor

@apcraig apcraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my testing looks good. Very happy this is all sorted out!! Thanks.

@apcraig
Copy link
Contributor

apcraig commented Oct 19, 2022

I will merge in the next day or so unless someone says "not yet". Thanks!

@apcraig apcraig merged commit 16b78da into CICE-Consortium:main Oct 20, 2022
phil-blain added a commit to phil-blain/CICE that referenced this pull request Dec 12, 2022
(cherry picked from commit 714afe8)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Dec 12, 2022
The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is
not compiled with support for PBS. This leads to failures as the MPI
runtime does not have the same view of the number of available processors
as the job scheduler.

Use our own build of OpenMPI, compiled with PBS support, for the
'ppp6_gnu'  environment, which uses OpenMPI.

(cherry picked from commit cf96fc2)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Dec 12, 2022
Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an
intermittent bug where a call to 'MPI_Waitall' fails with:

    Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

and no core dump is produced. This affects at least these cases of the
'decomp' suite:

- *_*_restart_gx3_16x2x1x1x800_droundrobin
- *_*_restart_gx3_16x2x2x2x200_droundrobin

This was reported to Intel and they suggested setting the variable
'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This
disables shared memory transport and indeeds fixes the failures.

Set this variable for all ECCC machine files using Intel MPI.

[1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html

(cherry picked from commit 1110dcb)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Dec 12, 2022
Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
[1], [2]), does not (by default) guarantee that repeated runs of the same
code on the same machine with the same number of MPI ranks yield the
same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.

Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
repeated runs of the code giving different answers, and baseline
comparing runs with code built from the same commit failing.

When generating a baseline or comparing against an existing baseline,
set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
using Intel MPI [3], so that (processor) topology-aware collective
algorithms are not used and results are reproducible.

Note that we do not need to set this variable on robert or underhill, on
which jobs have exclusive node access and thus job placement (on
processors) is guaranteed to be reproducible.

[1] https://stackoverflow.com/a/45916859/
[2] https://scicomp.stackexchange.com/a/2386/
[3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411

(cherry picked from commit 1f46305)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Feb 2, 2024
(cherry picked from commit 714afe8)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Feb 2, 2024
The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is
not compiled with support for PBS. This leads to failures as the MPI
runtime does not have the same view of the number of available processors
as the job scheduler.

Use our own build of OpenMPI, compiled with PBS support, for the
'ppp6_gnu'  environment, which uses OpenMPI.

(cherry picked from commit cf96fc2)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Feb 2, 2024
Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an
intermittent bug where a call to 'MPI_Waitall' fails with:

    Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code

and no core dump is produced. This affects at least these cases of the
'decomp' suite:

- *_*_restart_gx3_16x2x1x1x800_droundrobin
- *_*_restart_gx3_16x2x2x2x200_droundrobin

This was reported to Intel and they suggested setting the variable
'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This
disables shared memory transport and indeeds fixes the failures.

Set this variable for all ECCC machine files using Intel MPI.

[1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html

(cherry picked from commit 1110dcb)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
phil-blain added a commit to phil-blain/CICE that referenced this pull request Feb 2, 2024
Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
[1], [2]), does not (by default) guarantee that repeated runs of the same
code on the same machine with the same number of MPI ranks yield the
same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.

Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
repeated runs of the code giving different answers, and baseline
comparing runs with code built from the same commit failing.

When generating a baseline or comparing against an existing baseline,
set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
using Intel MPI [3], so that (processor) topology-aware collective
algorithms are not used and results are reproducible.

Note that we do not need to set this variable on robert or underhill, on
which jobs have exclusive node access and thus job placement (on
processors) is guaranteed to be reproducible.

[1] https://stackoverflow.com/a/45916859/
[2] https://scicomp.stackexchange.com/a/2386/
[3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411

(cherry picked from commit 1f46305)
(this commit is available via

    git fetch https://github.com/CICE-Consortium/CICE refs/pull/774/head

and was squash-merged as part of 16b78da (ice_dyn_vp: allow for
bit-for-bit reproducibility under `bfbflag` (CICE-Consortium#774), 2022-10-20)).
@phil-blain phil-blain deleted the repro-vp branch February 19, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants