Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ice_dyn_vp: allow for bit-for-bit reproducibility under bfbflag #774

Merged
merged 17 commits into from
Oct 20, 2022

Commits on Oct 17, 2022

  1. Configuration menu
    Copy the full SHA
    782d19e View commit details
    Browse the repository at this point in the history
  2. doc: correct default value of 'maxits_nonlin'

    The "Table of namelist options" in the user guide lists 'maxits_nonlin'
    as having a default value of 1000, whereas its actual default is 4, both
    in the namelist and in 'ice_init.F90'. This has been the case since the
    original implementation of the implicit solver in f7fd063 (dynamics: add
    implicit VP solver (CICE-Consortium#491), 2020-09-22).
    
    Fix the documentation.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    5263344 View commit details
    Browse the repository at this point in the history
  3. doc: VP solver is validated with OpenMP

    When the implicit VP solver was added in f7fd063 (dynamics: add implicit
    VP solver (CICE-Consortium#491), 2020-09-22), it had not yet been tested with OpenMP
    enabled.
    
    The OpenMP implementation was carefully reviewed and then fixed in
    d1e972a (Update OMP (CICE-Consortium#680), 2022-02-18), which lead to all runs of the
    'decomp' suite completing and all restart tests passing. The 'bfbcomp'
    tests are still failing, but this is due to the code not using the CICE
    global sum implementation correctly, which will be fixed in the next
    commits.
    
    Update the documentation accordingly.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    edd1876 View commit details
    Browse the repository at this point in the history
  4. ice_dyn_vp: activate OpenMP in 'dyn_prep2' loop

    When the OpenMP implementation was reviewed and fixed in d1e972a (Update
    OMP (CICE-Consortium#680), 2022-02-18), the 'PRIVATE' clause of the OpenMP directive
    for the loop where 'dyn_prep2' is called in 'implicit_solver' was
    corrected in line with what was done in 'ice_dyn_evp', but OpenMP was
    left unactivated for this loop (the 'TCXOMP' was not changed to a real
    'OMP' directive).
    
    Activate OpenMP for this loop. All runs and restart tests of the
    'decomp_suite' still pass with this change.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    04c668e View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    714afe8 View commit details
    Browse the repository at this point in the history
  6. machines: eccc: use PBS-enabled OpenMPI for 'ppp6_gnu'

    The system installation of OpenMPI at /usr/mpi/gcc/openmpi-4.1.2a1/ is
    not compiled with support for PBS. This leads to failures as the MPI
    runtime does not have the same view of the number of available processors
    as the job scheduler.
    
    Use our own build of OpenMPI, compiled with PBS support, for the
    'ppp6_gnu'  environment, which uses OpenMPI.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    cf96fc2 View commit details
    Browse the repository at this point in the history
  7. machines: eccc: set I_MPI_FABRICS=ofi

    Intel MPI 2021.5.1, which comes with oneAPI 2022.1.2, seems to have an
    intermittent bug where a call to 'MPI_Waitall' fails with:
    
        Abort(17) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Waitall: See the MPI_ERROR field in MPI_Status for the error code
    
    and no core dump is produced. This affects at least these cases of the
    'decomp' suite:
    
    - *_*_restart_gx3_16x2x1x1x800_droundrobin
    - *_*_restart_gx3_16x2x2x2x200_droundrobin
    
    This was reported to Intel and they suggested setting the variable
    'I_MPI_FABRICS' to 'ofi' (the default being 'shm:ofi' [1]). This
    disables shared memory transport and indeeds fixes the failures.
    
    Set this variable for all ECCC machine files using Intel MPI.
    
    [1] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/environment-variables-for-fabrics-control/communication-fabrics-control.html
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    1110dcb View commit details
    Browse the repository at this point in the history
  8. machines: eccc: set I_MPI_CBWR for BASEGEN/BASECOM runs

    Intel MPI, in contrast to OpenMPI (as far as I was able to test, and see
    [1], [2]), does not (by default) guarantee that repeated runs of the same
    code on the same machine with the same number of MPI ranks yield the
    same results when collective operations (e.g. 'MPI_ALLREDUCE') are used.
    
    Since the VP solver uses MPI_ALLREDUCE in its algorithm, this leads to
    repeated runs of the code giving different answers, and baseline
    comparing runs with code built from the same commit failing.
    
    When generating a baseline or comparing against an existing baseline,
    set the environment variable 'I_MPI_CBWR' to 1 for ECCC machine files
    using Intel MPI [3], so that (processor) topology-aware collective
    algorithms are not used and results are reproducible.
    
    Note that we do not need to set this variable on robert or underhill, on
    which jobs have exclusive node access and thus job placement (on
    processors) is guaranteed to be reproducible.
    
    [1] https://stackoverflow.com/a/45916859/
    [2] https://scicomp.stackexchange.com/a/2386/
    [3] https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html#i-mpi-adjust-family-environment-variables_GUID-A5119508-5588-4CF5-9979-8D60831D1411
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    1f46305 View commit details
    Browse the repository at this point in the history
  9. ice_dyn_vp: fgmres: exit early if right-hand-side vector is zero

    If starting a run with with "ice_ic='none'" (no ice), the linearized
    problem for the ice velocity A x = b will have b = 0, since all terms in
    the right hand side vector will be zero:
    
    - strint[xy] is zero because the velocity is zero
    - tau[xy] is zero because the ocean velocity is also zero
    - [uv]vel_init is zero
    - strair[xy] is zero because the concentration is zero
    - strtlt[xy] is zero because the ocean velocity is zero
    
    We thus have a linear system A x = b with b=0, so we
    must have x=0.
    
    In the FGMRES linear solver, this special case is not taken into
    account, and so we end up with an all-zero initial residual since
    workspace_[xy] is also zero because of the all-zero initial guess
    'sol[xy]', which corresponds to the initial ice velocity. This then
    leads to a division by zero when normalizing the first Arnoldi vector.
    
    Fix this special case by computing the norm of the right-hand-side
    vector before starting the iterations, and exiting early if it is zero.
    This is in line with the GMRES implementation in SciPy [1].
    
    [1] https://github.com/scipy/scipy/blob/651a9b717deb68adde9416072c1e1d5aa14a58a1/scipy/sparse/linalg/_isolve/iterative.py#L620-L628
    
    Close: #42
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    f3160a5 View commit details
    Browse the repository at this point in the history
  10. ice_dyn_vp: add global_norm, global_dot_product functions

    The VP solver uses a linear solver, FGMRES, as part of the non-linear
    iteration. The FGMRES algorithm involves computing the norm of a
    distributed vector field, thus performing global sums.
    
    These norms are computed by first summing the squared X and Y components
    of a vector field in subroutine 'calc_L2norm_squared', summing these
    over the local blocks, and then doing a global (MPI) sum using
    'global_sum'.
    
    This approach does not lead to reproducible results when the MPI
    distribution, or the number of local blocks, is changed, for reasons
    explained in the "Reproducible sums" section of the Developer Guide
    (mostly, floating point addition is not associative). This was partly
    pointed out in [1] but I failed to realize it at the time.
    
    Introduce a new function, 'global_dot_product', to encapsulate the
    computation of the dot product of two grid vectors, each split into two
    arrays (for the X and Y components).
    
    Compute the reduction locally as is done in 'calc_L2norm_squared', but
    throw away the result and use the existing 'global_sum' function when
    'bfbflag' is active, passing it the temporary array used to compute the
    element-by-element product.
    
    This approach avoids a performance regression from the added work done
    in 'global_sum', such that non-bfbflag runs are as fast as before.
    
    Note that since 'global_sum' loops on the whole array (and not just ice
    points as 'global_dot_product'), make sure to zero-initialize the 'prod'
    local array.
    
    Also add a 'global_norm' function implemented using
    'global_dot_product'. Both functions will be used in subsequent commits
    to ensure bit-for-bit reproducibility.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    58bd7c4 View commit details
    Browse the repository at this point in the history
  11. ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit output repr…

    …oducibility
    
    Make the results of the VP solver reproducible if desired by refactoring
    the code to use the subroutines 'global_norm' and 'global_dot_product'
    added in the previous commit.
    
    The same pattern appears in the FGMRES solver (subroutine 'fgmres'), the
    preconditioner 'pgmres' which uses the same algorithm, and the
    Classical and Modified Gram-Schmidt algorithms in 'orthogonalize'.
    
    These modifications do not change the number of global sums in the
    fgmres, pgmres and the MGS algorithm. For the CGS algorithm, there is
    (in theory) a slight performance impact as 'global_dot_product' is
    called inside the loop, whereas previously we called
    'global_allreduce_sum' after the loop to compute all 'initer' sums at
    the same time.
    
    To keep that optimization, we would have to implement a new interface
    'global_allreduce_sum' which would take an array of shape
    (nx_block,ny_block,max_blocks,k) and sum over their first three
    dimensions before performing the global reduction over the k dimension.
    
    We choose to not go that route for now mostly because anyway the CGS
    algorithm is (by default) only used for the PGMRES preconditioner, and
    so the cost should be relatively low as 'initer' corresponds to
    'dim_pgmres' in the namelist, which should be kept low for efficiency
    (default 5).
    
    These changes lead to bit-for-bit reproducibility (the decomp_suite
    passes) when using 'precond=ident' and 'precond=diag' along with
    'bfbflag=reprosum'. 'precond=pgmres' is still not bit-for-bit because
    some halo updates are skipped for efficiency. This will be addressed in
    a following commit.
    
    [1] CICE-Consortium#491 (comment)
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    02eeb42 View commit details
    Browse the repository at this point in the history
  12. ice_dyn_vp: do not skip halo updates in 'pgmres' under 'bfbflag'

    The 'pgmres' subroutine implements a separate GMRES solver and is used
    as a preconditioner for the FGMRES linear solver. Since it is only a
    preconditioner, it was decided to skip the halo updates after computing
    the matrix-vector product (in 'matvec'), for efficiency.
    
    This leads to non-reproducibility since the content of the non-updated
    halos depend on the block / MPI distribution.
    
    Add the required halo updates, but only perform them when we are
    explicitely asking for bit-for-bit global sums, i.e. when 'bfbflag' is
    set to something else than 'not'.
    
    Adjust the interfaces of 'pgmres' and 'precondition' (from which
    'pgmres' is called) to accept 'halo_info_mask', since it is needed for
    masked updates.
    
    Closes CICE-Consortium#518
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    762a915 View commit details
    Browse the repository at this point in the history
  13. ice_dyn_vp: use global_{norm,dot_product} for bit-for-bit log reprodu…

    …cibility
    
    In the previous commits we ensured bit-for-bit reproducibility of the
    outputs when using the VP solver.
    
    Some global norms computed during the nonlinear iteration still use the
    same non-reproducible pattern of summing over blocks locally before
    performing the reduction. However, these norms are used only to monitor
    the convergence in the log file, as well as to exit the iteration when
    the required convergence level is reached ('nlres_norm'). Only
    'nlres_norm' could (in theory) influence the output, but it is unlikely
    that a difference due to floating point errors would influence the 'if
    (nlres_norm < tol_nl)' condition used to exist the nonlinear iteration.
    
    Change these remaining cases to also use 'global_norm', leading to
    bit-for-bit log reproducibility.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    c1c7c0b View commit details
    Browse the repository at this point in the history
  14. ice_dyn_vp: remove unused subroutine and cleanup interfaces

    The previous commit removed the last caller of 'calc_L2norm_squared'.
    Remove the subroutine.
    
    Also, do not compute 'sum_squared' in 'residual_vec', since the variable
    'L2norm' which receives this value is also unused in 'anderson_solver'
    since the previous commit. Remove that variable, and adjust the
    interface of 'residual_vec' accordingly.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    962df1a View commit details
    Browse the repository at this point in the history
  15. ice_global_reductions: remove 'global_allreduce_sum'

    In a previous commit, we removed the sole caller of
    'global_allreduce_sum' (in ice_dyn_vp::orthogonalize). We do not
    anticipate that function to be ued elsewhere in the code, so remove it
    from ice_global_reductions. Update the 'sumchk' unit test accordingly.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    108fa0e View commit details
    Browse the repository at this point in the history
  16. doc: mention VP solver is only reproducible using 'bfbflag'

    The previous commits made sure that the model outputs as well as the log
    file output are bit-for-bit reproducible when using the VP solver by
    refactoring the code to use the existing 'global_sum' subroutine.
    
    Add a note in the documentation mentioning that 'bfbflag' is required to
    get bit-for-bit reproducible results under different decompositions /
    MPI counts when using the VP solver.
    
    Also, adjust the doc about 'bfbflag=lsum8' being the same as
    'bfbflag=off' since this is not the case for the VP solver: in the first
    case we use the scalar version of 'global_sum', in the second case we
    use the array version.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    d8bc816 View commit details
    Browse the repository at this point in the history
  17. ice_dyn_vp: improve default parameters for VP solver

    During QC testing of the previous commit, the 5 years QC test with the
    updated VP solver failed twice with "bad departure points" after a few
    years of simulation. Simply bumping the number of nonlinear iterations
    (maxits_nonlin) from 4 to 5 makes these failures disappear and allow the
    simulations to run to completion, suggesting the solution is not
    converged enough with 4 iterations.
    
    We also noticed that in these failing cases, the relative tolerance for
    the linear solver (reltol_fmgres = 1E-2) is too small to be reached in
    less than 50 iterations (maxits_fgmres), and that's the case at each
    nonlinear iteration. Other papers mention a relative tolerance of 1E-1
    for the linear solver, and using this value also allows both cases to
    run to completion (even without changing maxits_nonlin).
    
    Let's set the default tolerance for the linear solver to 1E-1, and let's
    be conservative and bump the number of nonlinear iterations to 10. This
    should give us a more converged solution and add robustness to the
    default settings.
    phil-blain committed Oct 17, 2022
    Configuration menu
    Copy the full SHA
    4f7f7cf View commit details
    Browse the repository at this point in the history