Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for uninitialized variables (snan testing) #579

Merged
merged 11 commits into from
Mar 25, 2021

Conversation

apcraig
Copy link
Contributor

@apcraig apcraig commented Mar 16, 2021

PR checklist

This replaces #560.

See also #572, #247

This fixes several issues found by turning on snan debug tests with the intel/gnu compilers. There are still outstanding issues, so snan is not out by default. This fixes

  • adds several tests including decomp tests with debug on, padded decomp with 1 active gridcell width in i and j directions, a 1x1 decomp test, and some serial decomp tests.
  • fixes fill of halo in haloupdate. was not working corrected for padded blocks. required storing some additional information in the halo datatype at initialization.
  • fixes issues with uvm and tarea, fixed by haloupdate fill.
  • pulls in fixes from cicecore: fix some use of uninitialized variables #560 (sst initiallization, Lsub initialization)
  • fixes a bug in the size 1 padded grid cell decomp (ice_blocks)
  • fixes a spacecurve issue
  • adds snan debug options for several machines, but many still commented out due to outstanding issues.

@@ -252,8 +252,8 @@ subroutine create_blocks(nx_global, ny_global, ew_boundary_type, &
!*** set last physical point if padded domain

else if (j_global(j,n) == ny_global .and. &
j > all_blocks(n)%jlo .and. &
j < all_blocks(n)%jhi) then
j >= all_blocks(n)%jlo .and. &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this change allows jlo=jhi, i.e. 1x1 domains, which weren't allowed before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is where the 1x1 blocks failed. The 1x1 and 2x2 are still producing different results from the other block sizes and decomps, but they are running and part of our test suite now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to elaborate a bit further. Before this fix, jhi was not set properly if the active block size was 1. It remained the default (ny_block - nghost), see line 194. This section of code is where the lo/hi indices are set for padded blocks. It worked fine as long as the block size was at least 2.

@@ -1684,11 +1682,14 @@ subroutine makemask
tarean(:,:,iblk) = c0
tareas(:,:,iblk) = c0

do j = 1, ny_block
do i = 1, nx_block
do j = jlo,jhi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the ghost cells for these masks filled later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not, but the halos are never accessed, so there is no error when we initialize with snan. There are probably many arrays that have uninitialized memory, but that's fine. In many ways, that's what we want. We want to only initialize memory that we think is used. That way we can trap code (or learn more about our codes) when an snan actually triggers an error. If we initialized everything explicitly to zero just for the sake of doing so, the snan testing wouldn't tell us anything.

enddo
enddo
!$OMP END PARALLEL DO
hm (:,:,:) = c0
Copy link
Contributor

@TillRasmussen TillRasmussen Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to initialize to zero in all of the block (also valid for kmt) and angle?
Wouldn't it be more correct to update from ilo to ihi and jlo to jhi and then use icehaloupdate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few different grid initialiizations (popgrid, popgrid_nc, latlon, rectangular, etc), and they don't all initialize the variables in the same way. I believe zero-ing these out here is required for the rectangular grid option that is used for the box setups. It is not needed for the popgrid option that we use for our more standard gx3+ testing. It could be that a better solution is to make all the grid options behave in the same way, but it wasn't clear to me what those requirements should be, so I went this direction.

Copy link
Member

@phil-blain phil-blain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so I took a look at each commit. If I understand correctly, the issues with tarea and uvm highlighted in #560 are fixed, and so #572 would be closed as well. Is that right ?

However, I'm not sure I understand completely how the fix to the fill values in the halo updates routine fix the problem for uvm, since fillValue is not used for uvm... unless there is something I'm missing....


On the Git side of things, I would prefer that we not keep my temporary fixes of initializing tarea and uvm to zero, only to remove them in latter commits. Even if we use a squash merge workflow, the commit messages of all commits on the branch are still kept (if not manuallt edited-out) in the merge commit message, so it can be very confusing to read when you are actually reading the commit log to try to understand the evolution of the codebase...

Comment on lines +1451 to +1453
! tcx, if l is 0, then fact has no factors, just return
if (l == 0) return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I read the diff for this commit correctly, this here is the entirety of the bug fix for spacecurve, right ? I'm not familiar with the algorithm...

Comment on lines 2410 to 2404
k = kmt(i,j,iblk)
k = min(nint(kmt(i,j,iblk)),nlevel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also something that was found with signalling NaNs ? not sure what this is changing ...

@apcraig
Copy link
Contributor Author

apcraig commented Mar 22, 2021

@phil-blain, thanks for having a look. I was hoping you'd find a few minutes. uvm=0 and tarea=0 are no longer being set. I think the halo fill fixed this. The halo fill operates even if no fill value is passed. It's set internally to zero by default, so it does matter in those cases. So yes, this addresses #560 and #572.

The ice_spacecurve mod does fix the spacecurve error. The spacecurve implementation is not particularly robust in general, so my fix isn't wrong, but may not be how a robust implementation might deal with the issue.

The k = "...nint(kmt)...", is needed in cases where kmt can be greater than nlevel. In that case, you'll end up referencing a element in the bathy arrays that is out of bounds. This whole implementation is a little dicey as nlevel is hardwired to 40 with fixed depths and that may have no relationship to the kmt file used by the CICE/ocean model or the depths defined there. But the idea is that you would normally read the depth file separately if you really had that information and were coupling to an ocean model. That's what we do in RASM when we use the bathymetry/landfast ice. For landfast ice, it also can't really matter what the depth is once it's greater than a few 10s of meters? Anyway, a better fix would be to get rid of this hardwired 40 levels and require that a depth profile be input when landfast ice is on. Also, kmt in CICE is a real, so we need to nint it to use the min function with nlevel.

Some of the errors that this PR fixes appeared when I turned on the signalling nans and then ran the full test suite. There are still several outstanding issues that I couldn't fix quickly, so those are deferred and the snan is NOT on by default on the machines I test on. There is still work to do in this regard. #247 is still open.

Finally, I understand the concerns about the git commit history, but I don't think it's worth making an effort to create a new branch or to try to clean up the history. I can edit some of it when I squash merge. In some ways, the squash merge does what we want and is why I advocate for it's use, it largely eliminates all that potentially chaotic history within the master branch commit series. Although I understand it does appear within an individual commit unless it's reviewed and cleaned up when we squash.

Let me know if there are any other questions. thanks!

@TillRasmussen
Copy link
Contributor

TillRasmussen commented Mar 23, 2021

I think that you could argue that use_bathymetry should be true as default. This reads in a depth from a bathymetry file. use_bathymetry=false is for a specific ocean setup used by Environment Canada (correct me if I am wrong @phil-blain) . In the case where kmt>nlevel I think that I would prefer an abort message.

@eclare108213
Copy link
Contributor

I agree with @TillRasmussen that an abort is appropriate if kmt > bathymetry file levels, when bathymetry is being read in. My general philosophy is that all namelist parameters relative to a given 'master namelist flag' should be set by default to the values desired when the master flag is turned on, but it sounds like the bathymetry file should be off by default, except for our tests, since coupled models read it in outside of the sea ice component. Is that correct?

@apcraig
Copy link
Contributor Author

apcraig commented Mar 24, 2021

I have updated the bathymetry implementation to abort if kmt > nlevel. That seems like a good implementation.

phil-blain and others added 10 commits March 24, 2021 12:18
When CICE is compiled with the 'CICE_IN_NEMO' CPP macro, and 'calc_Tsfc'
is false, ice_step::coupling_prep calls CICE_RunMod::srcflux_to_ocn to
transfer heat fluxes on ice-free grid points to the ocean.

The calculation in srcflux_to_ocn uses the Icepack parameter 'Lsub', but
without declaring this variable nor querying it from Icepack, which
results in a compile-time failure when using the standalone driver with
'CICE_IN_NEMO' defined.

Fix that by correctly declaring 'Lsub' and querying it using the Icepack
interface.
In subroutine 'init_grid2', the array 'tarea' is initialized in a loop
on indices {ilo, ihi} and {jlo, jhi}, but is then used in subroutine
'makemask' in a loop on indices {1, nx_block} and {1, ny_block},
potentially causing an uninitialized value to be used (at line 1693).

Initialize the whole array to zero first, to avoid using uninitialized values.
The initialization to zero is actually commented out, following
10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1] , so
uncomment it.

[1] CICE-Consortium#180
In subroutine 'makemask', the array 'uvm' is initialized in a loop on indices
{ilo, ihi} and {jlo, jhi}, but is then used in a loop on indices {1, nx_block} and
{1, ny_block}, potentially causing an uninitialized value to be used (at line 1672).

Initialize the whole array to zero first, to avoid using uninitialized values.
The initialization to zero is actually commented out, following
10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1], so uncomment it.

[1] CICE-Consortium#180
In subroutine 'init_coupler_flux', the initialization of the 'sst' array
to 'Tf' is protected by a '#ifndef CICE_IN_NEMO' preprocessor directive.
This has been the case since CICE-Consortium/CICE-svn-trunk@151b9af
(cice-4.0~17, 2008-04-28), though that commit does not explain why.
This is probably because in some conditions (depending on active CPPs at
NEMO compilation time, as well as NEMO namelist parameters), the SST
needs to be passed early to CICE, before calling CICE_Initialize (see
the CICE coupling interface in NEMO [1]), and initializing it to 'Tf' in
'init_coupler_flux' would erase the values already passed from the
ocean.

If however the SST is *not* passed to CICE before CICE_Initialize is
called, and 'CICE_IN_NEMO' is in use, then 'ice_init::set_state_var'
reads from the uninitialized 'sst' array when placing ice where the
ocean surface is cold.

In ECCC's in-house version of the coupling interface (sbc_ice_cice),
extensive changes have been made to work around bugs in the original
version and limitations due to the sequence in which CICE and NEMO fields
are initialized, such that we manually call 'ice_init::init_state' a
second time at the end of subroutine sbc_ice_cice::cice_sbc_init.

To avoid using a uninitialized array, remove the '#ifdef
CICE_IN_NEMO' around the initialization of the 'sst' array to 'Tf', so
that it is always initialized. These values will anyway be replaced by
the "correct" ones when init_state is called a second time in our
in-house version.

[1] http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/releases/release-3.6/NEMOGCM/NEMO/OPA_SRC/SBC/sbcice_cice.F90#L176
Picked up by testing with initialized snan and debug options in decomps
Update cheyenne debug compiler options to add snan initialization
Update test suite to add debug tests for multiple decomps
- Fix bug in ice_boundary fill in halo, was not taking into account padding correctly
- Fix bug in ice_blocks for cases where the active size is 1 gridcell wide
- Update ice_grid initialization to clean up prior workarounds.
- Update some intel compilers to add -init=snan,arrays for debug builds
- Add new tests to the decomp test suite including tests with debugging on,
  tests with the active size 1 gridcell wide, and tests with a single block
  that's bigger than the whole grid.
Back off on some of the full debug compiler flags.
@phil-blain
Copy link
Member

The k = "...nint(kmt)...", is needed in cases where kmt can be greater than nlevel. In that case, you'll end up referencing a element in the bathy arrays that is out of bounds. This whole implementation is a little dicey as nlevel is hardwired to 40 with fixed depths and that may have no relationship to the kmt file used by the CICE/ocean model or the depths defined there. But the idea is that you would normally read the depth file separately if you really had that information and were coupling to an ocean model. That's what we do in RASM when we use the bathymetry/landfast ice. For landfast ice, it also can't really matter what the depth is once it's greater than a few 10s of meters? Anyway, a better fix would be to get rid of this hardwired 40 levels and require that a depth profile be input when landfast ice is on. Also, kmt in CICE is a real, so we need to nint it to use the min function with nlevel.

OK, I have read that part of the code more carefully and understand more what it's doing and what you are changing. If I understand correctly, the conversion to integer was already implicit since k is declared as int, but it's more readable if it's explicit. I also agree with Till and Elizabeth that it makes more sense to just abort if kmt > nlevel. While we're there, I think we could change k > puny to k > 0 since k is an integer but puny is a double. It would be less confusing, I think (and it's already written that way in get_bathymetry_popfile). As an aside, the whole last part of these two subroutines is in fact duplicated, so a further clean-up could be to refactor this; but that can come latter.


I think that you could argue that use_bathymetry should be true as default. This reads in a depth from a bathymetry file. use_bathymetry=false is for a specific ocean setup used by Environment Canada (correct me if I am wrong @phil-blain) .

[...] My general philosophy is that all namelist parameters relative to a given 'master namelist flag' should be set by default to the values desired when the master flag is turned on, but it sounds like the bathymetry file should be off by default, except for our tests, since coupled models read it in outside of the sea ice component. Is that correct?

To clarify, in our in-house CICE4 implementation we pass the dynamic water height (bathy + SSH) fro NEMO to CICE, so no we do not read the bathymetry field from CICE. I've not yet gotten around to revisit that in my work on coupling our NEMO with CICE 6, but the same approach will be used. I was planning to do that soon.

With regards to what should be the default for use_bathymetry in our namelist, I think both approaches are valid. I'd say it depends if we want the namelist to have good defaults for coupled setups or standalone setups: for coupled, use_bathymetry = false (as it is now) is OK, because usually the ocean model is in charge of reading the bathymetry, but for standalone runs use_bathymetry = true would make more sense, since we do provide bathymetry files for our 3 grids, and that should give a more detailed field than the harcoded levels...

@apcraig
Copy link
Contributor Author

apcraig commented Mar 24, 2021

Thanks @phil-blain, there are a few things about the bathymetry implementation that I haven't understood well. I really don't understand why we have a 40 level default at all. I guess it's for testing. The get_bathymetry_popfile was added by RASM because pop has a well defined bathymetry file different from whatever was implemented before. I thought it was easier to create a new "read" method rather than duplicate the file in another format. Hence, it's largely a duplicate of the default implementation.

k = kmt() is not the same as k = nint(kmt()). The former is doing an "int(kmt())". We really do want an explicit nint in this case, although in practice, they seem to be producing the same result for our test cases (which may not have any landfast ice unfortunately).

I will modify the k>puny vs k>0. I agree that's a more robust implementation. kmt is normally thought of as an integer but it's a real in CICE. I think that's where the k/kmt type problems are showing up in the bathymetry, confusion about the types. I suspect we have (kmt()>puny) elsewhere, but should k=nint(kmt()), k>0 when converted to integer as opposed to k=kmt(), k>puny which is what seems to have been done.

I also agree we may want to rethink use of the 40 level hardcoded default, but as a separate PR.

@apcraig
Copy link
Contributor Author

apcraig commented Mar 24, 2021

I have rerun the full test suite on cheyenne after rebasing the code to the current master. Results are as expected and here, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#52e8102274c8a948f43e3e4e6b0428250706df58

@apcraig
Copy link
Contributor Author

apcraig commented Mar 25, 2021

I updated the k>puny line and ran a quick suite to make sure it didn't create any problems. all seems fine with that change.

@apcraig apcraig merged commit 6b399d1 into CICE-Consortium:master Mar 25, 2021
@apcraig apcraig deleted the initialize-before-use branch August 17, 2022 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Halo Update Fill Implemention Bug with Padded Blocks
4 participants