Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cicecore: fix some use of uninitialized variables #560

Closed

Conversation

phil-blain
Copy link
Member

@phil-blain phil-blain commented Jan 28, 2021

PR checklist

  • Short (1 sentence) summary of your PR:
    cicecore: fix some use of uninitialized variables
  • Developer(s):
    P.Blain
  • Suggest PR reviewers from list in the column to the right.
  • Please copy the PR test results link or provide a summary of testing completed below.
    in progress
  • How much do the PR code changes differ from the unmodified code?
    • bit for bit
    • different at roundoff level
    • more substantial
  • Does this PR create or have dependencies on Icepack or any other models?
    • Yes
    • No
  • Does this PR add any new test cases?
    • Yes
    • No
  • Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
    • Yes
    • No, does the documentation need to be updated at a later time?
      • Yes
      • No
  • Please provide any additional information or relevant details below:

This is the first one of a series of small PRs I'm preparing to upstream some changes I've had to make to CICE6 during my work on coupling with NEMO3.6. This one deals with use of uninitialized values, which I've detected by compiling with -init=snan,arrays (Intel). I will start the base_suite right away and update the description with the test results once they are finished running.

See the commit messages for complete details.

When CICE is compiled with the 'CICE_IN_NEMO' CPP macro, and 'calc_Tsfc'
is false, ice_step::coupling_prep calls CICE_RunMod::srcflux_to_ocn to
transfer heat fluxes on ice-free grid points to the ocean.

The calculation in srcflux_to_ocn uses the Icepack parameter 'Lsub', but
without declaring this variable nor querying it from Icepack, which
results in a compile-time failure when using the standalone driver with
'CICE_IN_NEMO' defined.

Fix that by correctly declaring 'Lsub' and querying it using the Icepack
interface.
In subroutine 'init_grid2', the array 'tarea' is initialized in a loop
on indices {ilo, ihi} and {jlo, jhi}, but is then used in subroutine
'makemask' in a loop on indices {1, nx_block} and {1, ny_block},
potentially causing an uninitialized value to be used (at line 1693).

Initialize the whole array to zero first, to avoid using uninitialized values.
The initialization to zero is actually commented out, following
10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1] , so
uncomment it.

[1] CICE-Consortium#180
In subroutine 'makemask', the array 'uvm' is initialized in a loop on indices
{ilo, ihi} and {jlo, jhi}, but is then used in a loop on indices {1, nx_block} and
{1, ny_block}, potentially causing an uninitialized value to be used (at line 1672).

Initialize the whole array to zero first, to avoid using uninitialized values.
The initialization to zero is actually commented out, following
10c446a (Dummy and unused variables. (CICE-Consortium#180), 2018-09-22) [1], so uncomment it.

[1] CICE-Consortium#180
In subroutine 'init_coupler_flux', the initialization of the 'sst' array
to 'Tf' is protected by a '#ifndef CICE_IN_NEMO' preprocessor directive.
This has been the case since CICE-Consortium/CICE-svn-trunk@151b9af
(cice-4.0~17, 2008-04-28), though that commit does not explain why.
This is probably because in some conditions (depending on active CPPs at
NEMO compilation time, as well as NEMO namelist parameters), the SST
needs to be passed early to CICE, before calling CICE_Initialize (see
the CICE coupling interface in NEMO [1]), and initializing it to 'Tf' in
'init_coupler_flux' would erase the values already passed from the
ocean.

If however the SST is *not* passed to CICE before CICE_Initialize is
called, and 'CICE_IN_NEMO' is in use, then 'ice_init::set_state_var'
reads from the uninitialized 'sst' array when placing ice where the
ocean surface is cold.

In ECCC's in-house version of the coupling interface (sbc_ice_cice),
extensive changes have been made to work around bugs in the original
version and limitations due to the sequence in which CICE and NEMO fields
are initialized, such that we manually call 'ice_init::init_state' a
second time at the end of subroutine sbc_ice_cice::cice_sbc_init.

To avoid using a uninitialized array, remove the '#ifdef
CICE_IN_NEMO' around the initialization of the 'sst' array to 'Tf', so
that it is always initialized. These values will anyway be replaced by
the "correct" ones when init_state is called a second time in our
in-house version.

[1] http://forge.ipsl.jussieu.fr/nemo/browser/NEMO/releases/release-3.6/NEMOGCM/NEMO/OPA_SRC/SBC/sbcice_cice.F90#L176
Copy link
Contributor

@eclare108213 eclare108213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes might affect other coupled models, e.g. for restarts, so I'll ask @apcraig and @dabail10 to take a look.

@TillRasmussen
Copy link
Contributor

Generally fine. A few comments.
Tarea: loops from 1 to nx_block when calculating northern or southern hemisphere average. This is used for diagnosticks and global sums. My question is more why it is necessary to include halo zones in a global sum of e.g ice concentration. Hence should the loop indices be changed instead?
uvm: There is a call to ice_halo_update with uvm just after its initialization. I would assume that the halos are updated there.
An offline topic: 'we should compare sbcice_cice between the nemo3.6 and the nemo 4

@phil-blain
Copy link
Member Author

Thanks for taking a look Till!

Tarea: loops from 1 to nx_block when calculating northern or southern hemisphere average. This is used for diagnosticks and global sums. My question is more why it is necessary to include halo zones in a global sum of e.g ice concentration. Hence should the loop indices be changed instead?

I think I agree. I did not look into it further when I saw that the initialization was commented, I just uncommented it. What you suggest seems more sane.

uvm: There is a call to ice_halo_update with uvm just after its initialization. I would assume that the halos are updated there.

You mean that the initialization should not be necessary ? Well at least on our "CREG025-Pacific extension" grid, some values were uninitialized in the condition at line 1673 if I did not initialize uvm to zero... maybe something else is hiding here with the halos on that grid...

An offline topic: 'we should compare sbcice_cice between the nemo3.6 and the nemo 4

Yes, I looked at the changes to sbc_ice_cice in your nemo4cice project and they were mostly the same as mine, although our version has a lot of in-house modifications, that I don't yet fully understand. Sometimes stuff was implemented in a not so clean way because of lack of time, and it's one of the mid-term goals of my project to clean that up. We definitely will let you know.

@@ -384,7 +384,7 @@ subroutine init_grid2
! T-grid cell and U-grid cell quantities
!-----------------------------------------------------------------

! tarea(:,:,:) = c0
tarea(:,:,:) = c0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unnecessary as tarea is in a HaloUpdate shortly after it is computed. I don't think you want any zero values in tarea at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I agree that it should not be necessary, but as I wrote, if I do not initialize here, an uninitialized value gets used... So that would mean that not all halo cells are initialized by the halo update. I'll have to investigate that further on the specific grid we are using.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that your compiler is being very picky. What line does it abort on exactly with the uninitialized variable? I've done the check with NAG and Intel and it runs fine. As you say, it might be a particular feature of your grid. There was an issue awhile ago from one of the European groups that trapped this same error. I am fine with initializing uvm to zero, but tarea should never be zero.

Copy link
Member Author

@phil-blain phil-blain Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It aborts at line 1694 in makemask

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. A cleaner initialization would be as follows:

do j = 1, ny_block
do i = 1, nx_block

  im = max(i,ilo)
  jm = max(j,jlo)
  ip = min(i,ihi)
  jp = min(j,jhi)

Then replace all of the i,j indexes on the right with ip/jp. This will extrapolate the tarea values to the ghost cells including the four corners. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, these should all be im/jm or ip/jp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the maskhalo is turned on, it's possible that is also masking some part of the initialization. Just speculation on my part, we'd have to look into it further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the maskhalo is not being used for the halo update of tarea, which is correct, so forget I said that. I assume tarea is not initialized on the entire halo because of the land mask? We could also fix the loop in makemask, but that's likely to affect other loops in other parts of the code in the same way. This is almost certainly a case of variables not being initialized in places they are not needed, but where they are referenced at some point later. The proper fix, of consistently implementing all loops, is probably not worth the effort, but if we want to update a loop here or there, that's probably OK too. I'm not opposed to this type of fix in this case.

@@ -1636,7 +1636,7 @@ subroutine makemask
!-----------------------------------------------------------------

bm = c0
! uvm = c0
uvm = c0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. The HaloUpdate should take care of uvm values in the ghost cells.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It aborts at line 1673 in makemask

Copy link
Contributor

@dabail10 dabail10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant the Lsub initialization. This is done in ice_forcing.F90.

@@ -384,7 +384,7 @@ subroutine init_grid2
! T-grid cell and U-grid cell quantities
!-----------------------------------------------------------------

! tarea(:,:,:) = c0
tarea(:,:,:) = c0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be that your compiler is being very picky. What line does it abort on exactly with the uninitialized variable? I've done the check with NAG and Intel and it runs fine. As you say, it might be a particular feature of your grid. There was an issue awhile ago from one of the European groups that trapped this same error. I am fine with initializing uvm to zero, but tarea should never be zero.

@TillRasmussen
Copy link
Contributor

The European group may have been DMI. I should look at the report Jacob wrote some time ago

@dabail10
Copy link
Contributor

It was Issue #61 . I have reopened it and linked it here.

@dabail10 dabail10 linked an issue Jan 29, 2021 that may be closed by this pull request
@phil-blain phil-blain mentioned this pull request Jan 29, 2021
@apcraig
Copy link
Contributor

apcraig commented Feb 4, 2021

@phil-blain, is testing complete, is this ready to merge?

@phil-blain
Copy link
Member Author

not ready to merge at all. I've yet to go back into the code and come up with a better solution for tarea, and understanding why this issue comes up only on the grid we're using. I'm working on that.

@phil-blain
Copy link
Member Author

phil-blain commented Mar 2, 2021

Ok I'm beginning to understand a little bit more. It seems it is due to the fact that the domain decomposition we are using has some padded blocks.

I was able to plot tarea(:,:,iblk) from a core dump for some values of iblk and indeed some values are missing (here iblk = 33 and 34):

tarea_33
tarea_34

So my guess at this point is that the halo updates are not correctly updating padded blocks ? The darker line around the edge of the plots have value 1.0, I don't know yet where this value is coming from...

@TillRasmussen
Copy link
Contributor

I think that 1. originates from the icehalo update
call ice_HaloUpdate (tarea, halo_info, &
field_loc_center, field_type_scalar, &
fillValue=c1)

@dabail10
Copy link
Contributor

dabail10 commented Mar 2, 2021

Good catch. Cell padding has always been a headache for me. I try to avoid it. I think that the cells in the padded area are not even treated as ghost cells. So, that means halo updates do not touch them. However, the model should never reach in here. In my opinion, the compiler is just being pedantic about uninitialized variables here.

@TillRasmussen
Copy link
Contributor

If I understand the code correct then block size (nx_block and ny_block) are constant wheras jhi and ihi are adjusted to the global domain size. Then in the last block towards east and north it may run out of the the global domain when it loops from 1, ny_block or 1, nx_block. (padding from e.g ihi ot nx_block).

@apcraig
Copy link
Contributor

apcraig commented Mar 2, 2021

I believe that is all correct. Each block has it's own ilo, ihi, jlo, jhi defined. That's what this is all about,

      do iblk = 1, nblocks
         this_block = get_block(blocks_ice(iblk),iblk)         
         ilo = this_block%ilo
         ihi = this_block%ihi
         jlo = this_block%jlo
         jhi = this_block%jhi

The padded areas should not be used in general. If they are, we might want to try to fix that. I assume the halos do work correctly on the padded blocks. We do have some bit-for-bit checks for different decomps, some padded. Is the padded cell halo index ihi+1 or is it still nx+1? I actually don't know.

@eclare108213
Copy link
Contributor

Not looking at the code, so someone needs to confirm this, but what I recall is that the padded cells start outside of the halo cells, and they should be treated as 'missing'. Entire missing (land) blocks are never accessed, but it's possible that 'missing' padded cells still get noticed by compilers, even if they are never accessed. If they are being accessed, then there's definitely a bug.

@dabail10
Copy link
Contributor

dabail10 commented Mar 2, 2021

Very good points. Don't we use nghost as we might have more than one ghost cell in each direction? The ilo, ihi, jlo, jhi and nx_block, ny_block are unique for each block including the padded areas. There could be loops that only go to +/- nghost relative to the lo/hi indices. The padded areas should still be "interior" cells, but just masked out right? This has always confused me.

@eclare108213
Copy link
Contributor

eclare108213 commented Mar 2, 2021

The padded cells are assigned a global index of zero, which can never occur for physical cells. This is how they are identified and avoided for boundary conditions, parallelization, etc. ilo:ihi, jlo:jhi index only the physical domain, and nx_block, ny_block are those physical cells plus ONE row or column of ghost cells [EDIT: on each side of the physical domain]. The code was written in a general way to allow more ghost cells, but this has never been tried and there are comments in the code saying that some routines will need to be updated to use more than one. Padded cells lie beyond nx_block, ny_block for the eastern- and northern-most blocks, and they should never be accessed in the main code, once the decomposition has been done. It's possible that the haloes along those edges aren't right (especially the two tripole forms), but the code has been exercised enough that I trust those pretty well.

@eclare108213
Copy link
Contributor

I agree. There are a lot of problems or inconsistencies in the code that we are gradually fixing, as we come across them. Let's definitely fix the loops that are causing the crashes in this case, and initialize stuff better if that's what's needed. Otherwise we can keep our eyes open for similar issues as new code modifications go in.

As an aside, it would be interesting to know how much of the code has 'aged' versus getting recoded or modified somehow. Probably a lot of it turns over, so speak, every 10 years or so. Although the 'blame' button gives some info, it doesn't necessarily capture white space changes... I'm not proposing we figure this out! But I've also wondered how many lines of code we actually have (as opposed to comments and blank lines). Just curious.
e

@phil-blain
Copy link
Member Author

[I thought I had hit the "Comment" button on Friday, but is seems not!]

Yeah, after thinking about this since I posted my response above I also think that fixing the loop indices where they need to be fixed is a better approach. This way, we should easily find other places in the code that need to be fixed, whereas initializing the padded cells will "hide" possible loop indices "errors" in those other places.

I will also have a closer look at the halo update code to understand what's going on with the padded cells.

@apcraig
Copy link
Contributor

apcraig commented Mar 8, 2021

I just ran the following test on cheyenne with 3 compiilers,

smoke_gx3_8x2x8x10x20_debug_diag1_droundrobin_run2day

and all pass. There is a long standing issue to update the debug flags to check more things and initialize data to Nans. @phil-blain, can you tell me what compiler flags you're using for your testing. This is with an intel compiler, correct?

It might be worthwhile to expand this PR to extend the debug flags, add a debug padded test into the base suite, test with multiple compilers, and fix loops. I think that requires some comprehensive efforts, and I'm willing to take that on. Maybe we can include it by the end of the month release? I'm also happy to just merge this PR as a starting point to fix the issues that @phil-blain has run into. We can then continue addressing the issue moving forward as I've suggested.

@phil-blain
Copy link
Member Author

[@apcraig I was writing this as I received your answer, so I'll read it and answer separately]

OK, some updates now:

I ran the decomp_suite with arrays initialized to NaNs (without changing the loop indices and without initializing tarea) and I can reproduce the failure that I get on our grid:

$ cat results.log | \grep FAIL
FAIL daley_intel_restart_gx3_6x2x4x29x18_debug_dspacecurve run
FAIL daley_intel_restart_gx3_6x2x4x29x18_debug_dspacecurve test 
FAIL daley_intel_restart_gx3_8x2x8x10x20_debug_droundrobin run
FAIL daley_intel_restart_gx3_8x2x8x10x20_debug_droundrobin test 
FAIL daley_intel_restart_gx3_4x2x19x19x10_debug_droundrobin run
FAIL daley_intel_restart_gx3_4x2x19x19x10_debug_droundrobin test 
FAIL daley_intel_restart_gx3_16x2x5x10x20_debug_drakeX2 run
FAIL daley_intel_restart_gx3_16x2x5x10x20_debug_drakeX2 test 
FAIL daley_intel_restart_gx3_8x2x8x10x20_debug_droundrobin_maskhalo run
FAIL daley_intel_restart_gx3_8x2x8x10x20_debug_droundrobin_maskhalo test 

so 5 tests are failing. Apart from first one ("dspacecurve" - see below ), they all abort at the same place in makemask when they access tarea (so at the same place as on our custom grid).

Next, I tried to change the loop indices for that loop in makemask :

diff --git i/cicecore/cicedynB/infrastructure/ice_grid.F90 w/cicecore/cicedynB/infrastructure/ice_grid.F90
index 5308236..33cae7f 100644
--- i/cicecore/cicedynB/infrastructure/ice_grid.F90
+++ w/cicecore/cicedynB/infrastructure/ice_grid.F90
@@ -1684,8 +1684,14 @@ subroutine makemask
          tarean(:,:,iblk) = c0
          tareas(:,:,iblk) = c0
 
-         do j = 1, ny_block
-         do i = 1, nx_block
+         this_block = get_block(blocks_ice(iblk),iblk)
+         ilo = this_block%ilo
+         ihi = this_block%ihi
+         jlo = this_block%jlo
+         jhi = this_block%jhi
+
+         do j = jlo, jhi
+         do i = ilo, ihi
 
             if (ULAT(i,j,iblk) >= -puny) lmask_n(i,j,iblk) = .true. ! N. Hem.
             if (ULAT(i,j,iblk) <  -puny) lmask_s(i,j,iblk) = .true. ! S. Hem.

Here I used [ilo, ihi], [jlo,jhi] since I don't see any reason why we should use [ilo - nghost, ihi + nghost], [jlo - nghost, jhi + nghost], since we should not be accessing the tarea[ns] arrays outside of [ilo, ihi], [jlo,jhi].

I then reran daley_intel_restart_gx3_8x2x8x10x20_debug_droundrobin and it failed with floating invalid (again, accessing a NaN-initialized location in to_ugrid, line 1929:

jhi = this_block%jhi
do j = jlo, jhi
do i = ilo, ihi
work2(i,j,iblk) = p25 * &
(work1(i, j, iblk)*tarea(i, j, iblk) &
+ work1(i+1,j, iblk)*tarea(i+1,j, iblk) &
+ work1(i, j+1,iblk)*tarea(i, j+1,iblk) &
+ work1(i+1,j+1,iblk)*tarea(i+1,j+1,iblk)) &
/ uarea(i, j, iblk)
enddo
enddo
enddo
!$OMP END PARALLEL DO
end subroutine to_ugrid
!=======================================================================
! Transfer from U-cell centers to T-cell centers. Writes work into

so when accessing tarea(i, j+1, iblk). So this really makes me think that something is fishy with the halo updates in the case of padded cells.


One last thing, about the "dspacecurve" test: it's failing in a different way, so this might be a different bug, but just to document it here (we might want ot opne a separate issue for that):

 ice_domain work_unit, max_work_unit =          608          10
 ice_domain nocn =            0        6078      261744
 ice_domain work_per_block =            0          11         570
forrtl: severe (408): fort: (3): Subscript #1 of the array FACTORS has value 0 which is less than the lower bound of 1

Image              PC                Routine            Line        Source
cice               00000000022AD476  Unknown               Unknown  Unknown
cice               00000000014E2A96  ice_spacecurve_mp        1451  ice_spacecurve.F90
cice               000000000150E9DA  ice_spacecurve_mp        1727  ice_spacecurve.F90
cice               00000000007327C0  ice_distribution_        2190  ice_distribution.F90
cice               00000000006FF72B  ice_distribution_         121  ice_distribution.F90
cice               000000000074A6B9  ice_domain_mp_ini         551  ice_domain.F90
cice               0000000000DC3AC4  ice_grid_mp_init_         298  ice_grid.F90
cice               0000000000402032  cice_initmod_mp_c         112  CICE_InitMod.F90
cice               0000000000401EB3  cice_initmod_mp_c          52  CICE_InitMod.F90
cice               000000000040168B  MAIN__                     43  CICE.F90
cice               00000000004015F2  Unknown               Unknown  Unknown
cice               0000000002392A4F  Unknown               Unknown  Unknown
cice               00000000004014DA  Unknown               Unknown  Unknown

@phil-blain
Copy link
Member Author

@apcraig yes, this is with Intel. The flags that I use are:

FFLAGS += -O0 -g -check -fpe0 -ftrapuv -fp-model except -check noarg_temp_created -init=snan,arrays

GCC/GFortran should have equivalent flag that I do not remember off the top of my head.

I think we should fix the loops (or the halo updates if that is the problem - although as @JFLemieux73 and I were discussing offline last Friday, if the decomp suite is bit for bit, I have trouble understanding how they could be wrong... meaning that if we are accessing non-initialized locations and still getting bit-for-bit results even between padded and non-padded decompositions, then maybe it does not influence the outputs in the end somehow...)

Anyway I think it's better to not do a band-aid tarea(:,:,:)=c0patch and really understand what's going on. I can help with that.

@apcraig
Copy link
Contributor

apcraig commented Mar 8, 2021

I think we're converging on a plan. Lets try not to duplicate efforts, do you want me to look into the halo update? Or how can I help best?

The spacecurve thing has always been a little dicey in it's implementation. It might be better as a separate issue. I think it depends what tests we plan to add.

I also agree that the fact that we're getting bit-for-bit answers with different decomposition strongly suggests these problems are not influencing model results. I think that's allowed us to get away with some of this.

@TillRasmussen
Copy link
Contributor

spacecurve failed the same way when I used machine freya and debug
freya_intel_logbfb_gx3_6x2x4x29x18_debug_diag1_dspacecurve_long_maskhalo_reprosum

On topic I think that it is dangerous to initialize to 0. If anything then a huge number in order to make the simulation fail in case that someone creates a loop that enters the padding zone or other areas that are not iniitialized. As you have agreed upon I much prefer not to initialize and then fix the problem when issues are found.

@apcraig
Copy link
Contributor

apcraig commented Mar 8, 2021

I have duplicated @phil-blain's results. I have also looked at the halo implementation, at least initially. The halo points are always next to the ilo/ihi/jlo/jhi indices. One thing is that the fill in the halo seems to operate assuming no padding,

   do j = 1,nghost
      array(1:nx_block,           j,:) = fill
      array(1:nx_block,ny_block-j+1,:) = fill
   enddo
   do i = 1,nghost
      array(i,           1:ny_block,:) = fill
      array(nx_block-i+1,1:ny_block,:) = fill
   enddo

should probably be

   do j = 1,nghost
      array(1:nx_block, jlo-j,:) = fill
      array(1:nx_block, jhi+j,:) = fill
   enddo
   do i = 1,nghost
      array(ilo-i, 1:ny_block,:) = fill
      array(ihi+i, 1:ny_block,:) = fill
   enddo

I'm looking at the error in the to_ugrid subroutine now.

@TillRasmussen
Copy link
Contributor

I think that the 1:ny_block should also be changed to jlo-nghost:jhi+nghost and the same in the other dimension

@apcraig
Copy link
Contributor

apcraig commented Mar 8, 2021

@TillRasmussen, that's correct. The protocode above is wrong. What's even more difficult is that we don't know ilo/ihi/jlo/jhi in the ice_boundary.F90 where the halo update is done and prefilled, and I'm not sure we can query it because we'll end up with circular logic. The halo does a bunch of precomputation based on neighbor info and global info, but I'm not sure there is an awareness of it's local list of blocks. The problem is that you want to fill the entire halo with the fill value, not just the part of the halo that's going to be filled by the haloupdate.

@apcraig
Copy link
Contributor

apcraig commented Mar 9, 2021

So, I'm pretty sure that the bug in the fill setting in ice_boundary is causing the problems in the to_ugrid. Unfortunately, I haven't found a way to fix that bug yet because we don't have access to the list of local blocks, so that will take a bit more work. I can create an issue.

I have duplicated all of @phil-blain's issues and fixes. I think we should proceed as follows

  • Set tarea=0 before it's initialized as @phil-blain has proposed until we can get a proper fix into ice_boundary.
  • Update the loop indices in makemask to operate only in the non-halo non-padded area
  • Add at least one test of debug with padding to our suite, maybe more
  • Update cheyenne to include snan initialization with intel
  • Look into new debug flags for gnu on cheyenne
  • Create an issue about the fill value problems in ice_boundary with padded blocks

@phil-blain, I can do a bunch of this and then PR a change onto your branch. Or if you prefer to delete this PR, I can create a new one. We could also merge this PR with what you like and then I can create a new PR with some additional stuff for cheyenne and maybe some new tests. I do want to test across the full test suite on cheyenne and cheyenne is going to be down the next 2 days, so that will slow me down a bit. But I'll start to expand my testing and see how that goes.

@apcraig
Copy link
Contributor

apcraig commented Mar 9, 2021

It looks like some of these changes in makemask are changing answers. I need to dig a bit further.

@TillRasmussen TillRasmussen mentioned this pull request Mar 9, 2021
12 tasks
@apcraig
Copy link
Contributor

apcraig commented Mar 15, 2021

I've extended this PR in a branch in my fork and added some new tests turning on debug with various decompositions, updated cheyenne debug to add signaling nans, and fixed the spacecurve bug. I also did a bunch of validation testing for bit-for-bit. There are several subtle issues. I will run a full test suite soon and then maybe we can merge my changes to this branch or create a new PR.

I am going to look at #572 again to see if I can get that implemented properly on the branch. If I can figure it out quickly, I'll include it. Otherwise, we can defer it a bit.

@phil-blain
Copy link
Member Author

Thanks Tony. Sorry for not being very responsive - I'm only working half days and have a lot going on both at and outside work.... Might be simpler in a new PR. Let's see what you come up with and then we can decide which commits we keep in which PR.

If you update the machines files, could you add the signaling NaNs also for banting and daley ? I tought I had them activated there but turns out they are not.

Nice to have the spacecurve bug also fixed!

@apcraig
Copy link
Contributor

apcraig commented Mar 15, 2021

@phil-blain, sounds good and will do.

@apcraig
Copy link
Contributor

apcraig commented Mar 16, 2021

I ran a full test suite with the "full" debug options and there are a bunch of other failing tests now. I have fixed a couple things, but can't take everything on at the moment. I will create a new PR and we need to continue to work on this over time. I am going to back off on the debug compiler options on cheyenne until all the tests can pass. I will update the issue related to extending the debug compiler options as well.

@apcraig
Copy link
Contributor

apcraig commented Mar 16, 2021

Closing for #579

@apcraig apcraig closed this Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uvm initialization
5 participants