Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPI-IO collective reads of NetCDF-4 files #1405

Closed
wants to merge 20 commits into from

Conversation

dkokron
Copy link
Contributor

@dkokron dkokron commented Nov 7, 2023

NetCDF-4, using the HDF5 file layout, has the ability to do parallel I/O in two different modes. The two modes are referred to as “independent” while the second mode is referred to as “collective”. The collective mode has been tested with a few NOAA workloads and shown to provide substantial improvement in job startup time while reducing negative impact on the underlying Lustre file system. The collective mode is not currently available in FMS (October 2023).

This PR does not address parallel I/O via pNetCDF.

This PR adds an option to enable collective reads. The user controls that option via settings in input.nml. The default behavior is unchanged, the user has to activate collective reads using the settings in input.nml.

Fixes # 1322
#1322

How Has This Been Tested?
I have run a RRFS case on WCOSS2 with and without collective reads activated. The resulting binary restart files are zero-diff.

The compile time environment used to compile FMS was
Currently Loaded Modules:

  1. craype-x86-rome (H) 2) libfabric/1.11.0.0. (H) 3) craype-network-ofi (H) 4) hpc/1.2.0 5) hpc-intel/19.1.3.304 6) hpc-cray-mpich/8.1.19 7) hdf5/1.14.1 8) netcdf/4.9.2

The compile and runtime environment for RRFS was

  1. craype-x86-rome (H) 4) envvar/1.0 7) craype/2.7.13 10) jasper/2.0.25 13) hdf5/1.14.0 16) esmf/8.4.2 19) g2/3.4.5 22) sp/2.3.3 25) mapl/2.35.2-esmf-8.4.2 28) cray-pals/1.2.2
  2. libfabric/1.11.0.0. (H) 5) PrgEnv-intel/8.1.0 8) cray-mpich/8.1.12 11) zlib/1.2.11 14) netcdf/4.9.2 17) bacio/2.4.1 20) g2tmpl/1.10.2 23) w3emc/2.9.2 26) ufs_common
  3. craype-network-ofi (H) 6) intel/19.1.3.304 9) cmake/3.20.2 12) libpng/1.6.37 15) pio/2.5.10 18) crtm/2.4.0 21) ip/3.3.3 24) gftl-shared/v1.5.0 27) ufs_wcoss2.intel

Please describe the tests that you ran to verify your changes. Please also note
any relevant details for your test configuration (e.g. compiler, OS). Include
enough information so someone can reproduce your tests.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • New check tests, if applicable, are included
    The current CMake test for Parallel NetCDF via (nc-config --has-parallel) works in my case. Testing via (nc-config --has-parallel4) is preferred.
  • make distcheck passes

Dan Kokron and others added 5 commits August 8, 2023 16:15
Copy link
Contributor

@bensonr bensonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the inline comment, please remove the commented out debug statements.

if(string_compare(trim(fileobj%path), trim(fn_collective(i)), .true.)) fileobj%use_collective = .true.
enddo
if(fileobj%use_collective) then
err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=mpp_comm_private, info=MPI_INFO_NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the use of mpp_comm_private here imply that all ranks associated with reads and writes the the named file will block until all mpi-ranks associated with that communicator have contributed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only working with reads, but yes those reads are collective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging prints removed from fms2_io/include/netcdf_read_data.inc and fms2_io/netcdf_io.F90

Copy link
Contributor

@bensonr bensonr Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, the NF90_NOWRITE specifies read-only. My concern is mpp_comm_private is not the correct communicator to be using here. For the general use of libFMS, mpp_comm_private is the MPI_COMM_WORLD global communicator, i.e.all mpi-ranks for the job as a whole. For the UFS, mpp_comm_private in libFMS is initialized with a local communicator that is a subset of the global mpi-ranks. I am trying to ensure the proper communicator will be in effect for all uses of parallel NetCDF within libFMS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One intended use case involves a GCM with nested domains, thus the communicator used in this nf90_open() call might change. I have not done any testing with that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From line 862 of FV3/io/fv3atm_restart_io.F90
call read_restart(Phy_restart, ignore_checksum=ignore_rst_cksum)

The read_restart() routine is from fms2_io_mod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkokron Thanks
I am assuming Phy_restart is defined as FmsNetcdfDomainFile_t?
I think the code is missing a call to mpp_set_current_pelist. If you can push what you currently have I will take a better look. I think I can test it out with our unit tests.

Do you get better performance for the HAFS nested grid case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Phy_restart is of type FmsNetcdfDomainFile_t.

I am still investigating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uramirez8707 Do you have access to the WCOSS2 test system 'acorn'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not

@dkokron
Copy link
Contributor Author

dkokron commented Nov 8, 2023

Pete Johnsen had these questions.

  1. why do we need to know the number of files being read with collective reads ahead of time? (and why user has to put this in namelist)
  2. how about integer variables and collective reads? there are some integer fields like land / surface masks that could be read collectively
  3. might need to issue a warning and fall back to original read method if the input file is not NetCDF4 format

My responses:

  1. I need to allocate an array to hold the file names.
  2. Integer variables could be added. None of the variables that are read through this code were integer.
  3. I did run into this situation with some of the oro_* files which were originally NetCDF Classic format. Had to convert them to NetCDF4 to get them working with this code. I'll add some testing for this.

@dkokron
Copy link
Contributor Author

dkokron commented Nov 27, 2023

I'd need to mimic the following code using only FMS calls.

     color=domain%tile_id
     call MPI_Comm_split(mpp_get_current_pelist_comm(), color(1), mpp_pe(), domain%TileComm, err)

What I need is an MPI communicator that contains the ranks associated with each tile of a domain2d. I think the following should accomplish the task. Is there a better way?

     !allocate(pelist(mpp_get_tile_npes(domain)))
     !call mpp_get_tile_pelist(domain, pelist)
     !call mpp_declare_pelist(pelist)
     !domain%TileComm = peset(get_peset(pelist))%id
     !deallocate(pelist)

@bensonr
Copy link
Contributor

bensonr commented Nov 28, 2023

@dkokron - I'm poking around in FMS and the domains structure to find the info you are asking about. We've been following your commits and discussing them internally. I'd like to propose another meeting to share some of our observations.

@dkokron
Copy link
Contributor Author

dkokron commented Nov 28, 2023

@bensonr I'm free 11am to 2pm central every day. Send me an invite.

@dkokron
Copy link
Contributor Author

dkokron commented Nov 29, 2023

How does one prepare and run the unit test suite when FMS is compiled using CMake?

@rem1776
Copy link
Contributor

rem1776 commented Nov 30, 2023

How does one prepare and run the unit test suite when FMS is compiled using CMake?

Unfortunately our test suite is currently only available via the autotools build since it's our main supported build system. We have an open issue for it here: #827.

Copy link
Contributor

@uramirez8707 uramirez8707 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for discussion

@@ -362,8 +362,12 @@ subroutine netcdf_read_data_2d(fileobj, variable_name, buf, unlim_dim_level, &
type is (integer(kind=i8_kind))
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
type is (real(kind=r4_kind))
! NetCDF does not have the ability to specify collective I/O at the file basis
! so we must activate at the variable level
if(fileobj%use_collective) err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inside a if (fileobj%is_root) then so only the first rank in the pelist is going to get to the nf90_var_par_access and nf90_get_var calls

@@ -362,8 +362,12 @@ subroutine netcdf_read_data_2d(fileobj, variable_name, buf, unlim_dim_level, &
type is (integer(kind=i8_kind))
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
type is (real(kind=r4_kind))
! NetCDF does not have the ability to specify collective I/O at the file basis
! so we must activate at the variable level
if(fileobj%use_collective) err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for domain decomposed files, c and e correspond to the section of the data for all of the PEs in the pelist:

call netcdf_read_data(fileobj, variable_name, buf_i4_kind, &
unlim_dim_level=unlim_dim_level, &
corner=c, edge_lengths=e, broadcast=.false.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the netcdf_read_data calls will need to be modified in domain_read_*d so that each ranks reads their own section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uramirez8707 Those c and e arrays have had the appropriate values during all of my testing so far. We'll see if that changes after removing the io_layout=1,1 'fix'.

diff --git a/tools/fv_mp_mod.F90 b/tools/fv_mp_mod.F90
index 4f2c0f2..f1e6012 100644
--- a/tools/fv_mp_mod.F90
+++ b/tools/fv_mp_mod.F90
@@ -658,12 +658,12 @@ contains
             !--- if io_layout\=(1,1) then read io_layout=io_layout (no change)
             l_layout = mpp_get_io_domain_layout(domain)
             call mpp_copy_domain(domain, domain_for_read)
-            if (ALL(l_layout == 1)) then
-              call mpp_get_layout(domain, l_layout)
+            !if (ALL(l_layout == 1)) then
+            !  call mpp_get_layout(domain, l_layout)
+            !  call mpp_define_io_domain(domain_for_read, l_layout)
+            !else
               call mpp_define_io_domain(domain_for_read, l_layout)
-            else
-              call mpp_define_io_domain(domain_for_read, l_layout)
-            endif
+            !endif
          endif
        deallocate(pe_start,pe_end)

err = nf90_get_var(fileobj%ncid, varid, buf, start=c, count=e)
type is (real(kind=r8_kind))
if(fileobj%use_collective) err = nf90_var_par_access(fileobj%ncid, varid, nf90_collective)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if use_collective is true then bdcast needs to be set to false

if (present(broadcast)) then
bcast = broadcast
else
bcast = .true.
endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thank you.

! so we must activate at the variable level in netcdf_read_data_2d() and netcdf_read_data_3d()
if(fileobj%use_collective .and. fileobj%TileComm < 0) then
!write(6,'("netcdf_file_open: Open for collective read "A,I4)') trim(fileobj%path), szTile
err = nf90_open(trim(fileobj%path), ior(NF90_NOWRITE, NF90_MPIIO), fileobj%ncid, comm=fileobj%TileComm, info=MPI_INFO_NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will this work for reading metadata? Currently a root rank reads the data and broadcasts to the other ranks. Is this still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bensonr Do you want to maintain support for the following read scenarios?

  1. fileobj%is_root (single rank) reads then broadcasts. I believe this is the situation when the io_layout=(1,1) WAR isn't active.
  2. all ranks associated with a file read independently (parallel-independent). This is the situation when the io_layout=(1,1) WAR is active)
  3. all ranks associated with a file read collectively (parallel-collective).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, are you proposing UFS eliminates the io_layout=(1,1) work-around from their code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkokron - I thought I answered this, but see I never hit the comment button. I am proposing the UFS eliminate the workaround as the new path through FMS with NC-4 files should behave the same way. Let me know your thoughts.

Copy link
Contributor Author

@dkokron dkokron Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only testing for NetCDF-4 format after a failed attempt to open for parallel-collective.

In order to obtain the new FMS path, the user needs to modify their code before each open_file() for which they want the new path (currently 4 file types out of the many 10s).

e.g.

+    Phy_restart%use_collective = .true.
+    Phy_restart%TileComm = mpp_get_tile_comm(fv_domain)
amiopen=open_file(Phy_restart, trim(fname), 'read', domain=fv_domain, is_restart=.true., dont_add_res_to_filename=.true.)

The way things are coded now (after removing the new path from those is_root blocks and removing the io_layout=(1,1) workaround), the default path is still read-broadcast.

Let me think about this some more.

@@ -32,6 +32,7 @@ module netcdf_io_mod
use mpp_mod
use fms_io_utils_mod
use platform_mod
use mpi, only: MPI_INFO_NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This MPI dependency should be removed. It can be made public through mpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this dependency once FMS provides the needed functionality via some other means.

fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
fms2_io/netcdf_io.F90 Outdated Show resolved Hide resolved
@MatthewPyle-NOAA
Copy link

Does this PR still have a path forward? The RRFS development team would love to utilize it within a true FMS release.

@bensonr
Copy link
Contributor

bensonr commented Jan 30, 2024

@MatthewPyle-NOAA - yes, this still has a path forward. I estimate having something you cana test with by Feb 8th

@dkokron
Copy link
Contributor Author

dkokron commented Jan 30, 2024 via email

@MatthewPyle-NOAA
Copy link

@dkokron I just pinged Shun via e-mail - hopefully one way or another we'll be able to point you at an updated model code by next week.

@bensonr
Copy link
Contributor

bensonr commented Feb 9, 2024

@dkokron @MatthewPyle-NOAA - see PR #1457

@MatthewPyle-NOAA
Copy link

Thanks @bensonr. @dkokron do you still need fresh RRFS code to test with?

@dkokron
Copy link
Contributor Author

dkokron commented Feb 13, 2024 via email

@ShunLiu-NOAA
Copy link

Haven't heard back from Shun.

On Mon, Feb 12, 2024, 12:10 PM MatthewPyle-NOAA @.> wrote: Thanks @bensonr https://github.com/bensonr. @dkokron https://github.com/dkokron do you still need fresh RRFS code to test with? — Reply to this email directly, view it on GitHub <#1405 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACODV2FATWLYIO6DYLFS42DYTJLJNAVCNFSM6AAAAAA7B46T2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGI3TMNRUGQ . You are receiving this because you were mentioned.Message ID: @.>

Dan, We test these modification in RRFS realtime parallel. With these changes, the run time of 1h forecast is reduced from 1350s to 950s. I think these changes should be merged into develop. Thank you so much for your help.

@dkokron
Copy link
Contributor Author

dkokron commented Feb 22, 2024 via email

@bensonr
Copy link
Contributor

bensonr commented Feb 23, 2024

@dkokron - there is now an alpha tag with the ability to declare a pelist and get the communicator directly as well as to query a domain to get individual tile communicators or the full domain communicator.

We'd like to include your PR in the release, so let us know how much time you think you'll need to merge/implement within the current working branch. Don't forget to include logic specific unit tests for the MPI-IO capabilities.

@dkokron
Copy link
Contributor Author

dkokron commented Feb 26, 2024

As far as I am aware, the only place where I am allowed to do this sort of development is on Acorn. That system is down for software work until at least 1 March.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 4, 2024

@bensonr Do you suggest overlaying the parallel IO work on top of 2024.01-alpha4?

@rem1776
Copy link
Contributor

rem1776 commented Mar 4, 2024

@dkokron We discussed this a bit with Rusty, it would be best to put the updates in on the main branch instead.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 4, 2024

I will overlay on main.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 4, 2024

@bensonr @rem1776 The user application needs some way to tell FMS which MPI communicator is associated with which file. I did that in UFS using the following code. I don't see the equivalent of mpp_set_tile_comm() in the current FMS 'main'. Which FMS function should I be using to achieve the functionality of mpp_set_tile_comm()?

    call mpp_get_layout(Atm%domain,read_layout)
    call mpp_copy_domain(Atm%domain, domain_for_read)
    call mpp_define_io_domain(domain_for_read, read_layout)
    call mpp_set_tile_comm(domain_for_read)

    GFS_restart%use_collective = .true.
    GFS_restart%TileComm = mpp_get_tile_comm(domain_for_read)

@bensonr
Copy link
Contributor

bensonr commented Mar 4, 2024

@dkokron - there are two new interfaces you can use to get each the current tile commID (mpp_get_domain_tile_commid) and the full domain commID (mpp_get_domain_commid). For a single tile domain, the tile and domain commIDs should be the same.

There is also a new optional commID argument to mpp_declare_pelist that will return the commID for any pelist group created.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 4, 2024

@bensonr I tried using mpp_get_domain_tile_commid() in the above sequence, but it failed at runtime. I'll investigate some more and report back here.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 5, 2024

@bensonr Does FMS provide a way to access MPI_COMM_NULL?

@dkokron dkokron closed this Mar 5, 2024
@dkokron dkokron deleted the fms.ParallelStartup branch March 5, 2024 19:40
@dkokron
Copy link
Contributor Author

dkokron commented Mar 5, 2024

I think I accidentally closed this PR when I renamed my fms.ParallelStartup branch to fms.ParallelStartupLegacy. The fms.ParallelStartup branch is now built on top of recent FMS:main.

Should I open a new PR?

@rem1776
Copy link
Contributor

rem1776 commented Mar 5, 2024

@dkokron Yeah you'll need to open a new PR from the new branch. With github it seems there's no way to change the branch you're merging in once the PR is created.

And to answer your earlier question, MPI_COMM_NULL can be accessed through MPP_INFO_NULL in mpp_mod.

@dkokron
Copy link
Contributor Author

dkokron commented Mar 5, 2024

MPI_INFO_NULL and MPI_COMM_NULL are not the same thing.
/opt/cray/pe/mpich/8.1.12/ofi/intel/19.0/include/mpif.h: INTEGER MPI_COMM_NULL
/opt/cray/pe/mpich/8.1.12/ofi/intel/19.0/include/mpif.h: PARAMETER (MPI_COMM_NULL=67108864)

/opt/cray/pe/mpich/8.1.12/ofi/intel/19.0/include/mpif.h: INTEGER MPI_INFO_NULL
/opt/cray/pe/mpich/8.1.12/ofi/intel/19.0/include/mpif.h: PARAMETER (MPI_INFO_NULL=469762048)

@bensonr
Copy link
Contributor

bensonr commented Mar 5, 2024

@dkokron - sorry have been busy all day. It doesn't look like we are pushing out MPI_COMM_NULL in FMS. If you need that, we can implement it as part of your PR in a fashion similar to the way MPI_INFO_NULL is exposed as MPP_INFO_NULL.

If you think we need to have a call for anything, let me know and we can set one up for tomorrow early morning or late afternoon.

@thomas-robinson
Copy link
Member

@dkokron what happened to this PR?

@dkokron
Copy link
Contributor Author

dkokron commented Mar 9, 2024

@thomas-robinson It got closed accidentally when I updated the underlying branch to a more recent FMS branch. I will open a new PR once testing of the new code is complete.

@dkokron dkokron mentioned this pull request Mar 11, 2024
8 tasks
@dkokron
Copy link
Contributor Author

dkokron commented Mar 11, 2024

New PR at
#1477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants