Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using wrong FV layout the model hangs and times out instead of immediate FATAL with a clear message #271

Open
nikizadehgfdl opened this issue May 10, 2023 · 1 comment
Assignees

Comments

@nikizadehgfdl
Copy link
Contributor

Describe the bug
I chose a wrong fv layout by mistake and the job hangs and times out (16 hours for production runs) instead of exiting right away with a FATAL message.
The stdout has the following messages

Invalid layout, NPES_X:0006NPES_Y:0048ncells_X:0016ncells_Y:0002

To Reproduce
for ESM4_c96 use the following layouts

               <atm ranks="1728" threads="2"   layout = "6,48"   io_layout = "1,4" />
               <lnd                            layout = "6,48"   io_layout = "1,4" />
               <ice                            layout = "36,48"   io_layout = "1,4" />
               <ocn ranks="2044" threads="1"   layout = "36,72"   io_layout = "1,4" mask_table="mask_table.548.36x72"/>

Expected behavior
The model run should exit with a FATAL message when users provide a wrong layout for any component.

System Environment
ncrc5.intel22

Additional context
stdout:
/lustre/f2/scratch/Niki.Zadeh/FMS2023.01_mom6_20221213/ESM4p2_piControl_spinup_J_redoyr450_FMS2/ncrc5.intel22-prod-openmp/stdout/run/ESM4p2_piControl_spinup_J_redoyr450_FMS2__8.o134508369

also another wrong layout:

/lustre/f2/scratch/Niki.Zadeh/FMS2023.01_mom6_20221213/ESM4p2_piControl_spinup_J_redoyr450_FMS2/ncrc5.intel22-prod-openmp/stdout/run/ESM4p2_piControl_spinup_J_redoyr450_FMS2__6.o134508105

@nikizadehgfdl
Copy link
Contributor Author

@bensonr wrote
Niki - this error comes from fv3. The logic chosen to shut down the model has an error as there's an mpi barrier expecting all ranks to check in, which isn't valid for concurrent coupled runs. Please open the issue in FV3 and assign it to me.

FV3 has always enforced the rule that all domains need a minimum of 4 cells in both the i- and j-dimensions. For a c96, the largest layout choice is 24.

I cannot assign this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants