Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Improve error messaging and logging for generate_FV3LAM_wflow.py #414

Merged

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Oct 13, 2022

This PR is now open for review

DESCRIPTION OF CHANGES:

The error messaging in the generate_FV3LAM_wflow.py has been overhauled to be more pythonic and more user-friendly. The old bash-like print_err_msg_exit() calls are replaced with proper exception handling, and print_info_msg() calls are replaced with the use of Python's built-in logging module. The latter has the added benefit of giving the same "tee"-like functionality--where messages are printed to screen and written to a log file simultaneously--without the need for messy subprocesses and the confusing error messages that come with them.

Errors are now handled in a consistent way, with the errors clearly highlighted both on screen and in the log file, as well as an easy-to-follow exception traceback for those wanting that information. In addition, several error conditions that were previously not caught at the generate step now cause the script to fail with helpful error messages.

From the user perspective, functionality will not change, aside from differences when error conditions are present. Successful runs will remain the same, except that some log messaging has been clarified, and the log file contains additional information for each message, as well as some indentation changes for ease of reading.

Example output

In this example, I specified an invalid variable "INVALID_VAR" in config.yaml. In the current code, the script succeeds, which is a dangerous condition that can produce unexpected results:
Updated code:

./generate_FV3LAM_wflow.py 

  ========================================================================
  Starting experiment generation...
  ========================================================================

  ========================================================================
  Starting function setup() in "setup.py"...
  ========================================================================

*********************************************************************
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: /mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/log.generate_FV3LAM_wflow
*********************************************************************

Traceback (most recent call last):
  File "./generate_FV3LAM_wflow.py", line 1037, in <module>
    generate_FV3LAM_wflow(USHdir, logfile)
  File "./generate_FV3LAM_wflow.py", line 103, in generate_FV3LAM_wflow
    setup()
  File "/mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/setup.py", line 107, in setup
    raise Exception(dedent(f'''
Exception: 
User-specified variable "INVALID_VAR" in config.yaml is not valid
Check config_defaults.yaml for allowed user-specified variables

For this same case, the log file shows the following text:

root                   DEBUG    Finished setting up debug file logging in /mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/log.generate_FV3LAM_wflow
root                   DEBUG    Logging set up successfully
generate_FV3LAM_wflow  INFO
  ========================================================================
  Starting experiment generation...
  ========================================================================
setup                  INFO
  ========================================================================
  Starting function setup() in "setup.py"...
  ========================================================================
root                   ERROR
*********************************************************************
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: /mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/log.generate_FV3LAM_wflow
*********************************************************************

Traceback (most recent call last):
  File "./generate_FV3LAM_wflow.py", line 1037, in <module>
    generate_FV3LAM_wflow(USHdir, logfile)
  File "./generate_FV3LAM_wflow.py", line 103, in generate_FV3LAM_wflow
    setup()
  File "/mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/setup.py", line 107, in setup
    raise Exception(dedent(f'''
Exception:
User-specified variable "INVALID_VAR" in config.yaml is not valid
Check config_defaults.yaml for allowed user-specified variables

In the log file you can see the above-mentioned additional information. Before each message, the subroutine writing the message is listed (a few top-level processes involving setting up logging and printing the error message are logged as "root"). Next on the line is a note of the "logging level"; see Logging HOWTO for more details, but basically this denotes the "urgency" of the message, with "debug" being the lowest and "error" being the highest used here, indicating the script has failed in some way. While not implemented here, this will allow us to properly quarantine lower-priority printouts to the log file, or only print them to screen if DEBUG=TRUE, by making use of lower-priority levels such as logging.debug. And finally, log messages that extend to multiple lines are indented by two spaces, to make individual messages easier to parse visually.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
    • Attempted to run as many error condition checks as possible to ensure smooth error handling/messaging
      • Ran ./generate_FV3LAM_wflow.py with a number of different failure conditions
        • Prior to checking out code
        • Prior to building code
          • Note: currently this does not exit with an error (in develop or with these new changes), as it would break existing tests. Will need to re-visit this in the future.
        • Prior to loading regional_workflow conda environment
          • Note: On some platforms this error will be impossible to catch gracefully, due to not even being able to start running the script
        • Removed Externals.cfg
        • malformed Externals.cfg
        • With no config.yaml
        • With malformed config.yaml
        • PREEXISTING_DIR_METHOD = quit
        • Incorrect config.yaml settings
          • Did not include MACHINE
          • Did not include ACCOUNT with WORKFLOW_MANAGER=rocoto
          • Included invalid variable
          • Mismatch between SPP array sizes
          • Mismatch between LSM_SPP array sizes
          • Incorrect PREDEF_GRID_NAME
          • RESTART_INTERVAL set to non-int value
          • DATE_FIRST_CYCL or DATE_LAST_CYCL set to non-date value
          • USE_CUSTOM_POST_CONFIG_FILE true but CUSTOM_POST_CONFIG_FP does not exist
          • USE_CRTM true but CRTM_DIR does not exist
          • FCST_LEN_HRS > 999
          • FCST_LEN_HRS not evenly divisible by LBC_SPEC_INTVL_HRS
          • Mandatory variables (DT_ATMOS, LAYOUT_X, LAYOUT_Y, BLOCKSIZE) not set
          • CCPP_PHYS_SUITE not valid
          • EXTRN_MDL_SOURCE_BASEDIR_ICS, EXTRN_MDL_SOURCE_BASEDIR_LBCS set to non-existent location while USE_USER_STAGED_EXTRN_FILES is True
          • RUN_TASK_VX_ENSGRID or RUN_TASK_VX_ENSPOINT set True while DO_ENSEMBLE is False
          • With RUN_TASK_MAKE_GRID = False
            • Set GRID_DIR to non-existent location
            • Set GRID_DIR to ""
          • With RUN_TASK_MAKE_OROG = False
            • Set OROG_DIR to non-existent location
            • Set OROG_DIR to ""
          • With RUN_TASK_MAKE_SFC_CLIMO = False
            • Set SFC_CLIMO_DIR to non-existent location
            • Set SFC_CLIMO_DIR to ""
          • Set PREEXISTING_DIR_METHOD to ""
        • Incorrect machine file settings
          • "MACHINE" does not correspond to a machine file
          • machine file is malformed
      • Ran ./run_WE2E_tests.sh with a number of different failure conditions
        • Prior to checking out code
        • Prior to loading regional_workflow conda environment
      • Ran ALL WE2E tests, compared output to current develop
        • Develop: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_385/before_and_after_testing/develop/expt_dirs
        • This branch: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_385/before_and_after_testing/after/expt_dirs/
        • No unexpected differences
  • cheyenne.intel
    • Ran fundamental suite of WE2E tests
  • cheyenne.gnu
    • Ran fundamental suite of WE2E tests
  • Jenkins testing
  • All CI checks passed

DEPENDENCIES:

None

DOCUMENTATION:

None, though the new, clearer error messaging and removal of extraneous comments could be considered a type of documentation change.

Check the comments on #385 for more examples of the differences in error messages.

ISSUE:

Fixes #385

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS (optional):

@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Oct 14, 2022

@mkavulich The unittest is failing probably due to a mixup of the community and nco test cases settings. I recall that when i first called generate_workflow directly, it remembered the RUN_ENVIR setting from the first test and used it for the latter. So I was forced to run them in separate processes.

@mkavulich mkavulich force-pushed the feature/improve_error_messaging branch 3 times, most recently from ceb4c4d to b00f98d Compare October 14, 2022 22:56
@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa I restored the multiprocessing to the unit tests in generate_FV3LAM_wflow.py, but I am still seeing errors that I don't understand (I can't figure out how to see the full logs). I am on PTO Monday and Tuesday but I may ask for a quick meeting after that to try to figure out what I did wrong.

@danielabdi-noaa
Copy link
Collaborator

@mkavulich You can run the unittest locally

python3 -m unittest generate_FV3LAM_wflow.py

It looks like you are just missing a dedent import in get_crontab_contents.py

Traceback (most recent call last):
  File "/contrib/miniconda3/4.5.12/envs/regional_workflow/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/contrib/miniconda3/4.5.12/envs/regional_workflow/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/generate_FV3LAM_wflow.py", line 494, in generate_FV3LAM_wflow
    add_crontab_line()
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/get_crontab_contents.py", line 104, in add_crontab_line
    logger.info(dedent(
NameError: name 'dedent' is not defined

@danielabdi-noaa
Copy link
Collaborator

By the way, is it possible to override how the logger prints messages so that it automatically dedents them just like print_info_msg does? That way we do not have to dedent every message going to the log file or the screen. There is a case where you have to dedent the message beforehand, e.g. when concatenating two strings, but in most cases it is convenient to just send the string as is.

@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa Thanks for the info, the tests seem to be working now!

By the way, is it possible to override how the logger prints messages so that it automatically dedents them just like print_info_msg does? That way we do not have to dedent every message going to the log file or the screen. There is a case where you have to dedent the message beforehand, e.g. when concatenating two strings, but in most cases it is convenient to just send the string as is.

There's no built-in way to do this as far as I can tell, but I could add a new function that does this. I didn't want to change the existing print_info_msg() or print_err_msg_exit() because I wanted to keep these initial improvements confined to the context of workflow generation and the existing functions are used throughout the system in several different contexts. I will put that in and then open this PR for review.

@danielabdi-noaa
Copy link
Collaborator

@mkavulich Are you thinking of modifying the logger.info/debug behaviour to include dedentation automatically or a separate function? As you pointed out, there is no built-in way but there seem to be workarounds like this that will automatically dedent both console and logfile output. I tried it quickly on your PR with this patch, and it looks like it does the job.

@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa We could discuss a better strategy in the future but I don't want to "force" dedenting on all log messages, so I have implemented it as a separate function similar to print_info_msg() for this initial effort.

Thanks for all your comments here so far, I am opening this up for wider review but let me know if you have additional comments/suggestions.

@mkavulich mkavulich added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 20, 2022
@venitahagerty venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Oct 20, 2022
@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 20, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1085525584/20221020000517/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
2022-10-20 00:36:11 +0000 :: hfe05 :: Task make_orog, jobid=36829776, in state DEAD (FAILED), ran for 39.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
2022-10-20 00:36:13 +0000 :: hfe05 :: Task make_orog, jobid=36829774, in state DEAD (FAILED), ran for 51.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
2022-10-20 00:36:10 +0000 :: hfe11 :: Task make_orog, jobid=36829772, in state DEAD (FAILED), ran for 55.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
2022-10-20 00:36:06 +0000 :: hfe05 :: Task make_orog, jobid=36829779, in state DEAD (FAILED), ran for 38.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
2022-10-20 00:36:14 +0000 :: hfe12 :: Task make_orog, jobid=36829780, in state DEAD (FAILED), ran for 41.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
2022-10-20 00:36:10 +0000 :: hfe08 :: Task make_orog, jobid=36829770, in state DEAD (FAILED), ran for 53.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
2022-10-20 00:36:12 +0000 :: hfe08 :: Task make_orog, jobid=36829778, in state DEAD (FAILED), ran for 40.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
2022-10-20 00:36:09 +0000 :: hfe07 :: Task make_orog, jobid=36829775, in state DEAD (FAILED), ran for 43.0 seconds, exit status=256, try=2 (of 2)
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
All experiments completed

@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 20, 2022

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1085525584/20221020000511/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta

@mkavulich mkavulich added the ci-jet-intel-WE Kicks off automated workflow test on jet with intel label Oct 20, 2022
@venitahagerty venitahagerty removed the ci-jet-intel-WE Kicks off automated workflow test on jet with intel label Oct 20, 2022
@mkavulich mkavulich marked this pull request as ready for review October 20, 2022 02:41
 - Remove/consolidate outdated or unnecessary comments
 - Remove print of rocoto details; it is now impossible for users to get
   this far *without* having loaded a module that includes rocoto
 - Add logging to get_crontab_contents.py
Bug in check_var_valid_value.py caused an unrelated exception if the
intended check failed and err_msg was not specified. Turns out
that nowhere in the code is err_msg used, so let's get rid of it! In
addition, trying to check an empty string caused an *additional*
exception, so added some logic to handle that case.
@mkavulich mkavulich force-pushed the feature/improve_error_messaging branch from 5d43ec2 to 73a5c50 Compare October 21, 2022 21:19
@mkavulich
Copy link
Collaborator Author

@christinaholtNOAA I believe I have addressed most of your concerns and followed up on the rest with a few comments/questions. Let me know what you think.

@mkavulich mkavulich added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 21, 2022
@venitahagerty venitahagerty removed ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 21, 2022
@venitahagerty
Copy link
Collaborator

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1085525584/20221021213518/ufs-srweather-app
If test failed, please make changes and add the following label back:
ci-jet-intel-WE

Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me. The last two items (a comment below and an unresolved previous comment) are purely cosmetic.

ush/setup.py Show resolved Hide resolved
@venitahagerty
Copy link
Collaborator

venitahagerty commented Oct 21, 2022

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1085525584/20221021213515/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR

@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa I think you still have one un-resolved comment as well, let me know if everything looks good!

@danielabdi-noaa
Copy link
Collaborator

I've now run almost all error testing cases listed in the PR description and they work as expected so approving.

Only the following case looks like it is impossible to catch on Hera atleast. If we run a WE2E test case though without conda activation, it would have been probably caught!

  • Prior to loading regional_workflow conda environment

This one is disabled to get unittests working and i think for now it should be that way in my opinion. I've had to disable a couple of hard exits to get unittests working before, but if the benefits outweight the cons we could get rid of the unittests.

  • Prior to building code

@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa Agreed on both counts. It would be nice to implement a check that the code is built, but I agree we will need to think of a better way to sync that with automated tests.

@mkavulich mkavulich merged commit e08e5c5 into ufs-community:develop Oct 21, 2022
@mkavulich mkavulich deleted the feature/improve_error_messaging branch July 19, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve error trapping and messaging to the end user
5 participants