[develop] Improve error messaging and logging for generate_FV3LAM_wflow.py #414

mkavulich · 2022-10-13T05:16:55Z

This PR is now open for review

DESCRIPTION OF CHANGES:

The error messaging in the generate_FV3LAM_wflow.py has been overhauled to be more pythonic and more user-friendly. The old bash-like print_err_msg_exit() calls are replaced with proper exception handling, and print_info_msg() calls are replaced with the use of Python's built-in logging module. The latter has the added benefit of giving the same "tee"-like functionality--where messages are printed to screen and written to a log file simultaneously--without the need for messy subprocesses and the confusing error messages that come with them.

Errors are now handled in a consistent way, with the errors clearly highlighted both on screen and in the log file, as well as an easy-to-follow exception traceback for those wanting that information. In addition, several error conditions that were previously not caught at the generate step now cause the script to fail with helpful error messages.

From the user perspective, functionality will not change, aside from differences when error conditions are present. Successful runs will remain the same, except that some log messaging has been clarified, and the log file contains additional information for each message, as well as some indentation changes for ease of reading.

Example output

In this example, I specified an invalid variable "INVALID_VAR" in config.yaml. In the current code, the script succeeds, which is a dangerous condition that can produce unexpected results:
Updated code:

./generate_FV3LAM_wflow.py 

  ========================================================================
  Starting experiment generation...
  ========================================================================

  ========================================================================
  Starting function setup() in "setup.py"...
  ========================================================================

*********************************************************************
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: /mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/log.generate_FV3LAM_wflow
*********************************************************************

Traceback (most recent call last):
  File "./generate_FV3LAM_wflow.py", line 1037, in <module>
    generate_FV3LAM_wflow(USHdir, logfile)
  File "./generate_FV3LAM_wflow.py", line 103, in generate_FV3LAM_wflow
    setup()
  File "/mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/setup.py", line 107, in setup
    raise Exception(dedent(f'''
Exception: 
User-specified variable "INVALID_VAR" in config.yaml is not valid
Check config_defaults.yaml for allowed user-specified variables

For this same case, the log file shows the following text:

root                   DEBUG    Finished setting up debug file logging in /mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/log.generate_FV3LAM_wflow
root                   DEBUG    Logging set up successfully
generate_FV3LAM_wflow  INFO
  ========================================================================
  Starting experiment generation...
  ========================================================================
setup                  INFO
  ========================================================================
  Starting function setup() in "setup.py"...
  ========================================================================
root                   ERROR
*********************************************************************
Experiment generation failed. See the error message(s) printed below.
For more detailed information, check the log file from the workflow
generation script: /mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/log.generate_FV3LAM_wflow
*********************************************************************

Traceback (most recent call last):
  File "./generate_FV3LAM_wflow.py", line 1037, in <module>
    generate_FV3LAM_wflow(USHdir, logfile)
  File "./generate_FV3LAM_wflow.py", line 103, in generate_FV3LAM_wflow
    setup()
  File "/mnt/lfs4/BMC/gsd-fv3-dev/kavulich/workdir/PR_414/ufs-srweather-app/ush/setup.py", line 107, in setup
    raise Exception(dedent(f'''
Exception:
User-specified variable "INVALID_VAR" in config.yaml is not valid
Check config_defaults.yaml for allowed user-specified variables

In the log file you can see the above-mentioned additional information. Before each message, the subroutine writing the message is listed (a few top-level processes involving setting up logging and printing the error message are logged as "root"). Next on the line is a note of the "logging level"; see Logging HOWTO for more details, but basically this denotes the "urgency" of the message, with "debug" being the lowest and "error" being the highest used here, indicating the script has failed in some way. While not implemented here, this will allow us to properly quarantine lower-priority printouts to the log file, or only print them to screen if DEBUG=TRUE, by making use of lower-priority levels such as logging.debug. And finally, log messages that extend to multiple lines are indented by two spaces, to make individual messages easier to parse visually.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

TESTS CONDUCTED:

DEPENDENCIES:

None

DOCUMENTATION:

None, though the new, clearer error messaging and removal of extraneous comments could be considered a type of documentation change.

Check the comments on #385 for more examples of the differences in error messages.

ISSUE:

Fixes #385

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes do not require updates to the documentation (explain).
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published

CONTRIBUTORS (optional):

danielabdi-noaa · 2022-10-14T03:52:05Z

@mkavulich The unittest is failing probably due to a mixup of the community and nco test cases settings. I recall that when i first called generate_workflow directly, it remembered the RUN_ENVIR setting from the first test and used it for the latter. So I was forced to run them in separate processes.

mkavulich · 2022-10-15T06:06:56Z

@danielabdi-noaa I restored the multiprocessing to the unit tests in generate_FV3LAM_wflow.py, but I am still seeing errors that I don't understand (I can't figure out how to see the full logs). I am on PTO Monday and Tuesday but I may ask for a quick meeting after that to try to figure out what I did wrong.

danielabdi-noaa · 2022-10-15T12:24:36Z

@mkavulich You can run the unittest locally

python3 -m unittest generate_FV3LAM_wflow.py

It looks like you are just missing a dedent import in get_crontab_contents.py

Traceback (most recent call last):
  File "/contrib/miniconda3/4.5.12/envs/regional_workflow/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/contrib/miniconda3/4.5.12/envs/regional_workflow/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/generate_FV3LAM_wflow.py", line 494, in generate_FV3LAM_wflow
    add_crontab_line()
  File "/scratch2/BMC/gsd-hpcs/Daniel.Abdi/ufs-srweather-app/ush/get_crontab_contents.py", line 104, in add_crontab_line
    logger.info(dedent(
NameError: name 'dedent' is not defined

danielabdi-noaa · 2022-10-15T13:13:06Z

By the way, is it possible to override how the logger prints messages so that it automatically dedents them just like print_info_msg does? That way we do not have to dedent every message going to the log file or the screen. There is a case where you have to dedent the message beforehand, e.g. when concatenating two strings, but in most cases it is convenient to just send the string as is.

mkavulich · 2022-10-19T16:51:34Z

@danielabdi-noaa Thanks for the info, the tests seem to be working now!

By the way, is it possible to override how the logger prints messages so that it automatically dedents them just like print_info_msg does? That way we do not have to dedent every message going to the log file or the screen. There is a case where you have to dedent the message beforehand, e.g. when concatenating two strings, but in most cases it is convenient to just send the string as is.

There's no built-in way to do this as far as I can tell, but I could add a new function that does this. I didn't want to change the existing print_info_msg() or print_err_msg_exit() because I wanted to keep these initial improvements confined to the context of workflow generation and the existing functions are used throughout the system in several different contexts. I will put that in and then open this PR for review.

danielabdi-noaa · 2022-10-19T20:42:26Z

@mkavulich Are you thinking of modifying the logger.info/debug behaviour to include dedentation automatically or a separate function? As you pointed out, there is no built-in way but there seem to be workarounds like this that will automatically dedent both console and logfile output. I tried it quickly on your PR with this patch, and it looks like it does the job.

mkavulich · 2022-10-20T00:04:03Z

@danielabdi-noaa We could discuss a better strategy in the future but I don't want to "force" dedenting on all log messages, so I have implemented it as a separate function similar to print_info_msg() for this initial effort.

Thanks for all your comments here so far, I am opening this up for wider review but let me know if you have additional comments/suggestions.

venitahagerty · 2022-10-20T00:23:05Z

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1085525584/20221020000517/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
2022-10-20 00:36:11 +0000 :: hfe05 :: Task make_orog, jobid=36829776, in state DEAD (FAILED), ran for 39.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
2022-10-20 00:36:13 +0000 :: hfe05 :: Task make_orog, jobid=36829774, in state DEAD (FAILED), ran for 51.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
2022-10-20 00:36:10 +0000 :: hfe11 :: Task make_orog, jobid=36829772, in state DEAD (FAILED), ran for 55.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
2022-10-20 00:36:06 +0000 :: hfe05 :: Task make_orog, jobid=36829779, in state DEAD (FAILED), ran for 38.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
2022-10-20 00:36:14 +0000 :: hfe12 :: Task make_orog, jobid=36829780, in state DEAD (FAILED), ran for 41.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
2022-10-20 00:36:10 +0000 :: hfe08 :: Task make_orog, jobid=36829770, in state DEAD (FAILED), ran for 53.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
2022-10-20 00:36:12 +0000 :: hfe08 :: Task make_orog, jobid=36829778, in state DEAD (FAILED), ran for 40.0 seconds, exit status=256, try=2 (of 2)
Experiment Failed on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
2022-10-20 00:36:09 +0000 :: hfe07 :: Task make_orog, jobid=36829775, in state DEAD (FAILED), ran for 43.0 seconds, exit status=256, try=2 (of 2)
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
All experiments completed

venitahagerty · 2022-10-20T00:39:39Z

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1085525584/20221020000511/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-jet-intel-WE
Experiment Succeeded on jet: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on jet: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
Experiment Succeeded on jet: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta

- Remove/consolidate outdated or unnecessary comments - Remove print of rocoto details; it is now impossible for users to get this far *without* having loaded a module that includes rocoto - Add logging to get_crontab_contents.py

Bug in check_var_valid_value.py caused an unrelated exception if the intended check failed and err_msg was not specified. Turns out that nowhere in the code is err_msg used, so let's get rid of it! In addition, trying to check an empty string caused an *additional* exception, so added some logic to handle that case.

necessary for CI testing

…why this didn't fail in manual tests...

function with built-in dedenting.

This reverts commit 252f91d.

…um requirement); reverting to older format

mkavulich · 2022-10-21T21:20:55Z

@christinaholtNOAA I believe I have addressed most of your concerns and followed up on the rest with a few comments/questions. Let me know what you think.

venitahagerty · 2022-10-21T21:35:18Z

Machine: jet
Compiler: intel
Job: WE
Repo location: /lfs1/BMC/nrtrr/rrfs_ci/autoci/pr/1085525584/20221021213518/ufs-srweather-app
If test failed, please make changes and add the following label back:
ci-jet-intel-WE

christinaholtNOAA

Everything looks good to me. The last two items (a comment below and an unresolved previous comment) are purely cosmetic.

ush/setup.py

venitahagerty · 2022-10-21T21:51:00Z

Machine: hera
Compiler: intel
Job: WE
Repo location: /scratch1/BMC/zrtrr/rrfs_ci/autoci/pr/1085525584/20221021213515/ufs-srweather-app
Build was Successful
Rocoto jobs started
Long term tracking will be done on 9 experiments
If test failed, please make changes and add the following label back:
ci-hera-intel-WE
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
Experiment Succeeded on hera: grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
Experiment Succeeded on hera: nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
Experiment Succeeded on hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR

mkavulich · 2022-10-21T22:06:50Z

@danielabdi-noaa I think you still have one un-resolved comment as well, let me know if everything looks good!

danielabdi-noaa · 2022-10-21T22:09:42Z

I've now run almost all error testing cases listed in the PR description and they work as expected so approving.

Only the following case looks like it is impossible to catch on Hera atleast. If we run a WE2E test case though without conda activation, it would have been probably caught!

Prior to loading regional_workflow conda environment

This one is disabled to get unittests working and i think for now it should be that way in my opinion. I've had to disable a couple of hard exits to get unittests working before, but if the benefits outweight the cons we could get rid of the unittests.

Prior to building code

mkavulich · 2022-10-21T22:14:35Z

@danielabdi-noaa Agreed on both counts. It would be nice to implement a check that the code is built, but I agree we will need to think of a better way to sync that with automated tests.

mkavulich force-pushed the feature/improve_error_messaging branch 3 times, most recently from ceb4c4d to b00f98d Compare October 14, 2022 22:56

mkavulich mentioned this pull request Oct 20, 2022

Improve error trapping and messaging to the end user #385

Closed

3 tasks

mkavulich added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 20, 2022

venitahagerty removed ci-jet-intel-WE Kicks off automated workflow test on jet with intel ci-hera-intel-WE Kicks off automated workflow test on hera with intel labels Oct 20, 2022

mkavulich added the ci-jet-intel-WE Kicks off automated workflow test on jet with intel label Oct 20, 2022

venitahagerty removed the ci-jet-intel-WE Kicks off automated workflow test on jet with intel label Oct 20, 2022

mkavulich marked this pull request as ready for review October 20, 2022 02:41

mkavulich requested review from gsketefian, JeffBeck-NOAA, RatkoVasic-NOAA, BenjaminBlake-NOAA, ywangwof, chan-hoo, panll, christinaholtNOAA, christopherwharrop-noaa and danielabdi-noaa as code owners October 20, 2022 02:41

mkavulich added 13 commits October 21, 2022 21:19

Add testing subprocess code back in per Daniel Abdi's advice; this is

a91df94

necessary for CI testing

Forgot to set up logger for check_for_preexist_dir_file.py; not sure …

2ba941a

…why this didn't fail in manual tests...

More fixes to Daniels tests

a90be2d

Add missing import to fix failing test

6389c76

Address some comments from Daniel: implement logging.info as its own

c76df4f

function with built-in dedenting.

Fail with error if model executable does not exist

3835b9b

Revert "Fail with error if model executable does not exist"

6f56ed6

This reverts commit 252f91d.

Restore "verbose" functionality now that we have a function

5b7093e

Missed one VERBOSE pass

2d4f5b0

The f-string "=" specifier is not supported for python 3.6 (our minim…

d2c9cbf

…um requirement); reverting to older format

Address reviewer comments

73a5c50

mkavulich force-pushed the feature/improve_error_messaging branch from 5d43ec2 to 73a5c50 Compare October 21, 2022 21:19

mkavulich added ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 21, 2022

venitahagerty removed ci-hera-intel-WE Kicks off automated workflow test on hera with intel ci-jet-intel-WE Kicks off automated workflow test on jet with intel labels Oct 21, 2022

christinaholtNOAA approved these changes Oct 21, 2022

View reviewed changes

ush/setup.py Show resolved Hide resolved

mkavulich added 2 commits October 21, 2022 16:03

Comment to clarify use of manual "raise" in setup.py

d581228

One more comment addressed

d525918

danielabdi-noaa approved these changes Oct 21, 2022

View reviewed changes

mkavulich merged commit e08e5c5 into ufs-community:develop Oct 21, 2022

mkavulich deleted the feature/improve_error_messaging branch July 19, 2023 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] Improve error messaging and logging for generate_FV3LAM_wflow.py #414

[develop] Improve error messaging and logging for generate_FV3LAM_wflow.py #414

mkavulich commented Oct 13, 2022 •

edited

Loading

danielabdi-noaa commented Oct 14, 2022 •

edited

Loading

mkavulich commented Oct 15, 2022

danielabdi-noaa commented Oct 15, 2022

danielabdi-noaa commented Oct 15, 2022

mkavulich commented Oct 19, 2022

danielabdi-noaa commented Oct 19, 2022

mkavulich commented Oct 20, 2022

venitahagerty commented Oct 20, 2022 •

edited

Loading

venitahagerty commented Oct 20, 2022 •

edited

Loading

mkavulich commented Oct 21, 2022

venitahagerty commented Oct 21, 2022

christinaholtNOAA left a comment

venitahagerty commented Oct 21, 2022 •

edited

Loading

mkavulich commented Oct 21, 2022

danielabdi-noaa commented Oct 21, 2022

mkavulich commented Oct 21, 2022

[develop] Improve error messaging and logging for generate_FV3LAM_wflow.py #414

[develop] Improve error messaging and logging for generate_FV3LAM_wflow.py #414

Conversation

mkavulich commented Oct 13, 2022 • edited Loading

DESCRIPTION OF CHANGES:

Example output

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

CONTRIBUTORS (optional):

danielabdi-noaa commented Oct 14, 2022 • edited Loading

mkavulich commented Oct 15, 2022

danielabdi-noaa commented Oct 15, 2022

danielabdi-noaa commented Oct 15, 2022

mkavulich commented Oct 19, 2022

danielabdi-noaa commented Oct 19, 2022

mkavulich commented Oct 20, 2022

venitahagerty commented Oct 20, 2022 • edited Loading

venitahagerty commented Oct 20, 2022 • edited Loading

mkavulich commented Oct 21, 2022

venitahagerty commented Oct 21, 2022

christinaholtNOAA left a comment

Choose a reason for hiding this comment

venitahagerty commented Oct 21, 2022 • edited Loading

mkavulich commented Oct 21, 2022

danielabdi-noaa commented Oct 21, 2022

mkavulich commented Oct 21, 2022

mkavulich commented Oct 13, 2022 •

edited

Loading

danielabdi-noaa commented Oct 14, 2022 •

edited

Loading

venitahagerty commented Oct 20, 2022 •

edited

Loading

venitahagerty commented Oct 20, 2022 •

edited

Loading

venitahagerty commented Oct 21, 2022 •

edited

Loading