Add support for native ERA5 data in GRIB format #2178

schlunma · 2023-08-23T06:58:01Z

Description

This PR allows ESMValCore to process native ERA5 data in GRIB format, which is for example available on Levante in the /pool/data/ERA5 directory.

Reading the data

The following settings are necessary in the user configuration file:

rootpath:
  ...
  native6:
    /pool/data/ERA5: DKRZ-ERA5-GRIB
  ...

I added an extra facets file which includes reasonable default for all supported variables. You can check it out here.

Thus, reading this data is as easy as

datasets:
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: tas, mip: Amon}
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: cl, mip: Amon, tres: 1H, frequency: 1hr}
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: ta, mip: Amon, type: fc, typeid: '12'}

Regridding

Native ERA5 data in GRIB format is on a reduced Gaussian grid (i.e., an unstructured grid). Thus, in 99% of the use cases, it is necessary to regrid this data, especially since no cell areas are available for the data (thus, we cannot even calculate global/regional statistics over the native data). This is done automatically by the CMORizer (as recommended by the ECMWF), but can be disabled in the recipe:

datasets:
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: tas, mip: Amon, regrid: false}

This PR depends on the following other PRs:

Closes #1991
Closes ESMValGroup/ESMValTool#3238

Link to documentation: https://esmvaltool--2178.org.readthedocs.build/projects/ESMValCore/en/2178/quickstart/find_data.html#supported-native-reanalysis-observational-datasets

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov · 2023-08-23T07:06:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.95%. Comparing base (2247a29) to head (22cdf6c).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2178      +/-   ##
==========================================
+ Coverage   94.77%   94.95%   +0.18%     
==========================================
  Files         249      249              
  Lines       14081    14164      +83     
==========================================
+ Hits        13345    13450     +105     
+ Misses        736      714      -22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

schlunma · 2023-08-25T08:22:15Z

This is ready from my side, but there's two issues that need to be resolved before I mark this ready for review:

Cleaned and extended function that extracts datetimes from paths #2181
For some reason I can't get iris-grib to run on CircleCI; locally it works well. This is probably closely related to the problem reported by @remi-kazeroni here.

I tested this thoroughly with the following recipe: recipe_000.yml.txt

An example run is available on Levante here: /home/b/b309141/scratch/esmvaltool_output/recipe_000_20230825_080240

Note that with the default dask scheduler, this recipe ran into a timeout after 8 hours with 67/76 tasks finished. With the following dask configuration, I could run the same recipe on the same node (regular Levante compute node with 256 GiB of memory) in 5:27 min (!!) 🚀.

cluster:
  type: distributed.LocalCluster
  n_workers: 32
  threads_per_worker: 4
  memory_limit: 8 GiB

@ESMValGroup/technical-lead-development-team

schlunma · 2024-06-11T12:10:25Z

This is now ready for review.

Here is a recipe that can be used to test this: recipe_era5_grib.yml.txt. Please not that this needs the iris-grib package and an updated config-user.yml file as described here.

Here is the output of that recipe: https://esmvaltool.dkrz.de/shared/esmvaltool/era5_grib_tests/recipe_era5_grib_20240611_114507/

axel-lauer

Great work @schlunma ! I looked at the output of your test recipe and everything looks fine. I also ran some tests for specific humidity (hus) at 100 hPa. The results look good, so ESMValGroup/ESMValTool#3238 can be closed once this PR is merged. Yay!
The only thing I find slightly annoying is that you have to use the regrid preprocessor to obtain ERA5 output that is usable with all diagnostics. If I want to use the same preprocessor also for model datasets (which is what I do often), I cannot analyze the model data on their native grid.
An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.

P.S.: I think iris-grib needs to be added to the environment so it does not need to be installed manually.

schlunma · 2024-06-26T14:31:39Z

Great, thanks for reviewing @axel-lauer!

The only thing I find slightly annoying is that you have to use the regrid preprocessor to obtain ERA5 output that is usable with all diagnostics. If I want to use the same preprocessor also for model datasets (which is what I do often), I cannot analyze the model data on their native grid. An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.

Good point. Initially I decided not to do that because the fix (more specifically fix_metadata) is executed for each input file separately (and for hourly data there are lots of files!), so this would be a huge performance bottleneck. However, I just realized that we also have fix_file which is executed after cube concatenation and time range clipping, so this would be the perfect place for the regridding.

The only issue I see here is that there is a risk of regridding twice if users use a custom regrid preprocessor and the "default" regridding. Do you think this is a problem? Should we enable the "default" regridding by default (I guess yes?)?

P.S.: I think iris-grib needs to be added to the environment so it does not need to be installed manually.

This has already been added in #2453 to let the tests pass 👍

schlunma · 2024-07-04T15:26:02Z

An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.

This is implemented in cc632dc now.

@ESMValGroup/technical-lead-development-team could any of you have a brief look on this and perform a technical review? That would be awesome, thanks!

bouweandela · 2024-07-08T07:43:03Z

esmvalcore/config/extra_facets/native6-mappings.yml

+  # Settings for all variables of all MIPs
+  '*':
+    '*':
+      family: E5


Is there a place where the meaning of these facets is explained? If yes, it would be nice to add a link here to make it easier to understand where they come from and how to update them.

Added in 4c3f1be

bouweandela · 2024-07-08T08:12:28Z

esmvalcore/cmor/_fixes/native6/era5.py

+                self.vardef.short_name,
+                DEFAULT_ERA5_GRID,
+            )
+            cube = regrid(cube, DEFAULT_ERA5_GRID, 'linear')


While I can see that the feature is useful, in particular to make seamlessly switching between NetCDF and GRIB ERA5 data possible, I'm not not sure if it should be enabled by default: I just checked a few recipes what the effect of this would be:

clouds/recipe_lauer22jclim_fig5_lifrac.yml clouds/recipe_lauer22jclim_fig1_clim.yml recipe_climwip_test_performance_sigma.yml bock20jgr/recipe_bock20jgr_fig_1-4.yml ipccwg1ar6ch3/recipe_ipccwg1ar6ch3_fig_3_19.yml model_evaluation/recipe_model_evaluation_basics.yml

and in all cases, it seems this will lead to double regridding. Therefore I think that disabled by default may be a better option.

Another concern is that a fix seems conceptually the wrong place to implement regridding, as this starts mixing preprocessing with data loading. But maybe there is no way around it.

It would also be possible to automatically add a regrid preprocessor in esmvalcore/_recipe/recipe.py for ERA5 data, but I'm not sure if that would be any cleaner.

Yeah, double regridding is certainly not optimal. However, I am not sure if disabling this by default is really useful, because then a user could make use of the regular preprocessor anyway. I think this feature is only really useful if you don't need any special settings.

It would also be possible to automatically add a regrid preprocessor in esmvalcore/_recipe/recipe.py for ERA5 data, but I'm not sure if that would be any cleaner.

Unfortunately that won't work, since the regridding should only be performed if the data is on an unstructured grid. However, the data is not loaded yet when the relevant part of esmvalcore/_recipe/recipe.py is executed, so we don't know the type of grid yet.

So, how should we proceed? I am rather undecided and can see the advantages and disadvantages of both sides...

There is the principle of least astonishment and to me at least, automatic regridding for one dataset and not for any other dataset is rather surprising, even if it may make sense in this case.

One could also argue that the least surprise would be to get files on a regular grid (like the ones you get when you download the netCDF data).

How about checking the extension of the files in esmvalcore/_recipe/recipe.py and applying the regridding based on this? This would avoid double regridding.

esmvalcore/config/extra_facets/native6-mappings.yml

bouweandela · 2024-07-08T08:19:37Z

esmvalcore/config/extra_facets/native6-mappings.yml

+      regrid: true
+      type: an
+      typeid: '00'
+      version: ''  # necessary to get a nice output file name


Suggested change

version: '' # necessary to get a nice output file name

version: 'v1' # necessary to get a nice output file name

may be nicer? This is also the version we use for NetCDF ERA5 data.

Actually, I think this should be specified in the recipe, e.g:
https://github.com/ESMValGroup/ESMValTool/blob/64c371e88d79accb300574f04832e16127a1d9df/esmvaltool/recipes/clouds/recipe_lauer22jclim_fig5_lifrac.yml#L170-L171

Hmm, I am not sure about this. Does it really make sense to invent an arbitrary version number here? The only reason we need this is to derive the output file name; it's not used at all to find the data.

Yes, we invent version numbers for all observational and reanalysis datasets.

This is certainly not true for all observational and reanalysis datasets. I will change it in the extra facets file..

bouweandela · 2024-07-08T08:58:36Z

esmvalcore/config/extra_facets/native6-mappings.yml

+      version: ''  # necessary to get a nice output file name
+
+    # Variable-specific settings
+    albsn:


Would it be possible to add var_name or a similar name (e.g. standard_name, long_name) to this mapping and add a simple check in AllVars.fix_metadata that the input data actually contains the right variable? I've always been concerned that people may put the wrong file in the wrong directory and get completely wrong results out of these fixes.

In principle I like that idea, but I don't know if that works in practice. For example, are we sure that the names of the netCDF input files are the exact same as for the GRIB input files? If this is not the case, the fix will start failing for one or the other.

I would argue that for the GRIB files the risk of using wrong files is rather low, as the GRIB ID (= the variable name) is part of the file name. So this is more of a problem for the netCDF input files, for which we simple use *.nc as input file pattern.

I would expect (but have not checked), that ECMWF uses the same names for their variables, regardless of the file format. e.g. are these variable names present in both file formats? https://confluence.ecmwf.int/pages/viewpage.action?pageId=82870405#ERA5:datadocumentation-Table4

While this is probably true, I don't think this check is relevant for the GRIB files. As I mentioned, they have their GRIB ID (= variable name) in the file name. We also don't check the variable name for CMIP model output.

Thus, IMHO, it would make more sense to implement this in a different PR (this one here as already >1000 lines).

valeriupredoi · 2024-07-08T13:21:27Z

esmvalcore/preprocessor/_io.py

+            # level, etc.
+            grib_formats = ('.grib2', '.grib', '.grb2', '.grb', '.gb2', '.gb')
+            if file.suffix in grib_formats:
+                raw_cubes = iris.load(file, callback=_load_callback)


Manu, you should tell @trexfeathers and Iris folk to tell ECMWF to list Iris as a viable/recommended GRIB loader, see https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them - one q with regards to Iris stability wrt GRIB files - is this a gods-given one now? If not, I'd argue we should perform a few consistency tests on the loaded cube(s)

Since the recent update(s) of iris-grib I didn't have any difficulties reading those ERA5 GRIB files. Since I would not expect that these files will change any time soon, I won't expect any problems with that. We also perform the same extensive CMOR checks on those files as we do for the netCDF files, so I am pretty confident we are fine here.

About reading GRIB files in general - I am not 100% sure if we (or better, iris-grib) support all possible GRIB files now, but I wouldn't know what to do about this here. If we encounter problems at some point in the future, we can fix that there. Again, given that we have extensive CMOR checks in our pipeline, I am fairly sure we will know if there are problems, even if we won't get very obvious errors.

Hi folks, I've raised SciTools/iris-grib#515

Iris-grib definitely cannot load all possible GRIB files. As you may have noticed: the GRIB specification, and how eccodes interprets this, is not as specific as NetCDF. This means it can be hard to know for certain if Iris-grib is doing the right thing, and we typically therefore rely on finding expert users who can tell us.

So the list of loadable templates is limited to those where we could find a user to check our work. And this also means that we might later discover problems with our existing work, but so far that appears to be rare.

bouweandela · 2024-07-23T13:51:56Z

doc/quickstart/find_data.rst

@@ -131,6 +140,74 @@ ERA5
  of both liquid and solid phases to vapor (from underlying surface and vegetation)."
  Therefore, the ERA5 (and ERA5-Land) CMORizer switches the signs of ``evspsbl`` and ``evspsblpot`` to be compatible with the CMOR standard used e.g. by the CMIP models.

+.. _read_native_era5_grib:
+
+ERA5 (in GRIB format available on DKRZ's Levante)


Mention that ERA5 data in grib format can also be downloaded from ECMWF?

I am not sure how useful that is in practice to be honest. All the facets and paths here are tailored towards the files on Levante. I think you'll get different paths if you download them manually since the names of the facets seem to be different in the docs.

To me, it's easier for the user if we just recommend manual download of the files in netCDF format via era5cli.

schlunma · 2024-08-13T15:43:11Z

There is currently a problem with iris-grib: SciTools/iris-grib#520

schlunma · 2024-09-04T09:03:36Z

@bouweandela would you be able to have another look on my answers to your review comments? It would be great to get this into main very soon. We want to use the features from this PR to add support for further datasets in GRIB format (e.g., CAMS reanalysis). Thanks 🙏

trexfeathers · 2024-09-04T14:18:10Z

There is currently a problem with iris-grib: SciTools/iris-grib#520

Note that a new version of Iris-grib is now available and that should fix this problem?

schlunma · 2024-09-04T16:08:54Z

Note that a new version of Iris-grib is now available and that should fix this problem?

Looks like this is indeed the case, the tests run fine now 🎉

bouweandela · 2024-09-05T07:46:45Z

@bouweandela would you be able to have another look on my answers to your review comments?

Not this week, but hopefully next week.

schlunma added 7 commits August 18, 2023 13:56

First working prototype of ERA5 GRIB reader

f998ae3

Extended list of supported variables for ERA5 GRIB support

8c48834

Merge remote-tracking branch 'origin/main' into read_era5_grib

9740cd3

Added public function to check for unstructured grids

60488dc

Make regridding much faster

f9a4ab7

Add support for more variables and make regridding optional

fc7384a

Add doc

e0c4da3

schlunma added the observations label Aug 23, 2023

schlunma added this to the v2.10.0 milestone Aug 23, 2023

schlunma self-assigned this Aug 23, 2023

schlunma added 13 commits August 23, 2023 09:49

Added first tests

3db12bb

Added test for loading grib files

09aabcb

Added iris-grib to environment and setup.py

8c73373

Fixed environment

744c20b

Fixed eccodes dependency

0c8ce64

Next try to get environment working

39c6677

Temporarily remove GRIB loading test

b7a0c68

Fixed tests

e907ff1

Added missing tests

0b8fbfa

Fixed test

bef5b5e

Improved test coverage of ERA5 CMORizer

2693066

Increased test coverage of regrid module

e7e7285

Optimized doc

911ed28

schlunma and others added 2 commits August 25, 2023 13:44

More customizable automatic regriddind for ERA5 GRIB

7afa551

Merge branch 'main' into read_era5_grib

7d09738

schlunma modified the milestones: v2.10.0, v2.11.0 Sep 28, 2023

Merge remote-tracking branch 'origin/main' into read_era5_grib

d0ae8d2

schlunma added 3 commits June 11, 2024 10:18

Update docs to latest changes

846930b

Do not fix time bounds for variables with no time dim coord

e6125ab

Add ERA5-GRIB to Levante-specific options

d6bbf21

Merge branch 'main' into read_era5_grib

7be8966

schlunma marked this pull request as ready for review June 11, 2024 12:10

schlunma requested a review from axel-lauer June 11, 2024 14:04

axel-lauer approved these changes Jun 18, 2024

View reviewed changes

schlunma added 2 commits July 4, 2024 15:43

Merge remote-tracking branch 'origin/main' into read_era5_grib

67d9d31

Re-enable automatic regridding

cc632dc

bouweandela reviewed Jul 8, 2024

View reviewed changes

esmvalcore/config/extra_facets/native6-mappings.yml Outdated Show resolved Hide resolved

bouweandela reviewed Jul 8, 2024

View reviewed changes

schlunma added 2 commits July 8, 2024 14:01

Rename extra facets file and add link to Levante doc

4c3f1be

Fix doc build

317cf22

valeriupredoi reviewed Jul 8, 2024

View reviewed changes

bouweandela reviewed Jul 23, 2024

View reviewed changes

schlunma added 2 commits July 24, 2024 11:08

Update version of ERA5 GRIB data in extra facets

0fea4ab

Merge remote-tracking branch 'origin/main' into read_era5_grib

5e6d149

Merge branch 'main' into read_era5_grib

22cdf6c

trexfeathers mentioned this pull request Sep 4, 2024

Reading file fails with latest iris version SciTools/iris-grib#520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for native ERA5 data in GRIB format #2178

Add support for native ERA5 data in GRIB format #2178

schlunma commented Aug 23, 2023 •

edited

Loading

codecov bot commented Aug 23, 2023 •

edited

Loading

schlunma commented Aug 25, 2023 •

edited

Loading

schlunma commented Jun 11, 2024

axel-lauer left a comment •

edited

Loading

schlunma commented Jun 26, 2024

schlunma commented Jul 4, 2024

bouweandela Jul 8, 2024

schlunma Jul 8, 2024

bouweandela Jul 8, 2024 •

edited

Loading

bouweandela Jul 8, 2024

schlunma Jul 8, 2024

bouweandela Jul 23, 2024

schlunma Jul 24, 2024

bouweandela Jul 8, 2024

bouweandela Jul 8, 2024 •

edited

Loading

schlunma Jul 8, 2024

bouweandela Jul 23, 2024

schlunma Jul 24, 2024

schlunma Jul 24, 2024

bouweandela Jul 8, 2024

schlunma Jul 8, 2024

bouweandela Jul 23, 2024

schlunma Jul 24, 2024

valeriupredoi Jul 8, 2024

schlunma Jul 12, 2024

trexfeathers Aug 2, 2024 •

edited

Loading

bouweandela Jul 23, 2024

schlunma Jul 24, 2024

schlunma commented Aug 13, 2024

schlunma commented Sep 4, 2024

trexfeathers commented Sep 4, 2024

schlunma commented Sep 4, 2024

bouweandela commented Sep 5, 2024

	version: '' # necessary to get a nice output file name
	version: 'v1' # necessary to get a nice output file name

Add support for native ERA5 data in GRIB format #2178

Are you sure you want to change the base?

Add support for native ERA5 data in GRIB format #2178

Conversation

schlunma commented Aug 23, 2023 • edited Loading

Description

Reading the data

Regridding

codecov bot commented Aug 23, 2023 • edited Loading

Codecov Report

schlunma commented Aug 25, 2023 • edited Loading

schlunma commented Jun 11, 2024

axel-lauer left a comment • edited Loading

Choose a reason for hiding this comment

schlunma commented Jun 26, 2024

schlunma commented Jul 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bouweandela Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bouweandela Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trexfeathers Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlunma commented Aug 13, 2024

schlunma commented Sep 4, 2024

trexfeathers commented Sep 4, 2024

schlunma commented Sep 4, 2024

bouweandela commented Sep 5, 2024

schlunma commented Aug 23, 2023 •

edited

Loading

codecov bot commented Aug 23, 2023 •

edited

Loading

schlunma commented Aug 25, 2023 •

edited

Loading

axel-lauer left a comment •

edited

Loading

bouweandela Jul 8, 2024 •

edited

Loading

bouweandela Jul 8, 2024 •

edited

Loading

trexfeathers Aug 2, 2024 •

edited

Loading