Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for native ERA5 data in GRIB format #2178

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open

Conversation

schlunma
Copy link
Contributor

@schlunma schlunma commented Aug 23, 2023

Description

This PR allows ESMValCore to process native ERA5 data in GRIB format, which is for example available on Levante in the /pool/data/ERA5 directory.

Reading the data

The following settings are necessary in the user configuration file:

rootpath:
  ...
  native6:
    /pool/data/ERA5: DKRZ-ERA5-GRIB
  ...

I added an extra facets file which includes reasonable default for all supported variables. You can check it out here.

Thus, reading this data is as easy as

datasets:
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: tas, mip: Amon}
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: cl, mip: Amon, tres: 1H, frequency: 1hr}
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: ta, mip: Amon, type: fc, typeid: '12'}

Regridding

Native ERA5 data in GRIB format is on a reduced Gaussian grid (i.e., an unstructured grid). Thus, in 99% of the use cases, it is necessary to regrid this data, especially since no cell areas are available for the data (thus, we cannot even calculate global/regional statistics over the native data). This is done automatically by the CMORizer (as recommended by the ECMWF), but can be disabled in the recipe:

datasets:
  - {project: native6, dataset: ERA5, timerange: '2000/2001', short_name: tas, mip: Amon, regrid: false}

This PR depends on the following other PRs:


Closes #1991
Closes ESMValGroup/ESMValTool#3238

Link to documentation: https://esmvaltool--2178.org.readthedocs.build/projects/ESMValCore/en/2178/quickstart/find_data.html#supported-native-reanalysis-observational-datasets


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added this to the v2.10.0 milestone Aug 23, 2023
@schlunma schlunma self-assigned this Aug 23, 2023
@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.95%. Comparing base (2247a29) to head (22cdf6c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2178      +/-   ##
==========================================
+ Coverage   94.77%   94.95%   +0.18%     
==========================================
  Files         249      249              
  Lines       14081    14164      +83     
==========================================
+ Hits        13345    13450     +105     
+ Misses        736      714      -22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@schlunma
Copy link
Contributor Author

schlunma commented Aug 25, 2023

This is ready from my side, but there's two issues that need to be resolved before I mark this ready for review:

I tested this thoroughly with the following recipe: recipe_000.yml.txt

An example run is available on Levante here: /home/b/b309141/scratch/esmvaltool_output/recipe_000_20230825_080240

Note that with the default dask scheduler, this recipe ran into a timeout after 8 hours with 67/76 tasks finished. With the following dask configuration, I could run the same recipe on the same node (regular Levante compute node with 256 GiB of memory) in 5:27 min (!!) 🚀.

cluster:
  type: distributed.LocalCluster
  n_workers: 32
  threads_per_worker: 4
  memory_limit: 8 GiB

@ESMValGroup/technical-lead-development-team

@schlunma schlunma modified the milestones: v2.10.0, v2.11.0 Sep 28, 2023
@schlunma
Copy link
Contributor Author

This is now ready for review.

Here is a recipe that can be used to test this: recipe_era5_grib.yml.txt. Please not that this needs the iris-grib package and an updated config-user.yml file as described here.

Here is the output of that recipe: https://esmvaltool.dkrz.de/shared/esmvaltool/era5_grib_tests/recipe_era5_grib_20240611_114507/

@schlunma schlunma marked this pull request as ready for review June 11, 2024 12:10
@schlunma schlunma requested a review from axel-lauer June 11, 2024 14:04
Copy link
Contributor

@axel-lauer axel-lauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @schlunma ! I looked at the output of your test recipe and everything looks fine. I also ran some tests for specific humidity (hus) at 100 hPa. The results look good, so ESMValGroup/ESMValTool#3238 can be closed once this PR is merged. Yay!
The only thing I find slightly annoying is that you have to use the regrid preprocessor to obtain ERA5 output that is usable with all diagnostics. If I want to use the same preprocessor also for model datasets (which is what I do often), I cannot analyze the model data on their native grid.
An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.

P.S.: I think iris-grib needs to be added to the environment so it does not need to be installed manually.

@schlunma
Copy link
Contributor Author

Great, thanks for reviewing @axel-lauer!

The only thing I find slightly annoying is that you have to use the regrid preprocessor to obtain ERA5 output that is usable with all diagnostics. If I want to use the same preprocessor also for model datasets (which is what I do often), I cannot analyze the model data on their native grid. An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.

Good point. Initially I decided not to do that because the fix (more specifically fix_metadata) is executed for each input file separately (and for hourly data there are lots of files!), so this would be a huge performance bottleneck. However, I just realized that we also have fix_file which is executed after cube concatenation and time range clipping, so this would be the perfect place for the regridding.

The only issue I see here is that there is a risk of regridding twice if users use a custom regrid preprocessor and the "default" regridding. Do you think this is a problem? Should we enable the "default" regridding by default (I guess yes?)?

P.S.: I think iris-grib needs to be added to the environment so it does not need to be installed manually.

This has already been added in #2453 to let the tests pass 👍

@schlunma
Copy link
Contributor Author

schlunma commented Jul 4, 2024

An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.

This is implemented in cc632dc now.

@ESMValGroup/technical-lead-development-team could any of you have a brief look on this and perform a technical review? That would be awesome, thanks!

# Settings for all variables of all MIPs
'*':
'*':
family: E5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place where the meaning of these facets is explained? If yes, it would be nice to add a link here to make it easier to understand where they come from and how to update them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 4c3f1be

self.vardef.short_name,
DEFAULT_ERA5_GRID,
)
cube = regrid(cube, DEFAULT_ERA5_GRID, 'linear')
Copy link
Member

@bouweandela bouweandela Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I can see that the feature is useful, in particular to make seamlessly switching between NetCDF and GRIB ERA5 data possible, I'm not not sure if it should be enabled by default: I just checked a few recipes what the effect of this would be:

clouds/recipe_lauer22jclim_fig5_lifrac.yml
clouds/recipe_lauer22jclim_fig1_clim.yml
recipe_climwip_test_performance_sigma.yml
bock20jgr/recipe_bock20jgr_fig_1-4.yml
ipccwg1ar6ch3/recipe_ipccwg1ar6ch3_fig_3_19.yml
model_evaluation/recipe_model_evaluation_basics.yml

and in all cases, it seems this will lead to double regridding. Therefore I think that disabled by default may be a better option.

Another concern is that a fix seems conceptually the wrong place to implement regridding, as this starts mixing preprocessing with data loading. But maybe there is no way around it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be possible to automatically add a regrid preprocessor in esmvalcore/_recipe/recipe.py for ERA5 data, but I'm not sure if that would be any cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, double regridding is certainly not optimal. However, I am not sure if disabling this by default is really useful, because then a user could make use of the regular preprocessor anyway. I think this feature is only really useful if you don't need any special settings.

It would also be possible to automatically add a regrid preprocessor in esmvalcore/_recipe/recipe.py for ERA5 data, but I'm not sure if that would be any cleaner.

Unfortunately that won't work, since the regridding should only be performed if the data is on an unstructured grid. However, the data is not loaded yet when the relevant part of esmvalcore/_recipe/recipe.py is executed, so we don't know the type of grid yet.

So, how should we proceed? I am rather undecided and can see the advantages and disadvantages of both sides...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the principle of least astonishment and to me at least, automatic regridding for one dataset and not for any other dataset is rather surprising, even if it may make sense in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could also argue that the least surprise would be to get files on a regular grid (like the ones you get when you download the netCDF data).

How about checking the extension of the files in esmvalcore/_recipe/recipe.py and applying the regridding based on this? This would avoid double regridding.

regrid: true
type: an
typeid: '00'
version: '' # necessary to get a nice output file name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
version: '' # necessary to get a nice output file name
version: 'v1' # necessary to get a nice output file name

may be nicer? This is also the version we use for NetCDF ERA5 data.

Copy link
Member

@bouweandela bouweandela Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I am not sure about this. Does it really make sense to invent an arbitrary version number here? The only reason we need this is to derive the output file name; it's not used at all to find the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we invent version numbers for all observational and reanalysis datasets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is certainly not true for all observational and reanalysis datasets. I will change it in the extra facets file..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version: '' # necessary to get a nice output file name

# Variable-specific settings
albsn:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add var_name or a similar name (e.g. standard_name, long_name) to this mapping and add a simple check in AllVars.fix_metadata that the input data actually contains the right variable? I've always been concerned that people may put the wrong file in the wrong directory and get completely wrong results out of these fixes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I like that idea, but I don't know if that works in practice. For example, are we sure that the names of the netCDF input files are the exact same as for the GRIB input files? If this is not the case, the fix will start failing for one or the other.

I would argue that for the GRIB files the risk of using wrong files is rather low, as the GRIB ID (= the variable name) is part of the file name. So this is more of a problem for the netCDF input files, for which we simple use *.nc as input file pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect (but have not checked), that ECMWF uses the same names for their variables, regardless of the file format. e.g. are these variable names present in both file formats? https://confluence.ecmwf.int/pages/viewpage.action?pageId=82870405#ERA5:datadocumentation-Table4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is probably true, I don't think this check is relevant for the GRIB files. As I mentioned, they have their GRIB ID (= variable name) in the file name. We also don't check the variable name for CMIP model output.

Thus, IMHO, it would make more sense to implement this in a different PR (this one here as already >1000 lines).

# level, etc.
grib_formats = ('.grib2', '.grib', '.grb2', '.grb', '.gb2', '.gb')
if file.suffix in grib_formats:
raw_cubes = iris.load(file, callback=_load_callback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manu, you should tell @trexfeathers and Iris folk to tell ECMWF to list Iris as a viable/recommended GRIB loader, see https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them - one q with regards to Iris stability wrt GRIB files - is this a gods-given one now? If not, I'd argue we should perform a few consistency tests on the loaded cube(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the recent update(s) of iris-grib I didn't have any difficulties reading those ERA5 GRIB files. Since I would not expect that these files will change any time soon, I won't expect any problems with that. We also perform the same extensive CMOR checks on those files as we do for the netCDF files, so I am pretty confident we are fine here.

About reading GRIB files in general - I am not 100% sure if we (or better, iris-grib) support all possible GRIB files now, but I wouldn't know what to do about this here. If we encounter problems at some point in the future, we can fix that there. Again, given that we have extensive CMOR checks in our pipeline, I am fairly sure we will know if there are problems, even if we won't get very obvious errors.

Copy link

@trexfeathers trexfeathers Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi folks, I've raised SciTools/iris-grib#515

Iris-grib definitely cannot load all possible GRIB files. As you may have noticed: the GRIB specification, and how eccodes interprets this, is not as specific as NetCDF. This means it can be hard to know for certain if Iris-grib is doing the right thing, and we typically therefore rely on finding expert users who can tell us.

So the list of loadable templates is limited to those where we could find a user to check our work. And this also means that we might later discover problems with our existing work, but so far that appears to be rare.

@@ -131,6 +140,74 @@ ERA5
of both liquid and solid phases to vapor (from underlying surface and vegetation)."
Therefore, the ERA5 (and ERA5-Land) CMORizer switches the signs of ``evspsbl`` and ``evspsblpot`` to be compatible with the CMOR standard used e.g. by the CMIP models.

.. _read_native_era5_grib:

ERA5 (in GRIB format available on DKRZ's Levante)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention that ERA5 data in grib format can also be downloaded from ECMWF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how useful that is in practice to be honest. All the facets and paths here are tailored towards the files on Levante. I think you'll get different paths if you download them manually since the names of the facets seem to be different in the docs.

To me, it's easier for the user if we just recommend manual download of the files in netCDF format via era5cli.

@schlunma
Copy link
Contributor Author

There is currently a problem with iris-grib: SciTools/iris-grib#520

@schlunma
Copy link
Contributor Author

schlunma commented Sep 4, 2024

@bouweandela would you be able to have another look on my answers to your review comments? It would be great to get this into main very soon. We want to use the features from this PR to add support for further datasets in GRIB format (e.g., CAMS reanalysis). Thanks 🙏

@trexfeathers
Copy link

There is currently a problem with iris-grib: SciTools/iris-grib#520

Note that a new version of Iris-grib is now available and that should fix this problem?

@schlunma
Copy link
Contributor Author

schlunma commented Sep 4, 2024

Note that a new version of Iris-grib is now available and that should fix this problem?

Looks like this is indeed the case, the tests run fine now 🎉

@bouweandela
Copy link
Member

@bouweandela would you be able to have another look on my answers to your review comments?

Not this week, but hopefully next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants