-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for native ERA5 data in GRIB format #2178
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2178 +/- ##
==========================================
+ Coverage 94.77% 94.95% +0.18%
==========================================
Files 249 249
Lines 14081 14164 +83
==========================================
+ Hits 13345 13450 +105
+ Misses 736 714 -22 ☔ View full report in Codecov by Sentry. |
This is ready from my side, but there's two issues that need to be resolved before I mark this ready for review:
I tested this thoroughly with the following recipe: recipe_000.yml.txt An example run is available on Levante here: Note that with the default dask scheduler, this recipe ran into a timeout after 8 hours with 67/76 tasks finished. With the following dask configuration, I could run the same recipe on the same node (regular Levante compute node with 256 GiB of memory) in 5:27 min (!!) 🚀. cluster:
type: distributed.LocalCluster
n_workers: 32
threads_per_worker: 4
memory_limit: 8 GiB @ESMValGroup/technical-lead-development-team |
This is now ready for review. Here is a recipe that can be used to test this: recipe_era5_grib.yml.txt. Please not that this needs the Here is the output of that recipe: https://esmvaltool.dkrz.de/shared/esmvaltool/era5_grib_tests/recipe_era5_grib_20240611_114507/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @schlunma ! I looked at the output of your test recipe and everything looks fine. I also ran some tests for specific humidity (hus) at 100 hPa. The results look good, so ESMValGroup/ESMValTool#3238 can be closed once this PR is merged. Yay!
The only thing I find slightly annoying is that you have to use the regrid preprocessor to obtain ERA5 output that is usable with all diagnostics. If I want to use the same preprocessor also for model datasets (which is what I do often), I cannot analyze the model data on their native grid.
An idea (please feel free to ignore) might be to introduce an additional facet for native6 datasets, e.g. "default_regridding: true", which would apply the "default" regridding to the data, i.e. 0.25°x0.25° with a linear scheme as recommended by ECMWF.
P.S.: I think iris-grib needs to be added to the environment so it does not need to be installed manually.
Great, thanks for reviewing @axel-lauer!
Good point. Initially I decided not to do that because the fix (more specifically The only issue I see here is that there is a risk of regridding twice if users use a custom regrid preprocessor and the "default" regridding. Do you think this is a problem? Should we enable the "default" regridding by default (I guess yes?)?
This has already been added in #2453 to let the tests pass 👍 |
This is implemented in cc632dc now. @ESMValGroup/technical-lead-development-team could any of you have a brief look on this and perform a technical review? That would be awesome, thanks! |
# Settings for all variables of all MIPs | ||
'*': | ||
'*': | ||
family: E5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place where the meaning of these facets is explained? If yes, it would be nice to add a link here to make it easier to understand where they come from and how to update them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 4c3f1be
self.vardef.short_name, | ||
DEFAULT_ERA5_GRID, | ||
) | ||
cube = regrid(cube, DEFAULT_ERA5_GRID, 'linear') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I can see that the feature is useful, in particular to make seamlessly switching between NetCDF and GRIB ERA5 data possible, I'm not not sure if it should be enabled by default: I just checked a few recipes what the effect of this would be:
clouds/recipe_lauer22jclim_fig5_lifrac.yml
clouds/recipe_lauer22jclim_fig1_clim.yml
recipe_climwip_test_performance_sigma.yml
bock20jgr/recipe_bock20jgr_fig_1-4.yml
ipccwg1ar6ch3/recipe_ipccwg1ar6ch3_fig_3_19.yml
model_evaluation/recipe_model_evaluation_basics.yml
and in all cases, it seems this will lead to double regridding. Therefore I think that disabled by default may be a better option.
Another concern is that a fix seems conceptually the wrong place to implement regridding, as this starts mixing preprocessing with data loading. But maybe there is no way around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be possible to automatically add a regrid preprocessor in esmvalcore/_recipe/recipe.py
for ERA5 data, but I'm not sure if that would be any cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, double regridding is certainly not optimal. However, I am not sure if disabling this by default is really useful, because then a user could make use of the regular preprocessor anyway. I think this feature is only really useful if you don't need any special settings.
It would also be possible to automatically add a regrid preprocessor in
esmvalcore/_recipe/recipe.py
for ERA5 data, but I'm not sure if that would be any cleaner.
Unfortunately that won't work, since the regridding should only be performed if the data is on an unstructured grid. However, the data is not loaded yet when the relevant part of esmvalcore/_recipe/recipe.py
is executed, so we don't know the type of grid yet.
So, how should we proceed? I am rather undecided and can see the advantages and disadvantages of both sides...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the principle of least astonishment and to me at least, automatic regridding for one dataset and not for any other dataset is rather surprising, even if it may make sense in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could also argue that the least surprise would be to get files on a regular grid (like the ones you get when you download the netCDF data).
How about checking the extension of the files in esmvalcore/_recipe/recipe.py
and applying the regridding based on this? This would avoid double regridding.
regrid: true | ||
type: an | ||
typeid: '00' | ||
version: '' # necessary to get a nice output file name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version: '' # necessary to get a nice output file name | |
version: 'v1' # necessary to get a nice output file name |
may be nicer? This is also the version we use for NetCDF ERA5 data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think this should be specified in the recipe, e.g:
https://github.com/ESMValGroup/ESMValTool/blob/64c371e88d79accb300574f04832e16127a1d9df/esmvaltool/recipes/clouds/recipe_lauer22jclim_fig5_lifrac.yml#L170-L171
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I am not sure about this. Does it really make sense to invent an arbitrary version number here? The only reason we need this is to derive the output file name; it's not used at all to find the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we invent version numbers for all observational and reanalysis datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is certainly not true for all observational and reanalysis datasets. I will change it in the extra facets file..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version: '' # necessary to get a nice output file name | ||
|
||
# Variable-specific settings | ||
albsn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add var_name
or a similar name (e.g. standard_name, long_name) to this mapping and add a simple check in AllVars.fix_metadata
that the input data actually contains the right variable? I've always been concerned that people may put the wrong file in the wrong directory and get completely wrong results out of these fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I like that idea, but I don't know if that works in practice. For example, are we sure that the names of the netCDF input files are the exact same as for the GRIB input files? If this is not the case, the fix will start failing for one or the other.
I would argue that for the GRIB files the risk of using wrong files is rather low, as the GRIB ID (= the variable name) is part of the file name. So this is more of a problem for the netCDF input files, for which we simple use *.nc
as input file pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect (but have not checked), that ECMWF uses the same names for their variables, regardless of the file format. e.g. are these variable names present in both file formats? https://confluence.ecmwf.int/pages/viewpage.action?pageId=82870405#ERA5:datadocumentation-Table4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is probably true, I don't think this check is relevant for the GRIB files. As I mentioned, they have their GRIB ID (= variable name) in the file name. We also don't check the variable name for CMIP model output.
Thus, IMHO, it would make more sense to implement this in a different PR (this one here as already >1000 lines).
# level, etc. | ||
grib_formats = ('.grib2', '.grib', '.grb2', '.grb', '.gb2', '.gb') | ||
if file.suffix in grib_formats: | ||
raw_cubes = iris.load(file, callback=_load_callback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manu, you should tell @trexfeathers and Iris folk to tell ECMWF to list Iris as a viable/recommended GRIB loader, see https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them - one q with regards to Iris stability wrt GRIB files - is this a gods-given one now? If not, I'd argue we should perform a few consistency tests on the loaded cube(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the recent update(s) of iris-grib I didn't have any difficulties reading those ERA5 GRIB files. Since I would not expect that these files will change any time soon, I won't expect any problems with that. We also perform the same extensive CMOR checks on those files as we do for the netCDF files, so I am pretty confident we are fine here.
About reading GRIB files in general - I am not 100% sure if we (or better, iris-grib) support all possible GRIB files now, but I wouldn't know what to do about this here. If we encounter problems at some point in the future, we can fix that there. Again, given that we have extensive CMOR checks in our pipeline, I am fairly sure we will know if there are problems, even if we won't get very obvious errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi folks, I've raised SciTools/iris-grib#515
Iris-grib definitely cannot load all possible GRIB files. As you may have noticed: the GRIB specification, and how eccodes interprets this, is not as specific as NetCDF. This means it can be hard to know for certain if Iris-grib is doing the right thing, and we typically therefore rely on finding expert users who can tell us.
So the list of loadable templates is limited to those where we could find a user to check our work. And this also means that we might later discover problems with our existing work, but so far that appears to be rare.
@@ -131,6 +140,74 @@ ERA5 | |||
of both liquid and solid phases to vapor (from underlying surface and vegetation)." | |||
Therefore, the ERA5 (and ERA5-Land) CMORizer switches the signs of ``evspsbl`` and ``evspsblpot`` to be compatible with the CMOR standard used e.g. by the CMIP models. | |||
|
|||
.. _read_native_era5_grib: | |||
|
|||
ERA5 (in GRIB format available on DKRZ's Levante) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention that ERA5 data in grib format can also be downloaded from ECMWF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how useful that is in practice to be honest. All the facets and paths here are tailored towards the files on Levante. I think you'll get different paths if you download them manually since the names of the facets seem to be different in the docs.
To me, it's easier for the user if we just recommend manual download of the files in netCDF format via era5cli.
There is currently a problem with iris-grib: SciTools/iris-grib#520 |
@bouweandela would you be able to have another look on my answers to your review comments? It would be great to get this into main very soon. We want to use the features from this PR to add support for further datasets in GRIB format (e.g., CAMS reanalysis). Thanks 🙏 |
Note that a new version of Iris-grib is now available and that should fix this problem? |
Looks like this is indeed the case, the tests run fine now 🎉 |
Not this week, but hopefully next week. |
Description
This PR allows ESMValCore to process native ERA5 data in GRIB format, which is for example available on Levante in the
/pool/data/ERA5
directory.Reading the data
The following settings are necessary in the user configuration file:
I added an extra facets file which includes reasonable default for all supported variables. You can check it out here.
Thus, reading this data is as easy as
Regridding
Native ERA5 data in GRIB format is on a reduced Gaussian grid (i.e., an unstructured grid). Thus, in 99% of the use cases, it is necessary to regrid this data, especially since no cell areas are available for the data (thus, we cannot even calculate global/regional statistics over the native data). This is done automatically by the CMORizer (as recommended by the ECMWF), but can be disabled in the recipe:
This PR depends on the following other PRs:
eccodes-python
PyPI package with neweccodes
in core requirements SciTools/iris-grib#357Closes #1991
Closes ESMValGroup/ESMValTool#3238
Link to documentation: https://esmvaltool--2178.org.readthedocs.build/projects/ESMValCore/en/2178/quickstart/find_data.html#supported-native-reanalysis-observational-datasets
Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.
To help with the number pull requests: