diff --git a/.zenodo.json b/.zenodo.json index d74202de..9ad96650 100644 --- a/.zenodo.json +++ b/.zenodo.json @@ -34,6 +34,11 @@ "name": "Caron, Louis-Philippe", "affiliation": "Ouranos", "orcid": "0000-0001-5221-0147" + }, + { + "name": "Braun, Marco", + "affiliation": "Ouranos", + "orcid": "0000-0001-5061-3217" } ], "keywords": [ diff --git a/AUTHORS.rst b/AUTHORS.rst index fd562276..be4540bd 100644 --- a/AUTHORS.rst +++ b/AUTHORS.rst @@ -21,3 +21,4 @@ Contributors * Louis-Philippe Caron * Sarah Gammon `@SarahG-579462 `_ * Yannick Rousseau +* Marco Braun `@vindelico `_ diff --git a/HISTORY.rst b/HISTORY.rst index 5880388f..7b32c3a4 100644 --- a/HISTORY.rst +++ b/HISTORY.rst @@ -4,7 +4,7 @@ History v0.7.0 (unreleased) ------------------- -Contributors to this version: Gabriel Rondeau-Genesse (:user:`RondeauG`), Pascal Bourgault (:user:`aulemahal`), Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`). +Contributors to this version: Gabriel Rondeau-Genesse (:user:`RondeauG`), Pascal Bourgault (:user:`aulemahal`), Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`), Marco Braun (:user:`vindelico`). Announcements ^^^^^^^^^^^^^ @@ -16,6 +16,7 @@ New features and enhancements * New function `get_warming_level` to search within the IPCC CMIP global temperatures CSV without requiring data. (:issue:`208`, :pull:`210`). * File re-structuration from catalogs with ``xscen.catutils.build_path``. (:pull:`205`). * New scripting functions `save_and_update` and `move_and_delete`. (:pull:`214`). +* Spatial dimensions can be generalized as X/Y when rechunking and will be mapped to rlon/rlat or lon/lat accordingly. (:pull:`221`). Breaking changes ^^^^^^^^^^^^^^^^ diff --git a/docs/notebooks/2_getting_started.ipynb b/docs/notebooks/2_getting_started.ipynb index d9cb7569..69a97796 100644 --- a/docs/notebooks/2_getting_started.ipynb +++ b/docs/notebooks/2_getting_started.ipynb @@ -61,7 +61,7 @@ "### Searching a subset of datasets within *DataCatalogs*\n", "\n", "
INFO\n", - " \n", + "\n", "At this stage, the search criteria should be for variables that will be **bias corrected**, not necessarily the variables required for the final product. For example, if `sfcWindfromdir` is the final product, then `uas` and `vas` should be searched for since these are the variables that will be bias corrected.\n", "
\n", "\n", @@ -130,8 +130,8 @@ "source": [ "## Extracting data\n", "\n", - "
WARNING \n", - " \n", + "
WARNING\n", + "\n", "It is heavily recommended to stop and analyse the results of `search_data_catalogs` before proceeding to the extraction function.\n", "
\n", "\n", @@ -221,8 +221,8 @@ "\n", "**NOTE:** Calling the extraction function without passing by `search_data_catalogs` beforehand is possible, but will not support *DerivedVariables*.\n", "\n", - "
NOTE \n", - " \n", + "
NOTE\n", + "\n", "`extract_dataset` currently only accepts a single unique ID at a time.\n", "
" ] @@ -259,11 +259,14 @@ "`xscen` has two functions for the purpose of saving data: `save_to_netcdf` and `save_to_zarr`. If possible for a given project, it is strongly recommended to use Zarr files since these are often orders of magnitude faster to read and create compared to NetCDF. They do have a few quirks, however:\n", "\n", "- Chunk size must separate the dataset in *exactly* equal chunks (with the exception of the last). While it is recommended to calculate ideal chunking and provide them explicitely to the function, `io.estimate_chunks()` can be used to automatically estimate a decent chunking. In a similar way, `io.subset_maxsize()` can be used to roughly cut a dataset along the *time* dimension into subsets of a given size (in Gb), which is especially useful for saving NetCDF files.\n", + " Chunk sizes can be passed to the two saving functions in a dictionary. Spatial dimensions can be generalized as `'X'` and `'Y'`, which will be mapped to the *xarray.Dataset*'s actual grid type's dimension names.\n", "\n", "\n", "- Default behaviour for a Zarr is to act like a directory, with each new variable being assigned a subfolder. This is great when all required variables have the same dimensions and frequency, but will crash otherwise. If you have daily *tasmax* and subdaily *pr*, for example, they need to be assigned different paths.\n", "\n", "\n", + "\n", + "\n", "#### Updating the catalog\n", "\n", "`intake-esm` will automatically copy the catalog's entry in the dataset's metadata, with a `cat:attr` format. Where appropriate, `xscen` updates that information to keep the metadata up to date with the manipulations. `ProjectCatalog.update_from_ds` will in turn read the metadata of a Dataset and fill in the information into a new catalog entry.\n", @@ -376,7 +379,7 @@ "source": [ "## Regridding data\n", "\n", - "
NOTE \n", + "
NOTE\n", "\n", "Regridding in `xscen` is built upon `xESMF`. For more information on basic usage and available regridding methods, [please consult their documentation](https://xesmf.readthedocs.io/en/latest/). Their [masking and extrapolation tutorial](https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html) is of particular interest.\n", "\n", @@ -481,8 +484,8 @@ "source": [ "### Preparing arguments for xESMF.Regridder\n", "\n", - "
NOTE \n", - " \n", + "
NOTE\n", + "\n", "xESMF's API appears to be broken on their ReadTheDocs. For a list of available arguments and options in `Regridder()`, please consult their [Github page](https://github.com/pangeo-data/xESMF/blob/master/xesmf/frontend.py) directly.\n", "
\n", "\n", @@ -522,7 +525,7 @@ "Other options exist in `ESMF/ESMPy`, but not `xESMF`. As they get implemented, they should automatically get supported by `xscen`.\n", "\n", "
NOTE\n", - " \n", + "\n", "Some utilities that exist in `xESMF` have not yet been explicitely added to `xscen`. If *conservative* regridding is desired, for instance, some additional scripts might be required on the User's side to create the lon/lat boundaries\n", "
" ] @@ -642,8 +645,8 @@ "source": [ "## Bias adjusting data\n", "\n", - "
NOTE \n", - " \n", + "
NOTE\n", + "\n", "Bias adjustment in `xscen` is built upon `xclim.sdba`. For more information on basic usage and available methods, [please consult their documentation](https://xclim.readthedocs.io/en/stable/sdba.html).\n", "
\n", "\n", @@ -694,9 +697,9 @@ "- `periods` defines the period(s) to bias adjust.\n", "- `xclim_adjust_kwargs` is described above.\n", "\n", - "
NOTE \n", - " \n", - "These functions currently do not support multiple variables due to the fact that train and adjust arguments might vary. The function must be called separately for every variable. \n", + "
NOTE\n", + "\n", + "These functions currently do not support multiple variables due to the fact that train and adjust arguments might vary. The function must be called separately for every variable.\n", "
" ] }, @@ -804,8 +807,8 @@ "source": [ "## Computing indicators\n", "\n", - "
NOTE \n", - " \n", + "
NOTE\n", + "\n", "`xscen` relies heavily on `xclim`'s YAML support for calculating indicators. For more information on how to build the YAML file, consult [this Notebook](https://xclim.readthedocs.io/en/latest/notebooks/extendxclim.html?highlight=yaml#YAML-file).\n", "
\n", "\n", @@ -1077,7 +1080,7 @@ "\n", "Usually, we would create an ensemble out of different models, but for this toy example, it will be made out of the 2 experiments of the same model that we have available.\n", "\n", - "
WARNING \n", + "
WARNING\n", "\n", "If given a set of paths, `xclim.ensembles.create_ensemble` will ignore the chunking on disk and open the datasets with only a chunking along the new `realization` dimension. Thus, for large datasets, `create_kwargs` should be used to explicitely specify chunks.\n", "
" @@ -1246,7 +1249,7 @@ "\n", "print(\"\")\n", "print(\"Inspect final attributes\")\n", - "display(ds_clean)" + "display(ds_clean)\n" ] } ], diff --git a/tests/conftest.py b/tests/conftest.py index a1e19f0f..47993f20 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -2,7 +2,11 @@ import shutil from pathlib import Path +import numpy as np +import pandas as pd import pytest +import xarray as xr +from xclim.testing.helpers import test_timeseries as timeseries import xscen as xs @@ -43,3 +47,119 @@ def samplecat(): xr_open_kwargs={"engine": "h5netcdf"}, ) return xs.DataCatalog({"esmcat": xs.catalog.esm_col_data, "df": df}) + + +@pytest.fixture +def datablock_3d(): + def _datablock_3d( + values, + variable, + x, + x_start, + y, + y_start, + x_step=0.1, + y_step=0.1, + start="7/1/2000", + freq="D", + units=None, + as_dataset=False, + ): + """ + Create a generic timeseries object based on pre-defined dictionaries of existing variables. + + Parameters + ---------- + values : np.ndarray + The values to be assigned to the variable. Dimensions are interpreted [T, Y, X]. + variable : str + The variable name. + x : str + The name of the x coordinate. + x_start : float + The starting value of the x coordinate. + y : str + The name of the y coordinate. + y_start : float + The starting value of the y coordinate. + x_step : float + The step between x values. + y_step : float + The step between y values. + start : str + The starting date of the time coordinate. + freq : str + The frequency of the time coordinate. + units : str + The units of the variable. + as_dataset : bool + If True, return a Dataset, else a DataArray. + """ + attrs = { + "lat": { + "units": "degrees_north", + "description": "Latitude.", + "standard_name": "latitude", + }, + "lon": { + "units": "degrees_east", + "description": "Longitude.", + "standard_name": "longitude", + }, + "rlon": { + "units": "degrees", + "description": "Rotated longitude.", + "standard_name": "grid_longitude", + }, + "rlat": { + "units": "degrees", + "description": "Rotated latitude.", + "standard_name": "grid_latitude", + }, + "x": { + "units": "m", + "description": "Projection x coordinate.", + "standard_name": "projection_x_coordinate", + }, + "y": { + "units": "m", + "description": "Projection y coordinate.", + "standard_name": "projection_y_coordinate", + }, + } + + dims = { + "time": xr.DataArray( + pd.date_range(start, periods=values.shape[0], freq=freq), dims="time" + ), + y: xr.DataArray( + np.arange(y_start, y_start + values.shape[1] * y_step, y_step), + dims=y, + attrs=attrs[y], + ), + x: xr.DataArray( + np.arange(x_start, x_start + values.shape[2] * x_step, x_step), + dims=x, + attrs=attrs[x], + ), + } + + # Get the attributes using xclim, then create the DataArray + ts_1d = timeseries(values[:, 0, 0], variable, start, units=units, freq=freq) + da = xr.DataArray( + values, coords=dims, dims=dims, name=variable, attrs=ts_1d.attrs + ) + + # Add the axis information + da[x].attrs["axis"] = "X" + da[y].attrs["axis"] = "Y" + da["time"].attrs["axis"] = "T" + + # TODO: Fully support rotated grids (2D coordinates + grid_mapping) + + if as_dataset: + return da.to_dataset() + else: + return da + + return _datablock_3d diff --git a/tests/test_io.py b/tests/test_io.py new file mode 100644 index 00000000..6fa6efaa --- /dev/null +++ b/tests/test_io.py @@ -0,0 +1,68 @@ +import numpy as np +import pytest + +import xscen as xs + + +class TestRechunkForSaving: + @pytest.mark.parametrize( + "dims, xy", + [ + (["lon", "lat"], True), + (["lon", "lat"], False), + (["rlon", "rlat"], True), + (["rlon", "rlat"], False), + ], + ) + def test_options(self, datablock_3d, dims, xy): + ds = datablock_3d( + np.random.random((30, 30, 50)), + variable="tas", + x=dims[0], + x_start=-70 if dims[0] == "lon" else 0, + y=dims[1], + y_start=45 if dims[1] == "lat" else 0, + as_dataset=True, + ) + x = "X" if xy else dims[0] + y = "Y" if xy else dims[1] + new_chunks = {y: 10, x: 10, "time": 20} + ds_ch = xs.io.rechunk_for_saving(ds, new_chunks) + for dim, chunks in ds_ch.chunks.items(): + dim = ( + dim + if (xy is False or dim == "time") + else "X" + if dim == dims[0] + else "Y" + ) + assert chunks[0] == new_chunks[dim] + + def test_variables(self, datablock_3d): + ds = datablock_3d( + np.random.random((30, 30, 50)), + variable="tas", + x="lon", + x_start=-70, + y="lat", + y_start=45, + as_dataset=True, + ) + ds["pr"] = datablock_3d( + np.random.random((30, 30, 50)), + variable="pr", + x="lon", + x_start=-70, + y="lat", + y_start=45, + as_dataset=False, + ) + + new_chunks = { + "tas": {"time": 20, "lon": 10, "lat": 7}, + "pr": {"time": 10, "lon": 5, "lat": 3}, + } + ds_ch = xs.io.rechunk_for_saving(ds, new_chunks) + for v in ds_ch.data_vars: + for dim, chunks in zip(list(ds.dims), ds_ch[v].chunks): + assert chunks[0] == new_chunks[v][dim] diff --git a/xscen/io.py b/xscen/io.py index 13f4da3b..54796fef 100644 --- a/xscen/io.py +++ b/xscen/io.py @@ -29,6 +29,7 @@ "save_to_netcdf", "save_to_zarr", "subset_maxsize", + "rechunk_for_saving", ] @@ -291,6 +292,8 @@ def save_to_netcdf( Name of the NetCDF file to be saved. rechunk : dict, optional This is a mapping from dimension name to new chunks (in any format understood by dask). + Spatial dimensions can be generalized as 'X' and 'Y', which will be mapped to the actual grid type's + dimension names. Rechunking is only done on *data* variables sharing dimensions with this argument. netcdf_kwargs : dict, optional Additional arguments to send to_netcdf() @@ -304,22 +307,7 @@ def save_to_netcdf( xarray.Dataset.to_netcdf """ if rechunk: - for rechunk_var in ds.data_vars: - # Support for chunks varying per variable - if rechunk_var in rechunk: - rechunk_dims = rechunk[rechunk_var] - else: - rechunk_dims = rechunk - - ds[rechunk_var] = ds[rechunk_var].chunk( - { - d: chnks - for d, chnks in rechunk_dims.items() - if d in ds[rechunk_var].dims - } - ) - ds[rechunk_var].encoding.pop("chunksizes", None) - ds[rechunk_var].encoding.pop("chunks", None) + ds = rechunk_for_saving(ds, rechunk) path = Path(filename) path.parent.mkdir(parents=True, exist_ok=True) @@ -371,6 +359,8 @@ def save_to_zarr( Name of the Zarr file to be saved. rechunk : dict, optional This is a mapping from dimension name to new chunks (in any format understood by dask). + Spatial dimensions can be generalized as 'X' and 'Y' which will be mapped to the actual grid type's + dimension names. Rechunking is only done on *data* variables sharing dimensions with this argument. zarr_kwargs : dict, optional Additional arguments to send to_zarr() @@ -404,22 +394,7 @@ def save_to_zarr( ds[v].encoding.clear() if rechunk: - for rechunk_var in ds.data_vars: - # Support for chunks varying per variable - if rechunk_var in rechunk: - rechunk_dims = rechunk[rechunk_var] - else: - rechunk_dims = rechunk - - ds[rechunk_var] = ds[rechunk_var].chunk( - { - d: chnks - for d, chnks in rechunk_dims.items() - if d in ds[rechunk_var].dims - } - ) - ds[rechunk_var].encoding.pop("chunksizes", None) - ds[rechunk_var].encoding.pop("chunks", None) + ds = rechunk_for_saving(ds, rechunk) path = Path(filename) path.parent.mkdir(parents=True, exist_ok=True) @@ -518,6 +493,44 @@ def coerce_attrs(attrs): raise +def rechunk_for_saving(ds, rechunk): + """Rechunk before saving to .zarr or .nc, generalized as Y/X for different axes lat/lon, rlat/rlon. + + Parameters + ---------- + ds : xr.Dataset + The xr.Dataset to be rechunked. + rechunk : dict + A dictionary with the dimension names of ds and the new chunk size. Spatial dimensions + can be provided as X/Y. + + Returns + ------- + xr.Dataset + The dataset with new chunking. + """ + for rechunk_var in ds.data_vars: + # Support for chunks varying per variable + if rechunk_var in rechunk: + rechunk_dims = rechunk[rechunk_var].copy() + else: + rechunk_dims = rechunk.copy() + + # get actual axes labels + if "X" in rechunk_dims and "X" not in ds.dims: + rechunk_dims[ds.cf.axes["X"][0]] = rechunk_dims.pop("X") + if "Y" in rechunk_dims and "Y" not in ds.dims: + rechunk_dims[ds.cf.axes["Y"][0]] = rechunk_dims.pop("Y") + + ds[rechunk_var] = ds[rechunk_var].chunk( + {d: chnks for d, chnks in rechunk_dims.items() if d in ds[rechunk_var].dims} + ) + ds[rechunk_var].encoding.pop("chunksizes", None) + ds[rechunk_var].encoding.pop("chunks", None) + + return ds + + @parse_config def rechunk( path_in: Union[os.PathLike, str, xr.Dataset],