Merge pull request #221 from Ouranosinc/generalize_rlatrlon_latlon

generalize dims as X/Y when lat/lon or rlat/rlon
Ouranosinc · Jul 27, 2023 · 267ed0b · 267ed0b
2 parents f1eb39e + 7b93897
commit 267ed0b
Show file tree

Hide file tree

Showing 7 changed files with 262 additions and 51 deletions.
diff --git a/.zenodo.json b/.zenodo.json
@@ -34,6 +34,11 @@
       "name": "Caron, Louis-Philippe",
       "affiliation": "Ouranos",
       "orcid": "0000-0001-5221-0147"
+    },
+    {
+      "name": "Braun, Marco",
+      "affiliation": "Ouranos",
+      "orcid": "0000-0001-5061-3217"
     }
   ],
   "keywords": [

diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -21,3 +21,4 @@ Contributors
 * Louis-Philippe Caron <caron.louis-philippe@ouranos.ca>
 * Sarah Gammon <gammon.sarah@ouranos.ca> `@SarahG-579462 <https://github.com/SarahG-579462>`_
 * Yannick Rousseau
+* Marco Braun <Braun.Marco@ouranos.ca> `@vindelico <https://github.com/vindelico>`_
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -4,7 +4,7 @@ History
 
 v0.7.0 (unreleased)
 -------------------
-Contributors to this version: Gabriel Rondeau-Genesse (:user:`RondeauG`), Pascal Bourgault (:user:`aulemahal`), Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`).
+Contributors to this version: Gabriel Rondeau-Genesse (:user:`RondeauG`), Pascal Bourgault (:user:`aulemahal`), Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`), Marco Braun (:user:`vindelico`).
 
 Announcements
 ^^^^^^^^^^^^^
@@ -16,6 +16,7 @@ New features and enhancements
 * New function `get_warming_level` to search within the IPCC CMIP global temperatures CSV without requiring data. (:issue:`208`, :pull:`210`).
 * File re-structuration from catalogs with ``xscen.catutils.build_path``. (:pull:`205`).
 * New scripting functions `save_and_update` and `move_and_delete`. (:pull:`214`).
+* Spatial dimensions can be generalized as X/Y when rechunking and will be mapped to rlon/rlat or lon/lat accordingly. (:pull:`221`).
 
 Breaking changes
 ^^^^^^^^^^^^^^^^

diff --git a/docs/notebooks/2_getting_started.ipynb b/docs/notebooks/2_getting_started.ipynb
@@ -61,7 +61,7 @@
     "### Searching a subset of datasets within *DataCatalogs*\n",
     "\n",
     "<div class=\"alert alert-info\"> <b>INFO</b>\n",
-    "    \n",
+    "\n",
     "At this stage, the search criteria should be for variables that will be **bias corrected**, not necessarily the variables required for the final product. For example, if `sfcWindfromdir` is the final product, then `uas` and `vas` should be searched for since these are the variables that will be bias corrected.\n",
     "</div>\n",
     "\n",
@@ -130,8 +130,8 @@
    "source": [
     "## Extracting data\n",
     "\n",
-    "<div class=\"alert alert-warning\"> <b>WARNING</b> \n",
-    "    \n",
+    "<div class=\"alert alert-warning\"> <b>WARNING</b>\n",
+    "\n",
     "It is heavily recommended to stop and analyse the results of `search_data_catalogs` before proceeding to the extraction function.\n",
     "</div>\n",
     "\n",
@@ -221,8 +221,8 @@
     "\n",
     "**NOTE:** Calling the extraction function without passing by `search_data_catalogs` beforehand is possible, but will not support *DerivedVariables*.\n",
     "\n",
-    "<div class=\"alert alert-info\"> <b>NOTE</b> \n",
-    "    \n",
+    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
+    "\n",
     "`extract_dataset` currently only accepts a single unique ID at a time.\n",
     "</div>"
    ]
@@ -259,11 +259,14 @@
     "`xscen` has two functions for the purpose of saving data: `save_to_netcdf` and `save_to_zarr`. If possible for a given project, it is strongly recommended to use Zarr files since these are often orders of magnitude faster to read and create compared to NetCDF. They do have a few quirks, however:\n",
     "\n",
     "- Chunk size must separate the dataset in *exactly* equal chunks (with the exception of the last). While it is recommended to calculate ideal chunking and provide them explicitely to the function, `io.estimate_chunks()` can be used to automatically estimate a decent chunking. In a similar way, `io.subset_maxsize()` can be used to roughly cut a dataset along the *time* dimension into subsets of a given size (in Gb), which is especially useful for saving NetCDF files.\n",
+    " Chunk sizes can be passed to the two saving functions in a dictionary. Spatial dimensions can be generalized as `'X'` and `'Y'`, which will be mapped to the *xarray.Dataset*'s actual grid type's dimension names.\n",
     "\n",
     "\n",
     "- Default behaviour for a Zarr is to act like a directory, with each new variable being assigned a subfolder. This is great when all required variables have the same dimensions and frequency, but will crash otherwise. If you have daily *tasmax* and subdaily *pr*, for example, they need to be assigned different paths.\n",
     "\n",
     "\n",
+    "\n",
+    "\n",
     "#### Updating the catalog\n",
     "\n",
     "`intake-esm` will automatically copy the catalog's entry in the dataset's metadata, with a `cat:attr` format. Where appropriate, `xscen` updates that information to keep the metadata up to date with the manipulations. `ProjectCatalog.update_from_ds` will in turn read the metadata of a Dataset and fill in the information into a new catalog entry.\n",
@@ -376,7 +379,7 @@
    "source": [
     "## Regridding data\n",
     "\n",
-    "<div class=\"alert alert-info\"> <b>NOTE</b> \n",
+    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
     "\n",
     "Regridding in `xscen` is built upon `xESMF`. For more information on basic usage and available regridding methods, [please consult their documentation](https://xesmf.readthedocs.io/en/latest/). Their [masking and extrapolation tutorial](https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html) is of particular interest.\n",
     "\n",
@@ -481,8 +484,8 @@
    "source": [
     "### Preparing arguments for xESMF.Regridder\n",
     "\n",
-    "<div class=\"alert alert-info\">  <b>NOTE</b> \n",
-    "    \n",
+    "<div class=\"alert alert-info\">  <b>NOTE</b>\n",
+    "\n",
     "xESMF's API appears to be broken on their ReadTheDocs. For a list of available arguments and options in `Regridder()`, please consult their [Github page](https://github.com/pangeo-data/xESMF/blob/master/xesmf/frontend.py) directly.\n",
     "</div>\n",
     "\n",
@@ -522,7 +525,7 @@
     "Other options exist in `ESMF/ESMPy`, but not `xESMF`. As they get implemented, they should automatically get supported by `xscen`.\n",
     "\n",
     "<div class=\"alert alert-info\">  <b>NOTE</b>\n",
-    "    \n",
+    "\n",
     "Some utilities that exist in `xESMF` have not yet been explicitely added to `xscen`. If *conservative* regridding is desired, for instance, some additional scripts might be required on the User's side to create the lon/lat boundaries\n",
     "</div>"
    ]
@@ -642,8 +645,8 @@
    "source": [
     "## Bias adjusting data\n",
     "\n",
-    "<div class=\"alert alert-info\"> <b>NOTE</b> \n",
-    "    \n",
+    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
+    "\n",
     "Bias adjustment in `xscen` is built upon `xclim.sdba`. For more information on basic usage and available methods, [please consult their documentation](https://xclim.readthedocs.io/en/stable/sdba.html).\n",
     "</div>\n",
     "\n",
@@ -694,9 +697,9 @@
     "- `periods` defines the period(s) to bias adjust.\n",
     "- `xclim_adjust_kwargs` is described above.\n",
     "\n",
-    "<div class=\"alert alert-info\">  <b>NOTE</b> \n",
-    "    \n",
-    "These functions currently do not support multiple variables due to the fact that train and adjust arguments might vary. The function must be called separately for every variable. \n",
+    "<div class=\"alert alert-info\">  <b>NOTE</b>\n",
+    "\n",
+    "These functions currently do not support multiple variables due to the fact that train and adjust arguments might vary. The function must be called separately for every variable.\n",
     "</div>"
    ]
   },
@@ -804,8 +807,8 @@
    "source": [
     "## Computing indicators\n",
     "\n",
-    "<div class=\"alert alert-info\"> <b>NOTE</b> \n",
-    "    \n",
+    "<div class=\"alert alert-info\"> <b>NOTE</b>\n",
+    "\n",
     "`xscen` relies heavily on `xclim`'s YAML support for calculating indicators. For more information on how to build the YAML file, consult [this Notebook](https://xclim.readthedocs.io/en/latest/notebooks/extendxclim.html?highlight=yaml#YAML-file).\n",
     "</div>\n",
     "\n",
@@ -1077,7 +1080,7 @@
     "\n",
     "Usually, we would create an ensemble out of different models, but for this toy example, it will be made out of the 2 experiments of the same model that we have available.\n",
     "\n",
-    "<div class=\"alert alert-warning\">  <b>WARNING</b> \n",
+    "<div class=\"alert alert-warning\">  <b>WARNING</b>\n",
     "\n",
     "If given a set of paths, `xclim.ensembles.create_ensemble` will ignore the chunking on disk and open the datasets with only a chunking along the new `realization` dimension. Thus, for large datasets, `create_kwargs` should be used to explicitely specify chunks.\n",
     "</div>"
@@ -1246,7 +1249,7 @@
     "\n",
     "print(\"\")\n",
     "print(\"Inspect final attributes\")\n",
-    "display(ds_clean)"
+    "display(ds_clean)\n"
    ]
   }
  ],

diff --git a/tests/conftest.py b/tests/conftest.py
@@ -2,7 +2,11 @@
 import shutil
 from pathlib import Path
 
+import numpy as np
+import pandas as pd
 import pytest
+import xarray as xr
+from xclim.testing.helpers import test_timeseries as timeseries
 
 import xscen as xs
 
@@ -43,3 +47,119 @@ def samplecat():
         xr_open_kwargs={"engine": "h5netcdf"},
     )
     return xs.DataCatalog({"esmcat": xs.catalog.esm_col_data, "df": df})
+
+
+@pytest.fixture
+def datablock_3d():
+    def _datablock_3d(
+        values,
+        variable,
+        x,
+        x_start,
+        y,
+        y_start,
+        x_step=0.1,
+        y_step=0.1,
+        start="7/1/2000",
+        freq="D",
+        units=None,
+        as_dataset=False,
+    ):
+        """
+        Create a generic timeseries object based on pre-defined dictionaries of existing variables.
+
+        Parameters
+        ----------
+        values : np.ndarray
+          The values to be assigned to the variable. Dimensions are interpreted [T, Y, X].
+        variable : str
+            The variable name.
+        x : str
+            The name of the x coordinate.
+        x_start : float
+            The starting value of the x coordinate.
+        y : str
+            The name of the y coordinate.
+        y_start : float
+            The starting value of the y coordinate.
+        x_step : float
+            The step between x values.
+        y_step : float
+            The step between y values.
+        start : str
+            The starting date of the time coordinate.
+        freq : str
+            The frequency of the time coordinate.
+        units : str
+            The units of the variable.
+        as_dataset : bool
+            If True, return a Dataset, else a DataArray.
+        """
+        attrs = {
+            "lat": {
+                "units": "degrees_north",
+                "description": "Latitude.",
+                "standard_name": "latitude",
+            },
+            "lon": {
+                "units": "degrees_east",
+                "description": "Longitude.",
+                "standard_name": "longitude",
+            },
+            "rlon": {
+                "units": "degrees",
+                "description": "Rotated longitude.",
+                "standard_name": "grid_longitude",
+            },
+            "rlat": {
+                "units": "degrees",
+                "description": "Rotated latitude.",
+                "standard_name": "grid_latitude",
+            },
+            "x": {
+                "units": "m",
+                "description": "Projection x coordinate.",
+                "standard_name": "projection_x_coordinate",
+            },
+            "y": {
+                "units": "m",
+                "description": "Projection y coordinate.",
+                "standard_name": "projection_y_coordinate",
+            },
+        }
+
+        dims = {
+            "time": xr.DataArray(
+                pd.date_range(start, periods=values.shape[0], freq=freq), dims="time"
+            ),
+            y: xr.DataArray(
+                np.arange(y_start, y_start + values.shape[1] * y_step, y_step),
+                dims=y,
+                attrs=attrs[y],
+            ),
+            x: xr.DataArray(
+                np.arange(x_start, x_start + values.shape[2] * x_step, x_step),
+                dims=x,
+                attrs=attrs[x],
+            ),
+        }
+
+        # Get the attributes using xclim, then create the DataArray
+        ts_1d = timeseries(values[:, 0, 0], variable, start, units=units, freq=freq)
+        da = xr.DataArray(
+            values, coords=dims, dims=dims, name=variable, attrs=ts_1d.attrs
+        )
+
+        # Add the axis information
+        da[x].attrs["axis"] = "X"
+        da[y].attrs["axis"] = "Y"
+        da["time"].attrs["axis"] = "T"
+
+        # TODO: Fully support rotated grids (2D coordinates + grid_mapping)
+
+        if as_dataset:
+            return da.to_dataset()
+        else:
+            return da
+
+    return _datablock_3d
diff --git a/tests/test_io.py b/tests/test_io.py
@@ -0,0 +1,68 @@
+import numpy as np
+import pytest
+
+import xscen as xs
+
+
+class TestRechunkForSaving:
+    @pytest.mark.parametrize(
+        "dims, xy",
+        [
+            (["lon", "lat"], True),
+            (["lon", "lat"], False),
+            (["rlon", "rlat"], True),
+            (["rlon", "rlat"], False),
+        ],
+    )
+    def test_options(self, datablock_3d, dims, xy):
+        ds = datablock_3d(
+            np.random.random((30, 30, 50)),
+            variable="tas",
+            x=dims[0],
+            x_start=-70 if dims[0] == "lon" else 0,
+            y=dims[1],
+            y_start=45 if dims[1] == "lat" else 0,
+            as_dataset=True,
+        )
+        x = "X" if xy else dims[0]
+        y = "Y" if xy else dims[1]
+        new_chunks = {y: 10, x: 10, "time": 20}
+        ds_ch = xs.io.rechunk_for_saving(ds, new_chunks)
+        for dim, chunks in ds_ch.chunks.items():
+            dim = (
+                dim
+                if (xy is False or dim == "time")
+                else "X"
+                if dim == dims[0]
+                else "Y"
+            )
+            assert chunks[0] == new_chunks[dim]
+
+    def test_variables(self, datablock_3d):
+        ds = datablock_3d(
+            np.random.random((30, 30, 50)),
+            variable="tas",
+            x="lon",
+            x_start=-70,
+            y="lat",
+            y_start=45,
+            as_dataset=True,
+        )
+        ds["pr"] = datablock_3d(
+            np.random.random((30, 30, 50)),
+            variable="pr",
+            x="lon",
+            x_start=-70,
+            y="lat",
+            y_start=45,
+            as_dataset=False,
+        )
+
+        new_chunks = {
+            "tas": {"time": 20, "lon": 10, "lat": 7},
+            "pr": {"time": 10, "lon": 5, "lat": 3},
+        }
+        ds_ch = xs.io.rechunk_for_saving(ds, new_chunks)
+        for v in ds_ch.data_vars:
+            for dim, chunks in zip(list(ds.dims), ds_ch[v].chunks):
+                assert chunks[0] == new_chunks[v][dim]