Ouranosinc · RondeauG · Mar 10, 2023 · Dec 6, 2022 · Dec 6, 2022 · Dec 20, 2022
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -4,12 +4,22 @@ History
 
 v0.6.0 (unreleased)
 -------------------
-Contributors to this version: Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`), Pascal Bourgault (:user:`aulemahal`).
+Contributors to this version: Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`), Pascal Bourgault (:user:`aulemahal`), Gabriel Rondeau-Genesse (:user:`RondeauG`).
+
+New features and enhancements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* New 'cos-lat' averaging in `spatial_mean` (:issue:`94`, :pull:`125`).
+
+Breaking changes
+^^^^^^^^^^^^^^^^
+* 'mean' averaging has been deprecated in `spatial_mean` (:pull:`125`).
+* 'interp_coord' has been renamed to 'interp_centroid' in `spatial_mean` (:pull:`125`).
 
 Bug fixes
 ^^^^^^^^^
 * Forbid pandas v1.5.3 in the environment files, as the linux conda build breaks the data catalog parser. (:issue:`161`, :pull:`162`).
 * Only return requested variables when using ``DataCatalog.to_dataset`` (:pull:`163`).
+* `compute_indicators` no longer crashes if less than 3 timesteps are produced (:pull:`125`).
 
 Internal changes
 ^^^^^^^^^^^^^^^^
@@ -18,6 +28,7 @@ Internal changes
 * Minimal documentation for templates (:pull:`163`).
 * `xscen` is now indexed in `Zenodo <https://zenodo.org/>`_, under the `ouranos` community of projects (:pull:`164`).
 * Added a few relevant `Shields <https://shields.io/>`_ to the README.rst (:pull:`164`).
+* Better warning messages in `_subset_file_coverage` when coverage is insufficient (:pull:`125`).
 
 v0.5.0 (2023-02-28)
 -------------------

diff --git a/docs/notebooks/2_getting_started.ipynb b/docs/notebooks/2_getting_started.ipynb
@@ -387,6 +387,10 @@
     "\n",
     "ds_grid = xesmf.util.cf_grid_2d(-68.5, -67.5, 0.25, 48.5, 49.5, 0.25)\n",
     "\n",
+    "# cf_grid_2d does not set the 'axis' attribute\n",
+    "ds_grid[\"lon\"].attrs[\"axis\"] = \"X\"\n",
+    "ds_grid[\"lat\"].attrs[\"axis\"] = \"Y\"\n",
+    "\n",
     "# xscen will use the domain to re-assign attributes, so it is important to set it up for custom grids like this\n",
     "ds_grid.attrs[\"cat:domain\"] = \"regular0-25\"\n",
     "ds_grid"
@@ -983,11 +987,10 @@
     "\n",
     "`xs.spatial_mean` is used to compute the spatial average over a given region, using various methods. The argument `call_clisops` can also be used to subset the domain prior to the averaging.\n",
     "\n",
-    "- `method: mean` will simply perform a `.mean()` operation over the spatial dimensions.\n",
-    "   - `kwargs` can be used to instruct on which dimension(s) to perform the averaging, using `{\"dim\": list}`. Otherwise, the script will try to guess the X and Y axes using CF Conventions.\n",
+    "- `method: cos-lat` will perform an average operation over the spatial dimensions, accounting for changes in grid cell area along the 'lat' coordinate.\n",
     "\n",
     "\n",
-    "- `method: interp_coord` will perform a linear interpolation towards given coordinates.\n",
+    "- `method: interp_centroid` will perform an interpolation towards given coordinates or towards the centroid of a region.\n",
     "   - `kwargs` is used to sent arguments to `.interp()`, including `lon` and `lat`.\n",
     "   - `region` can alternatively be used to send a gridpoint, bbox, or shapefile and compute the centroid. This argument is a dictionary that follows the same requirements as the one for `xs.extract` described previously.\n",
     "\n",
@@ -1010,7 +1013,7 @@
     "for key, ds in ds_dict.items():\n",
     "    ds_savg = xs.spatial_mean(\n",
     "        ds=ds,\n",
-    "        method=\"interp_coord\",\n",
+    "        method=\"interp_centroid\",\n",
     "        kwargs={\"method\": \"linear\", \"lon\": -70, \"lat\": 48.5},\n",
     "        to_domain=\"aggregated\",\n",
     "    )\n",
@@ -1238,7 +1241,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.13"
+   "version": "3.10.9"
   }
  },
  "nbformat": 4,

diff --git a/environment-dev.yml b/environment-dev.yml
@@ -7,6 +7,7 @@ dependencies:
   # Main packages
   - cartopy
   - cftime
+  - cf_xarray >= 0.7.6
   - clisops
   - dask
   - flox
@@ -19,6 +20,7 @@ dependencies:
   - netCDF4
   - numpy
   - pandas!=1.5.3
+  - pygeos
   - pyyaml
   - rechunker
   - shapely

diff --git a/environment.yml b/environment.yml
@@ -7,6 +7,7 @@ dependencies:
   # Main packages
   - cartopy
   - cftime
+  - cf_xarray >= 0.7.6
   - clisops
   - dask
   - flox
@@ -19,6 +20,7 @@ dependencies:
   - netCDF4
   - numpy
   - pandas!=1.5.3
+  - pygeos
   - pyyaml
   - rechunker
   - shapely

diff --git a/setup.py b/setup.py
@@ -27,6 +27,7 @@
 requirements = [
     "cartopy",
     "cftime",
+    "cf_xarray>=0.7.6",
     "clisops",
     "dask",
     "flox",
@@ -39,6 +40,7 @@
     "netCDF4",
     "numpy",
     "pandas",
+    "pygeos",
     "pyyaml",
     "rechunker",
     "shapely",

diff --git a/xscen/aggregate.py b/xscen/aggregate.py
@@ -10,6 +10,7 @@
 import geopandas as gpd
 import numpy as np
 import pandas as pd
+import pygeos
 import xarray as xr
 import xesmf as xe
 from shapely.geometry import Polygon
@@ -291,15 +292,15 @@ def spatial_mean(
     ds : xr.Dataset
         Dataset to use for the computation.
     method : str
-        'mean' will perform a .mean() over the spatial dimensions of the Dataset.
-        'interp_coord' will find the region's centroid (if coordinates are not fed through kwargs), then perform a .interp() over the spatial dimensions of the Dataset.
+        'cos-lat' will weight the area covered by each pixel using an approximation based on latitude.
+        'interp_centroid' will find the region's centroid (if coordinates are not fed through kwargs), then perform a .interp() over the spatial dimensions of the Dataset.
         The coordinate can also be directly fed to .interp() through the 'kwargs' argument below.
         'xesmf' will make use of xESMF's SpatialAverager. This will typically be more precise, especially for irregular regions, but can be much slower than other methods.
     call_clisops : bool
         If True, xscen.extraction.clisops_subset will be called prior to the other operations. This requires the 'region' argument.
     region : dict
         Description of the region and the subsetting method (required fields listed in the Notes).
-        If method=='interp_coord', this is used to find the region's centroid.
+        If method=='interp_centroid', this is used to find the region's centroid.
         If method=='xesmf', the bounding box or shapefile is given to SpatialAverager.
     kwargs : dict
         Arguments to send to either mean(), interp() or SpatialAverager().
@@ -334,36 +335,70 @@ def spatial_mean(
     xarray.Dataset.mean, xarray.Dataset.interp, xesmf.SpatialAverager
     """
     kwargs = kwargs or {}
+    if method == "mean":
+        warnings.warn(
+            "xs.spatial_mean with method=='mean' is deprecated and will be abandoned in a future release. Use method=='cos-lat' instead for a more robust but similar method.",
+            category=DeprecationWarning,
+            stacklevel=2,
+        )
+    elif method == "interp_coord":
+        warnings.warn(
+            "xs.spatial_mean with method=='interp_coord' is deprecated. Use method=='interp_centroid' instead.",
+            category=DeprecationWarning,
+            stacklevel=2,
+        )
+        method = "interp_centroid"
 
     # If requested, call xscen.extraction.clisops_subset prior to averaging
     if call_clisops:
         ds = clisops_subset(ds, region)
 
+    if method == "cos-lat":
+        if "latitude" not in ds.cf.coordinates:
+            raise ValueError(
+                "Could not determine the latitude name using CF conventions. "
+                "Use kwargs = {lat: str} to specify the coordinate name."
+            )
+
+        if "units" not in ds.cf["latitude"].attrs:
+            logger.warning(
+                f"{ds.attrs.get('cat:id', '')}: Latitude does not appear to have units. Make sure that the computation is right."
+            )
+        elif ds.cf["latitude"].attrs["units"] != "degrees_north":
+            logger.warning(
+                f"{ds.attrs.get('cat:id', '')}: Latitude units is '{ds.cf['latitude'].attrs['units']}', expected 'degrees_north'. "
+                f"Make sure that the computation is right."
+            )
+
+        if (ds.cf["longitude"].min() < -160) & (ds.cf["longitude"].max() > 160):
+            logger.warning(
+                "The region appears to be crossing the -180/180° meridian. Bounds computation is currently bugged in cf_xarray. "
+                "Make sure that the computation is right."
+            )
+
+        ds = ds.cf.add_bounds(["longitude", "latitude"])
+        weights = xr.DataArray(
+            pygeos.area(
+                pygeos.polygons(pygeos.linearrings(ds.lon_bounds, ds.lat_bounds))
+            ),
+            dims=ds.cf["longitude"].dims,
+            coords=ds.cf["longitude"].coords,
+        ) * np.cos(np.deg2rad(ds.cf["latitude"]))
+
+        ds_agg = ds.weighted(weights).mean(
+            [d for d in ds.cf.axes["X"] + ds.cf.axes["Y"]], keep_attrs=True
+        )
+
+        # Prepare the History field
+        new_history = (
+            f"[{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] "
+            f"weighted mean(dim={[d for d in ds.cf.axes['X'] + ds.cf.axes['Y']]}) using a 'cos-lat' approximation"
+        )
+
     # This simply calls .mean() over the spatial dimensions
-    if method == "mean":
+    elif method == "mean":
         if "dim" not in kwargs:
-            # Determine the X and Y names
-            spatial_dims = []
-            for d in ["X", "Y"]:
-                if d in ds.cf.axes:
-                    spatial_dims.extend([ds.cf[d].name])
-                elif (
-                    (d == "X")
-                    and ("longitude" in ds.cf.coordinates)
-                    and (len(ds[ds.cf.coordinates["longitude"][0]].dims) == 1)
-                ):
-                    spatial_dims.extend(ds.cf.coordinates["longitude"])
-                elif (
-                    (d == "Y")
-                    and ("latitude" in ds.cf.coordinates)
-                    and (len(ds[ds.cf.coordinates["latitude"][0]].dims) == 1)
-                ):
-                    spatial_dims.extend(ds.cf.coordinates["latitude"])
-            if len(spatial_dims) == 0:
-                raise ValueError(
-                    "Could not determine the spatial dimension(s) using CF conventions. Use kwargs = {dim: list} to specify on which dimension to perform the averaging."
-                )
-            kwargs["dim"] = spatial_dims
+            kwargs["dim"] = ds.cf.axes["X"] + ds.cf.axes["Y"]
 
         ds_agg = ds.mean(keep_attrs=True, **kwargs)
 
@@ -374,9 +409,14 @@ def spatial_mean(
         )
 
     # This calls .interp() to a pair of coordinates
-    elif method == "interp_coord":
+    elif method == "interp_centroid":
         # Find the centroid
-        if region is not None:
+        if region is None:
+            if ds.cf.axes["X"][0] not in kwargs:
+                kwargs[ds.cf.axes["X"][0]] = ds[ds.cf.axes["X"][0]].mean().values
+            if ds.cf.axes["Y"][0] not in kwargs:
+                kwargs[ds.cf.axes["Y"][0]] = ds[ds.cf.axes["Y"][0]].mean().values
+        else:
             if region["method"] == "gridpoint":
                 if len(region["gridpoint"]["lon"] != 1):
                     raise ValueError(
@@ -413,6 +453,10 @@ def spatial_mean(
 
     # Uses xesmf.SpatialAverager
     elif method == "xesmf":
+        logger.warning(
+            "A bug has been found with xesmf.SpatialAverager that impacts big or regions. "
+            "Until this is fixed, make sure that the computation is right or use multiple smaller regions."
+        )
         # If the region is a bounding box, call shapely and geopandas to transform it into an input compatible with xesmf
         if region["method"] == "bbox":
             lon_point_list = [
@@ -455,7 +499,7 @@ def spatial_mean(
             )
 
         else:
-            raise ValueError("'method' not understood.")
+            raise ValueError("'method' should be one of [bbox, shape].")
 
         kwargs_copy = deepcopy(kwargs)
         skipna = kwargs_copy.pop("skipna", False)

diff --git a/xscen/extract.py b/xscen/extract.py
@@ -1310,46 +1310,43 @@ def _subset_file_coverage(
         )
         files_in_range = file_intervals.apply(lambda r: period_interval.overlaps(r))
 
+        if len(df[files_in_range]) == 0:
+            logging.warning(
+                f"{df['id'].iloc[0] + ': ' if 'id' in df.columns else ''}Insufficient coverage (no files in range)."
+            )
+            return pd.DataFrame(columns=df.columns)
+
         # Very rough guess of the coverage relative to the requested period,
         # without having to open the files or checking day-by-day
-        # This is only checking that you have the first and last time point, not that you have everything in between.
-
-        guessed_nb_hrs = np.min(
-            [
-                df[files_in_range]["date_end"].max(),
-                date_parser(str(period[1]), end_of_period=True, freq="H"),
-            ]
-        ) - np.max(
-            [
-                df[files_in_range]["date_start"].min(),
-                date_parser(str(period[0]), freq="H"),
-            ]
-        )
 
+        # Number of hours in the requested period
         period_nb_hrs = date_parser(
             str(period[1]), end_of_period=True, freq="H"
         ) - date_parser(str(period[0]), freq="H")
 
-        # This checks the sum of hours in all selected files
-        if len(df[files_in_range]) != 0:
-            guessed_nb_hrs_sum = (
-                df[files_in_range]["date_end"] - df[files_in_range]["date_start"]
-            ).sum()
-        # if no files are selected and guessed_nb_hrs_sum=0, the division breaks the code.
-        # The warming will be called anyway even without this test.
-        else:
-            guessed_nb_hrs_sum = period_nb_hrs
+        # Sum of hours in all selected files, restricted by the requested period
+        guessed_nb_hrs_sum = (
+            df[files_in_range].apply(
+                lambda x: np.min(
+                    [
+                        x["date_end"],
+                        date_parser(str(period[1]), end_of_period=True, freq="H"),
+                    ]
+                ),
+                axis=1,
+            )
+            - df[files_in_range].apply(
+                lambda x: np.max(
+                    [x["date_start"], date_parser(str(period[0]), freq="H")]
+                ),
+                axis=1,
+            )
+        ).sum()
 
-        if (
-            guessed_nb_hrs / period_nb_hrs < coverage
-            or len(df[files_in_range]) == 0
-            or guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos < coverage
-        ):
+        if guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos < coverage:
             logging.warning(
-                f"{df['id'].iloc[0] + ': ' if 'id' in df.columns else ''}Insufficient coverage."
-                f"% covered, min to max : {guessed_nb_hrs / period_nb_hrs:.1%}, "
-                f"% covered, sum of hours : {guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos:.1%}, "
-                f"number of files in range : {len(df[files_in_range])}."
+                f"{df['id'].iloc[0] + ': ' if 'id' in df.columns else ''}Insufficient coverage "
+                f"(guessed at {guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos:.1%})."
             )
             return pd.DataFrame(columns=df.columns)