Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cos-lat averaging #125

Merged
merged 19 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,22 @@ History

v0.6.0 (unreleased)
-------------------
Contributors to this version: Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`), Pascal Bourgault (:user:`aulemahal`).
Contributors to this version: Trevor James Smith (:user:`Zeitsperre`), Juliette Lavoie (:user:`juliettelavoie`), Pascal Bourgault (:user:`aulemahal`), Gabriel Rondeau-Genesse (:user:`RondeauG`).

New features and enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* New 'cos-lat' averaging in `spatial_mean` (:issue:`94`, :pull:`125`).

Breaking changes
^^^^^^^^^^^^^^^^
* 'mean' averaging has been deprecated in `spatial_mean` (:pull:`125`).
* 'interp_coord' has been renamed to 'interp_centroid' in `spatial_mean` (:pull:`125`).

Bug fixes
^^^^^^^^^
* Forbid pandas v1.5.3 in the environment files, as the linux conda build breaks the data catalog parser. (:issue:`161`, :pull:`162`).
* Only return requested variables when using ``DataCatalog.to_dataset`` (:pull:`163`).
* `compute_indicators` no longer crashes if less than 3 timesteps are produced (:pull:`125`).

Internal changes
^^^^^^^^^^^^^^^^
Expand All @@ -18,6 +28,7 @@ Internal changes
* Minimal documentation for templates (:pull:`163`).
* `xscen` is now indexed in `Zenodo <https://zenodo.org/>`_, under the `ouranos` community of projects (:pull:`164`).
* Added a few relevant `Shields <https://shields.io/>`_ to the README.rst (:pull:`164`).
* Better warning messages in `_subset_file_coverage` when coverage is insufficient (:pull:`125`).

v0.5.0 (2023-02-28)
-------------------
Expand Down
13 changes: 8 additions & 5 deletions docs/notebooks/2_getting_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,10 @@
"\n",
"ds_grid = xesmf.util.cf_grid_2d(-68.5, -67.5, 0.25, 48.5, 49.5, 0.25)\n",
"\n",
"# cf_grid_2d does not set the 'axis' attribute\n",
"ds_grid[\"lon\"].attrs[\"axis\"] = \"X\"\n",
"ds_grid[\"lat\"].attrs[\"axis\"] = \"Y\"\n",
"\n",
"# xscen will use the domain to re-assign attributes, so it is important to set it up for custom grids like this\n",
"ds_grid.attrs[\"cat:domain\"] = \"regular0-25\"\n",
"ds_grid"
Expand Down Expand Up @@ -983,11 +987,10 @@
"\n",
"`xs.spatial_mean` is used to compute the spatial average over a given region, using various methods. The argument `call_clisops` can also be used to subset the domain prior to the averaging.\n",
"\n",
"- `method: mean` will simply perform a `.mean()` operation over the spatial dimensions.\n",
" - `kwargs` can be used to instruct on which dimension(s) to perform the averaging, using `{\"dim\": list}`. Otherwise, the script will try to guess the X and Y axes using CF Conventions.\n",
"- `method: cos-lat` will perform an average operation over the spatial dimensions, accounting for changes in grid cell area along the 'lat' coordinate.\n",
"\n",
"\n",
"- `method: interp_coord` will perform a linear interpolation towards given coordinates.\n",
"- `method: interp_centroid` will perform an interpolation towards given coordinates or towards the centroid of a region.\n",
" - `kwargs` is used to sent arguments to `.interp()`, including `lon` and `lat`.\n",
" - `region` can alternatively be used to send a gridpoint, bbox, or shapefile and compute the centroid. This argument is a dictionary that follows the same requirements as the one for `xs.extract` described previously.\n",
"\n",
Expand All @@ -1010,7 +1013,7 @@
"for key, ds in ds_dict.items():\n",
" ds_savg = xs.spatial_mean(\n",
" ds=ds,\n",
" method=\"interp_coord\",\n",
" method=\"interp_centroid\",\n",
" kwargs={\"method\": \"linear\", \"lon\": -70, \"lat\": 48.5},\n",
" to_domain=\"aggregated\",\n",
" )\n",
Expand Down Expand Up @@ -1238,7 +1241,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.10.9"
}
},
"nbformat": 4,
Expand Down
2 changes: 2 additions & 0 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
# Main packages
- cartopy
- cftime
- cf_xarray >= 0.7.6
- clisops
- dask
- flox
Expand All @@ -19,6 +20,7 @@ dependencies:
- netCDF4
- numpy
- pandas!=1.5.3
- pygeos
- pyyaml
- rechunker
- shapely
Expand Down
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ dependencies:
# Main packages
- cartopy
- cftime
- cf_xarray >= 0.7.6
- clisops
- dask
- flox
Expand All @@ -19,6 +20,7 @@ dependencies:
- netCDF4
- numpy
- pandas!=1.5.3
- pygeos
- pyyaml
- rechunker
- shapely
Expand Down
2 changes: 2 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
requirements = [
"cartopy",
"cftime",
"cf_xarray>=0.7.6",
"clisops",
"dask",
"flox",
Expand All @@ -39,6 +40,7 @@
"netCDF4",
"numpy",
"pandas",
"pygeos",
"pyyaml",
"rechunker",
"shapely",
Expand Down
102 changes: 73 additions & 29 deletions xscen/aggregate.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import geopandas as gpd
import numpy as np
import pandas as pd
import pygeos
import xarray as xr
import xesmf as xe
from shapely.geometry import Polygon
Expand Down Expand Up @@ -291,15 +292,15 @@ def spatial_mean(
ds : xr.Dataset
Dataset to use for the computation.
method : str
'mean' will perform a .mean() over the spatial dimensions of the Dataset.
'interp_coord' will find the region's centroid (if coordinates are not fed through kwargs), then perform a .interp() over the spatial dimensions of the Dataset.
'cos-lat' will weight the area covered by each pixel using an approximation based on latitude.
'interp_centroid' will find the region's centroid (if coordinates are not fed through kwargs), then perform a .interp() over the spatial dimensions of the Dataset.
The coordinate can also be directly fed to .interp() through the 'kwargs' argument below.
'xesmf' will make use of xESMF's SpatialAverager. This will typically be more precise, especially for irregular regions, but can be much slower than other methods.
call_clisops : bool
If True, xscen.extraction.clisops_subset will be called prior to the other operations. This requires the 'region' argument.
region : dict
Description of the region and the subsetting method (required fields listed in the Notes).
If method=='interp_coord', this is used to find the region's centroid.
If method=='interp_centroid', this is used to find the region's centroid.
If method=='xesmf', the bounding box or shapefile is given to SpatialAverager.
kwargs : dict
Arguments to send to either mean(), interp() or SpatialAverager().
Expand Down Expand Up @@ -334,36 +335,70 @@ def spatial_mean(
xarray.Dataset.mean, xarray.Dataset.interp, xesmf.SpatialAverager
"""
kwargs = kwargs or {}
if method == "mean":
warnings.warn(
"xs.spatial_mean with method=='mean' is deprecated and will be abandoned in a future release. Use method=='cos-lat' instead for a more robust but similar method.",
category=DeprecationWarning,
stacklevel=2,
)
elif method == "interp_coord":
warnings.warn(
"xs.spatial_mean with method=='interp_coord' is deprecated. Use method=='interp_centroid' instead.",
category=DeprecationWarning,
stacklevel=2,
RondeauG marked this conversation as resolved.
Show resolved Hide resolved
)
method = "interp_centroid"

# If requested, call xscen.extraction.clisops_subset prior to averaging
if call_clisops:
ds = clisops_subset(ds, region)

if method == "cos-lat":
if "latitude" not in ds.cf.coordinates:
raise ValueError(
"Could not determine the latitude name using CF conventions. "
"Use kwargs = {lat: str} to specify the coordinate name."
)

if "units" not in ds.cf["latitude"].attrs:
logger.warning(
f"{ds.attrs.get('cat:id', '')}: Latitude does not appear to have units. Make sure that the computation is right."
)
elif ds.cf["latitude"].attrs["units"] != "degrees_north":
logger.warning(
f"{ds.attrs.get('cat:id', '')}: Latitude units is '{ds.cf['latitude'].attrs['units']}', expected 'degrees_north'. "
f"Make sure that the computation is right."
)

if (ds.cf["longitude"].min() < -160) & (ds.cf["longitude"].max() > 160):
RondeauG marked this conversation as resolved.
Show resolved Hide resolved
logger.warning(
"The region appears to be crossing the -180/180° meridian. Bounds computation is currently bugged in cf_xarray. "
"Make sure that the computation is right."
)

ds = ds.cf.add_bounds(["longitude", "latitude"])
weights = xr.DataArray(
pygeos.area(
pygeos.polygons(pygeos.linearrings(ds.lon_bounds, ds.lat_bounds))
),
dims=ds.cf["longitude"].dims,
coords=ds.cf["longitude"].coords,
) * np.cos(np.deg2rad(ds.cf["latitude"]))

ds_agg = ds.weighted(weights).mean(
[d for d in ds.cf.axes["X"] + ds.cf.axes["Y"]], keep_attrs=True
)

# Prepare the History field
new_history = (
f"[{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] "
f"weighted mean(dim={[d for d in ds.cf.axes['X'] + ds.cf.axes['Y']]}) using a 'cos-lat' approximation"
aulemahal marked this conversation as resolved.
Show resolved Hide resolved
)

# This simply calls .mean() over the spatial dimensions
if method == "mean":
elif method == "mean":
if "dim" not in kwargs:
# Determine the X and Y names
spatial_dims = []
for d in ["X", "Y"]:
if d in ds.cf.axes:
spatial_dims.extend([ds.cf[d].name])
elif (
(d == "X")
and ("longitude" in ds.cf.coordinates)
and (len(ds[ds.cf.coordinates["longitude"][0]].dims) == 1)
):
spatial_dims.extend(ds.cf.coordinates["longitude"])
elif (
(d == "Y")
and ("latitude" in ds.cf.coordinates)
and (len(ds[ds.cf.coordinates["latitude"][0]].dims) == 1)
):
spatial_dims.extend(ds.cf.coordinates["latitude"])
if len(spatial_dims) == 0:
raise ValueError(
"Could not determine the spatial dimension(s) using CF conventions. Use kwargs = {dim: list} to specify on which dimension to perform the averaging."
)
kwargs["dim"] = spatial_dims
kwargs["dim"] = ds.cf.axes["X"] + ds.cf.axes["Y"]

ds_agg = ds.mean(keep_attrs=True, **kwargs)

Expand All @@ -374,9 +409,14 @@ def spatial_mean(
)

# This calls .interp() to a pair of coordinates
elif method == "interp_coord":
elif method == "interp_centroid":
# Find the centroid
if region is not None:
if region is None:
if ds.cf.axes["X"][0] not in kwargs:
kwargs[ds.cf.axes["X"][0]] = ds[ds.cf.axes["X"][0]].mean().values
if ds.cf.axes["Y"][0] not in kwargs:
kwargs[ds.cf.axes["Y"][0]] = ds[ds.cf.axes["Y"][0]].mean().values
else:
if region["method"] == "gridpoint":
if len(region["gridpoint"]["lon"] != 1):
raise ValueError(
Expand Down Expand Up @@ -413,6 +453,10 @@ def spatial_mean(

# Uses xesmf.SpatialAverager
elif method == "xesmf":
logger.warning(
"A bug has been found with xesmf.SpatialAverager that impacts big or regions. "
RondeauG marked this conversation as resolved.
Show resolved Hide resolved
"Until this is fixed, make sure that the computation is right or use multiple smaller regions."
)
# If the region is a bounding box, call shapely and geopandas to transform it into an input compatible with xesmf
if region["method"] == "bbox":
lon_point_list = [
Expand Down Expand Up @@ -455,7 +499,7 @@ def spatial_mean(
)

else:
raise ValueError("'method' not understood.")
raise ValueError("'method' should be one of [bbox, shape].")

kwargs_copy = deepcopy(kwargs)
skipna = kwargs_copy.pop("skipna", False)
Expand Down
59 changes: 28 additions & 31 deletions xscen/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -1310,46 +1310,43 @@ def _subset_file_coverage(
)
files_in_range = file_intervals.apply(lambda r: period_interval.overlaps(r))

if len(df[files_in_range]) == 0:
logging.warning(
f"{df['id'].iloc[0] + ': ' if 'id' in df.columns else ''}Insufficient coverage (no files in range)."
)
return pd.DataFrame(columns=df.columns)

# Very rough guess of the coverage relative to the requested period,
# without having to open the files or checking day-by-day
# This is only checking that you have the first and last time point, not that you have everything in between.

guessed_nb_hrs = np.min(
[
df[files_in_range]["date_end"].max(),
date_parser(str(period[1]), end_of_period=True, freq="H"),
]
) - np.max(
[
df[files_in_range]["date_start"].min(),
date_parser(str(period[0]), freq="H"),
]
)

# Number of hours in the requested period
period_nb_hrs = date_parser(
str(period[1]), end_of_period=True, freq="H"
) - date_parser(str(period[0]), freq="H")

# This checks the sum of hours in all selected files
if len(df[files_in_range]) != 0:
guessed_nb_hrs_sum = (
df[files_in_range]["date_end"] - df[files_in_range]["date_start"]
).sum()
# if no files are selected and guessed_nb_hrs_sum=0, the division breaks the code.
# The warming will be called anyway even without this test.
else:
guessed_nb_hrs_sum = period_nb_hrs
# Sum of hours in all selected files, restricted by the requested period
guessed_nb_hrs_sum = (
df[files_in_range].apply(
lambda x: np.min(
[
x["date_end"],
date_parser(str(period[1]), end_of_period=True, freq="H"),
]
),
axis=1,
)
- df[files_in_range].apply(
lambda x: np.max(
[x["date_start"], date_parser(str(period[0]), freq="H")]
),
axis=1,
)
).sum()

if (
guessed_nb_hrs / period_nb_hrs < coverage
or len(df[files_in_range]) == 0
or guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos < coverage
):
if guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos < coverage:
logging.warning(
f"{df['id'].iloc[0] + ': ' if 'id' in df.columns else ''}Insufficient coverage."
f"% covered, min to max : {guessed_nb_hrs / period_nb_hrs:.1%}, "
f"% covered, sum of hours : {guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos:.1%}, "
f"number of files in range : {len(df[files_in_range])}."
f"{df['id'].iloc[0] + ': ' if 'id' in df.columns else ''}Insufficient coverage "
f"(guessed at {guessed_nb_hrs_sum.nanos / period_nb_hrs.nanos:.1%})."
)
return pd.DataFrame(columns=df.columns)

Expand Down
Loading