Parallel weight generation with Dask #290

charlesgauthier-udm · 2023-08-18T17:43:19Z

Implemented parallel weight generation using Dask and xarray's map_blocks. Here is a quick summary: User can pass parallel=True to the Regridder and the weights will be computed in parallel.

Key points:

Parallel weight generation uses the chunks on the output dataset or dataarray given to Regridder.
There is overhead associated with map_blocks and dask, especially with the creation of a template for map_blocks, so for small grids serial weight generation is prefered. Therefore, the default is parallel=False
When parallel=True, an identical Regridder object to the serial case is returned. Could possibly add a self.parallel in the Regridder to keep knowledge of if it was generated in parallel.

Examples

Using dask to compute the weights allows for larger-than-memory dataset to be used. Using subsets of the Gridded Population of the World (gpw) and the CORDEX WRF in lambert conformal with a 0.22° resolution (y:281, x:297), we get the following examples:

WRF (y:281, x:297) --> GPW_subset(lat:5000, lon:5000); parallel=False: memory overflows, parallel=True: Regridder created in ~86s on my 4-core machine.
Using parallel=True I can tackle even bigger datasets: WRF (y:281, x:297) --> GPW_subset(lat:7000, lon:7000): Regridder created in ~2mins

Comparing serial vs. parallel, the overhead related to dask and map_blocks makes it slower for small datasets, but for bigger datasets we can compare both:

WRF (y:281, x:297) --> GPW_subset(lat:5000, lon:4000); parallel=False: Regridder created in ~100s, parallel=True: Regridder created in ~ 50s. Roughly 2x faster.

Execution time and memory usage is highly dependent on chunk sizes and the number of cores available. However, by chunking the output dataset, the user can adjust it to a specific problem.

…re given.

review-notebook-app · 2023-08-18T17:43:24Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

for more information, see https://pre-commit.ci

xesmf/frontend.py

xesmf/tests/test_frontend.py

doc/notebooks/Dask.ipynb

for more information, see https://pre-commit.ci

…e in dropping unnecessary vars

for more information, see https://pre-commit.ci

xesmf/frontend.py

for more information, see https://pre-commit.ci

aulemahal · 2023-08-25T15:04:46Z

xesmf/frontend.py

+            if 'lon_b' in ds_out.variables and 'lon_b' not in ds_out.coords.variables:
+                ds_out = ds_out.assign_coords(lon_b=ds_out.lon_b, lat_b=ds_out.lat_b)
+            if 'lon_b' in ds_in.variables and 'lon_b' not in ds_in.coords.variables:
+                ds_in = ds_in.assign_coords(lon_b=ds_in.lon_b, lat_b=ds_in.lat_b)


Suggested change

if 'lon_b' in ds_out.variables and 'lon_b' not in ds_out.coords.variables:

ds_out = ds_out.assign_coords(lon_b=ds_out.lon_b, lat_b=ds_out.lat_b)

if 'lon_b' in ds_in.variables and 'lon_b' not in ds_in.coords.variables:

ds_in = ds_in.assign_coords(lon_b=ds_in.lon_b, lat_b=ds_in.lat_b)

if 'lon_b' in ds_out.data_vars:

ds_out = ds_out.set_coords(['lon_b', 'lat_b'])

if 'lon_b' in ds_in.data_vars:

ds_in = ds_in.set_coords(['lon_b', 'lat_b'])

simply simpler ;) !

aulemahal · 2023-08-25T15:07:40Z

xesmf/frontend.py

+                    ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference(
+                        ['longitude', 'latitude']
+                    )
+                    ds_out = ds_out.cf.drop_dims(ds_out_dims_drop)


Suggested change

ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference(

['longitude', 'latitude']

)

ds_out = ds_out.cf.drop_dims(ds_out_dims_drop)

Is this still needed ? If the only variable left if mask, do we really care if extra dims are still present ?

And if we do, why it this drop not done for the case above ? ('mask' in ds_out)

The mask is the only variable left, but you could still have the coord time for example which depending on the size of the time series could be significant. Those lines drop the time coord. I agree that we should do it in both cases and also when locstream_out=True. I'll make that change.

aulemahal · 2023-08-25T15:10:38Z

xesmf/frontend.py

+                    ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference(
+                        ['longitude', 'latitude']
+                    )
+                    ds_out = ds_out.cf.drop_dims(ds_out_dims_drop)

            # Drop unnecessary variables in ds_in to save memory
            if not locstream_in:


Why are these steps not done in the locstream case ? I guess that the problems they avoid is lesser in that case, but it would still exist ?

Yes, I'll move it out of the if locstream_out condition.

aulemahal

Really nice work @charlesgauthier-udm!

… keep init from getting too complex

for more information, see https://pre-commit.ci

charlesgauthier-udm · 2023-08-31T19:16:51Z

@huard @aulemahal Looks like moving the para_regrid code outside of __init__ to its own method does not solve the issue of __init__ being too complex..

huard · 2023-08-31T19:29:20Z

I can live with that.

xesmf/tests/test_frontend.py

charlesgauthier-udm added 4 commits August 15, 2023 14:54

Parallel weight generation implementation

9a72154

Renaming bounds in ds_in ds_out so there's no confusion in map_blocks

4988fa2

added Para weight gen with locstreams, fixed weight gen when bounds a…

0bd4224

…re given.

Wrote tests for para weight gen, updated docs

ef1b594

[pre-commit.ci] auto fixes from pre-commit.com hooks

05bc733

for more information, see https://pre-commit.ci

huard requested review from huard and aulemahal August 18, 2023 17:47

Fixed failing tests

bdb3fff

huard reviewed Aug 18, 2023

View reviewed changes

xesmf/frontend.py Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/tests/test_frontend.py Outdated Show resolved Hide resolved

huard reviewed Aug 18, 2023

View reviewed changes

doc/notebooks/Dask.ipynb Show resolved Hide resolved

doc/notebooks/Dask.ipynb Show resolved Hide resolved

doc/notebooks/Dask.ipynb Show resolved Hide resolved

charlesgauthier-udm and others added 4 commits August 18, 2023 14:25

Fixed error in tests, added comments to frontend.py

549a58e

[pre-commit.ci] auto fixes from pre-commit.com hooks

214269a

for more information, see https://pre-commit.ci

Simplified Dask notebook, added comments to code and fixed ds_in issu…

bf27c57

…e in dropping unnecessary vars

[pre-commit.ci] auto fixes from pre-commit.com hooks

3651877

for more information, see https://pre-commit.ci

huard approved these changes Aug 23, 2023

View reviewed changes

aulemahal requested changes Aug 23, 2023

View reviewed changes

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

xesmf/frontend.py Outdated Show resolved Hide resolved

charlesgauthier-udm and others added 2 commits August 25, 2023 10:12

Implemented requested changes

e83ae39

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f6084f

for more information, see https://pre-commit.ci

aulemahal reviewed Aug 25, 2023

View reviewed changes

aulemahal approved these changes Aug 25, 2023

View reviewed changes

charlesgauthier-udm and others added 3 commits August 25, 2023 13:03

Included dims drop for locstream_out=True case.

0b45750

Moved para regridding code outside of init and into its own method to…

ffae767

… keep init from getting too complex

[pre-commit.ci] auto fixes from pre-commit.com hooks

299de0d

for more information, see https://pre-commit.ci

huard reviewed Aug 31, 2023

View reviewed changes

xesmf/tests/test_frontend.py Outdated Show resolved Hide resolved

huard added 2 commits September 1, 2023 08:33

Merge branch 'master' into paraweights

605081f

add comment

292c1c3

huard merged commit 910a20c into pangeo-data:master Sep 1, 2023
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel weight generation with Dask #290

Parallel weight generation with Dask #290

charlesgauthier-udm commented Aug 18, 2023

review-notebook-app bot commented Aug 18, 2023

aulemahal Aug 25, 2023

aulemahal Aug 25, 2023

charlesgauthier-udm Aug 25, 2023 •

edited

Loading

aulemahal Aug 25, 2023

charlesgauthier-udm Aug 25, 2023

aulemahal left a comment

charlesgauthier-udm commented Aug 31, 2023

huard commented Aug 31, 2023

Parallel weight generation with Dask #290

Parallel weight generation with Dask #290

Conversation

charlesgauthier-udm commented Aug 18, 2023

Key points:

Examples

review-notebook-app bot commented Aug 18, 2023

aulemahal Aug 25, 2023

Choose a reason for hiding this comment

aulemahal Aug 25, 2023

Choose a reason for hiding this comment

charlesgauthier-udm Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

aulemahal Aug 25, 2023

Choose a reason for hiding this comment

charlesgauthier-udm Aug 25, 2023

Choose a reason for hiding this comment

aulemahal left a comment

Choose a reason for hiding this comment

charlesgauthier-udm commented Aug 31, 2023

huard commented Aug 31, 2023

charlesgauthier-udm Aug 25, 2023 •

edited

Loading