Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel weight generation with Dask #290

Merged
merged 17 commits into from
Sep 1, 2023

Conversation

charlesgauthier-udm
Copy link
Contributor

Implemented parallel weight generation using Dask and xarray's map_blocks. Here is a quick summary: User can pass parallel=True to the Regridder and the weights will be computed in parallel.

Key points:

  • Parallel weight generation uses the chunks on the output dataset or dataarray given to Regridder.
  • There is overhead associated with map_blocks and dask, especially with the creation of a template for map_blocks, so for small grids serial weight generation is prefered. Therefore, the default is parallel=False
  • When parallel=True, an identical Regridder object to the serial case is returned. Could possibly add a self.parallel in the Regridder to keep knowledge of if it was generated in parallel.

Examples

Using dask to compute the weights allows for larger-than-memory dataset to be used. Using subsets of the Gridded Population of the World (gpw) and the CORDEX WRF in lambert conformal with a 0.22° resolution (y:281, x:297), we get the following examples:

  • WRF (y:281, x:297) --> GPW_subset(lat:5000, lon:5000); parallel=False: memory overflows, parallel=True: Regridder created in ~86s on my 4-core machine.
  • Using parallel=True I can tackle even bigger datasets: WRF (y:281, x:297) --> GPW_subset(lat:7000, lon:7000): Regridder created in ~2mins

Comparing serial vs. parallel, the overhead related to dask and map_blocks makes it slower for small datasets, but for bigger datasets we can compare both:

  • WRF (y:281, x:297) --> GPW_subset(lat:5000, lon:4000); parallel=False: Regridder created in ~100s, parallel=True: Regridder created in ~ 50s. Roughly 2x faster.

Execution time and memory usage is highly dependent on chunk sizes and the number of cores available. However, by chunking the output dataset, the user can adjust it to a specific problem.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@huard huard requested review from huard and aulemahal August 18, 2023 17:47
xesmf/frontend.py Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/tests/test_frontend.py Outdated Show resolved Hide resolved
doc/notebooks/Dask.ipynb Show resolved Hide resolved
doc/notebooks/Dask.ipynb Show resolved Hide resolved
doc/notebooks/Dask.ipynb Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
xesmf/frontend.py Outdated Show resolved Hide resolved
Comment on lines 958 to 961
if 'lon_b' in ds_out.variables and 'lon_b' not in ds_out.coords.variables:
ds_out = ds_out.assign_coords(lon_b=ds_out.lon_b, lat_b=ds_out.lat_b)
if 'lon_b' in ds_in.variables and 'lon_b' not in ds_in.coords.variables:
ds_in = ds_in.assign_coords(lon_b=ds_in.lon_b, lat_b=ds_in.lat_b)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if 'lon_b' in ds_out.variables and 'lon_b' not in ds_out.coords.variables:
ds_out = ds_out.assign_coords(lon_b=ds_out.lon_b, lat_b=ds_out.lat_b)
if 'lon_b' in ds_in.variables and 'lon_b' not in ds_in.coords.variables:
ds_in = ds_in.assign_coords(lon_b=ds_in.lon_b, lat_b=ds_in.lat_b)
if 'lon_b' in ds_out.data_vars:
ds_out = ds_out.set_coords(['lon_b', 'lat_b'])
if 'lon_b' in ds_in.data_vars:
ds_in = ds_in.set_coords(['lon_b', 'lat_b'])

simply simpler ;) !

Comment on lines 974 to 977
ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference(
['longitude', 'latitude']
)
ds_out = ds_out.cf.drop_dims(ds_out_dims_drop)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference(
['longitude', 'latitude']
)
ds_out = ds_out.cf.drop_dims(ds_out_dims_drop)

Is this still needed ? If the only variable left if mask, do we really care if extra dims are still present ?

And if we do, why it this drop not done for the case above ? ('mask' in ds_out)

Copy link
Contributor Author

@charlesgauthier-udm charlesgauthier-udm Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mask is the only variable left, but you could still have the coord time for example which depending on the size of the time series could be significant. Those lines drop the time coord. I agree that we should do it in both cases and also when locstream_out=True. I'll make that change.

ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference(
['longitude', 'latitude']
)
ds_out = ds_out.cf.drop_dims(ds_out_dims_drop)

# Drop unnecessary variables in ds_in to save memory
if not locstream_in:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these steps not done in the locstream case ? I guess that the problems they avoid is lesser in that case, but it would still exist ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll move it out of the if locstream_out condition.

Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work @charlesgauthier-udm!

@charlesgauthier-udm
Copy link
Contributor Author

@huard @aulemahal Looks like moving the para_regrid code outside of __init__ to its own method does not solve the issue of __init__ being too complex..

@huard
Copy link
Contributor

huard commented Aug 31, 2023

I can live with that.

xesmf/tests/test_frontend.py Outdated Show resolved Hide resolved
@huard huard merged commit 910a20c into pangeo-data:master Sep 1, 2023
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants