-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel weight generation with Dask #290
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…e in dropping unnecessary vars
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
xesmf/frontend.py
Outdated
if 'lon_b' in ds_out.variables and 'lon_b' not in ds_out.coords.variables: | ||
ds_out = ds_out.assign_coords(lon_b=ds_out.lon_b, lat_b=ds_out.lat_b) | ||
if 'lon_b' in ds_in.variables and 'lon_b' not in ds_in.coords.variables: | ||
ds_in = ds_in.assign_coords(lon_b=ds_in.lon_b, lat_b=ds_in.lat_b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if 'lon_b' in ds_out.variables and 'lon_b' not in ds_out.coords.variables: | |
ds_out = ds_out.assign_coords(lon_b=ds_out.lon_b, lat_b=ds_out.lat_b) | |
if 'lon_b' in ds_in.variables and 'lon_b' not in ds_in.coords.variables: | |
ds_in = ds_in.assign_coords(lon_b=ds_in.lon_b, lat_b=ds_in.lat_b) | |
if 'lon_b' in ds_out.data_vars: | |
ds_out = ds_out.set_coords(['lon_b', 'lat_b']) | |
if 'lon_b' in ds_in.data_vars: | |
ds_in = ds_in.set_coords(['lon_b', 'lat_b']) |
simply simpler ;) !
xesmf/frontend.py
Outdated
ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference( | ||
['longitude', 'latitude'] | ||
) | ||
ds_out = ds_out.cf.drop_dims(ds_out_dims_drop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference( | |
['longitude', 'latitude'] | |
) | |
ds_out = ds_out.cf.drop_dims(ds_out_dims_drop) |
Is this still needed ? If the only variable left if mask
, do we really care if extra dims are still present ?
And if we do, why it this drop not done for the case above ? ('mask' in ds_out
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mask is the only variable left, but you could still have the coord time for example which depending on the size of the time series could be significant. Those lines drop the time coord. I agree that we should do it in both cases and also when locstream_out=True. I'll make that change.
xesmf/frontend.py
Outdated
ds_out_dims_drop = set(ds_out.cf.coordinates.keys()).difference( | ||
['longitude', 'latitude'] | ||
) | ||
ds_out = ds_out.cf.drop_dims(ds_out_dims_drop) | ||
|
||
# Drop unnecessary variables in ds_in to save memory | ||
if not locstream_in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these steps not done in the locstream
case ? I guess that the problems they avoid is lesser in that case, but it would still exist ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll move it out of the if locstream_out
condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work @charlesgauthier-udm!
… keep init from getting too complex
for more information, see https://pre-commit.ci
@huard @aulemahal Looks like moving the para_regrid code outside of |
I can live with that. |
Implemented parallel weight generation using Dask and xarray's
map_blocks
. Here is a quick summary: User can passparallel=True
to theRegridder
and the weights will be computed in parallel.Key points:
dataset
ordataarray
given toRegridder
.map_blocks
and dask, especially with the creation of a template formap_blocks
, so for small grids serial weight generation is prefered. Therefore, the default isparallel=False
parallel=True
, an identicalRegridder
object to the serial case is returned. Could possibly add aself.parallel
in theRegridder
to keep knowledge of if it was generated in parallel.Examples
Using dask to compute the weights allows for larger-than-memory dataset to be used. Using subsets of the Gridded Population of the World (gpw) and the CORDEX WRF in lambert conformal with a 0.22° resolution
(y:281, x:297)
, we get the following examples:(y:281, x:297)
--> GPW_subset(lat:5000, lon:5000)
;parallel=False
: memory overflows,parallel=True
:Regridder
created in ~86s on my 4-core machine.parallel=True
I can tackle even bigger datasets: WRF(y:281, x:297)
--> GPW_subset(lat:7000, lon:7000)
:Regridder
created in ~2minsComparing serial vs. parallel, the overhead related to dask and
map_blocks
makes it slower for small datasets, but for bigger datasets we can compare both:(y:281, x:297)
--> GPW_subset(lat:5000, lon:4000)
;parallel=False
:Regridder
created in ~100s,parallel=True
:Regridder
created in ~ 50s. Roughly 2x faster.Execution time and memory usage is highly dependent on chunk sizes and the number of cores available. However, by chunking the output dataset, the user can adjust it to a specific problem.