Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented distributed regridding #762

Merged
merged 8 commits into from
Jul 18, 2019
Merged

Conversation

philippjfr
Copy link
Member

@philippjfr philippjfr commented Jul 5, 2019

Implements regridding of dask.array.Array backed xarray.DataArray objects without loading the entire array into memory.

The basic idea here is that the operation will no longer load dask arrays into memory, instead it will resample the original data in chunks and then stack the resampled chunks along the two axes. By default it will try to keep the chunksize consistent, however it now also exposes a memory constraint which can be used to limit the amount of data in memory at any one time.

If enabled it is possible to end up with situations where resampling cannot be performed given the memory constraints, e.g. let us say you have a 1 GB array which is roughly 12k x 12k in shape, and we want to resample it into a 10x10 grid. If we we give it a memory limit of 100 KB, even if each output chunk is only 1x1 it still has to load a 1200x1200 grid, i.e. about 11.5 MB, into memory . This could be solved by splitting the task and creating temporary chunks which are then aggregated but I considered that out of scope. Instead you will get an error that you have to relax the memory constraint.

Here is the profiling output when regridding a 10k x 10k grid into a 100x100 grid:

bokeh_plot - 2019-07-15T131812 114

Here is the same, but this time the array isn't already in memory but loaded from disk with zarr:

bokeh_plot - 2019-07-15T140555 076

Here we are zooming on a 50k x 50k (20 GB) array loaded with zarr (note how it slows down when we zoom out), while trying to minimize memory load:

dask_zarr

Known issues

  • Upsampling with 'linear' interpolation does not yet use correct padding so boundaries are wrong

  • Add tests

@philippjfr philippjfr added the wip label Jul 5, 2019
@philippjfr philippjfr force-pushed the philippjfr/distributed_regrid branch from 6b1bb75 to ab85b6c Compare July 15, 2019 11:00
@philippjfr philippjfr force-pushed the philippjfr/distributed_regrid branch from 8fba932 to 84d0bfc Compare July 15, 2019 16:06
@philippjfr philippjfr removed the wip label Jul 16, 2019
Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Please merge once the minor review comments have been addressed and the tests pass. Thanks for doing the clearly somewhat painful work involved in this. I tried my best to focus on some of the hairier C-like code here, and it looks fine, but it would be difficult for me to spot an off-by-one error...

datashader/core.py Outdated Show resolved Hide resolved
datashader/core.py Outdated Show resolved Hide resolved
datashader/resampling.py Outdated Show resolved Hide resolved
datashader/resampling.py Outdated Show resolved Hide resolved
datashader/resampling.py Outdated Show resolved Hide resolved
datashader/resampling.py Outdated Show resolved Hide resolved
Co-Authored-By: James A. Bednar <jbednar@users.noreply.github.com>
@philippjfr philippjfr merged commit 1b9f300 into master Jul 18, 2019
@jbednar jbednar deleted the philippjfr/distributed_regrid branch July 18, 2019 13:22
@jbednar jbednar mentioned this pull request Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants