Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick out-of-core downsampling to provide an overview of large data #560

Open
jbednar opened this issue Feb 9, 2018 · 4 comments
Open
Milestone

Comments

@jbednar
Copy link
Member

jbednar commented Feb 9, 2018

As suggested in #553, we need to improve our raster support so that it can access large rasters chunk by chunk. This will let us work with raster datasets larger than system memory, just as we can currently work with larger-than-memory columnar Dask dataframes, and should make it very practical to work with a small, zoomed-in region of a very large raster dataset.

However, even if full resolution is needed only on deep zooms, very often it is not clear which region of the large dataset should be viewed at full resolution, until a zoomed-out overview of the data has been provided to the user. And unfortunately, constructing such an overview using datashader will currently require making a complete pass through the entire dataset, reading all of the data into Python in order to aggregate it into a coarse grid.

Even if the Python code were to explicitly downsample the data, accessing only a strided subset of the data from disk, with most file formats and storage technologies the entire dataset will still end up being accessed.

After the above issue is addressed, it would be great if we could work on some specific combinations of aggregations and file formats where constructing such an overview would be relatively cheap. We could also consider making examples of having some pre-processed overview of the data stored with it, and then using HoloViews to switch dynamically between the overview and the full data depending on zoom level.

@jbednar
Copy link
Member Author

jbednar commented Feb 9, 2018

Note that I'm discussing raster data above, but the same issues apply to timeseries data, where a very long timeseries needs a quick overview that can be zoomed in to show the full data.

@LEMUEGGE
Copy link

I am currently implementing something similar for our data. I came along #553 and then started working on this myself. So far I didn't find a way to wrap xarrays apply_ufunc or dask.map_blocks even though the latter would support ghosting which, as mentioned by @philippjfr, will probably be of importance for the sampling. The problem is that one needs to know the size of the processed data chunks and these will be chunk-dependent for non-symmetrically chunked data sets (i.e. an array with size 2055 along one axis, chunked into 1000,1000,55).
However, for the sampling methods that I am currently interested in: nearest neighbor upsampling and area mean downsampling, it seems that there is a satisfactory solution by using dask.delayed and manual creation of chunks. I addressed the issue of variable chunk sizes in unsymmetric datasets by instead dividing the plot canvas in equal chunks and map( as good as possible ) equal data chunks on these tiles.
My algorithms looks like this:

-1. data = dask.array.from_array(hdf5 file, chunks=(1000,1000))
0. preprocess xrange and yrange to match with data extent and resolution

#estimate number of plot tiles

  1. xtiles = max(xrange // 2000,1) # something that fits good in memory

  2. ytiles = max(yrange // 2000,1)

  3. dask_delayed = [dask.delayed(resampling_function)( data[ith chunk in dim 1 : jth chunk in dim 2], plotwidth,plotheight, indicator of which tile (where in canvas and extents of tile) this data should be mapped to as (a function of i,j) ) for i,j in meshgrid (range(xtiles),range(ytiles))]

  4. agg = dask.delayed(numpy.sum)(dask_delayed)

  5. agg.compute(num_workers=4)

Inside the resampling_function, I initialize an array of zeros with the size of the plot canvas and fill one tile with the sampled array from gridtools.resample_2d. In the end when I sum over dask_delayed I end up with the complete agg. I am still testing the equivalence but so far it seems to work very well. On my laptop I can downsample more than 7 GByte in a few seconds.

In addition I added a cache of the full downsampled dataset in my plot class so that resetting to the full data extent, by far the biggest computation, is only performed once.

@LEMUEGGE
Copy link

Thinking a little bit more about it, above works well for me without any chunk overlap because of the data (a matrix of probabilities). The values are of the same magnitude in their respective neighborhood and will converge to an area mean regardless of where exactly the split occurs.
For more heterogeneous data, I definitely expect some problems in the down-sampling process, although one can alter above algo to data [(ith + 1/2) chunk:(jth +1/ 2) chunk] to have some overlap. In the upsampling process, we are only dealing with one data chunk in a canvas with a reasonable size of w,h < 1000 pixel anyway and no overlap is needed.

@youngsaj
Copy link

Now that #553 is done, and I've been able to successfully use that, this one comes to light again. I find when trying to do a downsample and zoom it ends up with the wrong range during zoom. Is there any tricks to using resample_2d that would help? It seems like the above was a mechanism to solve 553. I want to solve not having to have it ultimately read all the data. I just want a very partial read of the data for a fast overview, but still have the zooming ranges work correctly.
I've done it brute force, constantly rereading, but would prefer some sort of mixed datashader mechanism, or recommendations on how to get to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants