-
-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick out-of-core downsampling to provide an overview of large data #560
Comments
Note that I'm discussing raster data above, but the same issues apply to timeseries data, where a very long timeseries needs a quick overview that can be zoomed in to show the full data. |
I am currently implementing something similar for our data. I came along #553 and then started working on this myself. So far I didn't find a way to wrap xarrays apply_ufunc or dask.map_blocks even though the latter would support ghosting which, as mentioned by @philippjfr, will probably be of importance for the sampling. The problem is that one needs to know the size of the processed data chunks and these will be chunk-dependent for non-symmetrically chunked data sets (i.e. an array with size 2055 along one axis, chunked into 1000,1000,55). -1. data = dask.array.from_array(hdf5 file, chunks=(1000,1000)) #estimate number of plot tiles
Inside the resampling_function, I initialize an array of zeros with the size of the plot canvas and fill one tile with the sampled array from gridtools.resample_2d. In the end when I sum over dask_delayed I end up with the complete agg. I am still testing the equivalence but so far it seems to work very well. On my laptop I can downsample more than 7 GByte in a few seconds. In addition I added a cache of the full downsampled dataset in my plot class so that resetting to the full data extent, by far the biggest computation, is only performed once. |
Thinking a little bit more about it, above works well for me without any chunk overlap because of the data (a matrix of probabilities). The values are of the same magnitude in their respective neighborhood and will converge to an area mean regardless of where exactly the split occurs. |
Now that #553 is done, and I've been able to successfully use that, this one comes to light again. I find when trying to do a downsample and zoom it ends up with the wrong range during zoom. Is there any tricks to using resample_2d that would help? It seems like the above was a mechanism to solve 553. I want to solve not having to have it ultimately read all the data. I just want a very partial read of the data for a fast overview, but still have the zooming ranges work correctly. |
As suggested in #553, we need to improve our raster support so that it can access large rasters chunk by chunk. This will let us work with raster datasets larger than system memory, just as we can currently work with larger-than-memory columnar Dask dataframes, and should make it very practical to work with a small, zoomed-in region of a very large raster dataset.
However, even if full resolution is needed only on deep zooms, very often it is not clear which region of the large dataset should be viewed at full resolution, until a zoomed-out overview of the data has been provided to the user. And unfortunately, constructing such an overview using datashader will currently require making a complete pass through the entire dataset, reading all of the data into Python in order to aggregate it into a coarse grid.
Even if the Python code were to explicitly downsample the data, accessing only a strided subset of the data from disk, with most file formats and storage technologies the entire dataset will still end up being accessed.
After the above issue is addressed, it would be great if we could work on some specific combinations of aggregations and file formats where constructing such an overview would be relatively cheap. We could also consider making examples of having some pre-processed overview of the data stored with it, and then using HoloViews to switch dynamically between the overview and the full data depending on zoom level.
The text was updated successfully, but these errors were encountered: