Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add Rasters.jl benchmarks #18

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

asinghvi17
Copy link

@asinghvi17 asinghvi17 commented Sep 23, 2024

This PR adds benchmarks for Rasters.jl, and adds a pixi environment file to manage the installations of R, Python, Julia, and packages.

I added the pixi file because there's a version conflict if you try to install exactextractr and stars in the same environment, they require different versions of gdal. So now each package has a separate environment.

download-3

@asinghvi17
Copy link
Author

should I add #15 too while I'm at it @kadyb?

This allows exactextractr to be installed in a separate environment, so it doesn't provoke a version conflict with sf and stars via gdal.

This also segments each language and creates dependency trees so life is a bit easier.
conforming with all the other benchmarks
@asinghvi17
Copy link
Author

Edited the first post now that the PR is functionally complete from the Rasters.jl end.

Rasters.checkmem!(false) # Avoid memory check bug on some machines
raster = Raster(RasterStack(raster_files; name=band_names))
# Downsample from 30m to 90m (1/3 of the original resolution)
benchmark = @be Rasters.resample($raster; res=90) seconds=60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to use Rasters.aggregate here instead? That would work if we are resampling to exactly 1/3 of the resolution

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this was just copied from python code and we left it as resample. But for the actual problem I'm not sure why they would use resample when you can just aggregate X/Y by 3.

@kadyb
Copy link
Owner

kadyb commented Sep 25, 2024

should I add #15 too while I'm at it @kadyb?

@asinghvi17, let's wait with this, because it uses a different dataset. Separate issue is that depending on the dataset, {fasterize} can be significantly slower than GDAL.

I have another question -- In task "zonal" are exact zone statistics like in {exactextractr} used or approximate like in the other packages?

@rafaqz
Copy link

rafaqz commented Sep 25, 2024

Its approximate like the other packages. Now I've seen exactextractr is exact I'll add that option in a few months!

But I think it will be even faster after that in most cases because it would be best with switching to an online stats approach (at least where that is possible for sum/prod/mean etc).

Currently zonal is Rasters.jl is just a nice but very basic shortcut for applying a function to the result of mask and crop over each geometry - its the rasterization machinery under mask that is fast.

So for #15 it would be really nice to have mask and rasterize benchmarks here too from my perspective!

(It would also be good for rasterize to have a range of datasets with different kinds of geometries with varying node densities and target raster resolutions to get a clear picture of the tradroffs. I also think in some cases fasterize will be close to Rasters.jl and others slower.

Another thing is there are things Rasters.jl and gdal can do that fasterize can't do, and actually a lot Rasters can do that gdal can't do either - like arbitrary functions (even functions like median that need to sort) and custom objects/number types. it would be good to cover some of those - at least things gdal can do that fasterize cant)

Rasters.checkmem!(false) # avoid a memory size bug on some machines
ras = Raster(RasterStack(raster_files; name=band_names))
stack_file = joinpath(data_dir, "stack.tif")
benchmark = @be write($stack_file, $ras; force=true) seconds=60
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rasters compression default is "ZSTD" but the others are all using "LZW".

We don't have a compress keyword yet to handle this so it uses the ugly
options=Dict("COMPRESS"=>"ZSTD"). Maybe we should change that

@kadyb
Copy link
Owner

kadyb commented Sep 25, 2024

FWIW: Ideally, it would be useful to compare the performance of the current GDAL algorithm and scan line algorithm at the C++ level, since what I presented #15 is quite limited. If the latter algorithm turned out to be significantly better, it would be worth implementing it directly in GDAL so that all packages could benefit from it. And as you mentioned, {fasterize} has some limitations, e.g. it only works with {raster} objects and polygons, and fewer options compared to GDAL. Here is related issue in the GDAL repository: OSGeo/gdal#7200

CC: @mdsumner because is also involved in this topic.

@mdsumner
Copy link

I was working on fasterize today ...

Btw, see this related effort here, and discussion in six hours from now:

https://github.com/developmentseed/warp-resample-profiling

https://discourse.pangeo.io/t/pangeo-showcase-geospatial-reprojection-in-python-2024-whats-available-and-whats-next/4531

@rafaqz
Copy link

rafaqz commented Sep 25, 2024

Well, Rasters wont benefit from faster gdal rasterize as we only use gdal for i/o and for gdalwarp. But it would be good to have detailed comparisons of these algorithms. I put a bunch of work into optimising the scanline in Rasters but there will be places it will be slower than fasterize - a few nice diagrams of the performance space would really help understand where things are at.

@asinghvi17
Copy link
Author

asinghvi17 commented Sep 26, 2024

I almost forgot about this, but for exact zonal statistics we have GO.coverage that is an efficient way to get area of a rectangle that a polygon covers. I think the rectangle there has to be axis aligned, though, which may present a problem for affine spaces or matrix lookups.

@rafaqz
Copy link

rafaqz commented Sep 26, 2024

I imagined we can do that in the line burning phase and get coverage for each pixel the line touches instead of burning.

But zonal is doing a lot very fast, so it needs to be very much tuned to purpose to not end up with an order of magnitude or 2 slowdown.

We also have subsampling coverage in Rasters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants