Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add satellite image processing benchmark #1550

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

jrbourbeau
Copy link
Member

xref #1548

Copy link
Member Author

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still TODO:

  • Try with odc.stac instead of stackstac
  • Write output to Zarr instead of persisting to cluster memory

@guillaumeeb @Kirill888 does this look like it captures a common use case for folks processing satellite images?

@jrbourbeau jrbourbeau self-assigned this Sep 20, 2024
Copy link

@guillaumeeb guillaumeeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrbourbeau, this is clearly what I had in mind, and I think it's a very good basic benchmark on imagery, and underlying infrastructure (here Microsoft Planetary Computer). Just made some comments, but more questions than remarks!

Were you able to run this? Do you have an idea of the data volume you read? Did you have problem with Datacube being to sparse?

cc @TomAugspurger

from distributed import wait


def harmonize_to_old(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I knew that there was some version change, but didn't know that this could affect the processing. This adds some real problematics, but also complexifies the benchmark, this might no be needed...

items = search.item_collection()

# Construct Xarray Dataset from stack items
ds = stackstac.stack(items, assets=["B08", "B11"], chunksize=(400, 1, 256, 256))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you can complexify and select other bands for other indices like NDVI, NDSI, NDWI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion on which indicies are most common? I see NDVI come up a lot. Is there any computation difference between these indicies, or are there all just relatively straightforward reductions like we already have here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see NDVI as the HelloWorld of Satellite image processing. There is no computation difference on all if these indices, it's always the same formula, you only switch the bands. You probably already know, but NDVI vor Vegetation, NDSI for Snow, NDWI for Water.

modifier=planetary_computer.sign_inplace,
)

# GeoJSON for region of interest is from https://github.com/isellsoap/deutschlandGeoJSON/tree/main/1_deutschland

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much Sentinel 2 tiles does that represent? Do you have an idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search.item_collection() returns 2835 items. Is that what you're asking about, or something else?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was more a question about the number of Sentinel 2 zone in UTM format, but nevermind, it's already a good answer. If my calculation is correct, this means already half a TB of Data to load.

ds = harmonize_to_old(ds)

# Compute humidity index
humidity = (ds.sel(band="B08") - ds.sel(band="B11")) / (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be prettier to have (ds.B08 - ds.B11) / (ds.B08 + ds.B11), but I guess this is the way stackstack is working, and not a big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right this has to do with what stackstac returns (data array). odc.stac returns a dataset with the structure you're proposing. Thoughts on stackstac vs odc.stac?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on stackstac vs odc.stac?

Nope, there is a thread about that on Pangeo Discourse I think. Both have probably advantages and drawbacks...

ds.sel(band="B08") + ds.sel(band="B11")
)
result = humidity.groupby("time.month").mean()
result = result.persist()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also love to see the plot and if it represent something, typically seasonality at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@TomAugspurger
Copy link

TomAugspurger commented Sep 25, 2024

Are these benchmarks running in a stable cloud / region? It'd be nice to find a dataset in the same region if possible (to cut down on egress costs and speed up the I/O portion of the benchmark, which probably isn't relevant for dask).

(edit: I see the cluster_kwargs answers that, nice).

On stackstac vs. odc-stac, the main things to be aware of are

  1. they build task graphs differently. Beyond just DataArray vs. Dataset, I think that odc-stac loading includes a groupby stage to ensure that all of the pixels from the same time end up in the same pixel plane (where "same time" is configurable, so that a scene captured a few seconds later can be considered the same if you want).
  2. odc-stac will automatically use overviews if you're requesting lower-resolution data (but not relevant here, since you don't pass resolution=)

I'll give this workload a shot today or tomorrow and will report back.

else:
time_of_interest = "2015-01-01/2024-09-01"

search = catalog.search(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Coiled have a spot where we could store a small parquet file? As written, this is hitting the search endpoint every time the benchmark runs, which isn't exercising Dask at all and has the potential to throw errors.

We could instead do the search once and use stac-geoparquet to cache that result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I may need to do a little setup to give the CI runner access to an Azure bucket, but we can definitely do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the parquet file is small enough that we can also put it into s3 if that’s easier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, fair point

@jrbourbeau
Copy link
Member Author

Are these benchmarks running in a stable cloud / region?

Right now this is running in westeurope on Azure, which should be where the underlying data is stored, but we can run in any region on AWS, GCP, or Azure.

I'll give this workload a shot today or tomorrow and will report back.

That'd be great. I'm happy to chat generally about this. Also, let me know if you need access to a Coiled workspace that's configured Azure.

@jrbourbeau
Copy link
Member Author

Okay, so here's notebook (https://gist.github.com/jrbourbeau/900b602d19fe8087cafc0490b5c26f68) that runs the same computation using odc.stac. Here's the specific odc.stac.load call

resolution = 10
SHRINK = 4
resolution = resolution * SHRINK

ds = odc.stac.load(
    items,
    chunks={},
    patch_url=planetary_computer.sign,
    resolution=resolution,
    crs="EPSG:3857",
    groupby="solar_day",
)

where I use things like groupby="solar_day", which I saw used in a couple of examples I found. This seems to produce a much smaller graph and is more performant in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants