[WIP] Add satellite image processing benchmark #1550

jrbourbeau · 2024-09-16T21:58:31Z

jrbourbeau

Still TODO:

Try with odc.stac instead of stackstac
Write output to Zarr instead of persisting to cluster memory

@guillaumeeb @Kirill888 does this look like it captures a common use case for folks processing satellite images?

guillaumeeb

Thanks @jrbourbeau, this is clearly what I had in mind, and I think it's a very good basic benchmark on imagery, and underlying infrastructure (here Microsoft Planetary Computer). Just made some comments, but more questions than remarks!

Were you able to run this? Do you have an idea of the data volume you read? Did you have problem with Datacube being to sparse?

cc @TomAugspurger

guillaumeeb · 2024-09-25T14:41:24Z

tests/geospatial/test_satellite_filtering.py

+from distributed import wait
+
+
+def harmonize_to_old(data):


Interesting, I knew that there was some version change, but didn't know that this could affect the processing. This adds some real problematics, but also complexifies the benchmark, this might no be needed...

guillaumeeb · 2024-09-25T14:45:01Z

tests/geospatial/test_satellite_filtering.py

+        items = search.item_collection()
+
+        # Construct Xarray Dataset from stack items
+        ds = stackstac.stack(items, assets=["B08", "B11"], chunksize=(400, 1, 256, 256))


So here you can complexify and select other bands for other indices like NDVI, NDSI, NDWI.

Do you have a suggestion on which indicies are most common? I see NDVI come up a lot. Is there any computation difference between these indicies, or are there all just relatively straightforward reductions like we already have here?

Yeah, I see NDVI as the HelloWorld of Satellite image processing. There is no computation difference on all if these indices, it's always the same formula, you only switch the bands. You probably already know, but NDVI vor Vegetation, NDSI for Snow, NDWI for Water.

guillaumeeb · 2024-09-25T14:45:05Z

tests/geospatial/test_satellite_filtering.py

+            modifier=planetary_computer.sign_inplace,
+        )
+
+        # GeoJSON for region of interest is from https://github.com/isellsoap/deutschlandGeoJSON/tree/main/1_deutschland


How much Sentinel 2 tiles does that represent? Do you have an idea?

search.item_collection() returns 2835 items. Is that what you're asking about, or something else?

That was more a question about the number of Sentinel 2 zone in UTM format, but nevermind, it's already a good answer. If my calculation is correct, this means already half a TB of Data to load.

guillaumeeb · 2024-09-25T14:46:55Z

tests/geospatial/test_satellite_filtering.py

+        ds = harmonize_to_old(ds)
+
+        # Compute humidity index
+        humidity = (ds.sel(band="B08") - ds.sel(band="B11")) / (


It would be prettier to have (ds.B08 - ds.B11) / (ds.B08 + ds.B11), but I guess this is the way stackstack is working, and not a big deal.

Yeah, you're right this has to do with what stackstac returns (data array). odc.stac returns a dataset with the structure you're proposing. Thoughts on stackstac vs odc.stac?

Thoughts on stackstac vs odc.stac?

Nope, there is a thread about that on Pangeo Discourse I think. Both have probably advantages and drawbacks...

guillaumeeb · 2024-09-25T14:47:24Z

tests/geospatial/test_satellite_filtering.py

+            ds.sel(band="B08") + ds.sel(band="B11")
+        )
+        result = humidity.groupby("time.month").mean()
+        result = result.persist()


I would also love to see the plot and if it represent something, typically seasonality at least.

TomAugspurger · 2024-09-25T18:40:44Z

Are these benchmarks running in a stable cloud / region? It'd be nice to find a dataset in the same region if possible (to cut down on egress costs and speed up the I/O portion of the benchmark, which probably isn't relevant for dask).

(edit: I see the cluster_kwargs answers that, nice).

On stackstac vs. odc-stac, the main things to be aware of are

they build task graphs differently. Beyond just DataArray vs. Dataset, I think that odc-stac loading includes a groupby stage to ensure that all of the pixels from the same time end up in the same pixel plane (where "same time" is configurable, so that a scene captured a few seconds later can be considered the same if you want).
odc-stac will automatically use overviews if you're requesting lower-resolution data (but not relevant here, since you don't pass resolution=)

I'll give this workload a shot today or tomorrow and will report back.

TomAugspurger · 2024-09-25T18:42:58Z

tests/geospatial/test_satellite_filtering.py

+        else:
+            time_of_interest = "2015-01-01/2024-09-01"
+
+        search = catalog.search(


Does Coiled have a spot where we could store a small parquet file? As written, this is hitting the search endpoint every time the benchmark runs, which isn't exercising Dask at all and has the potential to throw errors.

We could instead do the search once and use stac-geoparquet to cache that result.

Yeah, I may need to do a little setup to give the CI runner access to an Azure bucket, but we can definitely do that

I assume that the parquet file is small enough that we can also put it into s3 if that’s easier?

Ah, yeah, fair point

jrbourbeau · 2024-09-25T18:53:36Z

Are these benchmarks running in a stable cloud / region?

Right now this is running in westeurope on Azure, which should be where the underlying data is stored, but we can run in any region on AWS, GCP, or Azure.

I'll give this workload a shot today or tomorrow and will report back.

That'd be great. I'm happy to chat generally about this. Also, let me know if you need access to a Coiled workspace that's configured Azure.

jrbourbeau · 2024-09-25T19:49:54Z

Okay, so here's notebook (https://gist.github.com/jrbourbeau/900b602d19fe8087cafc0490b5c26f68) that runs the same computation using odc.stac. Here's the specific odc.stac.load call

resolution = 10
SHRINK = 4
resolution = resolution * SHRINK

ds = odc.stac.load(
    items,
    chunks={},
    patch_url=planetary_computer.sign,
    resolution=resolution,
    crs="EPSG:3857",
    groupby="solar_day",
)

where I use things like groupby="solar_day", which I saw used in a couple of examples I found. This seems to produce a much smaller graph and is more performant in general.

Add satellite image processing benchmark

e534038

jrbourbeau mentioned this pull request Sep 16, 2024

[Tracking] Implement Geospatial Benchmarks #1548

Open

7 tasks

jrbourbeau commented Sep 19, 2024

View reviewed changes

jrbourbeau self-assigned this Sep 20, 2024

guillaumeeb reviewed Sep 25, 2024

View reviewed changes

TomAugspurger reviewed Sep 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add satellite image processing benchmark #1550

[WIP] Add satellite image processing benchmark #1550

jrbourbeau commented Sep 16, 2024

jrbourbeau left a comment

guillaumeeb left a comment

guillaumeeb Sep 25, 2024

guillaumeeb Sep 25, 2024

jrbourbeau Sep 25, 2024

guillaumeeb Sep 25, 2024

guillaumeeb Sep 25, 2024

jrbourbeau Sep 25, 2024

guillaumeeb Sep 25, 2024

guillaumeeb Sep 25, 2024

jrbourbeau Sep 25, 2024

guillaumeeb Sep 25, 2024

guillaumeeb Sep 25, 2024

jrbourbeau Sep 25, 2024

TomAugspurger commented Sep 25, 2024 •

edited

Loading

TomAugspurger Sep 25, 2024

jrbourbeau Sep 25, 2024

phofl Sep 25, 2024

jrbourbeau Sep 25, 2024

jrbourbeau commented Sep 25, 2024

jrbourbeau commented Sep 25, 2024

[WIP] Add satellite image processing benchmark #1550

Are you sure you want to change the base?

[WIP] Add satellite image processing benchmark #1550

Conversation

jrbourbeau commented Sep 16, 2024

jrbourbeau left a comment

Choose a reason for hiding this comment

guillaumeeb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrbourbeau commented Sep 25, 2024

jrbourbeau commented Sep 25, 2024

TomAugspurger commented Sep 25, 2024 •

edited

Loading