Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add satellite image processing benchmark #1550

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions tests/geospatial/test_satellite_filtering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import datetime

import fsspec
import geojson
import planetary_computer
import pystac_client
import stackstac
import xarray as xr
from distributed import wait


def harmonize_to_old(data):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I knew that there was some version change, but didn't know that this could affect the processing. This adds some real problematics, but also complexifies the benchmark, this might no be needed...

"""
Harmonize new Sentinel-2 data to the old baseline.

Parameters
----------
data: xarray.DataArray
A DataArray with four dimensions: time, band, y, x

Returns
-------
harmonized: xarray.DataArray
A DataArray with all values harmonized to the old
processing baseline.
"""
cutoff = datetime.datetime(2022, 1, 25)
offset = 1000
bands = [
"B01",
"B02",
"B03",
"B04",
"B05",
"B06",
"B07",
"B08",
"B8A",
"B09",
"B10",
"B11",
"B12",
]

old = data.sel(time=slice(cutoff))

to_process = list(set(bands) & set(data.band.data.tolist()))
new = data.sel(time=slice(cutoff, None)).drop_sel(band=to_process)

new_harmonized = data.sel(time=slice(cutoff, None), band=to_process).clip(offset)
new_harmonized -= offset

new = xr.concat([new, new_harmonized], "band").sel(band=data.band.data.tolist())
return xr.concat([old, new], dim="time")


def test_satellite_filtering(
scale,
client_factory,
cluster_kwargs={
"workspace": "dask-engineering-azure",
"region": "westeurope",
"wait_for_workers": True,
},
scale_kwargs={
"small": {"n_workers": 10},
"large": {"n_workers": 100},
},
):
with client_factory(
**scale_kwargs[scale], **cluster_kwargs
) as client: # noqa: F841
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1",
modifier=planetary_computer.sign_inplace,
)

# GeoJSON for region of interest is from https://github.com/isellsoap/deutschlandGeoJSON/tree/main/1_deutschland

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much Sentinel 2 tiles does that represent? Do you have an idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search.item_collection() returns 2835 items. Is that what you're asking about, or something else?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was more a question about the number of Sentinel 2 zone in UTM format, but nevermind, it's already a good answer. If my calculation is correct, this means already half a TB of Data to load.

with fsspec.open(
"https://raw.githubusercontent.com/isellsoap/deutschlandGeoJSON/main/1_deutschland/3_mittel.geo.json"
) as f:
gj = geojson.load(f)

# Flatten MultiPolygon to single Polygon
coordinates = []
for x in gj.features[0]["geometry"]["coordinates"]:
coordinates.extend(x)
area_of_interest = {
"type": "Polygon",
"coordinates": coordinates,
}

# Get stack items
if scale == "small":
time_of_interest = "2024-01-01/2024-09-01"
else:
time_of_interest = "2015-01-01/2024-09-01"

search = catalog.search(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Coiled have a spot where we could store a small parquet file? As written, this is hitting the search endpoint every time the benchmark runs, which isn't exercising Dask at all and has the potential to throw errors.

We could instead do the search once and use stac-geoparquet to cache that result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I may need to do a little setup to give the CI runner access to an Azure bucket, but we can definitely do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the parquet file is small enough that we can also put it into s3 if that’s easier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, fair point

collections=["sentinel-2-l2a"],
intersects=area_of_interest,
datetime=time_of_interest,
)
items = search.item_collection()

# Construct Xarray Dataset from stack items
ds = stackstac.stack(items, assets=["B08", "B11"], chunksize=(400, 1, 256, 256))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you can complexify and select other bands for other indices like NDVI, NDSI, NDWI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion on which indicies are most common? I see NDVI come up a lot. Is there any computation difference between these indicies, or are there all just relatively straightforward reductions like we already have here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see NDVI as the HelloWorld of Satellite image processing. There is no computation difference on all if these indices, it's always the same formula, you only switch the bands. You probably already know, but NDVI vor Vegetation, NDSI for Snow, NDWI for Water.

# See https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a#Baseline-Change
ds = harmonize_to_old(ds)

# Compute humidity index
humidity = (ds.sel(band="B08") - ds.sel(band="B11")) / (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be prettier to have (ds.B08 - ds.B11) / (ds.B08 + ds.B11), but I guess this is the way stackstack is working, and not a big deal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right this has to do with what stackstac returns (data array). odc.stac returns a dataset with the structure you're proposing. Thoughts on stackstac vs odc.stac?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on stackstac vs odc.stac?

Nope, there is a thread about that on Pangeo Discourse I think. Both have probably advantages and drawbacks...

ds.sel(band="B08") + ds.sel(band="B11")
)
result = humidity.groupby("time.month").mean()
result = result.persist()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also love to see the plot and if it represent something, typically seasonality at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

wait(result)
Loading