aggregate_polygon: Output format #2

jdries · 2018-04-04T11:34:20Z

Currently the zonal statistics process does not have a clear, standardized output format.
Having a json encoding would make most sense. You could also look into, or borrow concepts from, OGC TimeseriesML.

Here's an example of a simple json encoding that we currently use:

totalCount and validCount refer to number of pixels encountered in the polygon.
I'm not proposing this actual format, just trying to clarify what I'm talking about.

 {
  "results": [
    {
      "date": "2015-07-06",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.8150000000000001
      }
    },
    {
      "date": "2015-07-16",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.49
      }
    },
    {
      "date": "2015-07-26",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.215
      }
    },
    {
      "date": "2015-08-02",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.93
      }
    },
...

The text was updated successfully, but these errors were encountered:

flahn · 2018-04-04T12:13:04Z

Currently we have the possibility to specify the raster or vector data format by using GDAL and OGR identifiers. What you suggest is that we need to incorporate additional standards to model the especially vector data, when we use formats like plain JSON or XML.

Considering this I am not sure if we really should specify this for the output of this particular "zonal_statistics" process. I mean, we haven't specified an output format for any of the processes that deal with images as output.

But I see that it would be benefical to model vector data after a specific standard and I could imagine that this would be best modeled as a process, which can later become part of an openEO core profile.

jdries · 2018-04-04T12:34:51Z

Basically, as a user of the openeo api, if I compute zonal stats, I need to know what I'll get as output, and the API should describe this as clearly as possible, so the definition of this process should somehow reference the output format.
By proposing a simple json-based format, you'll encourage backend implementors to at least support this simple case, leading to more consistency among backends.
We should not wait too long with this, as various backends are already implementing this, probably in different ways.

m-mohr · 2018-04-04T12:53:57Z

Every process needs to define what it returns, but that is not necessarily the output for a job. Most processes will return an image collection, but this is probably not the case for zonal statistics. Yet, I am not sure how we smoothly convert from a process result to the actual output as defined in the job.

If this definition is urgent for any back-end then I am happy to pick up concrete proposals. Feel free to come up with a proposal/format you'd think is a good solution and we can discuss it and add it to the current "collection" of processes.

A general remark: Process definitions will not be part of the API specification itself, but part of a separate process / profile specification. AFAIK, they are due with D3.3 openEO data set and process descriptions in March 2019 (API version 0.4.0). That's why I changed the milestone for now, but that's more a categorization than setting a priority.

m-mohr · 2018-06-11T14:36:52Z

Disclaimer: I have very little experience with zonal statistics. This ideas evolved from implementing zonal-statistics for the GEE back-end.

As far as I know, we are expecting the user to specify a GeoJSON object as input for the region(s). Therefore, I'd suggest to simply use GeoJSON as output format, too.
If it's a Feature(Collection) we just add the additional data to the properties (example 1). If the input is a Geometry(Collection) we need to wrap it into a Feature(Collection) first (example 2).

Example 1

Multiple polygons stored in a file with a single aggregation function.

Process: zonal_statistics
Arguments:

regions: polygon.json
func: mean

polygon.json:

{
  "type":"FeatureCollection",
  "features":[
    {
      "type":"Feature",
      "properties":{},
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]
      }
    },
    {
      "type":"Feature",
      "properties":{},
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,47.9],[16.6,47.9],[16.6,48.25],[16.1,48.25],[16.1,47.9]]]
      }
    }
  ]
}

Result:

{
  "type":"FeatureCollection",
  "features":[
    {
      "type":"Feature",
      "properties":{
        "zonal_statistics": [
          {
            "date": "2015-07-06",
            "totalCount": 9,
            "validCount": 8,
            "results": {
              "mean": 0.8150000000000001
            }
          },
          {
            "date": "2015-07-16",
            "totalCount": 9,
            "validCount": 8,
            "results": {
              "mean": 0.49
            }
          }
        ]
      },
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]
      }
    },
    {
      "type":"Feature",
      "properties":{
        "zonal_statistics": [
          {
            "date": "2015-07-06",
            "totalCount": 9,
            "validCount": 9,
            "results": {
              "mean": 0.975
            }
          },
          {
            "date": "2015-07-16",
            "totalCount": 9,
            "validCount": 9,
            "results": {
              "mean": 0.425
            }
          }
        ]
      },
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,47.9],[16.6,47.9],[16.6,48.25],[16.1,48.25],[16.1,47.9]]]
      }
    }
  ]
}

Remarks:

totalCount and validCount are probably optional attributes, at least I don't get any information about them from GEE (see example 2)
date is expected to be according to ISO 8601 (date and/or date time?!)
results could include multiple results, e.g. min, max and mean (see example 2)

Example 2

A single polygon embedded into the process graph with multiple aggregation functions.

Process: zonal_statistics
Arguments:

regions: {"type":"Polygon","coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]}
func: [min, max, mean]

Result:

{
  "type":"Feature",
  "properties":{
    "zonal_statistics": [
      {
        "date": "2015-07-06",
        "results": {
          "min": 0.578415,
          "max": 0.9857545,
          "mean": 0.85754
        }
      },
      {
        "date": "2015-07-16",
        "results": {
          "min": 0.40,
          "max": 0.57,
          "mean": 0.49
        }
      }
    ]
  },
  "geometry":{
    "type":"Polygon",
    "coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]
  }
}

We could also wrap the Feature into a FeatureCollection if that would be simpler to read and/or write.

What do you think?
@jdries, could you provide your thoughts on that whenever you have the time and then assign it back to me, please?

neteler · 2018-06-11T15:39:36Z

Just a small remark: IMHO "totalCount" and "validCount" are essential because the user polygon used for computing the zonal statistics might be located outside or only partially inside available data. And/or, pixels could be invalid (cloud contamination, etc). From a GIS point of view these attributes should be mandatory to enable the user to evaluate the delivered results.

mkadunc · 2018-12-07T10:58:04Z

Given the discussion at Vito this week, about vector collections as arrays, I suggest we do the following:

the most frugal way of structuring the response would be to keep the zone axis in the results in the same order as the geometries that came in, then we don't need the reference geometry
possibly we could use IDs of zones (if they have IDs) to link to the zone geometries from the request
don't return geometries in the result - they will be known by the caller (either by ID or position in output array)
define properties to be returned as a dimension in the output array (statsAxis = categorical_axis(["totalCount", "validCount", "min","max","mean"]))
zone is another dimension (zoneAxis = index_axis(N_zones))
other dimensions can be arbitrary, depending on the process (timeAxis = irregular_axis(datetime, [t0,t1..tN]), bandAxis = categorical_axis(N_bands))
data is returned in a multi-dimensional array

In JSON, this could look like the following (modeled after NetCDF; the metadata part at the beginning is just a quick attempt at encoding relevant info into JSON)—note that there is no duplication of property names in the data part, so it should produce a much smaller file than using GeoJSON features:

//openEOCubeJSON
{
	"axes": {
		"time": { 
			"type" : "regular",
			"data" : [{
				"variable" : "timestamp",
				"dataType" : "datetime",
				"range" : {
					"length" : 2,
					"offset" : "2015-07-06",
					"samplingDistance" : "P10D"
				}
			}]
		},
		"zone": { "type" : "index" },
		"band": {
			"type" : "categorical",
			"data" : [{
				"variable" : "name",
				"dataType" : "string",
				"values" : ["NDVI"]
			}]
		},
		"stat": {
			"type" : "categorical",
			"data" : [{
				"variable" : "name",
				"dataType" : "string",
				"values" : ["totalCount", "validCount", "min", "max", "mean"]
			}]
		}
	},
	"data": [
		{
			"variable" : "value",
			"axes" : ["zone", "time", "band", "stat"],
			"raster" : [
				//zone 0
			    [
					//time 2015-07-06
					[
						//band NDVI
						[9,8,0.58,0.99,0.86]
					],
					//time 2015-07-16
					[
						//band NDVI
						[9,8,0.40,0.57,0.49]
					]
				],
				//zone 1
				[
					[
						//band NDVI
						[9,9,0.78,0.98,0.975]
					],
					//time 2015-07-16
					[
						//band NDVI
						[9,9,0.37,0.51,0.425]
					]
				]
			]
		}
	]
}

Now if polygons or their IDs are needed, they could be returned in the data part of the zone axis information.

edzer · 2018-12-07T11:20:27Z

I like that. There is also CoverageJSON, see for instance the MULTIPOLYGON or POINT example in https://covjson.org/playground/ . What they call domain and range, you call axes and data. In their POINT example, they have in addition parameters, which kind of makes sense e.g. for different variable that have different measurement units.

mkadunc · 2018-12-07T12:33:52Z

Nice link - i completely forgot about covjson. My source was mostly the NetCDF Data Model

Some observation/comparison of the two:

what covjson calls parameters, I call variable,
covjson parameters are specified in advance, which allows a parameter to be defined outside of any data matrix; my variables are all specified inline along with 'raster' or 'values',
their parameter doesn't include the dataType — it's only available in "ranges" alongside the values,
i don't like that their data arrays are 1-dimensional; nested arrays are IMO better for human-readable JSON (for non-human consumption, we will probably use some binary format anyway)
while I think it's nice that the data arrays can be linked from external source, it would be better if the shape (or at least the dimensions) of the array and its data types would be known in the main JSON document
they have many domain and axis types that encode different types of geospatial rasters; in my proposal (and netCDF) all axes have simple integer indexing, and any additional info about the axis values is provided in variables. I included axis type as a way of hinting that axis values may be reordered (categorical), extended (regular / irregular / categorical), aggregated (regular / irregular), compared (regular / irregular) - just something I think might be useful for defining operations of a datacube algebra.

covjson is definitely something we should look at - maybe it could also help model those parts of image collection metadata that aren't handled by STAC

m-mohr · 2019-01-15T13:28:36Z

Very interesting proposal and discussion, thanks @mkadunc . I don't think we'll come up with a final approach until mid-February so I'll stick with the simple GeoJSON based approach for now and add the experimental flag for aggregate_zonal. See the current specification: https://open-eo.github.io/openeo-api/v/0.4.0/processreference/#aggregate_zonal
We can work in a separate PR on the process after v0.4.

m-mohr · 2019-02-13T10:13:33Z

There is also CF-JSON, which seems somewhat similar: http://cf-json.org/specification

Edit: I was told there's also NCO-JSON, which departed from CF-JSON and may be merged at some point again. See pangeo-data/pangeo-datastore#3 (comment)

jdries · 2019-03-29T07:45:21Z

Didn't know this, but NetCDF also supports it:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch09.html
They are now even working on a way to encode the geometries inside netcdf:
https://github.com/cf-convention/cf-conventions/blob/master/ch07.adoc#geometries

However, browsers will not like such a binary format, so I still see a clear use case for something based on json.

m-mohr · 2019-09-13T10:43:33Z

3rd year planning: @jdries will look into CovJSON and CFJSON (fka NCO-JSON, see cf-json/cf-json.github.io#10). Related discussions: covjson/specification#86 and pangeo-data/pangeo-datastore#3 (comment)

soxofaan · 2019-10-17T08:55:09Z

CovJSON output format has been implemented for aggregate_polygon in VITO backend (still to be deployed at the moment) through PRs Open-EO/openeo-python-driver#23 and Open-EO/openeo-geopyspark-driver#25

Usage example against my localhost setup of the VITO backend
(note the save_result(format="CovJSON") to enable the CovJSON output format):

import json
import shapely
import openeo

connection = openeo.connect("http://localhost:8080/openeo/0.4.2/")

polygon1 = shapely.geometry.Polygon([(3.71, 51.01), (3.72, 51.02), (3.73, 51.01)])
polygon2 = shapely.geometry.Polygon([(3.71, 51.015), (3.725, 51.025), (3.725, 51.03), (3.71, 51.03)])
geometry_collection = shapely.geometry.GeometryCollection([polygon1, polygon2])

cube = (
    connection
        .load_collection('CGS_SENTINEL2_RADIOMETRY_V102_001')
        .filter_temporal('2019-07-15', '2019-07-30')
        .polygonal_mean_timeseries(polygon=geometry_collection)
        .save_result(format="CovJSON")
)

result = cube.execute()
print(json.dumps(result))

output:

{
  "domain": {
    "axes": {
      "composite": {
        "values": [
          [
            [
              [
                3.71,
                51.01
              ],
              [
                3.72,
                51.02
              ],
              [
                3.73,
                51.01
              ],
              [
                3.71,
                51.01
              ]
            ]
          ],
          [
            [
              [
                3.71,
                51.015
              ],
              [
                3.725,
                51.025
              ],
              [
                3.725,
                51.03
              ],
              [
                3.71,
                51.03
              ],
              [
                3.71,
                51.015
              ]
            ]
          ]
        ],
        "coordinates": [
          "x",
          "y"
        ],
        "dataType": "polygon"
      },
      "t": {
        "values": [
          "2019-07-20T00:00:00Z",
          "2019-07-23T00:00:00Z",
          "2019-07-25T00:00:00Z",
          "2019-07-28T00:00:00Z",
          "2019-07-30T00:00:00Z"
        ]
      }
    },
    "domainType": "MultiPolygonSeries",
    "type": "Domain",
    "referencing": [
      {
        "coordinates": [
          "x",
          "y"
        ],
        "system": {
          "type": "GeographicCRS",
          "id": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
        }
      },
      {
        "coordinates": [
          "t"
        ],
        "system": {
          "type": "TemporalRS",
          "calendar": "Gregorian"
        }
      }
    ]
  },
  "ranges": {
    "band0": {
      "values": [
        8279.030674846626,
        6925.279187817259,
        null,
        null,
        647.3476972527749,
        606.1746575342465,
        null,
        null,
        598.0162638242506,
        539.5157878068092
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    },
    "band2": {
      "values": [
        7540.171779141105,
        6389.071065989848,
        null,
        null,
        878.138968409422,
        825.1263001014713,
        null,
        null,
        816.8533253265275,
        744.4639429928741
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    },
    "band1": {
      "values": [
        7925.613496932516,
        6670.365482233503,
        null,
        null,
        844.6788709758425,
        798.5987760020295,
        null,
        null,
        792.2864935194916,
        728.3024861441013
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    },
    "band3": {
      "values": [
        9019.067484662577,
        7498.243654822335,
        null,
        null,
        2178.021344985184,
        2224.413654236428,
        null,
        null,
        2106.431366661662,
        2155.113032462391
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    }
  },
  "type": "Coverage",
  "parameters": {
    "band0": {
      "observedProperty": {
        "label": {
          "en": "Band 0"
        }
      },
      "type": "Parameter"
    },
    "band2": {
      "observedProperty": {
        "label": {
          "en": "Band 2"
        }
      },
      "type": "Parameter"
    },
    "band1": {
      "observedProperty": {
        "label": {
          "en": "Band 1"
        }
      },
      "type": "Parameter"
    },
    "band3": {
      "observedProperty": {
        "label": {
          "en": "Band 3"
        }
      },
      "type": "Parameter"
    }
  }
}

m-mohr · 2019-11-26T13:34:18Z

What would be the actual return value of aggregate_polygon? A general "vector-cube", which can be passed to the save_result process or would it return directly covjson or so? How should the schema for the return value of the function look like?

mkadunc · 2019-11-26T13:56:30Z

What would be the actual return value of aggregate_polygon?

vector-cube

JSON should be considered just as an output encoding, similar to "GTiff" in save_result

m-mohr · 2019-11-26T14:00:43Z

So coming from the point (see #68) that we generally don't support vector-cubes at the moment, would it be enough to support vector cubes in aggregate_polygon, save_result, load_result and maybe load_collection, add_dimension, drop_dimension and rename_dimension? That's probably what we need to support actually using that return type from aggregate_polygon. I guess I'd not enable vector cubes in filter_* processes yet? Or do I miss something?

mkadunc · 2019-11-26T14:58:01Z

IMO the only thing we need to do in order to have vector-cubes support is allow objects as dimension labels (currently we only allow number, string, date, date-time and time). Then vector-cube is just a cube with simple-feature-geometry as dimension labels on the single spatial dimension. If we go for this approach, we already support vector cubes in all processes (but we treat the spatial dimension as nothing special).

We could also ignore vector-cubes altogether (for now), returning a raster cube with an ordinal dimension to encode the index of the corresponding polygon. This should be quite intuitive for the user...

m-mohr · 2019-12-18T16:38:17Z

Moved discussions about vector cubes to #68 and thus this issue can be closed, I think.
The output format is nothing we define with the processes.

jdries · 2020-04-02T19:42:09Z

FYI I implemented netcdf as output format for timeseries. I used the simple orthogonal multidimensional array representation, and converted polygons to their points to make life easier. (Support for polygons in netcdf is worked on, but seems less standard.)
Otherwise the representation seems quite promising, and also can be converted to nco-json.

m-mohr · 2020-04-02T19:43:43Z

@jdries Sounds good. Do you plan to support nco-json anytime soon? If so, I think I'd also try to implement it for GEE. Also, I'd be interested in an example process graph + result...

jdries · 2020-04-03T04:38:44Z

I'll give nco-json a try, but will have to install the nco dependencies for that, as I haven't found a python library yet that does the conversion. It's a bit of a side thing, so not sure when I'll be able to finish.
Also working on that example!

jdries assigned m-mohr Apr 4, 2018

m-mohr removed their assignment Apr 23, 2018

m-mohr assigned jdries Jun 25, 2018

m-mohr changed the title ~~Output format for zonal statistics~~ zonal_statistics: Output format Jun 27, 2018

m-mohr transferred this issue from Open-EO/openeo-api Nov 30, 2018

m-mohr added the help wanted Extra attention is needed label Nov 30, 2018

m-mohr changed the title ~~zonal_statistics: Output format~~ aggregate_zonal: Output format Jan 14, 2019

m-mohr added this to the v0.5 milestone Jan 15, 2019

m-mohr mentioned this issue Feb 13, 2019

Datacube extension radiantearth/stac-spec#361

Merged

jdries mentioned this issue Feb 14, 2019

aggregate_polygon with multiple bands? #29

Closed

m-mohr mentioned this issue Mar 29, 2019

aggregate_polygon: coordinate ref system #53

Closed

m-mohr changed the title ~~aggregate_zonal: Output format~~ aggregate_polygons: Output format Mar 29, 2019

m-mohr changed the title ~~aggregate_polygons: Output format~~ aggregate_polygon: Output format Mar 29, 2019

m-mohr added the interoperability label Apr 1, 2019

m-mohr mentioned this issue Sep 13, 2019

Vector data cubes + processes #68

Closed

m-mohr mentioned this issue Sep 13, 2019

Common cloud-based time series data format Open-EO/openeo-api#7

Closed

m-mohr closed this as completed Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregate_polygon: Output format #2

aggregate_polygon: Output format #2

jdries commented Apr 4, 2018

flahn commented Apr 4, 2018

jdries commented Apr 4, 2018

m-mohr commented Apr 4, 2018 •

edited

Loading

m-mohr commented Jun 11, 2018 •

edited

Loading

neteler commented Jun 11, 2018

mkadunc commented Dec 7, 2018

edzer commented Dec 7, 2018

mkadunc commented Dec 7, 2018

m-mohr commented Jan 15, 2019 •

edited

Loading

m-mohr commented Feb 13, 2019 •

edited

Loading

jdries commented Mar 29, 2019

m-mohr commented Sep 13, 2019

soxofaan commented Oct 17, 2019

m-mohr commented Nov 26, 2019

mkadunc commented Nov 26, 2019

m-mohr commented Nov 26, 2019 •

edited

Loading

mkadunc commented Nov 26, 2019

m-mohr commented Dec 18, 2019 •

edited

Loading

jdries commented Apr 2, 2020

m-mohr commented Apr 2, 2020 •

edited

Loading

jdries commented Apr 3, 2020

aggregate_polygon: Output format #2

aggregate_polygon: Output format #2

Comments

jdries commented Apr 4, 2018

flahn commented Apr 4, 2018

jdries commented Apr 4, 2018

m-mohr commented Apr 4, 2018 • edited Loading

m-mohr commented Jun 11, 2018 • edited Loading

Example 1

Example 2

neteler commented Jun 11, 2018

mkadunc commented Dec 7, 2018

edzer commented Dec 7, 2018

mkadunc commented Dec 7, 2018

m-mohr commented Jan 15, 2019 • edited Loading

m-mohr commented Feb 13, 2019 • edited Loading

jdries commented Mar 29, 2019

m-mohr commented Sep 13, 2019

soxofaan commented Oct 17, 2019

m-mohr commented Nov 26, 2019

mkadunc commented Nov 26, 2019

m-mohr commented Nov 26, 2019 • edited Loading

mkadunc commented Nov 26, 2019

m-mohr commented Dec 18, 2019 • edited Loading

jdries commented Apr 2, 2020

m-mohr commented Apr 2, 2020 • edited Loading

jdries commented Apr 3, 2020

m-mohr commented Apr 4, 2018 •

edited

Loading

m-mohr commented Jun 11, 2018 •

edited

Loading

m-mohr commented Jan 15, 2019 •

edited

Loading

m-mohr commented Feb 13, 2019 •

edited

Loading

m-mohr commented Nov 26, 2019 •

edited

Loading

m-mohr commented Dec 18, 2019 •

edited

Loading

m-mohr commented Apr 2, 2020 •

edited

Loading