Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregate_polygon: Output format #2

Closed
jdries opened this issue Apr 4, 2018 · 21 comments
Closed

aggregate_polygon: Output format #2

jdries opened this issue Apr 4, 2018 · 21 comments
Assignees
Labels
help wanted Extra attention is needed interoperability
Milestone

Comments

@jdries
Copy link
Contributor

jdries commented Apr 4, 2018

Currently the zonal statistics process does not have a clear, standardized output format.
Having a json encoding would make most sense. You could also look into, or borrow concepts from, OGC TimeseriesML.

Here's an example of a simple json encoding that we currently use:

  • totalCount and validCount refer to number of pixels encountered in the polygon.
  • I'm not proposing this actual format, just trying to clarify what I'm talking about.
 {
  "results": [
    {
      "date": "2015-07-06",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.8150000000000001
      }
    },
    {
      "date": "2015-07-16",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.49
      }
    },
    {
      "date": "2015-07-26",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.215
      }
    },
    {
      "date": "2015-08-02",
      "result": {
        "totalCount": 1,
        "validCount": 1,
        "average": 0.93
      }
    },
...
@flahn
Copy link
Member

flahn commented Apr 4, 2018

Currently we have the possibility to specify the raster or vector data format by using GDAL and OGR identifiers. What you suggest is that we need to incorporate additional standards to model the especially vector data, when we use formats like plain JSON or XML.

Considering this I am not sure if we really should specify this for the output of this particular "zonal_statistics" process. I mean, we haven't specified an output format for any of the processes that deal with images as output.

But I see that it would be benefical to model vector data after a specific standard and I could imagine that this would be best modeled as a process, which can later become part of an openEO core profile.

@jdries
Copy link
Contributor Author

jdries commented Apr 4, 2018

Basically, as a user of the openeo api, if I compute zonal stats, I need to know what I'll get as output, and the API should describe this as clearly as possible, so the definition of this process should somehow reference the output format.
By proposing a simple json-based format, you'll encourage backend implementors to at least support this simple case, leading to more consistency among backends.
We should not wait too long with this, as various backends are already implementing this, probably in different ways.

@m-mohr
Copy link
Member

m-mohr commented Apr 4, 2018

Every process needs to define what it returns, but that is not necessarily the output for a job. Most processes will return an image collection, but this is probably not the case for zonal statistics. Yet, I am not sure how we smoothly convert from a process result to the actual output as defined in the job.

If this definition is urgent for any back-end then I am happy to pick up concrete proposals. Feel free to come up with a proposal/format you'd think is a good solution and we can discuss it and add it to the current "collection" of processes.

A general remark: Process definitions will not be part of the API specification itself, but part of a separate process / profile specification. AFAIK, they are due with D3.3 openEO data set and process descriptions in March 2019 (API version 0.4.0). That's why I changed the milestone for now, but that's more a categorization than setting a priority.

@m-mohr m-mohr removed their assignment Apr 23, 2018
@m-mohr
Copy link
Member

m-mohr commented Jun 11, 2018

Disclaimer: I have very little experience with zonal statistics. This ideas evolved from implementing zonal-statistics for the GEE back-end.

As far as I know, we are expecting the user to specify a GeoJSON object as input for the region(s). Therefore, I'd suggest to simply use GeoJSON as output format, too.
If it's a Feature(Collection) we just add the additional data to the properties (example 1). If the input is a Geometry(Collection) we need to wrap it into a Feature(Collection) first (example 2).

Example 1

Multiple polygons stored in a file with a single aggregation function.

Process: zonal_statistics
Arguments:

  • regions: polygon.json
  • func: mean

polygon.json:

{
  "type":"FeatureCollection",
  "features":[
    {
      "type":"Feature",
      "properties":{},
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]
      }
    },
    {
      "type":"Feature",
      "properties":{},
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,47.9],[16.6,47.9],[16.6,48.25],[16.1,48.25],[16.1,47.9]]]
      }
    }
  ]
}

Result:

{
  "type":"FeatureCollection",
  "features":[
    {
      "type":"Feature",
      "properties":{
        "zonal_statistics": [
          {
            "date": "2015-07-06",
            "totalCount": 9,
            "validCount": 8,
            "results": {
              "mean": 0.8150000000000001
            }
          },
          {
            "date": "2015-07-16",
            "totalCount": 9,
            "validCount": 8,
            "results": {
              "mean": 0.49
            }
          }
        ]
      },
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]
      }
    },
    {
      "type":"Feature",
      "properties":{
        "zonal_statistics": [
          {
            "date": "2015-07-06",
            "totalCount": 9,
            "validCount": 9,
            "results": {
              "mean": 0.975
            }
          },
          {
            "date": "2015-07-16",
            "totalCount": 9,
            "validCount": 9,
            "results": {
              "mean": 0.425
            }
          }
        ]
      },
      "geometry":{
        "type":"Polygon",
        "coordinates":[[[16.1,47.9],[16.6,47.9],[16.6,48.25],[16.1,48.25],[16.1,47.9]]]
      }
    }
  ]
}

Remarks:

  • totalCount and validCount are probably optional attributes, at least I don't get any information about them from GEE (see example 2)
  • date is expected to be according to ISO 8601 (date and/or date time?!)
  • results could include multiple results, e.g. min, max and mean (see example 2)

Example 2

A single polygon embedded into the process graph with multiple aggregation functions.

Process: zonal_statistics
Arguments:

  • regions: {"type":"Polygon","coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]}
  • func: [min, max, mean]

Result:

{
  "type":"Feature",
  "properties":{
    "zonal_statistics": [
      {
        "date": "2015-07-06",
        "results": {
          "min": 0.578415,
          "max": 0.9857545,
          "mean": 0.85754
        }
      },
      {
        "date": "2015-07-16",
        "results": {
          "min": 0.40,
          "max": 0.57,
          "mean": 0.49
        }
      }
    ]
  },
  "geometry":{
    "type":"Polygon",
    "coordinates":[[[16.1,48.25],[16.6,48.25],[16.6,48.6],[16.1,48.6],[16.1,48.25]]]
  }
}

We could also wrap the Feature into a FeatureCollection if that would be simpler to read and/or write.

What do you think?
@jdries, could you provide your thoughts on that whenever you have the time and then assign it back to me, please?

@neteler
Copy link
Member

neteler commented Jun 11, 2018

Just a small remark: IMHO "totalCount" and "validCount" are essential because the user polygon used for computing the zonal statistics might be located outside or only partially inside available data. And/or, pixels could be invalid (cloud contamination, etc). From a GIS point of view these attributes should be mandatory to enable the user to evaluate the delivered results.

@m-mohr m-mohr changed the title Output format for zonal statistics zonal_statistics: Output format Jun 27, 2018
@m-mohr m-mohr transferred this issue from Open-EO/openeo-api Nov 30, 2018
@m-mohr m-mohr added the help wanted Extra attention is needed label Nov 30, 2018
@mkadunc
Copy link
Member

mkadunc commented Dec 7, 2018

Given the discussion at Vito this week, about vector collections as arrays, I suggest we do the following:

  • the most frugal way of structuring the response would be to keep the zone axis in the results in the same order as the geometries that came in, then we don't need the reference geometry
  • possibly we could use IDs of zones (if they have IDs) to link to the zone geometries from the request
  • don't return geometries in the result - they will be known by the caller (either by ID or position in output array)
  • define properties to be returned as a dimension in the output array (statsAxis = categorical_axis(["totalCount", "validCount", "min","max","mean"]))
  • zone is another dimension (zoneAxis = index_axis(N_zones))
  • other dimensions can be arbitrary, depending on the process (timeAxis = irregular_axis(datetime, [t0,t1..tN]), bandAxis = categorical_axis(N_bands))
  • data is returned in a multi-dimensional array

In JSON, this could look like the following (modeled after NetCDF; the metadata part at the beginning is just a quick attempt at encoding relevant info into JSON)—note that there is no duplication of property names in the data part, so it should produce a much smaller file than using GeoJSON features:

//openEOCubeJSON
{
	"axes": {
		"time": { 
			"type" : "regular",
			"data" : [{
				"variable" : "timestamp",
				"dataType" : "datetime",
				"range" : {
					"length" : 2,
					"offset" : "2015-07-06",
					"samplingDistance" : "P10D"
				}
			}]
		},
		"zone": { "type" : "index" },
		"band": {
			"type" : "categorical",
			"data" : [{
				"variable" : "name",
				"dataType" : "string",
				"values" : ["NDVI"]
			}]
		},
		"stat": {
			"type" : "categorical",
			"data" : [{
				"variable" : "name",
				"dataType" : "string",
				"values" : ["totalCount", "validCount", "min", "max", "mean"]
			}]
		}
	},
	"data": [
		{
			"variable" : "value",
			"axes" : ["zone", "time", "band", "stat"],
			"raster" : [
				//zone 0
			    [
					//time 2015-07-06
					[
						//band NDVI
						[9,8,0.58,0.99,0.86]
					],
					//time 2015-07-16
					[
						//band NDVI
						[9,8,0.40,0.57,0.49]
					]
				],
				//zone 1
				[
					[
						//band NDVI
						[9,9,0.78,0.98,0.975]
					],
					//time 2015-07-16
					[
						//band NDVI
						[9,9,0.37,0.51,0.425]
					]
				]
			]
		}
	]
}

Now if polygons or their IDs are needed, they could be returned in the data part of the zone axis information.

@edzer
Copy link
Member

edzer commented Dec 7, 2018

I like that. There is also CoverageJSON, see for instance the MULTIPOLYGON or POINT example in https://covjson.org/playground/ . What they call domain and range, you call axes and data. In their POINT example, they have in addition parameters, which kind of makes sense e.g. for different variable that have different measurement units.

@mkadunc
Copy link
Member

mkadunc commented Dec 7, 2018

Nice link - i completely forgot about covjson. My source was mostly the NetCDF Data Model

Some observation/comparison of the two:

  • what covjson calls parameters, I call variable,
  • covjson parameters are specified in advance, which allows a parameter to be defined outside of any data matrix; my variables are all specified inline along with 'raster' or 'values',
  • their parameter doesn't include the dataType — it's only available in "ranges" alongside the values,
  • i don't like that their data arrays are 1-dimensional; nested arrays are IMO better for human-readable JSON (for non-human consumption, we will probably use some binary format anyway)
  • while I think it's nice that the data arrays can be linked from external source, it would be better if the shape (or at least the dimensions) of the array and its data types would be known in the main JSON document
  • they have many domain and axis types that encode different types of geospatial rasters; in my proposal (and netCDF) all axes have simple integer indexing, and any additional info about the axis values is provided in variables. I included axis type as a way of hinting that axis values may be reordered (categorical), extended (regular / irregular / categorical), aggregated (regular / irregular), compared (regular / irregular) - just something I think might be useful for defining operations of a datacube algebra.

covjson is definitely something we should look at - maybe it could also help model those parts of image collection metadata that aren't handled by STAC

@m-mohr m-mohr changed the title zonal_statistics: Output format aggregate_zonal: Output format Jan 14, 2019
@m-mohr m-mohr added this to the v0.5 milestone Jan 15, 2019
@m-mohr
Copy link
Member

m-mohr commented Jan 15, 2019

Very interesting proposal and discussion, thanks @mkadunc . I don't think we'll come up with a final approach until mid-February so I'll stick with the simple GeoJSON based approach for now and add the experimental flag for aggregate_zonal. See the current specification: https://open-eo.github.io/openeo-api/v/0.4.0/processreference/#aggregate_zonal
We can work in a separate PR on the process after v0.4.

@m-mohr
Copy link
Member

m-mohr commented Feb 13, 2019

There is also CF-JSON, which seems somewhat similar: http://cf-json.org/specification

Edit: I was told there's also NCO-JSON, which departed from CF-JSON and may be merged at some point again. See pangeo-data/pangeo-datastore#3 (comment)

@jdries
Copy link
Contributor Author

jdries commented Mar 29, 2019

Didn't know this, but NetCDF also supports it:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch09.html
They are now even working on a way to encode the geometries inside netcdf:
https://github.com/cf-convention/cf-conventions/blob/master/ch07.adoc#geometries

However, browsers will not like such a binary format, so I still see a clear use case for something based on json.

@m-mohr m-mohr changed the title aggregate_zonal: Output format aggregate_polygons: Output format Mar 29, 2019
@m-mohr m-mohr changed the title aggregate_polygons: Output format aggregate_polygon: Output format Mar 29, 2019
@m-mohr
Copy link
Member

m-mohr commented Sep 13, 2019

3rd year planning: @jdries will look into CovJSON and CFJSON (fka NCO-JSON, see cf-json/cf-json.github.io#10). Related discussions: covjson/specification#86 and pangeo-data/pangeo-datastore#3 (comment)

@soxofaan
Copy link
Member

CovJSON output format has been implemented for aggregate_polygon in VITO backend (still to be deployed at the moment) through PRs Open-EO/openeo-python-driver#23 and Open-EO/openeo-geopyspark-driver#25

Usage example against my localhost setup of the VITO backend
(note the save_result(format="CovJSON") to enable the CovJSON output format):

import json
import shapely
import openeo

connection = openeo.connect("http://localhost:8080/openeo/0.4.2/")

polygon1 = shapely.geometry.Polygon([(3.71, 51.01), (3.72, 51.02), (3.73, 51.01)])
polygon2 = shapely.geometry.Polygon([(3.71, 51.015), (3.725, 51.025), (3.725, 51.03), (3.71, 51.03)])
geometry_collection = shapely.geometry.GeometryCollection([polygon1, polygon2])

cube = (
    connection
        .load_collection('CGS_SENTINEL2_RADIOMETRY_V102_001')
        .filter_temporal('2019-07-15', '2019-07-30')
        .polygonal_mean_timeseries(polygon=geometry_collection)
        .save_result(format="CovJSON")
)

result = cube.execute()
print(json.dumps(result))

output:

{
  "domain": {
    "axes": {
      "composite": {
        "values": [
          [
            [
              [
                3.71,
                51.01
              ],
              [
                3.72,
                51.02
              ],
              [
                3.73,
                51.01
              ],
              [
                3.71,
                51.01
              ]
            ]
          ],
          [
            [
              [
                3.71,
                51.015
              ],
              [
                3.725,
                51.025
              ],
              [
                3.725,
                51.03
              ],
              [
                3.71,
                51.03
              ],
              [
                3.71,
                51.015
              ]
            ]
          ]
        ],
        "coordinates": [
          "x",
          "y"
        ],
        "dataType": "polygon"
      },
      "t": {
        "values": [
          "2019-07-20T00:00:00Z",
          "2019-07-23T00:00:00Z",
          "2019-07-25T00:00:00Z",
          "2019-07-28T00:00:00Z",
          "2019-07-30T00:00:00Z"
        ]
      }
    },
    "domainType": "MultiPolygonSeries",
    "type": "Domain",
    "referencing": [
      {
        "coordinates": [
          "x",
          "y"
        ],
        "system": {
          "type": "GeographicCRS",
          "id": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
        }
      },
      {
        "coordinates": [
          "t"
        ],
        "system": {
          "type": "TemporalRS",
          "calendar": "Gregorian"
        }
      }
    ]
  },
  "ranges": {
    "band0": {
      "values": [
        8279.030674846626,
        6925.279187817259,
        null,
        null,
        647.3476972527749,
        606.1746575342465,
        null,
        null,
        598.0162638242506,
        539.5157878068092
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    },
    "band2": {
      "values": [
        7540.171779141105,
        6389.071065989848,
        null,
        null,
        878.138968409422,
        825.1263001014713,
        null,
        null,
        816.8533253265275,
        744.4639429928741
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    },
    "band1": {
      "values": [
        7925.613496932516,
        6670.365482233503,
        null,
        null,
        844.6788709758425,
        798.5987760020295,
        null,
        null,
        792.2864935194916,
        728.3024861441013
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    },
    "band3": {
      "values": [
        9019.067484662577,
        7498.243654822335,
        null,
        null,
        2178.021344985184,
        2224.413654236428,
        null,
        null,
        2106.431366661662,
        2155.113032462391
      ],
      "axisNames": [
        "t",
        "composite"
      ],
      "type": "NdArray",
      "dataType": "float",
      "shape": [
        5,
        2
      ]
    }
  },
  "type": "Coverage",
  "parameters": {
    "band0": {
      "observedProperty": {
        "label": {
          "en": "Band 0"
        }
      },
      "type": "Parameter"
    },
    "band2": {
      "observedProperty": {
        "label": {
          "en": "Band 2"
        }
      },
      "type": "Parameter"
    },
    "band1": {
      "observedProperty": {
        "label": {
          "en": "Band 1"
        }
      },
      "type": "Parameter"
    },
    "band3": {
      "observedProperty": {
        "label": {
          "en": "Band 3"
        }
      },
      "type": "Parameter"
    }
  }
}

@m-mohr
Copy link
Member

m-mohr commented Nov 26, 2019

What would be the actual return value of aggregate_polygon? A general "vector-cube", which can be passed to the save_result process or would it return directly covjson or so? How should the schema for the return value of the function look like?

@mkadunc
Copy link
Member

mkadunc commented Nov 26, 2019

What would be the actual return value of aggregate_polygon?

vector-cube

JSON should be considered just as an output encoding, similar to "GTiff" in save_result

@m-mohr
Copy link
Member

m-mohr commented Nov 26, 2019

So coming from the point (see #68) that we generally don't support vector-cubes at the moment, would it be enough to support vector cubes in aggregate_polygon, save_result, load_result and maybe load_collection, add_dimension, drop_dimension and rename_dimension? That's probably what we need to support actually using that return type from aggregate_polygon. I guess I'd not enable vector cubes in filter_* processes yet? Or do I miss something?

@mkadunc
Copy link
Member

mkadunc commented Nov 26, 2019

IMO the only thing we need to do in order to have vector-cubes support is allow objects as dimension labels (currently we only allow number, string, date, date-time and time). Then vector-cube is just a cube with simple-feature-geometry as dimension labels on the single spatial dimension. If we go for this approach, we already support vector cubes in all processes (but we treat the spatial dimension as nothing special).

We could also ignore vector-cubes altogether (for now), returning a raster cube with an ordinal dimension to encode the index of the corresponding polygon. This should be quite intuitive for the user...

@m-mohr
Copy link
Member

m-mohr commented Dec 18, 2019

Moved discussions about vector cubes to #68 and thus this issue can be closed, I think.
The output format is nothing we define with the processes.

@m-mohr m-mohr closed this as completed Jan 14, 2020
@jdries
Copy link
Contributor Author

jdries commented Apr 2, 2020

FYI I implemented netcdf as output format for timeseries. I used the simple orthogonal multidimensional array representation, and converted polygons to their points to make life easier. (Support for polygons in netcdf is worked on, but seems less standard.)
Otherwise the representation seems quite promising, and also can be converted to nco-json.

@m-mohr
Copy link
Member

m-mohr commented Apr 2, 2020

@jdries Sounds good. Do you plan to support nco-json anytime soon? If so, I think I'd also try to implement it for GEE. Also, I'd be interested in an example process graph + result...

@jdries
Copy link
Contributor Author

jdries commented Apr 3, 2020

I'll give nco-json a try, but will have to install the nco dependencies for that, as I haven't found a python library yet that does the conversion. It's a bit of a side thing, so not sure when I'll be able to finish.
Also working on that example!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed interoperability
Projects
None yet
Development

No branches or pull requests

7 participants