Skip to content

Latest commit

 

History

History
265 lines (210 loc) · 13 KB

implementation.md

File metadata and controls

265 lines (210 loc) · 13 KB

Implementation Guide for back-ends

This file is meant to provide some additional implementation details for back-ends.

No-data value

A data cube shall always keep reference of the applicable no-data value(s). The no-data values can be chosen by the back-end implementation, e.g. depending on the data type of the data. No-data values should be exposed for each pre-defined Collection in its metadata. For all data generated through openEO (e.g. through synchronous or batch jobs), the metadata and/or data shall expose the no-data values.

The openEO process specifications generally uses null as a generic value to express no-data values. This is primarily meant for the JSON encoding, this means:

  1. in the process specification (data type null in the schema), and
  2. in the process graph (if the no-data value exposed through the metadata can't be used in JSON, e.g. NaN).

Back-ends may or may not use null as a no-data value internally.

NaN: If NaN is the no-data value for floating-point numbers, be aware that the behavior of no-data values in openEO and NaN (IEEE 754) sometimes differs.

Array processes: Some array processes (e.g. array_find or any) use null as a return value. In the context of data cube operations (e.g. in reduce_dimension), null values returned by the array processes shall be replaced with the no-data value of the data cube. As the processes may be used outside of data cubes where the no-data values are undefined, most processes fall back to null in this case (reflected through the mention of "(or null)" in the process description). This effectively means that null is the default value for an undefined no-data value in openEO.

Optimizations for conditions (e.g. if)

None of the openEO processes per se is "special" and thus all are treated the same way by default. Nevertheless, there are some cases where a special treatment can make a huge difference.

Character encoding

String-related processes previously mentioned that strings have to be "encoded in UTF-8 by default". This was removed and we clarify the behavior here:

For data transfer through the API, the character encoding of strings is specified using HTTP headers. This means all strings provided in the process graph have the same encoding as specified in the HTTP headers. Back-ends can internally use any character encoding and as such may need to convert the character encoding upon receipt of the process graph. It is recommended to use a Unicode character encoding such as UTF-8. In case of doubt, clients and server should assume UTF-8 as character encoding.

Branching behavior

The if process (and any process that is working on some kind of condition) are usually special control structures and not normal functions. Those conditionals usually decide between one outcome or the other. Evaluating them in a naive way would compute both outcomes and depending on the condition use one outcome and discard the other. This can and should be optimized by "lazily" only computing the outcome that is actually used. This could have a huge impact on performance as some computation doesn't need to be executed at all.

openEO doesn't require special handling for the if process, but it is strongly recommended that back-ends treat them special and only compute the outcome that is actually needed. In the end, this is faster and cheaper for the user and thus users may prefer back-ends that offer this optimization. Fortunately, both ways still lead to the same results and comparability and reproducibility of the results is still given.

Short-circuit evaluation

Similarly, back-ends should "short-circuit" the evaluation of conditions that use processes processes such as and, or or xor, which means that once a condition has reached an unambiguous result the evaluation should stop and provide the result directly. This is basically the same behavior that is also described in the processes all and any.

For example, the condition A > 0 or B > 0 should only execute B > 0 if A > 0 is false as otherwise the result is already unambiguous and will be true regardless of the rest of the condition.

Implementing this behavior does not have any negative side-effects so that comparability and reproducibility of the results is still given.

Enums for processing methods

There are numerours processes that provide a predefined set of processing methods. For example:

  • ard_surface_reflectance: atmospheric_correction_method and cloud_detection_method
  • athmospheric_correction: method
  • cloud_detection: method
  • resample_cube_spatial: method
  • resample_spatial: method

Those methods are meant to provide a common naming for well-known processing methods. Back-ends should check which methods they can implement and remove all the methods they can't implement. Similarly, you can add new methods. We'd likely ask you to open a new issue and provide us your additions so that we can align implementations and eventually update the process specifications with all methods out there. Thanks in advance!

Also make sure to update the textual descriptions accordingly.

This applies similarly to other enums specied in parameter schemas, e.g. the period parameter in aggregate_temporal_period.

Proprietary options in ard_surface_reflectance, athmospheric_correction and cloud_detection

The processes mentioned above have all at least one parameter for proprietary options that can be passed to the corresponsing methods:

  • ard_surface_reflectance: atmospheric_correction_options and cloud_detection_options
  • athmospheric_correction: options
  • cloud_detection: options

By default, the parameters don't allow any value except an empty opject. Back-ends have to either remove the parameter or define schema to give user details about the supported parameters per supported method.

For example, if you support the methods iCor and FORCE in atmospheric_correction, you may define something like the following for the parameter:

{
	"description": "Proprietary options for the atmospheric correction method.",
	"name": "options",
	"optional": true,
	"default": {},
	"schema": [
		{
			"title": "FORCE options",
			"type": "object",
			"properties": {
				"force_option1": {
					"type": "number",
					"description": "Description for option 1",
					"default": 0
				},
				"force_option2": {
					"type": "boolean",
					"description": "Description for option 1",
					"default": true
				}
			}
		},
		{
			"title": "iCor options",
			"type": "object",
			"properties": {
				"icor_option1": {
					"type": "string",
					"description": "Description for option 1",
					"default": "example"
				}
			}
		}

	]
}

Default values should be specified for each of the additional options given in properties. The top-level default value should always be an empty object {}. The default values for the empty object will be provided by the schema. None of the additional options should be required for better interoperability.

Date and Time manipulation

Working with dates is a lot more complex than it seems to be at first sight. Issues arise especially with daylight saving times (DST), time zones, leap years and leap seconds.

The date/time functions in openEO don't have any effect on time zones right now as only dates and times in UTC (with potential numerical time zone modifier) are supported.

Month overflows, including the specific case of leap years, are implemented in a way that computations handle them gracefully. For example:

  • If you add a month to January, 31th, it will result in February 29th (leap year) or 28th (other years). This means for invalid dates due to month overflow we round down (or "snap") to the last valid date of the month.
  • If you add a month to February, 29th, it will result in March, 29. So the "snap" behavior doesn't work the other way round.

Leap seconds are basically ignored in manipulations as they don't follow a regular pattern. So leap seconds may be passed into the processes, but will never be returned by date manipulation processes in openEO. See the examples for the leap second 2016-12-31T23:59:60Z:

  • If you add a minute to 2016-12-31T23:59:60Z, it will result in 2017-01-01T00:00:59Z. This means for invalid times we round down (or "snap") to the next valid time.
  • If you add a seconds to 2016-12-31T23:59:59Z, it will result in 2017-01-01T00:00:00Z.

Language support

To make date_shift easier to implement, we have found some libraries that follow this specification and can be used for implementations:

inspect process

The inspect process (previously known as debug) is a process to allow users to debug their workflows. Back-ends should not execute the processes for log levels that are not matching the mininum log level that can be specified through the API (>= v1.2.0) for each data processing request.

Data Types

The process is only useful for users if a common behavior for data types passed into the data parameter has been agreed on across implementations.

The following chapters include some proposals for common data (sub) types, but it is incomplete and will be extended in the future. Also, for some data types a JSON encoding is missing, we'll add more details once agreed upon: #299

Scalars

For the data types boolean, numbers, strings and null it is recommended to log them as given.

Arrays

It is recommended to summarize arrays as follows:

{
	"data": [3,1,6,4,8], // Return a reasonable excerpt of the data, e.g. the first 5 or 10 elements
	"length": 10, // Return the length of the array, this is important to determine whether the data above is complete or an excerpt
	"min": 0, // optional: Return additional statstics if possible, ideally use the corresponsing openEO process names as keys
	"max": 10
}

Data Cubes

It is recommended to return them summarized in a structure compliant to the STAC data cube extension. If reasonsable, it gives a valuable benefit for users to provide all dimension labels (e.g. individual timestamps for the temporal dimension) instead of values ranges. The top-level object and/or each dimension can be enhanced with additional statstics if possible, ideally use the corresponsing openEO process names as keys.

{
	"cube:dimensions": {
		"x": {
			"type": "spatial",
			"axis": "x",
			"extent": [8.253, 12.975],
			"reference_system": 4326
		},
		"y": {
			"type": "spatial",
			"axis": "y",
			"extent": [51.877,55.988],
			"reference_system": 4326
		},
		"t": {
			"type": "temporal",
			"values": [
				"2015-06-21T12:56:55Z",
				"2015-06-23T09:12:14Z",
				"2015-06-25T23:44:44Z",
				"2015-06-27T21:11:34Z",
				"2015-06-30T17:33:12Z"
			],
			"step": null
		},
		"bands": {
			"type": "bands",
			"values": ["NDVI"]
		}
	},
	// optional: Return additional data or statstics for the data cube if possible (see also the chapter for "Arrays" above).
	"min": -1,
	"max": 1
}

Quantile algorithms

The quantiles could implement a number of different algorithms, literature usually distinguishes 9 types. Right now it's not possible to choose from them, but it may be added in the future. To improve interoperability openEO processes, version 1.2.0 added details about the algorithm that must be implemented. A survey has shown that most libraries implement type 7 and as such this was chosen to be the default.

We have found some libraries that can be used for an implementation: