Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_bands / load_collection / reduce: Contradicting language regarding bands #77

Closed
m-mohr opened this issue Sep 5, 2019 · 11 comments

Comments

@m-mohr
Copy link
Member

m-mohr commented Sep 5, 2019

Originates from Open-EO/openeo-api#208

There's contradicting wording regarding dimension type bands, especially when reading filter_bands and load_collection.

reduce / filter:

For multi-spectral imagery there is usually a separate dimension of type bands for the bands.

filter_bands:

The data cube is expected to have only one dimension of type bands.

load_collection (which unfortunately contradicts to filter_bands above):

Applies to all dimensions of type bands if there are multiple of them.

It's still valid as it's just a recommendation on how the passed data cube should look like, but I should double-check the language here.

@m-mohr m-mohr added this to the v1.0 milestone Sep 5, 2019
@m-mohr m-mohr changed the title Contradicting wording regarding dimension type bands Contradicting language regarding dimension type bands Sep 5, 2019
@soxofaan
Copy link
Member

also related (I'm not sure whether I should make a separate issue for this):

filter_bands has arguments bands, common_names and wavelengths and recommends common names:

To keep algorithms interoperable it is recommended to prefer the common bands names or the wavelengths over collection and/or back-end specific band names.

load_collection only has the bands argument, so doesn't support what is recommended by filter_bands

One could argue to remove the extent and bands arguments from load_collection on the processgraph/API level and let the filtering to specialized processes (filter_temporal, filter_spatial, filter_bands) to eliminate duplication issues.
Clients can still optionally provide an interface where load_collection and filtering are provided in a single function (but unroll it to multiple process graph nodes under the hood)

@m-mohr
Copy link
Member Author

m-mohr commented Oct 25, 2019

I don't like these things where clients must implement special handling for a set of processes. This must be documented somewhere and is likely to lead to issues if back-ends don't support a process or clients don't follow the spec closely.

Still, I see the issue with filter_bands and load_collection and I don't like the inconsistency. load_collection has the bands in there as in some cases it can optimize loading the data (e.g. bands stored in different files). On the other hand I don't want to overload load_collection. What I could imagine to work is that we combine bands and common_names in both processes. Remove common_names from filter_bands and allow common names to be used in the bands property. Band IDs have priority over common names, but we recommend to use common names in the process whenever possible.

Would that make sense, @soxofaan ?

@m-mohr m-mohr changed the title Contradicting language regarding dimension type bands filter_bands / load_collection / reduce: Contradicting language regarding bands Oct 25, 2019
@soxofaan
Copy link
Member

where clients must implement special handling for a set of processes.

I don't think client MUST do that, I just said they could.
IMHO clients and client applications operate on a higher (abstraction) level than the API anyway (to accommodate for language specific features and restrictions), so I do not really mind that the client level function/method calls do not map one-on-one to the API level process graphs. For example: the "band math" operations you can do with the python client (e.g. b1 - b2 / b3) is massively shorter and more developer friendly than the resulting process graph representation.
By allowing some decoupling, you can focus better on what is important for each level:

  • a small but consistent, versatile process graph API
  • a user/developer friendly client API

load_collection has the bands in there as in some cases it can optimize loading the data

The goal in the Python client is to optimize this kind of data loading anyway, even if filter_bands is specified as a separate process in the graph. Same for spatial and temporal filtering.

combine bands and common_names

We try to do that already in the python client in a way: the ImageCollectionClient.band(band) method handles both internal names and common names (and even band indices), but with priority for common names at the moment.
But I see this more as a client level API UX optimization, and I'm not sure if it is a good idea to push that to the process graph level API.

@m-mohr
Copy link
Member Author

m-mohr commented Oct 25, 2019

I don't think client MUST do that, I just said they could.

My point is, if we specify that in the API, the clients MUST (or at least SHOULD) implement it so that all the three clients don't differ too much from each other. Otherwise some clients may be stuck with incomplete functionality regarding the processes.

IMHO clients and client applications operate on a higher (abstraction) level [...]

Agreed, but those things should not be specified in the API or in the processes.

The goal in the Python client is to optimize this kind of data loading anyway, even if filter_bands is specified as a separate process in the graph. Same for spatial and temporal filtering.

I'm not sure whether a client can really judge what is optimal and what isn't. That should be up to the back-end to decide as only the back-end really knows what is better for its architecture.

combine bands and common_names

We try to do that already in the python client in a way: the ImageCollectionClient.band(band) method handles both internal names and common names (and even band indices), but with priority for common names at the moment.
But I see this more as a client level API UX optimization, and I'm not sure if it is a good idea to push that to the process graph level API.

I think we could do that for the band names / common names as written above. As you said above, it would make the processes more consistent (you can use band names and common names in all processes) and easier to use. In many cases users will still use the processes directly and then it would make the UX better. I don't really see a problem in allowing both band names and common names through the same parameter. Question is whether common names or band names should be preferred.

@soxofaan
Copy link
Member

I'm not sure whether a client can really judge what is optimal and what isn't. That should be up to the back-end to decide

Indeed, I mis-wrote, I meant "the goal of the VITO geopyspark backend is to optimize ..."

@soxofaan
Copy link
Member

Otherwise some clients may be stuck with incomplete functionality regarding the processes.
Agreed, but those things should not be specified in the API or in the processes.

I'm not sure are talking about the same thing. What I originally suggested as possible solution is to define just these processes on the API level:

  • load_collection(collection_id) (so without spatial, temporal and band extents)
  • filter_temporal(...), filter_spatial(...) and filter_bands(...)

that would make this API orthogonal and avoid duplication issues.

A client library would then be expected to provide at least this minimal API in some form (functions, methods, ...).
Optionally, a client library could extend its load_collection function with additional (optional) arguments to specify spatial/temporal/band filtering directly in a load_collection call.

  • In Python for example, which has this handy keyword arguments feature, it would be a natural choice to do this with "keyword addressable" arguments. Under the hood, this load_collection function call would appropriately inject the relevant filter_ processes in the process graph. Note that the original filter_... methods should be kept too, so you lose orthogonality but win user friendliness.
  • In JavaScript however, maybe it would be more natural to keep orthogonal methods and promote a fluent/chained approach.

@soxofaan
Copy link
Member

I don't really see a problem in allowing both band names and common names through the same parameter. Question is whether common names or band names should be preferred.

If common names are recommend over internal names for interoperability reasons, I think it is more logical to also give them higher priority in this lookup problem

@m-mohr
Copy link
Member Author

m-mohr commented Oct 25, 2019

I'm not sure are talking about the same thing. What I originally suggested as possible solution is to define just these processes on the API level:

  • load_collection(collection_id) (so without spatial, temporal and band extents)
  • filter_temporal(...), filter_spatial(...) and filter_bands(...)

I understood this. We had load_collection(collection_id ) in 0.3 and that was not well received by back-ends (a bit of discussion is in #44). That lead to the current definition of load_collection and I'll only revert this if there's consensus about this for most back-ends.

A client library would then be expected to provide at least this minimal API in some form (functions, methods, ...).
Optionally, a client library could extend its load_collection function with additional (optional) arguments to specify spatial/temporal/band filtering directly in a load_collection call.

That only works if you hard-code behaviour as you do in Python, but R and JS for example generate their UX only based on the process descriptions (which I highly prefer instead of hard-coding behaviour) and then add custom behaviour on top (e.g. band math).

If we split up into what you proposed, this optimization gets either complicated for back-ends or we need additional good documentation that indicates clients must put filter_* directly after load_collection. I don't think this is a good idea and feedback on previous versions of the API supports that.

If common names are recommend over internal names for interoperability reasons, I think it is more logical to also give them higher priority in this lookup problem

That's not the same thing. The recommendation is for the users to use the common names whenever feasible. The back-end implementation to resolve them is a completely different thing. The issue here is that common names may collide with band names (but they shouldn't) and that common names are not unique. Therefore it seems better to first look for band names and then for common names to make resolving it more logical (i.e. more concrete selector vs less concrete selector, as for example also in CSS).

@soxofaan
Copy link
Member

(About the orthogonality issue)

Thanks, I wasn't aware of that historical background.
But does this then mean that some backends do not support these filter_.. processes at all? Or that usage of filter_... (instead of filtering in load_collection process) can have performance implications with some backends? Moreover: does order of filter_... processes matter then as well? And what if there is is both filtering in load_collection and with additional filter_... processes?

At the moment, when I develop against the VITO backend, I tend to use separate filter_... calls instead of doing it directly in load_collection. If I understand correctly this is a pattern that should be avoided and should not trickle down in general documentation, demo's, example notebooks, etc ...

@m-mohr
Copy link
Member Author

m-mohr commented Oct 28, 2019

But does this then mean that some backends do not support these filter_.. processes at all?

At the moment the openEO API doesn't require any of the processes to be implemented. It may happen that some back-ends don't implement filter_..., but most should have it, I guess.

Or that usage of filter_... (instead of filtering in load_collection process) can have performance implications with some backends?

Yes, could surely happen.

Moreover: does order of filter_... processes matter then as well?

Yes, could be the case. It depends on the back-end implementation and their optimizations.

And what if there is is both filtering in load_collection and with additional filter_... processes?

Then the data is filtered twice, I guess? Again, depends on back-end implementation whether that's slow or fast...

At the moment, when I develop against the VITO backend, I tend to use separate filter_... calls instead of doing it directly in load_collection. If I understand correctly this is a pattern that should be avoided and should not trickle down in general documentation, demo's, example notebooks, etc ...

Yes.

m-mohr added a commit to Open-EO/openeo-api that referenced this issue Dec 13, 2019
m-mohr added a commit that referenced this issue Dec 13, 2019
…oad_collection`: Parameter `bands` accepts common band names.
@m-mohr
Copy link
Member Author

m-mohr commented Dec 13, 2019

There's contradicting wording regarding dimension type bands, especially when reading filter_bands and load_collection.

It's not contradicting as explained below:

reduce / filter:

For multi-spectral imagery there is usually a separate dimension of type bands for the bands.

This is just a hint and says that there is usually a dimension of type bands.

filter_bands:

The data cube is expected to have only one dimension of type bands.

filter_bands expects a single dimension of type bands. It's more concrete than load_collection to allow fine-grained control over dimensions.

load_collection (which unfortunately contradicts to filter_bands above):

Applies to all dimensions of type bands if there are multiple of them.

This is more general than filter_bands to be easier to use. Also, you can't control the dimensions before loading in contrast to filter_bands. Nevertheless, there's usually just one dimension of type bands, of course.

Commits Open-EO/openeo-api@2e283ea and e27a460 solve this issue.

@m-mohr m-mohr closed this as completed Dec 13, 2019
m-mohr added a commit to Open-EO/openeo-api that referenced this issue Dec 13, 2019
m-mohr added a commit to Open-EO/openeo-api that referenced this issue Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants