Fix/improve data filtering #84

sinergise-anze · 2019-10-09T10:21:54Z

We are trying to implement support for for Sentinel-1 GRD data in load_collection process. The data in this collection has a property "orbitDirection" (sar:pass_direction), with possible values "ascending", "descending" or null (any). Our backend supports filtering by this property.

In load_collection process, I assume properties is meant for filtering the data that is to be loaded. However, it is unclear to me how the process graph would look like. Are there any examples available, or could someone please create a short example for this?
The issue with properties (as defined, that is: with callbacks) is that it would imho be quite difficult to convert the properties filters to query parameters for our backend, which means that load_collection would need to fetch all metadata, perform filtering on its end (for example just items with orbitDirection == ascending), and then request only data for remaining items. Not impossible, just more difficult to implement and a bit less optimal performance-wise. Ideally we would want to pass the appropriate parameters when fetching metadata, so that backend would already take care of filtering.

That said, since properties are not finalized yet, one alternative suggestion would be to use filters instead of properties, and use notation of STAC Query API. For example:
```
{
  "process_graph": {
    "loadco1": {
      "process_id": "load_collection",
      "arguments": {
        "id": "S1GRD",
        "spatial_extent": ...,
        "temporal_extent": ["2019-08-16", "2019-08-18"],
        "filters": {
          "sar:pass_direction": {
            "eq": "ascending"
          }
        }
      }
    },
    ...
  }
}
```
This would of course open an issue of how to describe which kind of queries the backend supports (startsWith might be unsupported) - or, alternatively, load_collection could simulate filtering if backend lacks such support.

I would appreciate some thoughts on this.

The text was updated successfully, but these errors were encountered:

m-mohr · 2019-10-10T12:46:34Z

In load_collection process, I assume properties is meant for filtering the data that is to be loaded.

Correct.

However, it is unclear to me how the process graph would look like. Are there any examples available, or could someone please create a short example for this?

I don't have a process graph at hand, but it should be something like:

load_collection("S1", null, null, null, eq(property("sar:pass_direction"), "ascending"))

Notes:

this is simplified
This doesn't work with the current processes. property() is defined badly trying to work for load_collection, filter and other processes. property() expects a data cube to be passed, which is not available in load_collection. filter on the other hand doesn't pass the data cube to the expression as callback parameter. So overall, our approach is broken and needs a better solution.

The issue with properties (as defined, that is: with callbacks) is that it would imho be quite difficult to convert the properties filters to query parameters for our backend, which means that load_collection would need to fetch all metadata, perform filtering on its end (for example just items with orbitDirection == ascending), and then request only data for remaining items. Not impossible, just more difficult to implement and a bit less optimal performance-wise. Ideally we would want to pass the appropriate parameters when fetching metadata, so that backend would already take care of filtering.

Couldn't you just use the callback to create whatever filter you need? I mean the callback could be more than just a set of parameters, e.g. a logical expression with some and and or etc.

That said, since properties are not finalized yet, one alternative suggestion would be to use filters instead of properties, and use notation of STAC Query API.

I like this idea, but STAC Query API is not finished yet and is likely to change in the next STAC/WFS sprint. So I'd like to wait for the sprint to happen before deciding for this approach.

sinergise-anze · 2019-10-15T07:25:31Z

@m-mohr Thank you for the explanations, appreciate it! We will try to implement something, and will try to keep it as close to specification as possible - at least this way we will be better prepared with concrete suggestions on how to improve it.

Couldn't you just use the callback to create whatever filter you need? I mean the callback could be more than just a set of parameters, e.g. a logical expression with some and and or etc.

We could, our main concern is that often one wants to limit loading to a subset of data, which is often achieved by passing additional parameters to the backend. Generic filtering (which supports logical expressions) is stronger than just specifying the keys/values, so it would need to be implemented on the side of the driver (instead of just passing appropriate parameters to the backend) because at least Sentinel Hub doesn't support a similar mechanism. Which means that all data (or at least metadata) would need to be loaded by driver, only to discard it with filtering, which is not ideal.

The way I see it, there are three options:

specify properties as generic callbacks
specify properties as keys/values
specify properties as keys/values and add filters property (generic callbacks)

I might be biased, but I find 2nd and 3rd option much easier to implement in an efficient way. :)

I like this idea, but STAC Query API is not finished yet and is likely to change in the next STAC/WFS sprint. So I'd like to wait for the sprint to happen before deciding for this approach.

👍

Thank you again for the example, will try to implement something similar.

m-mohr · 2019-12-13T10:16:48Z

@sinergise-anze How did you proceed with your implementation? Any lessons learned to share?

Unfortunately, the STAC/OGC meeting ended with a less definitive solution as I would have hoped. OGC tries to (re-)define a CQL-based query language. STAC will for now stick with an updated version of their query language (probably until CQL is ready).

With the STAC query language I have some concerns it is a bit flawed at the moment (see radiantearth/stac-spec#692) and might limit us. CQL is not ready at all yet. Our approach is also flawed (how to define which field to work on?), but still better aligned to our general data model and probably relatively easy to fix. Also, I think it is relatively easy to convert into something "STAC-ish".

m-mohr · 2019-12-13T10:47:55Z

Example for two equal queries:

STAC:

{
  "query": {
    "eo:cloud_cover": {
      "lt": 50
    },
    "provider": {
      "eq": "Planet"
    },
    "published": {
      "gte": "2018-02-12T00:00:00Z",
      "lte": "2018-03-18T12:31:12Z"
    },
    "pl:item_type": {
      "startsWith": "PSScene"
    },
    "product": {
      "in": ["foo","bar"]
    }
  }
}

openEO, based on what I expect to be in 1.0 (might slightly differ).

{
  "all": {
    "process_id": "all",
    "arguments": {
      "expressions": [
        {
          "process_id": "lt",
          "arguments": {
            "x": {"from_metadata": "eo:cloud_cover"},
            "y": 50
          }
        },
        {
          "process_id": "eq",
          "arguments": {
            "x": {"from_metadata": "provider"},
            "y": "Planet"
          }
        },
        {
          "process_id": "between",
          "arguments": {
            "x": {"from_metadata": "published"},
            "min": "2018-02-12T00:00:00Z",
            "max": "2018-03-18T12:31:12Z"
          }
        },
        {
          "process_id": "text_begins",
          "arguments": {
            "data": {"from_metadata": "pl:item_type"},
            "pattern": "PSScene",
            "case_sensitive": false
          }
        },
        {
          "process_id": "any",
          "arguments": {
            "expressions": [
              {
                "process_id": "array_contains",
                "arguments": {
                  "data": {"from_metadata": "product"},
                  "value": "foo"
                }
              },
              {
                "process_id": "array_contains",
                "arguments": {
                  "data": {"from_metadata": "product"},
                  "value": "bar"
                }
              }
            ]
          }
        }
      ]
    },
    "result": true
  }
}

Yes, that's more verbose...

sinergise-anze · 2019-12-13T12:28:10Z

@m-mohr We kept it simple. Since the only thing we needed was to pass some options to the backend, we implemented it like this:

{
  "loadco1": {
    "process_id": "load_collection",
    "arguments": {
      "bands": [
        "B04",
        "B08"
      ],
      "id": "S2L1C",
      "temporal_extent": [
        "2019-08-10",
        "2019-08-18"
      ],
      "spatial_extent": {
        "west": 11.2499,
        "east": 16.8750,
        "north": 45.0890,
        "south": 40.9798
      },
      "options": {
        "width": 256,
        "height": 256
      }
    }
  },
  ...
}

Note that this approach proved to be sufficient for the use-cases we have tried, and since it translates more or less 1:1 to what our service supports, I think it should suffice. That said, a more powerful mechanism is of course better, as long as it is not too difficult to implement.

m-mohr · 2020-01-17T14:43:18Z

I finished work on this, see PR #128.

m-mohr changed the title ~~Examples for custom data filtering in load_collection, alternative proposal~~ Custom data filtering in load_collection, alternative proposal Nov 18, 2019

m-mohr mentioned this issue Dec 13, 2019

Fix filters #104

Closed

m-mohr changed the title ~~Custom data filtering in load_collection, alternative proposal~~ Fix/improve data filtering Dec 13, 2019

m-mohr added this to the v1.0 milestone Dec 13, 2019

m-mohr added accepted bug critical and removed accepted labels Dec 13, 2019

m-mohr added a commit that referenced this issue Jan 15, 2020

Removed property process. #84

1babe17

m-mohr added the work in progress label Jan 15, 2020

m-mohr mentioned this issue Jan 17, 2020

Improvements for data filtering and other minor changes #128

Merged

m-mohr added has PR and removed work in progress critical labels Jan 17, 2020

m-mohr closed this as completed Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/improve data filtering #84

Fix/improve data filtering #84

sinergise-anze commented Oct 9, 2019

m-mohr commented Oct 10, 2019 •

edited

Loading

sinergise-anze commented Oct 15, 2019

m-mohr commented Dec 13, 2019 •

edited

Loading

m-mohr commented Dec 13, 2019 •

edited

Loading

sinergise-anze commented Dec 13, 2019

m-mohr commented Jan 17, 2020 •

edited

Loading

Fix/improve data filtering #84

Fix/improve data filtering #84

Comments

sinergise-anze commented Oct 9, 2019

m-mohr commented Oct 10, 2019 • edited Loading

sinergise-anze commented Oct 15, 2019

m-mohr commented Dec 13, 2019 • edited Loading

m-mohr commented Dec 13, 2019 • edited Loading

sinergise-anze commented Dec 13, 2019

m-mohr commented Jan 17, 2020 • edited Loading

m-mohr commented Oct 10, 2019 •

edited

Loading

m-mohr commented Dec 13, 2019 •

edited

Loading

m-mohr commented Dec 13, 2019 •

edited

Loading

m-mohr commented Jan 17, 2020 •

edited

Loading