Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/improve data filtering #84

Closed
sinergise-anze opened this issue Oct 9, 2019 · 6 comments
Closed

Fix/improve data filtering #84

sinergise-anze opened this issue Oct 9, 2019 · 6 comments
Labels
Milestone

Comments

@sinergise-anze
Copy link

We are trying to implement support for for Sentinel-1 GRD data in load_collection process. The data in this collection has a property "orbitDirection" (sar:pass_direction), with possible values "ascending", "descending" or null (any). Our backend supports filtering by this property.

  1. In load_collection process, I assume properties is meant for filtering the data that is to be loaded. However, it is unclear to me how the process graph would look like. Are there any examples available, or could someone please create a short example for this?

  2. The issue with properties (as defined, that is: with callbacks) is that it would imho be quite difficult to convert the properties filters to query parameters for our backend, which means that load_collection would need to fetch all metadata, perform filtering on its end (for example just items with orbitDirection == ascending), and then request only data for remaining items. Not impossible, just more difficult to implement and a bit less optimal performance-wise. Ideally we would want to pass the appropriate parameters when fetching metadata, so that backend would already take care of filtering.

    That said, since properties are not finalized yet, one alternative suggestion would be to use filters instead of properties, and use notation of STAC Query API. For example:

    {
      "process_graph": {
        "loadco1": {
          "process_id": "load_collection",
          "arguments": {
            "id": "S1GRD",
            "spatial_extent": ...,
            "temporal_extent": ["2019-08-16", "2019-08-18"],
            "filters": {
              "sar:pass_direction": {
                "eq": "ascending"
              }
            }
          }
        },
        ...
      }
    }
    

    This would of course open an issue of how to describe which kind of queries the backend supports (startsWith might be unsupported) - or, alternatively, load_collection could simulate filtering if backend lacks such support.

I would appreciate some thoughts on this.

@m-mohr
Copy link
Member

m-mohr commented Oct 10, 2019

  1. In load_collection process, I assume properties is meant for filtering the data that is to be loaded.

Correct.

However, it is unclear to me how the process graph would look like. Are there any examples available, or could someone please create a short example for this?

I don't have a process graph at hand, but it should be something like:

load_collection("S1", null, null, null, eq(property("sar:pass_direction"), "ascending"))

Notes:

  1. this is simplified
  2. This doesn't work with the current processes. property() is defined badly trying to work for load_collection, filter and other processes. property() expects a data cube to be passed, which is not available in load_collection. filter on the other hand doesn't pass the data cube to the expression as callback parameter. So overall, our approach is broken and needs a better solution.
  1. The issue with properties (as defined, that is: with callbacks) is that it would imho be quite difficult to convert the properties filters to query parameters for our backend, which means that load_collection would need to fetch all metadata, perform filtering on its end (for example just items with orbitDirection == ascending), and then request only data for remaining items. Not impossible, just more difficult to implement and a bit less optimal performance-wise. Ideally we would want to pass the appropriate parameters when fetching metadata, so that backend would already take care of filtering.

Couldn't you just use the callback to create whatever filter you need? I mean the callback could be more than just a set of parameters, e.g. a logical expression with some and and or etc.

That said, since properties are not finalized yet, one alternative suggestion would be to use filters instead of properties, and use notation of STAC Query API.

I like this idea, but STAC Query API is not finished yet and is likely to change in the next STAC/WFS sprint. So I'd like to wait for the sprint to happen before deciding for this approach.

@sinergise-anze
Copy link
Author

@m-mohr Thank you for the explanations, appreciate it! We will try to implement something, and will try to keep it as close to specification as possible - at least this way we will be better prepared with concrete suggestions on how to improve it.

Couldn't you just use the callback to create whatever filter you need? I mean the callback could be more than just a set of parameters, e.g. a logical expression with some and and or etc.

We could, our main concern is that often one wants to limit loading to a subset of data, which is often achieved by passing additional parameters to the backend. Generic filtering (which supports logical expressions) is stronger than just specifying the keys/values, so it would need to be implemented on the side of the driver (instead of just passing appropriate parameters to the backend) because at least Sentinel Hub doesn't support a similar mechanism. Which means that all data (or at least metadata) would need to be loaded by driver, only to discard it with filtering, which is not ideal.

The way I see it, there are three options:

  1. specify properties as generic callbacks
  2. specify properties as keys/values
  3. specify properties as keys/values and add filters property (generic callbacks)

I might be biased, but I find 2nd and 3rd option much easier to implement in an efficient way. :)

I like this idea, but STAC Query API is not finished yet and is likely to change in the next STAC/WFS sprint. So I'd like to wait for the sprint to happen before deciding for this approach.

👍

Thank you again for the example, will try to implement something similar.

@m-mohr m-mohr changed the title Examples for custom data filtering in load_collection, alternative proposal Custom data filtering in load_collection, alternative proposal Nov 18, 2019
@m-mohr
Copy link
Member

m-mohr commented Dec 13, 2019

@sinergise-anze How did you proceed with your implementation? Any lessons learned to share?

Unfortunately, the STAC/OGC meeting ended with a less definitive solution as I would have hoped. OGC tries to (re-)define a CQL-based query language. STAC will for now stick with an updated version of their query language (probably until CQL is ready).

With the STAC query language I have some concerns it is a bit flawed at the moment (see radiantearth/stac-spec#692) and might limit us. CQL is not ready at all yet. Our approach is also flawed (how to define which field to work on?), but still better aligned to our general data model and probably relatively easy to fix. Also, I think it is relatively easy to convert into something "STAC-ish".

@m-mohr
Copy link
Member

m-mohr commented Dec 13, 2019

Example for two equal queries:

STAC:

{
  "query": {
    "eo:cloud_cover": {
      "lt": 50
    },
    "provider": {
      "eq": "Planet"
    },
    "published": {
      "gte": "2018-02-12T00:00:00Z",
      "lte": "2018-03-18T12:31:12Z"
    },
    "pl:item_type": {
      "startsWith": "PSScene"
    },
    "product": {
      "in": ["foo","bar"]
    }
  }
}

openEO, based on what I expect to be in 1.0 (might slightly differ).

{
  "all": {
    "process_id": "all",
    "arguments": {
      "expressions": [
        {
          "process_id": "lt",
          "arguments": {
            "x": {"from_metadata": "eo:cloud_cover"},
            "y": 50
          }
        },
        {
          "process_id": "eq",
          "arguments": {
            "x": {"from_metadata": "provider"},
            "y": "Planet"
          }
        },
        {
          "process_id": "between",
          "arguments": {
            "x": {"from_metadata": "published"},
            "min": "2018-02-12T00:00:00Z",
            "max": "2018-03-18T12:31:12Z"
          }
        },
        {
          "process_id": "text_begins",
          "arguments": {
            "data": {"from_metadata": "pl:item_type"},
            "pattern": "PSScene",
            "case_sensitive": false
          }
        },
        {
          "process_id": "any",
          "arguments": {
            "expressions": [
              {
                "process_id": "array_contains",
                "arguments": {
                  "data": {"from_metadata": "product"},
                  "value": "foo"
                }
              },
              {
                "process_id": "array_contains",
                "arguments": {
                  "data": {"from_metadata": "product"},
                  "value": "bar"
                }
              }
            ]
          }
        }
      ]
    },
    "result": true
  }
}

Yes, that's more verbose...

@m-mohr m-mohr mentioned this issue Dec 13, 2019
@sinergise-anze
Copy link
Author

@m-mohr We kept it simple. Since the only thing we needed was to pass some options to the backend, we implemented it like this:

{
  "loadco1": {
    "process_id": "load_collection",
    "arguments": {
      "bands": [
        "B04",
        "B08"
      ],
      "id": "S2L1C",
      "temporal_extent": [
        "2019-08-10",
        "2019-08-18"
      ],
      "spatial_extent": {
        "west": 11.2499,
        "east": 16.8750,
        "north": 45.0890,
        "south": 40.9798
      },
      "options": {
        "width": 256,
        "height": 256
      }
    }
  },
  ...
}

Note that this approach proved to be sufficient for the use-cases we have tried, and since it translates more or less 1:1 to what our service supports, I think it should suffice. That said, a more powerful mechanism is of course better, as long as it is not too difficult to implement.

@m-mohr m-mohr changed the title Custom data filtering in load_collection, alternative proposal Fix/improve data filtering Dec 13, 2019
@m-mohr m-mohr added this to the v1.0 milestone Dec 13, 2019
m-mohr added a commit that referenced this issue Jan 15, 2020
@m-mohr
Copy link
Member

m-mohr commented Jan 17, 2020

I finished work on this, see PR #128.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants