Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-generated Collections #376

Open
m-mohr opened this issue Apr 9, 2021 · 11 comments · May be fixed by #518
Open

User-generated Collections #376

m-mohr opened this issue Apr 9, 2021 · 11 comments · May be fixed by #518
Assignees
Milestone

Comments

@m-mohr
Copy link
Member

m-mohr commented Apr 9, 2021

How to make user-generated or user-uploaded collections available? We could make them available in /collections (after authentication). Currently load_result and load_uploaded_files are an option, but maybe it's also just easier to make it available via collections or a new endpoint. This is very much open to discussion...

@jdries
Copy link

jdries commented Apr 19, 2021

Note that this is again a step further from simply loading a result, may not be possible to do this within the platform project.

@m-mohr
Copy link
Member Author

m-mohr commented Apr 19, 2021

What exactly is the difference (from your point of view)? Both the user's results and collections need to be loaded into a data cube, so is there a big difference whether they are exposed via the results endpoints or the collections endpoints? Once we are at the point to expose all results as collections instead of items (see #359), it seems to be just a small additional step.

Edit: Removed outdated stuff

@m-mohr m-mohr added this to the future milestone Nov 2, 2022
@m-mohr
Copy link
Member Author

m-mohr commented Aug 18, 2023

Below, I'm discussing some options for user-generated collections that we can consider for openEO (and SAP07 in openEO Platform @christophreimer).

Options:

  1. File Format Collection via save_result
  2. Process save_collection
  3. STAC API endpoint POST /collections or PUT /collections/{collection_id}
  4. Create from batch job: POST /jobs/{job_id}/collection (or property in batch job metadata)
  5. Keep the APIs separate and use /jobs/{job_id}/results/...
  6. OGC alternative: POST /process/{processId}/execution?response=collection

General

I'm assuming in the following that the internal storage details (including the file format) are up to the discretion of the backend. It seems that's not true and people want to define the file format.

The collection metadata is generated by the back-end, but can usually be overriden. Some general thoughts:

  • The required license should probably be derived from the source data (or be set to proprietary with a generic link).
  • The extent is derived from the generated product.
  • Most properties except can likely be auto-generated from the backend (e.g. extents, cube:dimensions, summaries, ...). Exceptions are the "descriptive" properties such as title, description and keywords, which are usually set by the user.
  • I'd recommend to expose the updated/created fields.
  • The process graph can be exposed via the STAC processing extension as we do it for batch job results - unfortunately this is not consistent with other endpoints where we simply have the process property. (?)

Each option (except 5) creates collections in the GET /collections and GET /collections/{collection_id} endpoints. I assume that for every user-generated collections the following endpoints are available in addition to the existing endpoints in the openEO API:

  • PATCH /collections/{collection_id} - Update selected metadata fields of the STAC Collection
  • DELETE /collections/{collection_id} - Delete the Collection
  • GET /collections and GET /collections/{collection_id}: We probably want to add a property that identifies a user-generated collection or do we just identify them by the availability of links to the PATCH/DELETE endpoints? How does OGC identify them?

We should try to align the endpoints with OGC APIs and/or STAC Transaction extension (if available).

1. File Format Collection via save_result

A pre-defined file format is defined for/added to GET /file_formats.

Example

Response from GET /file_formats:

GET /file_formats

{
  "output": {
    "Collection": {
      "title": "User-generated Collection",
      "gis_data_type": "other",
      "parameters": {
        "id": {
          "description": "Provide an ID for the newly created collection. Errors if the ID exists.",
          "type": "string",
          "minLength": 1,
          "required": true
        },
        "title": {
          "description": "Title for the collection",
          "type": "string",
          "minLength": 1
        },
        "description": {
          "description": "Description for the collection",
          "type": "string",
          "format": "commonmark",
          "minLength": 1
        },
        "keywords": {
          "description": "Keywords for the collection",
          "type": "array",
          "items": {
            "type": "string",
            "minLength": 1
          }
        },
        ...
      }
    }
  }
}
  • The ID parameter is required due to the issue that if the back-end would generate one, we could not inform the user about the generated ID and thus he'd need to search for the collection.
  • A description is required in STAC, so if the user doesn't provide one the back-end would need to generate one.

Usage in a process:

{
  ...
  "save": {
    "process_id": "save_result",
    "arguments": {
      "format": "Collection",
      "options": {
        "id": "MyCollection",
        "title": "My Collection",
        "description": "This describes my collection"
      }
    }
  }
}

The process behaves as follows for the different processing modes:

  • Synchronous: "Virtual" collection, data is processed on demand (?)
  • Batch Job: Data is pre-computed and collection becomes available once completed
  • Secondary Web Services: Either rejected or data is made available as web service and back-end requests data on demand

Pros

  • No process changes required
  • No or minimal API changes required (see below)

Cons

  • Currently we have no way to pre-define file formats (but we could include it as option in the schema for output formats in GET /file_formats)
  • Inconsistent, because we have a load_collection process but no input file format for Collections
  • The behavior for synchronous and secondary web services is not intuitive
  • Can't define a specific file format

2. Process save_collection

A new process save_collection is added.

Example

Draft process schema for save_collection:

{
  "id": "save_collection",
  "summary": "Save as collection",
  "description": "Save as a user-generated Collection ...",
  "categories": [
    "cubes",
    "export"
  ],
  "parameters": [
    {
      "name": "id",
      "description": "Provide an ID for the newly created collection. Throws a `CollectionIdExists` exception if the ID exists.",
      "schema": {
        "type": "string",
        "minLength": 1
      }
    },
    {
      "name": "description",
      "description": "Description for the collection",
      "schema": {
        "type": "string",
        "format": "commonmark",
        "minLength": 1
      }
    },
    {
      "name": "title",
      "description": "Title for the collection",
      "schema": {
        "type": ["string", "null"],
        "minLength": 1
      },
      "default": null,
      "optional": true
    },
    {
      "name": "keywords",
      "description": "Keywords for the collection",
      "schema": {
        "type": "array",
        "items": {
          "type": "string",
          "minLength": 1
        }
      },
      "default": [],
      "optional": true
    },
    ...
  ],
  "returns": {
    "description": "Always returns `true` as in case of an error an exception is thrown which aborts the execution of the process.",
    "schema": {
      "type": "boolean",
      "const": true
    }
  },
  "exceptions": {
    "CollectionIdExists": {
      "message": "The given Collection ID exists."
    }
  },
  "links": [
    {
      "rel": "about",
      "href": "https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md",
      "title": "STAC Collection specification"
    }
  ]
}
  • The ID parameter is required due to the issue that if the back-end would generate one, we could not inform the user about the generated ID and thus he'd need to search for the collection.
  • The title can be set to null to omit it, but back-ends can't set the title to null in the metadata as it's not allowed in STAC. So null requires to remove the property from the output, which is a bit annoying.
  • A description is required in STAC, so it's also required here. Alternatively, we could allow null, with the caveats that exist for title, too.

Usage in a process:

{
  ...
  "save": {
    "process_id": "save_collection",
    "arguments": {
      "id": "MyCollection",
      "description": "This describes my collection",
      "title": "My Collection"
    }
  }
}

The process behaves as follows for the different processing modes:

  • Synchronous: "Virtual" collection, data is processed on demand (?)
  • Batch Job: Data is pre-computed and collection becomes available once completed
  • Secondary Web Services: Either rejected or data is made available as web service and back-end requests data on demand

Pros

  • No API changes required
  • Consistent with load_collection
  • Could have parameters to define the file format (see save_result)

Cons

  • The behavior for synchronous and secondary web services is weird

3. STAC API endpoint POST /collections or PUT /collections/{collection_id}

We use RESTful CRUD operations for the /collections endpoints, ideally aligned with OGC APIs and/or STAC Transaction extension (but currently undefined).

  • Creating:
    • PUT /collections/{collection_id} - If we allow users to select a Collection ID
    • POST /collections - if we don't allow users to select a Collection ID
  • Reading/Updating/Deleting: see above

Example

This is an example for PUT /collections/{collection_id}. It works similarly for POST, but without IDs.

From a batch job result (pre-computed):

PUT /collections/{collection_id}

{
  "id": "MyCollection",
  "title": "My Collection",
  "description": "This describes my collection",
  "job_id": "abcd-123"
}

With a process (virtual collection, computed on demand):

PUT /collections/{collection_id}

{
  "id": "MyCollection",
  "title": "My Collection",
  "description": "This describes my collection",
  # Variant 1 (preferrably?)
  "process": {
    "process_graph": {...},
    ...
  }
  # Variant 2
  "processing:expression": {
    "format": "openeo",
    "expression": {
      "process_graph": {...},
      ...
    }
  }
}

Pros

  • No additional processes required
  • Relatively consistent interface, no confusion around processing modes
  • Potential for alignment with OGC API - Processes
  • File format can be specified in the process graph

Cons

  • In a STAC API, you'd usually submit all metadata and not just a subset. It would likely be inconsistent with STAC/OGC APIs.
  • Do we provide the process in the process property or via the STAC processing extension (property processing:expression)?
  • For the first example: What happens if the batch job gets deleted though? Does the collection get deleted, too?

4. Create from batch job: POST /jobs/{job_id}/collection (or property in batch job metadata)

We use the endpoints for batch jobs to create a collection.

I don't add this to the /jobs/{job_id}/results endpoints as they may conflict with whatever we define in the future for #484.

Example

Separate endpoint

POST /jobs/{job_id}/collection

{
  "id": "MyCollection",
  "title": "My Collection",
  "description": "This describes my collection",
  "job_id": "abcd-123"
}

Alternative: Property in batch job metadata

PATCH /jobs/{job_id}

{
  "collection_id": "MyCollection", # or null to not publish / unpublish
  ...
}

Pros

  • Metadata (title, description) can easily be used from the batch job
  • File format can be specified in the process graph

Cons

  • Only pre-computed processing
  • Close relation between jobs and collections. What happens if the batch job gets deleted though? Does the collection get deleted, too?
  • Most "proprietary" solution, no alignment with OGC API - Processes possible.

5. Keep the APIs separate and use /jobs/{job_id}/results/...

We don't include the collections in GET /collections and instead provide separate STAC APIs for batch job results.

We'd get endpoints such as:

  • GET /jobs/{job_id}/results - Landing page + result metadata (but conflicts, see below)
  • GET /jobs/{job_id}/results/collections - Collection Overview (but always just one)
  • GET /jobs/{job_id}/results/collections/{collection_id} - Individual collections (but always just one)

The "cleaner" interface would be to make GET /jobs/{job_id}/results consistent with the GET /collections/{collectionid}/items endpoint in STAC API, but that might lead to issues if clients assume a different endpoint structure.

Pros

Cons

  • Only pre-computed processing
  • Inconsistent with how load_collection works so far (ID is any collection from GET /collections)
  • It is actually not really possible in a standards compliant and non-breaking way
    • openEO allows a Collection or Item at GET /jobs/{job_id}/results
    • STAC API requires the landing page to be a Catalog
  • Each STAC API has just has one collection

6. OGC alternative: POST /process/{processId}/execution?response=collection

OGC API - Processes - Part 3 defines a conformance class for "Collection Output".
They send the inputs and outputs to the API endpoint POST /process/{processId}/execution?response=collection, which creates a collection.
It's not really an option as we don't have this endpoint and can't send it to a specific process as we usually combine multiple processes, but I'd want to mention it for context.
Details: https://docs.ogc.org/DRAFTS/21-009.html#_2d1e95c2-17fb-4a88-b3f4-e937fd9ff1a1

I've asked Jerome St. Louis to alternatively send it to POST /collections or PUT /collections/{collectionId} for alignment (if we also choose this way), feedback was relatively neutral though.

Summary

Reasonable approaches without any particular order:

  • Process save_collection
  • File Format Collection via save_result
  • API endpoint POST /collections or PUT /collections/{collection_id}
  • Create from batch job: POST /jobs/{job_id}/collection (or property in batch job metadata)

I don't see a strong preference for any of them. It's not completely clear what the process-based proposals do in the different processing modes or may not be available in all processing modes. The third option might be one that we can agree on with OGC APIs, so I have a slight tendency to this approach.

Unreasonable approaches are:

  • Keep the APIs separate and use /jobs/{job_id}/results/...: Too many issues / cons, requires openEO API 2.0
  • OGC alternative: POST /process/{processId}/execution?response=collection: Incompatible

@jdries
Copy link

jdries commented Aug 21, 2023

Thanks for putting this together, we also want to move forward with this!
I do have some important input in terms of assumptions:

  • our users do want control over the output file format, as they might be using the STAC collections outside of openEO as well
  • in most cases, user defined collections will contains results from multiple jobs, because single jobs are still limited in terms of the spatiotemporal extents that can be processed in a single job.

I see most promise in a combination of options 3 and 4. So basically an API that aligns with STAC transaction to allow management of the collection, and a specific process to add job results to that collection.

@m-mohr
Copy link
Member Author

m-mohr commented Aug 21, 2023

Thank you for these points. Comments below.

our users do want control over the output file format, as they might be using the STAC collections outside of openEO as well

Good to know. I'll add this above. So then it's a less internal thing and reading such collections is more related to load_stac rather than load_collection, I guess.

in most cases, user defined collections will contains results from multiple jobs, because single jobs are still limited in terms of the spatiotemporal extents that can be processed in a single job.

Hmm, this sounds like a back-end issue that should be handled internally and not by the API.
I assume that users submit a job with a large extent and internally (e.g. by the aggregator) the job is split into peaces, but then it's also the job of the "splitting entity" to combine it into a single collection.

On the other hand, it could be generally useful if users could combine multiple results. But then I'm wondering about the consistency of the data. How can this be ensured? It feels cleaner and safer to just allow a "controlled entity" like the aggregator to combine data under a single collection, if at all.

I see most promise in a combination of options 3 and 4. So basically an API that aligns with STAC transaction to allow management of the collection, and a specific process to add job results to that collection.

I'm not sure whether I made it clear enough in my post, but unfortunately STAC Transactions are currently only defined on the item level. So you can't create a collection through the API and you'd need to push all items individually (or as item collection?), which is rather bad. We'd need to define collection-level transactions and contribute them back to STAC/OGC, which will be a rather lengthy process, I assume. It seems unfeasible during the runtime of SAP07, at least.

@jdries
Copy link

jdries commented Aug 21, 2023

Another example of multiple jobs contributing to a collection is when you add a new product for instance every 10 days, as new satellite observations come in. Here I don't see how we could make that work with a single batch job.
Collection consistency is up to the user to handle, just like every other STAC collection. Perhaps openEO can help a little bit, for instance by checking if band labels in the items match up with collection metadata.

@m-mohr
Copy link
Member Author

m-mohr commented Aug 21, 2023

That's a good point. So saving to an existing collection should add data instead of overwriting it or throwing an error.

@christophreimer
Copy link
Member

@GeraldIr fyi

@m-mohr m-mohr linked a pull request Oct 11, 2023 that will close this issue
@m-mohr m-mohr linked a pull request Oct 11, 2023 that will close this issue
@m-mohr
Copy link
Member Author

m-mohr commented Dec 6, 2023

Idea from the openEO community call:

  • create new STAC subtype
  • update save_result to return STAC subtype
  • add processes to allow saving to workspace, saving to collection that accept the STAC subtype
    • for user collections we need a STAC API extension that allows managing STAC collections (similar to STAC Transations for Items)
  • add process to merge STAC collections
  • maybe: add process to update STAC metadata or work with STAC tooling
  • workspaces:
    • check current API against EODC implementation (GI)
    • update/remove the collection saving/updating part to reflect ideas from above

m-mohr added a commit to Open-EO/openeo-processes that referenced this issue Dec 8, 2023
`save_results` returns the STAC resource instead of boolean `true` Open-EO/openeo-api#376
@m-mohr
Copy link
Member Author

m-mohr commented Dec 8, 2023

PR: Open-EO/openeo-processes#485

@m-mohr
Copy link
Member Author

m-mohr commented Dec 8, 2023

Proposed extension for Collection Transactions in STAC APIs:
https://github.com/stac-api-extensions/collection-transaction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants