User-generated Collections #376

m-mohr · 2021-04-09T14:03:03Z

How to make user-generated or user-uploaded collections available? We could make them available in /collections (after authentication). Currently load_result and load_uploaded_files are an option, but maybe it's also just easier to make it available via collections or a new endpoint. This is very much open to discussion...

jdries · 2021-04-19T07:50:32Z

Note that this is again a step further from simply loading a result, may not be possible to do this within the platform project.

m-mohr · 2021-04-19T08:00:14Z

What exactly is the difference (from your point of view)? Both the user's results and collections need to be loaded into a data cube, so is there a big difference whether they are exposed via the results endpoints or the collections endpoints? Once we are at the point to expose all results as collections instead of items (see #359), it seems to be just a small additional step.

Edit: Removed outdated stuff

m-mohr · 2023-08-18T11:22:08Z

Below, I'm discussing some options for user-generated collections that we can consider for openEO (and SAP07 in openEO Platform @christophreimer).

Options:

File Format Collection via save_result
Process save_collection
STAC API endpoint POST /collections or PUT /collections/{collection_id}
Create from batch job: POST /jobs/{job_id}/collection (or property in batch job metadata)
Keep the APIs separate and use /jobs/{job_id}/results/...
OGC alternative: POST /process/{processId}/execution?response=collection

General

~~I'm assuming in the following that the internal storage details (including the file format) are up to the discretion of the backend.~~ It seems that's not true and people want to define the file format.

The collection metadata is generated by the back-end, but can usually be overriden. Some general thoughts:

The required license should probably be derived from the source data (or be set to proprietary with a generic link).
The extent is derived from the generated product.
Most properties except can likely be auto-generated from the backend (e.g. extents, cube:dimensions, summaries, ...). Exceptions are the "descriptive" properties such as title, description and keywords, which are usually set by the user.
I'd recommend to expose the updated/created fields.
The process graph can be exposed via the STAC processing extension as we do it for batch job results - unfortunately this is not consistent with other endpoints where we simply have the process property. (?)

Each option (except 5) creates collections in the GET /collections and GET /collections/{collection_id} endpoints. I assume that for every user-generated collections the following endpoints are available in addition to the existing endpoints in the openEO API:

PATCH /collections/{collection_id} - Update selected metadata fields of the STAC Collection
DELETE /collections/{collection_id} - Delete the Collection
GET /collections and GET /collections/{collection_id}: We probably want to add a property that identifies a user-generated collection or do we just identify them by the availability of links to the PATCH/DELETE endpoints? How does OGC identify them?

We should try to align the endpoints with OGC APIs and/or STAC Transaction extension (if available).

1. File Format `Collection` via `save_result`

A pre-defined file format is defined for/added to GET /file_formats.

Example

Response from GET /file_formats:

GET /file_formats

{
  "output": {
    "Collection": {
      "title": "User-generated Collection",
      "gis_data_type": "other",
      "parameters": {
        "id": {
          "description": "Provide an ID for the newly created collection. Errors if the ID exists.",
          "type": "string",
          "minLength": 1,
          "required": true
        },
        "title": {
          "description": "Title for the collection",
          "type": "string",
          "minLength": 1
        },
        "description": {
          "description": "Description for the collection",
          "type": "string",
          "format": "commonmark",
          "minLength": 1
        },
        "keywords": {
          "description": "Keywords for the collection",
          "type": "array",
          "items": {
            "type": "string",
            "minLength": 1
          }
        },
        ...
      }
    }
  }
}

The ID parameter is required due to the issue that if the back-end would generate one, we could not inform the user about the generated ID and thus he'd need to search for the collection.
A description is required in STAC, so if the user doesn't provide one the back-end would need to generate one.

Usage in a process:

{
  ...
  "save": {
    "process_id": "save_result",
    "arguments": {
      "format": "Collection",
      "options": {
        "id": "MyCollection",
        "title": "My Collection",
        "description": "This describes my collection"
      }
    }
  }
}

The process behaves as follows for the different processing modes:

Synchronous: "Virtual" collection, data is processed on demand (?)
Batch Job: Data is pre-computed and collection becomes available once completed
Secondary Web Services: Either rejected or data is made available as web service and back-end requests data on demand

Pros

No process changes required
No or minimal API changes required (see below)

Cons

Currently we have no way to pre-define file formats (but we could include it as option in the schema for output formats in GET /file_formats)
Inconsistent, because we have a load_collection process but no input file format for Collections
The behavior for synchronous and secondary web services is not intuitive
Can't define a specific file format

2. Process `save_collection`

A new process save_collection is added.

Example

Draft process schema for save_collection:

{
  "id": "save_collection",
  "summary": "Save as collection",
  "description": "Save as a user-generated Collection ...",
  "categories": [
    "cubes",
    "export"
  ],
  "parameters": [
    {
      "name": "id",
      "description": "Provide an ID for the newly created collection. Throws a `CollectionIdExists` exception if the ID exists.",
      "schema": {
        "type": "string",
        "minLength": 1
      }
    },
    {
      "name": "description",
      "description": "Description for the collection",
      "schema": {
        "type": "string",
        "format": "commonmark",
        "minLength": 1
      }
    },
    {
      "name": "title",
      "description": "Title for the collection",
      "schema": {
        "type": ["string", "null"],
        "minLength": 1
      },
      "default": null,
      "optional": true
    },
    {
      "name": "keywords",
      "description": "Keywords for the collection",
      "schema": {
        "type": "array",
        "items": {
          "type": "string",
          "minLength": 1
        }
      },
      "default": [],
      "optional": true
    },
    ...
  ],
  "returns": {
    "description": "Always returns `true` as in case of an error an exception is thrown which aborts the execution of the process.",
    "schema": {
      "type": "boolean",
      "const": true
    }
  },
  "exceptions": {
    "CollectionIdExists": {
      "message": "The given Collection ID exists."
    }
  },
  "links": [
    {
      "rel": "about",
      "href": "https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md",
      "title": "STAC Collection specification"
    }
  ]
}

The ID parameter is required due to the issue that if the back-end would generate one, we could not inform the user about the generated ID and thus he'd need to search for the collection.
The title can be set to null to omit it, but back-ends can't set the title to null in the metadata as it's not allowed in STAC. So null requires to remove the property from the output, which is a bit annoying.
A description is required in STAC, so it's also required here. Alternatively, we could allow null, with the caveats that exist for title, too.

Usage in a process:

{
  ...
  "save": {
    "process_id": "save_collection",
    "arguments": {
      "id": "MyCollection",
      "description": "This describes my collection",
      "title": "My Collection"
    }
  }
}

The process behaves as follows for the different processing modes:

Synchronous: "Virtual" collection, data is processed on demand (?)
Batch Job: Data is pre-computed and collection becomes available once completed
Secondary Web Services: Either rejected or data is made available as web service and back-end requests data on demand

Pros

No API changes required
Consistent with load_collection
Could have parameters to define the file format (see save_result)

Cons

The behavior for synchronous and secondary web services is weird

3. STAC API endpoint `POST /collections` or `PUT /collections/{collection_id}`

We use RESTful CRUD operations for the /collections endpoints, ideally aligned with OGC APIs and/or STAC Transaction extension (but currently undefined).

Creating:
- PUT /collections/{collection_id} - If we allow users to select a Collection ID
- POST /collections - if we don't allow users to select a Collection ID
Reading/Updating/Deleting: see above

Example

This is an example for PUT /collections/{collection_id}. It works similarly for POST, but without IDs.

From a batch job result (pre-computed):

PUT /collections/{collection_id}

{
  "id": "MyCollection",
  "title": "My Collection",
  "description": "This describes my collection",
  "job_id": "abcd-123"
}

With a process (virtual collection, computed on demand):

PUT /collections/{collection_id}

{
  "id": "MyCollection",
  "title": "My Collection",
  "description": "This describes my collection",
  # Variant 1 (preferrably?)
  "process": {
    "process_graph": {...},
    ...
  }
  # Variant 2
  "processing:expression": {
    "format": "openeo",
    "expression": {
      "process_graph": {...},
      ...
    }
  }
}

Pros

No additional processes required
Relatively consistent interface, no confusion around processing modes
Potential for alignment with OGC API - Processes
File format can be specified in the process graph

Cons

In a STAC API, you'd usually submit all metadata and not just a subset. It would likely be inconsistent with STAC/OGC APIs.
Do we provide the process in the process property or via the STAC processing extension (property processing:expression)?
For the first example: What happens if the batch job gets deleted though? Does the collection get deleted, too?

4. Create from batch job: `POST /jobs/{job_id}/collection` (or property in batch job metadata)

We use the endpoints for batch jobs to create a collection.

I don't add this to the /jobs/{job_id}/results endpoints as they may conflict with whatever we define in the future for #484.

Example

Separate endpoint

POST /jobs/{job_id}/collection

{
  "id": "MyCollection",
  "title": "My Collection",
  "description": "This describes my collection",
  "job_id": "abcd-123"
}

Alternative: Property in batch job metadata

PATCH /jobs/{job_id}

{
  "collection_id": "MyCollection", # or null to not publish / unpublish
  ...
}

Pros

Metadata (title, description) can easily be used from the batch job
File format can be specified in the process graph

Cons

Only pre-computed processing
Close relation between jobs and collections. What happens if the batch job gets deleted though? Does the collection get deleted, too?
Most "proprietary" solution, no alignment with OGC API - Processes possible.

5. Keep the APIs separate and use `/jobs/{job_id}/results/...`

We don't include the collections in GET /collections and instead provide separate STAC APIs for batch job results.

We'd get endpoints such as:

GET /jobs/{job_id}/results - Landing page + result metadata (but conflicts, see below)
GET /jobs/{job_id}/results/collections - Collection Overview (but always just one)
GET /jobs/{job_id}/results/collections/{collection_id} - Individual collections (but always just one)

The "cleaner" interface would be to make GET /jobs/{job_id}/results consistent with the GET /collections/{collectionid}/items endpoint in STAC API, but that might lead to issues if clients assume a different endpoint structure.

Pros

Doesn't overload the main collections endpoint
Is a long-term plan anyway: Allow STAC API as response for GET /jobs/{job_id}/results #484
File format can be specified in the process graph

Cons

Only pre-computed processing
Inconsistent with how load_collection works so far (ID is any collection from GET /collections)
It is actually not really possible in a standards compliant and non-breaking way
- openEO allows a Collection or Item at GET /jobs/{job_id}/results
- STAC API requires the landing page to be a Catalog
Each STAC API has just has one collection

6. OGC alternative: `POST /process/{processId}/execution?response=collection`

OGC API - Processes - Part 3 defines a conformance class for "Collection Output".
They send the inputs and outputs to the API endpoint POST /process/{processId}/execution?response=collection, which creates a collection.
It's not really an option as we don't have this endpoint and can't send it to a specific process as we usually combine multiple processes, but I'd want to mention it for context.
Details: https://docs.ogc.org/DRAFTS/21-009.html#_2d1e95c2-17fb-4a88-b3f4-e937fd9ff1a1

I've asked Jerome St. Louis to alternatively send it to POST /collections or PUT /collections/{collectionId} for alignment (if we also choose this way), feedback was relatively neutral though.

Summary

Reasonable approaches without any particular order:

Process save_collection
File Format Collection via save_result
API endpoint POST /collections or PUT /collections/{collection_id}
Create from batch job: POST /jobs/{job_id}/collection (or property in batch job metadata)

I don't see a strong preference for any of them. It's not completely clear what the process-based proposals do in the different processing modes or may not be available in all processing modes. The third option might be one that we can agree on with OGC APIs, so I have a slight tendency to this approach.

Unreasonable approaches are:

Keep the APIs separate and use /jobs/{job_id}/results/...: Too many issues / cons, requires openEO API 2.0
OGC alternative: POST /process/{processId}/execution?response=collection: Incompatible

jdries · 2023-08-21T10:56:09Z

Thanks for putting this together, we also want to move forward with this!
I do have some important input in terms of assumptions:

our users do want control over the output file format, as they might be using the STAC collections outside of openEO as well
in most cases, user defined collections will contains results from multiple jobs, because single jobs are still limited in terms of the spatiotemporal extents that can be processed in a single job.

I see most promise in a combination of options 3 and 4. So basically an API that aligns with STAC transaction to allow management of the collection, and a specific process to add job results to that collection.

m-mohr · 2023-08-21T11:09:56Z

Thank you for these points. Comments below.

our users do want control over the output file format, as they might be using the STAC collections outside of openEO as well

Good to know. I'll add this above. So then it's a less internal thing and reading such collections is more related to load_stac rather than load_collection, I guess.

in most cases, user defined collections will contains results from multiple jobs, because single jobs are still limited in terms of the spatiotemporal extents that can be processed in a single job.

Hmm, this sounds like a back-end issue that should be handled internally and not by the API.
I assume that users submit a job with a large extent and internally (e.g. by the aggregator) the job is split into peaces, but then it's also the job of the "splitting entity" to combine it into a single collection.

On the other hand, it could be generally useful if users could combine multiple results. But then I'm wondering about the consistency of the data. How can this be ensured? It feels cleaner and safer to just allow a "controlled entity" like the aggregator to combine data under a single collection, if at all.

I see most promise in a combination of options 3 and 4. So basically an API that aligns with STAC transaction to allow management of the collection, and a specific process to add job results to that collection.

I'm not sure whether I made it clear enough in my post, but unfortunately STAC Transactions are currently only defined on the item level. So you can't create a collection through the API and you'd need to push all items individually (or as item collection?), which is rather bad. We'd need to define collection-level transactions and contribute them back to STAC/OGC, which will be a rather lengthy process, I assume. It seems unfeasible during the runtime of SAP07, at least.

jdries · 2023-08-21T12:19:06Z

Another example of multiple jobs contributing to a collection is when you add a new product for instance every 10 days, as new satellite observations come in. Here I don't see how we could make that work with a single batch job.
Collection consistency is up to the user to handle, just like every other STAC collection. Perhaps openEO can help a little bit, for instance by checking if band labels in the items match up with collection metadata.

m-mohr · 2023-08-21T12:22:23Z

That's a good point. So saving to an existing collection should add data instead of overwriting it or throwing an error.

christophreimer · 2023-09-18T12:46:03Z

@GeraldIr fyi

m-mohr · 2023-12-06T18:01:35Z

Idea from the openEO community call:

create new STAC subtype
update save_result to return STAC subtype
add processes to allow saving to workspace, saving to collection that accept the STAC subtype
- for user collections we need a STAC API extension that allows managing STAC collections (similar to STAC Transations for Items)
add process to merge STAC collections
maybe: add process to update STAC metadata or work with STAC tooling
workspaces:
- check current API against EODC implementation (GI)
- update/remove the collection saving/updating part to reflect ideas from above

`save_results` returns the STAC resource instead of boolean `true` Open-EO/openeo-api#376

m-mohr · 2023-12-08T11:34:31Z

PR: Open-EO/openeo-processes#485

m-mohr · 2023-12-08T13:57:32Z

Proposed extension for Collection Transactions in STAC APIs:
https://github.com/stac-api-extensions/collection-transaction

m-mohr added the platform label Apr 9, 2021

m-mohr mentioned this issue Apr 9, 2021

load_result: Align with load_collection Open-EO/openeo-processes#220

Closed

m-mohr added this to the future milestone Nov 2, 2022

m-mohr added the stac / ogc label Aug 18, 2023

m-mohr modified the milestones: future, 1.3.0 Aug 18, 2023

m-mohr assigned m-mohr and christophreimer Aug 18, 2023

m-mohr mentioned this issue Aug 18, 2023

STAC API for batch job results? #398

Open

jdries mentioned this issue Oct 5, 2023

export_to_workspace Open-EO/openeo-processes#467

Open

m-mohr linked a pull request Oct 11, 2023 that will close this issue

Workspaces API #518

Open

m-mohr added a commit to Open-EO/openeo-processes that referenced this issue Dec 8, 2023

Add export_collection, export_workspace, stac_update;

83c642d

`save_results` returns the STAC resource instead of boolean `true` Open-EO/openeo-api#376

m-mohr mentioned this issue Dec 8, 2023

STAC, Collection and Workspace handling in openEO Open-EO/openeo-processes#485

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User-generated Collections #376

User-generated Collections #376

m-mohr commented Apr 9, 2021

jdries commented Apr 19, 2021

m-mohr commented Apr 19, 2021 •

edited

Loading

m-mohr commented Aug 18, 2023 •

edited

Loading

jdries commented Aug 21, 2023

m-mohr commented Aug 21, 2023 •

edited

Loading

jdries commented Aug 21, 2023

m-mohr commented Aug 21, 2023 •

edited

Loading

christophreimer commented Sep 18, 2023

m-mohr commented Dec 6, 2023

m-mohr commented Dec 8, 2023

m-mohr commented Dec 8, 2023

User-generated Collections #376

User-generated Collections #376

Comments

m-mohr commented Apr 9, 2021

jdries commented Apr 19, 2021

m-mohr commented Apr 19, 2021 • edited Loading

m-mohr commented Aug 18, 2023 • edited Loading

General

1. File Format Collection via save_result

Example

Pros

Cons

2. Process save_collection

Example

Pros

Cons

3. STAC API endpoint POST /collections or PUT /collections/{collection_id}

Example

Pros

Cons

4. Create from batch job: POST /jobs/{job_id}/collection (or property in batch job metadata)

Example

Pros

Cons

5. Keep the APIs separate and use /jobs/{job_id}/results/...

Pros

Cons

6. OGC alternative: POST /process/{processId}/execution?response=collection

Summary

jdries commented Aug 21, 2023

m-mohr commented Aug 21, 2023 • edited Loading

jdries commented Aug 21, 2023

m-mohr commented Aug 21, 2023 • edited Loading

christophreimer commented Sep 18, 2023

m-mohr commented Dec 6, 2023

m-mohr commented Dec 8, 2023

m-mohr commented Dec 8, 2023

m-mohr commented Apr 19, 2021 •

edited

Loading

m-mohr commented Aug 18, 2023 •

edited

Loading

1. File Format `Collection` via `save_result`

2. Process `save_collection`

3. STAC API endpoint `POST /collections` or `PUT /collections/{collection_id}`

4. Create from batch job: `POST /jobs/{job_id}/collection` (or property in batch job metadata)

5. Keep the APIs separate and use `/jobs/{job_id}/results/...`

6. OGC alternative: `POST /process/{processId}/execution?response=collection`

m-mohr commented Aug 21, 2023 •

edited

Loading

m-mohr commented Aug 21, 2023 •

edited

Loading