Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

soxofaan · 2020-07-03T09:21:41Z

(I remember discussions about this but couldn't find in-depth github ticket about this, feel free to close as duplicate)

At VITO, we're working on support for user-defined processes and basic sharing, which raises some concerns about naming collisions, not only between pre-defined and user-defined processes, but also across users when some kind of sharing mechanism is in place.

Some data points:

From User-defined processes #256:

Decide whether user-defined processes should be prefixed in process graphs (e.g. user:ndvi) - I'll leave this up to the back-end for now. We may revisit this for 1.0 final.

The description of the process_id component:

openeo-api/openapi.yaml

Lines 4661 to 4663 in 1247172

    
                   MUST be unique across all predefined and user-defined processes available for the authenticated user. 
        
                   If a back-end adds a process with the name of a user-defined process, the user-defined process takes 
        
                   preference over predefined processes in execution to not break existing process graphs.

I think there is a contradiction/inconsistency here: process_id MUST be unique .... but if backend adds same name, it's ok to be not unique anymore (can user still update their UDP afterwards?). I think this is an indication that it "needs more work".

Process ids currently must follow this regex:

openeo-api/openapi.yaml

Line 4667 in 1247172

pattern: '^\w+$'

meaning Latin characters, digits and underscore. Which means there is very little wiggle room to implement some kind of robust namespace/prefix scheme. Also somewhat relevant here, especially when sharing across users comes in the picture: in context of OpenID Connect, the user id returned by the OIDC provider might not fit \w+ (dashes, dots, ...)

I don't have a concrete suggestions at the moment, but just wanted to start off this discussion

The text was updated successfully, but these errors were encountered:

m-mohr · 2020-07-03T09:47:15Z

I remember discussions about this but couldn't find in-depth github ticket about this, feel free to close as duplicate [...]

Decide whether user-defined processes should be prefixed in process graphs (e.g. user:ndvi) -

I think that was discussed mainly in calls. Result was that some back-ends want prefixes for UDP and some not. Thus, there's now this compromise in the docs:

Back-ends may choose to enforce a prefix for user-defined processes while storing the process, e.g. user_ndvi with user_ being the prefix. Prefixes must still follow the pattern.

I think there is a contradiction/inconsistency here: process_id MUST be unique ....

... in its scope (pre-defined/user-defined) and from the user-perspective at the time of submission.

but if backend adds same name, it's ok to be not unique anymore (can user still update their UDP afterwards?). I think this is an indication that it "needs more work".

The issue is that if you don't prefix all, it can be problematic to add new processes as back-end as all good names might be taken by users. Thus to not break existing process graphs, user processes are preferred over back-end processes if this case happens.

Process ids currently must follow this regex:

openeo-api/openapi.yaml

Line 4667 in 1247172

pattern: '^\w+$'

meaning Latin characters, digits and underscore. Which means there is very little wiggle room to implement some kind of robust namespace/prefix scheme.

The API suggests user_ as prefix. What's wrong with that? How would you think it could get more robust? The characters are so limited as the process ids are used in programming languages directly and thus a common denominator had to be found, which was \w (e.g. for Python).

Also somewhat relevant here, especially when sharing across users comes in the picture: in context of OpenID Connect, the user id returned by the OIDC provider might not fit \w+ (dashes, dots, ...)

Indeed, at some point, it might be interesting to use URLs as process_id in process graphs. But there's no sharing yet, so the API is consistent. We can reconsider sharing and URLs as process IDs in a future version for sure.

soxofaan · 2020-07-03T12:32:17Z

Result was that some back-ends want prefixes for UDP and some not. Thus, there's now this compromise in the docs ...

Indeed, but that breaks reproducibility across backends. Unless there would be a mechanism that the user, when storing a UDP, provides a suggested process_id and the backend returns an updated "collision-safe" process_id.

.. from the user-perspective at the time of submission.

I think this also breaks reproducibility in a way. First, I assume that it is ok to update a conflicting UDP as long as it was created before the PDP (the API is not explicit about this I think). This feels a bit weird to me: updating is fine, but delete+create is not.
Secondly: we use PUT to store an UDP (update if it exists, create otherwise), which means that the user is "allowed to be negligent" of the fact whether their UDP already exists when storing. In that context it also feels pretty arbitrary that you can or can not PUT a UDP depending on the backend's release cycle of predefined processes.

it can be problematic to add new processes as back-end as all good names might be taken by users.

That's indeed the problem with using prefixes in the process_id naming, because it's a "commitment for life". It could be more flexible with something like optional namespaces: for example, as long as there is no conflict you can use a normal process_id without a namespace, e.g. awesome_ndvi, but once a backend introduces the same PDP, the user has to choose one explicitly with a namespace, e.g. core:awesome_ndvi vs user:awesome_ndvi. Obviously the user is allowed to use user: namespace even if the backend does not has that PDP (yet), e.g. to make sure you have reproducability across backends.

The API suggests user_ as prefix. What's wrong with that?

As noted above: it's just a suggestion and backends are still free to follow/enforce that, which breaks reproducability across backends. The optional namespacing I suggested above is a bit more flexible, but it does not fit the current ^\w+$ pattern.

Background story: we are starting with a use case where we already would like to do some kind of public UDP sharing (also see #85 (comment) ) and we'd like to have a solution soon that is as close as possible to some kind consensus that would end up in a later openEO API version.

m-mohr · 2020-07-03T13:43:28Z

There's two issues here: (1)the potential conflict of names and (2) how we reference shared processes.

For (2) I'd suggest to simply use URL's instead of process ids in the future. That means we still use \w+ for new processes stored at a back-end, but in a process graph you also allow URIs in process_id.

For (1) I really hesitate to define a fixed prefix, especially as we can't really move away from \w+. What I could imagine is that clients add a "user": true to the process graph node for user-defined processes. That is similar to your suggestion and would be mostly be handled by the clients.

Unless there would be a mechanism that the user, when storing a UDP, provides a suggested process_id and the backend returns an updated "collision-safe" process_id.

When suggesting user_ as potential prefix, I imagined that providers inform and reject all processes not matching user_\w+ for example. That must be communicated somehow, but that leaves room for providers to decide what they actually want to enforce. Otherwise it could also use the URL, but clients should hide this.

but once a backend introduces the same PDP, the user has to choose one explicitly with a namespace, e.g. core:awesome_ndvi vs user:awesome_ndvi.

That doesn't really solve the problem. What's happening with old process graphs?

soxofaan · 2020-07-06T09:49:36Z

What I could imagine is that clients add a "user": true to the process graph node for user-defined processes.

I like this idea. Instead of boolean "user" field, you could also call it for example "namespace" (or "library", "package", "source", "origin", ...) and handle sharing (or even packaging) as well. For example (probably going over the top here):

namespace "backend" (default) -> backend's predefined implementation
namespace "openeo" -> use "process_graph" implementation of official openeo process spec
namespace "user" -> personal user defined process
namespace "johndoe1976" -> user defined process from another user
namespace "crew42" -> user defined process from some user group (whatever that might mean)
namespace "https://example.com/path/to/myudp.json" -> UDP from process graph hosted somewhere online
namespace "https://example.com/path/to/myudps/" -> UDP from repository of process graphs hosted somewhere online

There are of course a lot of details to get right regarding sharing (security and privacy-wise), which is out of scope here
At least, this "namespace" field with just "backend" vs "user" would already solve the collision problem we have in openEO 1.0, while keeping room for later extensions (e.g. sharing)

m-mohr · 2020-07-06T09:52:56Z

Should we bring in the distinction between pre-defined (default) and user-defined in for 1.0? In that case we need a PR and approvals ASAP (this week). Otherwise the next possibility for adding it would be openEO API 2.0.

soxofaan · 2020-07-06T10:42:13Z

I think it would make the API cleaner, especially regarding the somewhat adhocy

MUST be unique across all PDP and UDP, unless the UDP was first, then UDP takes preference over the PDP ....
Back-ends may choose to enforce a prefix for user-defined processes while storing the process

and it easily allows backends to experiment with sharing approaches for usecases that might already need it.

m-mohr · 2020-07-06T10:44:09Z

I'll issue a PR and then we need to find consensus.

jdries · 2020-07-06T10:59:51Z

Indeed, some of these issues should preferably be solved in 1.0. We're probably one of the first backends trying this out anyway?
I like the suggestion to support URI's. This can solve a lot, as I can just share my pg through github, and have versioning and everything. Or even use some kind of non-public URI in case I don't want to share my graph publicly.

m-mohr · 2020-07-06T11:06:55Z

Yes, I was working on this with GEE also in 0.4, but dropped it in 1.0.

I think the namespaces are easy to add to at least distinguish user and pre-defined processes. Should also be relatively easy for clients to set correctly.

On the other hand, I'm concerned that the URI change could be to much for 1.0. For URIs you'd actually not even need a namespace. You could simply use the URI as process_id, but then all back-ends would need to implement it otherwise it would not be clear whether a back-end supports it or not. If you put a URL in the namespace, then the process_id is pretty much redundant.

For now, I'd suggest to just use the two namespaces and leave out URIs. What clients could do is to allow passing URIs, then they download it automatically and store it at /process_graphs and put the corresponding id and namespace in the process graph. If we's have chosen the URI approach in process graphs directly, the back-ends should store URI-based processes anyway during submission as just loading them during execution sounds a bit less reliable and prone to attacks when changed later on (see npm issues with deleted packages, for example).

m-mohr · 2020-07-06T11:22:46Z

Please review #309

soxofaan · 2020-07-06T12:40:22Z

If you put a URL in the namespace, then the process_id is pretty much redundant.

Not necessarily: the namespace URL could be a "folder" and process_id still a traditional \w+ string, which would resolve then together to for example to a real file {namespace}/{process_id}.json. This approach would make things a bit cleaner for the client/user side maybe.

m-mohr · 2020-07-06T13:00:33Z

Yes, would work, but I'd like to see client support/helpers on this. Because splitting URIs is nothing a typical user wants to do. Also, it assumes things are called [...].json, which may not always be the case. So definitely something to think about more in the future.

see Open-EO/openeo-api#305, Open-EO/openeo-api#309

soxofaan · 2020-07-06T14:35:59Z

possible next step (but not necessarily to include for 1.0 final):

merge GET /processes and GET /process_graphs and work with an additional namespace request parameter to select PDP or UDP
unify all /processes and /process_graphs requests to something like /processes/{namespace}/{process_graph_id}

m-mohr · 2020-07-06T14:37:28Z

That's definitely something that would only be available in openEO API 2.0 and should live in it's own issue as this is going to be closed for 1.0.0.

* Add namespace to process graph #305

m-mohr · 2020-07-07T17:22:06Z

This has been merged.

@soxofaan For the other ideas open a new issue, please.

m-mohr · 2020-07-13T13:14:00Z

I just realized again why we haven't had namespaces ( #309 ) in the API: User could simply replace missing pre-defined processes (e.g. normalized_difference) with our "alternative" processing instructions in openeo-processes. Just store it as user-defined process and it still runs. That's now broken with namespaces. Thus, I changed the default behavior again in #313. The namespaces for explicit choosing stay as is.

…-EO#305)

also see Open-EO/openeo-api#305

…pace Add "namespace" placeholder to ProcessUnsupported error message (#305)

m-mohr added this to the future milestone Jul 3, 2020

m-mohr added the process discovery and profile discovery label Jul 3, 2020

m-mohr modified the milestones: future, v1.0.0-final Jul 6, 2020

m-mohr added a commit that referenced this issue Jul 6, 2020

Add namespace to process graph #305

8da345d

m-mohr mentioned this issue Jul 6, 2020

#305: Add namespace to process graph nodes #309

Merged

m-mohr linked a pull request Jul 6, 2020 that will close this issue

#305: Add namespace to process graph nodes #309

Merged

soxofaan added a commit to Open-EO/openeo-python-client that referenced this issue Jul 6, 2020

UDP: add support for process namespaces

81656e7

see Open-EO/openeo-api#305, Open-EO/openeo-api#309

m-mohr added a commit that referenced this issue Jul 7, 2020

#305: Add namespace to process graph nodes (#309)

843c6db

* Add namespace to process graph #305

m-mohr closed this as completed Jul 7, 2020

soxofaan mentioned this issue Jul 9, 2020

Unification of /processes and /process_graphs #310

Open

m-mohr added a commit that referenced this issue Jul 13, 2020

Revert a bit of #305/#309

01aa077

soxofaan added a commit to soxofaan/openeo-api that referenced this issue Jul 22, 2020

Add "namespace" placeholder to ProcessUnsupported error message (Open…

6cd9394

…-EO#305)

soxofaan added a commit to Open-EO/openeo-python-driver that referenced this issue Jul 23, 2020

EP-3449/EP-3544 add support for URL as process namespace (experimental)

e21f5e4

also see Open-EO/openeo-api#305

m-mohr added a commit that referenced this issue Nov 25, 2020

Merge pull request #328 from soxofaan/error-process-unsupported-names…

05ce5a4

…pace Add "namespace" placeholder to ProcessUnsupported error message (#305)

soxofaan mentioned this issue Mar 4, 2021

Issue #182 add namespace support to DataCube.process and related Open-EO/openeo-python-client#183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

soxofaan commented Jul 3, 2020

m-mohr commented Jul 3, 2020 •

edited

Loading

soxofaan commented Jul 3, 2020

m-mohr commented Jul 3, 2020 •

edited

Loading

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020 •

edited

Loading

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020

jdries commented Jul 6, 2020

m-mohr commented Jul 6, 2020 •

edited

Loading

m-mohr commented Jul 6, 2020

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020 •

edited

Loading

m-mohr commented Jul 7, 2020

m-mohr commented Jul 13, 2020

Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

Comments

soxofaan commented Jul 3, 2020

m-mohr commented Jul 3, 2020 • edited Loading

soxofaan commented Jul 3, 2020

m-mohr commented Jul 3, 2020 • edited Loading

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020 • edited Loading

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020

jdries commented Jul 6, 2020

m-mohr commented Jul 6, 2020 • edited Loading

m-mohr commented Jul 6, 2020

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020

soxofaan commented Jul 6, 2020

m-mohr commented Jul 6, 2020 • edited Loading

m-mohr commented Jul 7, 2020

m-mohr commented Jul 13, 2020

m-mohr commented Jul 3, 2020 •

edited

Loading

m-mohr commented Jul 3, 2020 •

edited

Loading

m-mohr commented Jul 6, 2020 •

edited

Loading

m-mohr commented Jul 6, 2020 •

edited

Loading

m-mohr commented Jul 6, 2020 •

edited

Loading