Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

Closed
soxofaan opened this issue Jul 3, 2020 · 16 comments · Fixed by #309
Closed

Processes: pre-defined + user-defined + sharing: namespacing/prefixing? #305

soxofaan opened this issue Jul 3, 2020 · 16 comments · Fixed by #309
Labels
process discovery and profile discovery
Milestone

Comments

@soxofaan
Copy link
Member

soxofaan commented Jul 3, 2020

(I remember discussions about this but couldn't find in-depth github ticket about this, feel free to close as duplicate)

At VITO, we're working on support for user-defined processes and basic sharing, which raises some concerns about naming collisions, not only between pre-defined and user-defined processes, but also across users when some kind of sharing mechanism is in place.

Some data points:

  • From User-defined processes #256:

    Decide whether user-defined processes should be prefixed in process graphs (e.g. user:ndvi) - I'll leave this up to the back-end for now. We may revisit this for 1.0 final.

  • The description of the process_id component:

    openeo-api/openapi.yaml

    Lines 4661 to 4663 in 1247172

    MUST be unique across all predefined and user-defined processes available for the authenticated user.
    If a back-end adds a process with the name of a user-defined process, the user-defined process takes
    preference over predefined processes in execution to not break existing process graphs.

    I think there is a contradiction/inconsistency here: process_id MUST be unique .... but if backend adds same name, it's ok to be not unique anymore (can user still update their UDP afterwards?). I think this is an indication that it "needs more work".

  • Process ids currently must follow this regex:

    pattern: '^\w+$'

    meaning Latin characters, digits and underscore. Which means there is very little wiggle room to implement some kind of robust namespace/prefix scheme. Also somewhat relevant here, especially when sharing across users comes in the picture: in context of OpenID Connect, the user id returned by the OIDC provider might not fit \w+ (dashes, dots, ...)

I don't have a concrete suggestions at the moment, but just wanted to start off this discussion

@m-mohr
Copy link
Member

m-mohr commented Jul 3, 2020

I remember discussions about this but couldn't find in-depth github ticket about this, feel free to close as duplicate [...]

Decide whether user-defined processes should be prefixed in process graphs (e.g. user:ndvi) -

I think that was discussed mainly in calls. Result was that some back-ends want prefixes for UDP and some not. Thus, there's now this compromise in the docs:

Back-ends may choose to enforce a prefix for user-defined processes while storing the process, e.g. user_ndvi with user_ being the prefix. Prefixes must still follow the pattern.

I think there is a contradiction/inconsistency here: process_id MUST be unique ....

... in its scope (pre-defined/user-defined) and from the user-perspective at the time of submission.

but if backend adds same name, it's ok to be not unique anymore (can user still update their UDP afterwards?). I think this is an indication that it "needs more work".

The issue is that if you don't prefix all, it can be problematic to add new processes as back-end as all good names might be taken by users. Thus to not break existing process graphs, user processes are preferred over back-end processes if this case happens.

  • Process ids currently must follow this regex:

    pattern: '^\w+$'

    meaning Latin characters, digits and underscore. Which means there is very little wiggle room to implement some kind of robust namespace/prefix scheme.

The API suggests user_ as prefix. What's wrong with that? How would you think it could get more robust? The characters are so limited as the process ids are used in programming languages directly and thus a common denominator had to be found, which was \w (e.g. for Python).

Also somewhat relevant here, especially when sharing across users comes in the picture: in context of OpenID Connect, the user id returned by the OIDC provider might not fit \w+ (dashes, dots, ...)

Indeed, at some point, it might be interesting to use URLs as process_id in process graphs. But there's no sharing yet, so the API is consistent. We can reconsider sharing and URLs as process IDs in a future version for sure.

@m-mohr m-mohr added this to the future milestone Jul 3, 2020
@m-mohr m-mohr added the process discovery and profile discovery label Jul 3, 2020
@soxofaan
Copy link
Member Author

soxofaan commented Jul 3, 2020

Result was that some back-ends want prefixes for UDP and some not. Thus, there's now this compromise in the docs ...

Indeed, but that breaks reproducibility across backends. Unless there would be a mechanism that the user, when storing a UDP, provides a suggested process_id and the backend returns an updated "collision-safe" process_id.

.. from the user-perspective at the time of submission.

I think this also breaks reproducibility in a way. First, I assume that it is ok to update a conflicting UDP as long as it was created before the PDP (the API is not explicit about this I think). This feels a bit weird to me: updating is fine, but delete+create is not.
Secondly: we use PUT to store an UDP (update if it exists, create otherwise), which means that the user is "allowed to be negligent" of the fact whether their UDP already exists when storing. In that context it also feels pretty arbitrary that you can or can not PUT a UDP depending on the backend's release cycle of predefined processes.

it can be problematic to add new processes as back-end as all good names might be taken by users.

That's indeed the problem with using prefixes in the process_id naming, because it's a "commitment for life". It could be more flexible with something like optional namespaces: for example, as long as there is no conflict you can use a normal process_id without a namespace, e.g. awesome_ndvi, but once a backend introduces the same PDP, the user has to choose one explicitly with a namespace, e.g. core:awesome_ndvi vs user:awesome_ndvi. Obviously the user is allowed to use user: namespace even if the backend does not has that PDP (yet), e.g. to make sure you have reproducability across backends.

The API suggests user_ as prefix. What's wrong with that?

As noted above: it's just a suggestion and backends are still free to follow/enforce that, which breaks reproducability across backends. The optional namespacing I suggested above is a bit more flexible, but it does not fit the current ^\w+$ pattern.

Background story: we are starting with a use case where we already would like to do some kind of public UDP sharing (also see #85 (comment) ) and we'd like to have a solution soon that is as close as possible to some kind consensus that would end up in a later openEO API version.

@m-mohr
Copy link
Member

m-mohr commented Jul 3, 2020

There's two issues here: (1)the potential conflict of names and (2) how we reference shared processes.

For (2) I'd suggest to simply use URL's instead of process ids in the future. That means we still use \w+ for new processes stored at a back-end, but in a process graph you also allow URIs in process_id.

For (1) I really hesitate to define a fixed prefix, especially as we can't really move away from \w+. What I could imagine is that clients add a "user": true to the process graph node for user-defined processes. That is similar to your suggestion and would be mostly be handled by the clients.

Unless there would be a mechanism that the user, when storing a UDP, provides a suggested process_id and the backend returns an updated "collision-safe" process_id.

When suggesting user_ as potential prefix, I imagined that providers inform and reject all processes not matching user_\w+ for example. That must be communicated somehow, but that leaves room for providers to decide what they actually want to enforce. Otherwise it could also use the URL, but clients should hide this.

but once a backend introduces the same PDP, the user has to choose one explicitly with a namespace, e.g. core:awesome_ndvi vs user:awesome_ndvi.

That doesn't really solve the problem. What's happening with old process graphs?

@soxofaan
Copy link
Member Author

soxofaan commented Jul 6, 2020

What I could imagine is that clients add a "user": true to the process graph node for user-defined processes.

I like this idea. Instead of boolean "user" field, you could also call it for example "namespace" (or "library", "package", "source", "origin", ...) and handle sharing (or even packaging) as well. For example (probably going over the top here):

  • namespace "backend" (default) -> backend's predefined implementation
  • namespace "openeo" -> use "process_graph" implementation of official openeo process spec
  • namespace "user" -> personal user defined process
  • namespace "johndoe1976" -> user defined process from another user
  • namespace "crew42" -> user defined process from some user group (whatever that might mean)
  • namespace "https://example.com/path/to/myudp.json" -> UDP from process graph hosted somewhere online
  • namespace "https://example.com/path/to/myudps/" -> UDP from repository of process graphs hosted somewhere online

There are of course a lot of details to get right regarding sharing (security and privacy-wise), which is out of scope here
At least, this "namespace" field with just "backend" vs "user" would already solve the collision problem we have in openEO 1.0, while keeping room for later extensions (e.g. sharing)

@m-mohr
Copy link
Member

m-mohr commented Jul 6, 2020

Should we bring in the distinction between pre-defined (default) and user-defined in for 1.0? In that case we need a PR and approvals ASAP (this week). Otherwise the next possibility for adding it would be openEO API 2.0.

@m-mohr m-mohr modified the milestones: future, v1.0.0-final Jul 6, 2020
@soxofaan
Copy link
Member Author

soxofaan commented Jul 6, 2020

I think it would make the API cleaner, especially regarding the somewhat adhocy

MUST be unique across all PDP and UDP, unless the UDP was first, then UDP takes preference over the PDP ....
Back-ends may choose to enforce a prefix for user-defined processes while storing the process

and it easily allows backends to experiment with sharing approaches for usecases that might already need it.

@m-mohr
Copy link
Member

m-mohr commented Jul 6, 2020

I'll issue a PR and then we need to find consensus.

@jdries
Copy link

jdries commented Jul 6, 2020

Indeed, some of these issues should preferably be solved in 1.0. We're probably one of the first backends trying this out anyway?
I like the suggestion to support URI's. This can solve a lot, as I can just share my pg through github, and have versioning and everything. Or even use some kind of non-public URI in case I don't want to share my graph publicly.

@m-mohr
Copy link
Member

m-mohr commented Jul 6, 2020

Yes, I was working on this with GEE also in 0.4, but dropped it in 1.0.

I think the namespaces are easy to add to at least distinguish user and pre-defined processes. Should also be relatively easy for clients to set correctly.

On the other hand, I'm concerned that the URI change could be to much for 1.0. For URIs you'd actually not even need a namespace. You could simply use the URI as process_id, but then all back-ends would need to implement it otherwise it would not be clear whether a back-end supports it or not. If you put a URL in the namespace, then the process_id is pretty much redundant.

For now, I'd suggest to just use the two namespaces and leave out URIs. What clients could do is to allow passing URIs, then they download it automatically and store it at /process_graphs and put the corresponding id and namespace in the process graph. If we's have chosen the URI approach in process graphs directly, the back-ends should store URI-based processes anyway during submission as just loading them during execution sounds a bit less reliable and prone to attacks when changed later on (see npm issues with deleted packages, for example).

@m-mohr
Copy link
Member

m-mohr commented Jul 6, 2020

Please review #309

@soxofaan
Copy link
Member Author

soxofaan commented Jul 6, 2020

If you put a URL in the namespace, then the process_id is pretty much redundant.

Not necessarily: the namespace URL could be a "folder" and process_id still a traditional \w+ string, which would resolve then together to for example to a real file {namespace}/{process_id}.json. This approach would make things a bit cleaner for the client/user side maybe.

@m-mohr
Copy link
Member

m-mohr commented Jul 6, 2020

Yes, would work, but I'd like to see client support/helpers on this. Because splitting URIs is nothing a typical user wants to do. Also, it assumes things are called [...].json, which may not always be the case. So definitely something to think about more in the future.

soxofaan added a commit to Open-EO/openeo-python-client that referenced this issue Jul 6, 2020
@soxofaan
Copy link
Member Author

soxofaan commented Jul 6, 2020

possible next step (but not necessarily to include for 1.0 final):

  • merge GET /processes and GET /process_graphs and work with an additional namespace request parameter to select PDP or UDP
  • unify all /processes and /process_graphs requests to something like /processes/{namespace}/{process_graph_id}

@m-mohr
Copy link
Member

m-mohr commented Jul 6, 2020

That's definitely something that would only be available in openEO API 2.0 and should live in it's own issue as this is going to be closed for 1.0.0.

m-mohr added a commit that referenced this issue Jul 7, 2020
@m-mohr
Copy link
Member

m-mohr commented Jul 7, 2020

This has been merged.

@soxofaan For the other ideas open a new issue, please.

@m-mohr
Copy link
Member

m-mohr commented Jul 13, 2020

I just realized again why we haven't had namespaces ( #309 ) in the API: User could simply replace missing pre-defined processes (e.g. normalized_difference) with our "alternative" processing instructions in openeo-processes. Just store it as user-defined process and it still runs. That's now broken with namespaces. Thus, I changed the default behavior again in #313. The namespaces for explicit choosing stay as is.

soxofaan added a commit to Open-EO/openeo-python-driver that referenced this issue Jul 23, 2020
m-mohr added a commit that referenced this issue Nov 25, 2020
…pace

Add "namespace" placeholder to ProcessUnsupported error message (#305)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
process discovery and profile discovery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants