Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow external workflows to make changes #7563

Closed
qqmyers opened this issue Jan 28, 2021 · 3 comments · Fixed by #7568
Closed

Allow external workflows to make changes #7563

qqmyers opened this issue Jan 28, 2021 · 3 comments · Fixed by #7568
Labels
GDCC: DANS related to GDCC work for DANS

Comments

@qqmyers
Copy link
Member

qqmyers commented Jan 28, 2021

Workflows may need to read/write information in the dataset during their execution. However, for external workflows, i.e. ones that will use the API, this is currently blocked by the lock placed for the workflow itself. Use cases include:
PrePublication:

  • Adding metadata from an external service (DANS data migration),
  • Extracting geospatial information and writing a csv file to the dataset (UCLA geo integration),
  • Future spam or virus check analysis that would need to get file contents

PostPublication:

  • The archiver workflows set a field in datasetversion to indicate the archivalcopylocation. If this workflow were to be made external, it would have to make that change via an API.

New triggers (TBD):

  • External ingest apps that read files after upload and produce metadata.

Prepublication is the most critical case - for PostPublish and potential triggers like 'UploadFinished', there's no internal Dataverse processing that has to wait, so workflows could immediately respond with a success message and then process as they wish after the lock is released. For pre-publication, responding would trigger FinalizePublication and the workflow couldn't make changes to go into the published version.

The mechanism for lock enforcement I see in API calls is that locks are only checked within the commands themselves, and different lock types are checked separately.
I can see two options for fixes: either giving the user who started the workflow, or the holder of a token (such as the workflowID), an exemption to avoid being stopped by a workflow lock. (I think all the use cases would work if other lock types are still enforced and I think many would need the protection of those locks (e.g. Ingest, EditInProgress, etc.).

  • Commands already know which user is calling them, so it would be easy to add an exception in the current lock checks to allow you to run if there's a workflow lock and you are the user. I think this would be OK in practice since any action that would cause conflicts will also trigger additional non-workflow locks and, if the UI stays as is (doesn't exempt a workflow lock) the only potential for conflicting calls would be API calls from the same user (in theory something like running the DCT to change variable metadata while also having triggered a pre-publication workflow that is in progress could result in uncoordinated changes, but real conflicts should (?) still be stopped by other locks versus causing an exception/server error.)

  • An alternate approach that would be stricter but more work would be to add a way to pass the lockID (short and potentially easy to guess) or workflowID (better - long and unguessable at short timescales) to Commands so that the workflow lock check can verify that the API call originates with the workflow itself. In either case, a way to pass the token to the Command is needed, probably through adding it to the DataverseRequest used to start the Command (DataverseRequest has access to the HttpRequest and should be able to read a param or header). To use the workflowID, another addition to keep the relationship between the Workflow and DatasetLock would be needed.

Unless there are other ideas or a strong opinion, I'm inclined to just go for the latter - it's not too much more work and it is may avoid future problems if long-running apps associated with the user's apikey are also running (external request access tools, QDR's annotation tool (in dev), etc.)

@scolapasta
Copy link
Contributor

scolapasta commented Jan 28, 2021

@qqmyers I lean towards the latter as well.

A variant idea of the latter is to use the notion of a more robust API token system (e.g #7244) and so give the workflow one of these newer one-time or short lived tokens that would have permissions to make changes.

@qqmyers
Copy link
Member Author

qqmyers commented Jan 28, 2021

With some further discussion, I'm even more in favor of the 'latter option' over the 'former'. I like #7244 as a longer-term option as well. For security, I think we need the workflowID (versus the guessable lockId) . I also think that it should work to only send the workflow ID to the workflow implementation, not send the apiKey, and then essentially use the workflowID as both a shorter-lived api key and as proof that you can override the indicated workflow lock. So the proposal is:

  1. Modify dataverse to send the workflow ID in place of the apikey to workflows
  2. Have the workflow send the workflowID in a header when making API calls
  3. Modify dataverse to validate the worfkflowID and identify the associated user (analogous to how api keys are used)
  4. Proceed as is done currently to check that the user has the permissions to take the action for the given api call
  5. When the underlying Command is executed and it checks before starting to see if the relevant Dataset is locked, allow a call using that workflow ID to proceed, ignoring the lock. (And conversely, if someone uses an apiKey to authenticate, do not ignore the lock).

I think this both solves the current issue and tightens up security (the workflow does not get a long-lived key for the user. Instead, when it completes and sends a success/failure response, it invalidates the key that gave it access.)

@qqmyers qqmyers added the GDCC: DANS related to GDCC work for DANS label Jan 28, 2021
@janvanmansum
Copy link
Contributor

janvanmansum commented Jan 29, 2021

@qqmyers : the solution in your summary above is exactly what we are looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants