Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Enable file PIDs at Dataverse Collection Level #8889

Closed
sbarbosadataverse opened this issue Aug 3, 2022 · 13 comments · Fixed by #9614, #9716 or #9721
Closed

Feature Request: Enable file PIDs at Dataverse Collection Level #8889

sbarbosadataverse opened this issue Aug 3, 2022 · 13 comments · Fixed by #9614, #9716 or #9721
Labels
Feature: DOI & Handle Size: 10 A percentage of a sprint. 7 hours.
Milestone

Comments

@sbarbosadataverse
Copy link

sbarbosadataverse commented Aug 3, 2022

The Harvard Dataverse Model for supporting file-level DOIs will support collection level instead of production-wide file-level PIDs.
This requires the ability to enable file-level PIDs at the collection level. Eventually, we may discuss enabling at dataset level, if needed.
-->

Overview of the Feature Request

What kind of user is the feature intended for?
SUPERUSER/INSTALLATION ADMIN

What inspired the request?
THE COST OF PROVIDING FILE LEVEL PIDs FOR GENERALIST, OPEN REPOSITORIES

What existing behavior do you want changed?
MANAGED FILE LEVEL PIDs distribution

Any brand new behavior do you want to add to Dataverse?

Any related open or closed issues to this feature request?
NO

@scolapasta

@scolapasta
Copy link
Contributor

Some notes:

We'll want to introduce this as a setting per dataverse collection, and model it after other such settings (metadablocks, facets, etc) in that a child dataverse would inherit the setting from its parent unless it chooses to override it.

For now this will be superuser only, so we can set via API (optionally, we could include a rudimentary UI way, though further design would be warranted if we opened it up to dataverse collection admins).

Additionally, we'll want to modify the current installation wide setting from just true / false to also allow for this option (so likely, off, on, and "set all collection level" - though we may not need "on" and just off and "set all collection level" - something to discuss when we start working on this).

We'll also need an API (or move the current admin only API) to register for existing datasets (for when a superuser turns this on in a dataverse collection with existing datasets).

@j-n-c
Copy link
Contributor

j-n-c commented Sep 21, 2022

PID settings have been discussed recently:

This feature was in fact suggested by @qqmyers in session 2 of the community call held on the 20th Sep 2022

Adding to @scolapasta comment, perhaps the file PID configuratin could be set only on desired collections (passed as a parameter to the API or set through the GUI). On other Collections, the system wide default would be used.

Please consider adding this feature on a future release of the Datavese Software.

@mreekie mreekie added the bk2211 label Nov 1, 2022
@mreekie mreekie removed the bk2211 label Jan 11, 2023
@mreekie
Copy link

mreekie commented Feb 27, 2023

Sizing:

What is needed is to assign XX to a specific collection

  • all datasets automatically get a persistent identifier.
  • Files - persistent identifier is option.
  • Persistent-data identifiers for files is currently disabled in production.
  • It's a global switch.

Sonia would like this to be turned on/off on a per-collection basis.
This feelis like it could be one sprint: Size 80

@mreekie mreekie added the Size: 80 A percentage of a sprint. 56 hours. label Mar 6, 2023
@landreev
Copy link
Contributor

BTW, @poikilotherm 's #9462 is already adding a per-collection (and inheritable) setting that's related to PID behavior. For versions PIDs. Can be used as a model for this setting.

@landreev landreev self-assigned this May 17, 2023
landreev added a commit that referenced this issue May 23, 2023
landreev added a commit that referenced this issue May 23, 2023
landreev added a commit that referenced this issue May 24, 2023
landreev added a commit that referenced this issue May 24, 2023
landreev added a commit that referenced this issue May 24, 2023
landreev added a commit that referenced this issue May 24, 2023
landreev added a commit that referenced this issue May 31, 2023
landreev added a commit that referenced this issue Jun 22, 2023
landreev added a commit that referenced this issue Jun 27, 2023
@pdurbin pdurbin added this to the 5.14 milestone Jul 12, 2023
@scolapasta
Copy link
Contributor

After a discussion with @qqmyers and reviewing the code and documentation, I'm adding this back to the board. The issue is that the current implementation allows a user with edit dataverse permissions (e.g. Dataverse Admin) to set this despite it being turned off for the installation.

While there are some cases where this is ok, we should also allow for the possibility of an installtion that does not want any file DOIs minted.

One suggestion was to change the setting at the installation level from just to true / false to allow for null / true / false. In this case null would mean no file PIDs ever, true would be File PIDs by default but can be overridden for a particular collection, and false would be file PIDs off by default, but can be overridden for a particular collection.

One question I have for @cmbz would be is this enough? Do we also need the ability for them to be on at the installation level and not overridden at the collection level? If so we would then we would then want to handle with either not a boolean or with two booleans for true false (file pids at installtion level and can be overridden).

If we're ok with only the 3 possibilties, then we can just change the way the current setting works, and then the check for if this is turned on at the collection level to account for that change (plus, of coursae, fully document).

@scolapasta scolapasta reopened this Jul 13, 2023
@scolapasta
Copy link
Contributor

The other option, which @cmbz and I discussed is that it could only be superusers who can make this change. This would be a bigger code change how it's currently implemented, but I am in support of this way. We can always expand its use to other users in the future, if it's desired.

This would put this feature more inline, for example with storage drivers - which can be changed for a collection but only be a superuser.

@cmbz added "I think that this feature differs significantly from other default overrides (like adding optional metadata blocks) in terms of possible impact to the installation and therefore the installation owner/manager should be the one to authorize a collection to use this feature."

@scolapasta
Copy link
Contributor

Actually, rereading the description above, the orginal request from Sonia was for this to be superuser level:

What kind of user is the feature intended for?
SUPERUSER/INSTALLATION ADMIN

So I think we have Sonia's buy in to this apporach. :)

@poikilotherm
Copy link
Contributor

poikilotherm commented Jul 14, 2023

@scolapasta you are aware of how I implement this for version PIDs, aye?

Would it make sense to align how both features work to keep the configuration experience consistent?

Also: as both are API only things right now (aside from the installation setting), maybe we could make the edit endpoint execute the permission checks and add an installation setting if a superuser needs to do this or not? (Not sure where/how this is checked for storage backend via API at the moment)

Some installations like us are probably fine with curators enabling/disabling this per collection.

@kcondon
Copy link
Contributor

kcondon commented Jul 14, 2023

@scolapasta What about the use case where the sys admin wants to enable them on a collection level but only selectively to super users?

Yes, just saw the last line in above comment.

@qqmyers
Copy link
Member

qqmyers commented Jul 14, 2023

FWIW: WE have several things like this now - stores, curation labels, metadataLanguage, guestbook-at-request (not yet merged). In general, the pattern we have is that if the setting isn't set, you get legacy behavior, and we've made these inheritable down through the collection hierarchy. Beyond that, there are differences due to the individual uses cases, w.r.t. whether a dataset can override the collection level choice, whether the choice requires superuser or not, etc.

In this case, where DOIs can be a significant expense, I'd argue that the pattern of not allowing file PIDs unless the admin has enabled it makes sense, as does at least an option/default of requiring superuser to turn it on for a collection or dataset. If there's a demand for self-service for normal collection/dataset managers to toggle file PIDs, I'd suggest making that a separate choice/setting (rather than the default).

(FWIW - I'm currently making the opposite choice in the guestbook-at-request work as the choice of whether the guestbook appears at request or download doesn't, at first glance, cause any costs/issues a site-level admin would care about - still checking with ADA to see if they have automation where making this a superuser choice is required.)

@scolapasta
Copy link
Contributor

scolapasta commented Jul 14, 2023

@poikilotherm I just looked over #9462 to see how you are implementing there. I think that could be a good approach, I do wonder though about a situation where an installation would want to have always on and not allow overriding - I don't think I see that as an option.

Having them both work similarly would be nice ideally, if we decide the use cases are the similar enough (I do think they can vary, in that you can have datasets with 10k+ files and you will (hopefully) never have that many versions).

I think for the purpose of 5.14 we should get something out now and we can tweak as we think about it. If others agree, then I do think the easiest way is to for now make it only be superusers. It's also possible that that may be sufficient always? But in the short term it allows is to tweak how we allow these things to be defined in a way that doesn't allowe non superusers to blow up the # of dois being minted.

I still think this combined with the settings approach I suggested above:
One suggestion was to change the setting at the installation level from just to true / false to allow for null / true / false. In this case null would mean no file PIDs ever, true would be File PIDs by default but can be overridden for a particular collection, and false would be file PIDs off by default, but can be overridden for a particular collection.
would be satisfactory for 5.14. I'm open to other suggestion, of course.

(and if we decide we really want to change this cinsiderabley, there's always the option to comment out the code for now and release 5.14 without this functionality and revisit for 6.1. I'm not sure how pressing thus is. But that's why I like the superuser only approach - gets something out now and the tweaks can be for 6.1 or beyond.

@cmbz
Copy link

cmbz commented Jul 14, 2023

@scolapasta said: "I do think the easiest way is to for now make it only be superusers. It's also possible that that may be sufficient always? But in the short term it allows is to tweak how we allow these things to be defined in a way that doesn't allowe non superusers to blow up the # of dois being minted."

I am happy with only superusers being granted this option for now.

@pdurbin
Copy link
Member

pdurbin commented Jul 17, 2023

there's always the option to comment out the code for now and release 5.14 without this functionality and revisit for 6.1

Yes, this is the most straightforward and expedient want to release 5.14... to simply comment out the new PID code (or hide it behind a feature flag) and regroup in 6.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment