-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
21. Distribution — DataCite Metadata Schema 4.5 documentation #7
Comments
mediaType - contentURL.lastUpdated - contentURL.accessRights - how is this related to the dataset Rights? Do we really need it twice? Collections - |
Thanks Ted! Tagging in @kjgarza to help clarify these points. |
Hi @tedhabermann 👋
mediaType is per the distribution; if one distribution were to contain multiple files, one should follow either of the two recommendations mentioned (archival format OR bagit). In both cases, the mediaType for such distribution would be the archival format (see https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/guidance/distribution.html).
This is about the last time the
I think the doc makes clear that this is not about dataset rights in the same way the Rights fields. Do we need to clarify further?
@KellyStathis and I talk about this and will add a footnote about this. |
@kjgarza - |
contentURL.lastUpdated - What user story requires knowing when the URL changed? How would the user know when the content change, e.g. if they needed to download it again to get the most recent version? |
IMHO introducing another layer of rights and another vocabulary here is unnecessary and confusing. The Rights section of the metadata has this information in a way that people are used to. This vocabulary has four items, the three in the example and "restricted access" which don't really add very useful information. |
I agree with this point - if a harvester is downloading data files, it is more helpful to know whether content has been updated since the last download. |
@KellyStathis @tedhabermann for that use case, I think you should look at I guess the intent of |
Exactly Checksum is a better attribute to look up for that use case . Thanks @xaucerr The need behind lastUpdated is about dealing with changes that metadata providers (i.e., repository) can potentially make to the contentUrl value that might not affect the file(s). For example, suppose the repository goes bust. In that case, the metadata will be persisted by DataCite (FAIR A2). Having a record of when the link to the content was last updated would allow agents/machines to understand when the link was previously working. Other use cases involved changes in the path/domain/protocol of the Distribution's contentUrl. I might advise you to take a look at the proposal for Distribution. This was discussed by the Working Group in June(see notes), and there are more details about the proposal in the board. |
As described in the proposal, the I think that's the rationale for that decision and the discussion back in June. I hope I'm not missing anything, but @sarala or Jan can add comments in case I missed something. |
This discussion here is relevant: IQSS/dataverse#5086 Worth considering how repositories using file DOIs (like Dataverse) might use the Distribution property, and whether an additional resourceTypeGeneral is needed for file DOIs. |
Sharing some from the Dataverse team here:
As a followup, I had asked: "Would you see this recommendation as practical for Dataverse to implement (as in, would you be able to provide a distribution as an archive format or a BagIt structure)?"
"Or would it be preferable to repeat the distribution property for each file in the dataset? I realize this may also depend on whether the specific Dataverse repository is using file DOIs."
I think this is worth exploring further. Specifically, we may want to revise this language to be a recommendation rather than a requirement: Every distribution should represent the same resource in its entirety. It should NOT be used to describe collections. |
I’m excited about the addition of distribution, but I have some concerns about the restriction “Every distribution should represent the same resource in its entirety.” In practice, this means that most of the records in our repository would have to be represented as a BagiIt structure. This defeats a lot of the utility of providing a distribution, as data users will first have to download the entire bag and then parse out the file metadata. Here are some example use cases where the requirement is particularly problematic:
I recommend the committee think carefully about what concerns they have about not allowing broad use of the Distribution field. More specific recommendations (e.g. no more than 10 distributions, all distributions should have unique file names, etc.) might help alleviate concerns without unnecessarily restricting the use of the distribution property. |
The assumption that people are tracking the checksum for data files they are downloading is an interesting one. It seems unlikely to me. Of course they get the download date automatically as the creation date of the downloaded file. This can also be determined using the operating system directly, making it easy to check the date in deciding to repeat the download.
Also, as a user of the data I do not care about the URL changing as long as it resolves to the current content.
Ted
… On Sep 2, 2022, at 4:44 AM, Kristian Garza ***@***.***> wrote:
Exactly Checksum is a better attribute to look up for that use case . Thanks @xaucerr <https://github.com/xaucerr>
The need behind lastUpdated is about dealing with changes that metadata providers (i.e., repository) can potentially make to the contentUrl value that might not affect the file(s). For example, suppose the repository goes bust. In that case, the metadata will be persisted by DataCite (FAIR A2). Having a record of when the link to the content was last updated would allow agents/machines to understand when the link was previously working.
Other use cases involved changes in the path/domain/protocol of the Distribution's contentUrl. I might advise you to take a look at the proposal for Distribution <https://docs.google.com/document/d/1YZYBrL3z6fOY0j_nTFd97qq73XAiwJV9RrusxY-_dpI/edit#>. This was discussed by the Working Group in June <https://docs.google.com/document/d/1WIHn9RkgxyfwJVQJvbDCc5AQM8_EEQaJ_l2uNcDEqS0/edit#bookmark=id.i3aqd6lsu262>, and there are more details about the proposal in the board <https://miro.com/app/board/uXjVOLwzj6w=/?share_link_id=433060045641>.
—
Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABURU6KPRCIB6AXF2JIRAUTV4HK7PANCNFSM57QWCOTA>.
You are receiving this because you were mentioned.
|
@tmorrell I might be looking at a different part of the RFC but I think the RFC guidance talks about using not a normal BAG but what it would be defined as Holey Bags. A holey bag send a bag with “holes”, which saves disk space and network bandwidth. Therefore, I think that's not a concerns as the bags can be empty.
|
@xaucerr Ah, you're right. The paragraph at the top of https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/guidance/distribution.html should be clearer about this, since I've previously only run across bags with all the files included. |
Hi everyone, Sorry for being "late to the party", I only recently found out about the new Distribution element of the proposed v4.5 metadata schema. I'm exciting about the introduction of That said, I have a number of concerns about the v4.5 schema as-is, in particular with how BagIt is currently the recommended way for describing multiple files. I've tried to put those concerns in the Google doc and as comments on the readthedocs page. At the risk of repeating myself, perhaps a succinct summary here would help. Here are my five concerns:
As an alternative, I've been looking at the Metalink format. Similar to BagIt, Metalink format is has been defined by an RFC, and provides links the the underlying data. Metalink supports a file having multiple URLs and has an IANA-registered mediaType. It's also more widely adopted than BagIt: there are several open-source client software packages that support Metalink. Do you see DataCite metadata supporting both BagIt and Metalink as a valid way to describe a dataset with multiple files? If the focus is on BagIt, how do you plan to address the above concerns? I think (as a minimum) BagIt should have a registered media type, but I don't even know who to contact to request this. Who's maintaining BagIt? Cheers, |
Thanks everyone for the feedback here, on the Schema 4.5 RFC, and on the Guidance page (#17)! @paulmillar, appreciate the summary here as well. It seems that most of the issues revolve around the proposed requirement to use a single Distribution to describe multiple files: "Every distribution should represent the same resource in its entirety". This requirement directly led the Metadata WG to the recommendation to use Bagit, as articulated in the Guidance section: Using Distribution for a collection of files. I might suggest that if we change the first requirement to a recommendation—as in, we allow users to include multiple files as separate Distribution instances, but recommend container files/bags/etc—it will improve the situation. If there is no requirement for a single Distribution to represent a collection, the guidance for a how to represent a collection as a single Distribution becomes less urgent. This change would not only allow increased flexibility for those who don't have container files (or resources to implement BagIt/Metalink etc), but it would also give the Metadata WG more time to work on the guidance. The proposed Distribution property is agnostic to the file type used—Bagit was only a recommendation, and we can change recommendations as we learn more and as best practices evolve. I will leave the Bagit-specific questions for @kjgarza who is more familiar with this. Even if the schema no longer requires a single Distribution, I still think we want to improve on Using Distribution for a collection of files to address these concerns. Another option would be for the DataCite Metadata WG not to make a specific recommendation here, but to leave this for the community to develop best practices. |
Hi. Thanks for the feedback @KellyStathis. Others can correct me, but my impression is that this topic (how to represent multiple files within a Distribution) is somewhat immature, in the sense that I'm not sure the scientific communities using DataCite metadata scheme have reached a consensus on how to represent "a set of links". Of course, this could just be me! To be honest, the introduction of Distribution caught me as a (pleasant) surprise. Therefore, I support the idea of making DataCite metadata v4.5 somewhat non-committed on how to represent multiple files. I think this should be coupled with some process through which DataCite can reach consensus and make more concrete recommendations in some future version of the schema. This process could be just another round of public comments, but I think something more engaging (e.g., a working group) might be needed. That said, I'm not sure it's helpful to allow Distribution to represent the individual files from a dataset. The BagIt proposal provides a concrete and practical way of representing access to the files in a dataset. I do have some concerns, but I like the overall idea and direction. BagIt might not be the optimal solution for all communities but (for each community) that could change over time. As concrete proposal, I would suggest:
|
21. Distribution — DataCite Metadata Schema 4.5 documentation
https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/properties/recommended_optional/property_distribution.html
The text was updated successfully, but these errors were encountered: