Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

21. Distribution — DataCite Metadata Schema 4.5 documentation #7

Closed
utterances-bot opened this issue Aug 24, 2022 · 20 comments · Fixed by #23
Closed

21. Distribution — DataCite Metadata Schema 4.5 documentation #7

utterances-bot opened this issue Aug 24, 2022 · 20 comments · Fixed by #23

Comments

@utterances-bot
Copy link

21. Distribution — DataCite Metadata Schema 4.5 documentation

https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/properties/recommended_optional/property_distribution.html

Copy link
Collaborator

mediaType -
This has a cardinality of 1 - what about multiple files? Does not seem to be mentioned in bagIt docs.

contentURL.lastUpdated -
is defined as the last time the content URL was updated. Is this supposed to be the last time the content was updated? What is the relationship of the distribution metadata and the dataset metadata (i.e. are the lastUpdated dates the same?).

contentURL.accessRights - how is this related to the dataset Rights? Do we really need it twice?

Collections -
Length as a number of ‘octets’ - I think octet is the French name for a byte - can we use bytes instead? We do in contentURL discussion.

@KellyStathis
Copy link
Collaborator

Thanks Ted! Tagging in @kjgarza to help clarify these points.

@kjgarza
Copy link

kjgarza commented Aug 31, 2022

Hi @tedhabermann 👋

mediaType -
This has a cardinality of 1 - what about multiple files? Does not seem to be mentioned in bagIt docs.

mediaType is per the distribution; if one distribution were to contain multiple files, one should follow either of the two recommendations mentioned (archival format OR bagit). In both cases, the mediaType for such distribution would be the archival format (see https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/guidance/distribution.html).

contentURL.lastUpdated -
is defined as the last time the content URL was updated. Is this supposed to be the last time the content was updated? What is the relationship of the distribution metadata and the dataset metadata (i.e. are the lastUpdated dates the same?).

This is about the last time the contentURL value was updated. Maybe we can add the word value to clarify further.

contentURL.accessRights - how is this related to the dataset Rights? Do we really need it twice?

I think the doc makes clear that this is not about dataset rights in the same way the Rights fields. Do we need to clarify further?

Collections -
Length as a number of ‘octets’ - I think octet is the French name for a byte - can we use bytes instead? We do in contentURL discussion.

@KellyStathis and I talk about this and will add a footnote about this.

@tedhabermann
Copy link
Collaborator

@kjgarza -
In reading the documentation for the data files I did not see a way to describe the mime types of the actual data files. These are probably more important than the mime type of the archive. Did I miss something?

@tedhabermann
Copy link
Collaborator

contentURL.lastUpdated - What user story requires knowing when the URL changed? How would the user know when the content change, e.g. if they needed to download it again to get the most recent version?

Copy link
Collaborator

IMHO introducing another layer of rights and another vocabulary here is unnecessary and confusing. The Rights section of the metadata has this information in a way that people are used to. This vocabulary has four items, the three in the example and "restricted access" which don't really add very useful information.

@KellyStathis
Copy link
Collaborator

KellyStathis commented Sep 1, 2022

contentURL.lastUpdated - What user story requires knowing when the URL changed? How would the user know when the content change, e.g. if they needed to download it again to get the most recent version?

I agree with this point - if a harvester is downloading data files, it is more helpful to know whether content has been updated since the last download.

@xaucerr
Copy link

xaucerr commented Sep 2, 2022

@KellyStathis @tedhabermann for that use case, I think you should look at 21.2 Checksum. Parsing this page, I think a change in the Distribution file would be reflected in the checksum (as it would be different, and a harvester can confirm that when downloading the content or comparing checksums). Wouldn't you agree?

I guess the intent of 21.1a lastUpdated is different because it's a sub-property of 21.1 ContentUrl; hence only refers to that field.

@kjgarza
Copy link

kjgarza commented Sep 2, 2022

Exactly Checksum is a better attribute to look up for that use case . Thanks @xaucerr

The need behind lastUpdated is about dealing with changes that metadata providers (i.e., repository) can potentially make to the contentUrl value that might not affect the file(s). For example, suppose the repository goes bust. In that case, the metadata will be persisted by DataCite (FAIR A2). Having a record of when the link to the content was last updated would allow agents/machines to understand when the link was previously working.

Other use cases involved changes in the path/domain/protocol of the Distribution's contentUrl. I might advise you to take a look at the proposal for Distribution. This was discussed by the Working Group in June(see notes), and there are more details about the proposal in the board.

@KellyStathis
Copy link
Collaborator

Thanks @kjgarza and @xaucerr, this is helpful. Since the public feedback period hasn't officially opened yet, I am going to add a note to clarify this. For comments on or after Sept 12, I will not be making changes until after the feedback period is closed.

@kjgarza
Copy link

kjgarza commented Sep 7, 2022

IMHO introducing another layer of rights and another vocabulary here is unnecessary and confusing. The Rights section of the metadata has this information in a way that people are used to. This vocabulary has four items, the three in the example and "restricted access" which don't really add very useful information.

As described in the proposal, the Distribution emphasizes machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention). In that regard, the subgroup concluded that the 16. Right property allows "Findability" of the resource or resources but doesn't know allow "Machine-Actionability" of the resource (FAIR). Without the 21.3 accessRights sub-property, a machine/agent would have to make assumptions: for example, whether multiple Distributions share the same access rights or which access rights apply to which distribution.
Moreover, 21.3 only describes access rights, so it doesn't overlap with Rights whose scope is broader (Copyright or licensing information).
Furthermore, 21.3 accessRights aligns the Distribution property with 3WC vocabularies such as DCAT and schema.org. W/r/t the vocabulary, choice, this is a recommended value.

I think that's the rationale for that decision and the discussion back in June. I hope I'm not missing anything, but @sarala or Jan can add comments in case I missed something.

Copy link
Collaborator

This discussion here is relevant: IQSS/dataverse#5086

Worth considering how repositories using file DOIs (like Dataverse) might use the Distribution property, and whether an additional resourceTypeGeneral is needed for file DOIs.

@KellyStathis
Copy link
Collaborator

Sharing some from the Dataverse team here:

The new Distribution property and its subproperties look like great additions, compared to the current way that the Dataverse software includes information in its "OpenAIRE" metadata export. For example, see the XML metadata record at https://dataverse.harvard.edu/api/datasets/export?exporter=oai_datacite&persistentId=doi%3A10.7910/DVN/URL6A4, where each file's size and format are listed separately. One way this seems less desirable is there's no way to know which format and size belongs to the same file.

As a followup, I had asked:

"Would you see this recommendation as practical for Dataverse to implement (as in, would you be able to provide a distribution as an archive format or a BagIt structure)?"

Ah, so the way I thought we'd use the Distribution property, for each file in a dataset, would not be recommended. Thanks for the clarification!

Dataverse is able to package all files of a dataset in a BagIt structure, and Dataverse does let people download all files in a dataset as a regular old .zip file. So I think both are possible.

"Or would it be preferable to repeat the distribution property for each file in the dataset? I realize this may also depend on whether the specific Dataverse repository is using file DOIs."

We would need to discuss with the Dataverse community about this, maybe starting with whoever in the community added those and sections to the OpenAIRE exports (e.g. the one I mentioned at https://dataverse.harvard.edu/api/datasets/export?exporter=oai_datacite&persistentId=doi%3A10.7910/DVN/URL6A4). However, we use Schema.org's distribution property to describe and add a download link to each file, e.g. https://dataverse.harvard.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.7910/DVN/URL6A4. And I think folks i the Dataverse community have found this helpful.

I think this is worth exploring further. Specifically, we may want to revise this language to be a recommendation rather than a requirement: Every distribution should represent the same resource in its entirety. It should NOT be used to describe collections.

Copy link

tmorrell commented Sep 30, 2022

I’m excited about the addition of distribution, but I have some concerns about the restriction “Every distribution should represent the same resource in its entirety.” In practice, this means that most of the records in our repository would have to be represented as a BagiIt structure. This defeats a lot of the utility of providing a distribution, as data users will first have to download the entire bag and then parse out the file metadata. Here are some example use cases where the requirement is particularly problematic:

  • A record with a .netcdf data file, license file, and readme file. Users or automated systems want to download either the .netcdf data file, or the readme, or the license. It would be nice to list them in a distribution, even though each file is not the ‘entirety’ of the record. https://doi.org/10.14291/tccon.ggg2014.anmeyondo01.R0/1149284
  • A record with files at different resolutions. If these are listed in a distribution, it makes it easy to download the version of interest by looking at the file size. But it’s not clear that this would be allowed since they are not the “entirety” of the record. https://doi.org/10.22002/D1.1961

I also have concerns for requiring Bags for records that have large files, as the format will require users to download the entire Bag and all files to see what files are included in the collection.

I recommend the committee think carefully about what concerns they have about not allowing broad use of the Distribution field. More specific recommendations (e.g. no more than 10 distributions, all distributions should have unique file names, etc.) might help alleviate concerns without unnecessarily restricting the use of the distribution property.

@tedhabermann
Copy link
Collaborator

tedhabermann commented Oct 11, 2022 via email

@xaucerr
Copy link

xaucerr commented Oct 12, 2022

I also have concerns for requiring Bags for records that have large files, as the format will require users to download the entire Bag and all files to see what files are included in the collection.

@tmorrell I might be looking at a different part of the RFC but I think the RFC guidance talks about using not a normal BAG but what it would be defined as Holey Bags. A holey bag send a bag with “holes”, which saves disk space and network bandwidth. Therefore, I think that's not a concerns as the bags can be empty.

A subdirectory named “data/”, which is intended to hold the bag’s content payload but will be empty in the current use case.

@tmorrell
Copy link

@xaucerr Ah, you're right. The paragraph at the top of https://datacite-metadata-schema.readthedocs.io/en/4.5_draft/guidance/distribution.html should be clearer about this, since I've previously only run across bags with all the files included.

@paulmillar
Copy link

Hi everyone,

Sorry for being "late to the party", I only recently found out about the new Distribution element of the proposed v4.5 metadata schema.

I'm exciting about the introduction of Distribution element in the metadata scheme. There are a number of use-cases (both explicitly in my science domain and more generally) for using this information, where a DOI should resolve to the underlying data.

That said, I have a number of concerns about the v4.5 schema as-is, in particular with how BagIt is currently the recommended way for describing multiple files. I've tried to put those concerns in the Google doc and as comments on the readthedocs page. At the risk of repeating myself, perhaps a succinct summary here would help.

Here are my five concerns:

  1. BagIt has no registered media type. The mediaType attribute would be registered as a .zip file. Therefore, from DataCite metadata alone, there's no way to distinguish between BagIt and any other zip file. For example:
<Distribution mediaType="application/zip">
    <contentURL>http://example.org/my-dataset/maybe-bagit-or-maybe-the-data.zip</contentURL>
</Distribution>
  1. The metadata's use of BagIt to describe access to data (i.e., a BagIt file with no data, just links) is a profile. I believe there is currently no formal definition of this profile. The BagIt profile specification describes how to write down the expectations when DataCite metadata is using BagIt as a container for multiple files.
  2. A dataset might already be packaged in BagIt (or one of the derived formats, such as DataCrate or RO-Crate). There's currently no way (from only DataCite metadata) to distinguish between different BagIt profiles. A client (currently) cannot know whether it is downloading the data or downloading just the URLs to the data. A client might guess from byteSize (a file larger than some threshold is probably data), but it would only be a guess.
  3. There doesn't seem to be wide-spread client application support for BagIt fetch URLs. Given a BagIt file that follows the DataCite recommendation, how do I download the files?
  4. BagIt has some limitations. In particular, it doesn't support multiple URLs per file. I have a use-case where different clients would use different protocols (so, different URLs) to obtain the same file. I would have to use a non-standard extension (as a short-term fix) and flag this as something that needs fixing upstream.

As an alternative, I've been looking at the Metalink format.

Similar to BagIt, Metalink format is has been defined by an RFC, and provides links the the underlying data. Metalink supports a file having multiple URLs and has an IANA-registered mediaType. It's also more widely adopted than BagIt: there are several open-source client software packages that support Metalink.

Do you see DataCite metadata supporting both BagIt and Metalink as a valid way to describe a dataset with multiple files?

If the focus is on BagIt, how do you plan to address the above concerns?

I think (as a minimum) BagIt should have a registered media type, but I don't even know who to contact to request this. Who's maintaining BagIt?

Cheers,
Paul.

@KellyStathis
Copy link
Collaborator

Thanks everyone for the feedback here, on the Schema 4.5 RFC, and on the Guidance page (#17)! @paulmillar, appreciate the summary here as well.

It seems that most of the issues revolve around the proposed requirement to use a single Distribution to describe multiple files: "Every distribution should represent the same resource in its entirety". This requirement directly led the Metadata WG to the recommendation to use Bagit, as articulated in the Guidance section: Using Distribution for a collection of files.

I might suggest that if we change the first requirement to a recommendation—as in, we allow users to include multiple files as separate Distribution instances, but recommend container files/bags/etc—it will improve the situation. If there is no requirement for a single Distribution to represent a collection, the guidance for a how to represent a collection as a single Distribution becomes less urgent.

This change would not only allow increased flexibility for those who don't have container files (or resources to implement BagIt/Metalink etc), but it would also give the Metadata WG more time to work on the guidance. The proposed Distribution property is agnostic to the file type used—Bagit was only a recommendation, and we can change recommendations as we learn more and as best practices evolve.

I will leave the Bagit-specific questions for @kjgarza who is more familiar with this. Even if the schema no longer requires a single Distribution, I still think we want to improve on Using Distribution for a collection of files to address these concerns. Another option would be for the DataCite Metadata WG not to make a specific recommendation here, but to leave this for the community to develop best practices.

@paulmillar
Copy link

Hi.

Thanks for the feedback @KellyStathis.

Others can correct me, but my impression is that this topic (how to represent multiple files within a Distribution) is somewhat immature, in the sense that I'm not sure the scientific communities using DataCite metadata scheme have reached a consensus on how to represent "a set of links". Of course, this could just be me! To be honest, the introduction of Distribution caught me as a (pleasant) surprise.

Therefore, I support the idea of making DataCite metadata v4.5 somewhat non-committed on how to represent multiple files. I think this should be coupled with some process through which DataCite can reach consensus and make more concrete recommendations in some future version of the schema. This process could be just another round of public comments, but I think something more engaging (e.g., a working group) might be needed.

That said, I'm not sure it's helpful to allow Distribution to represent the individual files from a dataset. The BagIt proposal provides a concrete and practical way of representing access to the files in a dataset. I do have some concerns, but I like the overall idea and direction. BagIt might not be the optimal solution for all communities but (for each community) that could change over time.

As concrete proposal, I would suggest:

  • Leave Distribution as representing the whole of the dataset.
  • Add some kind of note/disclaimer to the v4.5 specification that best practice regarding distributing multiple files is still under discussion. Early adopters are encouraged to use the BagIt format (as described) and report their experiences, but that (for now) other approaches are allowed.
  • Establish a mechanism through which DataCite metadata users can reach a consensus, or (at least) easily provide feedback.
  • More concrete specification will be provided on the v4.6 or v5.0 timeline [delete as appropriate].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants