Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iqss/6497 semantic api #7414

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Nov 16, 2020

What this PR does / why we need it: This adds an API call that allows metadata to be added using json-ld - in the same format as the OAI-ORE export, and without having to retrieve the existing metadata and having to understand the way Dataverse is storing things internally. It's one step in supporting dataset migration via RDA Bags (see #6497) that might be useful independent of additional api calls to migrate a dataset. For example, it allows submission of new terms and conditions.

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Suggestions on how to test this: see examples in comments below

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

Conflicts:
	src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is looking really good and I'm excited to have a smaller, more compact JSON format for creating a dataset. That was the dream of #3068 (which focused on output, initially).

I left comments and questions throughout. I'd be happy to get on a call to discuss them. I haven't run the code locally yet but I might like to do that too.

@pdurbin pdurbin assigned qqmyers and unassigned pdurbin Jul 1, 2021
@qqmyers
Copy link
Member Author

qqmyers commented Jul 2, 2021

@pdurbin - thanks for reviewing! I think I've addressed the straight-forward issues. The one thing that is tbd would be to decide on whether to leave the metadataOnOrig functionality in or not. It could probably be pulled out for now and treated as another PR (that could sit for a while) if there's concern about having it in the code now. (FWIW: I think DANS has created a custom metadata block for the metadata they had in their old system that doesn't match and therefore isn't relying on the metadataOnOrig functionality.)

If I missed something, let me know. I'm also happy to just talk through things as well.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the changes are looking good. A few things on my mind:

  • As suggested by @qqmyers, I vote we remove the metadataOnOrig code and tsv if we really don't need it. It's the kind of thing we can add in later with more discussion, if necessary. If we decided to keep it, it definitely needs to be documented.
  • In the guides, I'd be happy to have a little more of an intro to JSON-LD and examples of how to create a dataset with one author vs. multiple, for example. We can live without this though, so please consider it a "nice to have".
  • Under "Suggestions on how to test this" it says "see examples in comments below" but I'm confused about what this means. Perhaps it means the examples in the guides. I'd suggest clarifying this.
  • I'm still slightly weirded out that we have URLs like https://dataverse.org/schema/citation/Description in examples but they're 404s.
  • I left a comment about the case being used in the JSON-LD examples. At first glance it doesn't seem consistent.

@@ -0,0 +1,5 @@
#metadataBlock name dataverseAlias displayName blockURI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was all non-obvious to me from a quick look at the code. If it's on the table to remove the metadataOnOrig functionality, I think we should. If you get other opinions that we should keep it, please document how it works.

"http://purl.org/dc/terms/creator": {
"https://dataverse.org/schema/citation/author#Name": "Finch, Fiona",
"https://dataverse.org/schema/citation/author#Affiliation": "Birds Inc."
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I figured. When I was working on the Schema.org JSON-LD output I was shocked by how loosey goosey JSON-LD is.

At some point it might behoove us to add a little crash course on JSON-LD to the guides, or at least plenty of examples so that users get the hang of it. Otherwise, I think we can anticipate questions like "How do I specify multiple authors?"

"https://dataverse.org/schema/citation/datasetContact#E-mail": "finch@mailinator.com",
"https://dataverse.org/schema/citation/datasetContact#Name": "Finch, Fiona"
},
"https://dataverse.org/schema/citation/Description": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capitalization seems inconsistent.

Contact and Description (title case) but also author, datasetContact and dsDescription (camel case)?

What are the rules?

Copy link
Member Author

@qqmyers qqmyers Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These come from the citation.tsv block with the pattern <block name>/<field title> for single fields and <block name>/<field title>/<child field name> being used. That choice was made back when OAI_ORE/BagIt/archiving was introduced. The use if title was an attempt to use URIs that mirrored what users see in the UI. I'm less sure why I used name for child fields - not sure if there was a conflict or if it was an issue with originally just trying a flatter <blockname>/<childfield title> and realizing that there are multiple child fields with the title 'Name' for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds like a preexisting condition to me. 😄 It would be nice to have more consistency but oh well. I assume we don't want to revisit decisions made during BagIt export.

@qqmyers
Copy link
Member Author

qqmyers commented Jul 8, 2021

I've removed the metadataOnOrig logic (and created https://github.com/GlobalDataverseCommunityConsortium/dataverse/tree/RDA-with_metadataOnOrig to keep that code around as a proof-of-concept for RDA/if/when needed).

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't run the code myself but this seems ready for QA. Thanks for the pull request and for hanging there with changes I asked for, @qqmyers!

"http://purl.org/dc/terms/creator": {
"https://dataverse.org/schema/citation/author#Name": "Finch, Fiona",
"https://dataverse.org/schema/citation/author#Affiliation": "Birds Inc."
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for more docs over time. I think we can live with what we have now, especially since it's experimental.

@kcondon kcondon self-assigned this Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: SciencesPO related to GDCC work for Sciences PO
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants