-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvesting DOI metadata from non-OAI-PMH sources #5402
Comments
What we are currently doing to prepare our repository:
Now that is the stuff I want to get into our institutional dataverse. This is only about metadata! The data would reside at its original source. |
#5104 seems to be closely related |
@donsizemore You mentioned Python code in the chat. What does it do exactly? |
@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. 😄 For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse Also, breaking down issues is almost always good. It makes them easier to estimate. 👍 |
Very nice. Sign me up! Don't think that your threat will stop me 😆
I feel honored to be mentioned :D
I'd be glad to. It would be great if some other people with general interest in harvesting features joined on this issue to make it easier to smash it into digestible pieces and prioritize them. Maybe there are also already good solutions to (some of) the problems in existence... @pdurbin, could you help me out with some more of your community magic? |
How about re-framing this as "Harvest metadata from a list of DOIs"? |
Maybe. Maybe we should try to tell a user story. How's this? "As a user, I'd like to collect datasets in Dataverse based on metadata available in DataCite. These datasets would behave somewhat like harvested datasets in that they are read only and would clearly indicate that they did not originate in Dataverse." I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser? |
We don't publish any data ourselves. Therefore, it is necessary to collect references (DOIs) from the diverse places where data has been published. For example, our unit DD is responsible for Components 1 and 6 of the German Longitudinal Election Study. The data resides at GESIS, but we would like to reference it in our insitutional repository (based on Dataverse). So we would like to add the following DOIs from that page to our catalogue and map (link) them to the dataverse of the unit DD and to those of individual researchers (if they want to). Also, we want to use that information to feed the CRIS. linkhttps://doi.org/10.4232/1.13089 |
Metadata should be retrieved from the best source available:
|
@RightInTwo thanks, this is helping. For now could you use the "Related Datasets" field to collect those DOIs? I just tried this at https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/24U2VG and here's a screenshot: The "Related Datasets" field is multivalued, which is nice, and it supports HTML, so I was able to link to the DOIs, but there isn't much structure to it. It all just goes in a single text area. What do you think? What does @jggautier think? 😄 |
For now I would just use the "Other ID" field, but it would be best to have the DOI in the actual "Dataset Persistent ID" field. We are currently collecting them in a database outside of Dataverse, but at some point it would be great to get them in there together with the metadata - before we manage that, we don't really want to make our Dataverse public (not even within the institute). (Improving on #5998 would be appreciated anyways...) |
I have no idea if this factoid is helpful or not but Dataverse can harvest its own native JSON format over OAI-PMH. This means that every single metadata field is available, even custom metadata blocks. (That's my understanding anyway.) The downside, of course, is that you'd have to implement our crazy native JSON format in the harvesting server you create. 😄 |
Wouldn't it be easier to implement a separate service for this? I'm also thinking in the direction of maybe slicing up Dataverse a bit and maybe move the complete harvesting into a separated module. It could run on its own, offer easier scaling and use the Dataverse API to load new stuff into the database. (No microservice, but a modulith) It could either use Quarkus/Spring (stay'n in Java) or Python (excellent pyDataverse) 😉 |
@poikilotherm Hi Oliver, that is kind of what I'm building now, except that I don't use any of the libraries but rather try to build it without constraints or validation in JS with jquery. Not because it is so great, but because the colleagues who will take over for me (my contract ends in March) don't have any programming background except for some jquery runtime manipulation in the browser... Now, that all could be totally different if...
Good point! Maybe a small OAI-PMH server could be part of the solution then. @donsizemore We once had a chat about this topic - are you still interested? |
@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! 🎉 Developed primarily by @tainguyenbui it may be new but it's moving fast! And it's on npm. |
Yes, I just discovered that yesterday! Let's see what the other people on here think about the choices regarding language and architecture. Maybe @skasberger could also contribute with his opinion? |
Here is some code as an example for how a quick and dirty import of Datacite metadata via DDI-XML works. After some experiments with mapping from one python dict to another, wanting to create a dict in the dataverse json format, I ended up with a solution that really earns the "quick and dirty" tag: Just insert everything into a string in the DDI-XML format accepted by /datasets/:importddi. Click here for the python code (just as an example - should I put it in an own repo?)
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import json
from jsonpath_ng.ext import parse as parseJsonPath
from requests import get, post
# pyDataverse doesn't provide an API to import DDI-XML yet, so we just use requests
#from pyDataverse.api import Api
#from pyDataverse.models import Dataverse
#%%
######################################################
# Setup
## Define API URLs
apimethods = {
'datacite_get_datacitejson': {
'usage': "get the datacite+json representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
'url': 'https://data.datacite.org/application/vnd.datacite.datacite+json/'
},
'datacite_get_xbibliography': {
'usage': "get the x-bibliography representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
'url': 'https://data.datacite.org/text/x-bibliography/'
}
}
## Provide API key
apikey = {
'wzbdataverse': '{insert API key here}'
}
## Provide base URL
baseurl = {
'wzbdataverse': 'https://dataverse.wzb.eu'
}
#%%
######################################################
# QUICK AND DIRTY function to map from Datacite+json (as a python dict) to DDI-XML (as a string)
def ddiXmlFromDoi(doi):
md = json.loads(get(apimethods['datacite_get_datacitejson']['url'] + doi).content)
citation = get(apimethods['datacite_get_xbibliography']['url'] + doi).content
issueDate = parseJsonPath('$.dates[?dateType="Issued"].date').find(md)[0].value
pubYear = parseJsonPath('$.publicationYear').find(md)[0].value
title = parseJsonPath('$.titles.[0].title').find(md)[0].value
creators = parseJsonPath('$.creators').find(md)[0].value
keywords = parseJsonPath('$.subjects').find(md)[0].value
descriptions = parseJsonPath('$.descriptions').find(md)[0].value
version = 1
subTitle = ''
ddixml = f"""<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<codeBook
xmlns="ddi:codebook:2_5"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5">
<docDscr>
<citation>
<titlStmt>
<titl>{title}</titl>
<IDNo agency="DOI">doi:{doi}</IDNo>
</titlStmt>
<distStmt>
<distrbtr>{md['publisher']}</distrbtr>
<distDate>{issueDate}</distDate>
</distStmt>
<verStmt source="Datacite">
<version date="{issueDate}" type="RELEASED">{version}</version>
</verStmt>
<biblCit>{citation}</biblCit>
</citation>
</docDscr>
<stdyDscr>
<citation>
<titlStmt>
<titl>{title}</titl>
<subTitl>{subTitle}</subTitl>
<IDNo agency="DOI">doi:{doi}</IDNo>
</titlStmt>
<rspStmt>
"""
### creators
for creator in creators:
affiliation = ''
if('affiliation' in creator):
affiliation = creator['affiliation']
name = creator['name']
ddixml += f'<AuthEnty affiliation="{affiliation}">{name}</AuthEnty>'
ddixml += f"""
</rspStmt>
<prodStmt>
<prodDate>{pubYear}</prodDate>
</prodStmt>
<distStmt>
<distrbtr>{md['publisher']}</distrbtr>
<distDate>{issueDate}</distDate>
</distStmt>
</citation>
<stdyInfo>
<subject>
"""
### subjects
for keyword in keywords:
word = keyword['subject']
scheme = ''
if 'subjectScheme' in keyword:
scheme = keyword['subjectScheme']
ddixml += f'<keyword subjectScheme="{scheme}">{word}</keyword>'
ddixml += f"""
</subject>
"""
### abstracts / descriptions
for desc in descriptions:
descText = desc['description']
ddixml += f'<abstract>{descText}</abstract>'
ddixml += """
<distrbtr>
<distrbtr>{md['publisher']}</distrbtr>
</distrbtr>
</stdyInfo>
</stdyDscr>
</codeBook>"""
return ddixml
#%%
######################################################
# Loop through the list of DOIs, get the DDI-XML and import it into Dataverse
dois = ['doi1', 'doi2', 'doi...', 'doin', ]
for doi in dois:
ddiXml = ddiXmlFromDoi(doi).encode(encoding='UTF-8')
params = {}
params['key'] = apikey['wzbdataverse']
import_api = f'/api/dataverses/open/datasets/:importddi?pid=doi:{doi}&release=yes'
response = post(baseurl['wzbdataverse'] + import_api, data=ddiXml, params=params, verify=False) |
With a custom OAI-PMH server (which holds the metadata for a specified list of DOIs, also see #6425 ), the solution could be archived with a harvesting client in Dataverse. Step 1 & 2 would be run regularly (daily?). |
@RightInTwo I could create an empty repo for you if you want. You'd want to mention prominently in the README that it's community supported. Nice diagram! (And nice code earlier. 😄 ) |
@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then. |
Yes 😀
… Le 21 janv. 2020 à 22:48, Jonas ***@***.***> a écrit :
@RightInTwo I could create an empty repo for you if you want.
@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@RightInTwo @tcoupin ok I just created https://github.com/IQSS/doi2pmh-server and made you admins of it. Again, please make sure you indicate that this a community supported project. Have fun you two. 😄 |
Thanks @pdurbin for setting up that repo. @tcoupin @RightInTwo Thanks for working on this. I think a solution that allows institutions to easily set up their collections in an OAI-PMH server and then have the metadata reflected in Dataverse for discoverability purposes is great. |
@pdurbin @djbrooke Thanks for making this happen! @poikilotherm @tcoupin See you on the other side! |
See https://github.com/IQSS/doi2pmh-server for the continuation!
I would like to harvest heterogeneous sources that don't necessarily present the datasets I need through OAI-PMH or in the form I need them. The issues I see with OAI-PMH:
These datasets would be described and updated using the metadata for the DOIs supplied by Datacite and Crossref through
the import API (which is currently not its purpose!). One solution would also be to set upan own harvesting server., but that would limit the abilities (metadata fields) to those supplied by OAI-PMH and create quite a big overhead.The text was updated successfully, but these errors were encountered: