Harvesting DOI metadata from non-OAI-PMH sources #5402

RightInTwo · 2018-12-14T16:53:38Z

See https://github.com/IQSS/doi2pmh-server for the continuation!

I would like to harvest heterogeneous sources that don't necessarily present the datasets I need through OAI-PMH or in the form I need them. The issues I see with OAI-PMH:

OAI-PMH interfaces don't necessarily exist for every source of datasets I want to use
OAI-PMH sets need to be defined at the source
Harvested data will go into one dataverse, with no ability to map specific datasets to dataverses
Supplied metadata is sometimes insufficient and harvested metadata can not be augmented (e.g. the information that our institute has an existing DUA for that data)
The granularity of the original data is not necessarily the one wanted in the repository (e.g. for a longitudinal study, we want to group all years into one dataset that describes the study as a whole. This is only a reference for scientists to increase discoverability of these data that are maintained at the external source)

These datasets would be described and updated using the metadata for the DOIs supplied by Datacite and Crossref through ~~the import API (which is currently not its purpose!). One solution would also be to set up~~ an own harvesting server.~~, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH and create quite a big overhead.~~

RightInTwo · 2018-12-14T17:01:45Z

What we are currently doing to prepare our repository:

Collect existing DOIs for relevant objects
Get basic metadata from Datacite (next step: also query Crossref and as a last resort: query the repository directly)
Categorize the objects (A. data we use from external sources and would like to reference in our repository, B. data from our institute published in external repositories)
Enrich metadata: research unit within our institute

Now that is the stuff I want to get into our institutional dataverse. This is only about metadata! The data would reside at its original source.

RightInTwo · 2019-01-24T14:00:27Z

#5104 seems to be closely related

RightInTwo · 2019-04-18T21:55:21Z

@donsizemore You mentioned Python code in the chat. What does it do exactly?

RightInTwo · 2019-07-18T19:06:28Z

@djbrooke @pdurbin Hey Guys! Would it make sense to break this down in some way? Or is an issue consolidation in progress/to be expected for this as well?

pdurbin · 2019-07-18T21:18:48Z

@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. 😄 For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

Also, breaking down issues is almost always good. It makes them easier to estimate. 👍

RightInTwo · 2019-07-18T23:06:40Z

@RightInTwo to help us keep this on our radar I think you should consider creating a project for your installation at https://github.com/orgs/IQSS/projects . If you're interested, please let me know and I can add you to a "read only" group. Beware that this also means we can assign issues to you. smile

Very nice. Sign me up! Don't think that your threat will stop me 😆

For more context on boards for installations, please see https://scholar.harvard.edu/pdurbin/blog/2019/jupyter-notebooks-and-crazy-ideas-for-dataverse

I feel honored to be mentioned :D

Also, breaking down issues is almost always good. It makes them easier to estimate. +1

I'd be glad to. It would be great if some other people with general interest in harvesting features joined on this issue to make it easier to smash it into digestible pieces and prioritize them. Maybe there are also already good solutions to (some of) the problems in existence... @pdurbin, could you help me out with some more of your community magic?

RightInTwo · 2019-07-23T23:40:05Z

How about re-framing this as "Harvest metadata from a list of DOIs"?

pdurbin · 2019-07-25T10:49:53Z

How about re-framing this as "Harvest metadata from a list of DOIs"?

Maybe. Maybe we should try to tell a user story. How's this?

"As a user, I'd like to collect datasets in Dataverse based on metadata available in DataCite. These datasets would behave somewhat like harvested datasets in that they are read only and would clearly indicate that they did not originate in Dataverse."

I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?

RightInTwo · 2019-07-25T14:25:24Z

I worry that I'm not understanding the "why" though. Are you saying that the researchers need a tool to collect related datasets together and that Dataverse could be that tool? What do they do now? Do they just have a bunch of bookmarks in their browser?

We don't publish any data ourselves. Therefore, it is necessary to collect references (DOIs) from the diverse places where data has been published.

For example, our unit DD is responsible for Components 1 and 6 of the German Longitudinal Election Study. The data resides at GESIS, but we would like to reference it in our insitutional repository (based on Dataverse). So we would like to add the following DOIs from that page to our catalogue and map (link) them to the dataverse of the unit DD and to those of individual researchers (if they want to). Also, we want to use that information to feed the CRIS.

link

https://doi.org/10.4232/1.13089
https://doi.org/10.4232/1.12722
https://doi.org/10.4232/1.12808
https://doi.org/10.4232/1.12809
https://doi.org/10.4232/1.13168
https://doi.org/10.4232/1.13137
https://doi.org/10.4232/1.13138
https://doi.org/10.4232/1.13139
https://doi.org/10.4232/1.12804
https://doi.org/10.4232/1.12805
https://doi.org/10.4232/1.12806
https://doi.org/10.4232/1.12443
https://doi.org/10.4232/1.12043
https://doi.org/10.4232/1.11443
https://doi.org/10.4232/1.11444

RightInTwo · 2019-07-25T14:29:59Z

Metadata should be retrieved from the best source available:

By content negotiation on the landing page (get rich metadata through DDI, OAI-ORE, ... - for a first implementation, this should be skipped because there are a lot of dependecies based on the repository software used)
From Datacite
From Crossref

pdurbin · 2019-07-25T14:48:01Z

@RightInTwo thanks, this is helping. For now could you use the "Related Datasets" field to collect those DOIs? I just tried this at https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/24U2VG and here's a screenshot:

The "Related Datasets" field is multivalued, which is nice, and it supports HTML, so I was able to link to the DOIs, but there isn't much structure to it. It all just goes in a single text area. What do you think? What does @jggautier think? 😄

RightInTwo · 2019-07-25T15:05:57Z

For now I would just use the "Other ID" field, but it would be best to have the DOI in the actual "Dataset Persistent ID" field.

We are currently collecting them in a database outside of Dataverse, but at some point it would be great to get them in there together with the metadata - before we manage that, we don't really want to make our Dataverse public (not even within the institute).

(Improving on #5998 would be appreciated anyways...)

pdurbin · 2019-12-06T02:09:25Z

One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH

I have no idea if this factoid is helpful or not but Dataverse can harvest its own native JSON format over OAI-PMH. This means that every single metadata field is available, even custom metadata blocks. (That's my understanding anyway.) The downside, of course, is that you'd have to implement our crazy native JSON format in the harvesting server you create. 😄

poikilotherm · 2019-12-06T09:59:05Z

Wouldn't it be easier to implement a separate service for this?

I'm also thinking in the direction of maybe slicing up Dataverse a bit and maybe move the complete harvesting into a separated module. It could run on its own, offer easier scaling and use the Dataverse API to load new stuff into the database. (No microservice, but a modulith)

It could either use Quarkus/Spring (stay'n in Java) or Python (excellent pyDataverse) 😉

RightInTwo · 2019-12-06T16:54:46Z

@poikilotherm Hi Oliver, that is kind of what I'm building now, except that I don't use any of the libraries but rather try to build it without constraints or validation in JS with jquery. Not because it is so great, but because the colleagues who will take over for me (my contract ends in March) don't have any programming background except for some jquery runtime manipulation in the browser...

Now, that all could be totally different if...
a) ...Dataverse would support and maintain that functionality. That would be of course a lot of additional work and I understand that it may (currently) be out of scope.
b) ...we would develop something together! :D I'm not deep into python, but pyDataverse seems very promising. And if there was a community effort on this, I'd be glad to be part of it and dump all that messy JS for good. @poikilotherm I could do the grunt work if you do the code structure and the QA :D

One solution would also be to set up an own harvesting server, but that would limit the abilities (metadata fields) to those supplied by OAI-PMH
Dataverse can harvest its own native JSON format over OAI-PMH

Good point! Maybe a small OAI-PMH server could be part of the solution then.

@donsizemore We once had a chat about this topic - are you still interested?

pdurbin · 2019-12-06T16:59:47Z

@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! 🎉

Developed primarily by @tainguyenbui it may be new but it's moving fast! And it's on npm.

RightInTwo · 2019-12-06T17:02:28Z

@RightInTwo since you're using Javascript you should definitely check out the new kid on the block when it comes to Dataverse API client libraries: dataverse-client-javascript! tada

Yes, I just discovered that yesterday! Let's see what the other people on here think about the choices regarding language and architecture. Maybe @skasberger could also contribute with his opinion?

RightInTwo · 2019-12-16T17:26:22Z

Here is some code as an example for how a quick and dirty import of Datacite metadata via DDI-XML works. After some experiments with mapping from one python dict to another, wanting to create a dict in the dataverse json format, I ended up with a solution that really earns the "quick and dirty" tag: Just insert everything into a string in the DDI-XML format accepted by /datasets/:importddi.

Click here for the python code (just as an example - should I put it in an own repo?)

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import json
from jsonpath_ng.ext import parse as parseJsonPath
from requests import get, post

# pyDataverse doesn't provide an API to import DDI-XML yet, so we just use requests
#from pyDataverse.api import Api
#from pyDataverse.models import Dataverse

#%%
######################################################
# Setup

## Define API URLs
apimethods = {
    'datacite_get_datacitejson': {
        'usage': "get the datacite+json representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
        'url': 'https://data.datacite.org/application/vnd.datacite.datacite+json/'
    },
    'datacite_get_xbibliography': {
        'usage': "get the x-bibliography representation of the DOI metadata. you need to append a DOI (just the ID!) to the url",
        'url': 'https://data.datacite.org/text/x-bibliography/'
    }
}
    
## Provide API key
apikey = {
    'wzbdataverse': '{insert API key here}'
}

## Provide base URL
baseurl = {
    'wzbdataverse': 'https://dataverse.wzb.eu'
}
    
#%%
######################################################    
# QUICK AND DIRTY function to map from Datacite+json (as a python dict) to DDI-XML (as a string)

def ddiXmlFromDoi(doi):
    md = json.loads(get(apimethods['datacite_get_datacitejson']['url'] + doi).content)
    citation = get(apimethods['datacite_get_xbibliography']['url'] + doi).content
    
    issueDate =     parseJsonPath('$.dates[?dateType="Issued"].date').find(md)[0].value
    pubYear =       parseJsonPath('$.publicationYear').find(md)[0].value
    title =         parseJsonPath('$.titles.[0].title').find(md)[0].value
    creators =      parseJsonPath('$.creators').find(md)[0].value
    keywords =      parseJsonPath('$.subjects').find(md)[0].value
    descriptions =  parseJsonPath('$.descriptions').find(md)[0].value
    version =       1
    subTitle =      ''
    
    ddixml = f"""<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<codeBook
	xmlns="ddi:codebook:2_5"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5">
	<docDscr>
		<citation>
			<titlStmt>
				<titl>{title}</titl>
				<IDNo agency="DOI">doi:{doi}</IDNo>
			</titlStmt>
			<distStmt>
				<distrbtr>{md['publisher']}</distrbtr>
				<distDate>{issueDate}</distDate>
			</distStmt>
			<verStmt source="Datacite">
				<version date="{issueDate}" type="RELEASED">{version}</version>
			</verStmt>
			<biblCit>{citation}</biblCit>
		</citation>
	</docDscr>
	<stdyDscr>
		<citation>
			<titlStmt>
				<titl>{title}</titl>
				<subTitl>{subTitle}</subTitl>
				<IDNo agency="DOI">doi:{doi}</IDNo>
			</titlStmt>
			<rspStmt>
"""
  
    ### creators          
    for creator in creators:
        affiliation = ''
        if('affiliation' in creator):
            affiliation = creator['affiliation']
        name = creator['name']
        ddixml += f'<AuthEnty affiliation="{affiliation}">{name}</AuthEnty>'
    
    
    ddixml += f"""
        </rspStmt>
			<prodStmt>
				<prodDate>{pubYear}</prodDate>
			</prodStmt>
			<distStmt>
				<distrbtr>{md['publisher']}</distrbtr>
				<distDate>{issueDate}</distDate>
			</distStmt>
		</citation>
		<stdyInfo>
			<subject>
"""

    ### subjects
    for keyword in keywords:
        word = keyword['subject']
        scheme = ''
        if 'subjectScheme' in keyword:
            scheme = keyword['subjectScheme']
        ddixml += f'<keyword subjectScheme="{scheme}">{word}</keyword>'
    
    
    ddixml += f"""
			</subject>
"""
    
    ### abstracts / descriptions
    for desc in descriptions:
        descText = desc['description']
        ddixml += f'<abstract>{descText}</abstract>'
        
    ddixml += """
			<distrbtr>
				<distrbtr>{md['publisher']}</distrbtr>
			</distrbtr>
		</stdyInfo>
	</stdyDscr>
</codeBook>"""

    return ddixml
    
#%%
###################################################### 
# Loop through the list of DOIs, get the DDI-XML and import it into Dataverse    

dois = ['doi1', 'doi2', 'doi...', 'doin', ] 

for doi in dois:    
    ddiXml = ddiXmlFromDoi(doi).encode(encoding='UTF-8') 
    params = {}
    params['key'] = apikey['wzbdataverse']
    import_api = f'/api/dataverses/open/datasets/:importddi?pid=doi:{doi}&release=yes'
    response = post(baseurl['wzbdataverse'] + import_api, data=ddiXml, params=params, verify=False)

RightInTwo · 2020-01-21T16:29:58Z

With a custom OAI-PMH server (which holds the metadata for a specified list of DOIs, also see #6425 ), the solution could be archived with a harvesting client in Dataverse. Step 1 & 2 would be run regularly (daily?).

(green: exists, red: todo)

RightInTwo · 2020-01-21T16:49:25Z

@tcoupin @pdurbin @djbrooke Since this is not going to be a core feature, where should this project reside and under what name? In the IQSS github, named something like "doi2pmh-server"?

pdurbin · 2020-01-21T20:44:12Z

@RightInTwo I could create an empty repo for you if you want. You'd want to mention prominently in the README that it's community supported. Nice diagram! (And nice code earlier. 😄 )

RightInTwo · 2020-01-21T21:48:41Z

@RightInTwo I could create an empty repo for you if you want.

@tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then.

tcoupin · 2020-01-22T05:11:05Z

Yes 😀

…

Le 21 janv. 2020 à 22:48, Jonas ***@***.***> a écrit : @RightInTwo I could create an empty repo for you if you want. @tcoupin Would you agree to administering this? Since my contract ends in the end of March, I cannot commit to that, but will gladly take part in developing it until then. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

pdurbin · 2020-01-22T12:39:04Z

@RightInTwo @tcoupin ok I just created https://github.com/IQSS/doi2pmh-server and made you admins of it. Again, please make sure you indicate that this a community supported project. Have fun you two. 😄

djbrooke · 2020-01-22T14:25:48Z

Thanks @pdurbin for setting up that repo.

@tcoupin @RightInTwo Thanks for working on this. I think a solution that allows institutions to easily set up their collections in an OAI-PMH server and then have the metadata reflected in Dataverse for discoverability purposes is great.

RightInTwo · 2020-01-22T20:19:37Z

@pdurbin @djbrooke Thanks for making this happen!

@poikilotherm @tcoupin See you on the other side!

RightInTwo mentioned this issue Dec 14, 2018

Cant publish imported dataset with existing DOI #5104

Closed

pdurbin added the Feature: Harvesting label Jul 11, 2019

RightInTwo mentioned this issue Dec 3, 2019

When I create a dataset, I want use an existing DOI #6425

Closed

RightInTwo changed the title ~~Harvesting from non-OAI-PMH sources~~ Harvesting DOI metadata from non-OAI-PMH sources Dec 5, 2019

RightInTwo mentioned this issue Dec 5, 2019

Dataset Citation: provide flexible options for information displayed in citation metadata #2297

Closed

RightInTwo closed this as completed Jan 22, 2020

RightInTwo mentioned this issue Oct 20, 2020

Feature request: create a harvested dataset via the API #7330

Open

plecor mentioned this issue May 31, 2024

Add info about DOI2PMH project in guide #10604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvesting DOI metadata from non-OAI-PMH sources #5402

Harvesting DOI metadata from non-OAI-PMH sources #5402

RightInTwo commented Dec 14, 2018 •

edited

Loading

RightInTwo commented Dec 14, 2018 •

edited

Loading

RightInTwo commented Jan 24, 2019

RightInTwo commented Apr 18, 2019

RightInTwo commented Jul 18, 2019 •

edited

Loading

pdurbin commented Jul 18, 2019

RightInTwo commented Jul 18, 2019 •

edited

Loading

RightInTwo commented Jul 23, 2019

pdurbin commented Jul 25, 2019

RightInTwo commented Jul 25, 2019

RightInTwo commented Jul 25, 2019

pdurbin commented Jul 25, 2019

RightInTwo commented Jul 25, 2019

pdurbin commented Dec 6, 2019

poikilotherm commented Dec 6, 2019

RightInTwo commented Dec 6, 2019 •

edited

Loading

pdurbin commented Dec 6, 2019

RightInTwo commented Dec 6, 2019 •

edited

Loading

RightInTwo commented Dec 16, 2019 •

edited by jggautier

Loading

RightInTwo commented Jan 21, 2020 •

edited

Loading

RightInTwo commented Jan 21, 2020 •

edited

Loading

pdurbin commented Jan 21, 2020

RightInTwo commented Jan 21, 2020

tcoupin commented Jan 22, 2020 via email

pdurbin commented Jan 22, 2020

djbrooke commented Jan 22, 2020

RightInTwo commented Jan 22, 2020

Harvesting DOI metadata from non-OAI-PMH sources #5402

Harvesting DOI metadata from non-OAI-PMH sources #5402

Comments

RightInTwo commented Dec 14, 2018 • edited Loading

RightInTwo commented Dec 14, 2018 • edited Loading

RightInTwo commented Jan 24, 2019

RightInTwo commented Apr 18, 2019

RightInTwo commented Jul 18, 2019 • edited Loading

pdurbin commented Jul 18, 2019

RightInTwo commented Jul 18, 2019 • edited Loading

RightInTwo commented Jul 23, 2019

pdurbin commented Jul 25, 2019

RightInTwo commented Jul 25, 2019

link

RightInTwo commented Jul 25, 2019

pdurbin commented Jul 25, 2019

RightInTwo commented Jul 25, 2019

pdurbin commented Dec 6, 2019

poikilotherm commented Dec 6, 2019

RightInTwo commented Dec 6, 2019 • edited Loading

pdurbin commented Dec 6, 2019

RightInTwo commented Dec 6, 2019 • edited Loading

RightInTwo commented Dec 16, 2019 • edited by jggautier Loading

RightInTwo commented Jan 21, 2020 • edited Loading

RightInTwo commented Jan 21, 2020 • edited Loading

pdurbin commented Jan 21, 2020

RightInTwo commented Jan 21, 2020

tcoupin commented Jan 22, 2020 via email

pdurbin commented Jan 22, 2020

djbrooke commented Jan 22, 2020

RightInTwo commented Jan 22, 2020

RightInTwo commented Dec 14, 2018 •

edited

Loading

RightInTwo commented Dec 14, 2018 •

edited

Loading

RightInTwo commented Jul 18, 2019 •

edited

Loading

RightInTwo commented Jul 18, 2019 •

edited

Loading

RightInTwo commented Dec 6, 2019 •

edited

Loading

RightInTwo commented Dec 6, 2019 •

edited

Loading

RightInTwo commented Dec 16, 2019 •

edited by jggautier

Loading

RightInTwo commented Jan 21, 2020 •

edited

Loading

RightInTwo commented Jan 21, 2020 •

edited

Loading