Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Harvard Dataverse Repository NIH Metrics #217

Open
Tracked by #118
cmbz opened this issue Apr 1, 2024 · 8 comments
Open
Tracked by #118

Epic: Harvard Dataverse Repository NIH Metrics #217

cmbz opened this issue Apr 1, 2024 · 8 comments
Assignees
Labels
GREI 4 Analytics and Reporting Project: NIH GREI Tasks related to the NIH GREI project

Comments

@cmbz
Copy link
Contributor

cmbz commented Apr 1, 2024

Overview

Tracking issue for monthly reports of NIH-funded datasets in Harvard Dataverse.

Resources

How these metrics are gathered for the monthly reports

At the beginning of each month @jggautier runs a Python script that:

  • uses Dataverse's Native and Search APIs to gather the persistent IDs, months of publications, storage sizes of the latest published versions, and metadata matches of datasets published in the previous month where the metadata includes the names and acronyms of NIH centers and institutes (see Search details for more information)
  • scrapes each dataset's page to get its file download count
  • uses the DataCite API to get each published dataset's citation count
  • creates a CSV file with the following information for each published dataset: PID URL, publication month, citation count, file download count, storage size, and where NIH institute names and acronyms appear in the metadata
  • reports the PIDs of any datasets that were removed (datasets in the previous month's report that aren't in the current month's) and datasets that were added (datasets in the current month's report that aren't not in the previous month's)

@jggautier then reviews any datasets that were included in previous months but removed, reviews the metadata of newly added datasets to make sure there's actually some indication of NIH funding, removes any datasets that aren't from NIH-funded research, and adjusts the script so that those datasets are ignored when the script is used again. The script is also adjusted to include datasets that @jggautier and colleagues know have been funded by the NIH and are missing such indications in their metadata.

Search details
The Python script uses the Search API to look across four metadata fields - Funding Information Agency, Contributor Name, Description, and Notes - for the full name of the NIH and its acronym and the full names of all NIH centers and institutes and most of their acronyms.

When looking through metadata in the Description field and Notes field, the script also looks for variations of the words "fund", "sponsor", "award", and "support" to increase the chances that it finds only datasets with metadata that acknowledges NIH funding.

@cmbz
Copy link
Contributor Author

cmbz commented Apr 1, 2024

Status: March 2024

  • Total NIH funded items: 290
  • Total storage used by NIH-funded items: 3.11 TB
  • Total downloads of NIH-funded items: 738,696
  • Total citations of NIH-funded items: 22

@cmbz cmbz added GREI 4 Analytics and Reporting Project: NIH GREI Tasks related to the NIH GREI project labels Apr 1, 2024
@cmbz
Copy link
Contributor Author

cmbz commented Apr 10, 2024

Status: April 2024

  • Total NIH funded items: 290
    • 3 datasets were added (one published in April 2024 and two published earlier)
    • 3 non-NIH funded datasets were removed
  • Total storage used by NIH-funded items: 3.11 TB
  • Total downloads of NIH-funded items: 763,697
  • Total citations of NIH-funded items: 22

@cmbz
Copy link
Contributor Author

cmbz commented May 7, 2024

Status: May 2024

  • Harvard Dataverse datasets have been indexed in the NIH's Dataset Catalog since the catalog launched in Feb. 2024
  • NIH-funded datasets in Harvard Dataverse
    • Total NIH funded items: 293
      • 3 datasets were added (one published in May 2024 and two published earlier)
    • Total storage used by NIH-funded items: 3.11 TB
    • Total downloads of NIH-funded items: 840,550
    • Total citations of NIH-funded items: 23

@cmbz
Copy link
Contributor Author

cmbz commented May 28, 2024

Status: June 2024

  • Total NIH funded items: 295
    • 2 datasets were added (both published in June 2024)
  • Total storage used by NIH-funded items: 3.11 TB
  • Total downloads of NIH-funded items: 853,080
  • Total citations of NIH-funded items: 49

@cmbz
Copy link
Contributor Author

cmbz commented Jul 17, 2024

Status: July 2024

  • Total NIH funded items: 300
    • 5 datasets were added (all published in July 2024)
  • Total storage used by NIH-funded items: 3.11 TB
  • Total downloads of NIH-funded items: 868,460
  • Total citations of NIH-funded items: 50

@cmbz
Copy link
Contributor Author

cmbz commented Aug 26, 2024

Status: August 2024

  • Total NIH funded items: 308
    • 8 datasets were added (all were published in August 2024)
  • Total storage used by NIH-funded items: 3.11 TB
  • Total downloads of NIH-funded items: 963,659
  • Total citations of NIH-funded items: 28

@cmbz
Copy link
Contributor Author

cmbz commented Sep 25, 2024

Status: September 2024

  • Total NIH funded items: 311
    • 3 datasets were added (all were published in September 2024)
  • Total storage used by NIH-funded items: 3.12 TB
  • Total downloads of NIH-funded items: 1,003,532
  • Total citations of NIH-funded items: 29

@cmbz
Copy link
Contributor Author

cmbz commented Oct 8, 2024

Status: October 2024

  • Pending

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GREI 4 Analytics and Reporting Project: NIH GREI Tasks related to the NIH GREI project
Projects
None yet
Development

No branches or pull requests

2 participants