Skip to content

Releases: IQSS/dataverse

v5.11

13 Jun 20:49
21ac7e1
Compare
Choose a tag to compare

Dataverse Software 5.11

Please note: We have removed the 5.11 war file and dvinstall.zip because there are very serious bugs in the 5.11 release. For the upgrade instructions below, please use the 5.11.1 war file instead. New installations should start with 5.11.1. The bugs are explained in the 5.11.1 release notes.

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Terms of Access or Request Access Required for Restricted Files

Beginning in this release, datasets with restricted files must have either Terms of Access or Request Access enabled. This change is to ensure that for each file in a Dataverse installation there is a clear path to get to the data, either through requesting access to the data or to provide context about why requesting access is not enabled.

Published datasets are not affected by this change. Datasets that are in draft and that have neither Terms of Access nor Request Access enabled must be updated to select one or the other (or both). Otherwise, datasets cannot be futher edited or published. Dataset authors will be able to tell if their dataset is affected by the presence of the following message at the top of their dataset (when they are logged in):

"Datasets with restricted files are required to have Request Access enabled or Terms of Access to help people access the data. Please edit the dataset to confirm Request Access or provide Terms of Access to be in compliance with the policy."

At this point, authors should click "Edit Dataset" then "Terms" and then check the box for "Request Access" or fill in "Terms of Access for Restricted Files" (or both). Afterwards, authors will be able to further edit metadata and publish.

In the "Notes for Dataverse Installation Administrators" section, we have provided a query to help proactively identify datasets that need to be updated.

See also Issue #8191 and PR #8308.

Muting Notifications

Users can control which notifications they receive if the system is configured to allow this. See also Issue #7492 and PR #8530.

Major Use Cases and Infrastructure Enhancements

Changes and fixes in this release include:

  • Terms of Access or Request Access required for restricted files. (Issue #8191, PR #8308)
  • Users can control which notifications they receive if the system is configured to allow this. (Issue #7492, PR #8530)
  • A 500 error was occuring when creating a dataset if a template did not have an associated "termsofuseandaccess". See "Legacy Templates Issue" below for details. (Issue #8599, PR #8789)
  • Tabular ingest can be skipped via API. (Issue #8525, PR #8532)
  • The "Verify Email" button has been changed to "Send Verification Email" and rather than sometimes showing a popup now always sends a fresh verification email (and invalidates previous verification emails). (Issue #8227, PR #8579)
  • For Shibboleth users, the emailconfirmed timestamp is now set on login and the UI should show "Verified". (Issue #5663, PR #8579)
  • Information about the license selection (or custom terms) is now available in the confirmation popup when contributors click "Submit for Review". Previously, this was only available in the confirmation popup for the "Publish" button, which contributors do not see. (Issue #8561, PR #8691)
  • For installations configured to support multiple languages, controlled vocabulary fields that do not allow multiple entries (e.g. journalArticleType) are now indexed properly. (Issue #8595, PR #8601, PR #8624)
  • Two-letter ISO-639-1 codes for languages are now supported, in metadata imports and harvesting. (Issue #8139, PR #8689)
  • The API endpoint for listing notifications has been enhanced to show the subject, text, and timestamp of notifications. (Issue #8487, PR #8530)
  • The API Guide has been updated to explain that the Content-type header is now (as of Dataverse 5.6) necessary to create datasets via native API. (Issue #8663, PR #8676)
  • Admin API endpoints have been added to find and delete dataset templates. (Issue 8600, PR #8706)
  • The BagIt file handler detects and transforms zip files with a BagIt package format into Dataverse data files, validating checksums along the way. See the BagIt File Handler section of the Installation Guide for details. (Issue #8608, PR #8677)
  • For BagIt Export, the number of threads used when zipping data files into an archival bag is now configurable using the :BagGeneratorThreads database setting. (Issue #8602, PR #8606)
  • PostgreSQL 14 can now be used (though we've tested mostly with 13). PostgreSQL 10+ is required. (Issue #8295, PR #8296)
  • As always, widgets can be embedded in the <iframe> HTML tag, but the HTTP header "Content-Security-Policy" is now being sent on non-widget pages to prevent them from being embedded. (PR #8662)
  • URIs in the the experimental Semantic API have changed (details below). (Issue #8533, PR #8592)
  • Installations running Make Data Count can upgrade to Counter Processor-0.1.04. (Issue #8380, PR #8391)
  • PrimeFaces, the UI framework we use, has been upgraded from 10 to 11. (Issue #8456, PR #8652)

Notes for Dataverse Installation Administrators

Identifying Datasets Requiring Terms of Access or Request Access Changes

In support of the change to require either Terms of Access or Request Access for all restricted files (see above for details), we have provided a query to identify datasets in your installation where at least one restricted file has neither Terms of Access nor Request Access enabled:

https://github.com/IQSS/dataverse/blob/v5.11/scripts/issues/8191/datasets_without_toa_or_request_access

This will allow you to reach out to those dataset owners as appropriate.

Legacy Templates Issue

When custom license functionality was added, dataverses that had older legacy templates as their default template would not allow the creation of a new dataset (500 error).

This occurred because those legacy templates did not have an associated termsofuseandaccess linked to them.

In this release, we run a script that creates a default empty termsofuseandaccess for each of these templates and links them.

Note the termsofuseandaccess that are created this way default to using the license with id=1 (cc0) and the fileaccessrequest to false.

See also Issue #8599 and PR #8789.

PostgreSQL Version 10+ Required

This release upgrades the bundled PostgreSQL JDBC driver to support major version 14.

Note that the newer PostgreSQL driver required a Flyway version bump, which entails positive and negative consequences:

  • The newer version of Flyway supports PostgreSQL 14 and includes a number of security fixes.
  • As of version 8.0 the Flyway Community Edition dropped support for PostgreSQL 9.6 and older.

This means that as foreshadowed in the 5.10 and 5.10.1 release notes, version 10 or higher of PostgreSQL is now required. For suggested upgrade steps, please see "PostgreSQL Update" in the release notes for 5.10: https://github.com/IQSS/dataverse/releases/tag/v5.10

Counter Processor 0.1.04 Support

This release includes support for counter-processor-0.1.04 for processing Make Data Count metrics. If you are running Make Data Counts support, you should reinstall/reconfigure counter-processor as described in the latest Guides. (For existing installations, note that counter-processor-0.1.04 requires a newer version of Python so you will need to follow the full counter-processor install. Also note that if you configure the new version the same way, it will reprocess the days in the current month when it is first run. This is normal and will not affect the metrics in Dataverse.)

New JVM Options and DB Settings

The following DB settings have been added:

  • :ShowMuteOptions
  • :AlwaysMuted
  • :NeverMuted
  • :CreateDataFilesMaxErrorsToDisplay
  • :BagItHandlerEnabled
  • :BagValidatorJobPoolSize
  • :BagValidatorMaxErrors
  • :BagValidatorJobWaitInterval
  • :BagGeneratorThreads

See the Database Settings section of the Guides for more information.

Notes for Developers and Integrators

See the "Backward Incompatibilities" section below.

Backward Incompatibilities

Semantic API Changes

This release includes an update to the experimental semantic API and the underlying assignment of URIs to metadata block terms that are not explicitly mapped to terms in community vocabularies. The change affects the output of the OAI_ORE metadata export, the OAI_ORE file in archival bags, and the input/output allowed for those terms in the semantic API.

For those updating integrating code or existing files intended for input into this release of Dataverse, URIs of the form...

https://dataverse.org/schema/<block name>/<parentField name>#<childField title>

and

https://dataverse.org/schema/<block name>/<Field title>

...are both replaced with URIs of the form...

https://dataverse.org/schema/<block name>/<Field name>.

Create Dataset API Requires Content-type Header (Since 5.6)

Due to a code change introduced in Dataverse 5.6, calls to the native API without the Content-type header will fail to create a dataset. The API Guide has been updated to indicate the necessity of this header: https://guides.dataverse.org/en/5.11/api/native-api.html#create-a-dataset-in-a-dataverse-collection

Complete List of Changes

...

Read more

v5.10.1

06 Apr 19:49
b844672
Compare
Choose a tag to compare

Dataverse Software 5.10.1

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Bug Fix for Request Access

Dataverse Software 5.10 contains a bug where the "Request Access" button doesn't work from the file listing on the dataset page if the dataset contains custom terms. This has been fixed in PR #8555.

Bug Fix for Searching and Selecting Controlled Vocabulary Values

Dataverse Software 5.10 contains a bug where the search option is no longer present when selecting from more than ten controlled vocabulary values. This has been fixed in PR #8521.

Major Use Cases and Infrastructure Enhancements

Changes and fixes in this release include:

  • Users can use the "Request Access" button when the dataset has custom terms. (Issue #8553, PR #8555)
  • Users can search when selecting from more than ten controlled vocabulary values. (Issue #8519, PR #8521)
  • The default file categories ("Documentation", "Data", and "Code") can be redefined through the :FileCategories database setting. (Issue #8461, PR #8478)
  • Documentation on troubleshooting Excel ingest errors was improved. (PR #8541)
  • Internationalized controlled vocabulary values can now be searched. (Issue #8286, PR #8435)
  • Curation labels can be internationalized. (Issue #8381, PR #8466)
  • "NONE" is no longer accepted as a license using the SWORD API (since 5.10). See "Backward Incompatibilities" below for details. (Issue #8551, PR #8558).

Notes for Dataverse Installation Administrators

PostgreSQL Version 10+ Required Soon

Because 5.10.1 is a bug fix release, an upgrade to PostgreSQL is not required. However, this upgrade is still coming in the next non-bug fix release. For details, please see the release notes for 5.10: https://github.com/IQSS/dataverse/releases/tag/v5.10

Payara Upgrade

You may notice that the Payara version used in the install scripts has been updated from 5.2021.5 to 5.2021.6. This was to address a bug where it was not possible to easily update the logging level. For existing installations, this release does not require upgrading Payara and a Payara upgrade is not part of the Upgrade Instructions below. For more information, see PR #8508.

New JVM Options and DB Settings

The following DB settings have been added:

  • :FileCategories - The default list of the pre-defined file categories ("Documentation", "Data" and "Code") can now be redefined with a comma-separated list (e.g. 'Docs,Data,Code,Workflow').

See the Database Settings section of the Guides for more information.

Notes for Developers and Integrators

In the "Backward Incompatibilities" section below, note changes in the API regarding licenses and the SWORD API.

Backward Incompatibilities

As of Dataverse 5.10, "NONE" is no longer supported as a valid license when creating a dataset using the SWORD API. The API Guide has been updated to reflect this. Additionally, if you specify an invalid license, a list of available licenses will be returned in the response.

Complete List of Changes

For the complete list of code changes in this release, see the 5.10.1 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide. Please also contact us to get added to the Dataverse Project Map if you have not done so already.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.10.1.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • service payara start

4. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.10.1.war

5. Restart payara

  • service payara stop
  • service payara start

v5.10

18 Mar 19:46
a57ce53
Compare
Choose a tag to compare

Dataverse Software 5.10

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Multiple License Support

Users can now select from a set of configured licenses in addition to or instead of the previous Creative Commons CC0 choice or provide custom terms of use (if configured) for their datasets. Administrators can configure their Dataverse instance via API to allow any desired license as a choice and can enable or disable the option to allow custom terms. Administrators can also mark licenses as "inactive" to disallow future use while keeping that license for existing datasets. For upgrades, only the CC0 license will be preinstalled. New installations will have both CC0 and CC BY preinstalled. The Configuring Licenses section of the Installation Guide shows how to add or remove licenses.

Note: Datasets in existing installations will automatically be updated to conform to new requirements that custom terms cannot be used with a standard license and that custom terms cannot be empty. Administrators may wish to manually update datasets with these conditions if they do not like the automated migration choices. See the "Notes for Dataverse Installation Administrators" section below for details.

This release also makes the license selection and/or custom terms more prominent when publishing and viewing a dataset and when downloading files.

Ingest and File Upload Messaging Improvements

Messaging around ingest failure has been softened to prevent support tickets. In addition, messaging during file upload has been improved, especially with regard to showing size limits and providing links to the guides about tabular ingest. For screenshots and additional details see PR #8271.

Downloading of Guestbook Responses with Fewer Clicks

A download button has been added to the page that lists guestbooks. This saves a click but you can still download responses from the "View Responses" page, as before.

Also, links to the guides about guestbooks have been added in additional places.

Dynamically Request Arbitrary Metadata Fields from Search API

The Search API now allows arbitrary metadata fields to be requested when displaying results from datasets. You can request all fields from metadata blocks or pick and choose certain fields.

The new parameter is called metadata_fields and the Search API documentation contains details and examples: https://guides.dataverse.org/en/5.10/api/search.html

Solr 8 Upgrade

The Dataverse Software now runs on Solr 8.11.1, the latest available stable release in the Solr 8.x series.

PostgreSQL Upgrade

A PostgreSQL upgrade is not required for this release but is planned for the next release. See below for details.

Major Use Cases and Infrastructure Enhancements

Changes and fixes in this release include:

  • When creating or updating datasets, users can select from a set of licenses configured by the administrator (CC, CC BY, custom licenses, etc.) or provide custom terms (if the installation is configured to allow them). (Issue #7440, PR #7920)
  • Users can get better feedback on tabular ingest errors and more information about size limits when uploading files. (Issue #8205, PR #8271)
  • Users can more easily download guestbook responses and learn how guestbooks work. (Issue #8244, PR #8402)
  • Search API users can specify additional metadata fields to be returned in the search results. (Issue #7863, PR #7942)
  • The "Preview" tab on the file page can now show restricted files. (Issue #8258, PR #8265)
  • Users wanting to upload files from GitHub to Dataverse can learn about a new GitHub Action called "Dataverse Uploader". (PR #8416)
  • Users requesting access to files now get feedback that it was successful. (Issue #7469, PR #8341)
  • Users may notice various accessibility improvements. (Issue #8321, PR #8322)
  • Users of the Social Science metadata block can now add multiples of the "Collection Mode" field. (Issue #8452, PR #8473)
  • Guestbooks now support multi-line text area fields. (Issue #8288, PR #8291)
  • Guestbooks can better handle commas in responses. (Issue #8193, PR #8343)
  • Dataset editors can now deselect a guestbook. (Issue #2257, PR #8403)
  • Administrators with a large actionlogrecord table can read docs on archiving and then trimming it. (Issue #5916, PR #8292)
  • Administrators can list locks across all datasets. (PR #8445)
  • Administrators can run a version of Solr that doesn't include a version of log4j2 with serious known vulnerabilities. We trust that you have patched the version of Solr you are running now following the instructions that were sent out. An upgrade to the latest version is recommended for extra peace of mind. (PR #8415)
  • Administrators can run a version of Dataverse that doesn't include a version of log4j with known vulnerabilities. (PR #8377)

Notes for Dataverse Installation Administrators

Updating for Multiple License Support

Adding and Removing Licenses and How Existing Datasets Will Be Automatically Updated

As part of installing or upgrading an existing installation, administrators may wish to add additional license choices and/or configure Dataverse to allow custom terms. Adding additional licenses is managed via API, as explained in the Configuring Licenses section of the Installation Guide. Licenses are described via a JSON structure providing a name, URL, short description, and optional icon URL. Additionally licenses may be marked as active (selectable for new or updated datasets) or inactive (only allowed on existing datasets) and one license can be marked as the default. Custom Terms are allowed by default (backward compatible with the current option to select "No" to using CC0) and can be disabled by setting :AllowCustomTermsOfUse to false.

Further, administrators should review the following automated migration of existing licenses and terms into the new license framework and, if desired, should manually find and update any datasets for which the automated update is problematic.
To understand the migration process, it is useful to understand how the multiple license feature works in this release:

"Custom Terms", aka a custom license, are defined through entries in the following fields of the dataset "Terms" tab:

  • Terms of Use
  • Confidentiality Declaration
  • Special Permissions
  • Restrictions
  • Citation Requirements
  • Depositor Requirements
  • Conditions
  • Disclaimer

"Custom Terms" require, at a minimum, a non-blank entry in the "Terms of Use" field. Entries in other fields are optional.

Since these fields are intended for terms/conditions that would potentially conflict with or modify the terms in a standard license, they are no longer shown when a standard license is selected.

In earlier Dataverse releases, it was possible to select the CC0 license and have entries in the fields above. It was also possible to say "No" to using CC0 and leave all of these terms fields blank.

The automated process will update existing datasets as follows.

  • "CC0 Waiver" and no entries in the fields above -> CC0 License (no change)
  • No CC0 Waiver and an entry in the "Terms of Use" field and possibly others fields listed above -> "Custom Terms" with the same entries in these fields (no change)
  • CC0 Waiver and an entry in some of the fields listed -> 'Custom Terms' with the following text preprended in the "Terms of Use" field: "This dataset is made available under a Creative Commons CC0 license with the following additional/modified terms and conditions:"
  • No CC0 Waiver and an entry in a field(s) other than the "Terms of Use" field -> "Custom Terms" with the following "Terms of Use" added: "This dataset is made available with limited information on how it can be used. You may wish to communicate with the Contact(s) specified before use."
  • No CC0 Waiver and no entry in any of the listed fields -> "Custom Terms" with the following "Terms of Use" added: "This dataset is made available without information on how it can be used. You should communicate with the Contact(s) specified before use."

Administrators who have datasets where CC0 has been selected along with additional terms, or datasets where the Terms of Use field is empty, may wish to modify those datasets prior to upgrading to avoid the automated changes above. This is discussed next.

Handling Datasets that No Longer Comply With Licensing Rules

In most Dataverse installations, one would expect the vast majority of datasets to either use the CC0 Waiver or have non-empty Terms of Use. As noted above, these will be migrated without any issue. Administrators may however wish to find and manually update datasets that specified a CC0 license but also had terms (no longer allowed) or had no license and no terms of use (also no longer allowed) rather than accept the default migrations for these datasets listed above.

Finding and Modifying Datasets with a CC0 License and Non-Empty Terms

To find datasets with a CC0 license and non-empty terms:

select CONCAT('doi:', dvo.authority, '/', dvo.identifier), v.alias as dataverse_alias, case when versionstate='RELEASED' then concat(dv.versionnumber, '.', dv.minorversionnumber) else versionstate END  as version, dv.id as datasetversion_id, t.id as termsofuseandaccess_id, t.termsofuse, t.confidentialitydeclaration, t.specialpermissions, t.restrictions, t.citationrequirements, t.depositorrequirements, t.conditions, t.disclaimer from dvobject dvo, termsofuseandaccess t, datasetversion dv, dataverse v where dv.dataset_id=dvo.id and...
Read more

v5.9

09 Dec 19:29
fb24c87
Compare
Choose a tag to compare

Dataverse Software 5.9

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Dataverse Collection Page Optimizations

The Dataverse Collection page, which also serves as the search page and the homepage in most Dataverse installations, has been optimized, with a specific focus on reducing the number of queries for each page load. These optimizations will be more noticable on Dataverse installations with higher traffic.

Support for HTTP "Range" Header for Partial File Downloads

Dataverse now supports the HTTP "Range" header, which allows users to download parts of a file. Here are some examples:

  • bytes=0-9 gets the first 10 bytes.
  • bytes=10-19 gets 10 bytes from the middle.
  • bytes=-10 gets the last 10 bytes.
  • bytes=9- gets all bytes except the first 10.

Only a single range is supported. For more information, see the Data Access API section of the API Guide.

Support for Optional External Metadata Validation Scripts

The Dataverse software now allows an installation administrator to provide custom scripts for additional metadata validation when datasets are being published and/or when Dataverse collections are being published or modified. The Harvard Dataverse Repository has been using this mechanism to combat content that violates our Terms of Use, specifically spam content. All the validation or verification logic is defined in these external scripts, thus making it possible for an installation to add checks custom-tailored to their needs.

Please note that only the metadata are subject to these validation checks. This does not check the content of any uploaded files.

For more information, see the Database Settings section of the Guide. The new settings are listed below, in the "New JVM Options and DB Settings" section of these release notes.

Displaying Author's Identifier as Link

In the dataset page's metadata tab the author's identifier is now displayed as a clickable link, which points to the profile page in the external service (ORCID, VIAF etc.) in cases where the identifier scheme provides a resolvable landing page. If the identifier does not match the expected scheme, a link is not shown.

Auxiliary File API Enhancements

This release includes updates to the Auxiliary File API. These updates include:

  • Auxiliary files can now also be associated with non-tabular files
  • Auxiliary files can now be deleted
  • Duplicate Auxiliary files can no longer be created
  • A new API has been added to list Auxiliary files by their origin
  • Some auxiliary were being saved with the wrong content type (MIME type) but now the user can supply the content type on upload, overriding the type that would otherwise be assigned
  • Improved error reporting
  • A bugfix involving checksums for Auxiliary files

Please note that the Auxiliary files feature is experimental and is designed to support integration with tools from the OpenDP Project. If the API endpoints are not needed they can be blocked.

Major Use Cases and Infrastructure Enhancements

Newly-supported major use cases in this release include:

  • The Dataverse collection page has been optimized, resulting in quicker load times on one of the most common pages in the application (Issue #7804, PR #8143)
  • Users will now be able to specify a certain byte range in their downloads via API, allowing for downloads of file parts. (Issue #6397, PR #8087)
  • A Dataverse installation administrator can now set up metadata validation for datasets and Dataverse collections, allowing for publish-time and create-time checks for all content. (Issue #8155, PR #8245)
  • Users will be provided with clickable links to authors' ORCIDs and other IDs in the dataset metadata (Issue #7978, PR #7979)
  • Users will now be able to associate Auxiliary files with non-tabular files (Issue #8235, PR #8237)
  • Users will no longer be able to create duplicate Auxiliary files (Issue #8235, PR #8237)
  • Users will be able to delete Auxiliary files (Issue #8235, PR #8237)
  • Users can retrieve a list of Auxiliary files based on their origin (Issue #8235, PR #8237)
  • Users will be able to supply the content type of Auxiliary files on upload (Issue #8241, PR #8282)
  • The indexing process has been updated so that datasets with fewer files and indexed first, resulting in fewer failures and making it easier to identify problematically-large datasets. (Issue #8097, PR #8152)
  • Users will no longer be able to create metadata records with problematic special characters, which would later require Dataverse installation administrator intervention and a database change (Issue #8018, PR #8242)
  • The Dataverse software will now appropriately recognize files with the .geojson extension as GeoJSON files rather than "unknown" (Issue #8261, PR #8262)
  • A Dataverse installation administrator can now retrieve more information about role deletion from the ActionLogRecord (Issue #2912, PR #8211)
  • Users will be able to use a new role to allow a user to respond to file download requests without also giving them the power to manage the dataset (Issue #8109, PR #8174)
  • Users will no longer be forced to update their passwords when moving from Dataverse 3.x to Dataverse 4.x (PR #7916)
  • Improved accessibility of buttons on the Dataset and File pages (Issue #8247, PR #8257)

Notes for Dataverse Installation Administrators

Indexing Performance on Datasets with Large Numbers of Files

We discovered that whenever a full reindexing needs to be performed, datasets with large numbers of files take an exceptionally long time to index. For example, in the Harvard Dataverse Repository, it takes several hours for a dataset that has 25,000 files. In situations where the Solr index needs to be erased and rebuilt from scratch (such as a Solr version upgrade, or a corrupt index, etc.) this can significantly delay the repopulation of the search catalog.

We are still investigating the reasons behind this performance issue. For now, even though some improvements have been made, a dataset with thousands of files is still going to take a long time to index. In this release, we've made a simple change to the reindexing process, to index any such datasets at the very end of the batch, after all the datasets with fewer files have been reindexed. This does not improve the overall reindexing time, but will repopulate the bulk of the search index much faster for the users of the installation.

Custom Analytics Code Changes

You should update your custom analytics code to capture a bug fix related to tracking within the dataset files table. This release restores that tracking.

For more information, see the documentation and sample analytics code snippet provided in Installation Guide. This update can be used on any version 5.4+.

New ManageFilePermissions Permission

Dataverse can now support a use case in which a Admin or Curator would like to delegate the ability to grant access to restricted files to other users. This can be implemented by creating a custom role (e.g. DownloadApprover) that has the new ManageFilePermissions permission. This release introduces the new permission, and a Flyway script adjusts the existing Admin and Curator roles so they continue to have the ability to grant file download requrests.

Thumbnail Defaults

New default values have been added for the JVM settings dataverse.dataAccess.thumbnail.image.limit and dataverse.dataAccess.thumbnail.pdf.limit, of 3MB and 1MB respectively. This means that, unless specified otherwise by the JVM settings already in your domain configuration, the application will skip attempting to generate thumbnails for image files and PDFs that are above these size limits.
In previous versions, if these limits were not explicitly set, the application would try to create thumbnails for files of unlimited size. Which would occasionally cause problems with very large images.

New JVM Options and DB Settings

The following DB settings allow configuration of the external metadata validator:

  • :DataverseMetadataValidatorScript
  • :DataverseMetadataPublishValidationFailureMsg
  • :DataverseMetadataUpdateValidationFailureMsg
  • :DatasetMetadataValidatorScript
  • :DatasetMetadataValidationFailureMsg
  • :ExternalValidationAdminOverride

See the Database Settings section of the Guides for more information.

Notes for Developers and Integrators

Two sections of the Developer Guide have been updated:

  • Instructions on how to sync a PR in progress with develop have been added in the version control section
  • Guidance on avoiding ineffeciencies in JSF render logic has been added to the "Tips" section

Complete List of Changes

For the complete list of code changes in this release, see the 5.9 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide. Please also contact us to get added to the Dataverse Project Map if you have not done so already.

Upgrade Instructions

...

Read more

v5.8

05 Nov 18:51
9161cd6
Compare
Choose a tag to compare

Dataverse Software 5.8

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Support for Data Embargoes

The Dataverse Software now supports file-level embargoes. The ability to set embargoes, up to a maximum duration (in months), can be configured by a Dataverse installation administrator. For more information, see the Embargoes section of the Dataverse Software Guides.

  • Users can configure a specific embargo, defined by an end date and a short reason, on a set of selected files or an individual file, by selecting the 'Embargo' menu item and entering information in a popup dialog. Embargoes can only be set, changed, or removed before a file has been published. After publication, only Dataverse installation administrators can make changes, using an API.

  • While embargoed, files cannot be previewed or downloaded (as if restricted, with no option to allow access requests). After the embargo expires, files become accessible. If the files were also restricted, they remain inaccessible and functionality is the same as for any restricted file.

  • By default, the citation date reported for the dataset and the datafiles in version 1.0 reflects the longest embargo period on any file in version 1.0, which is consistent with recommended practice from DataCite. Administrators can still specify an alternate date field to be used in the citation date via the Set Citation Date Field Type for a Dataset API Call.

The work to add this functionality was initiated by Data Archiving and Networked Services (DANS-KNAW), the Netherlands. It was further developed by the Global Dataverse Community Consortium (GDCC) in cooperation with and with funding from DANS.

Major Use Cases and Infrastructure Enhancements

Newly-supported major use cases in this release include:

  • Users can set file-level embargoes. (Issue #7743, #4052, #343, PR #8020)
  • Improved accessibility of form labels on the advanced search page (Issue #8169, PR #8170)

Notes for Dataverse Installation Administrators

Mitigate Solr Schema Management Problems

With Release 5.5, the <copyField> definitions had been reincluded into schema.xml to fix searching for datasets.

This release includes a final update to schema.xml and a new script update-fields.sh to manage your custom metadata fields, and to provide opportunities for other future improvements. The broken script updateSchemaMDB.sh has been removed.

You will need to replace your schema.xml with the one provided in order to make sure that the new script can function. If you do not use any custom metadata blocks in your installation, this is the only change to be made. If you do use custom metadata blocks you will need to take a few extra steps, enumerated in the step-by-step instructions below.

New JVM Options and DB Settings

  • :MaxEmbargoDurationInMonths controls whether embargoes are allowed in a Dataverse instance and can limit the maximum duration users are allowed to specify. A value of 0 months or non-existent setting indicates embargoes are not supported. A value of -1 allows embargoes of any length.

Complete List of Changes

For the complete list of code changes in this release, see the 5.8 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide. Please also contact us to get added to the Dataverse Project Map if you have not done so already.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.8.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • service payara start

4. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.8.war

5. Restart payara

  • service payara stop
  • service payara start

6. Update Solr schema.xml.

/usr/local/solr/solr-8.8.1/server/solr/collection1/conf is used in the examples below as the location of your Solr schema. Please adapt it to the correct location, if different in your installation. Use find / -name schema.xml if in doubt.

6a. Replace schema.xml with the base version included in this release.

   wget https://github.com/IQSS/dataverse/releases/download/v5.8/schema.xml
   cp schema.xml /usr/local/solr/solr-8.8.1/server/solr/collection1/conf

For installations that are not using any Custom Metadata Blocks, you can skip the next step.

6b. For installations with Custom Metadata Blocks

Use the script provided in the release to add the custom fields to the base schema.xml installed in the previous step.

   wget https://github.com/IQSS/dataverse/releases/download/v5.8/update-fields.sh
   chmod +x update-fields.sh
   curl "http://localhost:8080/api/admin/index/solr/schema" | ./update-fields.sh /usr/local/solr/solr-8.8.1/server/solr/collection1/conf/schema.xml

(Note that the curl command above calls the admin api on localhost to obtain the list of the custom fields. In the unlikely case that you are running the main Dataverse Application and Solr on different servers, generate the schema.xml on the application node, then copy it onto the Solr server)

7. Restart Solr

Usually service solr stop; service solr start, but may be different on your system. See the Installation Guide for more details.

v5.7

13 Oct 18:36
78c9a44
Compare
Choose a tag to compare

Dataverse Software 5.7

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Experimental Support for External Vocabulary Services

Dataverse can now be configured to associate specific metadata fields with third-party vocabulary services to provide an easy way for users to select values from those vocabularies. The mapping involves use of external Javascripts. Two such scripts have been developed so far: one for vocabularies served via the SKOSMOS protocol and one allowing people to be identified via their ORCID. The guides contain info about the new :CVocConf setting used for configuration and additional information about this functionality. Scripts, examples, and additional documentation are available at the GDCC GitHub Repository.

Please watch the online presentation, read the document with requirements and join the Dataverse Working Group on Ontologies and Controlled Vocabularies if you have some questions and want to contribute.

This functionality was initially developed by Data Archiving and Networked Services (DANS-KNAW), the Netherlands, and funded by SSHOC, "Social Sciences and Humanities Open Cloud". SSHOC has received funding from the European Union’s Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782. It was further improved by the Global Dataverse Community Consortium (GDCC) and extended with the support of semantic search.

Curation Status Labels

A new :AllowedCurationLabels setting allows a sysadmins to define one or more sets of labels that can be applied to a draft Dataset version via the user interface or API to indicate the status of the dataset with respect to a defined curation process.

Labels are completely customizable (alphanumeric or spaces, up to 32 characters, e.g. "Author contacted", "Privacy Review", "Awaiting paper publication"). Superusers can select a specific set of labels, or disable this functionality per collection. Anyone who can publish a draft dataset (e.g. curators) can set/change/remove labels (from the set specified for the collection containing the dataset) via the user interface or via an API. The API also would allow external tools to search for, read and set labels on Datasets, providing an integration mechanism. Labels are visible on the Dataset page and in Dataverse collection listings/search results. Internally, the labels have no effect, and at publication, any existing label will be removed. A reporting API call allows admins to get a list of datasets and their curation statuses.

The Solr schema must be updated as part of installing the release of Dataverse containing this feature for it to work.

Major Use Cases

Newly-supported major use cases in this release include:

  • Administrators will be able to set up integrations with external vocabulary services, allowing for autocomplete-assisted metadata entry, metadata standardization, and better integration with other systems (Issue #7711, PR #7946)
  • Users viewing datasets in the root Dataverse collection will now see breadcrumbs that have have a link back to the root Dataverse collection (Issue #7527, PR #8078)
  • Users will be able to more easily differentiate between datasets and files through new iconography (Issue #7991, PR #8021)
  • Users retrieving large guestbooks over the API will experience fewer failures (Issue #8073, PR #8084)
  • Dataverse collection administrators can specify which language will be used when entering metadata for new Datasets in a collection, based on a list of languages specified by the Dataverse installation administrator (Issue #7388, PR #7958)
    • Users will see the language used for metadata entry indicated at the document or element level in metadata exports (Issue #7388, PR #7958)
    • Administrators will now be able to specify the language(s) of controlled vocabulary entries, in addition to the installation's default language (Issue #6751, PR #7959)
  • Administrators and curators can now receive notifications when a dataset is created (Issue #8069, PR #8070)
  • Administrators with large files in their installation can disable the automatic checksum verification process at publish time (Issue #8043, PR #8074)

Notes for Dataverse Installation Administrators

Dataset Creation Notifications for Administrators

A new :SendNotificationOnDatasetCreation setting has been added. When true, administrators and curators (those who can publish the dataset) will get a notification when a new dataset is created. This makes it easier to track activity in a Dataverse and, for example, allow admins to follow up when users do not publish a new dataset within some period of time.

Skip Checksum Validation at Publish Based on Size

When a user requests to publish a dataset, the time taken to complete the publishing process varies based on the dataset/datafile size.

With the additional settings of :DatasetChecksumValidationSizeLimit and :DataFileChecksumValidationSizeLimit, the checksum validation can be skipped while publishing.

If the Dataverse administrator chooses to set these values, it's strongly recommended to have an external auditing system run periodically in order to monitor the integrity of the files in the Dataverse installation.

Guestbook Response API Update

With this release the Retrieve Guestbook Responses for a Dataverse Collection API will no longer produce a file by default. You may specify an output file by adding a -o $YOURFILENAME to the curl command.

Dynamic JavaServer Faces Configuration Options

This release includes a new way to easily change JSF settings via MicroProfile Config, especially useful during development.
See the development guide on "Debugging" for more information.

Enhancements to DDI Metadata Exports

Several changes have been made to the DDI exports to improve support for internationalization and to improve compliance with CESSDA requirements. These changes include:

  • Addition of xml:lang attributes specifying the dataset metadata language at the document level and for individual elements such as title and description
  • Specification of controlled vocabulary terms in duplicate elements in multiple languages (in the installation default langauge and, if different, the dataset metadata language)

While these changes are intended to improve harvesting and integration with external systems, they could break existing connections that make assumptions about the elements and attributes that have been changed.

New JVM Options and DB Settings

  • :SendNotificationOnDatasetCreation - A boolean setting that, if true will send an email and notification to additional users when a Dataset is created. Messages go to those, other than the dataset creator, who have the ability/permission necessary to publish the dataset.
  • :DatasetChecksumValidationSizeLimit - disables the checksum validation while publishing for any dataset size greater than the limit.
  • :DataFileChecksumValidationSizeLimit - Disables the checksum validation while publishing for any datafiles greater than the limit.
  • :CVocConf - A JSON-structured setting that configures Dataverse to associate specific metadatablock fields with external vocabulary services and specific vocabularies/sub-vocabularies managed by that service.
  • :MetadataLanguages - Sets which languages can be used when entering dataset metadata.
  • :AllowedCurationLabels - A JSON Object containing lists of allowed labels (up to 32 characters, spaces allowed) that can be set, via API or UI by users with the permission to publish a dataset. The set of labels allowed for datasets can be selected by a superuser - via the Dataverse collection page (Edit/General Info) or set via API call.

Notes for Tool Developers and Integrators

Bags Now Support File Paths

The original Bag generation code stored all dataset files directly under the /data directory. With the addition in Dataverse of a directory path for files and then a change to allow files with different paths to have the same name, archival Bags will now use the directory path from Dataverse to avoid name collisions within the /data directory. Prior to this update, Bags from Datasets with multiple files with the same name would have been created with only one of the files with that name (with warnings in the log, but still generating the Bag).

Complete List of Changes

For the complete list of code changes in this release, see the 5.7 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.7.

If yo...

Read more

v5.6

04 Aug 19:30
1c2d8d8
Compare
Choose a tag to compare

Dataverse Software 5.6

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Release Highlights

Anonymized Access in Support of Double Blind Review

Dataverse installations can select whether or not to allow users to create anonymized private URLs and can control which specific identifying fields are anonymized. If this is enabled, users can create private URLs that do not expose identifying information about dataset depositors, allowing for double blind reviews of datasets in the Dataverse installation.

Guestbook Responses API

A new API to retrieve Guestbook responses has been added. This makes it easier to retrieve the records for large guestbooks and also makes it easier to integrate with external systems.

Dataset Semantic API (Experimental)

Dataset metadata can be retrieved, set, and updated using a new, flatter JSON-LD format - following the format of an OAI-ORE export (RDA-conformant Bags), allowing for easier transfer of metadata to/from other systems (i.e. without needing to know Dataverse's metadata block and field storage architecture). This new API also allows for the update of terms metadata (#5899).

This development was supported by the Research Data Alliance, DANS, and Sciences PO and follows the recommendations from the Research Data Repository Interoperability Working Group.

Dataset Migration API (Experimental)

Datasets can now imported following the format of an OAI-ORE export (RDA-conformant Bags), allowing for easier migration from one Dataverse installation to another, and migration from other systems. This experimental, superuser only, endpoint also allows keeping the existing persistent identifier (where the authority and shoulder match those for which the software is configured) and publication dates.

This development was supported by DANS and the Research Data Alliance and follows the recommendations from the Research Data Repository Interoperability Working Group.

Direct Upload API Now Available for Adding Multiple Files' Metadata to the Dataset

Using the Direct Upload API, users can now add metadata of multiple files to the dataset after the files exist in the S3 bucket. This makes direct uploads more efficient and reduces server load by only updating the dataset once instead of once per file. For more information, see the Direct DataFile Upload/Replace API section of the Dataverse Software Guides.

Major Use Cases

Newly-supported major use cases in this release include:

  • Users can create Private URLs that anonymize dataset metadata, allowing for double blind peer review. (Issue #1724, PR #7908)
  • Users can download Guestbook records using a new API. (Issue #7767, PR #7931)
  • Users can update terms metadata using the new semantic API. (Issue #5899, PR #7414)
  • Users can retrieve, set, and update metadata using a new, flatter JSON-LD format. (Issue #6497, PR #7414)
  • Administrators can use the Traces API to retrieve information about specific types of user activity (Issue #7952, PR #7953)

Notes for Dataverse Installation Administrators

New Database Constraint

A new DB Constraint has been added in this release. Full instructions on how to identify whether or not your database needs any cleanup before the upgrade can be found in the Dataverse software GitHub repository. This information was also emailed out to Dataverse installation contacts.

Payara 5.2021.5 (or Higher) Required

Some changes in this release require an upgrade to Payara 5.2021.5 or higher. (See the upgrade section).

Instructions on how to update can be found in the Payara documentation We've included the necessary steps below, but we recommend that you review the Payara upgrade instructions as it could be helpful during any troubleshooting.

Installations upgrading from a previous Payara version shouldn't encounter a logging configuration bug in Payara-5.2021.5, but if your server.log fills with repeated notes about logging configuration and WELD complaints about loading beans, see the paragraph on logging.properties in the Installation Guide

Enhancement to DDI Metadata Exports

To increase support for internationalization and to improve compliance with CESSDA requirements, DDI exports now have a holdings element with a URI attribute whose value is the URL form of the dataset PID.

New JVM Options and DB Settings

:AnonymizedFieldTypeNames can be used to enable creation of anonymized Private URLs and to specify which fields will be anonymized.

Notes for Tool Developers and Integrators

Semantic API

The new Semantic API is especially helpful in data migrations and getting metadata into a Dataverse installation. Learn more in the Developers Guide.

Complete List of Changes

For the complete list of code changes in this release, see the 5.6 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.6.

The steps below include a required upgrade to Payara 5.2021.5 or higher. (It is a simple matter of reusing your existing domain directory with the new distribution). But we also recommend that you review the Payara upgrade instructions as it could be helpful during any troubleshooting: Payara documentation

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Move the current Payara directory out of the way

  • mv $PAYARA $PAYARA.MOVED

4. Download the new Payara version (5.2021.5+), and unzip it in its place

5. Replace the brand new payara/glassfish/domains/domain1 with your old, preserved domain1

6. Start Payara

  • service payara start

7. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.6.war

8. Restart payara

  • service payara stop
  • service payara start

10. Run ReExportall to update JSON Exports http://guides.dataverse.org/en/5.6/admin/metadataexport.html?highlight=export#batch-exports-through-the-api

Additional Release Steps

If your installation relies on the database-side stored procedure for generating sequential numeric identifiers:

Note that you can skip this step if your installation uses the default-style, randomly-generated six alphanumeric character-long identifiers for your datasets! This is the case with most Dataverse installations.

The underlying database framework has been modified in this release, to make it easier for installations to create custom procedures for generating identifier strings that suit their needs. Your current configuration will be automatically updated by the database upgrade (Flyway) script incorporated in the release. No manual configuration changes should be necessary. However, after the upgrade, we recommend that you confirm that your installation can still create new datasets, and that they are still assigned sequential numeric identifiers. In the unlikely chance that this is no longer working, please re-create the stored procedure following the steps described in the documentation for the :IdentifierGenerationStyle setting in the Configuration section of the Installation Guide for this release (v5.6).

(Running the script supplied there will NOT overwrite the position on the sequence you are currently using!)

v5.5

19 May 21:27
5fc0150
Compare
Choose a tag to compare

Dataverse Software 5.5

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project.

Note: this release has a change to the default value for the :ZipDownloadLimit setting, from 100 MB to 0 bytes. If you have not previously adjusted this setting from the default, your Dataverse installation will no longer generate zip files once v5.5 is installed, as the setting will now be 0 bytes. This behavior will be revisited in a later release.

Release Highlights

Auxiliary Files Accessible Through the UI

Auxiliary Files can now be downloaded from the web interface. Auxiliary files uploaded as type=DP appear under "Differentially Private Statistics" under file level download. The rest appear under "Other Auxiliary Files".

Please note that the auxiliary files feature is experimental and is designed to support integration with tools from the OpenDP Project. If the API endpoints are not needed they can be blocked.

Improved Workflow for Downloading Large Zip Files

Users trying to download a zip file larger than the Dataverse installation's :ZipDownloadLimit will now receive messaging that the zip file is too large, and the user will be presented with alternate access options. Previously, the zip file would download and files above the :ZipDownloadLimit would be excluded and noted in a MANIFEST.TXT file.

Guidelines on Depositing Code

The Software Metadata Working Group has created guidelines on depositing research code in a Dataverse installation. Learn more in the Dataset Management section of the Dataverse Guides.

New Metrics API

Users can retrieve new types of metrics and per-collection metrics. The new capabilities are described in the guides. A new version of the Dataverse Metrics web app adds interactive graphs to display these metrics. Anyone running the existing Dataverse Metrics app will need to upgrade or apply a small patch to continue retrieving metrics from Dataverse instances upgrading to this release.

Major Use Cases

Newly-supported major use cases in this release include:

  • Users can now select and download auxiliary files through the UI. (Issue #7400, PR #7729)
  • Users attempting to download zip files above the installation's size limit will receive better messaging and be directed to other download options. (Issue #7714, PR #7806)
  • Superusers can now sort users on the Dashboard. (Issue #7814, PR #7815)
  • Users can now access expanded and new metrics through a new API (Issue #7177, PR #7178)
  • Dataverse collection administrators can now add a search facet on their collection pages for the Geospatial metadatablock's "Other" field, so that others can narrow searches in their collections using the values entered in that "Other" field (Issue #7399, PR #7813)
  • Depositors can now receive guidance about depositing code into a Dataverse installation (PR #7717)

Notes for Dataverse Installation Administrators

Simple Search Fix for Solr Configuration

The introduction in v4.17 of a schema_dv_mdb_copies.xml file as part of the Solr configuration accidentally removed the contents of most metadata fields from index used for simple searches in Dataverse (i.e. when one types a word without indicating which field to search in the normal search box). This was somewhat ameliorated/hidden by the fact that many common fields such as description were still included by other means.

This release removes the schema_dv_mdb_copies.xml file and includes the updates needed in the schema.xml file. Installations with no custom metadata blocks can simply replace their current schema.xml file for Solr, restart Solr, and run a 'Reindex in Place' as described in the guides.

Installations using custom metadata blocks should manually copy the contents of their schema_dv_mdb_copies.xml file (excluding the enclosing <schema> element and only including the <copyField> elements) into their schema.xml file, replacing the section between

<!-- Dataverse copyField from http://localhost:8080/api/admin/index/solr/schema -->

and

<!-- End: Dataverse-specific -->.

In existing schema.xml files, this section currently includes only one line:

<xi:include href="schema_dv_mdb_copies.xml" xmlns:xi="http://www.w3.org/2001/XInclude" />.

In this release, that line has already been replaced with the default set of <copyFields>.
It doesn't matter whether schema_dv_mdb_copies.xml was originally created manually or via the recommended updateSchemaMDB.sh script and this fix will work with all prior versions of Dataverse from v4.17 on. If you make further changes to metadata blocks in your installation, you can repeat this process (i.e. run updateSchemaMDB.sh, copy the entries in schema_dv_mdb_copies.xml into the same section of schema.xml, restart solr, and reindex.)

Once schema.xml is updated, Solr should be restarted and a 'Reindex in Place' will be required. (Future Dataverse Software versions will avoid this manual copy step.)

Geospatial Metadata Block Updated

The Geospatial metadata block (geospatial.tsv) was updated. Dataverse collection administrators can now add a search facet on their collection pages for the metadata block's "Other" field, so that people searching in their collections can narrow searches using the values entered in that field.

Extended support for S3 Download Redirects ("Direct Downloads")

If your installation uses S3 for storage and you have "direct downloads" enabled, please note that it will now cover the following download types that were not handled by redirects in the earlier versions: saved originals of tabular data files, cached RData frames, resized thumbnails for image files and other auxiliary files. In other words, all the forms of the file download API that take extra arguments, such as "format" or "imageThumb" - for example:

/api/access/datafile/12345?format=original

/api/access/datafile/:persistentId?persistentId=doi:1234/ABCDE/FGHIJ&imageThumb=true

etc., that were previously excluded.

Since browsers follow redirects automatically, this change should not in any way affect the web GUI users. However, some API users may experience problems, if they use it in a way that does not expect to receive a redirect response. For example, if a user has a script where they expect to download a saved original of an ingested tabular file with the following command:

curl https://yourhost.edu/api/access/datafile/12345?format=original > orig.dta

it will fail to save the file when it receives a 303 (redirect) response instead of 200. So they will need to add "-L" to the command line above, to instruct curl to follow redirects:

curl -L https://yourhost.edu/api/access/datafile/12345?format=original > orig.dta

Most of your API users have likely figured it out already, since you enabled S3 redirects for "straightforward" downloads in your installation. But we feel it was worth a heads up, just in case.

Authenticated User Deactivated Field Updated

The "deactivated" field on the Authenticated User table has been updated to be a non-nullable field. When the field was added in version 5.3 it was set to 'false' in an update script. If for whatever reason that update failed in the 5.3 deploy you will need to re-run it before deploying 5.5. The update query you may need to run is: UPDATE authenticateduser SET deactivated = false WHERE deactivated IS NULL;

Notes for Tool Developers and Integrators

S3 Download Redirects

See above note about download redirects. If your application integrates with the the Dataverse software using the APIs, you may need to change how redirects are handled in your tool or integration.

Complete List of Changes

For the complete list of code changes in this release, see the 5.5 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.5.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • `servic...
Read more

v5.4.1

13 Apr 15:47
80361bf
Compare
Choose a tag to compare

Dataverse Software 5.4.1

This release provides a fix for a regression introduced in 5.4 and implements a few other small changes. Please use 5.4.1 for production deployments instead of 5.4.

Release Highlights

API Backwards Compatibility Maintained

The syntax in the example in the Basic File Access section of the Dataverse Software Guides will continue to work.

Direct Upload API Now Available for Replacing Files

Users can now replace files using the direct upload API. For more information, see the Direct DataFile Upload/Replace API section of the Dataverse Software Guides.

Complete List of Changes

For the complete list of code changes in this release, see the 5.4.1 Milestone in Github.

For help with upgrading, installing, or general questions please post to the Dataverse Community Google Group or email support@dataverse.org.

Installation

If this is a new installation, please see our Installation Guide.

Upgrade Instructions

0. These instructions assume that you've already successfully upgraded from Dataverse Software 4.x to Dataverse Software 5 following the instructions in the Dataverse Software 5 Release Notes. After upgrading from the 4.x series to 5.0, you should progress through the other 5.x releases before attempting the upgrade to 5.4.1.

If you are running Payara as a non-root user (and you should be!), remember not to execute the commands below as root. Use sudo to change to that user first. For example, sudo -i -u dataverse if dataverse is your dedicated application user.

In the following commands we assume that Payara 5 is installed in /usr/local/payara5. If not, adjust as needed.

export PAYARA=/usr/local/payara5

(or setenv PAYARA /usr/local/payara5 if you are using a csh-like shell)

1. Undeploy the previous version.

  • $PAYARA/bin/asadmin list-applications
  • $PAYARA/bin/asadmin undeploy dataverse<-version>

2. Stop Payara and remove the generated directory

  • service payara stop
  • rm -rf $PAYARA/glassfish/domains/domain1/generated

3. Start Payara

  • service payara start

4. Deploy this version.

  • $PAYARA/bin/asadmin deploy dataverse-5.4.1.war

5. Restart payara

  • service payara stop
  • service payara start

v5.4

05 Apr 16:29
ea91390
Compare
Choose a tag to compare

Dataverse Software 5.4

This release brings new features, enhancements, and bug fixes to the Dataverse Software. Thank you to all of the community members who contributed code, suggestions, bug reports, and other assistance across the project. Please note that there is an API backwards compatibility issue in 5.4, and we recommend using 5.4.1 for any production environments.

Release Highlights

Deactivate Users API, Get User Traces API, Revoke Roles API

A new API has been added to deactivate users to prevent them from logging in, receiving communications, or otherwise being active in the system. Deactivating a user is an alternative to deleting a user, especially when the latter is not possible due to the amount of interaction the user has had with the Dataverse installation. In order to learn more about a user before deleting, deactivating, or merging, a new "get user traces" API is available that will show objects created, roles, group memberships, and more. Finally, the "remove all roles" button available in the superuser dashboard is now also available via API.

New File Access API

A new API offers crawlable access view of the folders and files within a dataset:

/api/datasets/<dataset id>/dirindex/

will output a simple html listing, based on the standard Apache directory index, with Access API download links for individual files, and recursive calls to the API above for sub-folders. Please see the Native API Guide for more information.

Using this API, wget --recursive (or similar crawling client) can be used to download all the files in a dataset, preserving the file names and folder structure; without having to use the download-as-zip API. In addition to being faster (zipping is a relatively resource-intensive operation on the server side), this process can be restarted if interrupted (with wget --continue or equivalent) - unlike zipped multi-file downloads that always have to start from the beginning.

On a system that uses S3 with download redirects, the individual file downloads will be handled by S3 directly (with the exception of tabular files), without having to be proxied through the Dataverse application.

Restricted Files and DDI "dataDscr" Information (Summary Statistics, Variable Names, Variable Labels)

In previous releases, DDI "dataDscr" information (summary statistics, variable names, and variable labels, sometimes known as "variable metadata") for tabular files that were ingested successfully were available even if files were restricted. This has been changed in the following ways:

  • At the dataset level, DDI exports no longer show "dataDscr" information for restricted files. There is only one version of this export and it is the version that's suitable for public consumption with the "dataDscr" information hidden for restricted files.
  • Similarly, at the dataset level, the DDI HTML Codebook no longer shows "dataDscr" information for restricted files.
  • At the file level, "dataDscr" information is no longer publicly available for restricted files. In practice, it was only possible to get this publicly via API (the download/access button was hidden).
  • At the file level, "dataDscr" (variable metadata) information can still be downloaded for restricted files if you have access to download the file.

Search with Accented Characters

Many languages include characters that have close analogs in ascii, e.g. (á, à, â, ç, é, è, ê, ë, í, ó, ö, ú, ù, û, ü…). This release changes the default Solr configuration to allow search to match words based on these associations, e.g. a search for Mercè would match the word Merce in a Dataset, and vice versa. This should generally be helpful, but can result in false positives, e.g. "canon" will be found searching for "cañon".

Java 11, PostgreSQL 13, and Solr 8 Support/Upgrades

Several of the core components of the Dataverse Software have been upgraded. Specifically:

  • The Dataverse Software now runs on and requires Java 11. This will provide performance and security enhancements, allows developers to take advantage of new and updated Java features, and moves the project to a platform with better longer term support. This upgrade requires a few extra steps in the release process, outlined below.
  • The Dataverse Software has now been tested with PostgreSQL versions up to 13. Versions 9.6+ will still work, but this update is necessary to support the software beyond PostgreSQL EOL later in 2021.
  • The Dataverse Software now runs on Solr 8.8.1, the latest available stable release in the Solr 8.x series.

Saved Search Performance Improvements

A refactoring has greatly improved Saved Search performance in the application. If your installation has multiple, potentially long-running Saved Searches in place, this greatly improves the probability that those search jobs will complete without timing out.

Worldmap/Geoconnect Integration Now Obsolete

As of this release, the Geoconnect/Worldmap integration is no longer available. The Harvard University Worldmap is going through a migration process, and instead of updating this code to work with the new infrastructure, the decision was made to pursue future Geospatial exploration/analysis through other tools, following the External Tools Framework in the Dataverse Software.

Guides Updates

The Dataverse Software Guides have been updated to follow recent changes to how different terms are used across the Dataverse Project. For more information, see Mercè's note to the community:

https://groups.google.com/g/dataverse-community/c/pD-aFrpXMPo

Conditionally Required Metadata Fields

Prior to this release, when defining metadata for compound fields (via their dataset field types), fields could be either be optional or required, i.e. if required you must always have (at least one) value for that field. For example, Author Name being required means you must have at least one Author with an nonempty Author name.

In order to support more robust metadata (and specifically to resolve #7551), we need to allow a third case: Conditionally Required, that is, the field is required if and only if any of its "sibling" fields are entered. For example, Producer Name is now conditionally required in the citation metadata block. A user does not have to enter a Producer, but if they do, they have to enter a Producer Name.

Major Use Cases

Newly-supported major use cases in this release include:

  • Dataverse Installation Administrators can now deactivate users using a new API. (Issue #2419, PR #7629)
  • Superusers can remove all of a user's assigned roles using a new API. (Issue #2419, PR #7629)
  • Superusers can use an API to gather more information about actions a user has taken in the system in order to make an informed decisions about whether or not to deactivate or delete a user. (Issue #2419, PR #7629)
  • Superusers will now be able to harvest from installations using ISO-639-3 language codes. (Issue #7638, PR #7690)
  • Users interacting with the workflow system will receive status messages (Issue #7564, PR #7635)
  • Users interacting with prepublication workflows will see speed improvements (Issue #7681, PR #7682)
  • API Users will receive Dataverse collection API responses in a deterministic order. (Issue #7634, PR #7708)
  • API Users will be able to access a list of crawlable URLs for file download, allowing for faster and easily resumable transfers. (Issue #7084, PR #7579)
  • Users will no longer be able to access summary stats for restricted files. (Issue #7619, PR #7642)
  • Users will now see truncated versions of long strings (primarily checksums) throughout the application (Issue #6685, PR #7312)
  • Users will now be able to easily copy checksums, API tokens, and private URLs with a single click (Issue #6039, Issue #6685, PR #7539, PR #7312)
  • Users uploading data through the Direct Upload API will now be able to use additional checksums (Issue #7600, PR #7602)
  • Users searching for content will now be able to search using non-ascii characters. (Issue #820, PR #7378)
  • Users can now replace files in draft datasets, a functionality previously only available on published datasets. (Issue #7149, PR #7337)
  • Dataverse Installation Administrators can now set subfields of compound fields as conditionally required, that is, the field is required if and only if any of its "sibling" fields are entered. For example, Producer Name is now conditionally required in the citation metadata block. A user does not have to enter a Producer, but if they do, they have to enter a Producer Name. (Issue #7606, PR #7608)

Notes for Dataverse Installation Administrators

Java 11 Upgrade

There are some things to note and keep in mind regarding the move to Java 11:

  • You should install the JDK/JRE following your usual methods, depending on your operating system. An example of this on a RHEL/CentOS 7 or RHEL/CentOS 8 system is:

    $ sudo yum remove java-1.8.0-openjdk java-1.8.0-openjdk-devel java-1.8.0-openjdk-headless

    $ sudo yum install java-11-openjdk-devel

    The remove command may provide an error message if -headless isn't installed.

  • We targeted and tested Java 11, but 11+ will likely work. Java 11 was targeted because of its long term support.

  • If you're moving from a Dataverse installation that was previously running Glassfish 4.x (typically this would be Dataverse Software 4.x), you will need to adjust some JVM options in domain.xml as part of the upgrade process. We've provided these optional steps below. These steps are not required if your first installed Dataverse Software version was running Payara 5.x (typically Dataverse Software 5.x).

PostgreSQL Versions Up To 13 Supported

Up until this release our installation guide "strongly recommended" to install PostgreSQL v. 9.6. While tha...

Read more