Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/6497 migrate dataset api #7504

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Jan 8, 2021

What this PR does / why we need it: This PR supports migration of datasets from other systems, keeping their PID, version, and publicationdate, and allowing file upload via the existing API. It leverages leveraging the json-ld metadata api from #7414.
(Note that #7414 in turn requires resolution of updating our Java version or having a work-around for Java 8 added there - see that PR for details).

Which issue(s) this PR closes:

Closes #6497

Special notes for your reviewer: This is similar to the :import API but is different (hopefully usefully) in a few ways:

  • that API call requires files to be prepositioned. The api calls here allow you to create a dataset to be migrated, then add files/filemetadata using the existingAPIs (inlcuding direct upload or using the HttpOverlay store in Simple support for remote web stores #7325), and then conclude by marking the dataset as published.
  • this call uses the json-ld metadata format from the ORE export, Bag archiving, and Iqss/6497 semantic api #7414 which is flatter and supports incremental metadata updates (one doesn't have to download the whole metadata tree, make an edit and reupload everything),
  • the last point makes it highly compatible with re-importing an archived Bag which includes the same ORE/json-ld formatted metadata
  • since dataset creation and version update/publish capabilities are separated into two calls, it may be simpler to support migration of multiple versions via this API (it may work already - I still need to test that capability).

FYI: The last step is slightly different than the normal publish step since version and publication dates are preserved. The PR does update the PID/DOI (e.g. the landing URL specifically) assuming it has the authority/shoulder configured in Dataverse.

Suggestions on how to test this: The DVUploader uses this API to re-import Bags. Nominally running an upload on Bags of the sample datasets would test this fairly well (current versions at https://drive.google.com/drive/folders/1YXAmJLziFs7XhsvRfqfZ6o9AvLXywyBT?usp=sharing , note that work to improve the Bag/ORE contents is ongoing, so these may get out of date w.r.t. code, so check dates/remind me if an update looks like its needed).

Assuming #7414 is tested separately, the testing here could/should focus on just the added migrate calls, so looking for datasets with a broad range of metadata probably isn't necessary (since testing #7414 covers that.) and the things to look at in a migrated dataset are the DOI, creation/publication dates, whether the DOI is updated at DataCite, etc.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: Not directly - there are questions in the community (remote store/registry metadata working group topic, DANS migration, etc.) as to how one should indicate a migrated dataset, what metadata about the source should be captured, etc. This PR doesn't add any metadata or make an display changes, but this PR may increase interest in resolving these other questions.

Is there a release notes update needed for this change?:
Probably worthwhile to note it exists. As a v~1 implementation, I'd phrase this as an ~experimental API (I don't think we've decided exactly how to phrase this but there have been other recent API adds in this category).
Additional documentation:

FWIW: We have an error handler for the
edu.harvard.iq.dataverse.util.json.JsonParseException class but not for
javax.json.stream.JsonParsingException which was getting caught by the
Throwable handler and returned as a 500 error with json message {}
and avoid having contexts with specific entries for terms that are in a
namespace already
Conflicts:
	src/main/java/edu/harvard/iq/dataverse/util/bagit/OREMap.java
which can be used with existing AbstractApiBean.ok()
@qqmyers qqmyers marked this pull request as draft January 8, 2021 17:16
qqmyers and others added 3 commits July 8, 2021 13:12
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
IQSS/6497-migrate_dataset_api

Conflicts:
	src/main/java/edu/harvard/iq/dataverse/util/json/JSONLDUtil.java
@@ -22,7 +22,7 @@ This dashboard tool allows you to define sets of local datasets to make availabl
Metadata Export
---------------

This part of the Dashboard is simply a reminder message that metadata export happens through the Dataverse Software API. See the :doc:`metadataexport` section and the :doc:`/api/native-api` section of the API Guide for more details.
This part of the Dashboard is simply a reminder message that metadata export happens through the Dataverse Software API. See the :doc:`/admin/metadataexport` section and the :doc:`/api/native-api` section of the API Guide for more details.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: unrelated - just found it when fixing another link

if(migrating) {
// unparsable PID passed. Terminate.
throw new BadRequestException("Cannot parse the @id. Make sure it is in valid form - see Dataverse Native API documentation.");
if (migrating) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: Submitting an '@id' value for the dataset PID, and a modificationdate (next diff) are only allowed for migrate (not valid in the create dataset case), so these are cleaner fixes to just skip running into the null values in the create case.

@qqmyers
Copy link
Member Author

qqmyers commented Jul 13, 2021

OK - with the semantic api PR merged, this is ready for review - just minor changes beyond the new migrate and releasemigrated methods.

@pdurbin pdurbin self-assigned this Jul 16, 2021
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, I think this is ready for QA but there are two things I think we should fix:

  • The new integrations tests are yielding a 500 error.
  • We should update the list of APIs to include this one and others that have been recently added. Otherwise they're hard to find. @qqmyers let me know if you'd like me to make this change.

// Now use the migrate API to recreate the dataset
response = UtilIT.recreateDatasetJsonLD(apiToken, dataverseAlias, expectedString);
String body = response.getBody().asString();
response.then().assertThat().statusCode(CREATED.getStatusCode());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is showing a 500 error:

Screen Shot 2021-07-16 at 2 52 51 PM

@@ -0,0 +1,51 @@
Dataset Migration API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're seeing a number of APIs show up under the Dev Guide, it's probably time to update the "List of Dataverse APIs" I started when I did a light re-write of the API Guide. As of https://guides.dataverse.org/en/5.5/api/intro.html#lists-of-dataverse-apis this is how it looks:

Screen Shot 2021-07-16 at 2 56 08 PM

@qqmyers can you please update the "Please note that some APIs are only documented in other guides that are more suited to their audience" section and add "Developer Guide" and under that recent APIs that have been added? (Please include aux files in this.) Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please update the "Please note that some APIs are only documented in other guides that are more suited to their audience" section and add "Developer Guide" and under that recent APIs that have been added?

@qqmyers I just opened this pull request for this: GlobalDataverseCommunityConsortium#13

@qqmyers
Copy link
Member Author

qqmyers commented Jul 23, 2021

OK - @pdurbin made the doc change he asked for and I fixed the test - moving to QA (feel free to drop it back if there's something else that needs to happen first).

@kcondon kcondon self-assigned this Jul 26, 2021
@kcondon
Copy link
Contributor

kcondon commented Jul 27, 2021

Some questions/issues encountered:

  1. Update release notes, remove solr step since it is no longer necessary, per Jim
  2. Check and mention super user only in doc for api
  3. In (re)create example in doc, shows PERSISTENT_IDENTIFIER in env variable but does not reference it on command line
  4. When updating sample jsonld file with a real, published PID from demo, fails saying cannot recreate dataset that had yet to be created (sample file missing publication date)
  5. Recreate Endpoint doesn't seem to read doi jvm-options, verifies against prod namespace, doesn't work with test namespace: curl -H X-Dataverse-key:2642aa13-5e8b-4ff4-884d-4841545c06ce -X POST "http://localhost:8080/api/dataverses/kc072621/datasets/:startmigration" --upload-file dataset-migrate.jsonld
    {"status":"ERROR","message":"Provided PID does not exist. Status code for GET 'https://doi.org/10.70122/FK2/T6ESOU' is 404."

@qqmyers
Copy link
Member Author

qqmyers commented Jul 27, 2021

@kcondon - I addressed 1-4 in the latest commit. For 5), I added #8028 and referenced that in the docs.

@kcondon kcondon merged commit 1c3dbe8 into IQSS:develop Jul 27, 2021
@djbrooke djbrooke added this to the 5.6 milestone Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Semantic Metadata API calls
6 participants