Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RO-Crate exporter PoC #10086

Closed

Conversation

beepsoft
Copy link
Contributor

@beepsoft beepsoft commented Nov 2, 2023

What this PR does / why we need it:

This PR is a proof of concept implementation of an RO-Crate metadata JSON exporter, as a followup to this issue: #8688

There are two strategies to generate ro-crate-metadata.json:

  1. The RO-Crate spec suggests the use of Schema.org for describing datasets (https://www.researchobject.org/ro-crate/1.1/metadata.html#base-metadata-standard-schemaorg). For this we need to map the Dataverse dataset metadata fields to Schema.org properties. The advantage of this approach is that the created ro-crate-metadata.json could be interpreted by all RO-Crate tools using Schema.org. The disadvantage of this is, however, that there could be data fields in a DV dataset that cannot be mapped to a Schema.org property, and so this would lead to a lossy export.

  2. The RO-Crate spec allows the use of alternative schema/vocabularies as well:

However, as RO-Crate uses the Linked Data principles, adopters of RO-Crate are free to
supplement RO-Crate using Schema.org metadata and/or assertions using other Linked Data
vocabularies.

This opens up the possibility to use Dataverse's metadatablocks as our schemas/vocabularies and generate the ro-crate-metadata.json accordingly.

This PR implementats RO-Crate metadata generation this way, based on metadatablocks, thus exporting all available data of a dataset.

Known issues:

RO-Crate and JSON-LD requires the specification of URI-s for properties in @context. For this to work every metadatablock field must have an URI associated with it. In case of the Citation MDB some fields have an URI explicitly associated, eg. http://purl.org/dc/terms/title to the "title" field, some fields have an implicit URI based on the MDB's namespaceUri + the field's name. However other MDB-s like geospatial, socialscience etc. don't have neither explicit URI-s, nor a namespaceUri associated. This results in null values in the @context:

image

This can be easily overcome by associating a namespaceUri with every MDB, similar to the citation MDB.

Suggestions on how to test this:

The easiest way to test it is to run Dataverse in docker following the container guide

mvn -Pct clean package
mvn -Pct docker:run

Once a dataset is created and published the "RO-Crate (ARP style)" button appears under "Export Metadata":

image

Clicking that generate the RO-Crate metadata JSON.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No, just adds a new item to the "Export Metadata" droopdown.

Additional documentation:

The code for this implementationhas been extracted from the ARP project in the frame of FAIR-IMPACT's 1st Open Call "Enabling FAIR Signposting and RO-Crate for content/metadata discovery and consumption" by SZTAKI DSD (Department of Distributed Systems)

Implements an exporter to generate RO-Crate JSON representation of a dataset
using the metadatablocks of the dataset as the schema of the RO-Crate.

This class has been extracted from the ARP project
 (https://science-research-data.hu/en) in the frame of FAIR-IMPACT's 1st Open
 Call "Enabling FAIR Signposting and RO-Crate for content/metadata discovery
 and consumption".
@qqmyers
Copy link
Member

qqmyers commented Nov 2, 2023

Interesting work! Two quick comments: Fields should always be getting a valid context entry. If the field and the metadatablock don't have an assigned URI, one should be created via

private String getAssignedNamespaceUri() {
. If there are nulls appearing, it is a bug that should be reported/fixed asap.

Second, as of 5.14/6.0, Exporters can be created as stand-alone jars - see https://github.com/gdcc/dataverse-exporters (which is linked from https://guides.dataverse.org/en/latest/developers/metadataexport.html#building-an-exporter). Although this is a new feature, I think that it is probably now the preferred route for adding new exporters these days (versus a PR to add it to the Dataverse code base). I think this would be the first real exporter to use that mechanism and I'd/we'd be happy to help if you run into any problems. (hopefully not much of a code change for you as you're already using the new ExportDataProvider interface).

getJsonLDNamespace() uses getAssignedNamespaceUri(), which makes sure
a valid URI is always available.
@beepsoft
Copy link
Contributor Author

beepsoft commented Nov 2, 2023

@qqmyers, thanks for the suggestions!

Interesting work! Two quick comments: Fields should always be getting a valid context entry. If the field and the metadatablock don't have an assigned URI, one should be created via

private String getAssignedNamespaceUri() {

. If there are nulls appearing, it is a bug that should be reported/fixed asap.

Unfortunately getAssignedNamespaceUri() is private, that's why I used getNamespaceUri(). Can getAssignedNamespaceUri() be made public? I also found getJsonLDNamespace(), which also generates an URI using getAssignedNamespaceUri() so I am using this now and this solves the null values in @context.

I think this would be the first real exporter to use that mechanism and I'd/we'd be happy to help if you run into any problems. (hopefully not much of a code change for you as you're already using the new ExportDataProvider interface).

Sure, I could try that. One thing is that ExportDataProvider provides JSON/XML/etc format of the metadata, while our RoCrateManager works directly with the Dataset/DatasetField/DatasetFieldType/etc objects. Moreover, the ultimate goal is to generate a downloadable RO-Crate .zip at the end, which packages all the files in the dataset together with the ro-crate-metadata.json. This would require deeper integration and UI changes at the end.

@qqmyers
Copy link
Member

qqmyers commented Nov 2, 2023

Ah - I didn't look closely to see that you are going back to using the internal classes. Nominally the ExportDataProvider.getDatasetORE() gives you the json-ld context and 'everything' you'd need (all metadata, links to retrieve all the files) so you shouldn't have to go back to the database. (There are a couple known missing things in the ORE output, e.g. auxiliary files that were added after the ORE/archival bags functionality. If there's anything you need for an ROCrate that isn't available, we should add it.)

@pdurbin
Copy link
Member

pdurbin commented Nov 2, 2023

Great work and as I said in the container meeting this morning, it's fun that you used containers!

I let the RO-Crate folks know about it: https://seek4science.slack.com/archives/C01LQQAAAS1/p1698956581547419

Also (back to containers), I'm glad it motivated you to work on faster redeploys! I can't wait to try this:

Thanks!!

Copy link

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was curious about what Java library was used (I am using RO-Crate in Python, but former Java developer). The code looks interesting! A bit hard to follow all the for-loops and what's happening without knowing dataverse. But if you share some generated JSON files maybe I can have a look, run it through runcrate too (if you haven't done it). @simleo knows RO-Crate a lot more, and is a master in spotting issues in these JSON files 👍

@beepsoft
Copy link
Contributor Author

beepsoft commented Nov 3, 2023

@kinow, thanks for the suggestions!

We use https://github.com/kit-data-manager/ro-crate-java. Quite nice library, but sometimes a bit rigid.

Here's an example ro-crate-metadata.json with multiple metadatablocks used:

ro-crate-metadata.json

And regarding the implementation: it could be improved in many ways, but we just wanted to start a discussion whether this MDB based approach or a Schema.org mapped one is a better solution for Dataverse. Or maybe both, depending on the user's use case. The Schema.org mapped version could provide easier transfer to other repositories, while the MDB based approach could allow easier transfer to other DV installations.

Moreover, in our ARP project we use RO-Crate to extend the file metadata capabilities of Dataverse. Namely, in the current form DV only provides minimal metadata for files. In our system we allow editing the ro-crate-metadata.json to add (practically any) metadata to files, so that when the dataset is exported as RO-Crate you get all the metadata you entered in Dataverse + all additional metadata you added at the "RO-Crate layer".

@kinow
Copy link

kinow commented Nov 3, 2023

We use https://github.com/kit-data-manager/ro-crate-java. Quite nice library, but sometimes a bit rigid.

Saw that in the pom.xml, but hadn't heard about that lib before.

And regarding the implementation: it could be improved in many ways, but we just wanted to start a discussion whether this MDB based approach or a Schema.org mapped one is a better solution for Dataverse. Or maybe both, depending on the user's use case. The Schema.org mapped version could provide easier transfer to other repositories, while the MDB based approach could allow easier transfer to other DV installations.

Got it!

Moreover, in our ARP project we use RO-Crate to extend the file metadata capabilities of Dataverse. Namely, in the current form DV only provides minimal metadata for files. In our system we allow editing the ro-crate-metadata.json to add (practically any) metadata to files, so that when the dataset is exported as RO-Crate you get all the metadata you entered in Dataverse + all additional metadata you added at the "RO-Crate layer".

I am no expert, so I rely on runcrate, uploading things to WorkflowHub.eu, or asking people like Simone to help reviewing my RO-Crate's. I tested the file above with runcrate, and while it didn't produce an error, there was no output.

I think that's because you have no action in your file. You don't have the provenance of what was used to generate the data? A person, a workflow, some software? If so, I think a CreateAction + instrument would enable runcrate to produce some output. But I think what you have is already valid, and as you said you are testing things (there's even a PoC in the title) maybe the CreateAction is not needed now 👍

Cheers

@qqmyers
Copy link
Member

qqmyers commented Nov 4, 2023

FWIW: In general, Dataverse maps metadata items to schema.org if there's a term with the same meaning and range. If there are more that we could map, I think there would be interest in updating the metadata blocks, etc. Due to that, I'm not sure there's much difference between a schema.org vs. mdb approach, other than that schema.org only would be lossy.

@okaradeniz
Copy link
Contributor

This looks so interesting! We are also working on an external exporter but took the first strategy instead, with customizability to counter the disadvantage of the approach.

@ptsefton
Copy link

@kinow Re your comments about runcrate above -- note that runcrate is a family of RO-Crate profiles for a particular purpose (documenting the outcomes of workflows), a general purpose repository like Dataverse is more geared to general descriptions of many kinds of data -- not that workflows might not be some of them but not all RO-Crates are expected to be runcrate compatible.

@simleo
Copy link

simleo commented Nov 29, 2023

Just to clarify: the collection of RO-Crate profiles for capturing workflow provenance is called Workflow Run RO-Crate, while runcrate is a toolkit to interact with RO-Crates that follow the Workflow Run RO-Crate profiles.

@pdurbin
Copy link
Member

pdurbin commented Jan 5, 2024

Related:

Also, @beepsoft I just noted there are merge conflicts. Would you be able to resolve them? Thanks!

@pdurbin pdurbin added the Size: 10 A percentage of a sprint. 7 hours. label Feb 28, 2024
@scolapasta
Copy link
Contributor

If you are still interested in this PR, can you please merge and resolve any merge conflicts with the latest from develop? If so, we can prioritize reviewing and QAing the changes. If we don’t hear from you by May 22, 2024, we’ll go ahead and close this PR (it can always be reopened after that date, if there is still interest).

@beepsoft
Copy link
Contributor Author

Thanks @scolapasta for the heads up! I just fixed the issues, sorry for the long delay.

Do you have any other suggestions to update in this PR? For example, it is now added as a default exporter, so once it is merged, it will be available in all Dataverse installations. Can this be made an optional exporter somehow, which users can install if they need?

@qqmyers
Copy link
Member

qqmyers commented Apr 24, 2024

FWIW: There is now an external exporter mechanism and separate repo for creating exporters that can be dropped into an installation as a separate jar file. See https://github.com/gdcc/dataverse-exporters and the examples and PRs there for details. That would make the exporter optional and something you could update outside the Dataverse release cycle.

@beepsoft
Copy link
Contributor Author

@qqmyers the problem with the ExportDataProvider I faced was that it only provides the dataset JSON and for the RO-Crate I needed deeper understanding of the dataset, for example, I needed to access the DatasetServiceBean. In a standalone exporter project I would not be able to reference that. At least that is my current understanding.

@qqmyers
Copy link
Member

qqmyers commented Apr 24, 2024

I haven't tried it but I think you may be able to access those classes in an external exporter - you would just have to build against Dataverse rather than the smaller spi jar. That said, the intent with the interface was to give you all the info needed to create an export so if there's additional information that needs to get sent, let us know and perhaps we can add it.

@beepsoft
Copy link
Contributor Author

Probably based on a combination of the JSON and OAI_ORE exports the necessary data could be gathered, but we started implementing our solution before the new exporter API has been released.

@ErykKul
Copy link
Collaborator

ErykKul commented Jul 9, 2024

@beepsoft
I have tried porting this exporter to the Dataverse Transformer Exporter: ARP RO-Crate example
To test it, you need to copy that example folder containing the config.json and the transformer.py files together with the JAR file to your exporters dir. See also README.md. Can you try it out? All feedback is appreciated!

@cmbz cmbz added the FY25 Sprint 3 FY25 Sprint 3 label Jul 31, 2024
@pdurbin pdurbin self-assigned this Aug 5, 2024
@pdurbin
Copy link
Member

pdurbin commented Aug 5, 2024

This PR was in "ready for review" so I picked it up but my understanding is that we have three different RO-Crate implementation to evaluate:

  1. This PR that changes the core Dataverse code, developed by @beepsoft.

  2. https://github.com/gdcc/exporter-ro-crate which is an external exporter, developed by @okaradeniz.

  3. https://github.com/gdcc/exporter-transformer/tree/main/examples/arp-ro-crate which is an external exporter built on top of the "transformer" framework, developed by @ErykKul.

Generally speaking, now that we have an external exporter framework, we are favoring implementations that use it for a number of reasons:

  • No changes necessary to the core code.
  • Can have a separate life and release cycle.

So at minimum, I'd like the guides to be updated to reference one or more external RO-Crate exporters.

As for this pull request, there are merge conflicts at present. @beepsoft please feel free to resolve them but I think I'm going to play around first with @okaradeniz's exporter. Then I'm hoping to look at the other two.

@beepsoft
Copy link
Contributor Author

beepsoft commented Aug 6, 2024

@pdurbin I think we can ignore my PR and go with @ErykKul's implementation as it is more generic and works for our use case as well.

@pdurbin
Copy link
Member

pdurbin commented Aug 6, 2024

@beepsoft sound good. Thanks. I created a new issue to track testing those other two implementations:

@pdurbin pdurbin removed their assignment Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FY25 Sprint 3 FY25 Sprint 3 GREI 3 Search and Browse GREI 6 Connect Digital Objects Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.