Skip to content

Commit

Permalink
Merge pull request #10543 from IQSS/10169-JSON-schema-validation
Browse files Browse the repository at this point in the history
Improved JSON Schema validation for datasets
  • Loading branch information
landreev authored Aug 14, 2024
2 parents 7d4d534 + 55a8bce commit 53f9b45
Show file tree
Hide file tree
Showing 9 changed files with 667 additions and 12 deletions.
3 changes: 3 additions & 0 deletions doc/release-notes/10169-JSON-schema-validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
### Improved JSON Schema validation for datasets

Enhanced JSON schema validation with checks for required and allowed child objects, type checking for field types including `primitive`, `compound` and `controlledVocabulary`. More user-friendly error messages to help pinpoint the issues in the dataset JSON. See [Retrieve a Dataset JSON Schema for a Collection](https://guides.dataverse.org/en/6.3/api/native-api.html#retrieve-a-dataset-json-schema-for-a-collection) in the API Guide and PR #10543.
22 changes: 17 additions & 5 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -566,9 +566,7 @@ The fully expanded example above (without environment variables) looks like this
Retrieve a Dataset JSON Schema for a Collection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Retrieves a JSON schema customized for a given collection in order to validate a dataset JSON file prior to creating the dataset. This
first version of the schema only includes required elements and fields. In the future we plan to improve the schema by adding controlled
vocabulary and more robust dataset field format testing:
Retrieves a JSON schema customized for a given collection in order to validate a dataset JSON file prior to creating the dataset:

.. code-block:: bash
Expand All @@ -593,8 +591,22 @@ While it is recommended to download a copy of the JSON Schema from the collectio
Validate Dataset JSON File for a Collection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Validates a dataset JSON file customized for a given collection prior to creating the dataset. The validation only tests for json formatting
and the presence of required elements:
Validates a dataset JSON file customized for a given collection prior to creating the dataset.

The validation tests for:

- JSON formatting
- required fields
- typeClass must follow these rules:

- if multiple = true then value must be a list
- if typeClass = ``primitive`` the value object is a String or a List of Strings depending on the multiple flag
- if typeClass = ``compound`` the value object is a FieldDTO or a List of FieldDTOs depending on the multiple flag
- if typeClass = ``controlledVocabulary`` the values are checked against the list of allowed values stored in the database
- typeName validations (child objects with their required and allowed typeNames are configured automatically by the database schema). Examples include:

- dsDescription validation includes checks for typeName = ``dsDescriptionValue`` (required) and ``dsDescriptionDate`` (optional)
- datasetContact validation includes checks for typeName = ``datasetContactName`` (required) and ``datasetContactEmail``; ``datasetContactAffiliation`` (optional)

.. code-block:: bash
Expand Down
102 changes: 102 additions & 0 deletions scripts/search/tests/data/dataset-finch3.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
{
"datasetVersion": {
"license": {
"name": "CC0 1.0",
"uri": "http://creativecommons.org/publicdomain/zero/1.0"
},
"metadataBlocks": {
"citation": {
"fields": [
{
"value": "HTML & More",
"typeClass": "primitive",
"multiple": false,
"typeName": "title"
},
{
"value": [
{
"authorName": {
"value": "Markup, Marty",
"typeClass": "primitive",
"multiple": false,
"typeName": "authorName"
},
"authorAffiliation": {
"value": "W4C",
"typeClass": "primitive",
"multiple": false,
"typeName": "authorAffiliation"
}
}
],
"typeClass": "compound",
"multiple": true,
"typeName": "author"
},
{
"value": [
{
"datasetContactEmail": {
"typeClass": "primitive",
"multiple": false,
"typeName": "datasetContactEmail",
"value": "markup@mailinator.com"
},
"datasetContactName": {
"typeClass": "primitive",
"multiple": false,
"typeName": "datasetContactName",
"value": "Markup, Marty"
}
}
],
"typeClass": "compound",
"multiple": true,
"typeName": "datasetContact"
},
{
"value": [
{
"dsDescriptionValue": {
"value": "BEGIN<br></br>END",
"multiple": false,
"typeClass": "primitive",
"typeName": "dsDescriptionValue"
},
"dsDescriptionDate": {
"typeName": "dsDescriptionDate",
"multiple": false,
"typeClass": "primitive",
"value": "2021-07-13"
}
}
],
"typeClass": "compound",
"multiple": true,
"typeName": "dsDescription"
},
{
"value": [
"Medicine, Health and Life Sciences"
],
"typeClass": "controlledVocabulary",
"multiple": true,
"typeName": "subject"
},
{
"typeName": "language",
"multiple": true,
"typeClass": "controlledVocabulary",
"value": [
"English",
"Afar",
"aar"
]
}
],
"displayName": "Citation Metadata"
}
}
}
}
27 changes: 22 additions & 5 deletions src/main/java/edu/harvard/iq/dataverse/DataverseServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import edu.harvard.iq.dataverse.storageuse.StorageQuota;
import edu.harvard.iq.dataverse.util.StringUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
import edu.harvard.iq.dataverse.util.json.JsonUtil;

import java.io.File;
import java.io.IOException;
import java.sql.Timestamp;
Expand All @@ -34,6 +34,7 @@
import java.util.logging.Logger;
import java.util.Properties;

import edu.harvard.iq.dataverse.validation.JSONDataValidation;
import jakarta.ejb.EJB;
import jakarta.ejb.Stateless;
import jakarta.inject.Inject;
Expand Down Expand Up @@ -888,14 +889,16 @@ public List<Object[]> getDatasetTitlesWithinDataverse(Long dataverseId) {
return em.createNativeQuery(cqString).getResultList();
}


public String getCollectionDatasetSchema(String dataverseAlias) {
return getCollectionDatasetSchema(dataverseAlias, null);
}
public String getCollectionDatasetSchema(String dataverseAlias, Map<String, Map<String,List<String>>> schemaChildMap) {

Dataverse testDV = this.findByAlias(dataverseAlias);

while (!testDV.isMetadataBlockRoot()) {
if (testDV.getOwner() == null) {
break; // we are at the root; which by defintion is metadata blcok root, regarldess of the value
break; // we are at the root; which by definition is metadata block root, regardless of the value
}
testDV = testDV.getOwner();
}
Expand Down Expand Up @@ -932,6 +935,8 @@ public String getCollectionDatasetSchema(String dataverseAlias) {
dsft.setRequiredDV(dsft.isRequired());
dsft.setInclude(true);
}
List<String> childrenRequired = new ArrayList<>();
List<String> childrenAllowed = new ArrayList<>();
if (dsft.isHasChildren()) {
for (DatasetFieldType child : dsft.getChildDatasetFieldTypes()) {
DataverseFieldTypeInputLevel dsfIlChild = dataverseFieldTypeInputLevelService.findByDataverseIdDatasetFieldTypeId(testDV.getId(), child.getId());
Expand All @@ -944,8 +949,18 @@ public String getCollectionDatasetSchema(String dataverseAlias) {
child.setRequiredDV(child.isRequired() && dsft.isRequired());
child.setInclude(true);
}
if (child.isRequired()) {
childrenRequired.add(child.getName());
}
childrenAllowed.add(child.getName());
}
}
if (schemaChildMap != null) {
Map<String, List<String>> map = new HashMap<>();
map.put("required", childrenRequired);
map.put("allowed", childrenAllowed);
schemaChildMap.put(dsft.getName(), map);
}
if(dsft.isRequiredDV()){
requiredDSFT.add(dsft);
}
Expand Down Expand Up @@ -1021,11 +1036,13 @@ private String getCustomMDBSchema (MetadataBlock mdb, List<DatasetFieldType> req
}

public String isDatasetJsonValid(String dataverseAlias, String jsonInput) {
JSONObject rawSchema = new JSONObject(new JSONTokener(getCollectionDatasetSchema(dataverseAlias)));
Map<String, Map<String,List<String>>> schemaChildMap = new HashMap<>();
JSONObject rawSchema = new JSONObject(new JSONTokener(getCollectionDatasetSchema(dataverseAlias, schemaChildMap)));

try {
try {
Schema schema = SchemaLoader.load(rawSchema);
schema.validate(new JSONObject(jsonInput)); // throws a ValidationException if this object is invalid
JSONDataValidation.validate(schema, schemaChildMap, jsonInput); // throws a ValidationException if any objects are invalid
} catch (ValidationException vx) {
logger.info(BundleUtil.getStringFromBundle("dataverses.api.validate.json.failed") + " " + vx.getErrorMessage());
String accumulatedexceptions = "";
Expand Down
1 change: 0 additions & 1 deletion src/main/java/edu/harvard/iq/dataverse/api/Datasets.java
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
package edu.harvard.iq.dataverse.api;

import com.amazonaws.services.s3.model.PartETag;

import edu.harvard.iq.dataverse.*;
import edu.harvard.iq.dataverse.DatasetLock.Reason;
import edu.harvard.iq.dataverse.actionlogging.ActionLogRecord;
Expand Down
Loading

0 comments on commit 53f9b45

Please sign in to comment.