Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNF Fixes #3629

Closed
djbrooke opened this issue Feb 8, 2017 · 8 comments
Closed

UNF Fixes #3629

djbrooke opened this issue Feb 8, 2017 · 8 comments
Assignees

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Feb 8, 2017

This issue is for fixing UNFs that were dropped from some of the dataset citations in production.

@landreev
Copy link
Contributor

Status updates:
There are several underlying issues, fixed in separate steps:

  1. Legacy UNFs - version 3, 4 and 5 - we no longer calculate these. These UNFs were supposed to be migrated from DVN 3 and, for published/archived versions of datasets, to stay frozen, preserved in perpetuity.
    Some of these were dropped from the citations of such old, published versions.
    However, all these legacy UNFs should now be back in place, and displayed in citations correctly.
    Example: an older version of Gary King's study hdl:1902.1/11044:
    https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/11044&version=6.0

@landreev
Copy link
Contributor

Another case:
2) In some prod. dataset versions the new, version 6 UNFs haven't been properly calculated. This was caused by several bugs that have since been fixed: for example, creating a new version, without modifying any tabular data files was NOT automatically cloning the unf from the previous version.
In 4.6.1 we have added a new admin API for monitoring missing or invalid UNFs, and for fixing them with REST calls. When fixing the missing version 6 UNFs in production one of the 3 prod servers was patched with this updated API end point; which made the process much easier.
An example of a dataset version fixed:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SD7SVE&version=2.0

Please note, that just like with the case 1) above the integrity of the UNF signatures was not compromised. They were not showing in the citations of some dataset versions, but the important file level UNFs were still in place.

@landreev
Copy link
Contributor

  1. This case is a more complex/special case of 2):
    If a version is missing the UNF, running our recalculation method will always produce the latest version UNF - v. 6. Even if all the tabular data files in the have versions < 6 - this is by design of the UNF algorithm. However, if the tabular files in this version are the same as in the previous one, and the previous version has the UNF, we SHOULD USE THE UNF FROM THE PREVIOUS VERSION, rather than recalculate. Because we don't want the UNF to change from UNF:5:XXX to UNF:6:YYY, even though the tabular data files are exactly the same. So that was the problem with our "fixi missing unf" algorithm - it always recalculates. So this is a TODO: it needs to be modified to only recalculate when necessary.
    For all such cases, the fixes have been done outside of Dataverse, via direct database updates.
    Note that this was the case with all of Gary's datasets that weren't showing the UNFs in later versions. All of Gary's studies except 1 should have older (< 6) versions UNFs - since all of his tabular data files were finalized in the days before Dataverse 4; and all the recent updates had to do with documentation/extra files and the metadata.

@landreev
Copy link
Contributor

More TODO:s left:

  • Create scripts and documentation that could be passed to other Dataverse installations who recently upgraded to Dataverse 4, to make sure they have their legacy UNFs preserved.

  • Modify the recalculation method as described above;

  • Fix the bug in import (jsonparser) that can drop the UNF of the imported version;

  • Verify/double-check that we don't have any bugs in the app that can cause a version to lose the UNF. To the best of my knowledge all such issues have been fixed. HOWEVER, I heard some (anecdotal?) reports of UNFs disappearing from citations in the current production version (4.6)! We'll need to keep an eye on it and continue to run integrity checks.

landreev added a commit that referenced this issue Mar 1, 2017
 - fixes the version UNF in JSON import;
 - changes the logic in the fixMissingUnf call, to reuse the UNF of an earlier version, if available, and if the tabular file UNFs are identical; unless a recalculation is specifically requested, with a "force" option.
@landreev
Copy link
Contributor

landreev commented Mar 1, 2017

Created a pull request: #3656

There are 2 Application code changes:

  • A fix for the bug in the json parser that was dropping the version UNF on import. Note that this fix will prevent the loss of UNFs in any dvn3 migrations - but then it's unlikely that there are any un-migrated dvn3s left.

  • Changed the logic of the fixMissingUnf api call. So that it will reuse the UNF of an earlier version, if available, and if the tabular file UNFs are identical; unless a recalculation is specifically requested, with a "force" option.

@landreev
Copy link
Contributor

landreev commented Mar 2, 2017

@scolapasta - I checked in the 2 minor fixes you suggested per code review.

@scolapasta scolapasta removed their assignment Mar 7, 2017
@kcondon kcondon self-assigned this Mar 8, 2017
@landreev
Copy link
Contributor

landreev commented Mar 8, 2017

For the migration/import test, here's a sample DDI:
dvn-build.hmdc.harvard.edu:/tmp/326-6.xml

  • this is a real Gary King's study from the old production.

To test the migration import:

Assuming you have a dataverse with the alias "king":

Copy the DDI file into a directory with the same name somewhere, for example
cp 326-6.xml /tmp/king/

And call the migration API, for example:

curl 'http://localhost:8080/api/batch/migrate?path=/tmp&key=YOURKEY'

(i.e., the "path" parameter is the directory above the directory where the DDI file is located; the name of the subdirectory is used to determine which dataverse the dataset will be imported into)

@landreev
Copy link
Contributor

landreev commented Mar 8, 2017

(and, if successful, the imported dataset should have the UNF...)

kcondon added a commit that referenced this issue Mar 8, 2017
More unf improvements, from the todo list in #3629:
@kcondon kcondon closed this as completed Mar 8, 2017
@kcondon kcondon removed the Status: QA label Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants