Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6522 duplicate dvobjects #6612

Merged
merged 6 commits into from
Feb 10, 2020
Merged

6522 duplicate dvobjects #6612

merged 6 commits into from
Feb 10, 2020

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Feb 5, 2020

What this PR does / why we need it:
This addresses the issue of duplicate datafiles/dvobjects created on upload under some (unknown) conditions, referencing the same physical file. The code change adds the unique constraint on the storageidentifier within dataset that should prevent this from happening in the future.

Before it can be deployed however, the database must be checked for any existing duplicates. We've already done it in our prod., but we'll have to notify other installations and guide them through the cleanup process.

In addition to the normal release note, the PR contains a "pre-release-note", plus the diagnostic script, for sending out to the partner installation. This should be done ahead of the next release, as soon as the branch is merged into develop.

Which issue(s) this PR closes:

Closes #6522

Special notes for your reviewer:
Up for review is the approach I chose for dealing with this on remote installations. Specifically, I concluded it was not practical to try and produce a script that would fix any potentially affected installation. There are too many variables involved - whether an affected file is published vs. draft, ingested as tabular, physical file intact, etc. etc. So my approach is to send out the diagnostics script (that has been tested on one remote installation already) that will detect if this condition is present. If any duplicates are found, I'm asking them to send us the output and then we'll assist them with cleanup.

This involves committing to more support work down the road, but I feel like this is the best and safest way to address this.

Suggestions on how to test this:
We do not know of a way to reproduce the condition that leads to the creation of duplicates on upload. So there is no clear way to test it.

Mostly the QA should be about reviewing the diagnostic script and information note that we'll be sending out to the remote installations.

Does this PR introduce a user interface change?:
No
Is there a release notes update needed for this change?:
Yes
Additional documentation:

diagnostics script plus "pre-release note", to be sent preemptively to partner installations, so
that if the issue with the datafile duplicates is detected we could start assisting them with
cleanup before they can upgrade to the next release.
…hat we'll need to send out

to the remote installations). #6522
@coveralls
Copy link

coveralls commented Feb 5, 2020

Coverage Status

Coverage decreased (-0.01%) to 19.46% when pulling bf45d02 on 6522-duplicate-dvobjects into 3e4481d on develop.

@scolapasta scolapasta removed their assignment Feb 7, 2020
@kcondon kcondon self-assigned this Feb 7, 2020
@kcondon
Copy link
Contributor

kcondon commented Feb 7, 2020

@landreev The script url in the release notes results in 404: https://github.com/IQSS/dataverse/raw/develop/scripts/issues/6522/find_duplicates.sh

but this one in prerelease notes works: https://github.com/IQSS/dataverse/raw/6522-duplicate-dvobjects/scripts/issues/6522/find_duplicates.sh

Also getting a flyway failure due to checksum. That usually happens when there is already a script applied? Does this script need to be renamed for 4.20?

@landreev
Copy link
Contributor Author

landreev commented Feb 7, 2020

@kcondon
The flyway issue is a conflict with the migration script in the multistore branch, that's using the same number and has been deployed on build - 4.19.0.1. I didn't realize that branch had a flyway script.
I'll rename the script in my branch quickly (will be 4.19.0.2).

The link is just a matter of changing one of them to "develop" right before we merge, as discussed.

The script url in the release notes results in 404: https://github.com/IQSS/dataverse/raw/develop/scripts/issues/6522/find_duplicates.sh

but this one in prerelease notes works: https://github.com/IQSS/dataverse/raw/6522-duplicate-dvobjects/scripts/issues/6522/find_duplicates.sh

Also getting a flyway failure due to checksum. That usually happens when there is already a script applied? Does this script need to be renamed for 4.20?

@landreev
Copy link
Contributor Author

landreev commented Feb 7, 2020

@kcondon checked in

@kcondon kcondon merged commit 1c57fe9 into develop Feb 10, 2020
@kcondon kcondon deleted the 6522-duplicate-dvobjects branch February 10, 2020 16:36
@djbrooke djbrooke added this to the 4.20 milestone Feb 11, 2020
@landreev
Copy link
Contributor Author

Doh - we may need to revert this PR, on account of harvested files. Those often have remote urls in the storageidentifier field, that are totally not guaranteed to be unique... Our own prod. database has a ton of these, so the constraint doesn't work.

We don't really use these urls for any practical purposes; so we can probably stop shoving them into the field... But we can't proceed with adding the constraint until we figure it out.

@landreev landreev mentioned this pull request Feb 20, 2020
Copy link
Contributor

@sekmiller sekmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. I don't have anything locally to trigger all of the possible results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DB Constraint to prevent duplicate DvObjects for the same physical file
6 participants