DB Constraint to prevent duplicate DvObjects for the same physical file #6522

landreev · 2020-01-15T16:11:57Z

Another issue with files discovered in prod. in recent days:
There appears to be a number of instances in the db, where TWO dvobjects/datafiles are associated with the same storage identifier (i.e., same physical file).
I have not yet identified the exact scenario that makes it possible. But it must be something going wrong during upload and save.
Opening the issue to diagnose and fix the problem.
As always with things like this, this will involve both code fixes and cleanup.

djbrooke · 2020-01-15T19:59:58Z

We need to investigate why this is occurring or how to replicate it, most of these issues are in larger datasets
We should provide to the community a way to detect this and a a way to resolve this
An example is a few files in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IYGVYS (like https://dataverse.harvard.edu/file.xhtml?fileId=3563721 and https://dataverse.harvard.edu/file.xhtml?fileId=3563971 - both of which happen to link the same physical file, storage identifier s3://dvn-cloud:16d1c9d04ec-3eca092d88e4!)
EDIT: this dataset has been cleaned up, so the second file.xhtml link above should no longer be working!
We could add a unique constraint on storage ID (we used to see multiple drafts and we added a constraint)

landreev · 2020-01-16T15:57:19Z

I started looking into this yesterday, was able to cleanly fix the 3 most affected datasets: 10.7910/DVN/IYGVYS, 10.7910/DVN/LQ2KFZ and 10.7910/DVN/8A1XO3. These had ~1300 duplicated datafiles between the 3 of them (!).
There are 7 datasets remaining, but with relatively few affected files.

landreev · 2020-01-16T16:08:20Z

The 3 datasets above, all the affected files had the same pattern: uploaded in large batches (from 250 to 550 files); all end up assigned sequential ids in the 2N range, where N is the size of the batch; each file ending up with 2 identical dvobjects/datafiles, with the ids i and i+N. So this definitely happens during the initial save.

I can think of various fanciful scenarios that can trigger it... But how much do we want to invest into figuring out how it happened? - as opposed to just slapping the unique constraint on the storageidentifier?
I'm going to focus on cleaning up the remaining datasets for now.

landreev · 2020-01-21T16:08:39Z

All the existing duplicates have been removed.
I'll be making a PR adding a unique constraint on the storageidentifier in the dvobject table.

donsizemore · 2020-01-21T18:12:27Z

@landreev any chance you could include your SQL queries (and cleanup methods) in the PR? (and thank you!)

landreev · 2020-01-21T20:01:42Z

@landreev any chance you could include your SQL queries (and cleanup methods) in the PR? (and thank you!)

Yes, definitely. The way I did it involved a lot of manual labor (unfortunately); I'm trying to package all the queries etc. in some better organized way now, that could be passed to the other installations.

Have you actually seen this condition in your database? - I hope not; I'm still hoping that this was something that happened because of our unique load conditions/as part of specific system instability that we were experiencing here. But I'm preparing for the worst - for the possibility of at least some of the remote installations having experienced this as well. :(

landreev · 2020-01-23T23:00:48Z

@donsizemore A quick test to check if you have been affected by this issue:
first, count the number of unique non-harvested datafiles:

SELECT COUNT(DISTINCT o.id) FROM datafile f, dataset s, dvobject p, dvobject o WHERE s.id = p.id AND o.id = f.id AND o.owner_id = s.id AND s.harvestingclient_id IS null AND o.storageidentifier IS NOT null

Then count the number of distinct datafile storageidentifiers within datasets, and see if you get the same number:

SELECT COUNT(DISTINCT (o.owner_id,o.storageidentifier)) FROM datafile f, dataset s, dvobject p, dvobject o WHERE s.id = p.id AND o.id = f.id AND o.owner_id = s.id AND s.harvestingclient_id IS null AND o.storageidentifier IS NOT null

Of course, if this is your production database, you may get two different numbers because somebody has just uploaded and/or deleted some files... Running the queries on a snapshot copy of the database eliminates that problem.

If it looks like you have some duplicate dvObjects, please let me know.

donsizemore · 2020-01-24T13:38:44Z

non-harvested datafiles:     42,920
datafile storageidentifiers: 42,912

We upgraded from 4.11 to 4.16 four days ago...

landreev · 2020-01-24T15:47:15Z

OK, it looks like you have 8 duplicates. I'll send further instructions shortly.

landreev · 2020-01-24T16:02:05Z

I realized (remembered) why we didn't put a unique constraint on the storageidentifier field in the first place: they were not in fact (historically) unique. In its current form, the storage identifier is a product of the timestamp in milliseconds and a reasonably long random string... so it's more or less guaranteed to be unique. But it wasn't always the case in the old days (DVN/VDC?); since each dataset has its own storage folder, the file name only needed to be unique within the dataset.

So we have a number of grandfathered/migrated files with non-unique storageidentifiers in the database. So I guess instead of making the field unique, we need to make the combination of {"owner_id","storageidentifier"} unique; similarly to the constraint we have that makes the version numbers unique within datasets.

Although I'm debating if we should just go ahead and force uniqueness on the field, and migrate/rename any files with non-unique names. Anyone has an opinion on how to proceed?

(@donsizemore: yours is likely the one installation outside of ours that's old enough to have such legacy files)

landreev · 2020-01-24T16:07:21Z

@donsizemore

...
We upgraded from 4.11 to 4.16 four days ago...

The earliest dupes in our database were from late 2018. And that was before 4.11.

pdurbin · 2020-01-24T19:27:09Z

@landreev thanks for bringing this up at standup. What I don't have a sense for is how much effort it would be (and how much risk there would be) to migrate/rename those old non-unique storage identifiers from the old days. If it's relatively easy it feels cleaner to me to migrate them and apply a single constraint on storageidentifier rather than having to add in owner_id to the constraint.

diagnostics script plus "pre-release note", to be sent preemptively to partner installations, so that if the issue with the datafile duplicates is detected we could start assisting them with cleanup before they can upgrade to the next release.

…hat we'll need to send out to the remote installations). #6522

)

…sing output. (#6522)

…ore branch. (#6522)

…we are merging it! (#6522)

…e out what to do with harvested files. (#6522)

(this is a *compbined* script for BOTH #6510 and #6522!)

This (and the proper release note) SUPERCEDES what was in PR #6522! i.e. we are sending out only ONE note, not TWO, there's only one script to run, etc. (ref. #6510)

landreev · 2020-02-20T16:48:26Z

@djbrooke I'd like to reopen this one, pending the decision next week on how (or whether) to proceed with the database constraint.

djbrooke · 2020-02-20T16:49:36Z

OK, I didn't know if we should reopen or add a new issue for the specific harvesting case. Reopening now...

landreev · 2020-12-03T03:25:57Z

As discussed, closing the issue, with the new issue #7451 opened to finish the one remaining task - resolve the issue with non-unique "storageidentifiers" for harvested files in legacy databases; then add a flyway script for adding the constraint to any databases that don't have it yet to the next release.

djbrooke assigned landreev Jan 15, 2020

djbrooke unassigned landreev Jan 15, 2020

djbrooke added the Medium label Jan 15, 2020

djbrooke assigned landreev and scolapasta Jan 16, 2020

landreev mentioned this issue Jan 22, 2020

Local cleanup for the Duplicate DvObjects issue IQSS/dataverse.harvard.edu#52

Open

djbrooke assigned djbrooke and unassigned scolapasta Jan 28, 2020

landreev added a commit that referenced this issue Feb 5, 2020

added the real release note (in addition to the "pre-release note", t…

44f6df1

…hat we'll need to send out to the remote installations). #6522

landreev mentioned this issue Feb 5, 2020

6522 duplicate dvobjects #6612

Merged

landreev added a commit that referenced this issue Feb 6, 2020

*correct* new constraint on the storageidentifier within datasets (#6522

08475d9

)

landreev added a commit that referenced this issue Feb 7, 2020

Simplified version of the diag. script, with (potentially) less confu…

db4961f

…sing output. (#6522)

landreev added a commit that referenced this issue Feb 7, 2020

renamed the flyway script, so that it precedes the one in the multist…

7b51f8c

…ore branch. (#6522)

landreev added a commit that referenced this issue Feb 7, 2020

changed the branch name in the pre-release note to "develop" - since …

bf45d02

…we are merging it! (#6522)

kcondon closed this as completed in #6612 Feb 10, 2020

landreev added a commit that referenced this issue Feb 19, 2020

Removing the flyway script with the unique constraint, until we figur…

cd5cf39

…e out what to do with harvested files. (#6522)

djbrooke reopened this Feb 20, 2020

djbrooke removed their assignment Feb 20, 2020

djbrooke closed this as completed Feb 20, 2020

landreev added a commit that referenced this issue Feb 20, 2020

A diagnostics script to be sent to the remote installations.

0f394ea

(this is a *compbined* script for BOTH #6510 and #6522!)

djbrooke reopened this Feb 20, 2020

djbrooke assigned scolapasta Feb 20, 2020

djbrooke changed the title ~~Files: Duplicate DvObjects for the same physical file~~ DB Constraint to prevent duplicate DvObjects for the same physical file Feb 21, 2020

djbrooke unassigned landreev Mar 5, 2020

landreev mentioned this issue Dec 3, 2020

Add a unique DvObjects/Storageidentifier constraint to legacy databases #7451

Closed

landreev closed this as completed Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB Constraint to prevent duplicate DvObjects for the same physical file #6522

DB Constraint to prevent duplicate DvObjects for the same physical file #6522

landreev commented Jan 15, 2020

djbrooke commented Jan 15, 2020 •

edited by landreev

Loading

landreev commented Jan 16, 2020

landreev commented Jan 16, 2020

landreev commented Jan 21, 2020

donsizemore commented Jan 21, 2020

landreev commented Jan 21, 2020

landreev commented Jan 23, 2020 •

edited

Loading

donsizemore commented Jan 24, 2020 •

edited

Loading

landreev commented Jan 24, 2020

landreev commented Jan 24, 2020

landreev commented Jan 24, 2020

pdurbin commented Jan 24, 2020

landreev commented Feb 20, 2020

djbrooke commented Feb 20, 2020

landreev commented Dec 3, 2020

DB Constraint to prevent duplicate DvObjects for the same physical file #6522

DB Constraint to prevent duplicate DvObjects for the same physical file #6522

Comments

landreev commented Jan 15, 2020

djbrooke commented Jan 15, 2020 • edited by landreev Loading

landreev commented Jan 16, 2020

landreev commented Jan 16, 2020

landreev commented Jan 21, 2020

donsizemore commented Jan 21, 2020

landreev commented Jan 21, 2020

landreev commented Jan 23, 2020 • edited Loading

donsizemore commented Jan 24, 2020 • edited Loading

landreev commented Jan 24, 2020

landreev commented Jan 24, 2020

landreev commented Jan 24, 2020

pdurbin commented Jan 24, 2020

landreev commented Feb 20, 2020

djbrooke commented Feb 20, 2020

landreev commented Dec 3, 2020

djbrooke commented Jan 15, 2020 •

edited by landreev

Loading

landreev commented Jan 23, 2020 •

edited

Loading

donsizemore commented Jan 24, 2020 •

edited

Loading