Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Constraint to prevent duplicate DvObjects for the same physical file #6522

Closed
landreev opened this issue Jan 15, 2020 · 15 comments · Fixed by #6612
Closed

DB Constraint to prevent duplicate DvObjects for the same physical file #6522

landreev opened this issue Jan 15, 2020 · 15 comments · Fixed by #6612
Assignees

Comments

@landreev
Copy link
Contributor

Another issue with files discovered in prod. in recent days:
There appears to be a number of instances in the db, where TWO dvobjects/datafiles are associated with the same storage identifier (i.e., same physical file).
I have not yet identified the exact scenario that makes it possible. But it must be something going wrong during upload and save.
Opening the issue to diagnose and fix the problem.
As always with things like this, this will involve both code fixes and cleanup.

@djbrooke
Copy link
Contributor

djbrooke commented Jan 15, 2020

@landreev
Copy link
Contributor Author

I started looking into this yesterday, was able to cleanly fix the 3 most affected datasets: 10.7910/DVN/IYGVYS, 10.7910/DVN/LQ2KFZ and 10.7910/DVN/8A1XO3. These had ~1300 duplicated datafiles between the 3 of them (!).
There are 7 datasets remaining, but with relatively few affected files.

@landreev
Copy link
Contributor Author

The 3 datasets above, all the affected files had the same pattern: uploaded in large batches (from 250 to 550 files); all end up assigned sequential ids in the 2N range, where N is the size of the batch; each file ending up with 2 identical dvobjects/datafiles, with the ids i and i+N. So this definitely happens during the initial save.

I can think of various fanciful scenarios that can trigger it... But how much do we want to invest into figuring out how it happened? - as opposed to just slapping the unique constraint on the storageidentifier?
I'm going to focus on cleaning up the remaining datasets for now.

@landreev
Copy link
Contributor Author

All the existing duplicates have been removed.
I'll be making a PR adding a unique constraint on the storageidentifier in the dvobject table.

@donsizemore
Copy link
Contributor

@landreev any chance you could include your SQL queries (and cleanup methods) in the PR? (and thank you!)

@landreev
Copy link
Contributor Author

@landreev any chance you could include your SQL queries (and cleanup methods) in the PR? (and thank you!)

Yes, definitely. The way I did it involved a lot of manual labor (unfortunately); I'm trying to package all the queries etc. in some better organized way now, that could be passed to the other installations.

Have you actually seen this condition in your database? - I hope not; I'm still hoping that this was something that happened because of our unique load conditions/as part of specific system instability that we were experiencing here. But I'm preparing for the worst - for the possibility of at least some of the remote installations having experienced this as well. :(

@landreev
Copy link
Contributor Author

landreev commented Jan 23, 2020

@donsizemore A quick test to check if you have been affected by this issue:
first, count the number of unique non-harvested datafiles:

SELECT COUNT(DISTINCT o.id) FROM datafile f, dataset s, dvobject p, dvobject o WHERE s.id = p.id AND o.id = f.id AND o.owner_id = s.id AND s.harvestingclient_id IS null AND o.storageidentifier IS NOT null

Then count the number of distinct datafile storageidentifiers within datasets, and see if you get the same number:

SELECT COUNT(DISTINCT (o.owner_id,o.storageidentifier)) FROM datafile f, dataset s, dvobject p, dvobject o WHERE s.id = p.id AND o.id = f.id AND o.owner_id = s.id AND s.harvestingclient_id IS null AND o.storageidentifier IS NOT null

Of course, if this is your production database, you may get two different numbers because somebody has just uploaded and/or deleted some files... Running the queries on a snapshot copy of the database eliminates that problem.

If it looks like you have some duplicate dvObjects, please let me know.

@donsizemore
Copy link
Contributor

donsizemore commented Jan 24, 2020

non-harvested datafiles:     42,920
datafile storageidentifiers: 42,912

We upgraded from 4.11 to 4.16 four days ago...

@landreev
Copy link
Contributor Author

OK, it looks like you have 8 duplicates. I'll send further instructions shortly.

@landreev
Copy link
Contributor Author

I realized (remembered) why we didn't put a unique constraint on the storageidentifier field in the first place: they were not in fact (historically) unique. In its current form, the storage identifier is a product of the timestamp in milliseconds and a reasonably long random string... so it's more or less guaranteed to be unique. But it wasn't always the case in the old days (DVN/VDC?); since each dataset has its own storage folder, the file name only needed to be unique within the dataset.

So we have a number of grandfathered/migrated files with non-unique storageidentifiers in the database. So I guess instead of making the field unique, we need to make the combination of {"owner_id","storageidentifier"} unique; similarly to the constraint we have that makes the version numbers unique within datasets.

Although I'm debating if we should just go ahead and force uniqueness on the field, and migrate/rename any files with non-unique names. Anyone has an opinion on how to proceed?

(@donsizemore: yours is likely the one installation outside of ours that's old enough to have such legacy files)

@landreev
Copy link
Contributor Author

@donsizemore

...
We upgraded from 4.11 to 4.16 four days ago...

The earliest dupes in our database were from late 2018. And that was before 4.11.

@pdurbin
Copy link
Member

pdurbin commented Jan 24, 2020

@landreev thanks for bringing this up at standup. What I don't have a sense for is how much effort it would be (and how much risk there would be) to migrate/rename those old non-unique storage identifiers from the old days. If it's relatively easy it feels cleaner to me to migrate them and apply a single constraint on storageidentifier rather than having to add in owner_id to the constraint.

@djbrooke djbrooke assigned djbrooke and unassigned scolapasta Jan 28, 2020
landreev added a commit that referenced this issue Feb 5, 2020
diagnostics script plus "pre-release note", to be sent preemptively to partner installations, so
that if the issue with the datafile duplicates is detected we could start assisting them with
cleanup before they can upgrade to the next release.
landreev added a commit that referenced this issue Feb 5, 2020
…hat we'll need to send out

to the remote installations). #6522
landreev added a commit that referenced this issue Feb 7, 2020
landreev added a commit that referenced this issue Feb 19, 2020
@djbrooke djbrooke reopened this Feb 20, 2020
@djbrooke djbrooke removed their assignment Feb 20, 2020
landreev added a commit that referenced this issue Feb 20, 2020
(this is a *compbined* script for BOTH #6510 and #6522!)
landreev added a commit that referenced this issue Feb 20, 2020
This (and the proper release note) SUPERCEDES what was in PR #6522!
i.e. we are sending out only ONE note, not TWO, there's only one
script to run, etc.
(ref. #6510)
@landreev
Copy link
Contributor Author

@djbrooke I'd like to reopen this one, pending the decision next week on how (or whether) to proceed with the database constraint.

@djbrooke
Copy link
Contributor

OK, I didn't know if we should reopen or add a new issue for the specific harvesting case. Reopening now...

@djbrooke djbrooke reopened this Feb 20, 2020
@djbrooke djbrooke changed the title Files: Duplicate DvObjects for the same physical file DB Constraint to prevent duplicate DvObjects for the same physical file Feb 21, 2020
@landreev
Copy link
Contributor Author

landreev commented Dec 3, 2020

As discussed, closing the issue, with the new issue #7451 opened to finish the one remaining task - resolve the issue with non-unique "storageidentifiers" for harvested files in legacy databases; then add a flyway script for adding the constraint to any databases that don't have it yet to the next release.

@landreev landreev closed this as completed Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants