Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling and measuring (meta)data quality inside Dataverse #4751

Closed
pkiraly opened this issue Jun 13, 2018 · 7 comments
Closed

Handling and measuring (meta)data quality inside Dataverse #4751

pkiraly opened this issue Jun 13, 2018 · 7 comments

Comments

@pkiraly
Copy link
Member

pkiraly commented Jun 13, 2018

Data quality is evident if we see it with human eye, but not so evident during the data creation process. I suggest to start conversation of possible directions within the Dataverse community.

As a starting point I see at least three directions:

Actually it is my PhD research, and I have started the work (http://pkiraly.github.io) on other library related data, such as digital library metadata (Europeana), MARC bibliographic records etc. Basically what is (relatively) easy to check is the structure of the metadadata records: does it fit to the rules? are there outliers? are there missing parts? Are the descriptions unique? In some cases it is also possible to check some semantics. Regarding to the data files itself, there is a project in the Bielefeld University called Conquaire (http://conquaire.uni-bielefeld.de/) which tries to set up rules to check integrity of the files (triggered by commits to the repository). Also there is a Frictionless data project (https://frictionlessdata.io/) of the Open Knowledge International which suggest some solution for table-like dataset (see #4747).

@pameyer
Copy link
Contributor

pameyer commented Jun 15, 2018

"Data quality" covers a fairly wide range of issues (re-usability, reproducability/agreement, etc). For another perspective to the discussion, end-to-end data quality could be evaluated by both the degree to which the data is consistent with the research results reported for a given dataset, and the degree of difficulty that evaluating this consistency. Depending on the research domain, this may be more computational analysis is a more sensitive measure of this than the human eye (aka - a human curator can see that there are metadata elements associated with data, but not realize the accuracy of those metadata values without running one or more compute processes).

Providing information on these measurements in a data repository can definitely help drive research forward, but the source of these quality measurements might be better provided by external (domain specific) tools communicating with a common standard. Associating these measures with a dataset also brings up the potential issue of "conflict of interest", as ideally the "quality" of a dataset not be assed by the depositor, but an independent review.

@pkiraly
Copy link
Member Author

pkiraly commented Jun 15, 2018

That's true, an independent review would provide a more objective evaluation.

In Qualty Assurance literature one of the most frequent phase is "fitness for purpose". It means that we should know what is the purpose of the data. In case of research data, my impression is the most important purpose is to provide the reproducability of a research. For that a dataset should contant the input data (and formalized description of that (data distionary, structure), such as the one suggested by the Frictionless data project), and optionally the processing softwares including the processing pipelines (as files or as links) or at least the descriptions of those. To automate the reproducing process, the dataset should contain information on which the process could make distinction between the data and the softwares. On file level there are no metadata in Dataverse right now, you can tag files, but tags are totally arbitrary, so they do not come from an established dictionary.

However is we can not set up an automation process, it does not mean, that the components are not available in the dataset. It just means, that Dataverse does not provide tools for categorize these components, or the depositor doesn't use them. So the quality evaluation process should not report a black/white result, but a value from an ordinal or continuous scale (actually there should be more aspects and values for each ones).

Also the tool could apply heuristics. For example Hadley Wickham introduced the term "tidy data" [1, 2], which are rectangular data organized in an optimal way. I don't know if there is a tool which helps us decide if a CSV file is tidy or not. but it is clear, that tidy data provides more options for the reproducing process, than non tidy data, so the score of those will be higher. Maybe it is possible to find/invent tools to detect "tidiness".

[1] https://en.wikipedia.org/wiki/Tidy_data
[2] https://www.jstatsoft.org/article/view/v059i10/

@pdurbin
Copy link
Member

pdurbin commented Jun 18, 2018

There's some related discussion over at #2119. I agree that it would be nice to have some indication of quality of datasets, tools that can help nudge authors toward better practices and perhaps some concept of peer review some day. The most stringent quality standards that I'm aware of are applied to datasets on the American Journal of Political Science dataverse at https://dataverse.harvard.edu/dataverse/ajps . See https://ajps.org/ajps-replication-policy/ . I'm sure there are other places with high standards.

@pameyer
Copy link
Contributor

pameyer commented Jun 18, 2018

@pkiraly This is an area where things may need to be handled different for domain-agnostic and domain-specific (which I tend to focus on) data repositories. For a domain specific repository, the "fitness for purpose" can be judged by comparing the results of a processing pipeline run on the dataset to datasets representing the final products (which can be determined by metadata associated with the dataset). The purpose of these processing pipelines is essentially dictated by the dataset type; but the exact software used is less significant than the consistency of results (between multiple pipelines, and between pipeline results and other deposited results). How to best evaluate consistency for multi-step comparisons of several layers of both continuous and discrete attributes of these results (and present them in an easily interpretable manner to end-users) is still an open question, at least as far as I'm aware.

I'd imagine generalizing this to having the input files in one dataset, the details of the processing pipeline(s) in another "dataset" (acknowledging the open question of whether or not a compute process should be a dataset or another type of research object), and outputs in another dataset, all linked via metadata. This avoids the issue of having file tags for inputs, scripts, source code, built tools, etc (which as you point out aren't widely used). My perspective on this isn't universal - there seems to be a decent amount of mixing analysis code, input files, log files and output files in a single datasets. I don't have much insight into re-usability of CSV files, tidy or not.

@mheppler
Copy link
Contributor

Also related to badges Dataset - Peer Reviewed Tag or Badge #565 and Crowdsourcing rating tool of dataset #924.

@poikilotherm
Copy link
Contributor

Just my 2 cents: one of my ideas beyond PeerPub was to use this "internal peer review" tool in combination with data repositories.
It hasn't evolved to a useable state yet, but there is some slow movement on the horizon.

@djbrooke
Copy link
Contributor

Hey @pkiraly, I'm closing this as we want to approach this along with a few other use cases in #6041 and I'm trying to get all of those cases in a single issue for design purposes. The details in this issue will be extremely helpful to reference, and we'd appreciate your knowledge on that larger issue. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants