Handling and measuring (meta)data quality inside Dataverse #4751

pkiraly · 2018-06-13T13:53:12Z

Data quality is evident if we see it with human eye, but not so evident during the data creation process. I suggest to start conversation of possible directions within the Dataverse community.

As a starting point I see at least three directions:

in the relevant literature (see Metadata Assessment group at Zotero: https://www.zotero.org/groups/metadata_assessment) there are different metrics such as completeness, accessibility, multilinguality etc.
in Linked Data context Zaveri et al. recently did a survey and set up a classification of 67 metrics (https://content.iospress.com/articles/semantic-web/sw175). Some of the metrics are not relevant outside of Linked Data context, but others still are
in research data repository context last year FAIR Metrics Group (http://www.fairmetrics.org/) formed to set up rules and metrics to measure the "FAIRness" or research data. You can find their micropublications at https://github.com/FAIRMetrics/Metrics.

Actually it is my PhD research, and I have started the work (http://pkiraly.github.io) on other library related data, such as digital library metadata (Europeana), MARC bibliographic records etc. Basically what is (relatively) easy to check is the structure of the metadadata records: does it fit to the rules? are there outliers? are there missing parts? Are the descriptions unique? In some cases it is also possible to check some semantics. Regarding to the data files itself, there is a project in the Bielefeld University called Conquaire (http://conquaire.uni-bielefeld.de/) which tries to set up rules to check integrity of the files (triggered by commits to the repository). Also there is a Frictionless data project (https://frictionlessdata.io/) of the Open Knowledge International which suggest some solution for table-like dataset (see #4747).

pameyer · 2018-06-15T13:43:13Z

"Data quality" covers a fairly wide range of issues (re-usability, reproducability/agreement, etc). For another perspective to the discussion, end-to-end data quality could be evaluated by both the degree to which the data is consistent with the research results reported for a given dataset, and the degree of difficulty that evaluating this consistency. Depending on the research domain, this may be more computational analysis is a more sensitive measure of this than the human eye (aka - a human curator can see that there are metadata elements associated with data, but not realize the accuracy of those metadata values without running one or more compute processes).

Providing information on these measurements in a data repository can definitely help drive research forward, but the source of these quality measurements might be better provided by external (domain specific) tools communicating with a common standard. Associating these measures with a dataset also brings up the potential issue of "conflict of interest", as ideally the "quality" of a dataset not be assed by the depositor, but an independent review.

pkiraly · 2018-06-15T17:07:05Z

That's true, an independent review would provide a more objective evaluation.

In Qualty Assurance literature one of the most frequent phase is "fitness for purpose". It means that we should know what is the purpose of the data. In case of research data, my impression is the most important purpose is to provide the reproducability of a research. For that a dataset should contant the input data (and formalized description of that (data distionary, structure), such as the one suggested by the Frictionless data project), and optionally the processing softwares including the processing pipelines (as files or as links) or at least the descriptions of those. To automate the reproducing process, the dataset should contain information on which the process could make distinction between the data and the softwares. On file level there are no metadata in Dataverse right now, you can tag files, but tags are totally arbitrary, so they do not come from an established dictionary.

However is we can not set up an automation process, it does not mean, that the components are not available in the dataset. It just means, that Dataverse does not provide tools for categorize these components, or the depositor doesn't use them. So the quality evaluation process should not report a black/white result, but a value from an ordinal or continuous scale (actually there should be more aspects and values for each ones).

Also the tool could apply heuristics. For example Hadley Wickham introduced the term "tidy data" [1, 2], which are rectangular data organized in an optimal way. I don't know if there is a tool which helps us decide if a CSV file is tidy or not. but it is clear, that tidy data provides more options for the reproducing process, than non tidy data, so the score of those will be higher. Maybe it is possible to find/invent tools to detect "tidiness".

[1] https://en.wikipedia.org/wiki/Tidy_data
[2] https://www.jstatsoft.org/article/view/v059i10/

pdurbin · 2018-06-18T14:49:52Z

There's some related discussion over at #2119. I agree that it would be nice to have some indication of quality of datasets, tools that can help nudge authors toward better practices and perhaps some concept of peer review some day. The most stringent quality standards that I'm aware of are applied to datasets on the American Journal of Political Science dataverse at https://dataverse.harvard.edu/dataverse/ajps . See https://ajps.org/ajps-replication-policy/ . I'm sure there are other places with high standards.

pameyer · 2018-06-18T21:36:54Z

@pkiraly This is an area where things may need to be handled different for domain-agnostic and domain-specific (which I tend to focus on) data repositories. For a domain specific repository, the "fitness for purpose" can be judged by comparing the results of a processing pipeline run on the dataset to datasets representing the final products (which can be determined by metadata associated with the dataset). The purpose of these processing pipelines is essentially dictated by the dataset type; but the exact software used is less significant than the consistency of results (between multiple pipelines, and between pipeline results and other deposited results). How to best evaluate consistency for multi-step comparisons of several layers of both continuous and discrete attributes of these results (and present them in an easily interpretable manner to end-users) is still an open question, at least as far as I'm aware.

I'd imagine generalizing this to having the input files in one dataset, the details of the processing pipeline(s) in another "dataset" (acknowledging the open question of whether or not a compute process should be a dataset or another type of research object), and outputs in another dataset, all linked via metadata. This avoids the issue of having file tags for inputs, scripts, source code, built tools, etc (which as you point out aren't widely used). My perspective on this isn't universal - there seems to be a decent amount of mixing analysis code, input files, log files and output files in a single datasets. I don't have much insight into re-usability of CSV files, tidy or not.

mheppler · 2018-10-23T14:32:38Z

Also related to badges Dataset - Peer Reviewed Tag or Badge #565 and Crowdsourcing rating tool of dataset #924.

poikilotherm · 2018-10-24T08:27:08Z

Just my 2 cents: one of my ideas beyond PeerPub was to use this "internal peer review" tool in combination with data repositories.
It hasn't evolved to a useable state yet, but there is some slow movement on the horizon.

djbrooke · 2019-07-19T18:33:23Z

Hey @pkiraly, I'm closing this as we want to approach this along with a few other use cases in #6041 and I'm trying to get all of those cases in a single issue for design purposes. The details in this issue will be extremely helpful to reference, and we'd appreciate your knowledge on that larger issue. Thank you!

pdurbin added Feature: Metadata Feature: Metrics + Reports labels Oct 13, 2018

pdurbin mentioned this issue Jun 7, 2019

Support Open Badges for Dataverses #2119

Closed

djbrooke mentioned this issue Jul 19, 2019

Support measures of Data and Metadata Quality #6041

Closed

djbrooke closed this as completed Jul 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling and measuring (meta)data quality inside Dataverse #4751

Handling and measuring (meta)data quality inside Dataverse #4751

pkiraly commented Jun 13, 2018

pameyer commented Jun 15, 2018

pkiraly commented Jun 15, 2018

pdurbin commented Jun 18, 2018

pameyer commented Jun 18, 2018

mheppler commented Oct 23, 2018

poikilotherm commented Oct 24, 2018

djbrooke commented Jul 19, 2019

Handling and measuring (meta)data quality inside Dataverse #4751

Handling and measuring (meta)data quality inside Dataverse #4751

Comments

pkiraly commented Jun 13, 2018

pameyer commented Jun 15, 2018

pkiraly commented Jun 15, 2018

pdurbin commented Jun 18, 2018

pameyer commented Jun 18, 2018

mheppler commented Oct 23, 2018

poikilotherm commented Oct 24, 2018

djbrooke commented Jul 19, 2019