forked from lucfernandez/intovalue-data
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
294 lines (220 loc) · 15.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(dplyr)
library(gtsummary)
trials_all <- readr::read_csv(here::here("data", "processed", "trials.csv"))
```
# IntoValue Dataset
## Overview
This dataset is modified from the [main IntoValue dataset](https://doi.org/10.5281/zenodo.5141342), and includes updated registry data from ClinicalTrials.gov and DRKS. It also includes additional data on associated results publications, including links in the registries and trial registration number reporting in the publications.
Detailed documentation for the parent IntoValue dataset is provided in a data dictionary and readme alongside the [dataset in Zenodo](https://doi.org/10.5281/zenodo.5141342). This readme serves to highlight/document changes.
Note that `summary_results_date` for DRKS summary results was changed from the parent IntoValue dataset. The parent dataset included only a subset of summary results *manually* found during searches, whereas this dataset includes additional summary results found via *automated* search. The parent dataset used the `summary_results_date` manually extracted from *PDFs*, whereas this dataset uses the `summary_results_date` manually extracted from DRKS’ *change history* and reflects the date the results were uploaded and made publicly available.
## Data sources
This dataset builds on several sources, detailed below. The latest query date is provided when applicable. Raw data, when permissible (i.e., not for full-text), is shared in either this repository or in [Zenodo](https://zenodo.org/record/5506434), depending on its size.
```{r query-logs, echo = FALSE}
query_logs <- loggit::read_logs(here::here("queries.log"))
get_latest_query <- function(query, logs) {
logs %>%
filter(log_msg == query) %>%
arrange(desc(timestamp)) %>%
slice_head(n = 1) %>%
pull(timestamp) %>%
as.Date.character()
}
```
| Source | Type | Date | Raw Data | Script |
|-----------------------------------|-------------------------------|----------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| IntoValue | Trials | NA | <https://doi.org/10.5281/zenodo.5141342> | [get-intovalue.R](https://github.com/maia-sh/intovalue-data/blob/main/scripts/01_get-intovalue.R) |
| PubMed | Bibliometric | `r get_latest_query("PubMed", query_logs)` | [Zenodo](https://zenodo.org/record/5506434/files/pubmed.zip?download=1) [last update: 2021-08-15] | [get-pubmed.R](https://github.com/maia-sh/intovalue-data/blob/main/scripts/02_get-pubmed.R) |
| ClinicalTrials.gov/ AACT | Registry | `r get_latest_query("AACT", query_logs)` | [Zenodo](https://zenodo.org/record/5506434/files/ctgov.zip?download=1) [last update: 2021-08-15] | [get-process-aact.R](https://github.com/maia-sh/intovalue-data/blob/main/scripts/06_get-process-aact.R) |
| DRKS | Registry | `r get_latest_query("DRKS", query_logs)` | [Zenodo](https://zenodo.org/record/5506434/files/drks.zip?download=1) [last update: 2021-08-15] | [get-drks.R](https://github.com/maia-sh/intovalue-data/blob/main/scripts/08_get-drks.R) |
| Unpaywall/ institutional licenses | Full-text PDF | NA | NA, available only on local server `/data01/responsible_metrics/intovalue-data/fulltext` | [get-ft-pdf.R](https://github.com/maia-sh/intovalue-data/blob/main/scripts/03_get-ft-pdf.R) |
| GROBID | Full-text XML | NA | NA, available only on local server `/data01/responsible_metrics/intovalue-data/fulltext` | PDF-to-XML conversion done in Python |
| Unpaywall | Open access status | `r get_latest_query("Unpaywall", query_logs)` | [oa-unpaywall.csv](https://github.com/maia-sh/intovalue-data/blob/main/data/raw/open-access/oa-unpaywall.csv) | [get-oa-unpaywall-data.R](https://github.com/maia-sh/intovalue-data/blob/main/scripts/13_get-oa-unpaywall-data.R) |
| ShareYourPaper | Green open access permissions | `r get_latest_query("ShareYourPaper", query_logs)` | [oa-syp-permissions.csv](https://github.com/maia-sh/intovalue-data/blob/main/data/raw/open-access/oa-syp-permissions.csv) | [get-oa-permissions.py](https://github.com/maia-sh/intovalue-data/blob/main/scripts/14_get-oa-permissions.py) |
## Data directories
Aside from full-text publications, the data directories should be entirely reproducible from scripts. The data directory structure should look as follows. Directories with a large number of individual raw files are indicated with curly braces.
```{r data-directories, echo = FALSE, eval = FALSE}
fs::dir_tree(here::here("data", "processed"))
fs::dir_tree(here::here("data", "raw"), recurse = FALSE)
fs::dir_tree(here::here("data", "raw", "fulltext"))
```
├── data
├── processed
│ ├── codebook.csv
│ ├── trials.csv
│ ├── trials.rds
│ ├── pubmed
│ │ ├── pubmed-abstract.rds
│ │ ├── pubmed-ft-retrieved.rds
│ │ ├── pubmed-main.rds
│ │ └── pubmed-si.rds
│ ├── registries
│ │ ├── ctgov
│ │ │ ├── ctgov-crossreg.rds
│ │ │ ├── ctgov-facility-affiliations.rds
│ │ │ ├── ctgov-ids.rds
│ │ │ ├── ctgov-lead-affiliations.rds
│ │ │ ├── ctgov-references.rds
│ │ │ └── ctgov-studies.rds
│ │ ├── drks
│ │ │ ├── drks-crossreg.rds
│ │ │ ├── drks-facility-affiliations.rds
│ │ │ ├── drks-ids.rds
│ │ │ ├── drks-lead-affiliations.rds
│ │ │ ├── drks-references.rds
│ │ │ └── drks-studies.rds
│ │ ├── registry-crossreg.rds
│ │ ├── registry-references.rds
│ │ └── registry-studies.rds
│ └── trn
│ ├── cross-registrations.rds
│ ├── n-cross-registrations.rds
│ ├── trn-abstract.rds
│ ├── trn-all.rds
│ ├── trn-ft-doi.rds
│ ├── trn-ft-pmid.rds
│ ├── trn-reported-long.rds
│ ├── trn-reported-wide.rds
│ └── trn-si.rds
└── raw
├── intovalue.csv
├── pubmed {raw files named [pmid].xml}
├── registries
│ ├── ctgov
│ │ ├── centers.csv
│ │ ├── designs.csv
│ │ ├── facilities.csv
│ │ ├── ids.csv
│ │ ├── interventions.csv
│ │ ├── officials.csv
│ │ ├── references.csv
│ │ ├── responsible-parties.csv
│ │ ├── sponsors.csv
│ │ └── studies.csv
│ └── drks {raw files named [drks trn]}
├── fulltext
│ ├── doi
│ │ ├── pdf {raw files named [doi].pdf}
│ │ └── xml {raw files named [doi].tei.xml}
│ └── pmid
│ ├── pdf {raw files named [pmid].pdf}
│ └── xml {raw files named [pmid].tei.xml}
└── open-access
├── oa-syp-permissions.csv
└── oa-unpaywall.csv
## Analysis dataset
We are interested in interventional trials with a German UMC lead completed between 2009 and 2017. Due to changes in the registry as well as discrepancies between IntoValue 1 and 2, we re-apply the IntoValue exclusion criteria and deduplicate to get the analysis dataset.
```{r filter-trials}
trials <-
trials_all %>%
filter(
# Re-apply the IntoValue exclusion criteria
iv_completion,
iv_status,
iv_interventional,
has_german_umc_lead,
# In case of dupes, exclude IV1 version
!(is_dupe & iv_version == 1)
)
n_iv_trials <- nrow(trials)
```
**Number of included trials**: `r n_iv_trials`
For analyses by UMC, split trials by UMC lead city:
```{r trials-by-umc}
trials_by_umc <-
trials %>%
mutate(lead_cities = strsplit(as.character(lead_cities), " ")) %>%
tidyr::unnest(lead_cities)
```
Some analyses apply only to trials with a results publication (optionally limited to journal articles to exclude dissertations and abstracts) with a PMID that resolves to a PubMed record and for which we could acquire the full-text as a PDF.
```{r trials-with-pubs}
trials_pubs <-
trials %>%
filter(
# publication_type == "journal publication", #optional
has_pubmed,
has_ft,
)
n_iv_trials_pubs <- nrow(trials_pubs)
trials_same_pmid <- janitor::get_dupes(trials_pubs, pmid)
n_trials_same_pmid <- n_distinct(trials_same_pmid$id)
n_pmids_same_trial <- n_distinct(trials_same_pmid$pmid)
n_pmids_dupes <- unique(range(trials_same_pmid$dupe_count))
```
**Number of trials with results publications**: `r n_iv_trials_pubs`
In general, there is max 1 publication per trial and max 1 trial per publication. However, there are `r n_trials_same_pmid` trials associated with the same `r n_pmids_same_trial` publications (i.e., `r n_pmids_dupes` publications per trial). Since the unit of analysis is trials, we disregard this double-counting of publications.
## TRN reporting in abstract
```{r trn-abs}
n_trn_abs <- nrow(filter(trials_pubs, has_iv_trn_abstract))
prop_trn_abs <- n_trn_abs/n_iv_trials_pubs
```
<!-- $$ \text{TRN in abstract (%)} = \frac{\text{Number of trials with PubMed publications with TRN in abstract}}{\text{Number of trials with PubMed publications available as PDF full-text}}$$ -->
**Numerator**: Number of trials with PubMed publications with IntoValue TRN in abstract
**Denominator**: Number of trials with PubMed publications available as PDF full-text
`r scales::percent(prop_trn_abs)` (`r n_trn_abs`/`r n_iv_trials_pubs`) of trials report a TRN in the abstract of their results publication.
## TRN reporting in full-text
```{r trn-ft}
n_trn_ft <- nrow(filter(trials_pubs, has_iv_trn_ft))
prop_trn_ft <- n_trn_ft/n_iv_trials_pubs
```
**Numerator**: Number of trials with PubMed publications with IntoValue TRN in PDF full-text
**Denominator**: Number of trials with PubMed publications available as PDF full-text
`r scales::percent(prop_trn_ft)` (`r n_trn_ft`/`r n_iv_trials_pubs`) of trials report a TRN in the full-text (PDF) of their results publication.
## Linked publication in registry
```{r reg-pub-link}
# ClinicalTrials.gov
trials_ctgov <- filter(trials_pubs, registry == "ClinicalTrials.gov")
n_iv_trials_pubs_ctgov <- nrow(trials_ctgov)
n_reg_pub_link_ctgov <- nrow(filter(trials_ctgov, has_reg_pub_link))
prop_reg_pub_link_ctgov <- n_reg_pub_link_ctgov/ n_iv_trials_pubs_ctgov
n_auto <- nrow(filter(trials_ctgov, reference_derived))
n_manual <- nrow(filter(trials_ctgov, reference_derived))
# DRKS
trials_drks <- filter(trials_pubs, registry == "DRKS")
n_iv_trials_pubs_drks <- nrow(trials_drks)
n_reg_pub_link_drks <- nrow(filter(trials_drks, has_reg_pub_link))
prop_reg_pub_link_drks <- n_reg_pub_link_drks/ n_iv_trials_pubs_drks
```
*Registry Limitations*: ClinicalTrials.gov includes a often-used PMID field for references. In addition, ClinicalTrials.gov automatically indexes publications from PubMed using TRN in the secondary identifier field. In contrast, DRKS includes references as a free-text field, leaving trialists to decide whether to enter any publication identifiers.
We consider a publication "linked" if the PMID or DOI is included in the trial registrations. Note that some publications are included in the registrations without a PMID or DOI (i.e., publication title and/or URL only).
**Numerator**: Number of trials with PubMed publications PMIDs and/or DOIs linked in trial registration
**Denominator**: Number of trials with PubMed publications available as PDF full-text
`r scales::percent(prop_reg_pub_link_ctgov)` (`r n_reg_pub_link_ctgov`/`r n_iv_trials_pubs_ctgov`) of trials on clinicaltrials.gov include a link (i.e., PMID, DOI) to their PubMed publication (as available in the IntoValue dataset). This includes `r n_auto` (`r scales::percent(n_auto/n_reg_pub_link_ctgov)`) trials with automatically indexed publications (i.e., using TRN in PubMed's secondary identifier field) and `r n_manual` (`r scales::percent(n_manual/n_reg_pub_link_ctgov)`) trials with manually added publications.
`r scales::percent(prop_reg_pub_link_drks)` (`r n_reg_pub_link_drks`/`r n_iv_trials_pubs_drks`) of trials on DRKS include a link (i.e., PMID, DOI) to their PubMed publication (as available in the IntoValue dataset).
## Registry summary results
```{r summary-results}
trials_pubs %>%
count(registry, has_summary_results) %>%
knitr::kable()
```
*Registry Limitations*: ClinicalTrials.gov includes a structured summary results field. In contrast, DRKS includes summary results with other references, and summary results were inferred based on keywords, such as Ergebnisbericht or Abschlussbericht, in the reference title.
## EUCTR Cross-registrations
```{r euctr-crossreg}
tbl_euctr <-
trials %>%
tbl_cross(
row = has_crossreg_eudract,
col = registry,
margin = "column",
percent = "column",
label = list(
has_crossreg_eudract ~ "EUCTR TRN in Registration",
registry ~ "Registry"
)
)
as_kable(tbl_euctr)
```
Of the `r n_iv_trials` unique trials completed between 2009 and 2017 and meeting the IntoValue inclusion criteria, we found that `r inline_text(tbl_euctr, row_level = "TRUE", col_level = "Total")` include an EUCTR id in their registration, and are presumably cross-registered in EUCTR. This includes `r inline_text(tbl_euctr, row_level = "TRUE", col_level = "ClinicalTrials.gov")` from ClinicalTrials.gov and `r inline_text(tbl_euctr, row_level = "TRUE", col_level = "DRKS")` from DRKS.
## Documentation TODOs
- Update data dictionary
- Add information about categories of data changes from IV1/2 dataset