forked from rnabioco/valr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
181 lines (115 loc) · 6.84 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
output: github_document
---
```{r, echo = FALSE, message = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "img/README-")
library(valr)
library(dplyr)
```
# valr <img src="man/figures/logo.png" align="right" />
[![Build Status](https://travis-ci.org/rnabioco/valr.svg?branch=master)](https://travis-ci.org/rnabioco/valr)
[![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/rnabioco/valr?branch=master&svg=true)](https://ci.appveyor.com/project/jayhesselberth/valr)
[![Coverage Status](https://img.shields.io/codecov/c/github/rnabioco/valr/master.svg)](https://codecov.io/github/rnabioco/valr?branch=master)
[![](http://www.r-pkg.org/badges/version/valr)](http://www.r-pkg.org/pkg/valr)
[![](http://cranlogs.r-pkg.org/badges/valr?color=FFD700)](https://CRAN.R-project.org/package=valr)
**`valr` provides tools to read and manipulate genome intervals and signals**, similar to the [`BEDtools`][1] suite. `valr` enables analysis in the R/RStudio environment, leveraging modern R tools in the [`tidyverse`][14] for a terse, expressive syntax. Compute-intensive algorithms are implemented in [`Rcpp`][3]/C++, and many methods take advantage of the speed and grouping capability provided by [`dplyr`][2].
## Installation
The latest stable version can be installed from CRAN:
```r
install.packages('valr')
```
The latest development version can be installed from github:
```r
# install.packages("devtools")
devtools::install_github('rnabioco/valr')
```
## Why `valr`?
**Why another tool set for interval manipulations?** Based on our experience teaching genome analysis, we were motivated to develop interval arithmetic software that faciliates genome analysis in a single environment (RStudio), eliminating the need to master both command-line and exploratory analysis tools.
**Note:** `valr` can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.
### Familiar tools, all within R
The functions in `valr` have similar names to their `BEDtools` counterparts, and so will be familiar to users coming from the `BEDtools` suite. Similar to [`pybedtools`](https://daler.github.io/pybedtools/#why-pybedtools), `valr` has a terse syntax:
```{r syntax_demo, message = FALSE, eval = FALSE}
library(valr)
library(dplyr)
snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)
genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)
# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# find distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)
nearby %>%
select(starts_with('name'), .overlap, .dist) %>%
filter(abs(.dist) < 5000)
```
### Visual documentation
`valr` includes helpful glyphs to illustrate the results of specific operations, similar to those found in the `BEDtools` documentation. For example, `bed_glyph()` can be used to illustrate result of intersecting `x` and `y` intervals with `bed_intersect()`:
```{r intersect_glyph, eval = FALSE}
x <- trbl_interval(
~chrom, ~start, ~end,
'chr1', 25, 50,
'chr1', 100, 125
)
y <- trbl_interval(
~chrom, ~start, ~end,
'chr1', 30, 75
)
bed_glyph(bed_intersect(x, y))
```
### Reproducible reports
`valr` can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because `valr` is reasonably fast, it can be for exploratory analysis with `RMarkdown`, and for interactive analysis using `shiny`.
## API
Function names are similar to their their [BEDtools][1] counterparts, with some additions.
### Data types
* `tbl_interval()` and `tbl_genome()` wrap tibbles and enforce strict column naming. `trbl_interval()` and `trbl_genome()` are constructors that take `tibble::tribble()` formatting.
### Reading data
* BED and related files are read with `read_bed()`, `read_bed12()`, `read_bedgraph()`, `read_narrowpeak()` and `read_broadpeak()`.
* Genome files containing chromosome name and size information are loaded with `read_genome()`.
* VCF files are loaded with `read_vcf()`.
* Remote databases can be accessed with `db_ucsc()` and `db_ensembl()`.
### Transforming single interval sets
* Interval coordinates are adjusted with `bed_slop()` and `bed_shift()`, and new flanking intervals are created with `bed_flank()`.
* Nearby intervals are combined with `bed_merge()` and identified (but not merged) with `bed_cluster()`.
* Intervals not covered by a query are created with `bed_complement()`.
* Intervals are ordered with `dplyr::arrange()`.
### Comparing multiple interval sets
* Find overlaps between two sets of intervals with `bed_intersect()`.
* Apply functions to selected columns for overlapping intervals with `bed_map()`.
* Remove intervals based on overlaps between two files with `bed_subtract()`.
* Find overlapping intervals within a window with `bed_window()`.
* Find the closest intervals independent of overlaps with `bed_closest()`.
### Randomizing intervals
* Generate random intervals from an input genome with `bed_random()`.
* Shuffle the coordinates of input intervals with `bed_shuffle()`.
* Random sampling of input intervals is done with the `sample_` function family in `dplyr`.
### Interval statistics
* Calculate significance of overlaps between two sets of intervals with `bed_fisher()` and `bed_projection()`.
* Quantify relative and absolute distances between sets of intervals with `bed_reldist()` and `bed_absdist()`.
* Quantify extent of overlap between two sets of intervals with `bed_jaccard()`.
### Utilities
* Visualize the actions of valr functions with `bed_glyph()`.
* Constrain intervals to a genome reference with `bound_intervals()`.
* Subdivide intervals with `bed_makewindows()`.
* Convert BED12 to BED6 format with `bed12_to_exons()`.
* Calculate spacing between intervals with `interval_spacing()`.
* Access remote databases with `db_ucsc()` and `db_ensembl()`.
## Related work
* Command-line tools [BEDtools][1] and [bedops][5].
* The Python library [pybedtools][4] wraps BEDtools.
* The R packages [GenomicRanges][6], [bedr][7], [IRanges][8] and [GenometriCorr][9] provide similar capability with a different philosophy.
[1]: http://bedtools.readthedocs.org/en/latest/
[2]: https://github.com/hadley/dplyr
[3]: http://www.rcpp.org/
[4]: https://pythonhosted.org/pybedtools/
[5]: http://bedops.readthedocs.org/en/latest/index.html
[6]: https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html
[7]: https://CRAN.R-project.org/package=bedr
[8]: https://bioconductor.org/packages/release/bioc/html/IRanges.html
[9]: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002529
[10]: http://rmarkdown.rstudio.com/
[11]: https://www.r-bloggers.com/why-i-dont-like-jupyter-fka-ipython-notebook/
[12]: https://bitbucket.org/snakemake/snakemake/wiki/Home
[13]: http://shiny.rstudio.com/
[14]: http://tidyverse.org