IntervalSurgeon
presents functions for intersecting, overlapping, piling and annotating integer-bounded intervals. Underlying algorithms are written in C++ for efficiency where appropriate (with the help of the Rcpp
package). A typical use case would be for manipulating genomic intervals.
For the purposes of this package, intervals are represented as two-column integer matrices where the inclusive start points are in the first column and the exclusive end points are in the second column.
library(IntervalSurgeon)
starts <- 3*1:10
ends <- 5*1:10
intervals <- cbind(starts, ends)
The lengths of the intervals are therefore: intervals[,2]-intervals[,1]
.
A key concept in IntervalSurgeon
is breaking ranges which contain intervals into 'sections' delimited by the unique start/end points in the set. The sections for a set of intervals x
is therefore a two-column matrix
representing a set of intervals which partitions the range of x
by the sorted start and end points. The sorted start and end points can be obtained using the breaks
function (so named to reflect start/end points of intervals frequently being referred to as 'breakpoints' in genomics), which is equivalent to: sort(unique(as.integer(intervals)))
). The sections can be computed from the sorted start and end points using the sections
function.
One can use the depth
and pile
functions respectively to obtain the depth of intervals over their 'sections' (within which the depth is constant), and the row indices of original intervals which cover each section.
The flatten
function returns a non-touching, non-overlapping set of intervals (as a matrix
) which overlap at least one of a given set.
sectioned <- sections(breaks(intervals))
flattened <- flatten(intervals)
depths <- depth(intervals)
piles <- pile(intervals)
IntervalSurgeon
provides functions which are optimised for dealing with detached (i.e. non-overlapping and non-touching) intervals which are sorted and non-empty. Here, we generate two such sets of intervals:
x_starts <- 10*1:10
x <- cbind(x_starts, x_starts + 5)
y_starts <- 20*1:5 + 1
y <- cbind(y_starts, y_starts + 7)
We can determine that they are indeed detached, sorted and non-empty using the detached_sorted_nonempty
function.
detached_sorted_nonempty(x)
## [1] TRUE
The IntervalSurgeon
functions for operating on such sets of detached, sorted, non-empty intervals are analogues of the set operation functions in the base
package, namely: intersect
, union
and setdiff
. Here, the function names are in plural (i.e. with extra s's on the end).
For a given set of sorted, non-overlapping intervals, some of the start points might be the same as the end points of adjacent intervals. These intervals are therefore 'touching' and can be 'stitched' together using the stitch
function. If there were overlaps, the flatten
function can be used to generate a set of sorted disjoint intervals. Note that flatten
will also stitch touching intervals, although the stitch
function is faster (albeit requiring intervals to be sorted).
Information about overlaps between sets of intervals can be obtained by 'joining' the sets. This is analogous to an SQL inner-join of two tables, and can be performed on sets of intervals using the join
function. This function returns a matrix containing all overlaps of intervals from each set. Each row in the returned matrix corresponds to a specific overlap of intervals with one from each of the sets passed to the function. The n
th element in a row contains the row index of the interval in the n
th set of intervals passed to the function. Depending on the value of the output
argument, there may two additional columns giving the start and end coordinates of the overlap (the default: output="intervals"
, no extra columns (output="indices"
) or one additional column giving the row index of the 'section' of the complete set of intervals (output="sections"
, see ?sections
).
x <- cbind(3*1:8, 5*1:8)
y <- cbind(4*1:4, 7*1:4)
join_xy <- join(x, y)
head(join_xy)
## [,1] [,2] [,3] [,4]
## [1,] 1 1 4 5
## [2,] 2 1 6 7
## [3,] 2 2 8 10
## [4,] 3 2 9 14
## [5,] 3 3 12 15
## [6,] 4 2 12 14
One common task would be to tag intervals with overlapping intervals, perhaps from a different set. For example, this might be useful for tagging a set of genomic intervals with the names of genes which they overlap. The annotate
function is provided for this exact purpose.
x <- cbind(0, c(a=10, b=20, c=30))
y <- cbind(c(A=0, B=10, C=20), c(5, 15, 25))
annotate(x, y)
## $a
## [1] "A"
##
## $b
## [1] "A" "B"
##
## $c
## [1] "A" "B" "C"
Genomic intervals are often represented in R as data.frame
s with columns corresponding to chromosome name, start position and end position. Obviously intervals do not intersect if they are on different chromosomes, so in order to manipulate such intervals with IntervalSurgeon
, genome-wide operations must be performed chromosome-at-a-time. Using split
to create a list of chromosome specific data.frames
, or looping over the names of chromosomes and subsetting the original table, before cbind
ing/as.matrix
ing the start and end columns then makes the data accessible to the IntervalSurgeon
functions.
regions_annotated_with_genes <- lapply(
c(1:22, "X", "Y"),
function(chromosome) {
regions_on_chr <- as.matrix(regions[regions$chr == chromosome,c("start", "end")])
genes_on_chr <- as.matrix(genes[genes$chr == chromosome,c("start","end")])
annotate(regions_on_chr, genes_on_chr)
}
)