Introduction

The bioinformatic evaluation of gene co-expression often begins with correlation-based analyses. However, correlation lacks validity when applied to relative data, including count data generated by next-generation sequencing. This vignette describes three metrics of proportionality that measure dependence between relative features using compositional data analysis: \(\phi\) (Lovell 2015), \(\rho\) (Lovell 2015; Erb 2016), and \(\phi_s\) (Erb 2016; Quinn 2017). Unlike correlation, proportionality gives the same result for both relative and absolute data. Meanwhile, pairs that are strongly proportional in relative space are also strongly correlated in absolute space. Proportionality helps avoid the pitfall of spurious correlation.

Theory

Let \(A_i\) and \(A_j\) each represent a log-ratio transformed feature vector measured across \(N\) samples. We then define the metrics \(\phi\), \(\rho\), and \(\phi_s\):

\[\phi(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i)}\]

\[\rho(A_i, A_j) = 1 - \frac{var(A_i - A_j)}{var(A_i) + var(A_j)}\]

\[\phi_s(A_i, A_j) = \frac{var(A_i - A_j)}{var(A_i + A_j)}\]

Log-ratio transformation is applied to each subject vector \(\textrm{x}\) with \(D\) features (e.g., genes or microbes). In this vignette, we consider the centered log-ratio (clr) transformation and the additive log-ratio (alr) transformation. In clr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and the geometric mean of the vector, \(g(\textrm{x}) = \sqrt[D]{x_i…x_D}\). In alr-transformation, sample vectors undergo a transformation based on the logarithm of the ratio between the individual elements and a chosen reference feature, \(x_D\). We define these accordingly:

\[\textrm{clr(x)} = \left[\ln\frac{x_i}{g(\textrm{x})};…;\ln\frac{x_D}{g(\textrm{x})}\right]\]

\[\textrm{alr(x)} = \left[\ln\frac{x_i}{x_D};…;\ln\frac{x_{D-1}}{x_D}\right]\]

Note that this package also implements the interquartile log-ratio transformation (iqlr) as used by the ALDEx2 package. For more information on integrating propr with ALDEx2, we refer the reader to the “Frequently Asked Questions” vignette.

Calculating proportionality

The measures \(\phi\) and \(\rho\) differ in three ways. First, the values of \(\phi\) range from \([0, \infty)\) (with lower \(\phi\) values indicating more proportionality) while the values of \(\rho\) range from \([-1, 1]\) (with greater \(|\rho|\) values indicating more proportionality and negative \(\rho\) values indicating inverse proportionality). Second, \(\phi\) lacks symmetry, although one can force symmetry by reflecting the lower left triangle of the matrix across the diagonal (toggled by the argument symmetrize = TRUE). Third, \(\rho\) corrects for the individual variance of each feature in the pair, rather than for just one of the features. On the other hand, \(\phi_s\) is a naturally symmetric variant of \(\phi\) that also corrects for the individual variance of each feature in the pair. We advise against interpretting the negative proportionality directly; it does not always behave like one would expect!

Let us begin by building an arbitrary data set of 4 features (e.g., genes) measured across 100 subjects. In this example, the feature pairs “a” and “b” will be highly proportional.

set.seed(12345)
N <- 100
X <- data.frame(a=(1:N), b=(1:N) * rnorm(N, 10, 0.1),
                c=(N:1), d=(N:1) * rnorm(N, 10, 1.0))

Let \(D\) represent any number of features measured across \(N\) observations exposed to a binary or continuous event (for example, case-control status, treatment status, treatment dose, or time). These functions convert a “count matrix” with \(N\) rows and \(D\) columns into a proportionality matrix of \(D\) rows and \(D\) columns measuring proportionality for each feature pair. One can think of this matrix as analogous to a dissimilarity matrix (in the case of \(\phi\)) or a correlation matrix (in the case of \(\rho\)). The propr function returns a proportionality matrix bundled within an object of the class propr. This object contains four key slots:

library(propr)
phi <- propr(X, metric = "phi", symmetrize = TRUE)
rho <- propr(X, metric = "rho", ivar = 0)
phs <- propr(X, metric = "phs", ivar = 0)

Note that the log-ratio transformation, by its nature, fails if the input data contain any zero values. By default, this function replaces all zero values with 1. Alternatively, the user may set the parameter alpha greater than zero to approximate log-ratios in the presence of zeros (via the Box-Cox transformation). However, the topic of zero replacement is controversial. Proceed carefully when analyzing data that contain zero values.

How proportional is proportional enough?

By default, the propr function creates 100 permutations of the original data set, optionally used later by updateCutoffs to permute the false discovery rate (FDR) for a set of proportionality cutoffs. This method works for all proportionality measures. In the example below, there are no false discoveries across the range of cutoffs.

updateCutoffs(rho, cutoff =  seq(.05, .95, .3))
## Alert: Try parallelizing updateCutoffs with ncores > 1.
## |------------(25%)----------(50%)----------(75%)----------|
## Not weighted and not alpha-transformed 
## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: index with `[` method
## @fdr summary: iterations
##   cutoff randcounts truecounts FDR
## 1   0.05          0          2   0
## 2   0.35          0          2   0
## 3   0.65          0          2   0
## 4   0.95          0          2   0
## See ?propr for object methods

Choose the largest cutoff that keeps the FDR below 0.05.

Subsetting propr objects

We have provided methods for indexing and subsetting objects belonging to the propr class. Using the familiar [ method, we can efficiently index the proportionality matrix (i.e., the @matrix slot) based on an inequality operator and a reference value.

In this first example, we use [ to index the matrix by \(\rho > .95\). This indexes the location of all values satisfying that inequality (i.e., in the lower left triangle of the matrix), and saves those indices to the @pairs slot. Indexing helps guide some of the bundled visualization methods in lieu of copy-on-modify subsetting. Note that indexing an already indexed object appends the new index to the previous index (via a union merge).

rho99 <- rho[">", .95]
rho99@pairs
## [1]  2 12

Alternatively, using the subset method, we can subset an entire propr object based on a vector of feature indices or names. However, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large data sets.

In this second example, we subset by the feature names “a” and “b”.

rhoab <- subset(rho, select = c("a", "b"))
rhoab@matrix
##           a         b
## a 1.0000000 0.9999151
## b 0.9999151 1.0000000

The convenience function, simplify, can subset an entire propr object based on the index saved in its @pairs slot. This function converts the saved index into a paired list of coordinates and passes them along to the subset method. As such, this method does copy-on-modify the proportionality matrix, making it potentially unsuitable for large data sets. Unlike subset, simplify returns an object with the @pairs slot updated. Most users will find simplify preferable to subset.

simplify(rho99)
## Not weighted and not alpha-transformed 
## @counts summary: 100 subjects by 4 features
## @logratio summary: 100 subjects by 4 features
## @matrix summary: 4 features by 4 features
## @pairs summary: 2 feature pairs
## @fdr summary: iterations
## See ?propr for object methods

Computational burden

High-throughput genomic sequencing has the ability to measure tens of thousands of features for each subject. Since calculating proportionality generates a matrix sized \(D^2\), this method uses a lot of RAM when applied to real biological data sets. To address this, propr harnesses the power of C++ (via the Rcpp package) to minimize the run-time and RAM overhead. Below, we provide a table that estimates the approximate amount of RAM needed to render a proportionality matrix based on the number of features studied. The user should account for up to 25% more RAM for subsequent [ indexing.

Features Peak RAM (MiB)
1000 8
2000 31
4000 123
8000 491
16000 1959
24000 4405
32000 7829
64000 31301
100000 76406

Limitations

Although we developed this package with biological count data in mind, compositional count data is not truly compositional in that count data contain integer values only. As such, measuring “Gene A” as \(1\) in one subject and “Gene B” as \(2\) in another subject (i.e., the feature vector \([1, 2]\)) does not carry the same information as measuring “Gene A” as \(1000\) in one subject and “Gene B” as \(2000\) in another subject (i.e., the feature vector \([1000, 2000]\)) due to how additive variation affects the relative abundance of small counts more than large counts (Quinn 2017). We advise the investigator to proceed with caution when analyzing low counts.

References

  1. Erb, Ionas, and Cedric Notredame. “How Should We Measure Proportionality on Relative Gene Expression Data?” Theory in Biosciences = Theorie in Den Biowissenschaften 135, no. 1-2 (June 2016): 21-36. http://dx.doi.org/10.1007/s12064-015-0220-8.

  2. Lovell, David, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. “Proportionality: A Valid Alternative to Correlation for Relative Data.” PLoS Computational Biology 11, no. 3 (March 2015): e1004075. http://dx.doi.org/10.1371/journal.pcbi.1004075.

  3. Quinn, Thomas P., Mark F. Richardson, David Lovell, and Tamsyn M. Crowley. “Propr: An R-Package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.” Scientific Reports 7, no. 1 (November 24, 2017): 16252. https://doi.org/10.1038/s41598-017-16520-0.