ZicoSeq
is a linear model and permutation-based method for dfferential
abundance analysis of zero-inflated compositional sequencing data such as
microbiome sequencing data. It has the following components:
Currently ZicoSeq
supports:
count data or proportion data. For both count and proportion data, a reference-based ratio approach is used to account for compositional effects. When a count matrix is provided, it provides an option to draw posterior samples of the underlying proportions to account for the sampling variability during the sequencing process. The test results are aggregated over these posterior samples.
other data types. As a general methodology, ZicoSeq
can be used
to differential analysis of other high-dimensional datasets such as
transcriptomics, epigenomics, metabolomics, and proteomics data.
Install the package.
# install.packages("GUniFrac")
Load the package.
library(GUniFrac)
ZicoSeq
This example data set contains the OTU count data from the throat microbiome of
the left body side, which is available in GUniFrac
package. It contains 60
subjects consisting of 32 nonsmokers and 28 smokers (Charlson, Emily S., et
al.,
2010)..
In this example, we will identify smoking-associated OTUs while adjusting the
sex since sex is a potential confounder (we see more males in smokers).
data(throat.otu.tab)
data(throat.meta)
comm <- t(throat.otu.tab)
meta.dat <- throat.meta
meta.dat
ZicoSeq
functionZicoSeq.obj <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm,
grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "count",
# Filter to remove rare taxa
prev.filter = 0.2, mean.abund.filter = 0,
max.abund.filter = 0.002, min.prop = 0,
# Winsorization to replace outliers
is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
# Posterior sampling
is.post.sample = TRUE, post.sample.no = 25,
# Use the square-root transformation
link.func = list(function (x) x^0.5), stats.combine.func = max,
# Permutation-based multiple testing correction
perm.no = 99, strata = NULL,
# Reference-based multiple stage normalization
ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
# Family-wise error rate control
is.fwer = TRUE, verbose = TRUE, return.feature.dat = T)
ZicoSeq
Microbiome data may contain outliers with extremely large counts. These
outliers could reduce the statistical power, or more seriously increase the type
I error rate. In ZicoSeq
, we use winsorization to replace those extremely
large counts with a certain quantile (e.g. 97%). We recommend always setting
is.winsor = TRUE
for real data analysis. The specific quantile used depends on
the expected proportion of outliers.
Posterior sampling can be enabled by setting is.post.sample = TRUE
. It can
only be used when the data type is “count”. For microbiome data, the majority
of the taxa are rare, resulting in a large number of zeros in the data given a
limited sequencing depth. As the probability of zero strongly depends on the
sequencing depth, zeros from different samples are not comparable. In
ZicoSeq
, we provide a method to infer/impute the underlying true proportions
using an empirical Bayes approach by pooling information across samples. The
imputed proportions depend on both the sequencing depth and the estimated
(prior) distribution of the proportions across samples. To better model the
prior distribution, we used beta mixtures instead a single-component beta
distribution. Posterior sampling effectively reduces the type I error when
the sequencing depth is associated with the covariate of interest (e.g. cases
and controls differ in sequencing depth) and is more powerful than
rarefaction. When the sequencing depth is not a confounding factor,
is.post.sample
slightly improves the power and could be disabled for analyzing
large datasets.
The relationship between the taxa abundance and the covariate of interest
may vary across taxa, a single transformation function may not be powerful
enough to capture diverse relationships. ZicoSeq
allows specifying multiple
possible transformation functions (link.func
) and performs omnibus testing.
Although the log transformation is commonly used for microbiome due to its
interpretability, we found that power transformations may be more powerful. The
default is the square-root transformation.
ZicoSeq
uses a reference-based approach to address compositional effects. The
default uses 30% of the least variable/significant taxa as the reference. It
starts with selecting 50% of the least variable taxa as the reference and
differential abundance testing is then run multiple times excluding 20% taxa
with the smallest p-values in each iteration. This procedure is to make sure the
remaining 30% are more likely to be non-differential.
ZicoSeq
uses a permutation-based approach to control false discovery rate
(FDR). The permutation-based FDR control preserves the correlation structure in
the abundance data and is more robust and powerful than the traditional BH-based
FDR control based on raw p values. ZicoSeq
also allows performing
permutation-based family-wise error rate control. Users can determine the
appropriate error control method to suit their needs.
ZicoSeq
outputThe output of ZicoSeq
consists of mainly:
p.raw
: raw p-values based on permutations, it will not be accurate if
perm.no
is small.p.adj.fdr
: permutation-based FDR-adjusted p-values.p.adj.fwer
: permutation-based FWER-adjusted (West-Young), only if
is.fwer = T
in ZicoSeq
. perm.no
is recommened to set at least 999 for
accurate FWER-adjusted p-values.A matrix of R2 values (number of features by number of transformation functions) will be provided. R2 is an effect size measure and can be used to assess the association strength between the taxa abundance and the covariate of interest while adjusting for other covariates. When the omnibus testing is in action, i.e., multiple transformations are used, there will be multiple R2 s, each corresponding to a specific transformation.
ZicoSeq
output visualizationZicoSeq.plot
function produces a volcano plot with the y-axis being the
log10 (adjusted) p-value and the x-axis being the signed R2 with the sign
indicating the association direction determined based on the sign of the
regression coefficient (for mutli-categorical variables, sign is not
applicable). When multiple transformation functions are used, the largest
R2 is used. The names of differential taxa passing a specific significance
cutoff will be printed on the figure. When data types are counts and
proportions, the mean abundance and prevalence will be visualized; when the data
type is other
, mean and standard deviation of the features will be visualized.
Users need to set return.feature.dat = T
when using the plot function.
ZicoSeq.plot(ZicoSeq.obj, meta.dat, pvalue.type = 'p.adj.fdr', cutoff = 0.1, text.size = 10,
out.dir = NULL, width = 10, height = 6)
For some bioinformatics pipelines, the output could be proportion data.
ZicoSeq
can be applied to proportion data by specifying feature.dat.type =
"proportion"
. Posterior sampling will not be applied when analyzing the
proportions and other parameter settings are similar to the count case.
comm.p <- t(t(comm) / colSums(comm))
ZicoSeq.obj.p <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.p,
grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "proportion",
# Filter to remove rare taxa
prev.filter = 0.2, mean.abund.filter = 0, max.abund.filter = 0.002, min.prop = 0,
# Winsorization to replace outliers
is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
# Posterior sampling will be automatically disabled
is.post.sample = FALSE, post.sample.no = 25,
# Use the square-root transformation
link.func = list(function (x) x^0.5, function (x) x^0.25), stats.combine.func = max,
# Permutation-based multiple testing correction
perm.no = 99, strata = NULL,
# Reference-based multiple stage normalization
ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
# Family-wise error rate control
is.fwer = TRUE, verbose = TRUE, return.feature.dat = T)
suppressWarnings(ZicoSeq.plot(ZicoSeq.obj = ZicoSeq.obj.p, meta.dat = meta.dat, pvalue.type = 'p.adj.fdr',
cutoff = 0.1, text.size = 10, out.dir = NULL, width = 10, height = 6))
## Warning: Removed 103 rows containing missing values (geom_text_repel).
ZicoSeq
as a general linear model-based permutation test can be applied to
association analyses of other high-dimensional datasets such as transcriptome-,
methylome-, metabolome- and proteome-wide association testing, where the omics
features are treated as the outcomes. In this case, posterior sampling will be
automatically disabled. The user should be responsible for properly normalizing,
transforming, and filtering the data before applying ZicoSeq
. The user should
also decide whether to winsorize the top, bottom or both ends of the
distribution. The following code is just for demonstration purposes and does
not mean to be applied for differential abundance analysis.
comm.o <- comm[rowMeans(comm != 0) >= 0.2, ] + 1
comm.o <- log(t(t(comm.o) / colSums(comm.o)))
ZicoSeq.obj.o <- ZicoSeq(meta.dat = meta.dat, feature.dat = comm.o,
grp.name = 'SmokingStatus', adj.name = 'Sex', feature.dat.type = "other",
# Filter will not be applied
prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0, min.prop = 0,
# Winsorization the top end
is.winsor = TRUE, outlier.pct = 0.03, winsor.end = 'top',
# Posterior sampling will be automatically disabled
is.post.sample = FALSE, post.sample.no = 25,
# Identity function is used
link.func = list(function (x) x), stats.combine.func = max,
# Permutation-based multiple testing correction
perm.no = 99, strata = NULL,
# Reference-based multiple-stage normalization will not be performed
ref.pct = 0.5, stage.no = 6, excl.pct = 0.2,
# Family-wise error rate control
is.fwer = TRUE, verbose = TRUE, return.feature.dat = T)
ZicoSeq.plot(ZicoSeq.obj = ZicoSeq.obj.o, meta.dat = meta.dat, pvalue.type = 'p.adj.fdr',
cutoff = 0.1, text.size = 10, out.dir = NULL, width = 10, height = 6)
ZicoSeq
can be applied to perform within-subject comparisons (e.g. before
treatment vs. after treatment) by using the strata
parameter. For example,
if the subject variable is “subject_id” in the meta data file, the user can set
strata = 'subject_id'
.
Lu Yang & Jun Chen. 2022. A comprehensive evaluation of differential abundance analysis methods: current status and potential solutions. Microbiome. In revision.
Charlson ES, Chen J, Custers-Allen R, Bittinger K, Li H, Sinha R, Hwang J, Bushman FD, Collman RG. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PLoS One. 2010 Dec 20;5(12):e15216. doi: 10.1371/journal.pone.0015216. PMID: 21188149; PMCID: PMC3004851.
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GUniFrac_1.6
##
## loaded via a namespace (and not attached):
## [1] statmod_1.4.35 tidyselect_1.1.1 xfun_0.16
## [4] purrr_0.3.3 splines_3.6.3 lattice_0.20-40
## [7] colorspace_1.4-1 vctrs_0.3.8 generics_0.0.2
## [10] rmutil_1.1.5 mgcv_1.8-31 utf8_1.1.4
## [13] rlang_1.0.1 stable_1.1.4 pillar_1.7.0
## [16] glue_1.6.1 DBI_1.1.0 matrixStats_0.57.0
## [19] foreach_1.5.1 lifecycle_1.0.1 stringr_1.4.0
## [22] timeDate_3043.102 munsell_0.5.0 fBasics_3042.89.1
## [25] gtable_0.3.0 timeSeries_3062.100 statip_0.2.3
## [28] codetools_0.2-16 evaluate_0.14 labeling_0.3
## [31] knitr_1.29 permute_0.9-5 modeest_2.4.0
## [34] parallel_3.6.3 fansi_0.4.1 highr_0.8
## [37] Rcpp_1.0.7 scales_1.1.1 vegan_2.5-6
## [40] farver_2.0.3 digest_0.6.25 ggplot2_3.3.5
## [43] stringi_1.4.6 dplyr_1.0.8 ggrepel_0.9.1
## [46] clue_0.3-57 grid_3.6.3 stabledist_0.7-1
## [49] cli_3.1.1 tools_3.6.3 magrittr_1.5
## [52] tibble_3.1.6 cluster_2.1.0 crayon_1.3.4
## [55] ape_5.4-1 spatial_7.3-11 pkgconfig_2.0.3
## [58] MASS_7.3-51.5 ellipsis_0.3.2 Matrix_1.3-3
## [61] iterators_1.0.12 rpart_4.1-15 R6_2.4.1
## [64] nlme_3.1-145 compiler_3.6.3