Phylostratum
and Divergence Stratum
Enrichment AnalysesPhylostratum
and Divergence Stratum
enrichment analyses have been performed by several studies to correlate organ or metabolic pathway evolution with the origin of organ or pathway specific genes (Sestak and Domazet-Loso, 2015).
In detail, Phylostratum
and Divergence Stratum
enrichment analyses can be performed analogously to Gene Ontology and Kegg enrichment analyses to study the enrichment of evolutionary age or sequence divergence in a set of selected genes against the entire genome/transcriptome. In case specific age categories are significantly over- or underrepresented in the selected gene set, assumptions or potential correlations between the evolutionary origin of a particular organ or metabolic pathway can be implied.
In this vignette we will use the data set published by Sestak and Domazet-Loso, 2015 to demonstrate how to perform enrichment analyses using myTAI
.
PlotEnrichment()
The PlotEnrichment()
function implemented in myTAI
computes and visualizes the significance of enriched (over- or underrepresented) Phylostrata
or Divergence Strata
within an input set of process/tissue specific genes. In detail this function takes the Phylostratum
or Divergence Stratum
distribution of all genes stored in the input ExpressionSet
as background set and the Phylostratum
or Divergence Stratum
distribution of the specific gene set and performs a Fisher’s exact test for each Phylostratum
or Divergence Stratum
to quantify the statistical significance of over- or under-represented Phylostrata
or Divergence Strata
within the set of selected genes. In other words, the frequency distribution of Phylostrata
or Divergence Strata
in the complete sample is compared with the frequency distribution of Phylostrata
or Divergence Strata
in the set of selected genes and over- or under-representation is visualized by log-odds (or odds), where a log-odd of zero means that both frequency distributions are equal (see also Sestak and Domazet-Loso, 2015).
Before using the PlotEnrichment()
function, we need to download the example data set from Sestak and Domazet-Loso, 2015.
Download the Phylostratigraphic Map
of D. rerio:
# download the Phylostratigraphic Map of Danio rerio
# from Sestak and Domazet-Loso, 2015
download.file( url = "http://mbe.oxfordjournals.org/content/suppl/2014/11/17/msu319.DC1/TableS3-2.xlsx",
destfile = "MBE_2015a_Drerio_PhyloMap.xlsx" )
Read the *.xlsx
file storing the Phylostratigraphic Map
of D. rerio and format it for the use with myTAI
:
# install the readxl package
install.packages("readxl")
# load package readxl
library(readxl)
# read the excel file
DrerioPhyloMap.MBEa <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 1, skip = 4)
# format Phylostratigraphic Map for use with myTAI
Drerio.PhyloMap <- DrerioPhyloMap.MBEa[ , 1:2]
# have a look at the final format
head(Drerio.PhyloMap)
Phylostrata ZFIN_ID
1 1 ZDB-GENE-000208-13
2 1 ZDB-GENE-000208-17
3 1 ZDB-GENE-000208-18
4 1 ZDB-GENE-000208-23
5 1 ZDB-GENE-000209-3
6 1 ZDB-GENE-000209-4
Now, Drerio.PhyloMap
stores the Phylostratigraphic Map
of D. rerio which is used as background set to perform enrichment analyses with PlotEnrichment()
.
Now, the PlotEnrichment()
function visualizes the over- and underrepresented Phylostrata
of brain specific genes when compared with the total number of genes stored in the Phylostratigraphic Map
of D. rerio.
# read expression data (organ specific genes) from Sestak and Domazet-Loso, 2015
Drerio.OrganSpecificExpression <- read_excel("MBE_2015a_Drerio_PhyloMap.xlsx", sheet = 2, skip = 3)
# select only brain specific genes
Drerio.Brain.Genes <- unique(na.omit(Drerio.OrganSpecificExpression[ , "brain"]))
# visualize enriched Phylostrata of genes annotated as brain specific
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
use.only.map = TRUE,
legendName = "PS")
Here, the first argument is either a standard ExpressionSet
object (in case use.only.map = FALSE
: default) or a Phylostratigraphic Map
or Divergence Map
(in case use.only.map = TRUE
; see Introduction for details). The second argument test.set
specifies the gene ids also stored in the corresponding ExpressionSet
or Phylostratigraphic Map/ Divergence Map
for which enrichment shall be quantified and visualized.
To visualize the odds or log-odds of over- or underrepresented genes within the test.set
the following procedure is performed:
\(N_{ij}\) denotes the number of genes in group j and deriving from PS \(i\), with \(i = 1, .. , n\) and where \(j = 1\) denotes the background set and \(j = 2\) denotes the test.set
\(N_{i.}\) denotes the total number of genes within PS \(i\)
\(N_{.j}\) denotes the total number of genes within group \(j\)
\(N_{..}\) is the total number of genes within all groups \(j\) and all PS \(i\)
\(f_{ij}\) = \(N_{ij}\) / \(N_{..}\) and \(g_{ij}\) = \(f_{ij}\) / \(f_{.j}\) denote relative frequencies between groups
\(f_{i.}\) denotes the between group sum of \(f_{ij}\)
The result is the fold-change value (odds; measure = "foldchange"
) denoted as \(C_2 = g_{i2} / f_{i.}\) which is visualized above and below zero or the log fold-change value (log-odds; measure = "log-foldchange"
), where \(log_2\)(C) = \(log_2\)(\(g_{i2}\)) - \(log_2\)(\(f_{i.}\)) which is visualized symmetrically above and below zero by PlotEnrichment()
. Analogously, \(C_1 = g_{i1} / f_{i.}\) but is not visualized by this function.
Internally, PlotEnrichment()
performs a Fisher’s exact test for each Phylostratum
or Divergence Stratum
separately, to quantify the significance of over- or under-representation of corresponding Phylostrata
or Divergence Strata
within the test.set
when compared with the entire ExpressionSet
. PlotEnrichment()
visualizes significantly enriched (over- or underrepresented) Phylostrata
or Divergence Strata
with asterisks ’*’.
Notation:
Users will notice that when performing the PlotEnrichment()
function, the p-values and the enrichment matrix (storing \(C_1\) and \(C_2\)) will be returned.
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
use.only.map = TRUE,
legendName = "PS")
$p.values
PS1 PS2 PS3 PS4 PS5 PS6
8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01
PS7 PS8 PS9 PS10 PS11 PS12
2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01
PS13 PS14
3.929949e-03 1.593834e-05
$enrichment.matrix
BG_Set Test_Set
PS1 -0.001132832 0.007668216
PS2 0.023733936 -0.172380714
PS3 -0.040879607 0.250587496
PS4 -0.048920465 0.294399729
PS5 -0.114888949 0.603817643
PS6 0.008678915 -0.060350168
PS7 -0.062948352 0.367240944
PS8 0.115630474 -1.206210187
PS9 -0.007353969 0.048964218
PS10 -0.031971192 0.200141519
PS11 0.039742253 -0.303363314
PS12 -0.002418079 0.016311853
PS13 0.101449988 -0.984621732
PS14 0.098211044 -0.938724783
In case users are only interested in the p-values of the Fisher test and the enrichment matrix without illustrating the bar plot, they can specify the plot.bars = FALSE
argument to only retrieve the numeric results.
# specify plot.bars = FALSE to retrieve only numeric results
EnrichmentResult <- PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
use.only.map = TRUE,
legendName = "PS",
plot.bars = FALSE)
# access p-values quantifying the enrichment for each Phylostratum
EnrichmentResult$p.values
PS1 PS2 PS3 PS4 PS5 PS6
8.283490e-01 8.362880e-05 6.778981e-02 1.373816e-02 7.946309e-13 6.017041e-01
PS7 PS8 PS9 PS10 PS11 PS12
2.185021e-03 2.281194e-03 8.943147e-01 5.699612e-01 4.717058e-02 9.367759e-01
PS13 PS14
3.929949e-03 1.593834e-05
BG_Set Test_Set
PS1 -0.001132832 0.007668216
PS2 0.023733936 -0.172380714
PS3 -0.040879607 0.250587496
PS4 -0.048920465 0.294399729
PS5 -0.114888949 0.603817643
PS6 0.008678915 -0.060350168
PS7 -0.062948352 0.367240944
PS8 0.115630474 -1.206210187
PS9 -0.007353969 0.048964218
PS10 -0.031971192 0.200141519
PS11 0.039742253 -0.303363314
PS12 -0.002418079 0.016311853
PS13 0.101449988 -0.984621732
PS14 0.098211044 -0.938724783
The Fisher test which is performed inside PlotEnrichment()
assumes that all genes stored in the input ExpressionSet
or Phylostratigraphic Map
/Divergence Map
are used to define the background set for constructing the test statistic. However, since in most cases the test.set
is an subset of the input ExpressionSet
or Phylostratigraphic Map
/Divergence Map
one could also specify the complete.bg
argument to remove all test.set
genes from the background set when performing the Fisher test and visualization.
The following two examples allow users to compare the results when retaining all genes as background set compared with the option to remove test.set
genes from the background set.
# complete.bg = TRUE (default) -> retain test.set genes in background set
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
complete.bg = TRUE,
use.only.map = TRUE,
legendName = "PS")
# complete.bg = FALSE -> remove test.set genes from background set
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
complete.bg = FALSE,
use.only.map = TRUE,
legendName = "PS")
Users will notice that although some p-values change, the qualitative result did not change. In border line cases however, the results might influence whether or not some Phylostrata
or Divergence Strata
are denoted as significantly enriched or not. So always be aware of the interpretation when retaining or removing the test.set
from the background set, because both options are valid and have advantages and disadvantages and depend on a valid interpretation.
For the D. rerio brain genes example you can see that PS4, PS5, and PS7 are significantly over-represented in the set of brain specific genes.
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "foldchange",
complete.bg = TRUE,
use.only.map = TRUE,
legendName = "PS")
Again, we retrieve the D. rerio specific taxonomy represented by PS1-14 using the taxonomy()
funtion (see Introduction and Taxonomy for details).
id name rank
1 cellular organisms no rank 131567
2 Eukaryota superkingdom 2759
3 Opisthokonta no rank 33154
4 Metazoa kingdom 33208
5 Eumetazoa no rank 6072
6 Bilateria no rank 33213
7 Deuterostomia no rank 33511
8 Chordata phylum 7711
9 Craniata subphylum 89593
10 Vertebrata no rank 7742
11 Gnathostomata no rank 7776
12 Teleostomi no rank 117570
13 Euteleostomi no rank 117571
14 Actinopterygii superclass 7898
15 Actinopteri class 186623
16 Neopterygii subclass 41665
17 Teleostei infraclass 32443
18 Osteoglossocephalai no rank 1489341
19 Clupeocephala no rank 186625
20 Otomorpha no rank 186634
21 Ostariophysi no rank 32519
22 Otophysa no rank 186626
23 Cypriniphysae superorder 186627
24 Cypriniformes order 7952
25 Cyprinoidea superfamily 30727
26 Cyprinidae family 7953
27 Danio genus 7954
28 Danio rerio species 7955
Sestak and Domazet-Loso, 2015 collapsed these 28 taxonomic nodes into 14 taxonomic nodes (see Figure 2
in Sestak and Domazet-Loso, 2015) and labelled them as phylostrata 1 to phylostrata 14, where PS1 represents cellular organisms
and PS14 represents D. rerio
specific genes. Based on the phylostratum categorization of Sestak and Domazet-Loso, 2015, PS4 represents Holozoa (= Metazoa + allies)
, PS5 represents Metazoa
, and PS7 represents Bilateria
.
Now, the over-representation results of brain specific genes returned by PlotEnrichment()
provide evidence, that brain specific genes might indeed have originated during the emergence of the nervous system at the metazoan-eumetazoan transition leading to the interpretation that the vertebrate brain has a step wise adaptive history where most of its extant organization was already present in the chordate ancestor as argued by Sestak and Domazet-Loso, 2015.
This example shall illustrate how the PlotEnrichment()
function can be used to trace the evolutionary origin of tissue or process specific genes by investigating their age enrichment.
In case users have an ExpressionSet
storing the Phylostratigraphic Map
of D. rerio as well as an expression set, they can furthermore use the PlotGeneSet()
function implemented in myTAI
to visualize the expression levels of brain specific genes which have been shown to be significantly enriched in Metazoa
specific phylostrata.
Example:
# the best parameter setting to visualize this plot:
# png("DrerioBrainSpecificGeneExpression.png",700,400)
PlotGeneSet(ExpressionSet = DrerioPhyloExpressionSet,
gene.set = Drerio.Brain.Genes,
plot.legend = FALSE,
type = "l",
lty = 1,
lwd = 4,
xlab = "Ontogeny",
ylab = "Expression Level")
# dev.off()
Here DrerioPhyloExpressionSet
denotes a hypothetical ExpressionSet
of D. rerio development.
Additionally, the SelectGeneSet()
function allows users to obtain the ExpresisonSet
subset of selected genes (gene.set
) for subsequent analyses.
# select the ExpressionSet subset of Brain specific genes
Brain.PhyloExpressionSet <- SelectGeneSet( ExpressionSet = DrerioPhyloExpressionSet,
gene.set = Drerio.Brain.Genes )
head(Brain.PhyloExpressionSet)
In case a large number of Phylostrata or Divergence Strata is included in the input ExpressionSet
, p-values returned by PlotEnrichment()
should be adjusted for multiple comparisons. For this purpose PlotEnrichment()
includes the argument p.adjust.method
. Here, all methods implemented in ?p.adjust
can be specified:
# adjust p-values for multiple comparisons with Benjamini & Hochberg (1995)
PlotEnrichment(Drerio.PhyloMap,
test.set = Drerio.Brain.Genes,
measure = "log-foldchange",
complete.bg = FALSE,
use.only.map = TRUE,
legendName = "PS",
p.adjust.method = "BH")
Please consult these reviews (Biostatistics Handbook, Gelman et al., 2008, and Slides) to decide whether or not to apply p-value adjustment to your own dataset.
Sestak MS and Domazet-Loso T. Phylostratigraphic Profiles in Zebrafish Uncover Chordate Origins of the Vertebrate Brain. Mol. Biol. Evol. (2015) 32 (2): 299-312.