Introduction
This document illustrates some of the phangorn (Schliep 2011) specialized features which are useful but maybe not as well-known or just not (yet) described elsewhere. This is mainly interesting for someone who wants to explore different models or set up some simulation studies. We show how to construct data objects for different character states other than nucleotides or amino acids or how to set up different models to estimate transition rate.
The vignette Trees describes in detail how to estimate phylogenies from nucleotide or amino acids.
User defined data formats
To better understand how to define our own data type it is useful to
know a bit more about the internal representation of phyDat
objects. The internal representation of phyDat
object is
very similar to factor
objects.
As an example we will show here several possibilities to define nucleotide data with gaps defined as a fifth state. Ignoring gaps or coding them as ambiguous sites - as it is done in most programs, also in phangorn as default - may be misleading (see (Warnow 2012)). When the number of gaps is low and the gaps are missing at random coding gaps as separate state may be not important.
Let assume we have given a matrix where each row contains a character vector of a taxonomic unit:
library(phangorn)
## Lade nötiges Paket: ape
= matrix(c("r","a","y","g","g","a","c","-","c","t","c","g",
data "a","a","t","g","g","a","t","-","c","t","c","a",
"a","a","t","-","g","a","c","c","c","t","?","g"),
dimnames = list(c("t1", "t2", "t3"),NULL), nrow=3, byrow=TRUE)
data
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## t1 "r" "a" "y" "g" "g" "a" "c" "-" "c" "t" "c" "g"
## t2 "a" "a" "t" "g" "g" "a" "t" "-" "c" "t" "c" "a"
## t3 "a" "a" "t" "-" "g" "a" "c" "c" "c" "t" "?" "g"
Normally we would transform this matrix into an phyDat object and gaps are handled as ambiguous character like “?”.
= phyDat(data)
gapsdata1 gapsdata1
## 3 sequences with 12 character and 11 different site patterns.
## The states are a c g t
Now we will define a “USER” defined object and have to supply a vector levels of the character states for the new data, in our case the for nucleotide states and the gap. Additional we can define ambiguous states which can be any of the states.
= phyDat(data, type="USER", levels=c("a","c","g","t","-"),
gapsdata2 ambiguity = c("?", "n"))
## Warning in phyDat.default(data, levels = levels, return.index = return.index, :
## Found unknown characters (not supplied in levels). Deleted sites with unknown
## states.
gapsdata2
## 3 sequences with 10 character and 9 different site patterns.
## The states are a c g t -
This is not yet what we wanted as two sites of our alignment, which contain the ambiguous characters “r” and “y”, got deleted. To define ambiguous characters like “r” and “y” explicitly we have to supply a contrast matrix similar to contrasts for factors.
= matrix(data = c(1,0,0,0,0,
contrast 0,1,0,0,0,
0,0,1,0,0,
0,0,0,1,0,
1,0,1,0,0,
0,1,0,1,0,
0,0,0,0,1,
1,1,1,1,0,
1,1,1,1,1),
ncol = 5, byrow = TRUE)
dimnames(contrast) = list(c("a","c","g","t","r","y","-","n","?"),
c("a", "c", "g", "t", "-"))
contrast
## a c g t -
## a 1 0 0 0 0
## c 0 1 0 0 0
## g 0 0 1 0 0
## t 0 0 0 1 0
## r 1 0 1 0 0
## y 0 1 0 1 0
## - 0 0 0 0 1
## n 1 1 1 1 0
## ? 1 1 1 1 1
= phyDat(data, type="USER", contrast=contrast)
gapsdata3 gapsdata3
## 3 sequences with 12 character and 11 different site patterns.
## The states are a c g t -
Here we defined “n” as a state which can be any nucleotide but not a gap “-” and “?” can be any state including a gap.
These data can be used in all functions available in phangorn to compute distance matrices or perform parsimony and maximum likelihood analysis.
Markov models of character evolution
To model nucleotide substitutions across the edges of a tree T we can assign a transition matrix. In the case of nucleotides, with four character states, each 4 \(\times\) 4 transition matrix has, at most, 12 free parameters.
Time-reversible Markov models are used to describe how characters change over time, and use fewer parameters. Time-reversible means that these models need not be directed in time, and the Markov property states that these models depend only on the current state. These models are used in analysis of phylogenies using maximum likelihood and MCMC, computing pairwise distances, as well in simulating sequence evolution.
We will now describe the General Time-Reversible (GTR) model (Tavaré 1986). The parameters of the GTR model are the equilibrium frequencies \(\pi = (\pi_A ,\pi_C ,\pi_G ,\pi_T)\) and a rate matrix \(Q\) which has the form \[\begin{equation} Q = \begin{pmatrix} \ast & \alpha\pi_C & \beta\pi_G & \gamma\pi_T \\ \alpha\pi_A & \ast & \delta\pi_G & \epsilon\pi_T \\ \beta\pi_A & \delta\pi_C & \ast & \eta\pi_T \\ \gamma\pi_A & \epsilon\pi_C & \eta\pi_G & \ast \\ \end{pmatrix} (1) \end{equation}\]
where we need to assign 6 parameters \(\alpha, \dots, \eta\). The elements on the diagonal are chosen so that the rows sum to zero. The Jukes-Cantor (JC) (Jukes and Cantor 1969) model can be derived as special case from the GTR model, for equal equilibrium frequencies \(\pi_A = \pi_C = \pi_G = \pi_T = 0.25\) and equal rates set to \(\alpha = \beta = \gamma = \delta = \eta\). Table @ref(tab:models) lists all built-in nucleotide models in phangorn. The transition probabilities which describe the probabilities of change from character \(i\) to \(j\) in time \(t\), are given by the corresponding entries of the matrix exponential \[ P(t) = (p_{ij}(t)) = e^{Qt}, \qquad \sum_j p_{ij}=1 \] where \(P(t)\) is the transition matrix spanning a period of time \(t\).
Estimation of non-standard transition rate matrices
In the last section @ref(user) we described how to set up user defined data formats. Now we describe how to estimate transition matrices with pml.
Again for nucleotide data the most common models can be called
directly in the optim.pml
function (e.g. “JC69”, “HKY”,
“GTR” to name a few). Table 2 lists all the available nucleotide models,
which can estimated directly in optim.pml
. For amino acids
several transition matrices are available (“WAG”, “JTT”, “LG”,
“Dayhoff”, “cpREV”, “mtmam”, “mtArt”, “MtZoa”, “mtREV24”, “VT”,“RtREV”,
“HIVw”, “HIVb”, “FLU”, “Blosum62”, “Dayhoff_DCMut” and “JTT-DCMut”) or
can be estimated with optim.pml
. For example Mathews et
al. (2010) (Mathews, Clements, and Beilstein
2010) used this function to estimate a phytochrome amino acid
transition matrix.
We will now show how to estimate a rate matrix with different transition (\(\alpha\)) and transversion ratio (\(\beta\)) and a fixed rate to the gap state (\(\gamma\)) - a kind of Kimura two-parameter model (K81) for nucleotide data with gaps as fifth state (see table 1).
a | c | g | t | - | |
---|---|---|---|---|---|
a | |||||
c | \(\beta\) | ||||
g | \(\alpha\) | \(\beta\) | |||
t | \(\beta\) | \(\alpha\) | \(\beta\) | ||
- | \(\gamma\) | \(\gamma\) | \(\gamma\) | \(\gamma\) |
If we want to fit a non-standard transition rate matrices, we have to
tell optim.pml
, which transitions in Q get the same rate.
The parameter vector subs accepts a vector of consecutive integers and
at least one element has to be zero (these gets the reference rate of
1). Negative values indicate that there is no direct transition is
possible and the rate is set to zero.
library(ape)
<- unroot(rtree(3))
tree <- pml(tree, gapsdata3)
fit <- optim.pml(fit, optQ=TRUE, subs=c(1,0,1,2,1,0,2,1,2,2),
fit control=pml.control(trace=0))
fit
##
## model: Mk
##
## loglikelihood: -33.01
##
## unconstrained loglikelihood: -28.43
##
## Rate matrix:
## a c g t -
## a 0.000e+00 2.584e-06 1.000e+00 2.584e-06 0.6912
## c 2.584e-06 0.000e+00 2.584e-06 1.000e+00 0.6912
## g 1.000e+00 2.584e-06 0.000e+00 2.584e-06 0.6912
## t 2.584e-06 1.000e+00 2.584e-06 0.000e+00 0.6912
## - 6.912e-01 6.912e-01 6.912e-01 6.912e-01 0.0000
##
## Base frequencies:
## 0.2 0.2 0.2 0.2 0.2
Here are some conventions how the models are estimated:
If a model is supplied the base frequencies bf and rate matrix Q are optimized according to the model (nucleotides) or the adequate rate matrix and frequencies are chosen (for amino acids). If optQ=TRUE and neither a model or subs are supplied than a symmetric (optBf=FALSE) or reversible model (optBf=TRUE, i.e. the GTR for nucleotides) is estimated. This can be slow if the there are many character states, e.g. for amino acids. Table 2 shows how parameters are optimized and number of parameters to estimate. The elements of the vector subs correspond to \(\alpha, \dots, \eta\) in equation (1)
model | optQ | optBf | subs | df |
---|---|---|---|---|
JC | FALSE | FALSE | \(c(0, 0, 0, 0, 0, 0)\) | 0 |
F81 | FALSE | TRUE | \(c(0, 0, 0, 0, 0, 0)\) | 3 |
K80 | TRUE | FALSE | \(c(0, 1, 0, 0, 1, 0)\) | 1 |
HKY | TRUE | TRUE | \(c(0, 1, 0, 0, 1, 0)\) | 4 |
TrNe | TRUE | FALSE | \(c(0, 1, 0, 0, 2, 0)\) | 2 |
TrN | TRUE | TRUE | \(c(0, 1, 0, 0, 2, 0)\) | 5 |
TPM1 | TRUE | FALSE | \(c(0, 1, 2, 2, 1, 0)\) | 2 |
K81 | TRUE | FALSE | \(c(0, 1, 2, 2, 1, 0)\) | 2 |
TPM1u | TRUE | TRUE | \(c(0, 1, 2, 2, 1, 0)\) | 5 |
TPM2 | TRUE | FALSE | \(c(1, 2, 1, 0, 2, 0)\) | 2 |
TPM2u | TRUE | TRUE | \(c(1, 2, 1, 0, 2, 0)\) | 5 |
TPM3 | TRUE | FALSE | \(c(1, 2, 0, 1, 2, 0)\) | 2 |
TPM3u | TRUE | TRUE | \(c(1, 2, 0, 1, 2, 0)\) | 5 |
TIM1e | TRUE | FALSE | \(c(0, 1, 2, 2, 3, 0)\) | 3 |
TIM1 | TRUE | TRUE | \(c(0, 1, 2, 2, 3, 0)\) | 6 |
TIM2e | TRUE | FALSE | \(c(1, 2, 1, 0, 3, 0)\) | 3 |
TIM2 | TRUE | TRUE | \(c(1, 2, 1, 0, 3, 0)\) | 6 |
TIM3e | TRUE | FALSE | \(c(1, 2, 0, 1, 3, 0)\) | 3 |
TIM3 | TRUE | TRUE | \(c(1, 2, 0, 1, 3, 0)\) | 6 |
TVMe | TRUE | FALSE | \(c(1, 2, 3, 4, 2, 0)\) | 4 |
TVM | TRUE | TRUE | \(c(1, 2, 3, 4, 2, 0)\) | 7 |
SYM | TRUE | FALSE | \(c(1, 2, 3, 4, 5, 0)\) | 5 |
GTR | TRUE | TRUE | \(c(1, 2, 3, 4, 5, 0)\) | 8 |
Codon substitution models
A special case of the transition rates are codon models._phangorn_
now offers the possibility to estimate the \(d_N/d_S\) ratio (sometimes called ka/ks),
for an overview see (Yang 2014). These
functions extend the option to estimates the \(d_N/d_S\) ratio for pairwise sequence
comparison as it is available through the function kaks
in
seqinr. The transition rate between between codon \(i\) and \(j\) is defined as follows: \[\begin{eqnarray}
q_{ij}=\left\{
\begin{array}{l@{\quad}l}
0 & \textrm{if i and j differ in more than one position} \\
\pi_j & \textrm{for synonymous transversion} \\
\pi_j\kappa & \textrm{for synonymous transition} \\
\pi_j\omega & \textrm{for non-synonymous transversion} \\
\pi_j\omega\kappa & \textrm{for non synonymous transition}
\end{array}
\right. \nonumber
\end{eqnarray}\] where \(\omega\) is the \(d_N/d_S\) ratio, \(\kappa\) the transition transversion ratio
and \(\pi_j\) is the the equilibrium
frequencies of codon \(j\). For \(\omega\sim1\) the an amino acid change is
neutral, for \(\omega < 1\)
purifying selection and \(\omega >
1\) positive selection. There are four models available:
“codon0”, where both parameter \(\kappa\) and \(\omega\) are fixed to 1, “codon1” where
both parameters are estimated and “codon2” or “codon3” where \(\kappa\) or \(\omega\) is fixed to 1.
We compute the \(d_N/d_S\) for some
sequences given a tree using the ML functions pml
and
optim.pml
. First we have to transform the the nucleotide
sequences into codons (so far the algorithms always takes triplets).
library(phangorn)
<- system.file("extdata/trees", package = "phangorn")
fdir <- read.phyDat(file.path(fdir, "primates.dna"),
primates format = "interleaved")
<- NJ(dist.ml(primates))
tree <- dna2codon(primates)
dat <- pml(tree, dat, bf="F3x4")
fit <- optim.pml(fit, model="codon0", control=pml.control(trace=0))
fit0 <- optim.pml(fit, model="codon1", control=pml.control(trace=0))
fit1 <- optim.pml(fit, model="codon2", control=pml.control(trace=0))
fit2 <- optim.pml(fit, model="codon3", control=pml.control(trace=0))
fit3 anova(fit0, fit2, fit3, fit1)
## Likelihood Ratio Test Table
## Log lik. Df Df change Diff log lik. Pr(>|Chi|)
## 1 -2867 34
## 2 -2866 35 1 2 0.14
## 3 -2522 35 0 688 <2e-16 ***
## 4 -2521 36 1 2 0.20
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are several ways to estimate the codon frequencies \(\pi_j\). The simplest model is to assume
they have equal frequencies (=1/61). A second is to use the empirical
codon frequencies, either computed using baseFreq
or using
the argument bf="empirical"
in pml
. Last but
not least the frequencies can be derived from the base frequencies at
each codon position, the F3x4 model is set by the argument
bf="F3x4"
. One can estimate the codon frequencies setting
the option to optBf=TRUE
in optim.pml
. As the
convergence of the 60 parameters the convergence is likely slow set the
maximal iterations to a higher value than the default
(e.g. control = pml.control(maxit=50)
). Similar parameters
of the F3x4 can also estimated using ML optF3x4=TRUE
instead.
The “YN98” model (Yang and Nielsen 1998) is similar to the “codon1”, but additional optimizes the codon frequencies.
Generating trees
phangorn has several functions to generate tree topologies,
which may are interesting for simulation studies. allTrees
computes all possible bifurcating tree topologies either rooted or
unrooted for up to 10 taxa. One has to keep in mind that the number of
trees is growing exponentially, use howmanytrees
from
ape as a reminder.
<- allTrees(5)
trees par(mfrow=c(3,5), mar=rep(0,4))
for(i in 1:15)plot(trees[[i]], cex=1, type="u")
nni
returns a list of all trees which are one nearest
neighbor interchange away.
nni(trees[[1]])
## 4 phylogenetic trees
rNNI
and rSPR
generate trees which are a
defined number of NNI (nearest neighbor interchange) or SPR (subtree
pruning and regrafting) away.
Session info
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=de_AT.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_AT.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=de_AT.UTF-8
## [7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] phangorn_2.9.0 ape_5.6-2
##
## loaded via a namespace (and not attached):
## [1] igraph_1.3.1 Rcpp_1.0.8.3 rstudioapi_0.13 knitr_1.39
## [5] magrittr_2.0.3 lattice_0.20-45 R6_2.5.1 quadprog_1.5-8
## [9] rlang_1.0.2 fastmatch_1.1-3 fastmap_1.1.0 highr_0.9
## [13] stringr_1.4.0 tools_4.2.0 parallel_4.2.0 grid_4.2.0
## [17] nlme_3.1-157 xfun_0.31 cli_3.3.0 jquerylib_0.1.4
## [21] htmltools_0.5.2 yaml_2.3.5 digest_0.6.29 Matrix_1.4-1
## [25] codetools_0.2-18 sass_0.4.1 prettydoc_0.4.1 evaluate_0.15
## [29] rmarkdown_2.14 stringi_1.7.6 compiler_4.2.0 bslib_0.3.1
## [33] generics_0.1.2 jsonlite_1.8.0 pkgconfig_2.0.3