It can be a bit fiddly to get a phylogenetic dataset into R, particularly if you are not used to working with files in the Nexus format.
First off, make sure that you are comfortable telling R where to find a file.
Then you are ready to load raw data:
If your data is in an Excel spreadsheet, one way to load it into R is
using the xlsx
package. First you’ll have to install
it:
install.packages('xlsx') # You only need to do this once
Then you should prepare your Excel spreadsheet such that each row corresponds to a taxon, and each column to a character.
Then you can read the data from the Excel file by telling R which sheet, rows and columns contain your data:
library('xlsx')
<- as.matrix(read.xlsx(filename,
raw_data sheetIndex = 1, # Loads sheet number 1 from the excel file
rowIndex = 2:21, # Extracts rows 2 to 21
colIndex = 2:26, # Extracts columns B to Z
header = FALSE
))
# In this example, the names of taxa are in column 1
<- read.xlsx(filename, sheetIndex = 1, rowIndex = 2:21,
taxon_names colIndex = 1, as.data.frame=FALSE)
rownames(raw_data) <- taxon_names
TreeTools
contains an inbuilt Nexus parser:
<- ReadCharacters(filename)
raw_data # Or, to go straight to PhyDat format:
<- ReadAsPhyDat(filename) as_phydat
This will extract character names and codings from a dataset. It’s been written to work with datasets downloaded from MorphoBank, but my aim is for this function to handle most valid (and many invalid) NEXUS files. If you find a file that this function can’t handle, please let me know and I’ll try to fix it.
In the meantime, alternative Nexus parsers are available: try
<- ape::read.nexus.data(filename) raw_data
Non-standard elements of a Nexus file might be beyond the
capabilities of ape’s parser. In particular, you will need to replace
spaces in taxon names with an underscore, and to arrange all data into a
single block starting BEGIN DATA
. You’ll need to strip out
comments, character definitions and separate taxon blocks.
The function readNexus
in package phylobase
uses the NCL library and promises to be more powerful, but I’ve not been
able to get it to work.
A TNT format dataset downloaded from MorphoBank can be parsed with
ReadTntCharacters
, which might also handle other
TNT-compatible files. If there’s a file that’s not being read correctly,
please let
me know and I’ll try to fix it.
<- ReadTntCharacters(filename)
raw_data # Or, to go straight to PhyDat format:
<- ReadTntAsPhyDat(filename) my_data
The next stage is to get the raw data into a format that most R
packages can understand. If you’ve used the ReadAsPhyDat
or
ReadTntAsPhyDat
functions, then you can skip this step –
you’re already there.
Otherwise, you can try
<- PhyDat(raw_data) my_data
or if that doesn’t work,
<- MatrixToPhyDat(raw_data) my_data
These functions are pretty robust, but might return an error when
they encounter an unexpected dataset format – if they don’t work on your
dataset, please
let
me know.
Failing that, you can enlist the help of the ‘phangorn’ package:
install.packages('phangorn')
library('phangorn')
<- phyDat(raw_data, type='USER', levels=c(0:9, '-')) my_data
type='USER'
tells the parser to expect morphological
data.
The levels
parameter simply lists all the states that
any character might take. 0:9
includes all the integer
digits from 0 to 9. If you have inapplicable data in your matrix, you
should list -
as a separate level as it represents an
additional state (as handled by the Morphy implementation of (Brazeau, Guillerme, & Smith, 2019)). If you
have more complicated ambiguities, you may need to use a contrast matrix
to decode your matrix.
A contrast matrix translates the tokens used in your dataset to the character states to which they correspond: for example decoding ‘A’ to {01}. For more details, see the ‘phangorn-specials’ vignette in the phangorn package, accessible by typing ‘?phangorn’ in the R prompt and navigating to index > package vignettes.
<- matrix(data = c(
contrast.matrix # 0 1 - # Each column corresponds to a character-state
1,0,0, # Each row corresponds to a token, here 0, denoting the
# character-state set {0}
0,1,0, # 1 | {1}
0,0,1, # - | {-}
1,1,0, # A | {01}
1,1,0, # + | {01}
1,1,1 # ? | {01-}
ncol = 3, # ncol should correspond to the number of columns in the matrix
), byrow = TRUE);
dimnames(contrast.matrix) <- list(
c(0, 1, '-', 'A', '+', '?'), # A list of the tokens corresponding to each row
# in the contrast matrix
c(0, 1, '-') # A list of the character-states corresponding to the columns
# in the contrast matrix
)
contrast.matrix
## 0 1 -
## 0 1 0 0
## 1 0 1 0
## - 0 0 1
## A 1 1 0
## + 1 1 0
## ? 1 1 1
If you need to use a contrast matrix, convert the data using
<- phyDat(my.data, type='USER', contrast=contrast.matrix) my.phyDat
You might want to:
Load a phylogenetic tree into R.
Conduct parsimony search using Brazeau, Guillerme & Smith’s (2019) approach to inapplicable data, or using Profile parsimony (Faith & Trueman, 2001).