This vignette will introduce users to the retrieval of taxonomic information with myTAI
. The taxonomy()
function implemented in myTAI
relies on the powerful package taxize. Nevertheless, taxonomic information retrieval has been customized for the myTAI
standard and for organism specific information retrieval.
Specifically, the taxonomy()
function implemented in myTAI
can be used to classify genomes according to phylogenetic classification into Phylostrata (Phylostratigraphy) or to retrieve species specific taxonomic information when performing Divergence Stratigraphy (see Introduction for details).
For larger taxonomy queries it may be useful to create an NCBI Account and set up an ENTREZ API KEY.
# install.packages(c("taxize", "usethis"))
taxize::use_entrez()
# Create your key from your (brand-new) account's.
# After generating your key set it as ENTREZ_KEY in .Renviron.
# ENTREZ_KEY='youractualkeynotthisstring'
# For that, use usethis::edit_r_environ()
usethis::edit_r_environ()
The myTAI
package provides the taxonomy()
function to retrieve taxonomic information.
In the following example we will obtain the taxonomic hierarchy of Arabidopsis thaliana
from NCBI Taxonomy.
# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from NCBI Taxonomy
myTAI::taxonomy( organism = "Arabidopsis thaliana",
db = "ncbi",
output = "classification" )
name rank id
1 cellular organisms no rank 131567
2 Eukaryota superkingdom 2759
3 Viridiplantae kingdom 33090
4 Streptophyta phylum 35493
5 Streptophytina no rank 131221
6 Embryophyta no rank 3193
7 Tracheophyta no rank 58023
8 Euphyllophyta no rank 78536
9 Spermatophyta no rank 58024
10 Magnoliophyta no rank 3398
11 Mesangiospermae no rank 1437183
12 eudicotyledons no rank 71240
13 Gunneridae no rank 91827
14 Pentapetalae no rank 1437201
15 rosids subclass 71275
16 malvids no rank 91836
17 Brassicales order 3699
18 Brassicaceae family 3700
19 Camelineae tribe 980083
20 Arabidopsis genus 3701
21 Arabidopsis thaliana species 3702
The organism
argument takes the scientific name of a query organism, the db
argument specifies that database from which the corresponding taxonomic information shall be retrieved, e.g. ncbi
(NCBI Taxonomy) and itis
(Integrated Taxonomic Information System) and the output
argument specifies the type of taxonomic information that shall be returned for the query organism, e.g. classification
, taxid
, or children
.
The output of classification
is a data.frame
storing the taxonomic hierarchy of Arabidopsis thaliana
starting with cellular organisms
up to Arabidopsis thaliana
. The first column stores the taxonomic name, the second column the taxonomic rank, and the third column the NCBI Taxonomy id for corresponding taxa.
Analogous classification
information can be obtained from different databases.
# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from the Integrated Taxonomic Information System
myTAI::taxonomy( organism = "Arabidopsis thaliana",
db = "itis",
output = "classification" )
name rank id
1 Plantae Kingdom 202422
2 Viridiplantae Subkingdom 954898
3 Streptophyta Infrakingdom 846494
4 Embryophyta Superdivision 954900
5 Tracheophyta Division 846496
6 Spermatophytina Subdivision 846504
7 Magnoliopsida Class 18063
8 Rosanae Superorder 846548
9 Brassicales Order 822943
10 Brassicaceae Family 22669
11 Arabidopsis Genus 23040
The output
argument allows you to directly access taxonomy ids for a query organism or species.
# retrieving the taxonomy id of the query organism from NCBI Taxonomy
myTAI::taxonomy( organism = "Arabidopsis thaliana",
db = "ncbi",
output = "taxid" )
id
1 3702
# retrieving the taxonomy id of the query organism from Integrated Taxonomic Information Service
myTAI::taxonomy( organism = "Arabidopsis",
db = "itis",
output = "taxid" )
id
1 23040
So far, the following data bases can be accesses to retrieve taxonomic information:
db = "itis"
: Integrated Taxonomic Information Servicedb = "ncbi"
: National Center for Biotechnology InformationAnother output
supported by taxonomy()
is children
that returns the immediate children taxa for a query organism. This feature is useful to determine species relationships for quantifying recent evolutionary conservation with Divergence Stratigraphy.
# retrieve children taxa of the query organism stored in the correspondning database
myTAI::taxonomy( organism = "Arabidopsis",
db = "ncbi",
output = "children" )
childtaxa_id childtaxa_name childtaxa_rank
1 1547872 Arabidopsis umezawana species
2 1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica species
3 1240361 Arabidopsis thaliana x Arabidopsis arenosa species
4 869750 Arabidopsis thaliana x Arabidopsis lyrata species
5 412662 Arabidopsis pedemontana species
6 378006 Arabidopsis arenosa x Arabidopsis thaliana species
7 347883 Arabidopsis arenicola species
8 302551 Arabidopsis petrogena species
9 97980 Arabidopsis croatica species
10 97979 Arabidopsis cebennensis species
11 81970 Arabidopsis halleri species
12 59690 Arabidopsis kamchatica species
13 59689 Arabidopsis lyrata species
14 45251 Arabidopsis neglecta species
15 45249 Arabidopsis suecica species
16 38785 Arabidopsis arenosa species
17 3702 Arabidopsis thaliana species
# retrieve children taxa of the query organism stored in the correspondning database
myTAI::taxonomy( organism = "Arabidopsis",
db = "itis",
output = "children" )
parentname parenttsn rankname taxonname tsn
1 Arabidopsis 23040 Species Arabidopsis thaliana 23041
2 Arabidopsis 23040 Species Arabidopsis arenicola 823113
3 Arabidopsis 23040 Species Arabidopsis arenosa 823130
4 Arabidopsis 23040 Species Arabidopsis lyrata 823171
These results allow us to choose subject
organisms for Divergence Stratigraphy.