Retrieving Taxonomic Information

2021-02-23

This vignette will introduce users to the retrieval of taxonomic information with myTAI. The taxonomy() function implemented in myTAI relies on the powerful package taxize. Nevertheless, taxonomic information retrieval has been customized for the myTAI standard and for organism specific information retrieval.

Specifically, the taxonomy() function implemented in myTAI can be used to classify genomes according to phylogenetic classification into Phylostrata (Phylostratigraphy) or to retrieve species specific taxonomic information when performing Divergence Stratigraphy (see Introduction for details).

For larger taxonomy queries it may be useful to create an NCBI Account and set up an ENTREZ API KEY.

# install.packages(c("taxize", "usethis"))
taxize::use_entrez()
# Create your key from your (brand-new) account's. 
# After generating your key set it as ENTREZ_KEY in .Renviron.
# ENTREZ_KEY='youractualkeynotthisstring'
# For that, use usethis::edit_r_environ()
usethis::edit_r_environ()

Taxonomic Information Retrieval

The myTAI package provides the taxonomy() function to retrieve taxonomic information.

In the following example we will obtain the taxonomic hierarchy of Arabidopsis thaliana from NCBI Taxonomy.

# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from NCBI Taxonomy
myTAI::taxonomy( organism = "Arabidopsis thaliana", 
          db       = "ncbi",
          output   = "classification" )
                   name         rank      id
1    cellular organisms      no rank  131567
2             Eukaryota superkingdom    2759
3         Viridiplantae      kingdom   33090
4          Streptophyta       phylum   35493
5        Streptophytina      no rank  131221
6           Embryophyta      no rank    3193
7          Tracheophyta      no rank   58023
8         Euphyllophyta      no rank   78536
9         Spermatophyta      no rank   58024
10        Magnoliophyta      no rank    3398
11      Mesangiospermae      no rank 1437183
12       eudicotyledons      no rank   71240
13           Gunneridae      no rank   91827
14         Pentapetalae      no rank 1437201
15               rosids     subclass   71275
16              malvids      no rank   91836
17          Brassicales        order    3699
18         Brassicaceae       family    3700
19           Camelineae        tribe  980083
20          Arabidopsis        genus    3701
21 Arabidopsis thaliana      species    3702

The organism argument takes the scientific name of a query organism, the db argument specifies that database from which the corresponding taxonomic information shall be retrieved, e.g. ncbi (NCBI Taxonomy) and itis (Integrated Taxonomic Information System) and the output argument specifies the type of taxonomic information that shall be returned for the query organism, e.g. classification, taxid, or children.

The output of classification is a data.frame storing the taxonomic hierarchy of Arabidopsis thaliana starting with cellular organisms up to Arabidopsis thaliana. The first column stores the taxonomic name, the second column the taxonomic rank, and the third column the NCBI Taxonomy id for corresponding taxa.

Analogous classification information can be obtained from different databases.

# retrieving the taxonomic hierarchy of "Arabidopsis thaliana"
# from the Integrated Taxonomic Information System
myTAI::taxonomy( organism = "Arabidopsis thaliana", 
          db       = "itis",
          output   = "classification" )
              name          rank     id
1          Plantae       Kingdom 202422
2    Viridiplantae    Subkingdom 954898
3     Streptophyta  Infrakingdom 846494
4      Embryophyta Superdivision 954900
5     Tracheophyta      Division 846496
6  Spermatophytina   Subdivision 846504
7    Magnoliopsida         Class  18063
8          Rosanae    Superorder 846548
9      Brassicales         Order 822943
10    Brassicaceae        Family  22669
11     Arabidopsis         Genus  23040

The output argument allows you to directly access taxonomy ids for a query organism or species.

# retrieving the taxonomy id of the query organism from NCBI Taxonomy
myTAI::taxonomy( organism = "Arabidopsis thaliana", 
          db       = "ncbi", 
          output   = "taxid" )
    id
1 3702
# retrieving the taxonomy id of the query organism from Integrated Taxonomic Information Service
myTAI::taxonomy( organism = "Arabidopsis", 
          db       = "itis", 
          output   = "taxid" )
    id
1 23040

So far, the following data bases can be accesses to retrieve taxonomic information:

Retrieve Children Nodes

Another output supported by taxonomy() is children that returns the immediate children taxa for a query organism. This feature is useful to determine species relationships for quantifying recent evolutionary conservation with Divergence Stratigraphy.

# retrieve children taxa of the query organism stored in the correspondning database
myTAI::taxonomy( organism = "Arabidopsis", 
          db       = "ncbi", 
          output   = "children" )
   childtaxa_id                                                     childtaxa_name childtaxa_rank
1       1547872                                              Arabidopsis umezawana        species
2       1328956 (Arabidopsis thaliana x Arabidopsis arenosa) x Arabidopsis suecica        species
3       1240361                         Arabidopsis thaliana x Arabidopsis arenosa        species
4        869750                          Arabidopsis thaliana x Arabidopsis lyrata        species
5        412662                                            Arabidopsis pedemontana        species
6        378006                         Arabidopsis arenosa x Arabidopsis thaliana        species
7        347883                                              Arabidopsis arenicola        species
8        302551                                              Arabidopsis petrogena        species
9         97980                                               Arabidopsis croatica        species
10        97979                                            Arabidopsis cebennensis        species
11        81970                                                Arabidopsis halleri        species
12        59690                                             Arabidopsis kamchatica        species
13        59689                                                 Arabidopsis lyrata        species
14        45251                                               Arabidopsis neglecta        species
15        45249                                                Arabidopsis suecica        species
16        38785                                                Arabidopsis arenosa        species
17         3702                                               Arabidopsis thaliana        species
# retrieve children taxa of the query organism stored in the correspondning database
myTAI::taxonomy( organism = "Arabidopsis", 
          db       = "itis", 
          output   = "children" )
   parentname parenttsn rankname             taxonname    tsn
1 Arabidopsis     23040  Species  Arabidopsis thaliana  23041
2 Arabidopsis     23040  Species Arabidopsis arenicola 823113
3 Arabidopsis     23040  Species   Arabidopsis arenosa 823130
4 Arabidopsis     23040  Species    Arabidopsis lyrata 823171

These results allow us to choose subject organisms for Divergence Stratigraphy.