In this section we will start exploring what is stored inside the pagoo
object and how we can access this information. Keep in mind that this object has its own associated data and methods that can be easily queried with the $
operator. These methods allow for the rapid subsetting, extraction and visualization of pangenome data.
First of all, we will load a pangenome using a toy dataset included in the package. This is a preloaded set of 10 Campylobacter spp. genomes, with metadata associated.
library(pagoo) # Load package
<- system.file('extdata', 'campylobacter.RDS', package = 'pagoo')
rds <- load_pangenomeRDS(rds) p
A pangenome can be stratified in different gene subsets according to their frequency in the dataset. The core genes
can be defined as those present in all or almost every genome (typically 95-100%). The remaining genes are defined as the accessory genome, that can be subdivided in cloud genes
or singletons (present in one genome or in genomes that are identical) and shell genes
which are those in the middle. Let’s see this using pagoo
:
$summary_stats p
## DataFrame with 4 rows and 2 columns
## Category Number
## <character> <integer>
## 1 Total 2588
## 2 Core 1627
## 3 Shell 413
## 4 Cloud 548
The core level
defines the minimum number of genomes (as a percentage) in which a certain gene should be present to be considered a core gene. By default, pagoo
considers as core all genes present in at least 95% of organisms. The core level can be modified to be more or less stringent defining the core genome. This feature exemplifies R6’s reference semantics, since modifying the core level will affect the pangenome object state resulting in different core, shell and cloud sets. Have a look:
$core_level p
## [1] 95
$core_level <- 100 # Change value
p$summary_stats # Updated object p
## DataFrame with 4 rows and 2 columns
## Category Number
## <character> <integer>
## 1 Total 2588
## 2 Core 1554
## 3 Shell 486
## 4 Cloud 548
As you can see, changing the core level from 95% to a more stringent 100% cause in decrease in the number of core genes from 1627 to 1554, and a concomitant increase in shell genes from 413 to 486. This means that 73 genes migrated from core to shell when increasing the threshold to consider a cluster as “core” to 100%. Now this changes remain in the object for subsequent analysis, or can be reverted by setting the core level again at the original value.
$core_level <- 95 p
The pangenome matrix is one of the most useful things when analyzing pangenomes. Typically, it represents organisms in rows and clusters of orthologous in columns informing about gene abundance (considering paralogues). The pangenome matrix looks like this (printing only first 5 columns):
$pan_matrix[, 1:5] p
## group0001 group0002 group0003 group0004 group0005
## 16244_6_6 1 1 1 1 1
## 16244_6_18 1 1 1 1 1
## 17059_2_16 1 1 1 1 1
## 17059_2_23 1 1 1 1 0
## 17059_2_27 1 1 1 1 1
## 17150_1_73 1 1 1 1 1
## 17059_2_42 1 1 1 1 0
Individual gene metadata can be accessed by using the $genes
suffix. It always contains the gene name, the organism to which it belongs, the cluster to where it was assigned, and a gene identifier (gid
) that is mainly used internally to organize the data. Also, it may typically include annotation data, genomic coordinates, etc, but this other metadata is optional. Gene metadata is spitted by cluster, so it consist in a List
of DataFrame
s.
$genes p
## SplitDataFrameList of length 2588
## $group0001
## DataFrame with 7 rows and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0001 16244_6_6 16244_6_6_00150 16244_6_6__16244_6_6.. ilvC
## 2 group0001 16244_6_18 16244_6_18_00172 16244_6_18__16244_6_.. ilvC
## 3 group0001 17059_2_16 17059_2_16_01012 17059_2_16__17059_2_.. ilvC
## 4 group0001 17059_2_23 17059_2_23_01040 17059_2_23__17059_2_.. ilvC
## 5 group0001 17059_2_27 17059_2_27_00492 17059_2_27__17059_2_.. ilvC
## 6 group0001 17150_1_73 17150_1_73_00221 17150_1_73__17150_1_.. ilvC
## 7 group0001 17059_2_42 17059_2_42_00176 17059_2_42__17059_2_.. ilvC
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 ketol-acid reductois.. ERS672247|SC|contig0.. 137333 138355 +
## 2 ketol-acid reductois.. ERS672259|SC|contig0.. 173541 174563 +
## 3 ketol-acid reductois.. ERS739235|SC|contig0.. 43645 44667 -
## 4 ketol-acid reductois.. ERS739242|SC|contig0.. 173578 174600 +
## 5 ketol-acid reductois.. ERS739246|SC|contig0.. 184035 185057 -
## 6 ketol-acid reductois.. ERS686652|SC|contig0.. 207206 208228 +
## 7 ketol-acid reductois.. ERS739261|SC|contig0.. 173569 174591 +
##
## $group0002
## DataFrame with 7 rows and 10 columns
## cluster org gene gid geneName
## <factor> <factor> <factor> <character> <character>
## 1 group0002 16244_6_6 16244_6_6_01290 16244_6_6__16244_6_6.. hprA
## 2 group0002 16244_6_18 16244_6_18_01307 16244_6_18__16244_6_.. hprA
## 3 group0002 17059_2_16 17059_2_16_01627 17059_2_16__17059_2_.. hprA
## 4 group0002 17059_2_23 17059_2_23_00348 17059_2_23__17059_2_.. hprA
## 5 group0002 17059_2_27 17059_2_27_01208 17059_2_27__17059_2_.. hprA
## 6 group0002 17150_1_73 17150_1_73_01398 17150_1_73__17150_1_.. hprA
## 7 group0002 17059_2_42 17059_2_42_00751 17059_2_42__17059_2_.. hprA
## product contig from to strand
## <character> <character> <integer> <integer> <character>
## 1 2-hydroxyacid dehydr.. ERS672247|SC|contig0.. 274585 275517 -
## 2 2-hydroxyacid dehydr.. ERS672259|SC|contig0.. 272748 273680 -
## 3 2-hydroxyacid dehydr.. ERS739235|SC|contig0.. 11617 12549 +
## 4 2-hydroxyacid dehydr.. ERS739242|SC|contig0.. 308107 309039 -
## 5 2-hydroxyacid dehydr.. ERS739246|SC|contig0.. 114555 115487 -
## 6 2-hydroxyacid dehydr.. ERS686652|SC|contig0.. 273866 274798 -
## 7 2-hydroxyacid dehydr.. ERS739261|SC|contig0.. 15409 16341 +
##
## ...
## <2586 more elements>
If you want to work with this data as a single DataFrame
, just unlist
it:
unlist(p$genes, use.names = FALSE)
pagoo
also includes predefined subsets fields to list only certain pangenome category, these are queried by adding a prefix with the desired category followed by an underscore: $core_genes
, $shell_genes
, and $cloud_genes
. These kind of subsets are better explained in the ‘4 - Subets’ tutorial, and also apply to other pangenome data described below.
Groups of orthologues (clusters) are also stored in pagoo
objects as a table with a cluster identifier per row, and optional metadata associated as additional columns.
$clusters p
## DataFrame with 2588 rows and 2 columns
## cluster Pfam_Arch
## <factor> <character>
## 1 group0001 2-Hacid_dh_C
## 2 group0002 2-Hacid_dh_C;2-Hacid..
## 3 group0003 2-Hacid_dh_C;ACT;2-H..
## 4 group0004 2Fe-2S_thioredx
## 5 group0005 4HB_MCP_1;MCPsignal
## ... ... ...
## 2584 group2584 zf-RING_7
## 2585 group2585 zf-TFIIB
## 2586 group2586 ZinT
## 2587 group2587 ZnuA
## 2588 group2588 ZT_dimer
Subsets also exists for this field: $core_clusters
, $shell_clusters
, and $cloud_clusters
.
Although is an optional field (it exists only if user provide this data as an argument when object is created), $sequences
gives access to sequence data. Sequences are stored as a List
of DNAStringSet
(a.k.a DNAStringSetList
, Biostrings package), grouped by cluster.
$sequences # List all sequences grouped by cluster p
## DNAStringSetList of length 2588
## [["group0001"]] 16244_6_6__16244_6_6_00150=ATGGCGATAACAGTTTATTACGACAAAGATTGCG...
## [["group0002"]] 16244_6_6__16244_6_6_01290=ATGAAAATAGTATGCTTAGATGCCGACACGCTTG...
## [["group0003"]] 16244_6_6__16244_6_6_01710=ATGAAAACAGTTATAGTTTGCGATGCAATACATC...
## [["group0004"]] 16244_6_6__16244_6_6_01754=ATGAAATTCGAATTTACTCATGAGCAATTATCGG...
## [["group0005"]] 16244_6_6__16244_6_6_00049=ATGTCAAATTTAACTACTAACTTAACTACCAAAA...
## [["group0006"]] 16244_6_6__16244_6_6_01069=ATGAATTATTTTGAGAATTTAAAAGTTTCAACAA...
## [["group0007"]] 16244_6_6__16244_6_6_01612=ATGCGAATTAGAATTTATTATGAAGATACCGATG...
## [["group0008"]] 16244_6_6__16244_6_6_01679=ATGATGAAAGATATGGGCGAGCCACGTATAAAAA...
## [["group0009"]] 16244_6_18__16244_6_18_01216=ATGGGGCTTACTACGAGTACGACAAAGTATAT...
## [["group0010"]] 16244_6_6__16244_6_6_00758=ATGAAAAGAGTGGTTATAAAAGTAGGCTCTCACG...
## ...
## <2578 more elements>
$sequences[["group0001"]] # List first cluster p
## DNAStringSet object of length 7:
## width seq names
## [1] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 16244_6_6__16244_...
## [2] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 16244_6_18__16244...
## [3] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_16__17059...
## [4] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_23__17059...
## [5] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_27__17059...
## [6] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17150_1_73__17150...
## [7] 1023 ATGGCGATAACAGTTTATTACGA...TAGTTAATAAAGACAAAAATTAA 17059_2_42__17059...
Note that sequence names are created by pasting organism names and gene names, separated by a string that by default is sep = '__'
(two underscores). This are the same as the gid
column in the $genes
field, and are initially set when pagoo
object is created. If you think your dataset contain names with this separator, then you should set this parameter to other string to avoid conflicts. $sequences
field also has predefined subsets: $core_sequences
, $shell_sequences
, and $cloud_sequences
.
The $organisms
field contain a table with organisms and metadata as additional columns if provided.
$organisms p
## DataFrame with 7 rows and 8 columns
## org id strain year country host
## <factor> <character> <character> <integer> <character> <character>
## 1 16244_6_6 FR15 2008/170h 2008 France Human
## 2 16244_6_18 FR27 2012/185h 2012 France Human
## 3 17059_2_16 AR1 99/801 1999 Argentina Bovine
## 4 17059_2_23 AR8 04/875 2004 Argentina Bovine
## 5 17059_2_27 AR12 06/195 2006 Argentina Bovine
## 6 17150_1_73 CA1 001A-0374 2005 Canada Human
## 7 17059_2_42 TW6 1830 2008 Taiwan Human
## source accession
## <character> <character>
## 1 Feces ERS672247
## 2 Blood ERS672259
## 3 Prepuce ERS739235
## 4 Fetus ERS739242
## 5 VM ERS739246
## 6 Blood ERS686652
## 7 Blood ERS739261