Abstract
Describes the background of the package, important functions defined in the package and some of the applications and usages.
This vignette explains how to use the methods available in this package.
This is a basic example which shows you how to create a TidySet
object, to store associations between genes and sets:
library("BaseSet")
gene_lists <- list(
geneset1 = c("A", "B"),
geneset2 = c("B", "C", "D")
)
tidy_set <- tidySet(gene_lists)
tidy_set
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
This is then stored internally in three slots relations
, elements
, and sets
slots.
If you have more information for each element or set it can be added:
gene_data <- data.frame(
stat1 = c( 1, 2, 3, 4 ),
info1 = c("a", "b", "c", "d")
)
tidy_set <- add_column(tidy_set, "elements", gene_data)
set_data <- data.frame(
Group = c( 100, 200 ),
Colum = c( "abc", "def")
)
tidy_set <- add_column(tidy_set, "sets", set_data)
tidy_set
#> elements sets fuzzy Group Colum stat1 info1
#> 1 A geneset1 1 100 abc 1 a
#> 2 B geneset1 1 100 abc 2 b
#> 3 B geneset2 1 200 def 2 b
#> 4 C geneset2 1 200 def 3 c
#> 5 D geneset2 1 200 def 4 d
This data is stored in one of the three slots, which can be directly accessed using their getter methods:
relations(tidy_set)
#> elements sets fuzzy
#> 1 A geneset1 1
#> 2 B geneset1 1
#> 3 B geneset2 1
#> 4 C geneset2 1
#> 5 D geneset2 1
elements(tidy_set)
#> elements stat1 info1
#> 1 A 1 a
#> 2 B 2 b
#> 3 C 3 c
#> 4 D 4 d
sets(tidy_set)
#> sets Group Colum
#> 1 geneset1 100 abc
#> 2 geneset2 200 def
You can add as much information as you want, with the only restriction for a “fuzzy” column for the relations
. See the Fuzzy sets vignette.
As you can see it is possible to create a TidySet from a list and a data.frame, but it is also possible from a matrix:
m <- matrix(c(0, 0, 1, 1, 1, 1, 0, 1, 0), ncol = 3, nrow =3,
dimnames = list(letters[1:3], LETTERS[1:3]))
m
#> A B C
#> a 0 1 0
#> b 0 1 1
#> c 1 1 0
tidy_set <- tidySet(m)
Or they can be created from a GeneSet and GeneSetCollection objects. Additionally it has several function to read files related to sets like the OBO files (getOBO
) and GAF (getGAF
)
It is possible to extract the gene sets as a list
, for use with functions such as lapply
.
as.list(tidy_set)
#> $A
#> c
#> 1
#>
#> $B
#> a b c
#> 1 1 1
#>
#> $C
#> b
#> 1
Or if you need to apply some network methods and you need a matrix, you can create it with incidence
:
incidence(tidy_set)
#> A B C
#> c 1 1 0
#> a 0 1 0
#> b 0 1 1
To work with sets several methods are provided. In general you can provide a new name for the resulting set of the operation, but if you don’t one will be automatically provided using naming
. All methods work with fuzzy and non-fuzzy sets
You can make a union of two sets present on the same object.
BaseSet::union(tidy_set, sets = c("C", "B"), name = "D")
#> elements sets fuzzy
#> 1 a D 1
#> 2 b D 1
#> 3 c D 1
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = TRUE)
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c D 1
The keep argument used here is if you want to keep all the other previous sets:
intersection(tidy_set, sets = c("A", "B"), name = "D", keep = FALSE)
#> elements sets fuzzy
#> 1 c D 1
We can look for the complement of one or several sets:
complement_set(tidy_set, sets = c("A", "B"))
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c ∁A∪B 0
#> 7 a ∁A∪B 0
#> 8 b ∁A∪B 0
Observe that we haven’t provided a name for the resulting set but we can provide one if we prefer to
complement_set(tidy_set, sets = c("A", "B"), name = "F")
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
#> 6 c F 0
#> 7 a F 0
#> 8 b F 0
This is the equivalent of setdiff
, but clearer:
out <- subtract(tidy_set, set_in = "A", not_in = "B", name = "A-B")
out
#> elements sets fuzzy
#> 1 c A 1
#> 2 a B 1
#> 3 b B 1
#> 4 c B 1
#> 5 b C 1
name_sets(out)
#> [1] "A" "B" "C" "A-B"
subtract(tidy_set, set_in = "B", not_in = "A", keep = FALSE)
#> elements sets fuzzy
#> 1 a B∖A 1
#> 2 b B∖A 1
See that in the first case there isn’t any element present in B not in set A, but the new set is stored. In the second use case we focus just on the elements that are present on B but not in A.
The number of unique elements and sets can be obtained using the nElements
and nSets
methods.
nElements(tidy_set)
#> [1] 3
nSets(tidy_set)
#> [1] 3
nRelations(tidy_set)
#> [1] 5
The size of each gene set can be obtained using the set_size
method.
set_size(tidy_set, "A")
#> sets size probability
#> 1 A 1 1
Conversely, the number of sets associated with each gene is returned by the element_size
function.
element_size(tidy_set)
#> elements size probability
#> 1 c 2 1
#> 2 a 1 1
#> 3 b 2 1
The identifiers of elements and sets can be inspected and renamed using name_elements
and
name_elements(tidy_set)
#> [1] "c" "a" "b"
name_elements(tidy_set) <- paste0("Gene", seq_len(nElements(tidy_set)))
name_elements(tidy_set)
#> [1] "Gene1" "Gene2" "Gene3"
name_sets(tidy_set)
#> [1] "A" "B" "C"
name_sets(tidy_set) <- paste0("Geneset", seq_len(nSets(tidy_set)))
name_sets(tidy_set)
#> [1] "Geneset1" "Geneset2" "Geneset3"
dplyr
verbsYou can also use mutate
, filter
and other dplyr
verbs with TidySets (with the only exception being group_by
), but you usually need to activate which three slots you want to affect with activate
:
library("dplyr")
m_TS <- tidy_set %>%
activate("relations") %>%
mutate(Important = runif(nRelations(tidy_set)))
m_TS
#> elements sets fuzzy Important
#> 1 Gene1 Geneset1 1 0.03576740
#> 2 Gene2 Geneset2 1 0.00581606
#> 3 Gene3 Geneset2 1 0.96527650
#> 4 Gene1 Geneset2 1 0.91654958
#> 5 Gene3 Geneset3 1 0.66068543
You can use activate to select what are the verbs modifying:
set_modified <- m_TS %>%
activate("elements") %>%
mutate(Pathway = if_else(elements %in% c("Gene1", "Gene2"),
"pathway1",
"pathway2"))
set_modified
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.03576740 pathway1
#> 2 Gene2 Geneset2 1 0.00581606 pathway1
#> 3 Gene3 Geneset2 1 0.96527650 pathway2
#> 4 Gene1 Geneset2 1 0.91654958 pathway1
#> 5 Gene3 Geneset3 1 0.66068543 pathway2
set_modified %>%
deactivate() %>% # To apply a filter independently of where it is
filter(Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.03576740 pathway1
#> 2 Gene2 Geneset2 1 0.00581606 pathway1
#> 3 Gene1 Geneset2 1 0.91654958 pathway1
If you think you need group_by usually this would mean that you need a new set. You can create a new one with group
. If you want to use group_by
to group some elements then you need to create a new set:
# A new group of those elements in pathway1 and with Important == 1
set_modified %>%
deactivate() %>%
group(name = "new", Pathway == "pathway1")
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.03576740 pathway1
#> 2 Gene2 Geneset2 1 0.00581606 pathway1
#> 3 Gene3 Geneset2 1 0.96527650 pathway2
#> 4 Gene1 Geneset2 1 0.91654958 pathway1
#> 5 Gene3 Geneset3 1 0.66068543 pathway2
#> 6 Gene1 new 1 NA pathway1
#> 7 Gene2 new 1 NA pathway1
set_modified %>%
group("pathway1", elements %in% c("Gene1", "Gene2"))
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.03576740 pathway1
#> 2 Gene2 Geneset2 1 0.00581606 pathway1
#> 3 Gene3 Geneset2 1 0.96527650 pathway2
#> 4 Gene1 Geneset2 1 0.91654958 pathway1
#> 5 Gene3 Geneset3 1 0.66068543 pathway2
#> 6 Gene1 pathway1 1 NA pathway1
#> 7 Gene2 pathway1 1 NA pathway1
After grouping or mutating sometimes we might be interested in moving a column describing something to other places. We can do by this with:
elements(set_modified)
#> elements Pathway
#> 1 Gene1 pathway1
#> 2 Gene2 pathway1
#> 3 Gene3 pathway2
out <- move_to(set_modified, "elements", "relations", "Pathway")
relations(out)
#> elements sets fuzzy Important Pathway
#> 1 Gene1 Geneset1 1 0.03576740 pathway1
#> 2 Gene2 Geneset2 1 0.00581606 pathway1
#> 3 Gene3 Geneset2 1 0.96527650 pathway2
#> 4 Gene1 Geneset2 1 0.91654958 pathway1
#> 5 Gene3 Geneset3 1 0.66068543 pathway2
#> R version 4.0.1 (2020-06-06)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] reactome.db_1.74.0 forcats_0.5.1 ggplot2_3.3.5
#> [4] GO.db_3.12.1 org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0
#> [7] IRanges_2.24.1 S4Vectors_0.28.1 Biobase_2.50.0
#> [10] BiocGenerics_0.36.1 dplyr_1.0.6 BaseSet_0.0.17
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.1 xfun_0.24 bslib_0.2.5 purrr_0.3.4
#> [5] colorspace_2.0-1 vctrs_0.3.8 generics_0.1.0 htmltools_0.5.1.1
#> [9] yaml_2.2.1 utf8_1.2.1 blob_1.2.1 XML_3.99-0.6
#> [13] rlang_0.4.11 jquerylib_0.1.4 pillar_1.6.1 withr_2.4.2
#> [17] glue_1.4.2 DBI_1.1.1 bit64_4.0.5 lifecycle_1.0.0
#> [21] stringr_1.4.0 munsell_0.5.0 gtable_0.3.0 memoise_2.0.0
#> [25] evaluate_0.14 labeling_0.4.2 knitr_1.33 fastmap_1.1.0
#> [29] fansi_0.5.0 highr_0.9 GSEABase_1.52.1 Rcpp_1.0.6
#> [33] xtable_1.8-4 scales_1.1.1 cachem_1.0.5 graph_1.68.0
#> [37] annotate_1.68.0 jsonlite_1.7.2 farver_2.1.0 bit_4.0.4
#> [41] digest_0.6.27 stringi_1.6.2 grid_4.0.1 tools_4.0.1
#> [45] magrittr_2.0.1 sass_0.4.0 RSQLite_2.2.7 tibble_3.1.2
#> [49] crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.2 assertthat_0.2.1
#> [53] rmarkdown_2.9 httr_1.4.2 R6_2.5.0 compiler_4.0.1