We consider the acute lymphocytic leukemia dataset, included with the simPATHy package, first published in (Chiaretti et al. 2005). This dataset contains expression of 3405 genes in two conditions with sample sizes \(n_1=37\) and \(n_2=41\).
library(simPATHy)
library(graph)
data("chimera")
dim(chimera)
## [1] 3405 78
table(colnames(chimera))
##
## 1 2
## 37 41
The column names indicate the condition (1 or 2) of each sample.
In this example, we restrict our attention to a subset of genes in this dataset, corresponding to genes participating in the KEGG’s ``Acute Myeloid Leukemia’’ pathway.
The graph that we consider in what follows is a directed acyclic graph derived manually from this pathway.
graph<-gRbase::dag(~867:25+867:613+5295:867+5294:867+
207:5295+207:5294+4193:207+3551:207+
4792:3551+7157:4193+3265:6654+
3845:6654+6654:2885+2885:25+2885:613)
We take the first condition of this dataset as a reference condition for our example and begin by estimating the covariance matrix.
genes<-graph::nodes(graph)
data<-t(chimera[genes,colnames(chimera)==1])
S<-cov(data)
The matrix S
does not reflect conditional independence constraints imposed by the graph
. To impose these structural constraints simPATHy provides a function fitSgraph
for maximum likelihood estimation of covariance matrices in graphical models, for both Gaussian bayesian networks and Gaussian graphical models.
S<-fitSgraph(graph,S)
round(S[1:5,1:5],3)
## 867 25 613 5295 5294
## 867 0.115 -0.006 0.071 0.008 0.045
## 25 -0.006 0.240 0.000 0.000 -0.003
## 613 0.071 0.000 0.121 0.005 0.028
## 5295 0.008 0.000 0.005 0.317 0.003
## 5294 0.045 -0.003 0.028 0.003 0.246
The package also provides two plotting functions for graphical models. The first plotGraphNELD3
, focuses on the graphical structure, while the second one, plotCorGraph
, focuses on the correlation matrix. In both cases, the colors represent the strength of relation between nodes, where the user by setting the parameter type
chooses whether to show the pairwise correlation coefficient (type= "cor"
) or the partial correlation coefficient (type= "pcor"
).
plotGraphNELD3(graph,type = "cor",S1 = S)
plotCorGraph(S1 = S,type = "cor")
When the number of nodes is high and relations between them week, a user can improve the visibility by adjusting the color range colLim
(uncomment to launch).
lim<-round(max(abs(simPATHy:::riscala(S))[upper.tri(S)]),2)
plotCorGraph(S1 = S,type ="cor",colLim = c(-lim,lim))
Note that when an element is outside of the colLim
interval, it is colored gray in plotCorGraph
and represented as a dashed link in plotGraphNELD3
.
When plotting a correlation matrix, a user can also pass the associated graph to the plotCorGraph
function to plot the adjacency matrix over the correlation (or partial correlation) matrix.
plotCorGraph(S1 = S,type = "cor", graph = graph)
The zero elements of the adjacency matrix are represented as shaded squares, whereas non-zero elements are represented as squares with grey borderline.
An important aspect is that the minimum input requirement for simPATHy is the graph. In absence of any other information, simPATHy sets the value of the corresponding reference condition covariance matrix to an estimate obtained from the Acute Lymphocytic Leukemia dataset. This is achieved by randomly selecting a gene set of the same cardinality as the order of the input graph (the number of vertices).
Clearly, the parameter values obtained reflect the properties of the original dataset, and, in particular, represent starting values suitable for gene expression measurements. For more general situations, when the goal is to simulate data other than gene expression, the user might want to input a covariance matrix more representative of the desired type of data.
In simPATHy, a path is defined as a list (possibly of length 1) of edges of the graph.
The following command specifies manually a path.
path<-list(c("613","867"),c("867","5295"),c("5295","207"),
c("207","4193"),c("4193","7157"))
Alternatively, the user can select the same path interactively by means of the function getPathShiny
. The desired path is chosen by clicking sequentially on its edges, and pressing the button to exit the interactive mode.
path <- getPathShiny(graph)
path
## [[1]]
## [1] "613" "867"
##
## [[2]]
## [1] "867" "5295"
##
## [[3]]
## [1] "5295" "207"
##
## [[4]]
## [1] "207" "4193"
##
## [[5]]
## [1] "4193" "7157"
Finally, simPATHy provides a generatePATH
function that finds the shortest path connecting two given nodes.
path <- generatePath(graph,from="613",to="7157")
Note that simPATHy allows to simultaneously select more, possibly disjoint, paths to be dysregulated, as in the following command.
path1 <- list(c("613","2885"), c("4193","7157"))
By dysregulation we intend some multiplicative change of a subset of correlation coefficients. To specify the strength of dysregulation, the user is requested to provide a range of values for the possible dysregulation by fixing two positive parameters, min
and max
. The strength of dysregulation is sampled uniformly from the interval [min, max]
: a value smaller than 1 represents deactivation (the relation between two variables is weakened), a value greater than 1 represents activation (the relation be- tween two variables is strengthened). A dysregulation range, is required for each edge in the path.
There are several types of signalling dysregulations and depending on the scenario that the user wants to simulate, different magnitudes should be applied in our model. In case of a mutation almost completely suppressing the expression of a gene or the level of a protein, the magnitude should be strong (and thus both values min
, max
would be close to zero). Otherwise, in stress conditions characterized by mild dysregulation of some (possibly many) genes, the magnitude may be moderate.
In general, to test the sensitivity and specificity of gene set methods it would be reasonable to try different strengths of dysregulation. Moreover, values for min
and max
might depend on the type of data simulated. simPATHy provides default values min=2
, max=3
representing moderate dysregulations.
min<-c(2,8,2,0.1,0.5)
max<-c(2,10,2,4,0.5)
In some applications it might be of interest to change the direction of relation between two variables (the correlation coefficient changes sign). To allow for this possibility, simPATHy
provides prob
parameter. prob
is a number between 0 and 1, with 0 implying that the sign of the correlation coefficient should be changed, and 1 implying that the sign should be left unaltered (default). Values between the two extremes allow for random sign switch: the sign is changed with probability 1-prob
.
prob<-c(1,0,0,0.5,1)
dys<-cbind(min,max,prob)
rownames(dys)<-sapply(path,paste,collapse = "~")
dys
## min max prob
## 613~867 2.0 2.0 1.0
## 867~5295 8.0 10.0 0.0
## 5295~207 2.0 2.0 0.0
## 207~4193 0.1 4.0 0.5
## 4193~7157 0.5 0.5 1.0
Here, the correlation coefficient between variables 613
and 867
is to be activated in the dysregulated condition, more precisely, multiplied by two, while the relation between 4193
and 7157
is to be deactivated in the second condition (the correlation coefficient multiplied by 0.5). On the other hand, the relation between 5295
and 207
is multiplied by two in the dysregulated condition and reversed, since prob=0
forces a sign change.
It is worth noting that the dysregulation specified by the user might be causing the modified correlation coefficient to lie outside the (-1,1)
range. To avoid this problem and prevent excessively strong dysregulations, the absolute value of the dysregulated coefficient is controlled by simPATHy and set equal to \[\min(0.9; 1.25\max \left\{|\rho_{u,v}|, u\neq v\right\}),\] where \(R=(\rho_{u,v})\) is the correlation matrix of the reference condition.
After choosing the sample sizes n1
and n2
for the two conditions (default is 500), we have set all the required parameters and can proceed by calling the main function simPATHy
.
Note that the choice of the sample sizes depends on the use of the resulting simulated data. If simulated data are used to assess asymptotic properties of a method under study, sample size could be 1000 or higher. On the other hand, if one is interested in the finite sample properties, the sample size could be as small as 10.
set.seed(123)
Result<-simPATHy(graph,path,S,min,max,prob)
The output is a simPATHy
class object represented by a list of nine elements.
class(Result)
## [1] "simPATHy"
names(Result)
## [1] "dataset" "S1" "S2" "path" "strength"
## [6] "param" "correction" "mu1" "mu2"
The key element is the simulated dataset
containing n1+n2
observations from two conditions–reference condition cl1
and dysregulated condition cl2
–sampled from multivariate normal distributions with covariance matrices Result$S1
and Result$S2
, respectively.
round(Result$dataset[c(1:3,501:503),1:5],3)
## 867 25 613 5295 5294
## cl1 0.620 -0.045 0.132 0.178 0.696
## cl1 -0.125 -0.495 -0.460 -0.121 -0.258
## cl1 -0.228 -0.099 0.283 -0.023 -0.002
## cl2 -0.169 1.201 -0.201 0.374 0.545
## cl2 -0.557 0.680 -0.303 0.631 0.580
## cl2 0.555 1.337 0.304 -0.391 -0.694
By default observations are sampled from zero mean normal distributions; however, a user can specify different values for mu1
and mu2
.
Argument param
allows to recover the dysregulation parameters.
Result$param
## $min
## [1] 2.0 8.0 2.0 0.1 0.5
##
## $max
## [1] 2.0 10.0 2.0 4.0 0.5
##
## $prob
## [1] 1.0 0.0 0.0 0.5 1.0
The actual multiplicative constant applied to each path edge correlation coefficient is stored in argument strength
.
Result$strength
## [1] 1.4933490 -8.8179538 -2.0000000 -0.2776703 0.5000000
When the dysregulation of the initial (reference condition) covariance matrix leads to a matrix that is no longer positive definite, the internal function makePositiveDefinite
is automatically called. The correction
argument signals the use of the correction giving the value \((0.1 − \lambda_\min)\).
Result$correction
## $isCorrected
## [1] TRUE
##
## $correction
## [1] 0.1
The summary of the output is provided by the function easyLookDys
.
easyLookDys(Result)
edge | type | strength | cov.S1 | cov.S2 | cor..S1 | cor..S2 |
---|---|---|---|---|---|---|
613~867 | activation | 1.4933 | 0.0711 | 0.1062 | 0.3260 | 0.4869 |
867~5295 | switch | -8.8180 | 0.0084 | -0.0737 | 0.0279 | -0.2460 |
5295~207 | switch | -2.0000 | -0.0366 | 0.0682 | -0.0725 | 0.1353 |
207~4193 | switch | -0.2777 | 0.0240 | -0.0066 | 0.0700 | -0.0194 |
4193~7157 | deactivation | 0.5000 | 0.0216 | 0.0108 | 0.1157 | 0.0578 |
To visualize differences in two conditions, simPATHy provides above mentioned plotting functions, i.e. plotGraphNELD3
and plotCorGraph
. In this case, both functions take, in addition to the graph, two covariance matrices corresponding to two conditions and plot the difference between them (uncomment to launch).
plotCorGraph(S1 = Result$S1, S2 = Result$S2, type = "cor",
graph = graph, path = Result$path,colLim = c(-0.4,0.4))
A user can examine these plots in more detail by calling an interactive easyLookShiny
function.
easyLookShiny(Result, graph)
The interactive dashboard contains three boxes (each of which can be hidden by clicking on the minus sign situated on top right) showing results of the three simPATHy functions: plotGraphNELD3
, plotCorMatrix
and easyLookDys
.
colLim
and allows to adjust the color limits representing correlation coefficients (matrix cells and graph edges). It is possible to set different values for positive coefficients (blue color) and negative coefficients (red color). In case of color limits that are too stringent, correlation coefficients that lie outside the specified range will be coloured grey.
S1
and S2
will be represented, with colours representing the strength of dysregulation. Green squares correspond to edges that have been dysregulated by simPATHy, while grey squares correspond to the remaining, unaltered graph edges. Shaded squares correspond to pairs of variables not connected by a graph edge, but whose associated partial correlation coefficient was affected by the perturbation.
As already stated, datasets simulated by simPATHy are especially suited for studying the performance of GSA tools that use the topology of the underlying biological network.
For instance, (Massa, Chiogna, and Romualdi 2010) consider the test of equality of the two structured covariance matrices
\[H_0 : \Sigma_1 = \Sigma_2 \mbox{ vs } H_1 : \Sigma_1 \neq \Sigma_2 \]
The pathway.var.test
function implemented in the topologyGSA R package tests the equality of the concentration matrices of the two Gaussian graphical models, or in other words, the equality of the strength of the relations among genes on the pathway. The inputs are two experimental conditions, the graph and the significance level \(\alpha\).
library(topologyGSA)
#?pathway.var.test
y1<-Result$dataset[rownames(Result$dataset)=="cl1",]
y2<-Result$dataset[rownames(Result$dataset)=="cl2",]
alpha<-0.05
pathway.var.test(y1, y2, dag = graph, alpha)
clipper, the approach implemented in (Martini et al. 2013), goes a step further and identifies the paths within a pathway that differ the most between the two experimental conditions.
library(clipper)
#?clipper
expr<-t(Result$dataset)
classes<-as.numeric(gsub("cl","",colnames(expr)))
clipped<-clipper(expr,classes,graph)
clipped
simPATHy depends on some BioC packages.
The command install.packages("simPATHy")
will not install the necessary BioC packages that would have to be installed separately. For this reason we suggest to use the command biocLite("simPATHy")
: it will install all the dependencies from both CRAN and BioC.
#References
Chiaretti, Sabina, Xiaochun Li, Robert Gentleman, Antonella Vitale, Kathy S Wang, Franco Mandelli, Robin Foa, and Jerome Ritz. 2005. “Gene Expression Profiles of B-Lineage Adult Acute Lymphocytic Leukemia Reveal Genetic Patterns That Identify Lineage Derivation and Distinct Mechanisms of Transformation.” Clinical Cancer Research 11 (20): 7209–19.
Martini, Paolo, Gabriele Sales, M Sofia Massa, Monica Chiogna, and Chiara Romualdi. 2013. “Along Signal Paths: An Empirical Gene Set Approach Exploiting Pathway Topology.” Nucleic Acids Research 41 (1): e19–e19.
Massa, Maria Sofia, Monica Chiogna, and Chiara Romualdi. 2010. “Gene Set Analysis Exploiting the Topology of a Pathway.” BMC Systems Biology 4 (1): 1.