We load some simulated data.
library(coloc)
data(coloc_test_data)
attach(coloc_test_data) ## datasets D1, D2, D3 and D4
## The following objects are masked from coloc_test_data (pos = 3):
##
## D1, D2, D3, D4
## The following objects are masked from coloc_test_data (pos = 4):
##
## D1, D2, D3, D4
## The following objects are masked from coloc_test_data (pos = 5):
##
## D1, D2, D3, D4
## The following objects are masked from coloc_test_data (pos = 6):
##
## D1, D2, D3, D4
Datasets 3 and 4 are constructed to deliberately break the single
causal variant assumption in coloc.abf()
.
par(mfrow=c(2,1))
plot_dataset(D3, main="Dataset D3")
plot_dataset(D4, main="Dataset D4")
First, let us do a standard coloc (single causal variant) analysis to serve as a baseline comparison. The analysis concludes there is colocalisation, because it “sees” the SNPs on the left which are strongly associated with both traits. But it misses the SNPs on the right of the top left plot which are associated with only one trait.
<- coloc.abf(dataset1=D3, dataset2=D4) my.res
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.39e-15 2.84e-07 5.35e-12 9.76e-05 1.00e+00
## [1] "PP abf for shared variant: 100%"
class(my.res)
## [1] "coloc_abf" "list"
## print.coloc_abf
my.res
## Coloc analysis of trait 1, trait 2
##
## SNP Priors
## p1 p2 p12
## 1e-04 1e-04 1e-05
##
## Hypothesis Priors
## H0 H1 H2 H3 H4
## 0.9894755 0.005 0.005 2.45e-05 5e-04
##
## Posterior
## nsnps H0 H1 H2 H3 H4
## 5.000000e+01 1.386329e-15 2.843375e-07 5.351202e-12 9.763482e-05 9.999021e-01
sensitivity(my.res,"H4 > 0.9")
## Results pass decision rule H4 > 0.9
Even though the sensitivity analysis itself looks good, the Manhattan plots suggest we are violating the assumption of a single causal variant per trait.
coloc has adopted the SuSiE framework for fine mapping in the presence of multiple causal variants. This framework requires the LD matrix is known, so first check our datasets hold an LD matrix of the right format. =check_dataset= should return NULL if there are no problems, or print informative error messages if there are.
check_dataset(D3,req="LD")
## NULL
check_dataset(D4,req="LD")
## NULL
SuSiE can take a while to run on larger datasets, so it is best to run once per dataset with the =runsusie= function, store the results and feed those into subsequent analyses. =runsusie= is just a wrapper around the =susie_rss= function in the susieR package that automates running until convergence and saves a little extra information about snp names to make subsequent coloc processing simpler. Here, it does indeed find two signals for dataset D3 (there are two rows in the credible sets summary) and one for dataset D4.
=runsusie(D3) S3
## running max iterations: 100
## WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).
## converged: TRUE
summary(S3)
##
## Variables in credible sets:
##
## variable variable_prob cs
## 5 0.99068429 1
## 38 0.67052634 2
## 43 0.13918754 2
## 41 0.07184442 2
## 44 0.05385962 2
## 49 0.04166083 2
##
## Credible sets summary:
##
## cs cs_log10bf cs_avg_r2 cs_min_r2 variable
## 1 8.886435 1.0000000 1.0000000 5
## 2 2.838112 0.4601606 0.2522791 38,41,43,44,49
=runsusie(D4) S4
## running max iterations: 100
## WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).
## converged: TRUE
summary(S4)
##
## Variables in credible sets:
##
## variable variable_prob cs
## 5 0.9517864 1
##
## Credible sets summary:
##
## cs cs_log10bf cs_avg_r2 cs_min_r2 variable
## 1 6.614133 1 1 5
With these susie output objects stored, we can colocalise every pair of signals. This analysis says the first pair, tagged by s25.1 and s25 for datasets D3 and D4, do not colocalise (posterior for H3 is close to 1), whilst the second pair, tagged by the same SNP, s25, for both datasets, do (posterior for H4 is close to 1).
if(requireNamespace("susieR",quietly=TRUE)) {
=coloc.susie(S3,S4)
susie.resprint(susie.res$summary)
}
## Using 50/ 50 and 50 available
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 2.68e-14 1.03e-07 5.51e-10 1.23e-04 1.00e+00
## [1] "PP abf for shared variant: 100%"
## PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf PP.H4.abf
## 1.09e-05 3.75e-05 2.24e-01 7.72e-01 4.14e-03
## [1] "PP abf for shared variant: 0.414%"
## nsnps hit1 hit2 PP.H0.abf PP.H1.abf PP.H2.abf PP.H3.abf
## 1: 50 s5 s5 2.681604e-14 1.032284e-07 5.514394e-10 0.0001230134
## 2: 50 s38 s5 1.089658e-05 3.752945e-05 2.240750e-01 0.7717393815
## PP.H4.abf idx1 idx2
## 1: 0.999876883 1 1
## 2: 0.004137205 2 1
Note that because we are doing multiple colocalisations, sensitivity() needs to know which to consider, and we need to give it the datasets used if we want to see the Manhattan plots.
if(requireNamespace("susieR",quietly=TRUE)) {
sensitivity(susie.res,"H4 > 0.9",row=1,dataset1=D3,dataset2=D4)
sensitivity(susie.res,"H4 > 0.9",row=2,dataset1=D3,dataset2=D4)
}
## Results pass decision rule H4 > 0.9
## Results fail decision rule H4 > 0.9
runsusie()
is a wrapper around
susieR::susie_rss()
. It sets the null_weight parameter -
this must be non-zero or we cannot back calculate Bayes factors needed
for coloc - adds some colnames to the returned objects (so that snps can
be identified easily) and repeats the calls to susie_rss()
until convergence is achieved. In particular, the
null_weight
is by default set to an (implausibly) small
value of \(\frac{1}{1+nsnps}\). This
ensures that we get a posterior probability for the null hypothesis of
no association, so we can calculate Bayes factors when needed, but also,
by setting such a low value, allows us to capture weak signals that
would not be detected if null_weight
is large. For coloc
purposes, this is fine, because we will include in the coloc step our
real prior for a SNP to be causally associated (and by implication our
prior belief that no SNPs are causally associated). At this stage, weak
signals will get weeded out.
You can also pass that per-SNP prior to runsusie()
by
setting the parameter p
(equivalent to p1
or
p2
in coloc functions). But be aware
either option differs from the default behaviour of
susie_rss()
which is not to specify a prior for the null
hypothesis at all.
susie_rss()
has many other options and you should look
at them if runsusie()
is not giving the output you expect.
They can be passed directly through runsusie()
.
if(requireNamespace("susieR",quietly=TRUE)) {
::susie_rss
?susieR }
One option I have varied to help SuSiE detect weaker signals, is the
coverage parameter. By default susie_rss
looks for signals
for which it can form a 95% credible set. By reducing the coverage, we
can find weaker signals. For example we find nothing in this weaker
signal dataset
if(requireNamespace("susieR",quietly=TRUE)) {
## make a dataset with a weaker signal
=D3
D5$varbeta=D5$varbeta * 2
D5$N=D5$N / 2
D5par(mfrow=c(1,2))
plot_dataset(D3, main="original D3")
plot_dataset(D5, main="weaker signal D5")
summary(runsusie(D5)) # default coverage 0.95
}
But by reducing the coverage we can find one signal
if(requireNamespace("susieR",quietly=TRUE)) {
summary(runsusie(D5,coverage=0.1)) # lower coverage
}
And reducing it further finds the other as well
if(requireNamespace("susieR",quietly=TRUE)) {
summary(runsusie(D5,coverage=0.01)) # even lower
}
These values are just for illustration, I probably wouldn’t believe a signal in a real dataset with a \(p > 10^{-5}\). They let you find weaker signals, but the coloc results should be treated with caution.
if(requireNamespace("susieR",quietly=TRUE)) {
=runsusie(D5,coverage=0.1) # lower coverage
S5summary(S5)
=coloc.susie(S5,S4)
resprint(res$summary)
}