The GWAS Catalog is a service provided by the EMBL-EBI and NHGRI that offers a manually curated and freely available database of published genome-wide association studies (GWAS). The Catalog website and infrastructure is hosted by the EMBL-EBI.
There are three ways to access the Catalog database:
gwasrapidd facilitates the access to the Catalog via the REST API, allowing you to programmatically retrieve data directly into R.
The Catalog REST API is organized around four core entities: studies,
associations, variants, and traits. gwasrapidd provides
four corresponding functions to get each of the entities:
get_studies()
, get_associations()
,
get_variants()
, and get_traits()
.
Each function maps to an appropriately named S4 classed object: studies, associations, variants, and traits (see Figure 1).
You can use a combination of several search criteria with each retrieval function as shown in Figure 2. For example, if you want to get studies using either one of these two criteria: study accession identifier and variant identifier, you could run the following code:
library(gwasrapidd)
<- get_studies(study_id = 'GCST000858', variant_id = 'rs12752552') my_studies
This command returns all studies that match either
'GCST000858'
or 'rs12752552'
. This is
equivalent to running get_studies
separately on each
criteria, and combining the results afterwards:
<- get_studies(study_id = 'GCST000858')
s1 <- get_studies(variant_id = 'rs12752552')
s2 <- gwasrapidd::union(s1, s2) my_studies
All four retrieval functions accept the set_operation
parameter which defines the way the results obtained with each criterion
are combined. The two options for this parameter are
'union'
(default) or 'intersection'
,
resulting, respectively, in an OR or AND operation.
As a first example, take the work by Light et al. (2014). In this work the authors focused on variants that had been previously reported in genome-wide association studies (GWAS) for autoimmune disease.
With gwasrapidd we can interrogate the GWAS Catalog for the study/studies by searching by autoimmune disease (an EFO trait). To do that let’s load gwasrapidd first:
library(gwasrapidd)
Then query the GWAS Catalog by EFO trait:
<- get_studies(efo_trait = 'autoimmune disease') my_studies
We can now check how many GWAS studies we got back:
::n(my_studies)
gwasrapidd#> [1] 8
@studies$study_id
my_studies#> [1] "GCST003097" "GCST011008" "GCST007071" "GCST009873"
#> [5] "GCST011005" "GCST011009" "GCST90029015" "GCST90029016"
Apparently only 8 studies: GCST003097, GCST011008, GCST007071, GCST009873, GCST011005, GCST011009, GCST90029015, GCST90029016. Let’s see the associated publication titles:
@publications$title
my_studies#> [1] "Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases."
#> [2] "Genetic factors underlying the bidirectional relationship between autoimmune and mental disorders - findings from a Danish population-based study."
#> [3] "Leveraging Polygenic Functional Enrichment to Improve GWAS Power."
#> [4] "Meta-analysis of Immunochip data of four autoimmune diseases reveals novel single-disease and cross-phenotype associations."
#> [5] "Genetic factors underlying the bidirectional relationship between autoimmune and mental disorders - findings from a Danish population-based study."
#> [6] "Genetic factors underlying the bidirectional relationship between autoimmune and mental disorders - findings from a Danish population-based study."
#> [7] "Mixed-model association for biobank-scale datasets."
#> [8] "Mixed-model association for biobank-scale datasets."
If you want to further inspect these publications, you can quickly browse the respective PubMed entries:
# This launches your web browser at https://www.ncbi.nlm.nih.gov/pubmed/26301688
open_in_pubmed(my_studies@publications$pubmed_id)
Now if we want to know the variants previously associated with autoimmune disease, as used by Light et al. (2014), we need to retrieve statistical association information on these variants, and then filter them based on the same level of significance \(P < 1\times 10^{-6}\) (Light et al. 2014).
So let’s start by getting the associations by
study_id
:
# You could have also used get_associations(efo_trait = 'autoimmune disease')
<- get_associations(study_id = my_studies@studies$study_id) my_associations
Seemingly, there are 182 associations.
::n(my_associations)
gwasrapidd#> [1] 182
However, not all variants meet the level of significance, as required by Light et al. (2014):
# Get association ids for which pvalue is less than 1e-6.
::filter(my_associations@associations, pvalue < 1e-6) %>% # Filter by p-value
dplyr::drop_na(pvalue) %>%
tidyr::pull(association_id) -> association_ids # Extract column association_id dplyr
Here we subset the my_associations
object by a vector of
association identifiers (association_ids
) into a smaller
object, my_associations2
:
# Extract associations by association id
<- my_associations[association_ids]
my_associations2 ::n(my_associations2)
gwasrapidd#> [1] 180
Of the 182 associations found in GWAS Catalog, 180 meet the p-value threshold of \(1\times 10^{-6}\). Here are the variants, and their respective risk allele and risk frequency:
@risk_alleles[c('variant_id', 'risk_allele', 'risk_frequency')]
my_associations2#> # A tibble: 180 × 3
#> variant_id risk_allele risk_frequency
#> <chr> <chr> <dbl>
#> 1 rs11580078 G 0.43
#> 2 rs6679677 A 0.09
#> 3 rs34884278 C 0.3
#> 4 rs6689858 C 0.29
#> 5 rs2075184 T 0.23
#> 6 rs36001488 C 0.48
#> 7 rs4676410 A 0.19
#> 8 rs4625 G 0.31
#> 9 rs62324212 A 0.42
#> 10 rs7725052 C 0.43
#> # … with 170 more rows
#> # ℹ Use `print(n = ...)` to see more rows