Genotype imputation is an essential tool in genomics, enabling association testing with markers not directly genotyped, increasing statistical power, and facilitating data pooling between studies that employ different genotyping platforms. Two commonly used software packages for imputation are minimac and Impute2. Furthermore, services such as the Michigan Imputation Server have made genotype imputation much more accessible and streamlined.
While a number of software options are available for analyses of imputed data (e.g. PLINK, EPACTS), fewer are available for Genomewide Gene x Environment Interaction Scan (GWIS). Furthermore, data management tasks such as parsing, subsetting, and merging, while manageable in smaller studies, quickly become unwieldy and prohibitively slow with very large samples sizes. We aim to address these limitations by converting imputation outputs into a binary dosage file. The benefits of a binary format are two fold - decreased hard drive storage requirements (compared to a VCF file), and speed of parsing/analyses. The BinaryDosage package contains functions to convert VCF and Impute2 formatted files into binary dosage files, along with functions to merge samples.
For GWAS/GWIS analysis of BinaryDosage files, please refer to the GxEScanR package.
There are 4 formats for a binary dosage data set. Data sets in formats 1, 2, and 3 have 3 files, a sample information file, a SNP information file, and a genetic information file. Data sets in format 4 have just 1 file. This file contains all the information listed above and may contain the following information.
Note: Format 4 is recommended and is the default value for all functions.
remove.packages("BinaryDosage")
devtools::install_github("https://github.com/USCbiostats/BinaryDosage")
library(BinaryDosage)
The general workflow for using binary dosage data sets is as follows:
The examples below use the default values for the functions. More information about the functions and their options can be found using the help files and the vignettes.
In the examples below the input files are included with the binary dosage package and the output files are written to R temporary files. In normal use, the user would provide the names of the input and output files.
Example datasets set1a.vcf and set1b.vcf are representative of VCF output files obtained from the Michigan Imputation Server. An information file is also included for each set, set1a.info and set1b.info.
Example datasets set3a.imp and set3b.imp are representative of files return by the Impute imputation software. For GEN files the subject IDs are contained in separated files. For this example these are set3a.sample and set3b.sample.
The VCF and GEN files contain the same data. These files are in the extdata directory of the BinaryDosage package. These sets contain the following:
Set | Number of subjects | Number of SNPS |
---|---|---|
1a,3a | 60 | 10 |
1b,3b | 40 | 10 |
Since these files are distributed with the Binary Dosage package, it is necessary to get the complete file name and path for use in the following examples. The following code gets all the file names needed for the examples.
library(BinaryDosage)
# Get the file names for the VCF and information files
vcf1afile <- system.file("extdata", "set1a.vcf", package = "BinaryDosage")
vcf1ainfo <- system.file("extdata", "set1a.info", package = "BinaryDosage")
vcf1bfile <- system.file("extdata", "set1b.vcf", package = "BinaryDosage")
vcf1binfo <- system.file("extdata", "set1b.info", package = "BinaryDosage")
# Get the file names for the GEN and sample files
gen3afile <- system.file("extdata", "set3a.imp", package = "BinaryDosage")
gen3asample <- system.file("extdata", "set3a.sample", package = "BinaryDosage")
gen3bfile <- system.file("extdata", "set3b.imp", package = "BinaryDosage")
gen3bsample <- system.file("extdata", "set3b.sample", package = "BinaryDosage")
The binary dosage output files will be written to temporary files. There needs to be only one output file per data set because the examples use the default format value of 4. The following code creates these temporary output files.
# The output files for set 1
bdfile1a <- tempfile()
bdfile1b <- tempfile()
mergebd1 <- tempfile()
# The output files for set 3
bdfile3a <- tempfile()
bdfile3b <- tempfile()
mergebd3 <- tempfile()
Converting a VCF file into a binary dosage file is simple. The user passes the names of the VCF and information files along with the name for the binary dosage file to the vcftobd function. There are some options available for the vcftobd functions such as using gz compressed files vcf files. More information about these options can be found using the help files or reading the vignette usingvcffiles.
The following commands convert VCF data sets 1a and 1b into the binary dosage format.
vcftobd(vcffiles = c(vcf1afile, vcf1ainfo), bdfiles = bdfile1a)
vcftobd(vcffiles = c(vcf1bfile, vcf1binfo), bdfiles = bdfile1b)
Converting GEN files to binary dosage files is a little more difficult than converting VCF files. This is because GEN files aren’t as strictly formatted as VCF files. The user needs to have knowledge of how the GEN file is formatted. More information on this can be found in the help files and the vignette usinggenfiles.
In the example GEN file, the first column contains “–” for each SNP and the second column contains the SNP ID in the format
<chromosome>:<location>_<reference allele>_<alternate allele>
Because of this formatting, the function gentobd requires the snpcolumns parameter to have the value c(0L, 2L:5L). To convert the GEN data sets to binary dosage data sets, the names of the input and output files are passed to gentobd along with the needed value for snpcolumns.
The following commands convert the two GEN files into binary dosage files.
gentobd(genfiles = c(gen3afile, gen3asample), snpcolumns = c(0L, 2L:5L), bdfiles = bdfile3a)
gentobd(genfiles = c(gen3bfile, gen3bsample), snpcolumns = c(0L, 2L:5L), bdfiles = bdfile3b)
Merging binary dosage files is done by SNP ID. The files to merge cannot have the same subject IDs. See the vignette usingbdfiles for more information. In this example we are assuming two separate groups of subjects were imputed separately to the same reference panel.
To merge files, the user calls the bdmerge function and passes the names of the files to merge along with a file name for the merged data set. Other options exist for bdmerge and can be found in the help files and the vignette mergingfiles.
The following code first merges the binary files bdfile1a and bdfile1b created from the VCF files into a single file, mergedbd1, and then does the analogous action for the binary dosage files created from the GEN files.
bdmerge(mergefiles = mergebd1, bdfiles = c(bdfile1a, bdfile1b))
bdmerge(mergefiles = mergebd3, bdfiles = c(bdfile3a, bdfile3b))
Once binary dosage files have been created, a function can be applied to all the SNPs in a file.
The function applied to the SNPs in a binary dosage file must have the following four parameters, dosage, p0, p1, and p2. These are the dosage, Pr(g=0), Pr(g=1), and Pr(g=2), respectively. Other parameters can also be passed. For more information on defining the function see the vignette usingbdfiles.
The following code defines a function to calculate the alternate allele frequency.
To apply the function the user needs to call bdapply and pass information about the binary dosage file and the function. The information about the dosage file is obtained by calling the function getbdinfo. If the user is going to call bdapply multiple times, the user may wish to save the results of getbdinfo.
mergebd1info <- getbdinfo(mergebd1)
aaf1 <- bdapply(mergebd1info, calculateaaf)
mergebd3info <- getbdinfo(mergebd3)
aaf3 <- bdapply(mergebd3info, calculateaaf)
Since the VCF and GEN files contain the same information, the alternate allele frequencies should be the same. The following code creates a data frame with the SNP IDs and the alternate allele frequencies for both data sets.
Here is a table showing the results.
chromosome | location | snpid | reference | alternate | aaf_set1 | aaf_set3 |
---|---|---|---|---|---|---|
1 | 10000 | 1:10000:C:A | C | A | 0.3527 | 0.3527 |
1 | 11000 | 1:11000:T:C | T | C | 0.0135 | 0.0135 |
1 | 12000 | 1:12000:T:C | T | C | 0.2400 | 0.2400 |
1 | 13000 | 1:13000:T:C | T | C | 0.3375 | 0.3375 |
1 | 14000 | 1:14000:G:C | G | C | 0.1901 | 0.1901 |
1 | 15000 | 1:15000:A:C | A | C | 0.5627 | 0.5627 |
1 | 16000 | 1:16000:G:A | G | A | 0.4569 | 0.4569 |
1 | 17000 | 1:17000:C:A | C | A | 0.4578 | 0.4578 |
1 | 18000 | 1:18000:C:G | C | G | 0.2591 | 0.2591 |
1 | 19000 | 1:19000:T:G | T | G | 0.2431 | 0.2431 |
After doing an analysis, the user may want to extract a SNP from the data set for further analysis. This can be done using the getsnp function. By default the function returns a list with the dosage values for all the subjects. The genotype probabilities can be added to the list by setting the dosageonly option to FALSE. See the help files or the vignette usingbdfiles for more information.
The following code extracts the 6th SNP from both the binary dosage data sets generated above.
# Get the dosage values for the 6th SNP
set1snp6 <- getsnp(mergebd1info, 6)
# Get the dosage values for the 6th SNP
set3snp6 <- getsnp(mergebd3info, 6)
The results from the above lines were merged into a data frame with the subject IDs. Here are the first 10 lines of the data frame.
subjectid | set1snp6 | set3snp6 |
---|---|---|
I1 | 1.000 | 1.000 |
I2 | 1.849 | 1.849 |
I3 | 1.000 | 1.000 |
I4 | 2.000 | 2.000 |
I5 | 1.046 | 1.046 |
I6 | 1.915 | 1.915 |
I7 | 2.000 | 2.000 |
I8 | 2.000 | 2.000 |
I9 | 1.000 | 1.000 |
I10 | 1.000 | 1.000 |