This package is designed to entail the whole processing chain to use SamBada, from pre- to post-processing. In this documentation, we will work through all these steps with an example given in the data available with the package. This is a subset of ten SNPs from 800 Ougandan cattle including the sample location (see SamBada’s documentation below for more information)
If you want more documentation, you can read the documentation of the package or SamBada’s documentation. Also read the article …
NB: By default, all examples write to a temporary folder (use tempdir() to see where it is actually saved). You can remove the tempdir() statements and files will be saved in your current directory.
NB2: All steps presented in this vignette follow one another, in which output files of one step are often used in a subsequent step. However, to facilitate the use of this vignette, input files will always refer to test files of the package. If you want to can use the ones you generated in previous steps (look in the tempdir() folder if you did not specify other output locations). Some steps are necessary to be able to complete the subsequent steps. It is the case of downloadSambada() and prepareOutput()
NB3: As a general rule, please avoid spaces in input files (and paths leading to them), that might make some commands crash
SamBada can only be used with GCC7 compiler, which is not the default one on MacOS. You can install it from the terminal by first downloading homebrew and then installing gcc7. For the function createEnv, you also need GDAL to be installed. This can also be done with Homebrew
# install Homebrew (https://brew.sh/). At the moment of writing the command to install it is
/usr/bin/ruby -e "$(curl -fsSL \https://raw.githubusercontent.com/Homebrew/install/master/install)"
# install GCC7
brew install gcc@7
# install GDAL
brew install gdal
For the function createEnv, GDAL must be be installed. This can be done by using the osgeo4w installer. At the time of writing, this can be accessed under https://trac.osgeo.org/osgeo4w/. Download the 64-bit (or 32-bit) installer, check the ‘Express Desktop Install’, choose one of the proposed URL and uncheck all packages, but the GDAL package (alternatively you can use https://anaconda.org/conda-forge/gdal/files)
For the function createEnv, GDAL must be be installed. At the moment of writing this can be done with the following command in the terminal
sudo apt install libgdal-dev
sudo apt install libproj-dev
To run this vignette, you will need to install R.SamBada. It is recommended to install all dependencies, including the “suggested packages” (i.e. packages that are used in only one or a few functions and that are therefore not mandatory to install the package). Note that one of the dependency (the SNPRelate package) should be installed with the BiocManager separately.
install.packages("BiocManager")
BiocManager::install(pkgs=c("SNPRelate","biomaRt")) #if asked, please update packages that needs to be updated
install.packages("R.SamBada", dependencies=c("Depends",
"Imports", "LinkingTo", "Suggests"))
library("R.SamBada")
For running sambada, you need to download sambada’s binaries. This can be done with downloadSambada
which downloads Sambada from GitHub and unpacks it into the directory of your choice. You might already be a Sambada user and do not have to download it again! Note that if you plan to often use Sambada, it is recommended that you put the binaries folder of Sambada into your path environmental variable (this procedure is OS-dependent, look on the internet how to proceed), otherwise you will have to specify this path every time you start a new R session.
#Load help
?downloadSambada()
#Downloads Sambada into the temporary directory
downloadSambada(tempdir())
Most of the functions described here have an interactiveChecks
mode. When you run the function for the first time on your dataset, we strongly advise that you set it to TRUE. This prints plots that allows you to detect anomalies.However, to facilitate the use of this vignette, this mode has been disabled.We advise you to try with interactiveChecks==TRUE.
The first step when you have your genomic matrix is to prepare it into a format that samBada accepts. You can use prepareGeno
for this, which can process plink ped, plink bed, vcf or gds input file. In the meantime, you can also filter out SNPs based on Minor Allele Frequency (MAF), Missingness, Linkage Disequilibrium (LD) and Major Genotype Frequency (MGF). In order to work with your dataset, a GDS file (from SNPRelate package) is first created. If saveGDS is set to TRUE, then the file will be saved in the active directory. The GDS file is used in other functions, so we recommend that you keep it.
#Loads data
#================
#These files are distributed within your package. The system.file will return the full path to them. With your data, you can just use the name of the file, provided the file is in the active directory
genoFile=system.file("extdata", "uganda-subset-mol.ped", package = "R.SamBada")
genoFile #Check the path to the file
#Load help
#================
?prepareGeno
#Run prepareGeno
#================
#Make sure you ran downloadSambada() before running this command.
prepareGeno(fileName=genoFile,outputFile=file.path(tempdir(),'uganda-subset-mol.csv'),saveGDS=TRUE,mafThresh=0.05, missingnessThresh=0.1,interactiveChecks=FALSE) #Also try with interactiveChecks=TRUE
If coordinates are unknown, you can use setLocation
to help you defining sample coordinates. The procedure is self-explained: it opens a local web page in which you need to upload a file with a list of IDs. Then specify the name of the column containing IDs (and longitude and latitude if present). Once processed, select one or several samples, and click “Select coordinate” at the end of the list. Then select a point on the map (first zoom to a satisfying level): you should see the coordinates being updated in the list. When finished, click “Export Coordinates” to export the new csv file. In the data presented in this vignette, samples are already georeferenced. However, if you want to try this function :
#Locate file containing only IDs of individuals
#================
idFile=system.file("extdata", "uganda-subset-id.csv", package = "R.SamBada")
idFile #Check the path to the file
#Load help
#================
?setLocation
#Run setLocation
#================
setLocation()
#Once the browser opens, you can load the file uganda-subset-id.csv mentioned above
Then from the point locations, you need to create your environmental dataset from point location. Use createEnv
for this task.
You can use rasters of your study site that you already have or use the function to automatically download rasters of your study site from global databases.
#Loads data
#================
#These files are distributed within your package. The system.file will return the full path to them. With your data, you can just use the name of the file, provided the file is in the active directory
locationFile=system.file("extdata", "uganda-subset.csv", package = "R.SamBada")
readLines(locationFile, n=20) #View the first 20 lines of the input file
#Load help
#================
?createEnv
#Create environmental dataset
#================
#downloads the raster tiles from global databases and create the environmental file
#Warning: the download and processing of raster is both heavy in space and time-consuming
#If you want to save time, you can skip this step continue to the next function
createEnv(locationFileName=locationFile, outputFile=file.path(tempdir(),'uganda-subset-env.csv'), x='longitude',y='latitude',locationProj=4326,separator=';', worldclim=TRUE, saveDownload=TRUE, rasterName=NULL,rasterProj=NULL, interactiveChecks=FALSE) #Also try with interactiveChecks=TRUE
You can now use the prepareEnv
function. This function has 3 goals
maxCorr
)The function creates a new file with the name specified in outputFile.
#Loads data
#================
#Locate gds file. Note: this file has also been generated from prepareGeno
#GDS files are binary files that are system dependent
if(Sys.info()['sysname']=='Windows'){
gdsFile=system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada") #If you run Windows
}else{
gdsFile=system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada") #If you run MacOS or Linux
}
gdsFile #Check the path to the file
#Locate the envFile (generated from createEnv)
envFile=system.file("extdata", "uganda-subset-env.csv", package = "R.SamBada")
readLines(envFile, n=20) #View the first 20 lines of the file
#Load help
#================
?prepareEnv
#prepareEnv
#================
#Stores Principal components scores
prepareEnv(envFile=envFile, outputFile=file.path(tempdir(),'uganda-subset-env-export.csv'), maxCorr=0.8, idName='short_name', genoFile=gdsFile, numPc=0.2, mafThresh=0.05, missingnessThresh=0.1, ldThresh=0.2, numPop=NULL, x='longitude', y='latitude', interactiveChecks=FALSE, locationProj=4326 )
#Also try with interactiveChecks=TRUE
If you run samBada from the command line, you will have to create a parameter file. All parameters that can be calculated automatically from your file are calculated for you in sambadaParallel
. Furthermore, samBada includes a module called supervision to split the molecular file into several subfiles to allow parallel computing.
#Loads data
#================
#Locate envFile (created with preapreEnv)
envFile2=system.file("extdata", "uganda-subset-env-export.csv", package = "R.SamBada")
readLines(envFile2, n=20) #View the first 20 line of the environmental file
#Locate genoFile in csv format (created with prepareGeno)
genoFile2=system.file("extdata", "uganda-subset-mol.csv", package = "R.SamBada")
readLines(genoFile2, n=20) #View the first 20 line of the genetic file
#Load help
#================
?sambadaParallel
#sambadaParallel
#================
#Run sambada on two cores.
#Make sure you ran downloadSambada() before running this command.
#Only one pop var was saved, so set dimMax=2. prepareEnv puts the population variables at the end of the file (=> LAST).
sambadaParallel(genoFile=genoFile2, envFile=envFile2, idGeno='ID_indiv', idEnv='short_name', dimMax=2, cores=2, saveType='END ALL', populationVar='LAST', outputFile=file.path(tempdir(),'uganda-subset-mol'))
The function prepareOutput
does the following things on sambada’s output
#Loads data
#================
#You first need to copy the output file of sambadaParallel and prepareGeno into the active directory with the following command
file.copy(system.file("extdata", "uganda-subset-mol-Out-2.csv", package = "R.SamBada"), 'uganda-subset-mol-Out-2.csv')
file.copy(system.file("extdata", "uganda-subset-mol-storey.csv", package = "R.SamBada"), 'uganda-subset-mol-storey.csv')
#Copy GDS file (generated from prepareGeno)
if(Sys.info()['sysname']=='Windows'){
file.copy(system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada"),'uganda-subset-mol.gds') #If you run Windows
} else {
file.copy(system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada"),'uganda-subset-mol.gds') #If you run MacOS or Linux
}
#Load help
#================
?prepareOutput
#prepareOutput
#================
prep = prepareOutput(sambadaname='uganda-subset-mol', dimMax=2, popStr=TRUE, interactiveChecks=FALSE)
#Also try with interactiveChecks=TRUE
To explore your results, the first thing you could do is draw a manhattan plot of each environmental variable to detect one or several interesting peaks in particular variables
#Loads data
#================
#You need to run prepareOutput to run this function
#Load help
#================
?plotManhattan
#plotManhattan
#================
#Plot manhattan of all kept variables
plotManhattan(prep, c('bio1','bio2','bio3'),chromo='all',valueName='pvalueG')
#Warning: the manhattan plot is different from what we are used to see, given the small number of SNPs
The function plotResultInteractive
can now be invoked. This will open a local page on your web-browser with a manhattan plot (you have to choose the environmental variable you want to study and can choose the chromosomes to draw).
The plot is interactive so that you can select a point on the manhattan which will query the Ensembl database to indicate nearby SNPs and the consequence of the variant (intergenic, synonymous, non-synonymous, stop gained, stop lost,…) with VEP.
The map of the marker, population variables and environmental variable is available.
A boxplot showing the environmental range of both present and absent individuals is drawn as well.
#Loads data
#================
#You need to run prepareOutput to run this function
#Locate environmantal file (generated with prepareEnv)
envFile2=system.file("extdata", "uganda-subset-env-export.csv", package = "R.SamBada")
#Locate GDS file (generated from prepareGeno)
if(Sys.info()['sysname']=='Windows'){
gdsFile=system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada") #If you run Windows
}else {
gdsFile=system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada") #If you run MacOS or Linux
}
#Load help
#================
?plotResultInteractive
#plotResultInteractive
#================
plotResultInteractive(preparedOutput=prep, varEnv='bio1', envFile=envFile2,species='btaurus', pass=50000,x='longitude',y='latitude',gdsFile=gdsFile, IDCol='short_name', popStrCol='pop1')
#Accepts the Dataset and SNP Data found
#Once the interactive window opens, click on any point of the manhattan plot
You might be interested in plotting a map (though it is already done in the plotResultInteractive
). The advantages of the plotMap
are
You can draw a map of
#Loads data
#================
#You need to run prepareOutput to run this function
#Locate environmental file (generated with prepareEnv)
envFile2=system.file("extdata", "uganda-subset-env-export.csv", package = "R.SamBada")
#Locate GDS file (generated from prepareGeno)
if(Sys.info()['sysname']=='Windows'){
gdsFile=system.file("extdata", "uganda-subset-mol_windows.gds", package = "R.SamBada") #If you run Windows
}else {
gdsFile=system.file("extdata", "uganda-subset-mol_unix.gds", package = "R.SamBada") #If you run MacOS or Linux
}
#Load help
#================
?plotMap
#plotMap
#================
plotMap(envFile=envFile2, x='longitude', y='latitude', locationProj=4326, popStrCol='pop1', gdsFile=gdsFile, markerName='Hapmap28985-BTA-73836_GG', mapType='marker', varEnvName='bio1', simultaneous=FALSE)
Good luck with the analysis of your own dataset!