A quick guide to crossover identification using RTIGER

Introduction

Accurate identification of meiotic crossing-over sites (COs) is essential for correct genotyping of recombining samples. RTIGER is a method for predicting genome-wide COs using allele-counts at pre-defined SNP marker positions. RTIGER trains a rigid Hidden Markov Model (rHMM1) where each genomic state (homozygous parent 1, homozygous parent 2 or heterozygous) corresponds to respectively one hidden state and the allele-counts as the observed variable. COs are identified as transitions in the HMM state.

To account for sparse sampling of individual marker positions in the sequencing data, RTIGER uses Viterbi Path Algorithm and the rigidity parameter. This parameter defines the minimum number of SNP markers required to support a state-transition. This filters out low-confidence state-transitions, improving COs identification performance.

Installation

Pre-Requisites

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.14")

BiocManager::install(c("GenomicRanges", "GenomeInfoDb", "TailRank", "IRanges", "Gviz"))

Install

RTIGER can be installed directly from CRAN using:

install.packages("RTIGER")

Additionally, the development version (with the most recent features and bug-fixes) can be downloaded from GitHub using the following:

install.packages("devtools")
library(devtools)
install_github("rfael0cm/RTIGER", ref = 'main')

Preparing input data

RTIGER uses the allele-count information at the SNP marker positions. The SNP markers correspond to differences between the two genotypes (i.e. parent_1 vs parent_2). RTIGER requires as input one allele-count file for each sample. The allele-count file should be in tab-separated value format, where each row corresponds to a SNP marker. For example:

Example allele frequency file
Chr1 37388 C 0 T 2
Chr1 71348 T 1 G 3
Chr2 18057 A 0 C 1
Chr2 38554 G 0 A 2
Chr2 75348 A 1 T 0
Chr3 32210 T 2 G 0

Here the columns are:

The SNPs can be identified using any generic SNP identification pipeline3.

SNPs in repetitive regions should be filtered out. Further, as crossing-over usually takes place in syntenic regions between the two genomes, for best results, only SNPs in syntenic regions should be selected as markers. If whole genome assemblies are present for both genomes, then this can be easily achieved using methods like SyRI4.

Notes

Using RTIGER

Setting up Julia environment

RTIGER uses Julia to perform computationally intensive model training. All Julia packages that are used by RTIGER can be installed using using:

library(RTIGER)
setupJulia()

This step is necessary when using RTIGER for the first time, but can be skipped for later analysis as all required Julia packages would already be installed.

The Julia functions need to be loaded in the R environment using:

sourceJulia()

This step is required every time when using RTIGER.

Creating input objects

The primary input for RTIGER is a data-frame termed expDesign. The first column of expDesign should have paths to allele-count files for all samples and the second column should have unique samples IDs5.

# Get paths to example allele count files originating from a
# cross between Col-0 and Ler accession of the A.thaliana
file_paths = list.files(system.file("extdata",  package = "RTIGER"), full.names = TRUE)

# Get sample names
sampleIDs <- basename(file_paths)

# Create the expDesign object
expDesign = data.frame(files=file_paths, name=sampleIDs)

print(expDesign)
#>                                                          files         name
#> 1 /tmp/Rtmpco6loc/Rinstc012e7d1318/RTIGER/extdata/sampleAA.txt sampleAA.txt
#> 2 /tmp/Rtmpco6loc/Rinstc012e7d1318/RTIGER/extdata/sampleAC.txt sampleAC.txt
#> 3  /tmp/Rtmpco6loc/Rinstc012e7d1318/RTIGER/extdata/sampleB.txt  sampleB.txt

RTIGER also requires chromosome lengths for the parent 1. These need to be provided as a named vector where the values are chromosome lengths and the names are chromosome ids.

# Get chromosome lengths for the example data included in the package
chr_len <- RTIGER::ATseqlengths
names(chr_len) <- c('Chr1' , 'Chr2', 'Chr3', 'Chr4', 'Chr5')
print(chr_len)
#>     Chr1     Chr2     Chr3     Chr4     Chr5 
#> 34964571 22037565 25499034 20862711 31270811

Finding crossing-over sites using RTIGER

RTIGER does model training, CO identification, and creates summary plots using the RTIGER function.

myres = RTIGER(expDesign = expDesign,
               outputdir = "/PATH/TO/OUTPUT/DIRECTORY",
               seqlengths = chr_len,
               rigidity = 200)

The rigidity parameter defines the required minimum number of continuous markers that together support a state change of the HMM model (Figure 1). Smaller rigidity values increase the sensitivity in detecting COs that are close to each other, but may result in false-positive CO identification because of variation in sequencing coverage. Larger rigidity values improve precision but COs that are close to each other might not be identified. Users are supposed to test and adjust rigidity based on their specific experimental setup.

Effect of different R values on CO identificaion. The top panel shows the allele count for parent 1 (red) and parent 2 (blue) at the marker positions on a chormosomes. Remaining panels show the annotation of genomic regions by RTIGER for different values of R. Here, red blocks are regions that are homozygous for parent 1, purple blocks are for heterozygous regions, and blue blocks are for regions homozygous for parent 2.

Effect of different R values on CO identificaion. The top panel shows the allele count for parent 1 (red) and parent 2 (blue) at the marker positions on a chormosomes. Remaining panels show the annotation of genomic regions by RTIGER for different values of R. Here, red blocks are regions that are homozygous for parent 1, purple blocks are for heterozygous regions, and blue blocks are for regions homozygous for parent 2.

RTIGER Output

RTIGER identifies COs for each sample level and provides summary plots and statistics for each sample as well as for the entire population.

Per sample output

RTIGER creates a folder for each sample in the outputdir. This folder contains:

Summary plots

In addition to per sample output, RTIGER also generates three summary plots for the samples.

Analysing backcrossed populations

Backcrossed populations are formed by crossing a hybrid organism with one of its parent. These populations are different from the populations based on outcrossing as only two genomic states are possible (homozygous for the backrossed parent and heterozygous for both parents). To identify COs in such population, set nstates=2 in the RTIGER command.

myres = RTIGER(expDesign = expDesign, 
               outputdir = "PATH/TO/OUTPUT/DIR",
               seqlengths = chr_len,
               rigidity = 200, 
               nstates=2)

Citation

Campos-Martin R, et al., Reliable genotyping of recombinant genomes using a robust hidden Markov Model, 2022

Session info

#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.37
#> 
#> loaded via a namespace (and not attached):
#>  [1] qpdf_1.1               tidyselect_1.1.1       xfun_0.29             
#>  [4] bslib_0.3.1            purrr_0.3.4            reshape2_1.4.4        
#>  [7] generics_0.1.1         colorspace_2.0-3       vctrs_0.3.8           
#> [10] oompaData_3.1.1        htmltools_0.5.2        stats4_4.1.3          
#> [13] yaml_2.3.5             utf8_1.2.2             rlang_1.0.1           
#> [16] e1071_1.7-9            jquerylib_0.1.4        pillar_1.7.0          
#> [19] DBI_1.1.2              TailRank_3.2.1         glue_1.6.2            
#> [22] BiocGenerics_0.40.0    GenomeInfoDbData_1.2.7 lifecycle_1.0.1       
#> [25] plyr_1.8.6             stringr_1.4.0          zlibbioc_1.40.0       
#> [28] munsell_0.5.0          gtable_0.3.0           evaluate_0.15         
#> [31] Biobase_2.54.0         IRanges_2.28.0         fastmap_1.1.0         
#> [34] GenomeInfoDb_1.30.1    class_7.3-20           fansi_1.0.2           
#> [37] highr_0.9              Rcpp_1.0.8             scales_1.1.1          
#> [40] RTIGER_1.99.0          S4Vectors_0.32.3       jsonlite_1.7.2        
#> [43] XVector_0.34.0         askpass_1.1            ggplot2_3.3.5         
#> [46] digest_0.6.29          stringi_1.7.6          dplyr_1.0.7           
#> [49] oompaBase_3.2.9        GenomicRanges_1.46.1   grid_4.1.3            
#> [52] cli_3.2.0              tools_4.1.3            bitops_1.0-7          
#> [55] magrittr_2.0.2         sass_0.4.0             RCurl_1.98-1.6        
#> [58] proxy_0.4-26           tibble_3.1.6           cluster_2.1.2         
#> [61] pkgconfig_2.0.3        crayon_1.5.0           ellipsis_0.3.2        
#> [64] assertthat_0.2.1       rmarkdown_2.11         rstudioapi_0.13       
#> [67] JuliaCall_0.17.4       R6_2.5.1               compiler_4.1.3

  1. Longer description: https://kups.ub.uni-koeln.de/37286/↩︎

  2. https://www.geeksforgeeks.org/how-to-setup-julia-path-to-environment-variable/?ref=lbp↩︎

  3. For example: https://www.ebi.ac.uk/sites/ebi.ac.uk/files/content.ebi.ac.uk/materials/2014/140217_AgriOmics/dan_bolser_snp_calling.pdf↩︎

  4. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1911-0↩︎

  5. Avoid use of special characters or spaces in file-names as that may result in errors.↩︎