‘geneHapR’ data

Zhang RenLiang

2022-08-19

The first thing to using a software is to know what it inputs and outputs are.

Inputs

Before using geneHapR there were several data set should be prepared by the user.

A file stored difference in DNA level among individuals is necessary for haplotype analysis. This file could be supplied in variant call format (VCF) format or FASTA format or multi-aligned format.

The files hereafter are also recommend for filtration, visualization, and phenotpye association analysis. An annotation file in General Feature Format (GFF) format stored the annotations for the target species. And another two tables stored phenotype data and individuals grouping information separately, certainly this two table also could be supplied as ‘R’ object of data.frame class.

Input data

VCF file (variant call format file) imported into ‘R’ as vcfR object.

GFF file (genome annotations) imported into ‘R’ as GRanges object.

DNA sequences (fasta format) imported into ‘R’ as DNAStringSet object.

Phenotype data and accession group information imported into ‘R’ as data.frame objects.

Output/results

The main results are hapResult and hapSummary could be export as tab delimed tables; and visualizations could be export as figures format or PDF files.

hapResult and hapSummary

hapResult and hapSummary are effectively a matrix, which could be divided into three parts, with some additional attributes.

Part I consists of only one column, indicates contents type of each row. The first four rows are fix to additional information as CHROM, POS, INFO and ALLELE. Further annotations are stored in fields of INFO, and each field are separated by semicolons (;). Followed rows are names of each haplotype.

Part II: consists of at least one column. Each column represents a site. The first four elements in each contents information and annotations of the current sites. And followed elements represents genotype of the corresponding haplotype.

Part III: The part III of hapResult consists of one column named as Accession, while the part III of hapSummary consof two columns named as Accession and freq.

The differences between hapResult and hapSummary only lied in part III: (a) there is a freq column in hapSummary while hapResult not; (b) multi-accessions are separated by semicolons in hapSummary while one accession in each row of hapResult.

Cartoon representation of hapResult and hapSummary contents

Cartoon representation of hapResult and hapSummary contents