The purpose of this vignette is to show how to work with pedigrees
and marker data in pedtools
.
The following command installs the current CRAN version of the package:
install.packages("pedtools")
Alternatively, you may want the latest development version from GitHub:
# install.packages("devtools") # install devtools if needed
::install_github("magnusdv/pedtools") devtools
Now you should be able to load pedtools
.
library(pedtools)
In pedtools
and all related packages, pedigrees are
stored as ped
objects. We start by explaining briefly what
these objects look like, and their basic constructor. If you are reading
this vignette simply to learn how to create a particular pedigree, you
may want to skip ahead to section 1.3 where we describe practical
shortcuts to common pedigree structures.
ped
classThe ped
constructor function
The most direct way to create a pedigree in pedtools
is
with the ped()
constructor. This takes as input 4 vectors
of equal length:
id
: individual ID labels (numeric or character)fid
: id of the fathers (0 if not included)mid
: id of the mothers (0 if not included)sex
: gender codes, with entries 0 (unknown), 1 (male)
or 2 (female)In other words, the j’th pedigree member has label
id[j]
, father fid[j]
, mother
mid[j]
, and gender given by sex[j]
.
For example, the following creates a family trio, i.e. father, mother and child:
ped(id = 1:3, fid = c(0,0,1), mid = c(0,0,2), sex = c(1,2,2))
#> id fid mid sex
#> 1 * * 1
#> 2 * * 2
#> 3 1 2 2
In this example the child (id=3
) is female, since the
associated entry in sex
is 2. Note that missing parents are
printed as *
. Individuals without parents are called
founders of the pedigree, while the nonfounders have
both parents specified. It is not allowed to have exactly one
parent.
Instead of numerical labels as above, we could have used character
strings. Let us create the trio again, with more informative labels, and
store it in a variable named trio
.
= ped(id = c("fa", "mo", "girl"), fid = c("","","fa"), mid = c("","","mo"), sex = c(1,2,2))
trio
trio#> id fid mid sex
#> fa * * 1
#> mo * * 2
#> girl fa mo 2
The special strings "0"
, ""
and
NA
are all interpreted as a missing parent.
The internal structure of ped
objects
From the way it is printed, the object trio
appears to be a
data frame, but this is not exactly true. Rather it is an object of
class ped
, which is basically a list. We can see the actual
content of trio
by unclassing it:
unclass(trio)
#> $ID
#> [1] "fa" "mo" "girl"
#>
#> $FIDX
#> [1] 0 0 1
#>
#> $MIDX
#> [1] 0 0 2
#>
#> $SEX
#> [1] 1 2 2
#>
#> $FAMID
#> [1] ""
#>
#> $UNBROKEN_LOOPS
#> [1] FALSE
#>
#> $LOOP_BREAKERS
#> NULL
#>
#> $FOUNDER_INBREEDING
#> NULL
#>
#> $MARKERS
#> NULL
In most cases it is not recommended for regular users to interact
directly with the internal slots of a ped
, since this can
have unfortunate consequences unless you know exactly what you are
doing. Instead, one should use accessor functions like
labels()
, getMarkers()
and
founderInbreeding()
. The most important accessors are
described within this vignette, while others are documented in the help
page ?ped_utils
.
To plot a pedigree, simply use plot()
.
plot(trio)
Under the hood, pedtools::plot()
is an elaborate wrapper
of the excellent plotting functionality of the kinship2
package. Most of the possibilities provided by kinship2 are available
from pedtools, and several features are added. An overview can be found
in the documentation ?plot.ped
, but a quick example should
get you started:
plot(trio, deceased = "fa", starred = "mo", hatched = "girl",
col = c("green", "red", "blue"), title = "Trio 1")
See Section 2.2 for how to add, and control the appearance of, marker genotypes to pedigree plots.
Rather than using the ped()
function directly, it is
usually quicker and safer to build pedigrees step by step, applying the
arsenal of utility functions offered by pedtools
. A typical
workflow is as follows:
You will find several examples below, but first let us list the available tools for each of the 3 steps.
Basic pedigrees
The following pedigree structures serve as starting points for pedigree
constructions. For parameters and details, see
?ped_basic
.
singleton()
, a pedigree consisting of a single
individualnuclearPed()
, a nuclear pedigree
(parents+children)halfSibPed()
, two sibships with one parent in
commonlinearPed()
, a straight line of successorsavuncularPed()
, uncle-nephew and similar
relationshipscousinPed()
, cousins of specified degree/removalhalfCousinPed()
, half cousins of specified
degree/removalancestralPed()
, a family tree containing the ancestors
of a single personThere are also more specialized structures, including double cousins,
selfing pedigrees, and consecutive matings between full siblings. Look
them up in ?ped_complex
if you are interested.
Add/remove/extract individuals
The functions below are used to modify an existing ped
object by adding/removing individuals, or extracting a sub-pedigree. For
details, see ?ped_modify
.
addChildren()
, with special cases addSon()
and addDaughter()
addParents()
removeIndividuals()
branch()
subset()
Edit labels and attributes
The following functions modify various attributes of a ped
object. See ?ped_modify
for parameters and details.
setSex()
swapSex()
relabel()
As our first example we will recreate the trio
pedigree
without using the ped()
constructor. To give a hint of the
flexibility, we show 3 alternative ways to code this.
Alternative A
The obvious starting point is nuclearPed()
, with
nch = 1
to indicate 1 child. By default, this creates a
trio with numeric labels (father=1; mother=2; child=3) and a male child.
Hence we fix the gender with swapSex()
, and edit the labels
with relabel()
:
= nuclearPed(nch = 1)
trio2 = swapSex(trio2, ids = 3)
trio2 = relabel(trio2, new = c("fa", "mo", "girl")) trio2
Alternative B (quickest and best)
The previous approach can be condensed into a one-liner, since
nuclearPed()
allows an alternative syntax in which child
genders and labels are specified directly:
= nuclearPed(father = "fa", mother = "mo", children = "girl", sex = 2) trio3
Alternative C
Here is another possibility. We start by creating the father as a
singleton, and then add the daughter:
= singleton("fa")
trio4 = addDaughter(trio4, parent = "fa", id = "girl")
trio4 #> Mother: Creating new individual with ID = NN_1
= relabel(trio4, old = "NN_1", new = "mo") trio4
Note that addDaughter()
automatically created the mother
as “NN_1”, so we needed to relabel her.
This time we will create this inbred family:
Alternative A
One approach is to first create the pedigree consisting of individuals
1-6, with halfSibPed()
, and then use addSon()
to add the inbred child.
= halfSibPed(nch1 = 1, nch2 = 2, sex1 = 1, sex2 = 2:1)
x1 = addSon(x1, parents = 4:5) x1
Alternative B
We could also view the half siblings 4 and 5 as half cousins of degree
0. The halfCousinPed()
function accepts an option
child = TRUE
adding an inbred child. The labels will be
different with this approach, so you should plot the pedigree after each
command to see who-is-who. Also, we must relabel in the end.
= halfCousinPed(0, child = T)
x2 = addSon(x2, parents = 2:3)
x2 = relabel(x2, old = c(7,6), new = c(6,7)) x2
A note about the order of pedigree members
Although both x1
and x2
reproduce exactly
the plot shown above, they are not identical objects:
identical(x1, x2)
#> [1] FALSE
The reason is the order in which the individuals are stored.
For x1
the ordering is the natural sequence
1,2,3,4,5,6,7
, but for x2
our construction
process has produced a slightly different order:
x2#> id fid mid sex
#> 1 * * 2
#> 2 * * 1
#> 3 * * 2
#> 4 2 1 1
#> 5 2 3 2
#> 7 4 5 1
#> 6 2 3 1
The internal ordering is usually of little importance in
applications.1 However, if you get annoyed by “wrong”
orderings such as for x2
above, you can use
reorderPed()
to permute the pedigree any way you like. In
fact, the default action of this function is to permute into the natural
order of the labels, which is exactly what we need to make
x2
identical to x1
:
= reorderPed(x2)
x2 identical(x1, x2)
#> [1] TRUE
Here we consider the family tree below, extending both upwards and
downwards from a single person. We will use this example to demonstrate
the mergePed()
function, which “glues” together two
pedigrees by the indicated members.
# Top part
= ancestralPed(g = 2) # bottom person is "7"
x
# Bottom part
= halfCousinPed(degree = 1) # top person is "2"
y = swapSex(y, 4)
y #> Changing sex of spouses as well: 5
# Merge
= mergePed(x, y, by = c("7" = "2"), relabel = TRUE) z
Note the argument by = c("7" = "2")
, which means that
individual “7” in x
should be identified with “2” in
y
. As seen in the plot below, this individual ends up as
“8” after relabelling in the final result.
plotPedList(list(x, y, z))
Many situations call for selecting all pedigree members sharing some property, e.g., all females, or all descendants of some person. Several utilities in pedtools exist to help with such tasks. Generally these come in two flavours: 1) members with certain global property, and 2) members with a certain relationship to a given individual.
Pedigree members with a certain property
Each of the following functions returns a vector specifying the members
with the given property.
founders()
nonfounders()
leaves()
males()
females()
typedMembers()
untypedMembers()
By default, the output of these functions is a character vector
containing ID labels. However, adding the option
internal = TRUE
will give you an integer vector instead,
reporting the internal indices of the members. This is frequently used
in the source code of pedtools
, but is usually not intended
for end users of the package.
Relatives of a given individual
The functions below take as input a ped
object and the
label of a single member. They return a vector of all members with the
given relation to that individual.
father()
mother()
parents()
grandparents()
children()
spouses()
siblings()
cousins()
nephews_nieces()
ancestors()
descendants()
unrelated()
The other main theme of the pedtools
package (pedigrees
being the first) are marker genotypes.
Marker objects created with the marker()
function. For
example, the following command makes an empty marker associated with the
trio
pedigree:
marker(trio)
#> id <NA>
#> fa -/-
#> mo -/-
#> girl -/-
#> * * * * * *
#> Position: NA
#> Mutation: none
#> Frequencies:
#> 1 2
#> 0.5 0.5
As shown in the output, the marker is indeed empty: All pedigree
members have missing genotypes, and there is no assigned name or
position. By default, markers are diallelic, with alleles 1 and 2, with
equal frequencies. For a more interesting example, let us make a SNP
named “snp1”, with alleles “A” and “B”. The father is homozygous “A/A”,
while the mother is heterozygous. We store it in a variable
m1
for later use.
= marker(trio, fa = "A/A", mo = "A/B", name = "snp1") m1
This illustrates several points. Firstly, individual genotypes may be specified using the ID labels. The different alleles occurring in the genotypes is interpreted as the complete set of alleles for the marker. Finally, these are assigned equal frequencies. Of course, this behaviour can be overridden, by declaring alleles frequencies explicitly:
marker(trio, fa = "A/A", mo = "A/B", afreq = c(A = .2, B = .3, C = .5))
#> id <NA>
#> fa A/A
#> mo A/B
#> girl -/-
#> * * * * * *
#> Position: NA
#> Mutation: none
#> Frequencies:
#> A B C
#> 0.2 0.3 0.5
The markers chromosome can be declared using the chrom
argument, and similarly its position by posMb
(megabases).
Markers with unknown chromosome are treated as autosomal. To define an
X-linked marker, put chrom = "X"
. the fact that males are
hemizygous on X (i.e. they have only one allele) is reflected in the
printout of such markers:
= marker(trio, fa = "A/A", mo = "A/B", chrom = "X", name = "snpX")
m2
m2#> id snpX
#> fa A
#> mo A/B
#> girl -/-
#> * * * * * *
#> Position: chr = X, Mb = NA
#> Mutation: none
#> Frequencies:
#> A B
#> 0.5 0.5
A side note: It may come as a surprise that you don’t need quotes
around the ID labels (which are characters!) in the above commands. This
is because marker()
uses non-standard evaluation
(NSE), a peculiarity of the R language which often leads to less
typing and more readable code.2 Unfortunately, this doesn’t work with
numerical ID labels. Thus to assign a genotype to someone labelled “1”
you need quotes, as in marker(trio, "1" = "A/A")
.
Including marker data in a pedigree plot is straightforward:
plot(trio, marker = m1)
The appearance of the genotypes can be tweaked in various ways, as
documented in ?plot.ped
. Here’s an example:
plot(trio, marker = m1, showEmpty = T, missing = "?", sep = "")
ped
objectsIn most applications it is useful to attach markers to their
ped
object. In particular for bigger projects with many
markers, this makes it easier to manipulate the dataset as a unit.
To attach a marker object m
(which could be a list of
several markers) to a pedigree x
, there are two main
options:
setMarkers(x, m)
addMarkers(x, m)
The difference between these is that setMarkers()
replaces all existing markers, while addMarkers()
appends
m
to the existing ones. In our trio
example
the two are equivalent since there are no existing markers.
= setMarkers(trio, list(m1, m2))
trio
trio#> id fid mid sex snp1 snpX
#> fa * * 1 A/A A
#> mo * * 2 A/B A/B
#> girl fa mo 2 -/- -/-
There is a handy shortcut, addMarker()
(without the
‘s’), allowing you to create and attach a single marker in one go. The
command addMarker(x, ...)
is essentially equivalent to
setMarkers(x, marker(x, ...))
. It is also well adapted to
piping, as in this example:
nuclearPed(1) |>
addMarker(name = "myMarker", alleles = c("a", "b", "c")) |>
setGenotype(marker = 1, id = 3, geno = "a/c")
#> id fid mid sex myMarker
#> 1 * * 1 -/-
#> 2 * * 2 -/-
#> 3 1 2 1 a/c
Selecting and removing attached markers
Four closely related functions functions are useful for manipulating
markers attached to a pedigree:
selectMarkers()
, returns a ped
object
where only the indicated markers are retainedremoveMarkers()
, returns a ped
object
where the indicated markers are removedgetMarkers()
, returns a list of the indicated
markerswhichMarkers()
, returns the indices of the indicated
markersAll of these have exactly the same arguments, described in more
detail in ?marker_select
. Let us do a couple of examples
here. Recall that by now, our trio
has two attached
markers; the first is called “snp1”, and the other is on the X
chromosome.
whichMarkers(trio, chrom = "X")
#> [1] 2
selectMarkers(trio, markers = "snp1")
#> id fid mid sex snp1
#> fa * * 1 A/A
#> mo * * 2 A/B
#> girl fa mo 2 -/-
Internally, a marker object is stored as a matrix with two columns (one for each allele) and one row for each pedigree member. The matrix is numeric (for computational convenience) while the allele labels and other meta information are added as attributes. The most important of these are:
alleles
: The allele labels, stored as a character
vector.afreq
: The allele frequencies, in the same order as
the alleles. An error is issued if the frequencies do not sum to 1 after
rounding to 3 decimals.name
: The marker name, which can be any character
string not consisting solely of digits.chrom
: The chromosome name. This can be given as an
integer, but is always converted to character. The special values “23”
and “X” are recognized as the human X chromosome, which affects the way
genotypes are printed.posMb
: Chromosomal position given in megabases.In addition to those listed above, there are two more attributes:
pedmembers
and sex
. They store the ID labels
and genders of the pedigree associated with the marker, and are only
used to empower the printing method of marker objects.
Marker accessor functions
For each marker attribute listed above, there is a corresponding
function with the same name for retrieving its content. These functions
take as input either a marker
object, or a ped
object together with the name (or index) of an attached marker. This may
sound a bit confusing, but a few examples will make it clear!
Recall that our marker “snp1” exists in two copies: One is stored in
the variable m1
, while the other is attached to
trio
. In both cases we can extract the allele frequencies
with the function afreq()
.
afreq(m1)
#> A B
#> 0.5 0.5
afreq(trio, marker = "snp1")
#> A B
#> 0.5 0.5
We can also modify the frequencies using this syntax. To
avoid confusion about the allele order, the frequencies must be named
with the allele labels (just as in the output of afreq()
above).
afreq(trio, marker = "snp1") = c(A = 0.9, B = 0.1)
In addition to the functions getting and setting marker attributes,
there is one more important marker accessor, namely
genotype()
. This returns the genotype of a specified
individual, and can also be used to modify genotypes. As the others, it
can be applied to marker objects directly, or to pedigrees with attached
markers. Here we show a few examples of the latter type:
# Girl is not genotyped
genotype(trio, "snpX", id = "girl")
#> [1] NA NA
# Set genotype
genotype(trio, "snpX", id = "girl") = "A/A"
# Inspect the result
trio#> id fid mid sex snp1 snpX
#> fa * * 1 A/A A
#> mo * * 2 A/B A/B
#> girl fa mo 2 -/- A/A
pedtools
are indented for modifying
many (or all) markers at the same time. Their purpose and typical use
cases are summarised in the table below. The argument x
always denotes a ped
object.
Use … | When you want to … | For example to … |
---|---|---|
Get | ||
getAlleles(x)
|
extract all alleles as a matrix. | do summary stats on the marker alleles |
getFreqDatabase(x)
|
extract allele frequencies as a data.frame in allelic ladder format. | transfer to other objects, or write the database to a file |
getMarkers(x)
|
extract list of marker objects. Each marker is a N * 2
allele matrix (N = pedsize(x) ) with locus annotations as
attributes
|
do computations |
Set | ||
setAlleles(x, ...)
|
replace the genotypes of x without changing the locus
attributes.
|
erase all genotypes |
setFreqDatabase(x, db)
|
replace all allele frequencies without changing the genotype data. The
input is a data.frame in allelic ladder format. Conceptually
equivalent to
setMarkers(x, alleleMatrix = getAlleles(x), locusAnnotations = db) .
|
change the frequency database |
setMarkers(x, ...)
|
attach marker objects with or without genotype data. Locus attributes are indicated as a list; genotypes as a matrix or data.frame. | prepare joint manipulation of a pedigree and marker data |
Convert | ||
as.data.frame(x)
|
convert x to a data.frame, with pedigree columns in
standard format followed by genotype columns. One column per marker,
with genotype format a/b and missing alleles indicated as
- .
|
pretty-print ped objects |
as.matrix(x)
|
convert x to a numerical matrix, with additional
info attached as attributes.
|
modify a pedigree with marker data |
Transfer | ||
transferMarkers(from, to)
|
transfer genotypes and attributes between pedigree objects (or lists of such). | transfer simulated marker data |
There is an important exception to this: Certain
algorithms in pedigree analysis work “top-down”, in the sense that
parents must be treated before their children. For this reason, many
implementations require, for simplicity, that the individuals are stored
in this fashion, i.e. that parents always precede their children.
pedtools
offers a special reordering function to ensure
this, parents_before_children()
, which you will find
utilised in the source code of packages like ribd
and
ibdsim2
.↩︎
You may have come across NSE before, for instance when
using subset()
on a data.frame. To learn more about NSE, I
recommend this book chapter by Hadley Wickham:
http://adv-r.had.co.nz/Computing-on-the-language.html↩︎