In this step-by-step guide we will look how the function gawdis()
works and to apply it using simple data (either available in the FD package or made up below). The function gawdis()
, designed by Pavel Fibich, is an extension of the classic function gowdis()
in the package FD (Laliberté, Legendre and Shipley 2014). The function computes dissimilarity between units, usually species, based on multiple types of variables (e.g. quantitative, categorical etc.), usually species’ traits. Hence it can normally be applied to compute multi-trait dissimilarity between species in functional trait ecology studies, but it can be used in other applications as well. The dissimilarity obtained can be computed in order to attain a quasi-identical contribution of individual variables (e.g. traits) or group of associated variables (on variables reflecting similar information, e.g. multiple leaf traits). The dissimilarity is computed by either an analytical approach or through iterations. The function borrows several arguments from gowdis()
, with additional ones described below. This includes the option to consider fuzzy and dummy variables (e.g. multiple columns defining a single trait).
Let’s first load functions.
We will now load the package gawdis, which you should have previously installed on your computer (it is available on standard CRAN repository). The gawdis package includes the function gawdis()
and already loads other packages, such as FD and GA among others"
library(gawdis)
## Loading required package: FD
## Loading required package: ade4
## Loading required package: ape
## Loading required package: geometry
## Loading required package: vegan
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-7
## Loading required package: GA
## Loading required package: foreach
## Loading required package: iterators
## Package 'GA' version 3.2
## Type 'citation("GA")' for citing this R package in publications.
##
## Attaching package: 'GA'
## The following object is masked from 'package:utils':
##
## de
We will start now by looking what the function does, overall. To do this let’s first look at the original function gawdis
with the data dummy$trait
, which includes invented data for few species and their traits, available in the FD package (gawdis sources the FD package).
$trait dummy
## num1 num2 fac1 fac2 ord1 ord2 bin1 bin2
## sp1 9.0 4.5 A X 3 2 0 1
## sp2 8.1 6.0 A Z <NA> 1 0 1
## sp3 NA 2.3 C Y 5 3 1 1
## sp4 3.2 5.4 B Z 1 7 0 0
## sp5 5.8 1.2 C X 2 6 NA 0
## sp6 3.4 8.5 C Y 2 1 1 1
## sp7 7.5 2.1 B X 3 2 1 0
## sp8 4.3 6.5 <NA> Z 1 3 0 1
<- gowdis(dummy$trait)
ex1 round(ex1, 3)##just to see only 3 decimals, enough
## sp1 sp2 sp3 sp4 sp5 sp6 sp7
## sp2 0.218
## sp3 0.524 0.668
## sp4 0.674 0.561 0.823
## sp5 0.529 0.815 0.486 0.484
## sp6 0.610 0.593 0.278 0.707 0.607
## sp7 0.448 0.686 0.485 0.558 0.302 0.619
## sp8 0.407 0.204 0.596 0.239 0.559 0.447 0.703
class(ex1)
## [1] "dist"
The data dummy$trait
include trait information for 8 species. Some traits are numerical (num1 and num2), some are categorical, i.e. factors (fac1 and fac2, nice names btw.), some semi-quantitative traits, i.e. ordinal traits (ord1 and ord2) and some binary traits (bin1 and bin2). Some traits have NA values. The function gawdis
computes the dissimilarity between the 8 species in the dummy$trait
, based on all traits available. The function returns a ‘triangular’ distance object, of class dist
, which express the (multi-traits) dissimilarity between each pair of species. Value are a scaled between 0 (species are exactly the same) and 1 (species are completely different).
Let’s make even a simpler example, by taking only two traits (num2 and bin2) without NA. Let’s see exactly what gawdis
is doing. First a dissimilarity is computed for each trait. For the quantitative trait the differences in values between each pair of species are scaled to a maximum value 1, by dividing the dissimilarity values to the maximum possible difference for this trait. For example, when we looked at dummy$trait
above, we would see that the sp6 had the highest value for num2 (i.e. 8.5) and sp7 had the lowest (i.e. 2.1). The difference between these two species will be equal to 1. For binary trait (e.g. if a species fly or not, 1 vs. 0), the maximum value is 1, so that there is no need to make such a standardization. Finally, the dissimilarity of the two traits is simply the average of the distances for each individual trait.
<-dist(dummy$trait[, "num2", drop=F])/max(dist(dummy$trait$num2))#note the drop=F not to loose the species names in the process#
distance.num2round(distance.num2, 3)
## sp1 sp2 sp3 sp4 sp5 sp6 sp7
## sp2 0.205
## sp3 0.301 0.507
## sp4 0.123 0.082 0.425
## sp5 0.452 0.658 0.151 0.575
## sp6 0.548 0.342 0.849 0.425 1.000
## sp7 0.329 0.534 0.027 0.452 0.123 0.877
## sp8 0.274 0.068 0.575 0.151 0.726 0.274 0.603
<-dist(dummy$trait$bin2)
distance.bin2 distance.bin2
## 1 2 3 4 5 6 7
## 2 0
## 3 0 0
## 4 1 1 1
## 5 1 1 1 0
## 6 0 0 0 1 1
## 7 1 1 1 0 0 1
## 8 0 0 0 1 1 0 1
<-(distance.num2+distance.bin2)/2
distance.twotraits.byhand<-gowdis(dummy$trait[, c("num2", "bin2")])
distance.twotraits.gowdisplot(distance.twotraits.gowdis, distance.twotraits.byhand)
abline(0, 1)
These simple steps show what the function gawdis
does “automatically” for us. Notice that, for the categorical trait the process is similar as for the binary traits. If the species belong to the same category, then the dissimilarity=0, if they belong to different categories, then the dissimilarity=1. For example:
$trait$fac2 dummy
## [1] X Z Y Z X Y X Z
## Levels: X Y Z
gowdis(dummy$trait[, "fac2", drop=F])#notice that gowdis wants a matrix, not a vector, here a simple solution to solve this and keep species names
## sp1 sp2 sp3 sp4 sp5 sp6 sp7
## sp2 1
## sp3 1 1
## sp4 1 0 1
## sp5 0 1 1 1
## sp6 1 1 0 1 1
## sp7 0 1 1 1 0 1
## sp8 1 0 1 0 1 1 1
You can see that sp1 and sp2 have different categories (X and Z), so the dissimilarity is=1.
Let’s now see why we actually need the new function gawdis
, in addition to the traditional gowdis
. In the example above distance.twotraits.gowdis
reflects the dissimilarity of both traits, i.e. multi-trait dissimilarity, while distance.num2
and distance.bin2
reflect the dissimilarity for individual traits. What is the contribution of each single trait to the multi-trait dissimilarity? how much each single trait contribute to the final multi-trait dissimilarity? to answer this we can do, for example, a correlation between the multi-trait dissimilarity with the individual trait dissimilarity:
cor(distance.twotraits.gowdis, distance.num2)
## [1] 0.5167754
cor(distance.twotraits.gowdis, distance.bin2)
## [1] 0.8985725
This tells you how much the multi-trait dissimilarity reflects the information of each individual trait. If we look at this example we can see that the binary trait (bin2) is much more correlated to the multi-trait dissimilarity than the quantitative trait (num2), Pearson R=0.89 for the binary trait and R=0.51 for the quantitative. This means that the contribution of the binary trait is much much greater than of the quantitative trait. By the way, categorical/nominal traits will work similarly to binary traits, in general terms.
Are we happy with this result? are we happy that when we combine different variables (traits in this case), the multivariable dissimilarity has a far greater contribution from some variables? We think it is rather fair to say that this is not an ideal solution. Luckily the gawdis
has an useful argument w
, or weight which we can use to modify the contribution of each trait. Remember that the multi-trait dissimilarity is, by default with gawdis
, a simple average of the dissimilarity from individual traits. Instead of doing a simple average, we could do a weighted average, in which some traits could have bigger weights. For example, we could reduce to the weight of the binary trait, to reduce its contribution to the multi-trait dissimilarity, and increase the one for the quantitative trait. In the next lines we show how this can be done “by hand” or we could be using directly the argument w
in gowdis
(notice that we are trying, to start with, to give twice the weight to the binary trait, compared to the quantitative one). See both options below:
<-0.67*(distance.num2)+0.33*(distance.bin2)
distance.twotraits.byhand.weighted<-gowdis(dummy$trait[, c("num2", "bin2")], w=c(2, 1))
distance.twotraits.gowdis.weightedplot(distance.twotraits.byhand.weighted, distance.twotraits.gowdis.weighted)
abline(0, 1)
By doing this the new multi-trait dissimilarity will be more evenly affected by each trait.
cor(distance.twotraits.gowdis.weighted, distance.num2)
## [1] 0.7454126
cor(distance.twotraits.gowdis.weighted, distance.bin2)
## [1] 0.7300753
Specifically the Pearson correlation is R=0.74 for the binary trait and R=0.73 for the quantitative. This is quite even, good job! What we did is modifying the weight of each trait to modify their contribution to the multi-trait dissimilarity. Cool! But…how do we know which values we have to select for the argument w
to allow all traits to have a similar weight? if we have only 2 traits maybe we can try various w
values and hope to find some decent combination, and try until we find a decent combination. The thing gets more complicated if we have many traits. Moreover, as we show below there are also other cases more complicated cases.
gawdis
: basicsThis is where the new function gowdis
will get useful. The function looks for the best values for the w
argument, to obtain an equal contribution of individual traits.
<- gawdis(dummy$trait[,c("num2", "bin2")]) equalcont
## [1] "Running w.type=analytic"
equalcont
## sp1 sp2 sp3 sp4 sp5 sp6
## sp2 0.13584757
## sp3 0.19924310 0.33509067
## sp4 0.42038371 0.39321419 0.61962681
## sp5 0.63773982 0.77358739 0.43849672 0.38037319
## sp6 0.36226018 0.22641261 0.56150328 0.61962681 1.00000000
## sp7 0.55623127 0.69207884 0.35698817 0.29886465 0.08150854 0.91849146
## sp8 0.18113009 0.04528252 0.38037319 0.43849672 0.81886991 0.18113009
## sp7
## sp2
## sp3
## sp4
## sp5
## sp6
## sp7
## sp8 0.73736137
attr(equalcont,"correls")
## num2 bin2
## 0.7377916 0.7377916
attr(equalcont,"weights")
## num2 bin2
## 0.6611248 0.3388752
In one step we have a computed a multi-trait dissimilarity in which the contribution of each trait (provided in output of the function, i.e. in “correls”) are basically identical. This was done by finding an ideal w
value for each trait (provided in the output of the function, “weights”). Notice that the values are very close to the 2:1 ratio we ‘tried’ above with distance.twotraits.byhand.weighted
and distance.twotraits.gowdis.weighted
). Actually we ‘tried’ those values because we knew what was already the solution, but otherwise we would have spent some time really trying multiple combinations.
Of course the function gawdis
can also be used in the exact same way as the original gowdis
. For example, if you recall the object ex1
created above with gowdis
, and copied again below, now we can do the same with gowdis
, just telling that we want to have a similar weight (and hence different contribution) for the different traits. This is the defauls in gowdis
and in gawdis
this is set by either using w.type =“equal” or by w.type =“user” and then saying W
is the same for all traits.
<- gowdis(dummy$trait)
ex1 <- gawdis(dummy$trait, w.type ="equal") ex1.gaw1
## [1] "Running w.type=equal"
<- gawdis(dummy$trait, w.type ="user", W=rep(1, 8)) ex1.gaw2
## [1] "Running w.type=user"
plot(ex1, ex1.gaw1)
abline(0, 1)
plot(ex1, ex1.gaw2)
abline(0, 1)
gawdis
: analytical vs. iterative approachesNow that we verified that gawdis
works fine, let’s see more more details about the function and its different applications. The solutions provided by the function can be obtained in two ways. The first one is by an analytical solution, using purely a mathematical formulas (see main paper and corresponding supplementary material). This is the approach used by default in the function, as we used to create the object equalcont
above. In case of doubts, this approach can be set by using w.type=“analytic”. NOTE that unfortunately this solution cannot be used when there are missing values (NAs).
When there are NAs, even if you set the argument w.type =“analytic” the function will apply a second approach, based on iterations, i.e. it will keep trying several solutions until finding a good one (actually it is based on a fitness improving approach, which gradually improves the output of the results). This will take some time, as multiple alternatives are tested sequentially. It can be obtained by using w.type =“optimized”. The running time will depend, mostly, on the number of species and the number of iterations, which is set by the argument opti.maxiter
, by default set to 300. Below we apply both approach to a subset of the dummy$trait data
, without NAs
<-gawdis(dummy$trait[,c(2,4,6,8)], w.type ="analytic")# it is not needed to add the argument w.type, this is the approach used by default if not defined# analytical
## [1] "Running w.type=analytic"
attr(analytical, "correls")
## num2 fac2 ord2 bin2
## 0.5350139 0.5350139 0.5350139 0.5350139
attr(analytical, "weights")
## num2 fac2 ord2 bin2
## 0.2422698 0.2467646 0.3575842 0.1533815
<-gawdis(dummy$trait[,c(2,4,6,8)], w.type ="optimized", opti.maxiter=100)# here we used 'only' 100 iterations, to speed up the process and because with such few species this is likely more than enough# iterations
## [1] "Running w.type=optimized"
attr(iterations, "correls")
## num2 fac2 ord2 bin2
## 0.5354906 0.5348100 0.5341519 0.5360814
attr(iterations, "weights")
## num2 fac2 ord2 bin2
## 0.2427054 0.2465698 0.3568413 0.1538836
plot(analytical, iterations)
Here you see some slight differences in the results, for example in terms of correlations between the dissimilarity of single traits and the multitrait (i.e. “correls”) and the weights finally used for each trait (“weights”). The final results, the dissimilarity values displayed in the plot, also vary just a tiny bit. This is because the iterations are just based on a trial-error approach, which improves with more iterations, but it might “miss” the perfect mathematical solution, and just approach very close to it. Otherwise we can be quite confident that the results using the iteration tend quite well to the ideal results.
gawdis
: grouping traitsIn some cases the datasets we are considering have multiple variables which reflect some partially overlapping or redundant information. For example, many plant trait databases include a lot of leaf traits. This is the case, for example, of the dataset tussock$trait
, also available in the FD package. Let’s have a look:
dim(tussock$trait)
## [1] 53 16
head(tussock$trait)
## growthform height LDMC leafN leafP leafS SLA
## Achi_mill Semi basal 0.09727 163.1264 3.4 0.310 0.210 15.637656
## Agro_capi Semi basal 0.19100 302.7657 2.5 0.250 0.180 21.846627
## Anth_odor Tussock 0.06350 280.2013 2.0 0.240 0.150 24.547661
## Apha_arve Semi basal 0.04470 185.9606 1.7 0.290 0.180 18.587383
## Arre_elat Tussock 0.56400 311.9551 2.6 0.256 0.174 20.777386
## Brac_bell Short basal 0.00338 223.5052 1.2 0.330 0.220 6.549107
## nutrientuptake raunkiaer clonality leafsize dispersal
## Achi_mill Mycorrhizal Chamaephyte Clonal belowground 313.51484 Wind
## Agro_capi Mycorrhizal Hemicryptophyte Clonal belowground 141.78019 Wind
## Anth_odor Mycorrhizal Hemicryptophyte None 161.43904 Passive
## Apha_arve Mycorrhizal Therophyte None 20.89516 Passive
## Arre_elat Mycorrhizal Hemicryptophyte Clonal belowground 805.42883 Passive
## Brac_bell Mycorrhizal Hemicryptophyte None 216.33835 Wind
## seedmass resprouting pollination lifespan
## Achi_mill 0.1600 Yes Hymenoptera (bees) Perennial
## Agro_capi 0.0580 Yes Wind Perennial
## Anth_odor 0.5600 Yes Wind Perennial
## Apha_arve 0.1590 No Self Annual
## Arre_elat 2.9000 Yes Wind Perennial
## Brac_bell 0.0524 Yes Hymenoptera (bees) Perennial
head(tussock$trait[, 3:7])
## LDMC leafN leafP leafS SLA
## Achi_mill 163.1264 3.4 0.310 0.210 15.637656
## Agro_capi 302.7657 2.5 0.250 0.180 21.846627
## Anth_odor 280.2013 2.0 0.240 0.150 24.547661
## Apha_arve 185.9606 1.7 0.290 0.180 18.587383
## Arre_elat 311.9551 2.6 0.256 0.174 20.777386
## Brac_bell 223.5052 1.2 0.330 0.220 6.549107
cor(tussock$trait[, 3:7], use = "complete")
## LDMC leafN leafP leafS SLA
## LDMC 1.0000000 -0.6848213 -0.6682552 -0.5366135 -0.6417381
## leafN -0.6848213 1.0000000 0.6052319 0.7546005 0.6384765
## leafP -0.6682552 0.6052319 1.0000000 0.7703387 0.6212733
## leafS -0.5366135 0.7546005 0.7703387 1.0000000 0.5966755
## SLA -0.6417381 0.6384765 0.6212733 0.5966755 1.0000000
In the dataset there are 5 leaf traits, likely measured on the same leaves, and quite correlated between them. In this case we do suggest to create groups of traits, using the argument groups
. Why it is so? the problem is that if many trait provide a similar information, actually all these leaf traits are very much correlated. So the information could become too prominent in the multi-trait dissimilarity. Let’s see all this, with a step-by-step approach. Basically the same test can be found in the de Bello et al. (2021, Methods in Ecology and Evolution, doi: https://doi.org/10.1111/2041-210X.13537 ). First we slightly reduce the number of traits, just for simplicity, and because the dataset included some very unbalanced categorical traits (with too few entries for a number of categories). Then we need to log-transform some of the quantitative traits, and previously we checked that height, seedmass and leafS were the one needing log-transformation. NOTE that, by the way, that if we do not use log-transformation is applied trait with abnormal trait distribution will have bigger contribution, simply because they have greater variance (see Pavoine et al. 2009 Oikos).
<-tussock$trait[, c("height", "LDMC", "leafN","leafS", "leafP", "SLA", "seedmass", "raunkiaer", "pollination", "clonality", "growthform")]
tussock.trait<-tussock.trait#some traits needed log-tranformation, just creating a matrix to store the new data
tussock.trait.log$height<-log(tussock.trait$height)
tussock.trait.log$seedmass<-log(tussock.trait$seedmass)
tussock.trait.log$leafS<-log(tussock.trait$leafS)
tussock.trait.logcolnames(tussock.trait.log)
## [1] "height" "LDMC" "leafN" "leafS" "leafP"
## [6] "SLA" "seedmass" "raunkiaer" "pollination" "clonality"
## [11] "growthform"
Let’s first compute the simple Gower distance with gowdis
. We we will also compute the dissimilarity only on leaf traits and correlate it to the multi-trait dissimilarity:
#straightgowdis<-gowdis(tussock.trait.log)
.2<-gawdis(tussock.trait.log, w.type = "equal", silent = T)#we compute 'normal' gower with the new function because it provides more results
straightgowdis#plot(straightgowdis, straightgowdis.2)# if you want to check that the results are the same
<-attr(straightgowdis.2,"correls")
cors.gow12]<-mantel(straightgowdis.2, gowdis(tussock.trait.log[, 2:6]), na.rm=T)$statistic
cors.gow[names(cors.gow)[12]<-"leaves"
cors.gow
## height LDMC leafN leafS leafP SLA
## 0.2023004 0.5083692 0.4522899 0.4944064 0.4292520 0.3729605
## seedmass raunkiaer pollination clonality growthform leaves
## 0.2604944 0.5941643 0.4528619 0.3237053 0.4668307 0.5889437
We can see that the heightest
contribution (across the cors.gow
values) were obtained for a categorical traits, raunkier life form and the combination of leaf traits (‘leaves’). This is simply because with w.type = “equal” the multi-trait dissimilarity is an average of the dissimilarity for single traits, so leaf traits are represented 5/11 times in this average, a disproportional effect with respect to other traits, right? We could say that, although there are differences between the leaf traits, they are the same TYPE of trait, counted 5 times, when combining the traits.
The problem is not solved by using PCoA analyses, as suggested by the current literature (with the idea to reduce the number of traits and synthesize correlated traits into some multivariate axis). Here is a clear example:
<-dbFD(cailliez(straightgowdis.2), tussock$abun)###checking how many PCoA axes are retained testpcoa
## FRic: Dimensionality reduction was required. The last 40 PCoA axes (out of 51 in total) were removed.
## FRic: Quality of the reduced-space representation = 0.6098694
## CWM: When 'x' is a distance matrix, CWM cannot be calculated.
<-dudi.pco(cailliez(straightgowdis.2), scannf = FALSE, nf = 11)
pcoaaxes<-dist(pcoaaxes$li)
gowdis.PCoAsum(pcoaaxes$eig[1:11])/sum(pcoaaxes$eig)#how much variability the axes explain
## [1] 0.6098694
#contribution of traits on the combined dissimilarity, done by hand#
<-vector()
cors.pcoafor(i in 1:dim(tussock.trait.log)[2]){
<-mantel(gowdis.PCoA, gowdis(as.data.frame(tussock.trait.log[, i])), na.rm=T)$statistic
cors.pcoa[i]
}12]<-mantel(gowdis.PCoA, gowdis(tussock.trait.log[, 2:6]), na.rm=T)$statistic
cors.pcoa[names(cors.pcoa)<-c(colnames(tussock.trait.log), "leaves")#contributions of traits to the overall multi-trait dissimilarity
cors.pcoa
## height LDMC leafN leafS leafP SLA
## 0.2167067 0.4735296 0.4298086 0.4453686 0.3826323 0.3335101
## seedmass raunkiaer pollination clonality growthform leaves
## 0.2993358 0.5685074 0.4333025 0.3334111 0.4647386 0.5419322
Let’s just, for simplicity, focus only on the final result of this part. If we look at the object cors.pcoa
we can see very similar contribution of traits, as obtained with the previous approach cors.gow
. So, in very simplified terms, the PCoA approach does not help very much to solve the case of many redundant/correlated traits, as the correlated traits still have, altogether, a superior contribution to the overall multi-trait dissimilarity.
The function gawdis
could help, but only if we define groups. If we do not, let’s see what happen:
<-gawdis(tussock.trait.log, w.type = "optimized", opti.maxiter = 200)#there are NAs so the iteration approach is the only possible gaw.nogroups
## [1] "Running w.type=optimized"
<-attr(gaw.nogroups,"correls")
cors.gaw12]<-mantel(gaw.nogroups, gowdis(as.data.frame(tussock.trait.log[, 2:6])), na.rm=T)$statistic
cors.gaw[names(cors.gaw)[12]<-"leaves"
cors.gaw
## height LDMC leafN leafS leafP SLA
## 0.4140883 0.4311780 0.4118054 0.4268804 0.3961230 0.4117168
## seedmass raunkiaer pollination clonality growthform leaves
## 0.4045750 0.4238674 0.4033490 0.4191094 0.4076496 0.5558241
Now, if we look at the cors.gaw
, we can see the function, apparently, successfully accomplished its mission to obtain a quasi-equal contribution of each single trait. But, the solution provided this is not a good solution neither, because leaves, altogether, have a much bigger contribution that other traits (0.54 while other traits are around ~0.41). Again, the function considered each leaf trait separately, so that altogether leaf traits will have a greater contribution.
A better solution can be obtained by saying that all leaf traits belong to a given group. Thus gawdis
will first compute a dissimilarity for all leaf traits together and then try to get for this leaf-combined dissimilarity the same weight as for other traits. Let’s see it:
colnames(tussock.trait.log)
## [1] "height" "LDMC" "leafN" "leafS" "leafP"
## [6] "SLA" "seedmass" "raunkiaer" "pollination" "clonality"
## [11] "growthform"
<-gawdis(tussock.trait.log, w.type = "optimized", opti.maxiter = 200, groups.weight=T, groups = c(1,2, 2, 2, 2, 2, 3, 4, 5, 6, 7))#there are NAs so the iteration approach is the only possible gaw.groups
## [1] "Running w.type=optimized on groups=c(1,2,2,2,2,2,3,4,5,6,7)"
## [1] "Traits inside the group were weighted - optimized."
<-attr(gaw.groups,"correls")
cors.gaw.gr12]<-attr(gaw.groups,"group.correls")[2]
cors.gaw.gr[names(cors.gaw.gr)[12]<-"leaves"
cors.gaw.gr
## height LDMC leafN leafS leafP SLA
## 0.4344114 0.3917150 0.3591459 0.3515101 0.2753182 0.2631728
## seedmass raunkiaer pollination clonality growthform leaves
## 0.4269570 0.4408426 0.4528826 0.4496304 0.4426791 0.4424059
Now if we look at the cors.gaw.gr
we can see single leaf traits have lower contribution than other traits, but in combination (leaves), they have a comparable contribution! Of course it will be difficult to decide when and in which case group of traits should be defined. But we think that if traits are measured in the same organ and provide, to a good extent, some overlapping information, then they should be considered as a group.
gawdis
: fuzzing coding and dummy variablesThe function gawdis
can be also useful in the case of traits coded as dummy variables or, as a specific case of this, as fuzzy variables. Let’s first create this new trait matrix, tall
.
<-c(10, 20, 30, 40, 50, NA, 70)
bodysize<-c(1, 1, 0, 1, 0,1, 0)
carnivory<-c(1, 0, 0.5, 0, 0.2, 0, 1)
red<-c(0, 1, 0, 0, 0.3, 1, 0)
yellow<-c(0, 0, 0.5,1, 0.5, 0, 0)
blue<-cbind(red, yellow, blue)
colors.fuzzynames(bodysize)<-paste("sp", 1:7, sep="")
names(carnivory)<-paste("sp", 1:7, sep="")
rownames(colors.fuzzy)<-paste("sp", 1:7, sep="")
<-as.data.frame(cbind(bodysize, carnivory, colors.fuzzy))
tall tall
## bodysize carnivory red yellow blue
## sp1 10 1 1.0 0.0 0.0
## sp2 20 1 0.0 1.0 0.0
## sp3 30 0 0.5 0.0 0.5
## sp4 40 1 0.0 0.0 1.0
## sp5 50 0 0.2 0.3 0.5
## sp6 NA 1 0.0 1.0 0.0
## sp7 70 0 1.0 0.0 0.0
The object tall
includes 3 traits for 7 species, bodysize
(quantitative), carnivory
(binary/categorical) and color. The last one is represented by 3 columns, one for each basic color (yellow, red, blue). For example the first species (sp1), is red, so it gets 1 in the the ‘red’ column and 0 in the others. This type of variables, defined by multiple columns is called dummy variable, and in this specific case fuzzy coding, because the value in each column can be different from 0 and 1, with the sum over the 3 columns being equal to 1. For more information see (https://www.univie.ac.at/arctictraits/traitspublic_fuzzyCoding.php, https://stattrek.com/multiple-regression/dummy-variables.aspx).
We surely cannot apply gowdis()
to this type of matrices, because the function will “think” there is a total of 5 traits, since the matrix contains 5 columns, and will treat each of the 3 columns for colors as independent traits, resulting in a higher contribution of this single trait. Moreover the function gets a bit ‘crazy’ with this type of variables. Let’s see why. We can use, to start with, the function only on the colors traits:
round(gowdis(tall[, 3:5]), 3)
## sp1 sp2 sp3 sp4 sp5 sp6
## sp2 0.667
## sp3 0.333 0.667
## sp4 0.667 0.667 0.333
## sp5 0.533 0.467 0.200 0.333
## sp6 0.667 0.000 0.667 0.667 0.467
## sp7 0.000 0.667 0.333 0.667 0.533 0.667
We can appreciate that these results are not correct. Species 1 (sp1) was red, Species 2 was yellow. If for simplicity we think that each column means a completely different color, then we do expect the dissimilarity between these two species should be equal to 1. This was not the case. Similarly sp3 is half red and so the dissimilarity with sp1 should equal to 0.5. But this is not the case. What can we do to solve this? Do simply the following, i.e. divide the dissimilarity by the maximum dissimilarity value:
round(gowdis(tall[, 3:5])/max(gowdis(tall[, 3:5])), 3)
## sp1 sp2 sp3 sp4 sp5 sp6
## sp2 1.0
## sp3 0.5 1.0
## sp4 1.0 1.0 0.5
## sp5 0.8 0.7 0.3 0.5
## sp6 1.0 0.0 1.0 1.0 0.7
## sp7 0.0 1.0 0.5 1.0 0.8 1.0
So if we want to combine this trait (color, defined by 3 columns in the tall
), we need first to compute the dissimilarity for color this way, and then combine it with the other traits, for example with an average. For example, if we want a simple average of the dissimilarity for the 3 traits:
<-gowdis(tall[, "bodysize", drop=F])
dissim.bodysize<-gowdis(tall[, "carnivory", drop=F])
dissim.carnivory<-gowdis(tall[, 3:5])/max(gowdis(tall[, 3:5]))
dissim.colour<-list(as.matrix(dissim.bodysize), as.matrix(dissim.carnivory), as.matrix(dissim.colour))
dall<-as.dist(apply(simplify2array(dall), c(1, 2), mean, na.rm=T), 2)
mean.dissim.allround(mean.dissim.all, 3)
## sp1 sp2 sp3 sp4 sp5 sp6 sp7
## sp1 0.000
## sp2 0.389 0.000
## sp3 0.611 0.722 0.000
## sp4 0.500 0.444 0.556 0.000
## sp5 0.822 0.733 0.211 0.556 0.000
## sp6 0.500 0.000 1.000 0.500 0.850 0.000
## sp7 0.667 0.944 0.389 0.833 0.378 1.000 0.000
This is all a bit time consuming to do it line by line. There is other function in the package ‘ade4’, i.e. the function dist.ktab()
(together with the associated functions prep.fuzzy()
and ktab.list.df()
). However this solution is quite time consuming as well, as it requires a number steps and coding lines (you can surely try, just in case). Ideally we can thus also solve this problem with the function gawdis()
, possibly in a simple way. The solution is obtained by defining all columns belonging to a given trait (like colors above) to a certain group, as we saw above and then setting the argument fuzzy=c(2)
(fuzzy
argument allows to specify which groups should be fuzzy coded, here these are only the last three columns in group 2). Let’s see this now.
gawdis(tall, w.type="equal", groups =c(1, 1, 2, 2, 2), fuzzy=c(2))
## [1] "Running w.type=equal on groups=c(1,1,2,2,2)"
## [1] "Running w.type=equal on groups=c(1,1)"
## [1] "Running w.type=equal on groups=c(2,2,2)"
## sp1 sp2 sp3 sp4 sp5 sp6
## sp2 0.5416667
## sp3 0.5833333 0.7916667
## sp4 0.6250000 0.5833333 0.5416667
## sp5 0.8166667 0.7250000 0.2333333 0.5416667
## sp6 0.5000000 0.0000000 1.0000000 0.5000000 0.8500000
## sp7 0.5000000 0.9583333 0.4166667 0.8750000 0.4833333 1.0000000