The smacof package is the most flexible package for multidimensional scaling (MDS) computation in R. This vignette gives a very basic introduction to MDS using smacof. More technical details and advanced examples can be found in the main package vignette (vignette("smacof")
). An accessible introductory MDS book written for applied researchers is Borg, Groenen, and Mair (2018).
MDS is a method to represent associations among objects (variables, individuals, etc.) and requires a proximity matrix as input. We distinguish between two types of input data: directly observed proximities and derived proximities. In the first scenario the proximities are directly observed/collected in an experiment (e.g., participants rate pairwise similarities between stimuli). A classical example in the MDS literature is the Ekman dataset that comes with the package:
library(smacof)
#> Loading required package: plotrix
#> Loading required package: colorspace
#> Loading required package: e1071
#>
#> Attaching package: 'smacof'
#> The following object is masked from 'package:base':
#>
#> transform
ekman#> 434 445 465 472 490 504 537 555 584 600 610 628 651
#> 445 0.86
#> 465 0.42 0.50
#> 472 0.42 0.44 0.81
#> 490 0.18 0.22 0.47 0.54
#> 504 0.06 0.09 0.17 0.25 0.61
#> 537 0.07 0.07 0.10 0.10 0.31 0.62
#> 555 0.04 0.07 0.08 0.09 0.26 0.45 0.73
#> 584 0.02 0.02 0.02 0.02 0.07 0.14 0.22 0.33
#> 600 0.07 0.04 0.01 0.01 0.02 0.08 0.14 0.19 0.58
#> 610 0.09 0.07 0.02 0.00 0.02 0.02 0.05 0.04 0.37 0.74
#> 628 0.12 0.11 0.01 0.01 0.01 0.02 0.02 0.03 0.27 0.50 0.76
#> 651 0.13 0.13 0.05 0.02 0.02 0.02 0.02 0.02 0.20 0.41 0.62 0.85
#> 674 0.16 0.14 0.03 0.04 0.00 0.01 0.00 0.02 0.23 0.28 0.55 0.68 0.76
The elements of this matrix are similarities (confusion proportions) of colors with wavelengths from 434 nm to 674 nm. Important: smacof requires dissimilarities as inputs! Therefore, we need to convert these similarities into dissimilarities. The sim2diss()
function supports the user with this conversion. Here we can simply subtract the proportions from 1 (method = "confusion"
).
<- sim2diss(ekman, method = "confusion", to.dist = TRUE) ekmanD
In modern MDS applications, the more common input data scenario are derived proximities. The starting point is a standard data frame structure. As an example we use a dataset on PTSD symptoms from the MPsychoR
package.
library(MPsychoR)
data(Wenchuan)
head(Wenchuan)
#> intrusion dreams flash upset physior avoidth avoidact amnesia lossint distant
#> 1 2 2 2 2 3 2 3 2 3 2
#> 2 2 2 2 3 3 3 3 2 3 3
#> 3 2 4 4 4 3 3 3 5 4 3
#> 4 2 1 2 2 1 1 2 2 2 1
#> 5 2 2 2 2 2 2 2 2 3 2
#> 6 4 3 2 2 2 2 3 3 2 2
#> numb future sleep anger concen hyper startle
#> 1 2 1 3 4 3 4 2
#> 2 2 2 3 3 2 3 3
#> 3 2 3 4 4 4 3 4
#> 4 1 2 2 1 2 3 3
#> 5 2 2 3 2 3 2 3
#> 6 2 3 2 3 2 3 2
In total we have 17
PTSD symptoms collected on 362
individuals. We need to convert this matrix into dissimilarities. If we want to scale the symptoms, a popular strategy is to compute the correlation matrix (similarities) and subsequently convert it into a dissimilarity matrix.
<- cor(Wenchuan, use = "pairwise.complete.obs")
Rmat <- sim2diss(Rmat, method = "corr", to.dist = TRUE)
Delta round(Delta, 2)
#> intrusion dreams flash upset physior avoidth avoidact amnesia lossint
#> dreams 0.53
#> flash 0.59 0.56
#> upset 0.65 0.65 0.67
#> physior 0.64 0.65 0.64 0.54
#> avoidth 0.75 0.73 0.71 0.67 0.71
#> avoidact 0.76 0.74 0.73 0.68 0.70 0.45
#> amnesia 0.77 0.76 0.73 0.74 0.73 0.73 0.72
#> lossint 0.77 0.78 0.77 0.77 0.75 0.70 0.69 0.74
#> distant 0.79 0.79 0.80 0.79 0.74 0.78 0.73 0.76 0.64
#> numb 0.81 0.84 0.80 0.83 0.77 0.82 0.81 0.75 0.73
#> future 0.72 0.79 0.75 0.73 0.74 0.81 0.79 0.73 0.77
#> sleep 0.69 0.69 0.71 0.70 0.72 0.70 0.73 0.79 0.73
#> anger 0.74 0.77 0.78 0.74 0.75 0.78 0.79 0.78 0.72
#> concen 0.76 0.74 0.72 0.75 0.74 0.77 0.78 0.75 0.66
#> hyper 0.69 0.68 0.67 0.70 0.66 0.74 0.72 0.75 0.71
#> startle 0.75 0.75 0.70 0.69 0.71 0.75 0.77 0.78 0.74
#> distant numb future sleep anger concen hyper
#> dreams
#> flash
#> upset
#> physior
#> avoidth
#> avoidact
#> amnesia
#> lossint
#> distant
#> numb 0.67
#> future 0.73 0.67
#> sleep 0.78 0.79 0.71
#> anger 0.76 0.74 0.71 0.63
#> concen 0.73 0.72 0.67 0.63 0.57
#> hyper 0.75 0.70 0.65 0.64 0.64 0.58
#> startle 0.80 0.74 0.72 0.63 0.67 0.59 0.53
At this point we are ready to fit an MDS.
What MDS is trying to achieve is to represent these dissimilarities as distances in a low-dimensional space. Most often researchers use two or three dimensions in order to be able plot the resulting configuration. The simplest call to perform a 2D MDS is the following.
<- mds(Delta)
mds_wen1
mds_wen1#>
#> Call:
#> mds(delta = Delta)
#>
#> Model: Symmetric SMACOF
#> Number of objects: 17
#> Stress-1 value: 0.309
#> Number of iterations: 59
The print output shows a measure called “stress” which tells us how well the solution fits the data. It becomes 0 in case of a perfect fit. A value of 0.309
is not great for this simple MDS problem. Note that in the MDS literature there exist stress rules of thumb. We strongly advice to use several diagnostics in combination (see Mair, Borg, and Rusch 2016) in order to assess the goodness-of-fit, instead of relying on these rules of thumb.
This said, we need to improve this solution. We could increase the number of dimensions but it will be difficult to plot the solution. Rather, we can choose a more flexible transformation function. Modern MDS implementations provide the option to transform the input dissimilarities. This is a big advantage over classical scaling implementations as in cmdscale()
. The default used in mds()
is a ratio transformation which essentially says that input dissimilarities are not touched. We could relax this restriction by allowing something more flexible such as an ordinal transformation which results in a step function. Shepard plots give insights into these transformations.
<- mds(Delta, type = "ordinal")
mds_wen2 plot(mds_wen1, plot.type = "Shepard", main = "Shepard Diagram (Ratio Transformation)")
plot(mds_wen2, plot.type = "Shepard", main = "Shepard Diagram (Ordinal Transformation)")
How should we decide on a transformation function? One option is to decide on a data driven basis. Try out different transformation functions ("interval"
is another popular one), look at the Shepard plots, and pick one that leads to an acceptably low stress value while keeping the transformation as simple as possible (an ordinal transformation always gives the lowest stress as it is the most flexible transformation). Another option is to decide on the basis of ad hoc scale levels or substantive considerations. For instance, in this example we could argue that the original data are measured on a Likert scale and we therefore process the dissimilarities using an ordinal transformation. More thorough discussions on such “optimal scaling” procedures can be found in Mair (2018).
Let us look at the output of the ordinal MDS fit:
mds_wen2#>
#> Call:
#> mds(delta = Delta, type = "ordinal")
#>
#> Model: Symmetric SMACOF
#> Number of objects: 17
#> Stress-1 value: 0.145
#> Number of iterations: 22
We see the drastic reduction in the stress value compared to the ratio fit. Assume we are happy with this solution we are now ready to produce the most important output in MDS: the configuration plot.
plot(mds_wen2, main = "PTSD Symptoms Configuration")
A nice feature of MDS is that this configuration plot is super intuitive to interpret as the location of the points matters substantively (as opposed to correlation networks) and the distances between any two points are Euclidean. The closer two symptoms in the configuration, the stronger their association.
The interpretation of the dimensions is not that crucial as, for instance, in PCA. Sometimes researchers find an interpretation, whereas sometimes, as in our example, there is no straightforward interpretation. Note that this configuration can be rotated arbitrarily if this helps for interpretation. Further details on interpreting configuration plots can be found in Borg, Groenen, and Mair (2018), Mair (2018), and the main package vignette.
As a final note, for users more attracted to ggplot
style, here is some basic code:
library(ggplot2)
<- as.data.frame(mds_wen2$conf)
conf_wen <- ggplot(conf_wen, aes(x = D1, y = D2, label = rownames(conf_wen)))
p + geom_point(size = 0.9) + coord_fixed() + geom_text(size = 3.5, vjust = -0.8) +
p ggtitle("PTSD Symptoms Configuration")
It is important to fix the aspect ratio to 1 in order to interpret the distances in a Euclidean way. To avoid overlapping labels, the ggrepel
package provides some nice options.