We recently added visualizations based on Self-Organizing Maps (SOM) and Uniform Manifold Approximation and Projection (UMAP) for objects in the Mercator package. This addition relies on the core implementations in the kohonen and umap packages, respectively. The main challenge that we faced was that both of those implementations want to receive the raw data as inputs, while Mercator objects only store a distance matrix computed from the raw data. The purpose of this vignette is, first, to describe how we worked around this restriction, and second, to illustrate how to use these methods in the package.
First we load the package.
Next, we load a “fake” set of synthetic continuous data that comes with the Mercator package.
## [1] "C" "Cl4" "Delta" "J" "N" "P"
## [7] "SV" "X" "fakeclin" "fakedata" "filename" "hclass"
## [13] "jacc.Vis" "kk" "mercury" "my.binmat" "my.clust" "my.data"
## [19] "neptune" "pts" "sokal.Vis" "tab" "temp.Vis" "venus"
## [1] 776 300
## [1] 300 4
Let’s create a UMAP visualization directly from the raw data.
We aren’t going to do anything fancy with labels or colors in this plot; we just want an idea of the main structure in the data. In particular, it looks like there are seven clusters, divided into one group three and two groups of two each.
Next, we are going to create a Meractor object with a multi-dimensional scaling (MDS) visualization. Inspired by the first plot, we will arbitrarily assign seven groups. (This assignment uses the PAM algorithm internally.)
Interestingly, here we see a possibly different number of groups. But that’s a separate issue, and we don’t want to distracted from our main thread.
We want to confirm that the actual raw data is not contained in mercury, just the distance matrix.
## An object of the 'Mercator' class, using the ' euclid ' metric, of size
## [1] 300 300
## Contains these visualizations: mds
## [1] "metric" "distance" "view" "palette" "symbols" "clusters"
## [1] "euclid"
## NC1 NC2 NC3 NC4 NC5 NC6 NC7 NC8
## "#2E91E5" "#E15F99" "#1CA71C" "#FB0D0D" "#DA16FF" "#222A2A" "#B68100" "#750D86"
## NC9 NC10 NC11 NC12 NC13 NC14 NC15 NC16
## "#EB663B" "#511CFB" "#00A08B" "#FB00D1" "#FC0080" "#B2828D" "#6C7C32" "#778AAE"
## NC17 NC18 NC19 NC20 NC21 NC22 NC23 NC24
## "#862A16" "#A777F1" "#620042" "#1616A7" "#DA60CA" "#6C4516" "#0D2A63" "#AF0038"
## [1] 16 15 17 18 10 7 11 9
##
## 1 2 3 4 5 6 7
## 44 68 38 38 37 39 36
## [1] "list"
## [1] "mds"
## [1] "NULL"
## [1] "dist"
Next, we add a “umap” visualization to this object.
mercury <- addVisualization(mercury, "umap")
plot(mercury, view = "umap", main="UMAP from distance matrix")
Notice that the plot method knows that we want to see the internal “layout” component of the umap object. It also automatically assigns axis names that (subliminally?) remind us that we computed this visualization with UMAP. And it colors the points using the same values from the initial PAM clustering assignments.
As an aside, we point out that the PAM clustering doesn’t really match the implicit UMAP clustering of the data.
We obviously constructed this plot just from the distance matrix, not from the raw data. It neverheless has a structure essentially the same as the original one. But how?
Well, internally, we use this function:
## function (D)
## {
## M <- as.matrix(D)
## X <- scale(t(scale(t(M^2), scale = FALSE)), scale = FALSE)
## E <- eigen(-X/2, symmetric = TRUE)$values
## R <- sum(E > 1e-10)
## cmdscale(D, k = R)
## }
## <bytecode: 0x000001268a531568>
## <environment: namespace:Mercator>
The first four lines of code inside this function are basically the algorithm used by the cmdscale function to compute the eigenvalues need to create its dimension reduction for visualization. The count of the number of nonzero eigenvalues (which defines R) is careful to stay away from zero. We do this to avoid round-off errors. If we miss by one or two eignevalues, than the final call to cmdscale will generate a confusing warning. That final line, by the way, creates an embedding into Euclidean space that should preserve almost all of the distances between pairs of points in the original data set.
To see that, let’s carry out that step explicitly.
## [1] 300 299
Now we compute distances between pairs of points in this new embedding.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.101e-13 -1.776e-14 0.000e+00 -1.496e-15 1.421e-14 1.066e-13
We see that the absolute errors between any pair of points in the two distance computations are less than \(10^{-12}\). And that is the fundamental reason that we expect equivalent structures whether we apply UMAP to the original data or to the re-embedded data arising from the distance matrix.
The same approach works to compute self-organizing maps from a distance matrix. Here is the result using the data matrix.
mercury <- addVisualization(mercury, "som",
grid = kohonen::somgrid(6, 5, "hexagonal"))
plot(mercury, view = "som")
And here is the corresponding result using the original data.
library(kohonen)
mysom <- som(t(fakedata), grid = somgrid(6, 5, "hexagonal"))
plot(mysom, type = "mapping")
This time, we will work a little harder to color the plot in the same way.
Qualitatively, most of the SOM plots should be similar. The fundamental exception, however, is the “codes” plot. This plot shows the patterns of intensities in each of the underlying dimensions (averaged over the assigned objects in each node). Because we have performed a multi-dimensional scaling analysis, we have changed the underlying coordinate system. Here are the corresponding plots.
## Warning in par(opar): argument 1 does not name a graphical parameter
## Warning in par(opar): argument 1 does not name a graphical parameter
As we have seen, when we start with a Euclidean distance matrix, we can get qualitatively equivalent results from the original data or from creating a “realization” of the distance matrix in a large Euclidean space. The major advantage, however, arises when we start with a completely different distance metric. In that case, we cannot apply SOM or UMAP at all. But we can still embed that distance matrix into a Eucldean space and then run SOM or UMAP.
This analaysis was performed in the following environment:
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kohonen_3.0.11 umap_0.2.8.0 Mercator_1.1.2
## [4] Thresher_1.1.3 PCDimension_1.1.11 ClassDiscovery_3.4.3
## [7] oompaBase_3.2.9 cluster_2.1.3
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.2 sass_0.4.1 jsonlite_1.8.0
## [4] viridisLite_0.4.0 bslib_0.3.1 askpass_1.1
## [7] highr_0.9 stats4_4.2.1 yaml_2.3.5
## [10] slam_0.1-50 pillar_1.7.0 lattice_0.20-45
## [13] glue_1.6.2 reticulate_1.25 digest_0.6.29
## [16] skmeans_0.2-14 colorspace_2.0-3 oompaData_3.1.2
## [19] htmltools_0.5.2 Matrix_1.4-1 pkgconfig_2.0.3
## [22] purrr_0.3.4 cpm_2.3 scales_1.2.0
## [25] RSpectra_0.16-1 Rtsne_0.16 tibble_3.1.7
## [28] openssl_2.0.2 generics_0.1.2 ggplot2_3.3.6
## [31] movMF_0.2-7 ellipsis_0.3.2 nnet_7.3-17
## [34] cli_3.3.0 magrittr_2.0.3 crayon_1.5.1
## [37] mclust_5.4.10 evaluate_0.15 fansi_1.0.3
## [40] MASS_7.3-57 changepoint_2.2.3 tools_4.2.1
## [43] lifecycle_1.0.1 stringr_1.4.0 kernlab_0.9-31
## [46] munsell_0.5.0 ade4_1.7-19 compiler_4.2.1
## [49] jquerylib_0.1.4 rlang_1.0.3 grid_4.2.1
## [52] igraph_1.3.2 rmarkdown_2.14 gtable_0.3.0
## [55] flexmix_2.3-18 R6_2.5.1 gridExtra_2.3
## [58] zoo_1.8-10 knitr_1.39 dplyr_1.0.9
## [61] fastmap_1.1.0 utf8_1.2.2 clue_0.3-61
## [64] KernSmooth_2.23-20 dendextend_1.15.2 modeltools_0.2-23
## [67] stringi_1.7.6 Polychrome_1.5.1 Rcpp_1.0.8.3
## [70] vctrs_0.4.1 png_0.1-7 scatterplot3d_0.3-41
## [73] tidyselect_1.1.2 xfun_0.31