title: “Optimizing the calculation of Hopkins statistic” |
author: “Kevin Wright” |
date: “2021-04-19” |
bibliography: clustertend.bib |
output: |
rmarkdown::html_vignette: |
md_extensions: [ |
“-autolink_bare_uris” |
] |
vignette: > |
% |
% |
% |
The clustertend::hopkins
function used two nested for()
loops:
for (i in 1:n) {
1] <- dist(rbind(p[i, ], data[1, ]))
distp[<- dist(rbind(q[i, ], data[1, ]))
minqi for (j in 2:nrow(data)) {
<- dist(rbind(p[i, ], data[j, ]))
distp[j] <- q[i, ] - data[j, ]
error if (sum(abs(error)) != 0) {
<- dist(rbind(q[i, ], data[j, ]))
distq if (distq < minqi)
<- distq
minqi
} }
Whenever nested for()
loops are used, you should immediately consider if it possible to vectorize one or both loops. In this case, we can define a function that calculates the euclidean distance from a vector to the nearest row of a matrix and then use this function to eliminate the inner loop:
<- function(x,Y){
DistVecToMat min(apply(Y, 1, function(yi) sqrt(sum((x-yi)^2))))
}# DistVecToMat(c(1,2), matrix(c(4,5,6,7), nrow=2, byrow=TRUE))
# 4.242641 7.071068 # sqrt(c(18,50))
# For each ith row of sample, calculate the distance of:
# U[i,] to X
# W[i,] to X[-k[i] , ], i.e. omitting W[i,] from X
<- rep(NA, n)
dux <- rep(NA, n)
dwx for(i in 1:n) {
<- DistVecToMat(U[i,], X)
dux[i] <- DistVecToMat(W[i,], X[-k[i],])
dwx[i] }
When thinking about manipulating two vectors or two matrices, you should always keep in mind that there are R functions like crossprod()
, outer()
, and apply()
that might come in handy. I played around with these functions but was having trouble getting the results I wanted. I then used Google to search for ideas and discovered the pdist
package, which can efficiently compute the distance between all pairs of rows of two matrices. This is exactly what we need.
# pdist(W,X) computes distance between rows of W and rows of X
<- as.matrix(pdist(W,X))
tmp <- apply(tmp, 1, min)
dwx
# pdist(U,X) computes distance between rows of U and rows of X
<- as.matrix(pdist(U,X))
tmp <- apply(tmp, 1, min) dux
Finally, there’s two ways two different ways to change some elements of the distance matrix to be Inf:
library(microbenchmark)
# Method 1. Loop over vector 1:n
# for(i in 1:m) dwx[i,k[i]] <- Inf
microbenchmark(hopkins(X12, 500))
## Unit: milliseconds
## expr min lq mean median uq max neval
## hopkins(X12, 500) 38.9493 42.8045 50.73876 45.0668 47.69945 120.9366 100
# Method 2. Build a matrix of indexes to the cells that need changing
# dwx[ matrix( c(1:n, k), nrow=n) ] <- Inf
microbenchmark(hopkins(X12, 500))
## Unit: milliseconds
## expr min lq mean median uq max neval
## hopkins(X12, 500) 39.2668 42.41565 50.21628 43.74215 46.9462 126.3522 100
The median times across 100 calls to the function are virtually identical for this test case. Results could be different for smaller/larger datasets. In our (purely subjective) taste, the loop method is a bit easier to understand.
How good is the optimization? In one simulated-data example with 1000 rows and 2 columns, sampling 500 rows, the non-optimized function used about 17 seconds, while the optimized function used approximately 0.05 seconds.