
In general, dann will struggle as unrelated variables are intermingled with informative variables. To deal with this, sub_dann projects the data onto a unique subspace and then calls dann. sub_dann is able to mitigate the use of noise variables. See section 3 of Discriminate Adaptive Nearest Neighbor Classification for details. Section 4 compares dann and sub_dann to a number of other approaches.


Example: Circle Data With Random Variables

In the below example there are 2 related variables and 5 that are unrelated. Lets see how dann, sub_dann, and dann with only the correct features perform. First, lets make a data set to work with.

 library(dplyr, warn.conflicts = FALSE)

 # Circle data with unrelated variables
 train <-, 2) %>%
 colnames(train)[1:3] <- c("X1", "X2", "Y")
 train <- train %>%
  mutate(Y = as.numeric(Y))

 # Add 5 unrelated variables
 train <- train %>%
     U1 = runif(500, -1, 1),
     U2 = runif(500, -1, 1),
     U3 = runif(500, -1, 1),
     U4 = runif(500, -1, 1),
     U5 = runif(500, -1, 1)
 test <-, 2) %>%
 colnames(test)[1:3] <- c("X1", "X2", "Y")
 test <- test %>%
  mutate(Y = as.numeric(Y))

 # Add 5 unrelated variables
 test <- test %>%
     U1 = runif(500, -1, 1),
     U2 = runif(500, -1, 1),
     U3 = runif(500, -1, 1),
     U4 = runif(500, -1, 1),
     U5 = runif(500, -1, 1)

As expected, dann is not permanent.

 dannPreds <- dann_df(
  formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5,
  train = train, test = test,
  k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
 mean(dannPreds == test$Y)
## [1] 0.668

Moving on to sub_dann, the dimension of the subspace should be chosen based on the number of large eigenvalues. The graph suggests 2 (the correct answer).

 graph_eigenvalues_df(formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5, train = train, 
                   neighborhood_size = 50, weighted = FALSE, sphere = "mcd")

While continuing to use unrelated variables, sub_dann did much better than dann.

 subDannPreds <- sub_dann_df(formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5, 
                             train = train, test = test,
                             k = 3, neighborhood_size = 50, epsilon = 1, 
                             probability = FALSE, 
                             weighted = FALSE, sphere = "mcd", numDim = 2)
 mean(subDannPreds == test$Y)
## [1] 0.882

As an upper bound on performance for this A.I. approach, lets try dann using only the informative variables. Is there much of a difference?

 variableSelectionDann <- dann_df(formula = Y~X1 + X2, 
                               train = train, test = test,
                               k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
 mean(variableSelectionDann == test$Y)
## [1] 0.944

Using only the related variables produced the best model. Many times, the related variables are unknown. sub_dann was able to produce a model nearly as performant.