In general, dann will struggle as unrelated variables are intermingled with informative variables. To deal with this, sub_dann projects the data onto a unique subspace and then calls dann. sub_dann is able to mitigate the use of noise variables. See section 3 of Discriminate Adaptive Nearest Neighbor Classification for details. Section 4 compares dann and sub_dann to a number of other approaches.
In the below example there are 2 related variables and 5 that are unrelated. Lets see how dann, sub_dann, and dann with only the correct features perform. First, lets make a data set to work with.
library(dann)
library(mlbench)
library(magrittr)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
######################
# Circle data with unrelated variables
######################
set.seed(1)
<- mlbench.circle(500, 2) %>%
train ::as_tibble()
tibblecolnames(train)[1:3] <- c("X1", "X2", "Y")
<- train %>%
train mutate(Y = as.numeric(Y))
# Add 5 unrelated variables
<- train %>%
train mutate(
U1 = runif(500, -1, 1),
U2 = runif(500, -1, 1),
U3 = runif(500, -1, 1),
U4 = runif(500, -1, 1),
U5 = runif(500, -1, 1)
)
<- mlbench.circle(500, 2) %>%
test ::as_tibble()
tibblecolnames(test)[1:3] <- c("X1", "X2", "Y")
<- test %>%
test mutate(Y = as.numeric(Y))
# Add 5 unrelated variables
<- test %>%
test mutate(
U1 = runif(500, -1, 1),
U2 = runif(500, -1, 1),
U3 = runif(500, -1, 1),
U4 = runif(500, -1, 1),
U5 = runif(500, -1, 1)
)
As expected, dann is not permanent.
<- dann_df(
dannPreds formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5,
train = train, test = test,
k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
mean(dannPreds == test$Y)
## [1] 0.668
Moving on to sub_dann, the dimension of the subspace should be chosen based on the number of large eigenvalues. The graph suggests 2 (the correct answer).
graph_eigenvalues_df(formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5, train = train,
neighborhood_size = 50, weighted = FALSE, sphere = "mcd")
While continuing to use unrelated variables, sub_dann did much better than dann.
<- sub_dann_df(formula = Y~X1 + X2 + U1 + U2 + U3 + U4 + U5,
subDannPreds train = train, test = test,
k = 3, neighborhood_size = 50, epsilon = 1,
probability = FALSE,
weighted = FALSE, sphere = "mcd", numDim = 2)
mean(subDannPreds == test$Y)
## [1] 0.882
As an upper bound on performance for this A.I. approach, lets try dann using only the informative variables. Is there much of a difference?
<- dann_df(formula = Y~X1 + X2,
variableSelectionDann train = train, test = test,
k = 3, neighborhood_size = 50, epsilon = 1, probability = FALSE)
mean(variableSelectionDann == test$Y)
## [1] 0.944
Using only the related variables produced the best model. Many times, the related variables are unknown. sub_dann was able to produce a model nearly as performant.