Please see the paper as cited
below (Kamulete 2022) for details. We
denote the R package as dsos
, to avoid confusion with
D-SOS
, the method.
We show how easy it is to implement D-SOS
for a
particular notion of outlyingness. Suppose we want to test for no
adverse shift based on isolation scores in the context of two-sample
comparison. To do so, we need two main ingredients: a scoring function
and a method to compute the \(p-\)value.
First, the scores are obtained using predictions from isolation
forest with the isotree
package (Cortes 2020). Isolation forest detects
isolated points, instances that are typically
out-of-distribution relative to the high-density regions of the data
distribution. Naturally, any performant method for density-based
out-of-distribution detection can effectively be used to achieve the
same goal. The function score_od
shows the implementation
of one such scoring function in the dsos
package.
dsos::score_od
## function (x_train, x_test, n_trees = 500L, threshold = 0.6)
## {
## if (!requireNamespace("isotree", quietly = TRUE)) {
## stop("Package \"isotree\" must be installed to use this function.",
## call. = FALSE)
## }
## iso_fit <- isotree::isolation.forest(data = x_train, ntrees = n_trees)
## os_train <- predict(iso_fit, newdata = x_train)
## os_test <- predict(iso_fit, newdata = x_test)
## os_train[os_train < threshold] <- threshold
## os_test[os_test < threshold] <- threshold
## return(list(test = os_test, train = os_train))
## }
## <bytecode: 0x0000013a865e5a30>
## <environment: namespace:dsos>
Second, we estimate the empirical null distribution for the \(p-\)value via permutations. For speed, this
is implemented as a sequential Monte Carlo test with the
simctest
package (Gandy
2009). The function pt_refit
in the
dsos
package combines scoring with inference. The prefix
pt stands for permutation test. The code for _pt_
is relatively straightforward.
dsos
may use sample splitting and out-of-bag variants as
alternatives to permutations for \(p-\)value calculation. Both sample
splitting and out-of-bag variants use the asymptotic null distribution
for the test statistic. As a result, they can be appreciably faster than
inference based on permutations.
dsos::pt_refit
## function (x_train, x_test, scorer, n_pt = 2000)
## {
## result <- exchangeable_null(x_train, x_test, scorer = scorer,
## n_pt = n_pt, is_oob = FALSE)
## return(result)
## }
## <bytecode: 0x0000013a8679efa8>
## <environment: namespace:dsos>
Take the iris
dataset for example. When the training set only consists of
setosa
(flower species) and the test set, only of
versicolor
, the data is incompatible with the null of no
adverse shift. In other words, we have strong evidence that the test
contains a disproportionate number of outliers, if the training set is
the reference distribution.
set.seed(12345)
data(iris)
x_train <- iris[1:50,1:4] # Training sample: Species == 'setosa'
x_test <- iris[51:100,1:4] # Test sample: Species == 'versicolor'
iris_test <- pt_refit(x_train, x_test, score = dsos::score_od)
plot(iris_test)
You can plug in your own scores in this framework. Those already
implemented in dsos
can be useful but they are by means the
only ones. If you favour a different method for out-of-distribution
(outlier) detection, want to tune the hyperparameters, or choose a
different notion of outlyingness altogether, dsos
provides
the building blocks to build your own. The workhorse function, powering
the approach behind the scenes, is a way to calculate the test statistic
from outlier scores (see wauc_from_os
).