splitTools
is a fast, lightweight toolkit for data splitting.
Its two main functions partition
and create_folds
support
data partitioning (e.g. into training, validation and test),
creating folds for cross-validation,
creating repeated folds for cross-validation,
stratified splitting (e.g. for stratified cross-validation),
grouped splitting (e.g. for group-k-fold cross-validation) as well as
blocked splitting (if the sequential order of the data should be retained).
The function create_timefolds
does time-series splitting where the out-of-sample data follows the (growing) in-sample data.
We will now illustrate how to use splitTools
in a typical modelling workflow.
We will go through the following three steps:
Split the data set iris
into 60% training data, 20% validation and 20% test, stratified by the variable Sepal.Length
. Since it is numeric, stratification is done internally by quantile binning.
We then tune the parameter mtry
of a random forest to predict Sepal.Length
as good as possible by the other variables. We do this on the validation data.
After selecting the best mtry
, we challenge the final model on the remaining test data set.
library(splitTools)
library(ranger)
# Split data into partitions
set.seed(3451)
<- partition(iris$Sepal.Length, p = c(train = 0.6, valid = 0.2, test = 0.2))
inds str(inds)
#> List of 3
#> $ train: int [1:81] 2 3 6 7 8 10 11 18 19 20 ...
#> $ valid: int [1:34] 1 12 14 15 27 34 36 38 42 48 ...
#> $ test : int [1:35] 4 5 9 13 16 17 25 39 41 45 ...
<- iris[inds$train, ]
train <- iris[inds$valid, ]
valid <- iris[inds$test, ]
test
# Root-mean-squared error function used to evaluate results
<- function(y, pred) {
rmse sqrt(mean((y - pred)^2))
}
# Tune mtry on validation data
<- numeric(ncol(train) - 1)
valid_mtry
for (i in seq_along(valid_mtry)) {
<- ranger(Sepal.Length ~ ., data = train, mtry = i)
fit <- rmse(valid$Sepal.Length, predict(fit, valid)$predictions)
valid_mtry[i]
}
valid_mtry#> [1] 0.3809234 0.3242173 0.3208119 0.3241518
<- which.min(valid_mtry))
(best_mtry #> [1] 3
# Fit and test final model
<- ranger(Sepal.Length ~ ., data = train, mtry = best_mtry)
final_fit rmse(test$Sepal.Length, predict(final_fit, test)$predictions)
#> [1] 0.3480947
Since our data set consists of only 150 rows, investing 20% of observations for validation is neither very robust nor data efficient. Let’s modify the modelling strategy by replacing simple validation by five-fold cross-validation, again using stratification on the response variable.
Split the data set iris
into 80% training data and 20% test, stratified by the variable Sepal.Length
.
Use stratified five-fold cross-validation to tune the parameter mtry
. We do this on the training data.
After selecting the best mtry
by this simple “GridSearchCV”, we challenge the final model on the test data set.
# Split into training and test
<- partition(iris$Sepal.Length, p = c(train = 0.8, test = 0.2))
inds
<- iris[inds$train, ]
train <- iris[inds$test, ]
test
# Get stratified cross-validation in-sample indices
<- create_folds(train$Sepal.Length, k = 5)
folds
# Tune mtry by GridSearchCV
<- numeric(ncol(train) - 1)
valid_mtry
for (i in seq_along(valid_mtry)) {
<- numeric()
cv_mtry for (fold in folds) {
<- ranger(Sepal.Length ~ ., data = train[fold, ], mtry = i)
fit <- c(cv_mtry,
cv_mtry rmse(train[-fold, "Sepal.Length"], predict(fit, train[-fold, ])$predictions))
}<- mean(cv_mtry)
valid_mtry[i]
}
# Result of cross-validation
valid_mtry#> [1] 0.3870915 0.3460605 0.3337763 0.3302831
<- which.min(valid_mtry))
(best_mtry #> [1] 4
# Use optimal mtry to make model
<- ranger(Sepal.Length ~ ., data = train, mtry = best_mtry)
final_fit rmse(test$Sepal.Length, predict(final_fit, test)$predictions)
#> [1] 0.2970876
If feasible, repeated cross-validation is recommended in order to reduce uncertainty in decisions. The process is the same as above. Instead of getting five performance values per fold, we get five times the number of repetitions (here, three).
# We start by making repeated, stratified cross-validation folds
<- create_folds(train$Sepal.Length, k = 5, m_rep = 3)
folds length(folds)
#> [1] 15
for (i in seq_along(valid_mtry)) {
<- numeric()
cv_mtry for (fold in folds) {
<- ranger(Sepal.Length ~ ., data = train[fold, ], mtry = i)
fit <- c(cv_mtry,
cv_mtry rmse(train[-fold, "Sepal.Length"], predict(fit, train[-fold, ])$predictions))
}<- mean(cv_mtry)
valid_mtry[i]
}
# Result of cross-validation
valid_mtry#> [1] 0.3934294 0.3544207 0.3422013 0.3393454
<- which.min(valid_mtry))
(best_mtry #> [1] 4
# Use optimal mtry to make model
<- ranger(Sepal.Length ~ ., data = train, mtry = best_mtry)
final_fit rmse(test$Sepal.Length, predict(final_fit, test)$predictions)
#> [1] 0.2937055
When modelling time series, usual cross-validation destroys the sequential nature of the data. This can be avoided by the following modification of k-fold cross-validation:
The data is first split into \(k+1\) blocks \(B_1, ..., B_{k+1}\), in sequential order. Depending of type = "extending"
(default) or type == "moving"
, the following data sets are used in cross-validation:
First fold: \(B_1\) is used for training, \(B_2\) for evaluation.
Second fold: \(\{B_1, B_2\}\) resp. \(B_2\) is used for training if type is “extending” resp. “moving”, \(B_3\) for evaluation.
…
\(k\)-th fold: \(\{B_1, ..., B_k\}\) resp. \(B_{k}\) is used for training for type “extending” resp. “moving”, \(B_{k+1}\) for evaluation.
These schemata make sure that the evaluation data set always follows the training data. Note that the training data grows over the whole process linearly with type = "extending"
whereas its length is approximately constant with type = "moving"
.
In order to have a final evaluation of the optimized model, typically an initial blocked split into sequential training and testing data is done.
We first create a time series and derive lagged features for training. Then, again, we optimize mtry
of a random forest by time-series cross-validation. We evaluate the optimized model on the last 10% of the time series.
# Create data
set.seed(452)
<- 1000
n <- seq(0, 2 * pi, length.out = n)
t <- 0.2 * sin(t) - 0.1 * cos(t) + 0.2 * runif(n)
y plot(y ~ t, pch = ".", cex = 2)
# Helper function
<- function(z, k = 1) {
Lag c(z[-seq_len(k)], rep(NA, k))
}Lag(1:4, k = 1)
#> [1] 2 3 4 NA
# Add lagged features
<- data.frame(y,
dat lag1 = Lag(y),
lag2 = Lag(y, k = 2),
lag3 = Lag(y, k = 3))
<- dat[complete.cases(dat), ]
dat head(dat)
#> y lag1 lag2 lag3
#> 1 -0.085447174 0.075314649 -0.007841658 0.10081516
#> 2 0.075314649 -0.007841658 0.100815165 0.09635395
#> 3 -0.007841658 0.100815165 0.096353945 -0.06476294
#> 4 0.100815165 0.096353945 -0.064762935 0.10474890
#> 5 0.096353945 -0.064762935 0.104748904 0.07030228
#> 6 -0.064762935 0.104748904 0.070302283 0.01085425
cor(dat)
#> y lag1 lag2 lag3
#> y 1.0000000 0.8789100 0.8858208 0.8840929
#> lag1 0.8789100 1.0000000 0.8791226 0.8858995
#> lag2 0.8858208 0.8791226 1.0000000 0.8791388
#> lag3 0.8840929 0.8858995 0.8791388 1.0000000
# Block partitioning
<- partition(dat$y, p = c(train = 0.9, test = 0.1), type = "blocked")
inds str(inds)
#> List of 2
#> $ train: int [1:898] 1 2 3 4 5 6 7 8 9 10 ...
#> $ test : int [1:99] 899 900 901 902 903 904 905 906 907 908 ...
<- dat[inds$train, ]
train <- dat[inds$test, ]
test
# Get time series folds
<- create_timefolds(train$y, k = 5)
folds str(folds)
#> List of 5
#> $ Fold1:List of 2
#> ..$ insample : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 151 152 153 154 155 156 157 158 159 160 ...
#> $ Fold2:List of 2
#> ..$ insample : int [1:300] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 301 302 303 304 305 306 307 308 309 310 ...
#> $ Fold3:List of 2
#> ..$ insample : int [1:450] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 451 452 453 454 455 456 457 458 459 460 ...
#> $ Fold4:List of 2
#> ..$ insample : int [1:600] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:150] 601 602 603 604 605 606 607 608 609 610 ...
#> $ Fold5:List of 2
#> ..$ insample : int [1:750] 1 2 3 4 5 6 7 8 9 10 ...
#> ..$ outsample: int [1:148] 751 752 753 754 755 756 757 758 759 760 ...
# Tune mtry by GridSearchCV
<- numeric(ncol(train) - 1)
valid_mtry
for (i in seq_along(valid_mtry)) {
<- numeric()
cv_mtry for (fold in folds) {
<- ranger(y ~ ., data = train[fold$insample, ], mtry = i)
fit <- c(cv_mtry,
cv_mtry rmse(train[fold$outsample, "y"],
predict(fit, train[fold$outsample, ])$predictions))
}<- mean(cv_mtry)
valid_mtry[i]
}
# Result of cross-validation
valid_mtry#> [1] 0.08227426 0.08188822 0.08230784
<- which.min(valid_mtry))
(best_mtry #> [1] 2
# Use optimal mtry to make model and evaluate on future test data
<- ranger(y ~ ., data = train, mtry = best_mtry)
final_fit <- predict(final_fit, test)$predictions
test_pred rmse(test$y, test_pred)
#> [1] 0.07184702
# Plot
<- seq_along(dat$y)
x plot(x, dat$y, pch = ".", cex = 2)
points(tail(x, length(test$y)), test$y, col = "red", pch = ".", cex = 2)
lines(tail(x, length(test$y)), test_pred)