Introduction

The motivation for this package is to provide functions which help with the development and tuning of machine learning models in biomedical data where the sample size is frequently limited, but the number of predictors may be significantly larger (P >> n). While most machine learning pipelines involve splitting data into training and testing cohorts, typically 2/3 and 1/3 respectively, medical datasets may be too small for this, and so determination of accuracy in the left-out test set suffers because the test set is small. Nested cross-validation (CV) provides a way to get round this, by maximising use of the whole dataset for testing overall accuracy, while maintaining the split between training and testing.

In addition typical biomedical datasets often have many 10,000s of possible predictors, so filtering of predictors is commonly needed. However, it has been demonstrated that filtering on the whole dataset creates a bias when determining accuracy of models (Vabalas et al, 2019). Feature selection of predictors should be considered an integral part of a model, with feature selection performed only on training data. Then the selected features and accompanying model can be tested on hold-out test data without bias. Thus, it is recommended that any filtering of predictors is performed within the CV loops, to prevent test data information leakage.

This package enables nested cross-validation (CV) to be performed using the commonly used glmnet package, which fits elastic net regression models, and the caret package, which is a general framework for fitting a large number of machine learning models. In addition, nestedcv adds functionality to enable cross-validation of the elastic net alpha parameter when fitting glmnet models.

nestedcv partitions the dataset into outer and inner folds (default 10x10 folds). The inner fold CV, (default is 10-fold), is used to tune optimal hyperparameters for models. Then the model is fitted on the whole inner fold and tested on the left-out data from the outer fold. This is repeated across all outer folds (default 10 outer folds), and the unseen test predictions from the outer folds are compared against the true results for the outer test folds and the results concatenated, to give measures of accuracy (e.g. AUC and accuracy for classification, or RMSE for regression) across the whole dataset.

Finally, the tuning parameters for each model in the outer folds are averaged to give the mean best parameters across all outer folds. A final model is fitted across the whole data using these final hyperparameters and can be used for prediction with external data.

Variable selection

While some models such as glmnet allow for sparsity and have variable selection built-in, many models fail to fit when given massive numbers of predictors, or perform poorly due to overfitting without variable selection. In addition, in medicine one of the goals of predictive modelling is commonly the development of diagnostic or biomarker tests, for which reducing the number of predictors is typically a practical necessity.

Several filter functions (t-test, Wilcoxon test, anova, Pearson/Spearman correlation, random forest variable importance, and ReliefF from the CORElearn package) for feature selection are provided, and can be embedded within the outer loop of the nested CV.

Installation

install.packages("nestedcv")
library(nestedcv)

Examples

Importance of nested CV

The following simulated example demonstrates the bias intrinsic to datasets where P >> n when applying filtering of predictors to the whole dataset rather than to training folds.

## Example binary classification problem with P >> n
x <- matrix(rnorm(150 * 2e+04), 150, 2e+04)  # predictors
y <- factor(rbinom(150, 1, 0.5))  # binary response

## Partition data into 2/3 training set, 1/3 test set
trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE)

## t-test filter using whole test set
filt <- ttest_filter(y, x, nfilter = 100)
filx <- x[, filt]

## Train glmnet on training set only using filtered predictor matrix
library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-4
fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial")

## Predict response on test set
predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class")
predy <- as.vector(predy)
predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response")
predyp <- as.vector(predyp)
output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp)

## Results on test set
## shows bias since univariate filtering was applied to whole dataset
predSummary(output)
#>               AUC          Accuracy Balanced accuracy 
#>         0.9198718         0.8600000         0.8621795

## Nested CV
fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 7:10 / 10,
                      filterFUN = ttest_filter,
                      filter_options = list(nfilter = 100))
fit2
#> Nested cross-validation with glmnet
#> Filter:  ttest_filter 
#> 
#> Final parameters:
#>    lambda      alpha  
#> 0.0001938  0.7000000  
#> 
#> Final coefficients:
#> (Intercept)       V1137      V15198       V4189      V13976      V17882 
#>    0.298695   -0.952609    0.826657    0.810132   -0.808610    0.796586 
#>       V3871      V17426      V13613       V9547      V14847       V6486 
#>   -0.743440    0.714931   -0.698244    0.664076   -0.660436   -0.650877 
#>       V5157      V16867       V2468       V3082       V8072      V15172 
#>   -0.644675    0.637427   -0.633349   -0.626273   -0.585535   -0.583907 
#>        V985      V12400      V10336      V15992      V15841      V16597 
#>   -0.571197   -0.569178   -0.567913    0.563064    0.547011    0.546713 
#>      V10275      V19353      V10637      V18792         V41      V10167 
#>   -0.541976   -0.530889    0.529753    0.510927   -0.491562   -0.489788 
#>       V2568       V6530      V11525       V3242       V2067       V9855 
#>    0.468789   -0.458550   -0.457703    0.456534   -0.452629   -0.439260 
#>      V11857      V15511       V7902       V1083       V7875       V7810 
#>    0.434469    0.421734    0.419901    0.409053    0.405210   -0.404394 
#>      V12928       V7265      V11994       V6034       V3869       V8222 
#>    0.395207    0.392503   -0.382519    0.373850   -0.371960   -0.369064 
#>      V15851       V4273      V19628      V17989       V2862      V15904 
#>    0.366102   -0.354554    0.327694    0.326189   -0.307085    0.298820 
#>      V10360      V14759       V2714       V4968       V3125        V114 
#>    0.292886   -0.289364    0.288755   -0.284631   -0.283124   -0.270684 
#>      V10096      V15473      V14592      V11415       V9651       V4292 
#>   -0.269675   -0.266762    0.265028   -0.258715   -0.249426    0.245484 
#>      V16987      V19094      V15542      V17573       V5158      V15809 
#>   -0.221939   -0.189252    0.188926   -0.155194   -0.146151   -0.140997 
#>      V19459       V2002      V10112      V11153       V2147      V19210 
#>    0.134456   -0.129726    0.128568    0.128120   -0.084154    0.084125 
#>       V3154        V621       V2887        V631      V16124        V319 
#>   -0.066919   -0.060291    0.026694   -0.016878   -0.002106   -0.001262 
#> 
#> Result:
#>               AUC           Accuracy  Balanced accuracy  
#>            0.4940             0.4667             0.4623

testroc <- pROC::roc(output$testy, output$predyp, direction = "<", quiet = TRUE)
inroc <- innercv_roc(fit2)
plot(fit2$roc)
lines(inroc, col = 'blue')
lines(testroc, col = 'red')
legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds", 
                                 "Test partition, non-nested filtering"), 
       col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")

In this example the dataset is pure noise. Filtering of predictors on the whole dataset is a source of leakage of information about the test set, leading to substantially overoptimistic performance on the test set as measured by ROC AUC.

Figures A & B below show two commonly used, but biased methods in which cross-validation is used to fit models, but the result is a biased estimate of model performance. In scheme A, there is no hold-out test set at all, so there are two sources of bias/ data leakage: first, the filtering on the whole dataset, and second, the use of left-out CV folds for measuring performance. Left-out CV folds are known to lead to biased estimates of performance as the tuning parameters are ‘learnt’ from optimising the result on the left-out CV fold.

In scheme B, the CV is used to tune parameters and a hold-out set is used to measure performance, but information leakage occurs when filtering is applied to the whole dataset. Unfortunately this is commonly observed in many studies which apply differential expression analysis on the whole dataset to select predictors which are then passed to machine learning algorithms.

Figures C & D below show two valid methods for fitting a model with CV for tuning parameters as well as unbiased estimates of model performance. Figure C is a traditional hold-out test set, with the dataset partitioned 2/3 training, 1/3 test. Notably the critical difference between scheme B above, is that the filtering is only done on the training set and not on the whole dataset.

Figure D shows the scheme for fully nested cross-validation. Note that filtering is applied to each outer CV training fold. The key advantage of nested CV is that outer CV test folds are collated to give an improved estimate of performance compared to scheme C since the numbers for total testing are larger.

Nested CV with glmnet

In the real life example below, RNA-Sequencing gene expression data from synovial biopsies from patients with rheumatoid arthritis in the R4RA randomised clinical trial (Humby et al, 2021) is used to predict clinical response to the biologic drug rituximab. Treatment response is determined by a clinical measure, namely Clinical Disease Activity Index (CDAI) 50% response, which has a binary outcome: treatment success or failure (response or non-response). This dataset contains gene expression on over 50,000 genes in arthritic synovial tissue from 133 individuals, who were randomised to two drugs (rituximab and tocilizumab). First, we remove genes of low expression using a median cut-off (this still leaves >16,000 genes), and we subset the dataset to the rituximab treated individuals (n=68).

# Raw RNA-Seq data for this example is located at:
# https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-11611/

# set up data
load("/../R4RA_270821.RData")

index <- r4ra.meta$Outliers_Detected_On_PCA != "outlier" & r4ra.meta$Visit == 3 &
          !is.na(r4ra.meta$Visit)
metadata <- r4ra.meta[index, ]
dim(metadata)  # 133 individuals

medians <- Rfast::rowMedians(as.matrix(r4ra.vst))
data <- t(as.matrix(r4ra.vst))
# remove low expressed genes
data <- data[index, medians > 6]  
dim(data)  # 16254 genes

# Rituximab cohort only
yrtx <- metadata$CDAI.response.status.V7[metadata$Randomised.medication == "Rituximab"]
yrtx <- factor(yrtx)
data.rtx <- data[metadata$Randomised.medication == "Rituximab", ]

# no filter
res.rtx <- nestcv.glmnet(y = yrtx, x = data.rtx,
                         family = "binomial", cv.cores = 8,
                         alphaSet = seq(0.7, 1, 0.05))
res.rtx
## Nested cross-validation with glmnet
## No filter
## 
## Final parameters:
## lambda   alpha  
## 0.1511  0.7950  
## 
## Final coefficients:
## (Intercept)  AC016582.3       PCBP3    TMEM170B      EIF4E3     SEC14L6       CEP85        APLF 
##   0.8898659  -0.2676580  -0.2667770   0.2456329   0.2042326  -0.1992225   0.1076051  -0.1072684 
##       EARS2        PTK7       EFNA5        MEST      IQANK1    MTATP6P1       GSK3B       STK40 
##  -0.1036846  -0.0919594  -0.0882686   0.0769173  -0.0708992   0.0545392   0.0469272   0.0316988 
##     SUV39H2  AC005670.2      ZNF773        XIST       STAU2      DIRAS3 
##   0.0297370   0.0184851  -0.0170861  -0.0100934   0.0016182  -0.0009975 
## 
## Result:
##               AUC           Accuracy  Balanced accuracy  
##            0.7648             0.7059             0.6773

Use summary() to see full information from the nested model fitting. coef() can be used to show the coefficients of the final fitted model. Filters can be used by setting the filterFUN argument. Options for the filter function are passed as a list through filter_options.

# t-test filter
res.rtx <- nestcv.glmnet(y = yrtx, x = data.rtx, filterFUN = ttest_filter,
                         filter_options = list(nfilter = 300, p_cutoff = NULL),
                         family = "binomial", cv.cores = 8,
                         alphaSet = seq(0.7, 1, 0.05))
summary(res.rtx)

Output from the nested CV with glmnet can be plotted to show how deviance is affected by alpha and lambda.

plot_alphas(res.rtx)
plot_lambdas(res.rtx)

The tuning of alpha for each outer fold can be plotted.

# Fold 1 line plot
plot(res.rtx$outer_result[[1]]$cvafit)

# Scatter plot
plot(res.rtx$outer_result[[1]]$cvafit, type = 'p')

# Number of non-zero coefficients
plot(res.rtx$outer_result[[1]]$cvafit, xaxis = 'nvar')

ROC curves from left-out folds from both outer and inner CV can be plotted. Note that the AUC based on the left-out outer folds is the unbiased estimate of accuracy, while the left-out inner folds demonstrate bias due to the optimisation of the model’s hyperparameters on the inner fold data.

# Outer CV ROC
plot(res.rtx$roc, main = "Outer fold ROC", font.main = 1, col = 'blue')
legend("bottomright", legend = paste0("AUC = ", signif(pROC::auc(res.rtx$roc), 3)), bty = 'n')

# Inner CV ROC
rtx.inroc <- innercv_roc(res.rtx)
plot(rtx.inroc, main = "Inner fold ROC", font.main = 1, col = 'red')
legend("bottomright", legend = paste0("AUC = ", signif(pROC::auc(rtx.inroc), 3)), bty = 'n')

The overall expression level of each gene selected in the final model can be compared with a boxplot.

boxplot_model(res.rtx, data.rtx, ylab = "VST")

Leave-one-out cross-validation (LOOCV) can be performed on the outer folds.

# Outer LOOCV
res.rtx <- nestcv.glmnet(y = yrtx, x = data.rtx, min_1se = 0, filterFUN = ttest_filter,
                         filter_options = list(nfilter = 300, p_cutoff = NULL),
                         outer_method = "LOOCV",
                         family = "binomial", cv.cores = 8,
                         alphaSet = seq(0.7, 1, 0.05))
summary(res.rtx)

Filters

Multiple filters for variable reduction are available including:

ttest_filter t-test
wilcoxon_filter Wilcoxon (Mann-Whitney) test
anova_filter one-way ANOVA
correl_filter Pearson or Spearman correlation for regression modelling
lm_filter linear model with covariates
rf_filter random forest variable importance
relieff_filter ReliefF and other methods available via CORElearn
boruta_filter Boruta
# Random forest filter
res.rtx <- nestcv.glmnet(y = yrtx, x = data.rtx, min_1se = 0.5, filterFUN = rf_filter,
                         filter_options = list(nfilter = 300),
                         family = "binomial", cv.cores = 8, 
                         alphaSet = seq(0.7, 1, 0.05))
summary(res.rtx)

# ReliefF algorithm filter
res.rtx <- nestcv.glmnet(y = yrtx, x = data.rtx, min_1se = 0, filterFUN = relieff_filter,
                         filter_options = list(nfilter = 300),
                         family = "binomial", cv.cores = 8, 
                         alphaSet = seq(0.7, 1, 0.05))
summary(res.rtx)

Bootstrapped versions of the univariate filters are available [see boot_ttest()]. These use repeated random sampling to try to improve stability of ranking of predictors based on univariate statistics.

Custom filter

It is fairly straightforward to create your own custom filter, which can be embedded within the outer CV loops via nestcv.glmnet, nestcv.train or outercv. The function simply must be of the form

filter <- function(y, x, ...) {}

Other arguments can be passed in to the filter function as a named list via the filter_options argument. The function must return a vector of indices of those predictors in x which are to be retained for downstream model fitting as well as prediction on left-out outer folds. Importantly the filter function is applied independently to each outer CV fold and not run on the whole data.

Finally once the model performance has been calculated by nested CV. The filter is applied to the whole dataset when refitting the final model to the full dataset.

Class imbalance

Class imbalance is known to impact on model fitting for certain model types, e.g. random forest. Models may tend to aim to predict the majority class and ignore the minority class as selecting the majority class can give high accuracy purely by chance. While performance measures such as balanced accuracy can give improved estimates of model performance, techniques for rebalancing data have been developed. These include:

  • Random oversampling of the minority class
  • Random undersampling of the majority class
  • Combination of oversampling and undersampling
  • Synthesising new data in the minority class, e.g. SMOTE (Chawla et al, 2002)

These are available within nestedcv using the balance argument to specify a balancing function. Other arguments to control the balancing process are passed to the function as a list via balance_options.

randomsample Random oversampling of the minority class and/or undersampling of the majority class
smote Synthetic minority oversampling technique (SMOTE)

Note that in nestedcv balancing is performed only on the outer training folds, immediately prior to filtering of features. This is important as balancing the whole dataset prior to the outer CV leads to data leakage of outer CV hold-out samples into the outer training folds.

The number of samples in each class in the outer CV folds can be checked on nestedcv objects using the function class_balance().

Custom balancing function

A custom balancing function can be provided. The function must be of the form:

balance <- function(y, x, ...) {}

Other arguments can be passed in to the balance function as a named list via the balance_options argument. The function must return a list containing y an expanded/altered response vector and x the matrix or dataframe of predictors with increased/decreased samples in rows and predictors in columns. In theory manipulation of predictors to generate additional, composite predictors would be supported here.

Nested CV with caret

Nested CV can also be performed using the caret package framework written by Max Kuhn (https://topepo.github.io/caret/index.html). This enables access to the large library of machine learning models available within caret.

Here we use caret for tuning the alpha and lambda parameters of glmnet.

# nested CV using caret
tg <- expand.grid(lambda = exp(seq(log(2e-3), log(1e0), length.out = 100)),
                  alpha = seq(0.8, 1, 0.1))
ncv <- nestcv.train(y = yrtx, x = data.rtx,
               method = "glmnet",
               savePredictions = "final",
               filterFUN = ttest_filter, filter_options = list(nfilter = 300),
               tuneGrid = tg, cv.cores = 8)
ncv$summary

# Plot ROC on outer folds
plot(ncv$roc)

# Plot ROC on inner LO folds
inroc <- innercv_roc(ncv)
plot(inroc)
pROC::auc(inroc)

# Extract coefficients of final fitted model
glmnet_coefs(ncv$final_fit$finalModel, s = ncv$finalTune$lambda)

Notes on caret

It is important to realise that the train() function in caret sets a parameter known as tuneLength to 3 by default, so the default tuning grid is minimal. tuneLength can easily be increased to give a tuning grid of greater granularity. Tuneable parameters for individual models can be inspected using modelLookup(), while getModelInfo() gives further information.

When fitting classification models, the usual default metric for tuning model hyperparameters in caret is Accuracy. However, with small datasets, accuracy is disproportionately affected by changes in a single individual’s prediction outcome from incorrect to correctly classified or vice versa. For this reason, we suggest using logLoss with smaller datasets as it provides more stable measures of model tuning behaviour. In nestedcv, when fitting classification models with caret, the default metric is changed to use logLoss.

We recommend that the results of tuning are plotted to understand whether parameters have a systematic effect on model accuracy. With small datasets tuning may pick parameters partially at random because of random fluctuations in measured accuracy during tuning, which may worsen noise surrounding performance than if they were fixed.

# Example tuning plot for outer fold 1
plot(ncv$outer_result[[1]]$fit, xTrans = log)

# ggplot2 version
ggplot(ncv$outer_result[[1]]$fit) +
  scale_x_log10() 

Making predictions

For all of the nestedcv model training functions described above, predictions on new data can be made by referencing the final fitted object which is fitted to the whole dataset.

# for nestcv.glmnet object
preds <- predict(res.rtx, newdata = data.rtx, type = 'response')

# for nestcv.train object
preds <- predict(ncv, newdata = data.rtx)

Bayesian shrinkage models

The two examples below implement Bayesian linear and logistic regression models using the horseshoe prior over parameters to encourage a sparse model. Models are fitted using the hsstan R package, which performs full Bayesian inference through a Stan implementation. In Bayesian inference model meta-parameters such as the amount of shrinkage are also given prior distributions and are thus directly learned from the data through sampling. This bypasses the need to cross-validate results over a grid of values for the meta-parameters, as would be required to find the optimal lambda in a lasso or elastic net model.

However, Bayesian inference is computationally intensive. In high-dimensional settings, with e.g. more than 10,000 biomarkers, pre-filtering of inputs based on univariate measures of association with the outcome may be beneficial. If pre-filtering of inputs is used then a cross-validation procedure is needed to ensure that the data points used for pre-filtering and model fitting differ from the data points used to quantify model performance. The outercv() function is used to perform univariate pre-filtering and cross-validate model performance in this setting.

CAUTION should be used when setting the number of cores available for parallelisation. The code will use the product of cv.cores and mc.cores. The former controls parallelisation over folds, while the latter controls parallelisation in the Bayesian inference procedure. The default setting for hsstan is to use 4 Markov chains for sampling. We recommend setting options(mc.cores = 4) which will parallelise computation over the chains. If we are also using a 10-fold cross-validation procedure, and set cv.cores = 10 to parallelise computation over folds, then 40 cores will be used in total.

Linear regression with a Bayesian shrinkage model (continuous outcomes)

We use cross-validation and apply univariate filtering of predictors and model fitting in one part of the data (training fold), followed by evaluation of model performance on the left-out data (testing fold), repeated in each fold.

Only one cross-validation split is needed (function outercv) as the Bayesian model does not require cross-validation for meta-parameters.


# specify options for running the code
nfolds <- 3

# specify number of cores for parallelising computation
# the product of cv.cores and mc.cores will be used in total
# number of cores for parallelising over CV folds
cv.cores <- 1
# number of cores for parallelising hsstan sampling
# to pass CRAN package checks we need to create object oldopt
# in the end we reset options to the old configuration
oldopt <- options(mc.cores = 2)

# load iris dataset and simulate a continuous outcome
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2)

# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)],
                         model = model.hsstan,
                         filterFUN = lm_filter,
                         filter_options = list(force_vars = uvars,
                                               nfilter = 2,
                                               p_cutoff = NULL,
                                               rsq_cutoff = 0.9),
                         n_outer_folds = nfolds, cv.cores = cv.cores,
                         unpenalized = uvars, warmup = 1000, iter = 2000)
# view prediction performance based on testing folds
summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop:  3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#> 
#> Model:  model.hsstan 
#> Filter:  lm_filter 
#>         n.filter
#> Fold 1         3
#> Fold 2         3
#> Fold 3         3
#> 
#> Final fit:             mean   sd  2.5% 97.5% n_eff Rhat
#> (Intercept) -3.14 0.17 -3.47 -2.81  3586    1
#> marker1      0.50 0.28  0.01  1.12  3663    1
#> marker2      1.58 0.19  1.19  1.95  3773    1
#> marker3     -0.06 0.25 -0.72  0.41  4104    1
#> 
#> Result:
#>     RMSE  Rsquared       MAE  
#>   2.1769    0.3339    1.7953

# load hsstan package to examine the Bayesian model
library(hsstan)
#> hsstan 0.8.1: using 2 cores, set 'options(mc.cores)' to change.
sampler.stats(res.cv.hsstan$final_fit)
#>         accept.stat stepsize divergences treedepth gradients warmup sample
#> chain:1      0.9825   0.0363           0         7    110720   0.83   0.59
#> chain:2      0.9881   0.0270           0         8    128216   0.82   0.69
#> chain:3      0.9735   0.0425           0         7     98832   0.92   0.53
#> chain:4      0.9933   0.0233           0         8    158808   1.02   0.86
#> all          0.9843   0.0323           0         8    496576   3.59   2.67
print(projsel(res.cv.hsstan$final_fit), digits = 4)  # adding marker2
#>                                                      Model        KL        ELPD
#>                                             Intercept only   0.22969  -360.70301
#>                                           Initial submodel   0.22249  -360.29745
#>                                                    marker2   0.00097  -326.05259
#>                                                    marker3   0.00000  -326.13922
#>                var        kl rel.kl.null rel.kl   elpd delta.elpd
#> 1   Intercept only 0.2296893     0.00000     NA -360.7  -34.56380
#> 2 Initial submodel 0.2224910     0.03134 0.0000 -360.3  -34.15824
#> 3          marker2 0.0009713     0.99577 0.9956 -326.1    0.08663
#> 4          marker3 0.0000000     1.00000 1.0000 -326.1    0.00000
options(oldopt)  # reset options

Here adding marker2 improves the model fit: substantial decrease of KL-divergence from the full model to the submodel. Adding marker3 does not improve the model fit: no decrease of KL-divergence from the full model to the submodel.

Logistic regression with a Bayesian shrinkage model (continuous outcomes)

We use cross-validation and apply univariate filtering of predictors and model fitting in one part of the data (training fold), followed by evaluation of model performance on the left-out data (testing fold), repeated in each fold.

Only one cross-validation split is needed (function outercv) as the Bayesian model does not require cross-validation for meta-parameters.

# sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}

# specify options for running the code
nfolds <- 3

# specify number of cores for parallelising computation
# the product of cv.cores and mc.cores will be used in total
# number of cores for parallelising over CV folds
cv.cores <- 1
# number of cores for parallelising hsstan sampling
# to pass CRAN package checks we need to create object oldopt
# in the end we reset options to the old configuration
oldopt <- options(mc.cores = 2)

# load iris dataset and create a binary outcome
set.seed(267)
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
rownames(dt) <- paste0("sample", c(1:nrow(dt)))
dt$outcome.bin <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt))
dt$outcome.bin <- factor(dt$outcome.bin)


# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
res.cv.hsstan <- outercv(y = dt$outcome.bin,
                         x = as.matrix(dt[, c(uvars, pvars)]),
                         model = model.hsstan,
                         filterFUN = ttest_filter,
                         filter_options = list(force_vars = uvars,
                                               nfilter = 2,
                                               p_cutoff = NULL,
                                               rsq_cutoff = 0.9),
                         n_outer_folds = nfolds, cv.cores = cv.cores,
                         unpenalized = uvars, warmup = 1000, iter = 2000)


# view prediction performance based on testing folds
summary(res.cv.hsstan)
#> Single cross-validation to measure performance
#> Outer loop:  3-fold CV
#> No inner loop
#> 150 observations, 4 predictors
#> FALSE  TRUE 
#>    73    77 
#> 
#> Model:  model.hsstan 
#> Filter:  ttest_filter 
#>         n.filter
#> Fold 1         3
#> Fold 2         3
#> Fold 3         3
#> 
#> Final fit:            mean   sd  2.5% 97.5% n_eff Rhat
#> (Intercept) 0.09 0.20 -0.30  0.48  3570    1
#> marker1     0.39 0.33 -0.42  0.94  3388    1
#> marker2     1.76 0.34  1.15  2.46  3969    1
#> marker3     0.14 0.33 -0.29  1.10  2888    1
#> 
#> Result:
#>               AUC           Accuracy  Balanced accuracy  
#>            0.8127             0.7467             0.7468

# examine the Bayesian model
sampler.stats(res.cv.hsstan$final_fit)
#>         accept.stat stepsize divergences treedepth gradients warmup sample
#> chain:1      0.9910   0.0358           0         7    107952   0.68   0.83
#> chain:2      0.9952   0.0275           0         8    129032   0.69   0.99
#> chain:3      0.9896   0.0426           0         7     91872   0.85   0.71
#> chain:4      0.9936   0.0309           0         7    117480   0.75   0.92
#> all          0.9923   0.0342           0         8    446336   2.97   3.45
print(projsel(res.cv.hsstan$final_fit), digits = 4)  # adding marker2
#>                                                      Model        KL        ELPD
#>                                             Intercept only   0.18519  -104.24943
#>                                           Initial submodel   0.17839  -103.92563
#>                                                    marker2   0.00132  -76.74973
#>                                                    marker3   0.00000  -76.69643
#>                var        kl rel.kl.null rel.kl    elpd delta.elpd
#> 1   Intercept only 1.852e-01     0.00000     NA -104.25   -27.5530
#> 2 Initial submodel 1.784e-01     0.03673 0.0000 -103.93   -27.2292
#> 3          marker2 1.316e-03     0.99289 0.9926  -76.75    -0.0533
#> 4          marker3 6.320e-18     1.00000 1.0000  -76.70     0.0000
options(oldopt)  # reset options

Here adding marker2 improves the model fit: substantial decrease of KL-divergence from the full model to the submodel. Adding marker3 does not improve the model fit: no decrease of KL-divergence from the full model to the submodel.

Parallelisation

Currently nestcv.glmnet, nestcv.train and outercv all allow parallelisation of the outer CV loop using mclapply from the parallel package on unix/mac and parLapply on windows. This is enabled by specifying the number of cores using the argument cv.cores. Since in some cases the filtering process can be time consuming (e.g. rf_filter, relieff_filter), we tend to recommend parallelisation of the outer CV loop over parallelisation of the inner CV loop.

The caret package is set up for parallelisation using foreach (https://topepo.github.io/caret/parallel-processing.html). We generally do not recommend nested parallelisation of both outer and inner loops, although this is theoretically feasible if you have enough cores. If P processors are registered with the parallel backend, some caret functionality leads to P2 processes being generated. We generally find this does not lead to much speed up once the number of processes reaches the number of physical cores, as all processors are saturated and there is both time and memory overheads for duplicating data and packages for each process.

Note at time of writing, there appears to be a bug in rstan (used by hsstan) leading to it ignoring the argument cores and instead spawning multiple processes as specified by the chain argument. This behaviour can be limited by setting options(mc.cores = ..).

References

Chawla, NV, Bowyer KW, Hall LO, Kegelmeyer WP. Smote: Synthetic minority over-sampling technique. 2002; 16:321-357

Humby F, Durez P, Buch MH, Lewis MJ et al. Rituximab versus tocilizumab in anti-TNF inadequate responder patients with rheumatoid arthritis (R4RA): 16-week outcomes of a stratified, biopsy-driven, multicentre, open-label, phase 4 randomised controlled trial. 2021; 397(10271):305-317. https://doi.org/10.1016/s0140-6736(20)32341-2

Kuhn, M. Building Predictive Models in R Using the caret Package. 2008; 28(5):1-26.

Rivellese F, Surace AEA, Goldmann K, et al, Lewis MJ, Pitzalis C and the R4RA collaborative group. Rituximab versus tocilizumab in rheumatoid arthritis: Synovial biopsy-based biomarker analysis of the phase 4 R4RA randomized trial. 2022; 28(6):1256-1268. https://doi.org/10.1038/s41591-022-01789-0

Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PloS one. 2019;14(11):e0224365.

Zou, H, Hastie, T. Regularization and variable selection via the elastic net. 2005; 67(2): 301-320.