Learning algorithms and hyperparameter optimisation

Alex Zwanenburg

2022-08-12

Learning algorithms create models that relate input data to the outcome. Development data is used to train models, after which they can be used to predict outcomes for new data. Familiar implements several commonly used algorithms. This vignette first describes the learners and their hyperparameters. Then, hyperparameter optimisation is described in more detail.

Overview of learners implemented in familiar. a These learners test multiple distributions or linking families and attempt to find the best option. b k-nearest neighbours learners allow for setting the distance metric using *. If omitted (e.g. knn), the kernel is either euclidean (numeric features) or gower (mixed features). c The SVM kernel is indicated by *. If omitted (e.g. svm_c), the radial basis function is used as kernel. d SVM type is indicated by *, and must be one of c, nu, or epsilon.
learner tag binomial multinomial continuous count survival
generalised linear models
generala glm × × × × ×
logistic glm_logistic ×
cauchy glm_cauchy ×
complementary log-log glm_loglog ×
normal glm_probit ×
multinomial glm_multinomial ×
log-normal glm_log, glm_log_gaussian ×
normal (gaussian) glm_gaussian ×
inverse gaussian glm_inv_gaussian ×
log-poisson glm_log_poisson x ×
poisson glm_poisson x ×
cox cox ×
exponential survival_regr_exponential ×
gaussian survival_regr_gaussian ×
logistic survival_regr_logistic ×
log-logistic survival_regr_loglogistic ×
log-normal survival_regr_lognormal ×
survival regressiona survival_regr ×
weibull survival_regr_weibull ×
lasso regression models
generala lasso × × × × ×
logistic lasso_binomial ×
multi-logistic lasso_multinomial ×
normal (gaussian) lasso_gaussian ×
poisson lasso_poisson ×
cox lasso_cox ×
ridge regression models
generala ridge × × × × ×
logistic ridge_binomial ×
multi-logistic ridge_multinomial ×
normal (gaussian) ridge_gaussian ×
poisson ridge_poisson ×
cox ridge_cox ×
elastic net regression models
generala elastic_net × × × × ×
logistic elastic_net_binomial ×
multi-logistic elastic_net_multinomial ×
normal (gaussian) elastic_net_gaussian ×
poisson elastic_net_poisson ×
cox elastic_net_cox ×
boosted linear models
generala boosted_glm × × × ×
area under the curve boosted_glm_auc ×
cauchy boosted_glm_cauchy ×
complementary log-log boosted_glm_loglog ×
logistic boosted_glm_logistic ×
log-logistic boosted_glm_log ×
normal boosted_glm_probit ×
gaussian boosted_glm_gaussian ×
huber loss boosted_glm_huber ×
laplace boosted_glm_laplace ×
poisson boosted_glm_poisson ×
concordance index boosted_glm_cindex ×
cox boosted_glm_cox ×
log-logistic boosted_glm_loglog ×
log-normal boosted_glm_lognormal ×
rank-based estimation boosted_glm_gehan ×
survival regressiona boosted_glm_surv ×
weibull boosted_glm_weibull ×
extreme gradient boosted linear models
generala xgboost_lm × × × × ×
logistic xgboost_lm_logistic × × ×
gamma xgboost_lm_gamma ×
gaussian xgboost_lm_gaussian × x
poisson xgboost_lm_poisson ×
cox xgboost_lm_cox ×
random forests
random forest (RFSRC) random_forest_rfsrc × × × × ×
random forest (ranger) random_forest_ranger × × × × ×
boosted regression trees
generala boosted_tree × × × ×
area under the curve boosted_tree_auc ×
cauchy boosted_tree_cauchy ×
complementary log-log boosted_tree_loglog ×
logistic boosted_tree_logistic ×
log-logistic boosted_tree_log ×
normal boosted_tree_probit ×
gaussian boosted_tree_gaussian ×
huber loss boosted_tree_huber ×
laplace boosted_tree_laplace ×
poisson boosted_tree_poisson ×
concordance index boosted_tree_cindex ×
cox boosted_tree_cox ×
log-logistic boosted_tree_loglog ×
log-normal boosted_tree_lognormal ×
rank-based estimation boosted_tree_gehan ×
survival regressiona boosted_tree_surv ×
weibull boosted_tree_weibull ×
extreme gradient boosted trees
generala xgboost_tree × × × × ×
logistic xgboost_tree_logistic × × ×
gamma xgboost_tree_gamma ×
gaussian xgboost_tree_gaussian × x
poisson xgboost_tree_poisson ×
cox xgboost_tree_cox ×
extreme gradient boosted DART trees
generala xgboost_dart × × × × ×
logistic xgboost_dart_logistic × × ×
gamma xgboost_dart_gamma ×
gaussian xgboost_dart_gaussian × x
poisson xgboost_dart_poisson ×
cox xgboost_dart_cox ×
bayesian models
naive bayes naive_bayes × ×
nearest neighbour models
k-nearest neighboursb k_nearest_neighbours_*, knn_* × × x x
support vector machines
\(C\)-classificationc svm_c_* × ×
\(\nu\)-classification/ regressionc svm_nu_* × × × ×
\(\epsilon\)-regressionc svm_eps_* × ×
linear kerneld svm_*_linear × × × ×
polynomial kerneld svm_*_polynomial × × × ×
radial kerneld svm_*_radial × × × ×
sigmoid kerneld svm_*_sigmoid × × × ×

Configuration options

Learners, their hyperparameters, and parallelisation options can be specified using the tags or arguments below:

Configuration options for model development.
tag / argument description default
learner The desired learner. Multiple learners may be provided at the same time. This setting has no default and must be provided. – (required)
hyperparameter Each learner has one or more hyperparameters which can be specified. Sequential model-based optimisation is used to identify hyperparameter values from the data unless these are specifically specified here. See the section on hyperparameter optimisation for more details. – (optional)
parallel_model_development Enables parallel processing for model development. Ignored if parallel=FALSE. TRUE

Generalised linear models

Generalised linear models (GLM) are easy to understand, use and share. In many situations, GLM may be preferred over more complex models as they are easier to report and validate. The most basic GLM is the linear model. The linear model for an outcome variable \(y_j\) and a single predictor variable \(x_j\) for sample \(j\) is:

\[y_j=\beta_0 + \beta_1 x_j + \epsilon_j\] Here, \(\beta_0\) and \(\beta_1\) are both model coefficients. \(\beta_0\) is also called the model intercept. \(\epsilon_j\) is an error term that describes the difference between the predicted and the actual response for sample \(j\). When a linear model is developed, the \(\beta\)-coefficients are set in such manner that the mean-squared error over the sample population is minimised.

The above model is a univariate model as it includes only a single predictor. A multivariate linear model includes multiple predictors \(\mathbf{X_j}\) and is denoted as:

\[y_j= \mathbf{\beta}\mathbf{X_j} + \epsilon_j\]

\(\mathbf{\beta}\mathbf{X_j}\) is the linear predictor. The GLM generalises this linear predictor through a linking function \(g\) (Nelder and Wedderburn 1972):

\[y_j=g\left(\mathbf{\beta}\mathbf{X}\right) + \epsilon_j\]

For example, binomial outcomes are commonly modelled using logistic regression with the logit linking function. Multiple linking functions are available in familiar and are detailed below.

Linear models for binomial outcomes

Linear models for binomial outcomes derive from the stats package which is part of the R core distribution (R Core Team 2019). Hyperparameters for these models include linkage functions and are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
linking function family logistic, probit, loglog, cauchy glm only The linking function is not optimised when it is specified, e.g. glm_logistic.
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).

Linear models for multinomial outcomes

The linear model for multinomial outcomes is implemented using the VGAM::vglm function (T. W. Yee and Wild 1996; T. Yee 2010; Thomas W. Yee 2015). Hyperparameters for this model are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
linking function family multinomial no There is only one linking function available.
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).

Linear models for continuous and count-type outcomes

Linear models for continuous and count-type outcomes derive from the stats package of the R core distribution (R Core Team 2019). Hyperparameters for these models include linkage functions and are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
linking function family gaussian, log_gaussian, inv_gaussian, poisson, log_poisson glm only The linking function is not optimised when it is specified, e.g. glm_poisson. gaussian, log_gaussian, inv_gaussian linking functions are not available for count-type outcomes.

Linear models for survival outcomes

Linear models for survival outcomes are divided into semi-parametric and parametric models. The semi-parametric Cox proportional hazards model (Cox 1972) is based on the survival::coxph function (Therneau and Grambsch 2000). Tied survival times are resolved using the default method by Efron (1977).

Various fully parametric models are also available, which differ in the assumed distribution of the outcome, i.e. the linking function. The parametric models are all based on survival::survreg (Therneau and Grambsch 2000).

Hyperparameters for semi-parametric and parametric survival models are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
linking function family weibull, exponential, gaussian, logistic, lognormal, loglogistic survival_regr only The linking function is not optimised when it is specified, e.g. survival_regr_weibull. The non-parametric cox learner does not have a linking function.

Lasso, ridge and elastic net regression

Generalised linear models can be problematic as there is no inherent limit to model complexity, which can easily lead to overfitting. Penalised regression, or shrinkage, methods address this issue by penalising model complexity.

Three shrinkage methods are implemented in familiar, namely ridge regression, lasso and elastic net, which are all implemented using the glmnet package (Hastie, Tibshirani, and Friedman 2009; Simon et al. 2011). The set of hyperparameters is shown in the table below. The optimal lambda parameter is determined by cross-validation as part of the glmnet::cv.glmnet function, and is not directly determined using hyperparameter optimisation.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
elastic net penalty alpha \(\mathbb{R} \in \left[0,1\right]\) elastic net This penalty is fixed for ridge regression (alpha = 0) and lasso (alpha = 1).
optimal lambda lambda_min lambda.1se, lambda.min no Default is lambda.min.
number of CV folds n_folds \(\mathbb{Z} \in \left[3,n\right]\) no Default is \(3\) if \(n<30\), \(\lfloor n/10\rfloor\) if \(30\leq n \leq 200\) and \(20\) if \(n>200\).
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples for binomial and multinomial outcomes, and none otherwise. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).
normalisation normalise FALSE, TRUE no Default is FALSE, as normalisation is part of pre-processing in familiar.

Boosted generalised linear models and regression trees

Boosting is a procedure which combines many weak learners to form a more powerful panel (Schapire 1990; Hastie, Tibshirani, and Friedman 2009). Boosting learners are implemented using the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner, Boccuto, and Göker 2015). Both linear regression (mboost::glmboost) and regression trees (mboost::blackboost) are implemented. The hyperparameters are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
family family logistic, probit, bin_loglog, cauchy, log, auc, gaussian, huber, laplace, poisson, cox, weibull, lognormal, surv_loglog, gehan, cindex general tags The family is not optimised when it is specified, e.g. boosted_tree_gaussian.
boosting iterations n_boost \(\mathbb{R} \in \left[0,\infty\right)\) yes This parameter is expressed on the \(\log_{10}\) scale, i.e. the actual input value will be \(10^\texttt{n_boost}\). The default range is \(\left[0, 3\right]\).
learning rate learning_rate \(\mathbb{R} \in \left[-\infty,0\right]\) yes This parameter is expressed on the \(\log_{10}\) scale. The default range is \(\left[-5,0\right]\).
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples for binomial and multinomial outcomes, and none otherwise. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).
maximum tree depth tree_depth \(\mathbb{Z} \in \left[1,n\right]\) trees only Maximum depth to which trees are allowed to grow.
minimum sum of instance weight min_child_weight \(\mathbb{R} \in \left[0,\infty\right)\) trees only Minimal instance weight required for further branch partitioning, or the number of instances required in each node. This parameter is expressed on the \(\log_{10}\) scale with a \(-1\) offset. The default range is \(\left[0, 2\right]\).
significance split threshold alpha \(\mathbb{R} \in \left(0.0, 1.0\right]\) trees only Minimum significance level for further splitting. The default range is \(\left[10^{-6}, 1.0\right]\)

Optimising the number of boosting iterations may be slow as models with a large number of boosting iterations take longer to learn and assess. If this is an issue, the n_boost parameter should be set to a smaller range, or provided with a fixed value.

Low learning rates may require an increased number of boosting iterations.

Also note that hyperparameter optimisation time depends on the type of family chosen, e.g. cindex is considerably slower than cox.

Extreme gradient boosted linear models and trees

Boosting is a procedure which combines many weak learners to form a more powerful panel (Schapire 1990; Hastie, Tibshirani, and Friedman 2009). Extreme gradient boosting is a gradient boosting implementation that was highly successful in several machine learning competitions. Learners are implemented using the xgboost package (Chen and Guestrin 2016). Three types are implemented: regression based boosting (xgboost_lm), regression-tree based boosting (xgboost_tree) and regression-tree based boosting with drop-outs (xgboost_dart).

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
family learn_objective gaussian, continuous_logistic, multinomial_logistic, binomial_logistic, poisson, gamma, cox general tags The family is not optimised when it is specified, e.g. xgboost_lm_linear.
boosting iterations n_boost \(\mathbb{R} \in \left[0,\infty \right)\) yes This parameter is expressed on the \(\log_{10}\) scale, i.e. the actual value will be \(10^\texttt{n_boost}\). The default range is \(\left[0, 3 \right]\).
learning rate learning_rate \(\mathbb{R} \in \left(-\infty,0\right]\) yes This parameter is expressed on the \(\log_{10}\) scale. The default range is \(\left[-5, 0 \right]\).
L1 regularisation alpha \(\mathbb{R} \in \left[-6, \infty\right)\) yes This parameter is expressed on the \(\log_{10}\) scale with a \(10^{-6}\) offset. The default range is \(\left[-6, 3\right]\).
L2 regularisation lambda \(\mathbb{R} \in \left[-6, \infty\right)\) yes This parameter is expressed on the \(\log_{10}\) scale with a \(10^{-6}\) offset. The default range is \(\left[-6, 3\right]\).
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples for binomial and multinomial outcomes, and none otherwise. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).
maximum tree depth tree_depth \(\mathbb{Z} \in \left[1,\infty\right)\) gbtree, dart Maximum depth to which trees are allowed to grow. The default range is \(\left[1, 10\right]\).
subsampling fraction sample_size \(\mathbb{R} \in \left(0, 1.0\right]\) gbtree, dart Fraction of available data that is used for to create a single tree. The default range is \(\left[2 / m, 1.0\right]\), with \(m\) the number of samples.
minimum sum of instance weight min_child_weight \(\mathbb{R} \in \left[0,\infty\right)\) gbtree, dart Minimal instance weight required for further branch partitioning, or the number of instances required in each node. This parameter is expressed on the \(\log_{10}\) scale with a \(-1\) offset. The default range is \(\left[0, 2\right]\).
min. splitting error reduction gamma \(\mathbb{R} \in \left[-6, \infty\right)\) gbtree, dart Minimum error reduction required to allow splitting. This parameter is expressed on the \(\log_{10}\) scale with a \(10^{-6}\) offset. The default range is \(\left[-6, 3\right]\). continuous and count-type outcomes are normalised to the \(\left[0, 1\right]\) range prior to model fitting to deal with scaling issues. Values are converted back to the original scale after prediction.
DART sampling algorithm sample_type uniform, weighted dart
drop-out rate rate_drop \(\mathbb{R} \in \left[0,1\right)\) dart

Random forest (RFSRC)

Random forests (explain) (Breiman 2001). The random forest learner is implemented through the randomForestSRC package, which provides a unified interface for different types of forests (Ishwaran et al. 2008, 2011).

An outcome variable that represents count-type data is first transformed using a \(\log(x+1)\) transformation. Predicted responses are then transformed to the original scale using the inverse \(\exp(x)-1\) transformation.

Hyperparameters for random forest learners are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
number of trees n_tree \(\mathbb{Z} \in \left[0,\infty\right)\) yes This parameter is expressed on the \(\log_{2}\) scale, i.e. the actual input value will be \(2^\texttt{n_tree}\) (Oshiro, Perez, and Baranauskas 2012). The default range is \(\left[4, 10\right]\).
subsampling fraction sample_size \(\mathbb{R} \in \left(0, 1.0\right]\) yes Fraction of available data that is used for to create a single tree. The default range is \(\left[2 / m, 1.0\right]\), with \(m\) the number of samples.
number of features at each node m_try \(\mathbb{R} \in \left[0.0, 1.0\right]\) yes Familiar ensures that there is always at least one candidate feature.
node size node_size \(\mathbb{Z} \in \left[1, \infty\right)\) yes Minimum number of unique samples in terminal nodes. The default range is \(\left[5, \lfloor m / 3\rfloor\right]\), with \(m\) the number of samples.
maximum tree depth tree_depth \(\mathbb{Z} \in \left[1,\infty\right)\) yes Maximum depth to which trees are allowed to grow. The default range is \(\left[1, 10\right]\).
number of split points n_split \(\mathbb{Z} \in \left[0, \infty\right)\) no By default, splitting is deterministic and has one split point (\(0\)).
splitting rule split_rule gini, auc, entropy, mse, quantile.regr, la.quantile.regr, logrank, logrankscore, bs.gradient no Default splitting rules are gini for binomial and multinonial outcomes, mse for continuous and count outcomes and logrank for survival outcomes.
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples for binomial and multinomial outcomes, and none otherwise. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).

Note that optimising that optimising the number of trees can be slow, as big forests take longer to construct and perform more computations for predictions. Hence hyperparameter optimisation may be sped up by limiting the range of the n_tree parameter, or setting a single value.

Random forest (ranger)

The second implementation of random forests comes from the ranger package (Wright and Ziegler 2017). It is generally faster than the implementation in randomForestSRC and allows for different splitting rules, such as maximally selected rank statistics (Lausen and Schumacher 1992; Wright, Dankowski, and Ziegler 2017) and concordance index-based splitting (Schmid, Wright, and Ziegler 2016). Hyperparameters for the random forest are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
number of trees n_tree \(\mathbb{Z} \in \left[0,\infty\right)\) yes This parameter is expressed on the \(\log_{2}\) scale, i.e. the actual input value will be \(2^\texttt{n_tree}\) (Oshiro, Perez, and Baranauskas 2012). The default range is \(\left[4, 10\right]\).
subsampling fraction sample_size \(\mathbb{R} \in \left(0, 1.0\right]\) yes Fraction of available data that is used for to create a single tree. The default range is \(\left[2 / m, 1.0\right]\), with \(m\) the number of samples.
number of features at each node m_try \(\mathbb{R} \in \left[0.0, 1.0\right]\) yes Familiar ensures that there is always at least one candidate feature.
node size node_size \(\mathbb{Z} \in \left[1, \infty\right)\) yes Minimum number of unique samples in terminal nodes. The default range is \(\left[5, \lfloor m / 3\rfloor\right]\), with \(m\) the number of samples.
maximum tree depth tree_depth \(\mathbb{Z} \in \left[1,\infty\right)\) yes Maximum depth to which trees are allowed to grow. The default range is \(\left[1, 10\right]\).
splitting rule split_rule gini, extratrees, variance, logrank, C, maxstat no Default splitting rules are gini for binomial and multinomial outcomes and maxstat for continuous, count and survival outcomes.
significance split threshold alpha \(\mathbb{R} \in \left(0.0, 1.0\right]\) maxstat Minimum significance level for further splitting. The default range is \(\left[10^{-6}, 1.0\right]\)
sample weighting sample_weighting inverse_number_of_samples, effective_number_of_samples, none no Sample weighting allows for mitigating class imbalances. The default is inverse_number_of_samples for binomial and multinomial outcomes, and none otherwise. Instances with the majority class receive less weight.
beta sample_weighting_beta \(\mathbb{Z} \in \left[-6,-1\right]\) effective_number_of_samples only Specifies the \(\beta\) parameter for effective number of samples weighting (Cui et al. 2019). It is expressed on the \(\log_{10}\) scale: \(\beta=1-10^\texttt{beta}\).

Note that optimising the number of trees can be slow, as big forests take longer to construct and perform more computations for predictions. Hence hyperparameter optimisation may be sped up by limiting the range of the n_tree parameter or setting a single value.

Naive Bayes

The naive Bayes classifier uses Bayes rule to predict posterior probabilities. The naive Bayes classifier uses the e1071::naiveBayes function (Meyer et al. 2021). The hyperparameters of the classifier are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
laplace smoothing laplace \(\mathbb{R} \in \left[0.0, \infty \right)\) yes The default range is \(\left[0, 10\right]\).

k-nearest neighbours

k-nearest neighbours is a simple clustering algorithm that classifies samples based on the classes of their k nearest neighbours in feature space. The e1071::gknn function is used to implement k-nearest neighbours (Meyer et al. 2021). The hyperparameters are shown in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
number of nearest neighbours k \(\mathbb{Z} \in \left[1,m\right]\) yes The default range is \(\left[1, \lceil 2 m^{1/3} \rceil \right]\), with \(m\) the number of samples.
distance metric distance_metric all metrics supported by proxy::dist yes The default set of metrics is gower, euclidean and manhattan.

Support vector machines

Support vector machines were originally defined to find optimal margins between classes for classification problems. Support vector machines were popularized after the use of the kernel trick was described. Using the kernel trick the calculations are performed in an implicit high-dimensional feature space, which is considerably more efficient than explicit calculations (Boser, Guyon, and Vapnik 1992).

Familiar implements SVM using e1071::svm. We tried kernlab::ksvm, but this function would reproducibly freeze during unit testing. By default, both features and outcome data are scaled internally. This was used to derive the default hyperparameter ranges specified in the table below.

parameter tag values optimised comments
signature size sign_size \(\mathbb{Z} \in \left[1,n\right]\) yes
SVM kernel kernel linear, polynomial, radial, sigmoid no Default is radial, unless specified as part of the learner’s name.
\(C\) c \(\mathbb{R}\) yes The cost for violation of constraints is expressed on a \(\log_{10}\) scale, i.e. the actual value is \(10^\texttt{c}\). The default range is \(\left[-5, 3\right]\).
\(\epsilon\) epsilon \(\mathbb{R}\) continuous and count outcomes The error tolerance \(\epsilon\) is only used for regression SVM. \(\epsilon\) is expressed on a \(\log_{10}\) scale. The default range is \(\left[-5, 1\right]\).
\(\nu\) nu \(\mathbb{R}\) nu SVM type nu provides upper bounds for the fraction of training errors and lower bounds for the fraction of support vectors (Chang and Lin 2011). It is expressed on a \(\log_{10}\) scale. The default range is \(\left[-5, 1\right]\).
inverse kernel width gamma \(\mathbb{R}\) non-linear kernels This parameter specifies the inverse of the kernel width. It is expressed on the \(\log_{10}\) scale. The default range is \(\left[-9, 3\right]\).
polynomial degree degree \(\mathbb{Z} \in \left[1, \infty\right)\) polynomial kernel The default range is \(\left[1, 5\right]\).
kernel offset offset \(\mathbb{R} \in \left[0, \infty\right)\) polynomial and sigmoid kernels Negative values are not allowed. The default range is \(\left[0, 1 \right]\).

Several types of SVM algorithms exist for classification and regression. The following SVM types are implemented:

The following kernels are implemented:

Hyperparameter optimization

Hyperparameter optimisation is conducted to select model parameters that are more likely to lead to generalisable results. The main hyperparameter optimisation framework used by familiar is based on sequential model-based optimisation (SMBO) (Hutter, Hoos, and Leyton-Brown 2011), but with significant updates and extensions.

Overall, the following steps are conducted to optimise hyperparameters for a given learner and dataset:

Predicting run time of model

Models take a certain time to train. Familiar is actively measuring this time during hyperparameter optimisation for two reasons: first, to optimise assignment of jobs to parallel nodes; and secondly, to prune potential challenger sets that produce models that tend to run longer than the best-known model. The strictness increases with increased exploration.

Let \(m\) be the number of bootstraps used to quantify the most visited hyperparameter sets, and \(M\) the total number of bootstrap samples that can be visited. Then let \(t_{\texttt{opt}}\) be the runtime of the best known model. The maximum time that a challenger hyperparameter set is allowed for training is then empirically set to:

\[t_{\texttt{max}} = \left(5 - 4 \frac{m}{M} \right) t_{\texttt{opt}} \]

Hence, \(t_{\texttt{max}}\) converges to \(t_{\texttt{opt}}\) for \(m \rightarrow M\).

If maximum runtime is relatively insignificant, i.e. \(t_{\texttt{max}} < 10.0\) seconds, a threshold of \(t_{\texttt{max}}= 10.0\) is used.

If the maximum runtime is not known, i.e. none of the hyperparameter sets evaluated so far produced a valid model, the maximum time threshold is set to infinite. In effect, no hyperparameter sets are pruned based on expected run time.

A random forest is trained to infer runtimes for (new) hyperparameter sets, based on the runtimes observed for visited hyperparameter sets. The random forest subsequently infers runtime for a challenger hyperparameter set. The runtime estimate is compared against \(t_{\texttt{max}}\), and if it exceeds the threshold, it is rejected and not evaluated.

Assessing goodness of hyperparameter sets

A summary score \(s\) determines how good a set of hyperparameters \(\mathbf{x}\) is. This score is computed in two steps. First an objective score is computed to assess model performance for a hyperparameter set in a bootstrap subsample. Subsequently an optimisation score is computed within the subsample. Finally a summary score is determined from the optimisation score over all subsamples.

An objective score \(s''\) is computed from the performance metric value for a specific hyperparameter set for in-bag (IB; \(s''_{\textrm{IB}}\)) and out-of-bag data (OOB; \(s''_{\textrm{OOB}}\)). The objective score always lies in the interval \([-1.0, 1.0]\). An objective score of \(1.0\) always indicates the best possible score. A score of \(0.0\) indicates that the hyperparameter set leads to the same degree of performance as the best educated guess, i.e. the majority class (for binomial and multinomial outcomes), the median value (for continuous and count outcomes), or tied risk or survival times (for survival outcomes). Objective scores that are either missing or below \(-1.0\) are truncated to the value of \(-1.0\). In case multiple metrics are used to assess model performance, the mean of the respective objective scores is used.

The optimisation scores \(s'\) and summary score \(s\) of each hyperparameter set are subsequently computed using one of the following functions, which can be set using the optimisation_function parameter:

  1. validation, max_validation (default): The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The summary score \(s\) is then the average of optimisation scores \(s'\). This is a commonly used criterion that tries to maximise the score on the OOB validation data.

  2. balanced: The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}} - \left| s''_{\textrm{OOB}} - s''_{\textrm{IB}} \right|\). The summary score \(s\) is then the average of optimisation scores \(s'\). A variation on max_validation with a penalty for differences in performance between the IB and OOB data. The underlying idea is that a good set of hyperparameters should lead to models that perform well on both development and validation data.

  3. strong_balance: The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}} - 2\left| s''_{\textrm{OOB}}-s''_{\textrm{IB}} \right|\). The summary score \(s\) is then the average of optimisation scores \(s'\). A variant of balanced with a stronger penalty term.

  4. validation_minus_sd: The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The summary score \(s\) is then the average of optimisation scores \(s'\) minus the standard deviation of \(s'\). This penalises hyperparameter sets that lead to wide variance on OOB data.

  5. validation_25th_percentile: The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The summary score \(s\) is the 25th percentile of optimisation scores \(s'\). Like validation_minus_sd this penalises hyperparameter sets that lead to wide variance on OOB data.

  6. model_estimate: The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The model used to infer utility of new hyperparameter sets (see the next section Predicting optimisation score for new hyperparameter sets) is used to predict the expected optimisation score \(\mu(\mathbf{x})\). This score is used as summary score \(s\). Not available for random search.

  7. model_estimate_minus_sd: The optimisation score for each bootstrap subsample is \(s'=s''_{\textrm{OOB}}\). The model used to infer utility of new hyperparameter sets (see the next section Predicting optimisation score for new hyperparameter sets) is used to predict the expected optimisation score \(\mu(\mathbf{x})\) and its standard deviation \(\sigma(\mathbf{x})\). Then \(s = \mu(\mathbf{x}) - \sigma(\mathbf{x})\). This penalises hyperparameter sets for which a large variance is expected. Not available for random search.

The summary score is then used to select the best known hyperparameter set, i.e. the set that maximises the summary score. Moreover, the optimisation scores are used to identify candidate hyperparameter sets as described in the following section. The optimisation function is set using the optimisation_function argument.

Predicting optimisation score for new hyperparameter sets

Any point in the hyperparameter space has a single, scalar, optimisation score value that is a priori unknown. During the optimisation process, the algorithm samples from the hyperparameter space by selecting hyperparameter sets and computing the optimisation score for one or more bootstraps. For each hyperparameter set the resulting scores are distributed around the actual value.

A key point of Bayesian optimisation is the ability to estimate the usefulness or utility of new hyperparameter sets. This ability is facilitated by modelling the optimisation score of new hyperparameter sets using the optimisation scores of observed hyperparameter sets.

The following models can be used for this purpose:

In addition, familiar can perform random search (random or random_search). This forgoes the use of models to steer optimisation. Instead, the hyperparameter space is sampled at random.

The learner used to predict optimisation scores can be specified using the hyperparameter_learner parameter.

Acquisition functions for utility of hyperparameter sets

The expected values and the posterior values of optimisation scores from the models are used to compute utility of new hyperparameter sets using an acquisition function. The following acquisition functions are available in familiar (Shahriari et al. 2016). Let \(\nu=f(\mathbf{x})\) be the posterior distribution of hyperparameter set \(\mathbf{x}\), and \(\tau\) the best observed optimisation score. Let also \(\mu(\mathbf{x})\) and \(\sigma(\mathbf{x})\) be the sample mean and sample standard deviation for set \(\mathbf{x}\) (implicitly for round \(m\)):

The acquisition function can be specified using the acquisition_function parameter.

Exploring challenger sets

After selecting challenger hyperparameter sets, the optimisation score of models trained using these hyperparameter sets are compared against the best-known score in a run-off intensify step. During this step, models will be trained on new bootstrap data, compared, and then trained on another set of bootstraps, and so on. Pruning the set of challenger hyperparameters after each iteration reduces computational load. Familiar implements the following methods:

The method used to steer exploration can be set using the exploration_method parameter.

Providing hyperparameters manually

It is possible to set hyperparameters manually. This can be used to change hyperparameters that are fixed by default, to set a fixed value for randomised hyperparameters, or to provide a different search range for randomised hyperparameters.

Hyperparameters can be provided using the hyperparameter tag. For the glm_logistic learner an example tag may for example look as follows:

<hyperparameter>
  <glm_logistic>
    <sign_size>5</sign_size>
  </glm_logistic>
</hyperparameter>

Or as a nested list passed as the hyperparameter argument to summon_familiar:

hyperparameter = list("glm_logistic"=list("sign_size"=5))

More than one value can be provided. The behaviour changes depending on whether the hyperparameter is categorical or numeric variable. In case of categorical hyperparameters, the provided values define the search range. For numerical hyperparameters, providing two values sets the bounds of the search range, whereas providing more than two values will define the search range itself.

Configuration options for hyperparameter optimisation

Hyperparameter optimization may be configured using the tags/arguments in the table below.

Configuration options for hyperparameter optimisation.
tag / argument description default
optimisation_bootstraps Maximum number of bootstraps created for optimisation. 20
optimisation_determine_vimp If TRUE, compute variable importance for each bootstrap. If FALSE use variable importance computed during the feature selection step. TRUE
smbo_random_initialisation If random initial parameter sets are generated randomly from default ranges. If fixed or fixed_subsample, the initial parameter sets are based on a grid in parameter space. fixed_subsample
smbo_n_random_sets Sets the number of hyperparameter sets drawn for initialisation. Ignored if smbo_random_initialisation is fixed. If smbo_random_initialisation is fixed_subsample, the number of selected hyperparameters may be lower. 100
max_smbo_iterations Maximum number of intensify iterations of the SMBO algorithm. 20
smbo_stop_convergent_iterations Number of subsequent convergent SMBO iterations required to stop hyperparameter optimisation early. 3
smbo_stop_tolerance Tolerance for recent optimisation scores to determine convergence. between 0.01 and 0.001
smbo_time_limit Time limit (in minutes) for the optimisation process. NULL
smbo_initial_bootstraps Number of bootstraps assessed initially. 1
smbo_step_bootstraps Number of bootstraps used within each step of an intensify iteration. 3
smbo_intensify_steps Number of intensify steps within each intensify iteration. 5
optimisation_metric The metric used for optimisation, e.g. auc. See the vignette on performance metrics for available options. More than one metric may be specified. auc_roc (binomial, multinomial); mse (continuous); msle (count); and concordance_index (survival)
optimisation_function The optimisation function used (see Assessing goodness of hyperparameter sets). balanced
hyperparameter_learner Learner used to predict optimisation scores for new hyperparameter sets (see Predicting optimisation score for new hyperparameter sets). gaussian_process
acquisition_function The function used to quantify utility of hyperparameter sets (see Acquisition functions for utility of hyperparameter sets). expected_improvement
exploration_method Method used to explore challenger hyperparameter sets (see Exploring challenger sets). successive_halving
smbo_stochastic_reject_p_value The p-value level for stochastic pruning. 0.05
parallel_hyperparameter_optimisation Enables parallel processing for hyperparameter optimisation. Ignored if parallel=FALSE. Can be outer to process multiple optimisation processes simultaneously. TRUE

Model recalibration

Even though learners may be good at discrimination, model calibration can be lacklustre. Some learners are therefore recalibrated as follows:

  1. 3-fold cross-validation is performed. Responses are predicted using a new model trained on each training fold.

  2. Three recalibration models are trained using the responses for the training folds as input (Niculescu-Mizil and Caruana 2005).

  3. The three recalibration models are then applied to (new) responses of the full model, and the resulting values averaged for each sample.

The following learners currently undergo recalibration:

learner outcome recalibration model
xgboost_lm, xgboost_lm_logistic binomial, multinomial glm_logistic
xgboost_tree, xgboost_tree_logistic binomial, multinomial glm_logistic
xgboost_lm_cox survival glm_cox
xgboost_tree_cox survival glm_cox
boosted_glm_cindex, boosted_glm_gehan survival glm_cox
boosted_tree_cindex, boosted_tree_gehan survival glm_cox

Familiar currently does not recalibrate models by inverting the model calibration curves, but may do so in the future.

References

Boser, Bernhard E, Isabelle M Guyon, and Vladimir N Vapnik. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144–52. ACM.
Breiman, Leo. 2001. “Random Forests.” Mach. Learn. 45 (1): 5–32.
Bühlmann, Peter, and Torsten Hothorn. 2007. “Boosting Algorithms: Regularization, Prediction and Model Fitting.” Stat. Sci. 22 (4): 477–505.
Chang, Chih-Chung, and Chih-Jen Lin. 2011. LIBSVM: A Library for Support Vector Machines.” ACM Trans. Intell. Syst. Technol. 2 (3): 27:1–27.
Chen, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94.
Cox, D R. 1972. “Regression Models and Life-Tables.” J. R. Stat. Soc. Series B Stat. Methodol. 34 (2): 187–202.
Cui, Yin, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. “Class-Balanced Loss Based on Effective Number of Samples.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9268–77. Long Beach, CA, USA: IEEE.
Efron, Bradley. 1977. “The Efficiency of Cox’s Likelihood Function for Censored Data.” J. Am. Stat. Assoc. 72 (359): 557–65.
Gramacy, Robert B. 2016. laGP: Large-Scale Spatial Modeling via Local Approximate Gaussian Processes in R.” Journal of Statistical Software 72 (1): 1–46.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition. Springer Series in Statistics. New York, NY, United States: Springer Science+Business Media, LLC.
Hofner, Benjamin, Luigi Boccuto, and Markus Göker. 2015. “Controlling False Discoveries in High-Dimensional Situations: Boosting with Stability Selection.” BMC Bioinformatics 16 (May): 144.
Hothorn, Torsten, Peter Bühlmann, Thomas Kneib, Matthias Schmid, and Benjamin Hofner. 2010. “Model-Based Boosting 2.0.” J. Mach. Learn. Res. 11 (Aug): 2109–13.
Hutter, Frank, Holger H Hoos, and Kevin Leyton-Brown. 2011. “Sequential Model-Based Optimization for General Algorithm Configuration.” In Learning and Intelligent Optimization, edited by Carlos A Coello Coello, 6683:507–23. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg.
Ishwaran, Hemant, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. 2008. “Random Survival Forests.” Ann. Appl. Stat. 2 (3): 841–60.
Ishwaran, Hemant, Udaya B Kogalur, Xi Chen, and Andy J Minn. 2011. “Random Survival Forests for High-Dimensional Data.” Stat. Anal. Data Min. 4 (1): 115–32.
Jamieson, Kevin, and Ameet Talwalkar. 2016. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, edited by Arthur Gretton and Christian C Robert, 51:240–48. Proceedings of Machine Learning Research. Cadiz, Spain: PMLR.
Kaufmann, Emilie, Olivier Cappé, and Aurélien Garivier. 2012. “On Bayesian Upper Confidence Bounds for Bandit Problems.” In Artificial Intelligence and Statistics, 592–600.
Lausen, Berthold, and Martin Schumacher. 1992. “Maximally Selected Rank Statistics.” Biometrics 48 (1): 73.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2021. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. https://CRAN.R-project.org/package=e1071.
Nelder, J A, and R W M Wedderburn. 1972. “Generalized Linear Models.” J. R. Stat. Soc. Ser. A 135 (3): 370–84.
Niculescu-Mizil, Alexandru, and Rich Caruana. 2005. “Predicting Good Probabilities with Supervised Learning.” In Proceedings of the 22nd International Conference on Machine Learning, 625–32. ACM.
Oshiro, Thais Mayumi, Pedro Santoro Perez, and José Augusto Baranauskas. 2012. “How Many Trees in a Random Forest?” In Machine Learning and Data Mining in Pattern Recognition, 154–68. Springer Berlin Heidelberg.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Schapire, Robert E. 1990. “The Strength of Weak Learnability.” Mach. Learn. 5 (2): 197–227.
Schmid, Matthias, Marvin N Wright, and Andreas Ziegler. 2016. “On the Use of Harrell’s C for Clinical Risk Prediction via Random Survival Forests.” Expert Syst. Appl. 63 (November): 450–59.
Shahriari, B, K Swersky, Z Wang, R P Adams, and N de Freitas. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proc. IEEE 104 (1): 148–75.
Simon, Noah, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” J. Stat. Softw. 39 (5): 1–13.
Sparapani, Rodney, Charles Spanbauer, and Robert McCulloch. 2021. “Nonparametric Machine Learning and Efficient Computation with Bayesian Additive Regression Trees: The BART R Package.” Journal of Statistical Software 97 (1): 1–66.
Srinivas, N, A Krause, S M Kakade, and M W Seeger. 2012. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting.” IEEE Trans. Inf. Theory 58 (5): 3250–65.
Therneau, Terry M, and Patricia M Grambsch. 2000. Modeling Survival Data: Extending the Cox Model. Statistics for Biology and Health. New York: Springer Science & Business Media.
Wright, Marvin N, Theresa Dankowski, and Andreas Ziegler. 2017. “Unbiased Split Variable Selection for Random Survival Forests Using Maximally Selected Rank Statistics.” Stat. Med. 36 (8): 1272–84.
Wright, Marvin N, and Andreas Ziegler. 2017. “Ranger : A Fast Implementation of Random Forests for High Dimensional Data in c++ and R.” J. Stat. Softw. 77 (1).
Yee, T W, and C J Wild. 1996. “Vector Generalized Additive Models.” J. R. Stat. Soc. Series B Stat. Methodol. 58 (3): 481–93.
Yee, Thomas. 2010. “The VGAM Package for Categorical Data Analysis.” Journal of Statistical Software 32 (10): 1–34.
Yee, Thomas W. 2015. Vector Generalized Linear and Additive Models: With an Implementation in R. Springer.