A beneficial property of the lasso penalty is that it shrinks coefficients to zero. A less beneficial property is that, in the process, the lasso tends to overshrink the large coefficients too far to zero. Authors have argued that the lasso should “be considered as a variable screener rather than a model selector” (Su et al., 2017). It has a high true positive rate (i.e., high probability that a true predictor will be selected) at the cost of high false positive rate (i.e., high probability that a non-predictor will be selected).
When fitting prediction rule ensembles (PREs), this may be a nuisance, especially when we want to enforce sparsity. In what follows, we show how we can easily select a smaller number of terms for the final ensemble, while retaining relatively high predictive accuracy, using the relaxed lasso.
The relaxed lasso was proposed by Meinshausen (2007). Investigations of Su et al. (2107) provide insight on why the relaxed lasso is beneficial1. Hastie, Tibshirani & Tibshirani (2017) propose a simplified version of the relaxed lasso, which is implemented in package glmnet
and can be employed in package pre
. Hastie et al. (2017) find that “best subset selection generally performing better in high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes” and that “the relaxed lasso […] is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as well as best subset selection in high SNR scenarios”. A short introduction to the relaxed lasso is provided in glmnet
vignette “The Relaxed lasso” (accessible in R
by typing vignette("relax", "glmnet")
.
library("pre")
We fit a PRE to predict Ozone
and inspect the result:
<- airquality[complete.cases(airquality), ]
airq set.seed(42)
<- pre(Ozone ~ ., data = airq)
airq.ens airq.ens
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 3.543968
## number of terms = 12
## mean cv error (se) = 352.395 (99.13754)
##
## cv error type : Mean-Squared Error
##
## rule coefficient description
## (Intercept) 68.48270407 1
## rule191 -10.97368180 Wind > 5.7 & Temp <= 87
## rule173 -10.90385519 Wind > 5.7 & Temp <= 82
## rule42 -8.79715538 Wind > 6.3 & Temp <= 84
## rule204 7.16114781 Wind <= 10.3 & Solar.R > 148
## rule10 -4.68646145 Temp <= 84 & Temp <= 77
## rule192 -3.34460038 Wind > 5.7 & Temp <= 87 & Day <= 23
## rule51 -2.27864287 Wind > 5.7 & Temp <= 84
## rule93 2.18465676 Temp > 77 & Wind <= 8.6
## rule74 -1.36479545 Wind > 6.9 & Temp <= 84
## rule28 -1.15326093 Temp <= 84 & Wind > 7.4
## rule25 -0.70818400 Wind > 6.3 & Temp <= 82
## rule166 -0.04751152 Wind > 6.9 & Temp <= 82
What if we find an ensemble of 12 rules too complex and we want to retain only five rules? We could extract the fitted lasso path, and choose the penalty parameter so that we retain only five rules:
plot(airq.ens$glmnet.fit)
<- airq.ens$glmnet.fit$glmnet.fit
tab <- print(tab) tab
9:16, ] tab[
## Df %Dev Lambda
## 9 2 35.27 8.577
## 10 4 38.25 8.187
## 11 5 41.46 7.815
## 12 5 44.45 7.460
## 13 5 47.17 7.121
## 14 5 49.65 6.797
## 15 6 51.94 6.488
## 16 9 54.46 6.193
From the Df
and Lambda
column, we can see that a \(\lambda\) value of 7.815 would result in a final ensemble comprising five terms. However, from the plot above, which shows the cross-validated error (\(y\)-axis) against the value of penalty parameter \(\lambda\) on the lower \(x\)-axis, and the corresponding number of selected terms on the upper \(x\)-axis, we see that this would yield substantially higher error. In part, this is due to overshrinkage, which we can mitigate using the relaxed lasso, which ‘unshrinks’ the non-zero coefficients.
We can use the relaxed lasso by specifying relax = TRUE
when fitting rule ensemble using function pre
:
set.seed(42)
<- pre(Ozone ~ ., data = airq, relax = TRUE) airq.ens.rel
If we specify relax = TRUE
, the gamma
argument (see ?cv.glmnet
for documentation on arguments relax
and gamma
) will by default be set to a range of five values in the interval [0, 1]. This can be overruled by specifying different values for argument gamma
in the call to function pre
.
Let us take a look at the regularization paths for the relaxed fits:
plot(airq.ens.rel$glmnet.fit)
We obtained one regularization path for each value of \(\gamma\). The path with \(\gamma = 1\) is the default lasso path. Lower values of \(\gamma\) ‘unshrink’ the value of the non-zero coefficients of the lasso towards their unpenalized values. We see that for the \(\lambda\) value yielding the minimum MSE (indicated by the left-most vertical dotted line), the value of \(\gamma\) does not make a lot of difference for the MSE, but when \(\lambda\) values increase, higher values of \(\gamma\) tend to yield lower MSE.
For model selection using the "lambda.min"
criterion, by default the \(\lambda\) and \(\gamma\) combination yielding the lowest CV error is returned. For the "lambda.1se"
criterion, the \(\lambda\) and \(\gamma\) combination yielding the sparsest solution within 1 standard error of the error criterion of the minimum is returned:
<- airq.ens.rel$glmnet.fit$relaxed
fit <- data.frame(lambda.1se = c(fit$lambda.1se, fit$gamma.1se, fit$nzero.1se),
mat lambda.min = c(fit$lambda.min, fit$gamma.min, fit$nzero.min),
row.names = c("lamda", "gamma", "# of non-zero terms"))
mat
## lambda.1se lambda.min
## lamda 6.193185 3.382889
## gamma 0.000000 0.000000
## # of non-zero terms 9.000000 12.000000
Thus, as the dotted vertical lines in the plots already suggest, with the default "lambda.1se"
criterion, a final model with 9 terms will be selected, with coefficients obtained using a \(\lambda\) value of 6.193 and a \(\gamma\) value of 0. With the "lambda.min"
criterion, we obtain a more complex fit; \(\gamma = 0\) still yields the lowest CV error. Note that use of "lambda.min"
increases the likelihood of overfitting, because function pre
uses the same data to extract the rules and fit the penalized regression, so in most cases the default "lambda.1se"
criterion can be expected to provide a less complex, better generalizable, often more accurate fit.
The default of function pre
is to use the "lambda.1se"
criterion. When relax = TRUE
has been specified in the call to function pre
, the default of all functions and S3
methods applied to objects of class pre
(print
, plot
, coef
, predict
, importance
, explain
, cvpre
, singleplot
, pairplot
, interact
) is to use the solution obtained with "lambda.1se"
and the \(\gamma\) value yielding lowest CV error at that value of \(\lambda\). This can be overruled by specifying a different value of \(\lambda\) (penalty.par.val
) and/or \(\gamma\) (gamma
). Some examples:
summary(airq.ens.rel)
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 6.193185
## gamma = 0
## number of terms = 9
## mean cv error (se) = 304.7364 (79.60512)
##
## cv error type : Mean-Squared Error
summary(airq.ens.rel, penalty = "lambda.min")
## Final ensemble with minimum cv error:
##
## lambda = 3.382889
## gamma = 0
## number of terms = 12
## mean cv error (se) = 244.7256 (67.54855)
##
## cv error type : Mean-Squared Error
summary(airq.ens.rel, penalty = 8, gamma = 0)
## Final ensemble:
##
## lambda = 7.814913
## gamma = 0
## number of terms = 5
## mean cv error (se) = 390.2582 (101.9163)
##
## cv error type : Mean-Squared Error
summary(airq.ens.rel, penalty = 8, gamma = 1)
## Final ensemble:
##
## lambda = 7.814913
## gamma = 1
## number of terms = 5
## mean cv error (se) = 682.127 (146.0948)
##
## cv error type : Mean-Squared Error
Note how lowest CV error is indeed obtained with the "lambda.min"
criterion, while the default "lambda.1se"
yields a sparser model, with accuracy within 1 standard error of "lambda.min"
. If we want to go (much) sparser, we need to specify a lower value for the \(\lambda\) penalty, and a lower value of \(\gamma\) should likely be preferred, to retain good-enough predictive accuracy.
Some rules for specification of \(\lambda\) and \(\gamma\):
If a numeric value of \(\lambda\) has been supplied, a (numeric) value for \(\gamma\) must be supplied.
Otherwise (if the default "lambda.1se"
criterion is employed, or "lambda.min"
specified), the \(\gamma\) value yielding lowest CV error (at the \(\lambda\) value associated with the specified criterion) will be used; this \(\gamma\) value can be overruled by supplying the desired \(\gamma\) value to the gamma
argument.
Multiple values of \(\gamma\) can be passed to function pre
, but all other methods and functions accept only a single value for \(\gamma\) (this differs from several glmnet
functions) .
If a specific \(\lambda\) value is supplied, results are returned for a penalty parameter value that was used in the path, and closest to the specified value.
Also note that in the code chunk above we refer to the penalty.par.val
argument by abbreviating it to penalty
; this has the same effect as writing penalty.par.val
in full.
Using \(\gamma = 0\) amounts to a forward stepwise selection approach, with entry order of the variables (rules and linear terms) determined by the lasso. This approach can be useful if we want a rule ensemble with low complexity and high generalizability, and especially when we want to decide a-priori on the number of terms we want to retain. By specifying a high value of \(\lambda\), we can retain a small number of rules, while specifying \(\gamma = 0\) will provide unbiased (unpenalized) coefficients. This avoids the overshrinking of large coefficients. In terms of predictive accuracy, this approach may not perform best, but if low complexity (interpretability) is most important, this is a very useful approach, which does not reduce predictive accuracy too much.
To use forward stepwise regression with variable entry order determined by the lasso, we specify a \(\gamma\) value of 0, and specify the number of variables we want to retain through specification of \(\lambda\) (penalty.par.val
). To find the value of \(\lambda\) corresponding to the number of terms one want to retain, check (results not shown for space considerations):
$glmnet.fit$glmnet.fit airq.ens.rel
Here, we use the value of \(\lambda\) that we found earlier to yield a five-term ensemble:
<- coef(airq.ens.rel, gamma = 1, penalty = 8)
coefs $coefficient != 0, ] coefs[coefs
## rule coefficient description
## 201 (Intercept) 5.613926e+01 1
## 150 rule173 -1.115460e+01 Wind > 5.7 & Temp <= 82
## 39 rule42 -9.541935e+00 Wind > 6.3 & Temp <= 84
## 179 rule204 5.202272e-01 Wind <= 10.3 & Solar.R > 148
## 68 rule74 -2.972928e-01 Wind > 6.9 & Temp <= 84
## 48 rule51 -2.283879e-04 Wind > 5.7 & Temp <= 84
<- coef(airq.ens.rel, gamma = 0, penalty = 8)
coefs $coefficient != 0, ] coefs[coefs
## rule coefficient description
## 201 (Intercept) 67.841887 1
## 150 rule173 -22.228421 Wind > 5.7 & Temp <= 82
## 179 rule204 19.456547 Wind <= 10.3 & Solar.R > 148
## 39 rule42 -11.507352 Wind > 6.3 & Temp <= 84
## 48 rule51 -9.953640 Wind > 5.7 & Temp <= 84
## 68 rule74 -6.251821 Wind > 6.9 & Temp <= 84
Note we have retained the exact same set of terms with the unpenalized relaxed lasso (\(\gamma = 0\)) as with the default (non-relaxed) lasso (\(\gamma = 1\)), but the terms obtained different coefficient values. The CV error estimates (returned by fuction summary
above) indicate that this is beneficial for prediction.
To evaluate predictive accuracy of the final fitted model, and to estimate generalization error to unseen observations, cross-validation would be more appropriate than using training data for fitting and evaluating the model. To that end, function cvpre
can for example be used.
In case you obtained different results, these results were obtained using the following:
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=Dutch_Netherlands.1252
## [3] LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pre_1.0.5 mice_3.14.0
##
## loaded via a namespace (and not attached):
## [1] shape_1.4.6 tidyselect_1.1.2 xfun_0.29 bslib_0.3.1
## [5] inum_1.0-4 purrr_0.3.4 splines_4.1.3 lattice_0.20-45
## [9] vctrs_0.3.8 generics_0.1.2 htmltools_0.5.2 yaml_2.2.1
## [13] utf8_1.2.2 survival_3.3-1 rlang_1.0.2 jquerylib_0.1.4
## [17] pillar_1.7.0 glue_1.6.2 withr_2.5.0 DBI_1.1.2
## [21] plotmo_3.6.1 foreach_1.5.2 lifecycle_1.0.1 stringr_1.4.0
## [25] MatrixModels_0.5-0 mvtnorm_1.1-3 codetools_0.2-18 evaluate_0.15
## [29] knitr_1.38 fastmap_1.1.0 fansi_1.0.3 highr_0.9
## [33] broom_0.7.12 Rcpp_1.0.8.3 backports_1.4.1 plotrix_3.8-2
## [37] jsonlite_1.8.0 TeachingDemos_2.12 digest_0.6.27 stringi_1.7.6
## [41] dplyr_1.0.8 grid_4.1.3 cli_3.2.0 tools_4.1.3
## [45] magrittr_2.0.2 sass_0.4.1 glmnet_4.1-3 tibble_3.1.6
## [49] Formula_1.2-4 crayon_1.5.1 tidyr_1.2.0 pkgconfig_2.0.3
## [53] partykit_1.2-15 libcoin_1.0-9 ellipsis_0.3.2 Matrix_1.4-1
## [57] assertthat_0.2.1 rmarkdown_2.13 rstudioapi_0.13 iterators_1.0.14
## [61] earth_5.3.1 rpart_4.1.16 R6_2.5.1 compiler_4.1.3
Hastie, T., Tibshirani, R., & Tibshirani, R. J. (2017). Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv:1707.08692, https://arxiv.org/abs/1707.08692.
Meinshausen, N. (2007). Relaxed lasso. Computational Statistics & Data Analysis, 52(1), 374-393. https://doi.org/10.1016/j.csda.2006.12.019
Su, W., Bogdan, M., & Candes, E. (2017). False discoveries occur early on the lasso path. The Annals of Statistics, 45(5), 2133-2150. https://doi.org/10.1214/16-AOS1521
As explained by Su et al. (2017), the shrinkage of the lasso penalty introduces pseudo-noise. When the regularization parameter \(\lambda\) is large, the estimated coefficients are seriously biased downwards, and the residuals still contain much of the effects associated with already selected variables. As strong variables get picked up, the shrinkage inflates pseudo-noise. If non-predictors are at least slightly correlated with this pseudo-noise, they will get picked up as predictors. Thus, for variable selection we might want a somewhat larger value of \(\lambda\), but to reduce the pseudo-noise, we want to deshrink the non-zero coefficients.↩︎