This package implements the Terminating-LARS (T-LARS) algorithm, i.e., it computes the solution path of the T-LARS algorithm. The T-LARS algorithm appends dummy predictors to the original predictor matrix and terminates the forward-selection process after a pre-defined number of dummy variables has been selected.
In contrast, the original LARS algorithm computes the entire solution path without stopping early. However, in certain applications there exists very little or no useful information in later steps of the solution path and, therefore, stopping early allows for huge improvements in terms of computation time and no loss in the variable selection accuracy. Especially the T-Rex selector (Paper and R package) requires terminating multiple solution paths early after a pre-defined number of dummy variables has been included.
The T-LARS algorithm is a major building block of the T-Rex selector. The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level while maximizing the number of selected variables.
In the following, we show how to use the package and give you an idea of why terminating the solution path early is a reasonable approach in high-dimensional and sparse variable selection: In many applications, most active variables enter the solution path early!
\[ \DeclareMathOperator{\FDP}{FDP} \DeclareMathOperator{\FDR}{FDR} \DeclareMathOperator{\TPP}{TPP} \DeclareMathOperator{\TPR}{TPR} \newcommand{\A}{\mathcal{A}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=} \]
You can install the ‘tlars’ package from GitHub with
install.packages("devtools")
::install_github("jasinmachkour/tlars") devtools
You can open the help pages with
library(tlars)
help(package = "tlars")
?tlars
?tlars_model
?tlars_cpp
?plot.Rcpp_tlars_cpp
?print.Rcpp_tlars_cpp ?Gauss_data
To cite the package ‘tlars’ in publications use
citation("tlars")
In the following, we illustrate the basic usage of the ‘tlars’ package for performing variable selection in sparse and high-dimensional regression settings using the T-LARS algorithm:
library(tlars)
# Setup
<- 150 # Number of observations
n <- 300 # Number of variables
p <- 5 # Number of true active variables
num_act <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # Coefficient vector
beta <- which(beta > 0) # Indices of true active variables
true_actives <- p # Number of dummy predictors (or dummies)
num_dummies
# Generate Gaussian data
set.seed(123)
<- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
X <- X %*% beta + stats::rnorm(n) y
set.seed(1234)
<- matrix(stats::rnorm(n * num_dummies), nrow = n, ncol = num_dummies)
dummies <- cbind(X, dummies) XD
<- tlars_model(X = XD, y = y, num_dummies = num_dummies)
mod_tlars #> Created an object of class tlars_cpp...
#> The first p = 300 predictors in 'XD' are the original predictors and
#> the last num_dummies = 300 predictors are dummies
tlars(model = mod_tlars, T_stop = 1, early_stop = TRUE) # Perform one T-LARS step on object "mod_tlars"
#> Executing T-LARS step by reference...
#> Finished T-LARS step(s)...
#> - The results are stored in the C++ object 'mod_tlars'.
#> - New value of T_stop: 1.
#> - Time elapsed: 0.002 sec.
print(mod_tlars) # Print information about the results of the performed T-LARS steps
#> 'mod_tlars' is a C++ object of class 'tlars_cpp' ...
#> - Number of dummies: 300.
#> - Number of included dummies: 1.
#> - Selected variables: 5, 2, 4, 3, 1.
plot(mod_tlars) # Plot the terminated solution path
The object “mod_tlars” stores the results and, therefore, allows for warm starts. That is, after performing a T-LARS step with, e.g., “T_stop = 1”, we can perform another T-LARS step with, e.g., “T_stop = 2, 3, …”, by continuing to build the solution path from its last T-LARS step.
# Numerical zero
<- .Machine$double.eps
eps
# Perform one additional T-LARS step (going from T_stop = 1 to T_stop = 2) on object "mod_tlars"
tlars(model = mod_tlars, T_stop = 2, early_stop = TRUE)
#> Executing T-LARS step by reference...
#> Finished T-LARS step(s)...
#> - The results are stored in the C++ object 'mod_tlars'.
#> - New value of T_stop: 2.
#> - Time elapsed: 0.001 sec.
print(mod_tlars)
#> 'mod_tlars' is a C++ object of class 'tlars_cpp' ...
#> - Number of dummies: 300.
#> - Number of included dummies: 2.
#> - Selected variables: 5, 2, 4, 3, 1, 257.
plot(mod_tlars)
# Coefficient vector corresponding to original and dummy variables at the terminal T-LARS step
<- mod_tlars$get_beta()
beta_hat
<- which(abs(beta_hat[seq(p)]) > eps) # Indices of selected original variables
selected_var <- p + which(abs(beta_hat[seq(p + 1, ncol(XD))]) > eps) # Indices of selected dummy variables
selected_dummies
<- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) # False discovery proportion (FDP)
FDP <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) # True positive proportion (TPP)
TPP
selected_var#> [1] 1 2 3 4 5 257
selected_dummies#> [1] 357 417
FDP#> [1] 0.1666667
TPP#> [1] 1
# Perform three additional T-LARS steps (going from T_stop = 2 to T_stop = 5) on object "mod_tlars"
tlars(model = mod_tlars, T_stop = 5, early_stop = TRUE)
#> Executing T-LARS step by reference...
#> Finished T-LARS step(s)...
#> - The results are stored in the C++ object 'mod_tlars'.
#> - New value of T_stop: 5.
#> - Time elapsed: 0.002 sec.
print(mod_tlars)
#> 'mod_tlars' is a C++ object of class 'tlars_cpp' ...
#> - Number of dummies: 300.
#> - Number of included dummies: 5.
#> - Selected variables: 5, 2, 4, 3, 1, 257, 67, 255, 91, 227, 113.
plot(mod_tlars)
# Coefficient vector corresponding to original and dummy variables at the terminal T-LARS step
<- mod_tlars$get_beta()
beta_hat
<- which(abs(beta_hat[seq(p)]) > eps) # Indices of selected original variables
selected_var <- p + which(abs(beta_hat[seq(p + 1, ncol(XD))]) > eps) # Indices of selected dummy variables
selected_dummies
<- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) # False discovery proportion (FDP)
FDP <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) # True positive proportion (TPP)
TPP
selected_var#> [1] 1 2 3 4 5 67 91 113 227 255 257
selected_dummies#> [1] 320 321 357 417 507
FDP#> [1] 0.5454545
TPP#> [1] 1
From a statistical point of view, it is desirable to use a variable selection method that allows for controlling the expected value of the FDP at a user-defined target level \(\alpha \in [0, 1]\) while maximizing the number of selected variables. These type of methods exist and are called false discovery rate (FDR)-controlling methods. For example, the T-Rex selector (Paper and R package) is a fast and FDR-controlling variable/feature selection framework for large-scale high-dimensional settings that relies on the T-LARS method.
Definitions (FDR and TPR) Let \(\widehat{\A} \subseteq \lbrace 1, \ldots, p \rbrace\) be the set of selected variables, \(\A \subseteq \lbrace 1, \ldots, p \rbrace\) the set of true active variables, \(| \widehat{\A} |\) the cardinality of \(\widehat{\A}\), and define \(1 \lor a \coloneqq \max\lbrace 1, a \rbrace\), \(a \in \mathbb{R}\). Then, the false discovery rate (FDR) and the true positive rate (TPR) are defined by \[ \FDR \coloneqq \mathbb{E} \big[ \FDP \big] \coloneqq \mathbb{E} \left[ \dfrac{\big| \widehat{\A} \backslash \A \big|}{1 \lor \big| \widehat{\A} \big|} \right] \] and
\[ \TPR \coloneqq \mathbb{E} \big[ \TPP \big] \coloneqq \mathbb{E} \left[ \dfrac{| \A \cap \widehat{\A} |}{1 \lor | \A |} \right], \] respectively.
We conduct Monte Carlo simulations and plot the resulting averaged FDP and TPP over the number of included dummies T. Note that the averaged FDP and TPP are estimates of the FDR and TPR, respectively.
# Setup
<- 100 # number of observations
n <- 300 # number of variables
p
# Parameters
<- 10 # number of true active variables
num_act <- rep(0, times = p) # coefficient vector (all zeros first)
beta sample(seq(p), size = num_act, replace = FALSE)] <- 3 # coefficient vector (active variables with non-zero coefficients)
beta[<- which(beta > 0) # indices of true active variables
true_actives <- p # number of dummies
num_dummies <- c(1, 2, 5, 10, 20, 50, 100) # stopping points, i.e, number of included dummies before terminating the solution path
T_vec <- 500 # number of Monte Carlo runs per stopping point
MC
# Initialize results vectors
<- matrix(NA, nrow = MC, ncol = length(T_vec))
FDP <- matrix(NA, nrow = MC, ncol = length(T_vec))
TPP
# Numerical zero
<- .Machine$double.eps
eps
# Seed
set.seed(12345)
# Run simulations
for (t in seq_along(T_vec)) {
for (mc in seq(MC)) {
# Generate Gaussian data
<- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
X <- X %*% beta + stats::rnorm(n)
y
# Generate dummy matrix and append it to X
<- matrix(stats::rnorm(n * p), nrow = n, ncol = num_dummies)
dummies <- cbind(X, dummies)
XD
# Create object of class tlars_cpp
<- tlars_model(X = XD, y = y, num_dummies = num_dummies, type = "lar", info = FALSE)
mod_tlars
# Run T-LARS steps
tlars(model = mod_tlars, T_stop = t, early_stop = TRUE, info = FALSE)
<- mod_tlars$get_beta()
beta_hat <- which(abs(beta_hat[seq(p)]) > eps)
selected_var
# Results
<- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var))
FDP[mc, t] <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives))
TPP[mc, t]
}
}
# Compute estimates of FDR and TPR by averaging FDP and TPP over MC Monte Carlo runs
<- colMeans(FDP)
FDR <- colMeans(TPP) TPR
We observe that with growing number of included dummies \(T\) the FDR and TPR increase. Moreover, we see that there seems to be a trade-off between FDR and TPR. For more details and discussions on these observations, we refer the interested reader to [1].
The T-LARS algorithm is a major building block of the T-Rex selector (Paper and R package). The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the FDR at the user-defined target level while maximizing the number of selected variables. If you are working in genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings, then this is for you. Check it out!