Table of Contents


Motivation

This package implements the Terminating-LARS (T-LARS) algorithm, i.e., it computes the solution path of the T-LARS algorithm. The T-LARS algorithm appends dummy predictors to the original predictor matrix and terminates the forward-selection process after a pre-defined number of dummy variables has been selected.

In contrast, the original LARS algorithm computes the entire solution path without stopping early. However, in certain applications there exists very little or no useful information in later steps of the solution path and, therefore, stopping early allows for huge improvements in terms of computation time and no loss in the variable selection accuracy. Especially the T-Rex selector (Paper and R package) requires terminating multiple solution paths early after a pre-defined number of dummy variables has been included.

The T-LARS algorithm is a major building block of the T-Rex selector. The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level while maximizing the number of selected variables.

In the following, we show how to use the package and give you an idea of why terminating the solution path early is a reasonable approach in high-dimensional and sparse variable selection: In many applications, most active variables enter the solution path early!

\[ \DeclareMathOperator{\FDP}{FDP} \DeclareMathOperator{\FDR}{FDR} \DeclareMathOperator{\TPP}{TPP} \DeclareMathOperator{\TPR}{TPR} \newcommand{\A}{\mathcal{A}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=} \]

Installation

You can install the ‘tlars’ package from GitHub with

install.packages("devtools")
devtools::install_github("jasinmachkour/tlars")

You can open the help pages with

library(tlars)
help(package = "tlars")
?tlars
?tlars_model
?tlars_cpp
?plot.Rcpp_tlars_cpp
?print.Rcpp_tlars_cpp
?Gauss_data

To cite the package ‘tlars’ in publications use

citation("tlars")

Quick Start

In the following, we illustrate the basic usage of the ‘tlars’ package for performing variable selection in sparse and high-dimensional regression settings using the T-LARS algorithm:

  1. First, we generate a high-dimensional Gaussian data set with sparse support:
library(tlars)

# Setup
n <- 150 # Number of observations
p <- 300 # Number of variables
num_act <- 5 # Number of true active variables
beta <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # Coefficient vector
true_actives <- which(beta > 0) # Indices of true active variables
num_dummies <- p # Number of dummy predictors (or dummies)

# Generate Gaussian data
set.seed(123)
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
y <- X %*% beta + stats::rnorm(n)
  1. Second, we generate a dummy matrix containing n rows and num_dummies dummy predictors that are sampled from the standard normal distribution and append it to the original predictor matrix:
set.seed(1234)
dummies <- matrix(stats::rnorm(n * num_dummies), nrow = n, ncol = num_dummies)
XD <- cbind(X, dummies)
  1. Third, we generate an object of the C++ class ‘tlars_cpp’ and supply the information that the last num_dummies predictors in XD are dummy predictors:
mod_tlars <- tlars_model(X = XD, y = y, num_dummies = num_dummies)
#> Created an object of class tlars_cpp... 
#>       The first p = 300 predictors in 'XD' are the original predictors and 
#>       the last num_dummies = 300 predictors are dummies
  1. Finally, we perform one T-LARS step on ‘mod_tlars’, i.e., the T-LARS algorithm is run until T_stop = 1 dummy has entered the solution path and stops there:
tlars(model = mod_tlars, T_stop = 1, early_stop = TRUE) # Perform one T-LARS step on object "mod_tlars"
#> Executing T-LARS step by reference...
#>       Finished T-LARS step(s)... 
#>           - The results are stored in the C++ object 'mod_tlars'.
#>           - New value of T_stop: 1.
#>           - Time elapsed: 0.002 sec.
print(mod_tlars) # Print information about the results of the performed T-LARS steps
#> 'mod_tlars' is a C++ object of class 'tlars_cpp' ... 
#>   - Number of dummies: 300.
#>   - Number of included dummies: 1.
#>   - Selected variables: 5, 2, 4, 3, 1.
plot(mod_tlars) # Plot the terminated solution path

T-LARS Warm Starts

The object “mod_tlars” stores the results and, therefore, allows for warm starts. That is, after performing a T-LARS step with, e.g., “T_stop = 1”, we can perform another T-LARS step with, e.g., “T_stop = 2, 3, …”, by continuing to build the solution path from its last T-LARS step.


  • T_stop = 2:
# Numerical zero
eps <- .Machine$double.eps

# Perform one additional T-LARS step (going from T_stop = 1 to T_stop = 2) on object "mod_tlars"
tlars(model = mod_tlars, T_stop = 2, early_stop = TRUE)
#> Executing T-LARS step by reference...
#>       Finished T-LARS step(s)... 
#>           - The results are stored in the C++ object 'mod_tlars'.
#>           - New value of T_stop: 2.
#>           - Time elapsed: 0.001 sec.
print(mod_tlars)
#> 'mod_tlars' is a C++ object of class 'tlars_cpp' ... 
#>   - Number of dummies: 300.
#>   - Number of included dummies: 2.
#>   - Selected variables: 5, 2, 4, 3, 1, 257.
plot(mod_tlars)


# Coefficient vector corresponding to original and dummy variables at the terminal T-LARS step 
beta_hat <- mod_tlars$get_beta()

selected_var <- which(abs(beta_hat[seq(p)]) > eps) # Indices of selected original variables
selected_dummies <- p + which(abs(beta_hat[seq(p + 1, ncol(XD))]) > eps) # Indices of selected dummy variables

FDP <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) # False discovery proportion (FDP)
TPP <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) # True positive proportion (TPP)

selected_var
#> [1]   1   2   3   4   5 257
selected_dummies
#> [1] 357 417
FDP
#> [1] 0.1666667
TPP
#> [1] 1

  • T_stop = 5:
# Perform three additional T-LARS steps (going from T_stop = 2 to T_stop = 5) on object "mod_tlars"
tlars(model = mod_tlars, T_stop = 5, early_stop = TRUE)
#> Executing T-LARS step by reference...
#>       Finished T-LARS step(s)... 
#>           - The results are stored in the C++ object 'mod_tlars'.
#>           - New value of T_stop: 5.
#>           - Time elapsed: 0.002 sec.
print(mod_tlars)
#> 'mod_tlars' is a C++ object of class 'tlars_cpp' ... 
#>   - Number of dummies: 300.
#>   - Number of included dummies: 5.
#>   - Selected variables: 5, 2, 4, 3, 1, 257, 67, 255, 91, 227, 113.
plot(mod_tlars)


# Coefficient vector corresponding to original and dummy variables at the terminal T-LARS step 
beta_hat <- mod_tlars$get_beta()

selected_var <- which(abs(beta_hat[seq(p)]) > eps) # Indices of selected original variables
selected_dummies <- p + which(abs(beta_hat[seq(p + 1, ncol(XD))]) > eps) # Indices of selected dummy variables

FDP <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var)) # False discovery proportion (FDP)
TPP <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives)) # True positive proportion (TPP)

selected_var
#>  [1]   1   2   3   4   5  67  91 113 227 255 257
selected_dummies
#> [1] 320 321 357 417 507
FDP
#> [1] 0.5454545
TPP
#> [1] 1

FDR and TPR

False discovery rate (FDR) and true positive rate (TPR)

From a statistical point of view, it is desirable to use a variable selection method that allows for controlling the expected value of the FDP at a user-defined target level \(\alpha \in [0, 1]\) while maximizing the number of selected variables. These type of methods exist and are called false discovery rate (FDR)-controlling methods. For example, the T-Rex selector (Paper and R package) is a fast and FDR-controlling variable/feature selection framework for large-scale high-dimensional settings that relies on the T-LARS method.

Definitions (FDR and TPR) Let \(\widehat{\A} \subseteq \lbrace 1, \ldots, p \rbrace\) be the set of selected variables, \(\A \subseteq \lbrace 1, \ldots, p \rbrace\) the set of true active variables, \(| \widehat{\A} |\) the cardinality of \(\widehat{\A}\), and define \(1 \lor a \coloneqq \max\lbrace 1, a \rbrace\), \(a \in \mathbb{R}\). Then, the false discovery rate (FDR) and the true positive rate (TPR) are defined by \[ \FDR \coloneqq \mathbb{E} \big[ \FDP \big] \coloneqq \mathbb{E} \left[ \dfrac{\big| \widehat{\A} \backslash \A \big|}{1 \lor \big| \widehat{\A} \big|} \right] \] and

\[ \TPR \coloneqq \mathbb{E} \big[ \TPP \big] \coloneqq \mathbb{E} \left[ \dfrac{| \A \cap \widehat{\A} |}{1 \lor | \A |} \right], \] respectively.

Simulations

We conduct Monte Carlo simulations and plot the resulting averaged FDP and TPP over the number of included dummies T. Note that the averaged FDP and TPP are estimates of the FDR and TPR, respectively.

# Setup
n <- 100 # number of observations
p <- 300 # number of variables

# Parameters
num_act <- 10 # number of true active variables
beta <- rep(0, times = p) # coefficient vector (all zeros first)
beta[sample(seq(p), size = num_act, replace = FALSE)] <- 3 # coefficient vector (active variables with non-zero coefficients)
true_actives <- which(beta > 0) # indices of true active variables
num_dummies <- p # number of dummies
T_vec <- c(1, 2, 5, 10, 20, 50, 100) # stopping points, i.e, number of included dummies before terminating the solution path
MC <- 500 # number of Monte Carlo runs per stopping point

# Initialize results vectors
FDP <- matrix(NA, nrow = MC, ncol = length(T_vec))
TPP <- matrix(NA, nrow = MC, ncol = length(T_vec))

# Numerical zero
eps <- .Machine$double.eps

# Seed
set.seed(12345)

# Run simulations
for (t in seq_along(T_vec)) {
  for (mc in seq(MC)) {
    # Generate Gaussian data
    X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
    y <- X %*% beta + stats::rnorm(n)

    # Generate dummy matrix and append it to X
    dummies <- matrix(stats::rnorm(n * p), nrow = n, ncol = num_dummies)
    XD <- cbind(X, dummies)

    # Create object of class tlars_cpp
    mod_tlars <- tlars_model(X = XD, y = y, num_dummies = num_dummies, type = "lar", info = FALSE)

    # Run T-LARS steps
    tlars(model = mod_tlars, T_stop = t, early_stop = TRUE, info = FALSE)
    beta_hat <- mod_tlars$get_beta()
    selected_var <- which(abs(beta_hat[seq(p)]) > eps)

    # Results
    FDP[mc, t] <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var))
    TPP[mc, t] <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives))
  }
}

# Compute estimates of FDR and TPR by averaging FDP and TPP over MC Monte Carlo runs
FDR <- colMeans(FDP)
TPR <- colMeans(TPP)

We observe that with growing number of included dummies \(T\) the FDR and TPR increase. Moreover, we see that there seems to be a trade-off between FDR and TPR. For more details and discussions on these observations, we refer the interested reader to [1].

Outlook

The T-LARS algorithm is a major building block of the T-Rex selector (Paper and R package). The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the FDR at the user-defined target level while maximizing the number of selected variables. If you are working in genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings, then this is for you. Check it out!

References

[1]
Machkour, J., Muma, M. and Palomar, D. P. (2021). The Terminating-Knockoff filter: Fast high-dimensional variable selection with false discovery rate control. arXiv preprint arXiv:2110.06048.
[2]
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–99.