README

Sören Künzel, Theo Saarinen, Simon Walter, Edward Liu, Allen Tang, Jasjeet Sekhon

Introduction

Rforestry is a fast implementation of Honest Random Forests, Gradient Boosting, and Linear Random Forests, with an emphasis on inference and interpretability.

How to install

Usage

library(Rforestry)

set.seed(292315)
test_idx <- sample(nrow(iris), 3)
x_train <- iris[-test_idx, -1]
y_train <- iris[-test_idx, 1]
x_test <- iris[test_idx, -1]

rf <- forestry(x = x_train, y = y_train, nthread = 2)

predict(rf, x_test)

Ridge Random Forest

A fast implementation of random forests using ridge penalized splitting and ridge regression for predictions. In order to use this version of random forests, set the linear option to TRUE.

library(Rforestry)

set.seed(49)
n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)
forest <- forestry(x, y, linear = TRUE, nthread = 2)
predict(forest, x)

Monotonic Constraints

The parameter monotonicConstraints strictly enforces monotonicity of partition averages when evaluating potential splits on the indicated features. This parameter can be used to specify both monotone increasing and monotone decreasing constraints.

library(Rforestry)

set.seed(49)
x <- rnorm(150)+5
y <- .15*x + .5*sin(3*x)
data_train <- data.frame(x1 = x, x2 = rnorm(150)+5, y = y + rnorm(150, sd = .4))

monotone_rf <- forestry(x = data_train[,-3],
                        y = data_train$y,
                        monotonicConstraints = c(1,1),
                        nodesizeStrictSpl = 5,
                        nthread = 1,
                        ntree = 25)
                        
predict(monotone_rf, newdata = data_train[,-3])

OOB Predictions

We can return the predictions for the training data set using only the trees in which each observation was out-of-bag (OOB). Note that when there are few trees, or a high proportion of the observations sampled, there may be some observations which are not out-of-bag for any trees. The predictions for these are returned as NaN.

library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
               y = iris[,1],
               nthread = 2,
               ntree = 500)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds -  iris[,1])^2)
getOOB(rf)

If OOB predictions are going to be used, it is advised that one use OOB honesty during training (OOBhonest=true). In this version of honesty, the OOB observations for each tree are used as the honest (averaging) set. OOB honesty also changes how predictions are constructed. When predicting for observations that are out-of-sample (using Predict(…, aggregation = “average”)), all the trees in the forest are used to construct predictions. When predicting for an observation that was in-sample (using predict(…, aggregation = “oob”)), only the trees for which that observation was not in the averaging set are used to construct the prediction for that observation. aggregation=“oob” (out-of-bag) ensures that the outcome value for an observation is never used to construct predictions for a given observation even when it is in sample. This property does not hold in standard honesty, which relies on an asymptotic subsampling argument. OOB honesty, when used in combination with aggregation=“oob” at the prediction stage, cannot overfit IID data, at either the training or prediction stage. The outputs of such models are also more stable and more easily interpretable. One can observe this if one queries the model using interpretation tools such as ALEs, PDPs, LIME, etc.

library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
               y = iris[,1],
               nthread = 2,
               ntree = 500,
               OOBhonest=TRUE)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds -  iris[,1])^2)
getOOB(rf)

Saving + Loading a model

In order to save a trained model, we include two functions in order to save and load a model we have built. The following code shows how to use saveForestry and loadForestry to save and load a forestry model.

library(Rforestry)

# Train a forest
forest <- forestry(x = iris[,-1],
                   y = iris[,1],
                   nthread = 2,
                   ntree = 500,
                   OOBhonest=TRUE)
               
# Get predictions before save the forest
y_pred_before <- predict(forest, iris[,-1])

# Save the forest
saveForestry(forest, filename = file.path("forest.Rda"))

# Delete the forest
rm(forest)

# Load the forest
forest_after <- loadForestry(file.path("forest.Rda"))

# Predict after loading the forest
y_pred_after <- predict(forest_after, iris[,-1])