Sören Künzel, Theo Saarinen, Simon Walter, Edward Liu, Allen Tang, Jasjeet Sekhon
Rforestry is a fast implementation of Honest Random Forests, Gradient Boosting, and Linear Random Forests, with an emphasis on inference and interpretability.
install.packages("devtools")
.devtools::has_devel()
to check whether you do. If no
development environment exists, Windows users download and install Rtools and
macOS users download and install Xcode.devtools::install_github("forestry-labs/Rforestry")
. For
Windows users, you’ll need to skip 64-bit compilation
devtools::install_github("forestry-labs/Rforestry", INSTALL_opts = c('--no-multiarch'))
due to an outstanding gcc issue.library(Rforestry)
set.seed(292315)
<- sample(nrow(iris), 3)
test_idx <- iris[-test_idx, -1]
x_train <- iris[-test_idx, 1]
y_train <- iris[test_idx, -1]
x_test
<- forestry(x = x_train, y = y_train, nthread = 2)
rf
predict(rf, x_test)
A fast implementation of random forests using ridge penalized
splitting and ridge regression for predictions. In order to use this
version of random forests, set the linear
option to
TRUE
.
library(Rforestry)
set.seed(49)
<- c(100)
n <- rnorm(n)
a <- rnorm(n)
b <- rnorm(n)
c <- 4*a + 5.5*b - .78*c
y <- data.frame(a,b,c)
x <- forestry(x, y, linear = TRUE, nthread = 2)
forest predict(forest, x)
The parameter monotonicConstraints
strictly enforces
monotonicity of partition averages when evaluating potential splits on
the indicated features. This parameter can be used to specify both
monotone increasing and monotone decreasing constraints.
library(Rforestry)
set.seed(49)
<- rnorm(150)+5
x <- .15*x + .5*sin(3*x)
y <- data.frame(x1 = x, x2 = rnorm(150)+5, y = y + rnorm(150, sd = .4))
data_train
<- forestry(x = data_train[,-3],
monotone_rf y = data_train$y,
monotonicConstraints = c(1,1),
nodesizeStrictSpl = 5,
nthread = 1,
ntree = 25)
predict(monotone_rf, newdata = data_train[,-3])
We can return the predictions for the training data set using only
the trees in which each observation was out-of-bag (OOB). Note that when
there are few trees, or a high proportion of the observations sampled,
there may be some observations which are not out-of-bag for any trees.
The predictions for these are returned as NaN
.
library(Rforestry)
# Train a forest
<- forestry(x = iris[,-1],
rf y = iris[,1],
nthread = 2,
ntree = 500)
# Get the OOB predictions for the training set
<- predict(rf, aggregation = "oob")
oob_preds
# This should be equal to the OOB error
mean((oob_preds - iris[,1])^2)
getOOB(rf)
If OOB predictions are going to be used, it is advised that one use OOB honesty during training (OOBhonest=true). In this version of honesty, the OOB observations for each tree are used as the honest (averaging) set. OOB honesty also changes how predictions are constructed. When predicting for observations that are out-of-sample (using Predict(…, aggregation = “average”)), all the trees in the forest are used to construct predictions. When predicting for an observation that was in-sample (using predict(…, aggregation = “oob”)), only the trees for which that observation was not in the averaging set are used to construct the prediction for that observation. aggregation=“oob” (out-of-bag) ensures that the outcome value for an observation is never used to construct predictions for a given observation even when it is in sample. This property does not hold in standard honesty, which relies on an asymptotic subsampling argument. OOB honesty, when used in combination with aggregation=“oob” at the prediction stage, cannot overfit IID data, at either the training or prediction stage. The outputs of such models are also more stable and more easily interpretable. One can observe this if one queries the model using interpretation tools such as ALEs, PDPs, LIME, etc.
library(Rforestry)
# Train a forest
<- forestry(x = iris[,-1],
rf y = iris[,1],
nthread = 2,
ntree = 500,
OOBhonest=TRUE)
# Get the OOB predictions for the training set
<- predict(rf, aggregation = "oob")
oob_preds
# This should be equal to the OOB error
mean((oob_preds - iris[,1])^2)
getOOB(rf)
In order to save a trained model, we include two functions in order to save and load a model we have built. The following code shows how to use saveForestry and loadForestry to save and load a forestry model.
library(Rforestry)
# Train a forest
<- forestry(x = iris[,-1],
forest y = iris[,1],
nthread = 2,
ntree = 500,
OOBhonest=TRUE)
# Get predictions before save the forest
<- predict(forest, iris[,-1])
y_pred_before
# Save the forest
saveForestry(forest, filename = file.path("forest.Rda"))
# Delete the forest
rm(forest)
# Load the forest
<- loadForestry(file.path("forest.Rda"))
forest_after
# Predict after loading the forest
<- predict(forest_after, iris[,-1]) y_pred_after