Using missRanger

Introduction

The aim of this vignette is to introduce the R package missRanger for imputation of missing values and to explain how to use it for multiple imputation.

missRanger uses the ranger package (Wright and Ziegler 2017) to do fast missing value imputation by chained random forest. As such, it can be used as an alternative to missForest, a beautiful algorithm introduced in (Stekhoven and Buehlmann 2011). Basically, each variable is imputed by predictions from a random forest using all other variables as covariables. missRanger iterates multiple times over all variables until the average out-of-bag prediction error of the models stops to improve.

Why should you consider missRanger?

It is fast.
It is flexible and intuitive to apply: E.g. calling missRanger(data, . ~ 1) would impute all variables univariately, missRanger(data, Species ~ Sepal.Width) would use Sepal.Width to impute Species.
It can deal with most realistic variable types, even dates and times without destroying the original data structure.
It combines random forest imputation with predictive mean matching. This generates realistic variability and avoids “new” values like 0.3334 in a 0-1 coded variable. Like this, missRanger can be used for realistic multiple imputation scenarios, see e.g. (Rubin 1987) for the statistical background.

In the examples below, we will meet two functions from the missRanger package:

generateNA: To replace values in a data set by missing values.
missRanger: To impute missing values in a data frame.

Installation

From CRAN:

install.packages("missRanger")

Latest version from github:

library(devtools)
install_github("mayer79/missRanger", subdir = "release/missRanger")

Examples

We first generate a data set with about 20% missing values per column and fill them again by missRanger.

library(missRanger)
library(dplyr)

set.seed(84553)

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

# Generate data with missing values in all columns
head(irisWithNA <- generateNA(iris, p = 0.2))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5           NA         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4           NA          NA           NA         0.2  setosa
#> 5          5.0         3.6          1.4          NA  setosa
#> 6          5.4         3.9           NA         0.4  setosa
 
# Impute missing values with missRanger
head(irisImputed <- missRanger(irisWithNA, num.trees = 100))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1     5.100000    3.500000     1.503583   0.2000000  setosa
#> 2     4.900000    3.000000     1.400000   0.2845833  setosa
#> 3     4.700000    3.200000     1.300000   0.2000000  setosa
#> 4     5.673567    3.273117     2.505867   0.2000000  setosa
#> 5     5.000000    3.600000     1.400000   0.1914333  setosa
#> 6     5.400000    3.900000     1.509900   0.4000000  setosa

It worked! Unfortunately, the new values look somewhat unnatural due to different rounding. If we would like to avoid this, we just set the pmm.k argument to a positive number. All imputations done during the process are then combined with a predictive mean matching (PMM) step, leading to more natural imputations and improved distributional properties of the resulting values:

head(irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.8         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.4         0.4  setosa

Note that missRanger offers a ... argument to pass options to ranger, e.g. num.trees or min.node.size. How would we use its “extra trees” variant with 50 trees?

head(irisImputed_et <- missRanger(irisWithNA, pmm.k = 3, splitrule = "extratrees", num.trees = 50))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.3         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.8         2.7          1.3         0.2  setosa
#> 5          5.0         3.6          1.4         0.4  setosa
#> 6          5.4         3.9          1.3         0.4  setosa

It is as simple!

Further note that missRanger does not rely on tidyverse but you can embed it into a dplyr pipeline (without group_by). Make sure to set verbose = 0 in order to prevent messages.

iris %>% 
  generateNA() %>% 
  missRanger(verbose = 0) %>% 
  head()
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1     5.100000    3.500000     1.400000   0.3102322  setosa
#> 2     4.900000    3.038516     1.400000   0.2000000  setosa
#> 3     4.700000    3.200000     1.300000   0.2000000  setosa
#> 4     4.600000    3.100000     1.500000   0.2000000  setosa
#> 5     5.000000    3.211776     1.484725   0.2000000  setosa
#> 6     6.691223    3.900000     3.082777   0.4000000  setosa

By default missRanger uses all columns in the data set to impute all columns with missings. To override this behaviour, you can use an intuitive formula interface: The left hand side specifies the variables to be imputed (variable names separated by a +), while the right hand side lists the variables used for imputation.

# Impute all variables with all (default behaviour). Note that variables without
# missing values will be skipped from the left hand side of the formula.
head(m <- missRanger(irisWithNA, formula = . ~ ., pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.5         0.2  setosa
#> 2          4.9         3.0          1.4         0.5  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          6.1         3.0          3.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.3  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

# Same
head(m <- missRanger(irisWithNA, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#> iter 5:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          3.3         0.2  setosa
#> 2          4.9         3.0          1.4         0.3  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.5         3.1          3.8         0.2  setosa
#> 5          5.0         3.6          1.4         0.3  setosa
#> 6          5.4         3.9          1.4         0.4  setosa

# Impute all variables with all except Species
head(m <- missRanger(irisWithNA, . ~ . - Species, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          4.5         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          6.0         2.7          4.0         0.2  setosa
#> 5          5.0         3.6          1.4         0.3  setosa
#> 6          5.4         3.9          4.0         0.4  setosa

# Impute Sepal.Width by Species 
head(m <- missRanger(irisWithNA, Sepal.Width ~ Species, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Width
#>   Variables used to impute:  
#> iter 1:  .
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5           NA         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4           NA         2.0           NA         0.2  setosa
#> 5          5.0         3.6          1.4          NA  setosa
#> 6          5.4         3.9           NA         0.4  setosa

# No success. Why? Species contains missing values and thus can only be used for imputation if it is being imputed as well
head(m <- missRanger(irisWithNA, Sepal.Width + Species ~ Species, pmm.k = 3, num.trees = 10))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Width, Species
#>   Variables used to impute:  Species
#> iter 1:  ..
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5           NA         0.2  setosa
#> 2          4.9         3.0          1.4          NA  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4           NA         3.2           NA         0.2  setosa
#> 5          5.0         3.6          1.4          NA  setosa
#> 6          5.4         3.9           NA         0.4  setosa

# Impute all variables univariatly
head(m <- missRanger(irisWithNA, . ~ 1))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  
#> iter 1:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          6.0         0.2  setosa
#> 2          4.9         3.0          1.4         1.8  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          6.3         3.5          4.5         0.2  setosa
#> 5          5.0         3.6          1.4         1.8  setosa
#> 6          5.4         3.9          1.4         0.4  setosa

Imputation takes too much time. What can I do?

missRanger is based on iteratively fitting random forests for each variable with missing values. Since the underlying random forest implementation ranger uses 500 trees per default, a huge number of trees might be calculated. For larger data sets, the overall process can take very long.

Here are tweaks to make things faster:

Use less trees, e.g. by setting num.trees = 50. Even one single tree might be sufficient. Typically, the number of iterations until convergence will increase with fewer trees though.
Use smaller bootstrap samples by setting e.g. sample.fraction = 0.1.
Use the less greedy splitrule = "extratrees".
Use a low tree depth max.depth = 6.
Use large leafs, e.g. min.node.size = 10000.
Use a low max.iter, e.g. 1 or 2.

Examples evaluated on a normal laptop (not run here)

library(ggplot2) # for diamonds data
dim(diamonds) # 53940    10

diamonds_with_NA <- generateNA(diamonds)

# Takes 270 seconds (10 * 500 trees per iteration!)
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3))

# Takes 19 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50))

# Takes 7 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 1))

# Takes 9 seconds
system.time(m <- missRanger(diamonds_with_NA, pmm.k = 3, num.trees = 50, sample.fraction = 0.1))

Trick: Use `case.weights` to weight down contribution of rows with many missings

Using the case.weights argument, you can pass case weights to the imputation models. This might be e.g. useful to weight down the contribution of rows with many missings.

Example

# Count the number of non-missing values per row
non_miss <- rowSums(!is.na(irisWithNA))
table(non_miss)
#> non_miss
#>  1  2  3  4  5 
#>  2  6 28 68 46

# No weighting
head(m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#> iter 5:  .....
#> iter 6:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.6         0.2  setosa
#> 2          4.9         3.0          1.4         0.1  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.9         3.7          1.4         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.5         0.4  setosa

# Weighted by number of non-missing values per row. 
head(m <- missRanger(irisWithNA, num.trees = 20, pmm.k = 3, seed = 5, case.weights = non_miss))
#> 
#> Missing value imputation by random forests
#> 
#>   Variables to impute:       Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#>   Variables used to impute:  Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species
#> iter 1:  .....
#> iter 2:  .....
#> iter 3:  .....
#> iter 4:  .....
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.0         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          5.4         3.4          1.4         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.3         0.4  setosa

Using missRanger

Michael Mayer

2021-03-27

Introduction

Installation

Examples

Imputation takes too much time. What can I do?

Examples evaluated on a normal laptop (not run here)

Trick: Use `case.weights` to weight down contribution of rows with many missings

Example

References

Using missRanger

Michael Mayer

2021-03-27

Introduction

Installation

Examples

Imputation takes too much time. What can I do?

Examples evaluated on a normal laptop (not run here)

Trick: Use case.weights to weight down contribution of rows with many missings

Example

References

Trick: Use `case.weights` to weight down contribution of rows with many missings