missRanger
for multiple imputation?For machine learning tasks, imputation is typically seen as a fixed data preparation step like dummy coding. There, multiple imputation is rarely applied as it adds another level of complexity to the analysis. This might be fine since a good validation schema will account for variation introduced by imputation.
For tasks with focus on statistical inference (p values, standard errors, confidence intervals, estimation of effects), the extra variability introduced by imputation has to be accounted for except if only very few missing values appear. One of the standard approaches is to impute the data set multiple times, generating e.g. 10 or 100 versions of a complete data set. Then, the intended analysis (t-test, linear model etc.) is applied independently to each of the complete data sets. Their results are combined afterward in a pooling step, usually by Rubin’s rule (Rubin 1987). For parameter estimates, averages are taken. Their variance is basically a combination of the average squared standard errors plus the variance of the parameter estimates across the imputed data sets, leading to inflated standard errors and thus larger p values and wider confidence intervals.
The package mice
(Buuren and Groothuis-Oudshoorn 2011) takes case of this pooling step. The creation of multiple complete data sets can be done by mice
or also by missRanger
. In the latter case, in order to keep the variance of imputed values at a realistic level, we suggest to use predictive mean matching on top of the random forest imputations.
The following example shows how easy such workflow looks like.
library(missRanger)
library(mice)
set.seed(19)
<- generateNA(iris, p = c(0, 0.1, 0.1, 0.1, 0.1))
irisWithNA
# Generate 20 complete data sets
<- replicate(20, missRanger(irisWithNA, verbose = 0, num.trees = 50, pmm.k = 5), simplify = FALSE)
filled
# Run a linear model for each of the completed data sets
<- lapply(filled, function(x) lm(Sepal.Length ~ ., x))
models
# Pool the results by mice
summary(pooled_fit <- pool(models))
#> term estimate std.error statistic df p.value
#> 1 (Intercept) 2.5318379 0.35328813 7.166496 78.42046 3.702585e-10
#> 2 Sepal.Width 0.4219628 0.10914841 3.865955 86.24543 2.139366e-04
#> 3 Petal.Length 0.7463245 0.09206148 8.106805 55.91280 5.215939e-11
#> 4 Petal.Width -0.1942882 0.18919218 -1.026936 65.08255 3.082525e-01
#> 5 Speciesversicolor -0.7083371 0.28697395 -2.468298 92.54440 1.541229e-02
#> 6 Speciesvirginica -0.9094360 0.39326030 -2.312555 91.05446 2.300101e-02
# Compare with model on original data
summary(lm(Sepal.Length ~ ., data = iris))
#>
#> Call:
#> lm(formula = Sepal.Length ~ ., data = iris)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.79424 -0.21874 0.00899 0.20255 0.73103
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
#> Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
#> Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
#> Petal.Width -0.31516 0.15120 -2.084 0.03889 *
#> Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
#> Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3068 on 144 degrees of freedom
#> Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
#> F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
The standard errors and p values of the multiple imputation are larger than of the original data set. This reflects the additional uncertainty introduced by the presence of missing values in a realistic way.