This vignette provides a brief demonstration of rMIDAS. We show how to use the package to multiply impute missing values in the Adult census dataset, which is commonly used for benchmarking machine learning tasks.
rMIDAS relies on Python to implement the MIDAS
imputation algorithm, so you should ensure you have a Python 3.X
environment installed on your machine. When the package is first loaded,
it will try and automatically locate a suitable Python environment; if
this fails, you will receive a warning message. When this occurs, users
can manually specify a Python binary using set_python_env()
or via reticulate directly (see this vignette for more
information).
Once rMIDAS is initialized, we can load our data. For the purpose of this example, we’ll use a subset of the Adult data:
library(rMIDAS)
<- read.csv("https://raw.githubusercontent.com/MIDASverse/MIDASpy/master/Examples/adult_data.csv",
adult row.names = 1)[1:1000,]
As the dataset has a very low proportion of missingness (one of the
reasons it is favored for machine learning tasks), we randomly set 10%
of observed values as missing in each column using the
rMIDAS’ add_missingness()
function:
set.seed(89)
<- add_missingness(adult, prop = 0.1) adult
Next, we make a list of all categorical and binary variables, before
preprocessing the data for training using the convert()
function. Setting the minmax_scale
argument to
TRUE
ensures that continuous variables are scaled between 0
and 1, which can substantially improve convergence in the training step.
All pre-processing steps can be reversed after imputation:
<- c('workclass','marital_status','relationship','race','education','occupation','native_country')
adult_cat <- c('sex','class_labels')
adult_bin
# Apply rMIDAS preprocessing steps
<- convert(adult,
adult_conv bin_cols = adult_bin,
cat_cols = adult_cat,
minmax_scale = TRUE)
The data are now ready to be fed into the MIDAS algorithm, which
involves a single call of the train()
function. At this
stage, we specify the dimensions, input corruption proportion, and other
hyperparameters of the MIDAS neural network as well as the number of
training epochs:
# Train the model for 2 epochs
<- train(adult_conv,
adult_train training_epochs = 20,
layer_structure = c(128,128),
input_drop = 0.75,
seed = 89)
Once training is complete, we can generate any number of completed
datasets using the complete()
function (below we generate
10). The completed dataframes can also be saved as ‘.csv’ files using
the file
and file_root
arguments (not
demonstrated here). By default, complete()
unscales
continuous variables and converts binary and categorical variables back
to their original form.
Since the MIDAS algorithm returns predicted probabilities for binary
and categorical variables, imputed values of such variables can be
generated using one of two options. When fast = FALSE
(the
default), complete()
uses the predicted probabilities for
each category level to take a weighted random draw from the set of all
levels. When fast = TRUE
, the function selects the level
with the highest predicted probability. If completed datasets are very
large or complete()
is taking a long time to run, users may
benefit from choosing the latter option:
# Generate 10 imputed datasets
<- complete(adult_train, m = 10)
adult_complete
# Inspect first imputed dataset:
head(adult_complete[[1]])
Finally, the combine()
function allows users to estimate
regression models on the completed datasets with Rubin’s combination
rules. This function wraps the glm()
package, whose
arguments can be used to select different families of estimation methods
(gaussian/OLS, binomial etc.) and to specify other aspects of the
model:
# Estimate logit model on 10 completed datasets (using Rubin's combination rules)
<- combine("class_labels ~ hours_per_week + sex",
adult_model
adult_complete,family = stats::binomial)
adult_model