R package mma is used for general third-variable effect analysis (Yu and Li 2022). The third-variable effect (TVE) refers to the effect conveyed by intervening variables to an observed relationship between an exposure and a response variable (outcome). In this package, the exposure is also called the predictor, the intervening variables are called mediators or confounders (MC). The TVE include total effect, direct effect, and indirect effects from different third variables. The effects are defined in Yu et al. (2014). The algorithms on how to make inferences on TVEs and how to explain the effects are described in Yu et al. (2014).
To use the R package mma, we first install the package in R (install.packages("mma")
) and load it.
library(mma)
#> Loading required package: gbm
#> Loaded gbm 2.1.8
#> Loading required package: splines
#> Loading required package: survival
#> Loading required package: car
#> Loading required package: carData
#> Loading required package: gplots
#>
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#>
#> lowess
#source('C:/Users/qyu/Desktop/Research/mma/version current/R/mma.r')
We use weight_behavior data as an example to show how to do a preliminary data analysis to identify potential MCs and covariates. The function data.org is used for this purpose. The list of arguments for data.org are defined in Yu and Li (2017). In the following example, the predictor, “pred,” is gender, and the outcome, “y,” is a binary indicator of overweight (1) or not (0). The reference group of the exposure variable is the male as defined by “predref=”M"“. There are 12 variables in”x“, which includes all potential MCs and covariates. As an argument in the function data.org,”mediator=5:12" indicates that the variables from column 5 to column 12 in x should be tested for potential MCs. Other columns in “x” are to be treated as covariates. The “mediator” argument can also be a character vector with names of the potential MCs in x. Two tests are needed to identify a potential MC: first, the variable is significantly related with the predictor adjusting for other covariates. The other covariates here is defined by “cova.” If cova is a data frame by itself, all covariates in cova are for all potential MCs. If cova is a list, the first item will be the covariate data frame and the second item list the names or column numbers of MCs in “x” that should adjust for the covariates. The significance level is set by “alpha2,” whose default value is 0.1. Second, the variable has to be significantly related with the outcome adjusting (testtype=1, by default) or not adjusting (testtype=2) for the predictor and other variables. The significance level is set by “alpha.” In the following example, p-value 1 shows the results for the second test. and p-value 2 are the results for the first test. A variable that pass the second test but not the first test is considered as covariates. Variables do not pass the second test are not adopted in further analysis. Variables in “x” but not in “mediator” are forced in further analysis as covariates.
“jointm” is a list with the first item identify the number of groups of MCs whose joint effects are of interesting, where each group is defined in an item of “jointm.” In the above example, the joint effect of Variables 5, 7, 9 is of interesting. These variables are forced into analysis as potential MCs. A variable can be shown in “jointm” multiple times. The joint effect of each group of variables will be reported. A trick to force in variables as potential MCs is to assign them in a separate group of “jointm.”
data("weight_behavior")
#binary predictor x
#binary y
=weight_behavior[,c(2,4:14)]
x=weight_behavior[,"sex"]
pred=weight_behavior[,"overweigh"]
y.1<-data.org(x,y,mediator=5:12,jointm=list(n=1,j1=7:9),
data.b.bpred=pred,predref="M", alpha=0.4,alpha2=0.4)
summary(data.b.b.1)
#> Identified as mediators:
#> [1] "tvhours" "cmpthours" "cellhours" "exercises" "sweat" "sports"
#> Selected as covariates:
#> [1] "age" "race" "numpeople" "car"
#> Tests:
#> P-Value 1.y P-Value 2.pred
#> age 0.836 NA
#> race 0.850 NA
#> numpeople 0.561 NA
#> car 0.052 NA
#> gotosch 0.710 NA
#> snack 0.724 NA
#> tvhours - 0.830 0.484
#> cmpthours - 0.826 0.760
#> cellhours - 0.067 0.688
#> sports * 0.000 0.003
#> exercises * 0.176 0.203
#> sweat * 0.181 0.046
#> pred 0.178 NA
#> ----
#> *:mediator,-:joint mediator
#> P-Value 1:Type-3 tests in the full model (data.org) or estimated coefficients (data.org.big)
#> when testtype=1, univariate relationship test with the outcome when testtype=2
#> P-Value 2:Tests of relationship with the Predictor
The results from “data.org” function is summarized as above.
A potential MC defined in “mediator” argument is identified as categorical or binary if it has only two unique values or if the variable is a vector factors or characters. By default, the reference group is the first level of the variable. If the reference group needs to be changed, the categorical variable should be defined in binmed or catmed, and then the corresponding reference group in binref and catref. Continuous MCs can be defined by contmed. The following example did the same analysis as above, only that it defines that potential continuous mediators are columns 7,8,9,11, and 12 of x, binary mediators are columns 6 and 10, and categorical mediator is column 5 of x with 1 to be the reference group for all categorical and binary mediators.
.2<-data.org(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10), jointm=list(n=1,j1=7:9),
data.b.bbinref=c(1,1),catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4)
summary(data.b.b.2)
The functions in mma can deal with binary, categorical, continuous, or time-to-event outcomes. If the outcome is time-to-event, it should be defined by the “Surv” function in the survival package. An example of application on breast cancer survival can be found in Yu et al. (2019).
<-cgd0
cgd1<-ifelse(is.na(cgd1$etime1),0,1)
status=Surv(cgd1$futime,status)
y=cgd1[,c(5:7,9:12)]
x=cgd1[,4]
pred<-data.org(x,y,pred=pred,mediator=1:ncol(x),
data.b.survalpha=0.4,alpha2=0.4)
summary(data.b.surv)
In addition, the package can handle multivariate and multicategorical predictors. If the predictor is multicategorical of k levels, “data.org” first transform the predictor to k-1 binary predictors. If a variable significantly relates with any of the k-1 predictors, the variable passes the first test described above. P-value 2 is shown for each predictor. In the following example, the predictor is the race with six levels.
#multivariate predictor
=weight_behavior[,c(2:3,5:14)]
x=weight_behavior[,4]
pred=weight_behavior[,15]
y<-data.org(x,y,mediator=5:12,jointm=list(n=1,j1=c(5,7,9)),
data.mb.b pred=pred,predref="OTHER", alpha=0.4,alpha2=0.4)
summary(data.mb.b)
#> Identified as mediators:
#> [1] "tvhours" "cellhours" "sweat" "sports" "gotosch"
#> Selected as covariates:
#> [1] "age" "sex" "numpeople" "car" "exercises"
#> Tests:
#> P-Value 1.y P-Value 2.pred P-Value 2.predAFRICAN
#> age 0.836 NA NA
#> sex 0.178 NA NA
#> numpeople 0.561 NA NA
#> car 0.052 NA NA
#> gotosch - 0.710 0.112 0.796
#> snack 0.724 NA NA
#> tvhours - 0.830 0.748 0.535
#> cmpthours 0.826 NA NA
#> cellhours - 0.067 0.880 0.994
#> sports * 0.000 0.587 0.163
#> exercises 0.176 0.629 0.731
#> sweat * 0.181 0.647 0.174
#> pred 0.543 NA NA
#> predAFRICAN 0.663 NA NA
#> predCAUCASIAN 0.242 NA NA
#> predINDIAN 0.890 NA NA
#> predMIXED 0.782 NA NA
#> P-Value 2.predCAUCASIAN P-Value 2.predINDIAN P-Value 2.predMIXED
#> age NA NA NA
#> sex NA NA NA
#> numpeople NA NA NA
#> car NA NA NA
#> gotosch - 0.996 0.097 0.033
#> snack NA NA NA
#> tvhours - 0.334 0.916 0.092
#> cmpthours NA NA NA
#> cellhours - 0.707 0.555 0.383
#> sports * 0.816 0.453 0.091
#> exercises 0.723 0.912 0.881
#> sweat * 0.214 0.374 0.203
#> pred NA NA NA
#> predAFRICAN NA NA NA
#> predCAUCASIAN NA NA NA
#> predINDIAN NA NA NA
#> predMIXED NA NA NA
#> ----
#> *:mediator,-:joint mediator
#> P-Value 1:Type-3 tests in the full model (data.org) or estimated coefficients (data.org.big)
#> when testtype=1, univariate relationship test with the outcome when testtype=2
#> P-Value 2:Tests of relationship with the Predictor
Similarly, the package can deal with multivariate outcomes. The following code deals with multivariate predictors and multivariate responses. If the variable is significantly related with any one of the outcomes, it passes the second test described above. The results from “data.org” are summarized for each combination of the exposure-outcome relationship.
#multivariate responses
=weight_behavior[,c(2:3,5:14)]
x=weight_behavior[,4]
pred=weight_behavior[,c(1,15)]
y<-data.org(x,y,mediator=5:12,jointm=list(n=1,j1=c(5,7,9)),
data.mb.mbpred=pred,predref="OTHER", alpha=0.4,alpha2=0.4)
summary(data.mb.mb)
#> Identified as mediators:
#> [1] "tvhours" "cellhours" "sweat" "sports" "gotosch"
#> Selected as covariates:
#> [1] "age" "sex" "numpeople" "car" "exercises"
#> Tests:
#> P-Value 1.bmi P-Value 1.overweigh P-Value 2.pred
#> age 0.458 0.836 NA
#> sex 0.003 0.178 NA
#> numpeople 0.282 0.561 NA
#> car 0.059 0.052 NA
#> gotosch - 0.527 0.710 0.112
#> snack 0.830 0.724 NA
#> tvhours - 0.505 0.830 0.748
#> cmpthours 0.676 0.826 NA
#> cellhours - 0.084 0.067 0.880
#> sports * 0.001 0.000 0.587
#> exercises 0.089 0.176 0.629
#> sweat * 0.078 0.181 0.647
#> pred 0.344 0.543 NA
#> predAFRICAN 0.900 0.663 NA
#> predCAUCASIAN 0.032 0.242 NA
#> predINDIAN 0.446 0.890 NA
#> predMIXED 0.833 0.782 NA
#> P-Value 2.predAFRICAN P-Value 2.predCAUCASIAN
#> age NA NA
#> sex NA NA
#> numpeople NA NA
#> car NA NA
#> gotosch - 0.796 0.996
#> snack NA NA
#> tvhours - 0.535 0.334
#> cmpthours NA NA
#> cellhours - 0.994 0.707
#> sports * 0.163 0.816
#> exercises 0.731 0.723
#> sweat * 0.174 0.214
#> pred NA NA
#> predAFRICAN NA NA
#> predCAUCASIAN NA NA
#> predINDIAN NA NA
#> predMIXED NA NA
#> P-Value 2.predINDIAN P-Value 2.predMIXED
#> age NA NA
#> sex NA NA
#> numpeople NA NA
#> car NA NA
#> gotosch - 0.097 0.033
#> snack NA NA
#> tvhours - 0.916 0.092
#> cmpthours NA NA
#> cellhours - 0.555 0.383
#> sports * 0.453 0.091
#> exercises 0.912 0.881
#> sweat * 0.374 0.203
#> pred NA NA
#> predAFRICAN NA NA
#> predCAUCASIAN NA NA
#> predINDIAN NA NA
#> predMIXED NA NA
#> ----
#> *:mediator,-:joint mediator
#> P-Value 1:Type-3 tests in the full model (data.org) or estimated coefficients (data.org.big)
#> when testtype=1, univariate relationship test with the outcome when testtype=2
#> P-Value 2:Tests of relationship with the Predictor
##Third-Variable Effect Analysis Next, we use med function to estimate the TVE using the results from data.org function. The default approach is using a generalized linear model to fit the final full model. If “nonlinear=TRUE”", multiple Additive Regression Trees (MART) will be used to fit the relationship between outcome(s) and other variables, and smoothing splines used to fit the relationship between the potential MC and exposure variables(s).
.2<-med(data=data.b.b.2,n=2,nonlinear=TRUE) med.b.b
Using the linear method and to show the results:
.1<-med(data=data.b.b.2,n=2)
med.b.b.1
med.b.b#>
#>
#> For the predictor pred :
#> The estimated total effect:[1] 0.6475
#>
#> The estimated indirect effect:
#> y1.all y1.tvhours y1.cmpthours y1.cellhours y1.exercises y1.sweat
#> 0.2945 0.0072 0.0398 0.0843 -0.0218 -0.0114
#> y1.sports y1.j1
#> 0.1464 0.0768
We can plot the marginal effect of the MC on the outcome, and the marginal effect of the predictor on the MC. If the predictor is binary, plot.med draws a histogram or boxplot of the marginal density of the MC at each different value of the predictor.
plot(med.b.b.1, data.b.b.2,vari="exercises",xlim=c(0,50))
plot(med.b.b.2, data.b.b.2,vari="sports")
For survival outcome, the default option is to fit the final full model using Cox proportional hazard model. Note that when use type=“lp,” the linear predictor is the type of predicted value as in predict.coxph, the results is similar to those from MART. Readers are referred to Yu et al. (2019) for more details.
.1<-med(data=data.b.surv,n=2,type="lp")
med.b.surv#close to mart results when use type="lp"
.2<-med(data=data.b.surv,n=2,nonlinear=TRUE)
med.b.surv#results in the linear part unit
boot.med function is used to make inferences on the TVE. Results return from data.org function is input as the first argument.
.1<-boot.med(data=data.b.b.2,n=2,n2=50)
bootmed.b.b#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6
#> [1] 7
#> [1] 8
#> [1] 9
#> [1] 10
#> [1] 11
#> [1] 12
#> [1] 13
#> [1] 14
#> [1] 15
#> [1] 16
#> [1] 17
#> [1] 18
#> [1] 19
#> [1] 20
#> [1] 21
#> [1] 22
#> [1] 23
#> [1] 24
#> [1] 25
#> [1] 26
#> [1] 27
#> [1] 28
#> [1] 29
#> [1] 30
#> [1] 31
#> [1] 32
#> [1] 33
#> [1] 34
#> [1] 35
#> [1] 36
#> [1] 37
#> [1] 38
#> [1] 39
#> [1] 40
#> [1] 41
#> [1] 42
#> [1] 43
#> [1] 44
#> [1] 45
#> [1] 46
#> [1] 47
#> [1] 48
#> [1] 49
#> [1] 50
summary(bootmed.b.b.1)
#> MMA Analysis: Estimated Mediation Effects Using GLM
#> For Predictor/Moderator at pred
#> $total.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.603 0.644 0.375 1.379 -0.091 1.283 -0.031 1.529 -0.090 0.086 0.120
#>
#> $direct.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.339 0.374 0.377 1.113 -0.364 1.071 -0.284 1.154 -0.400 0.320 0.320
#>
#> $indirect.effect
#> y1.all y1.tvhours y1.cmpthours y1.cellhours y1.exercises y1.sweat
#> est 0.263 -0.003 0.001 0.026 -0.006 0.007
#> mean 0.270 0.004 0.042 0.015 -0.014 0.016
#> sd 0.149 0.052 0.073 0.062 0.028 0.032
#> upbd 0.562 0.105 0.185 0.138 0.042 0.078
#> lwbd -0.023 -0.097 -0.102 -0.107 -0.070 -0.046
#> upbd_q 0.539 0.104 0.207 0.139 0.030 0.074
#> lwbd_q 0.003 -0.082 -0.077 -0.117 -0.078 -0.037
#> upbd_b 0.603 0.137 0.217 0.169 0.040 0.076
#> lwbd_b -0.046 -0.094 -0.084 -0.138 -0.080 -0.072
#> p_norm 0.071 0.945 0.569 0.807 0.616 0.612
#> p_quan 0.080 0.920 0.600 0.600 0.680 0.480
#> y1.sports y1.j1
#> est 0.147 0.017
#> mean 0.164 0.083
#> sd 0.084 0.099
#> upbd 0.328 0.277
#> lwbd -0.001 -0.112
#> upbd_q 0.356 0.297
#> lwbd_q 0.057 -0.081
#> upbd_b 0.394 0.342
#> lwbd_b 0.017 -0.099
#> p_norm 0.051 0.405
#> p_quan 0.000 0.360
Similarly, the default approach is using a generalized linear model to fit the final full model. If the argument nonlinear=TRUE, multiple Additive Regression Trees (MART) will be used to fit the model in estimating the outcome. Applications for nonlinear Mediation Analysis are described in Yu et al. (2017, 2018).
.2<-boot.med(data=data.b.b.2,n=2,n2=40,nu=0.05,nonlinear=TRUE) bootmed.b.b
summary(bootmed.b.b.2)
#> MMA Analysis: Estimated Mediation Effects Using MART
#> For Predictor/Moderator at pred
#> $total.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.067 0.271 0.034 0.337 0.205 0.313 0.237 0.316 0.235 0.000 0.000
#>
#> $direct.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> -0.014 0.078 0.087 0.249 -0.093 0.192 0.004 0.203 0.001 0.370 0.000
#>
#> $indirect.effect
#> y1.all y1.tvhours y1.cmpthours y1.cellhours y1.exercises y1.sweat
#> est 0.081 0.004 0.001 -0.013 0.000 0.000
#> mean 0.193 0.004 0.049 0.012 -0.006 0.006
#> sd 0.108 0.011 0.059 0.045 0.035 0.016
#> upbd 0.405 0.025 0.165 0.101 0.061 0.037
#> lwbd -0.020 -0.016 -0.066 -0.076 -0.074 -0.025
#> upbd_q 0.266 0.015 0.128 0.059 0.038 0.027
#> lwbd_q 0.047 -0.008 0.007 -0.042 -0.033 -0.003
#> upbd_b 0.268 0.016 0.041 0.062 0.042 0.030
#> lwbd_b 0.033 0.004 0.007 -0.046 -0.029 -0.002
#> p_norm 0.076 0.678 0.405 0.782 0.852 0.694
#> p_quan 0.000 0.500 0.000 0.500 0.500 1.000
#> y1.sports y1.j1
#> est 0.070 0.011
#> mean 0.165 0.067
#> sd 0.097 0.050
#> upbd 0.354 0.164
#> lwbd -0.025 -0.030
#> upbd_q 0.267 0.120
#> lwbd_q 0.049 0.016
#> upbd_b 0.192 0.122
#> lwbd_b 0.041 0.015
#> p_norm 0.089 0.176
#> p_quan 0.000 0.000
plot.mma() plots the marginal effect of the selected variable on the outcome, and the marginal effect of the predictor on the selected variable. Interpretations for similar plots can be found in Yu et al. (2017).
plot(bootmed.b.b.1,vari="exercises",xlim=c(0,50))
plot(bootmed.b.b.1,vari="sports")
The mma function is a combined function that automatically identify potential MCs, based on which to make statistical inference on the mediation effects.
=weight_behavior[,c(2,4:14)]
x=weight_behavior[,3]
pred=weight_behavior[,15]
y<-mma(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10),binref=c(1,1), catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4,n=2,n2=2)
mma.b.b.glm#> [1] 1
#> [1] 2
summary(mma.b.b.glm)
#> MMA Analysis: Estimated Mediation Effects Using GLM
#> For Predictor/Moderator at pred
#> $total.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.498 0.840 0.313 1.453 0.227 1.050 0.630 0.619 0.619 0.007 0.000
#>
#> $direct.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.429 0.619 0.258 1.125 0.113 0.792 0.446 0.437 0.437 0.016 0.000
#>
#> $indirect.effect
#> y1.all y1.exercises y1.sweat y1.sports
#> est 0.069 -0.021 0.014 0.137
#> mean 0.221 -0.022 -0.023 0.170
#> sd 0.055 0.020 0.037 0.011
#> upbd 0.328 0.018 0.049 0.193
#> lwbd 0.113 -0.063 -0.095 0.148
#> upbd_q 0.258 -0.009 0.001 0.178
#> lwbd_q 0.184 -0.036 -0.048 0.163
#> upbd_b 0.182 0.182 0.182 0.182
#> lwbd_b -0.008 -0.008 -0.008 -0.008
#> p_norm 0.000 0.272 0.527 0.000
#> p_quan 0.000 0.000 1.000 0.000
When the argument nonlinear=TRUE, multiple Additive Regression Trees (MART) will be used to fit the final full model in estimating the outcome.
<-mma(x,y,pred=pred,contmed=c(7:9,11:12),binmed=c(6,10),binref=c(1,1), catmed=5,catref=1,predref="M",alpha=0.4,alpha2=0.4,nonlinear=TRUE,n=2,n2=5)
mma.b.b.martsummary(mma.b.b.mart)
summary(mma.b.b.mart)
#> MMA Analysis: Estimated Mediation Effects Using MART
#> For Predictor/Moderator at pred
#> $total.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.207 0.334 0.168 0.663 0.005 0.581 0.174 0.346 0.167 0.047 0.000
#>
#> $direct.effect
#> est mean sd upbd lwbd upbd_q lwbd_q upbd_b lwbd_b p_norm p_quan
#> 0.078 0.213 0.212 0.629 -0.203 0.522 0.007 0.213 -0.002 0.316 0.400
#>
#> $indirect.effect
#> y1.all y1.exercises y1.sweat y1.sports
#> est 0.128 0.003 0.004 0.090
#> mean 0.121 0.014 0.017 0.082
#> sd 0.047 0.011 0.024 0.025
#> upbd 0.212 0.037 0.064 0.131
#> lwbd 0.030 -0.008 -0.031 0.033
#> upbd_q 0.167 0.032 0.054 0.110
#> lwbd_q 0.056 0.005 -0.002 0.047
#> upbd_b 0.169 0.034 0.058 0.111
#> lwbd_b 0.100 0.008 -0.003 0.077
#> p_norm 0.009 0.207 0.492 0.001
#> p_quan 0.000 0.000 0.400 0.000
plot(mma.b.b.mart,vari="exercises")
#> Error in pred_name[, z.b]: incorrect number of dimensions
plot(mma.b.b.glm,vari="sweat")